Merge branch 'master' into spacy.io

2025-09-17 17:42:43 +03:00 · 2020-02-23 12:04:20 +01:00 · 2020-02-23 12:04:20 +01:00 · 89967f3701
commit 89967f3701
parent 13b516289b ddf63b97a8
36 changed files with 704 additions and 84 deletions
--- a/.github/contributors/Jan-711.md
+++ b/.github/contributors/Jan-711.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Jan Jessewitsch      |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 16.02.2020           |
+| GitHub username                | Jan-711              |
+| Website (optional)             |                      |
--- a/.github/contributors/MisterKeefe.md
+++ b/.github/contributors/MisterKeefe.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [ ] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |      Tom Keefe       |
+| Company name (if applicable)   |          /           |
+| Title or role (if applicable)  |          /           |
+| Date                           |   18 February 2020   |
+| GitHub username                |     MisterKeefe      |
+| Website (optional)             |          /           |
--- a/MANIFEST.in
+++ b/MANIFEST.in
@ -1,5 +1,5 @@
 recursive-include include *.h
-recursive-include spacy *.txt
+recursive-include spacy *.txt *.pyx *.pxd
 include LICENSE
 include README.md
 include bin/spacy
--- a/examples/streamlit_spacy.py
+++ b/examples/streamlit_spacy.py
@ -26,12 +26,12 @@ DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook."
 HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""


-@st.cache(ignore_hash=True)
+@st.cache(allow_output_mutation=True)
 def load_model(name):
    return spacy.load(name)


-@st.cache(ignore_hash=True)
+@st.cache(allow_output_mutation=True)
 def process_text(model_name, text):
    nlp = load_model(model_name)
    return nlp(text)
@ -79,7 +79,9 @@ if "ner" in nlp.pipe_names:
    st.header("Named Entities")
    st.sidebar.header("Named Entities")
    label_set = nlp.get_pipe("ner").labels
-    labels = st.sidebar.multiselect("Entity labels", label_set, label_set)
+    labels = st.sidebar.multiselect(
+        "Entity labels", options=label_set, default=list(label_set)
+    )
    html = displacy.render(doc, style="ent", options={"ents": labels})
    # Newlines seem to mess with the rendering
    html = html.replace("\n", " ")
--- a/spacy/attrs.pxd
+++ b/spacy/attrs.pxd
@ -92,3 +92,5 @@ cdef enum attr_id_t:
    LANG
    ENT_KB_ID = symbols.ENT_KB_ID
    ENT_ID = symbols.ENT_ID
+
+    IDX
--- a/spacy/attrs.pyx
+++ b/spacy/attrs.pyx
@ -91,6 +91,7 @@ IDS = {
    "SPACY": SPACY,
    "PROB": PROB,
    "LANG": LANG,
+    "IDX": IDX
 }


--- a/spacy/cli/pretrain.py
+++ b/spacy/cli/pretrain.py
@ -34,7 +34,7 @@ from .train import _load_pretrained_tok2vec
    vectors_model=("Name or path to spaCy model with vectors to learn from"),
    output_dir=("Directory to write models to on each epoch", "positional", None, str),
    width=("Width of CNN layers", "option", "cw", int),
-    depth=("Depth of CNN layers", "option", "cd", int),
+    conv_depth=("Depth of CNN layers", "option", "cd", int),
    cnn_window=("Window size for CNN layers", "option", "cW", int),
    cnn_pieces=("Maxout size for CNN layers. 1 for Mish", "option", "cP", int),
    use_chars=("Whether to use character-based embedding", "flag", "chr", bool),
@ -84,7 +84,7 @@ def pretrain(
    vectors_model,
    output_dir,
    width=96,
-    depth=4,
+    conv_depth=4,
    bilstm_depth=0,
    cnn_pieces=3,
    sa_depth=0,
@ -132,9 +132,15 @@ def pretrain(
    msg.info("Using GPU" if has_gpu else "Not using GPU")

    output_dir = Path(output_dir)
+    if output_dir.exists() and [p for p in output_dir.iterdir()]:
+        msg.warn(
+            "Output directory is not empty",
+            "It is better to use an empty directory or refer to a new output path, "
+            "then the new directory will be created for you.",
+        )
    if not output_dir.exists():
        output_dir.mkdir()
-        msg.good("Created output directory")
+        msg.good("Created output directory: {}".format(output_dir))
    srsly.write_json(output_dir / "config.json", config)
    msg.good("Saved settings to config.json")

@ -162,7 +168,7 @@ def pretrain(
        Tok2Vec(
            width,
            embed_rows,
-            conv_depth=depth,
+            conv_depth=conv_depth,
            pretrained_vectors=pretrained_vectors,
            bilstm_depth=bilstm_depth,  # Requires PyTorch. Experimental.
            subword_features=not use_chars,  # Set to False for Chinese etc
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -14,6 +14,7 @@ import contextlib
 import random

 from .._ml import create_default_optimizer
+from ..util import use_gpu as set_gpu
 from ..attrs import PROB, IS_OOV, CLUSTER, LANG
 from ..gold import GoldCorpus
 from ..compat import path2str
@ -32,6 +33,13 @@ from .. import about
    pipeline=("Comma-separated names of pipeline components", "option", "p", str),
    replace_components=("Replace components from base model", "flag", "R", bool),
    vectors=("Model to load vectors from", "option", "v", str),
+    width=("Width of CNN layers of Tok2Vec component", "option", "cw", int),
+    conv_depth=("Depth of CNN layers of Tok2Vec component", "option", "cd", int),
+    cnn_window=("Window size for CNN layers of Tok2Vec component", "option", "cW", int),
+    cnn_pieces=("Maxout size for CNN layers of Tok2Vec component. 1 for Mish", "option", "cP", int),
+    use_chars=("Whether to use character-based embedding of Tok2Vec component", "flag", "chr", bool),
+    bilstm_depth=("Depth of BiLSTM layers of Tok2Vec component (requires PyTorch)", "option", "lstm", int),
+    embed_rows=("Number of embedding rows of Tok2Vec component", "option", "er", int),
    n_iter=("Number of iterations", "option", "n", int),
    n_early_stopping=("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int),
    n_examples=("Number of examples", "option", "ns", int),
@ -63,6 +71,13 @@ def train(
    pipeline="tagger,parser,ner",
    replace_components=False,
    vectors=None,
+    width=96,
+    conv_depth=4,
+    cnn_window=1,
+    cnn_pieces=3,
+    use_chars=False,
+    bilstm_depth=0,
+    embed_rows=2000,
    n_iter=30,
    n_early_stopping=None,
    n_examples=0,
@ -115,6 +130,7 @@ def train(
        )
    if not output_path.exists():
        output_path.mkdir()
+        msg.good("Created output directory: {}".format(output_path))

    # Take dropout and batch size as generators of values -- dropout
    # starts high and decays sharply, to force the optimizer to explore.
@ -147,6 +163,18 @@ def train(
    disabled_pipes = None
    pipes_added = False
    msg.text("Training pipeline: {}".format(pipeline))
+    if use_gpu >= 0:
+        activated_gpu = None
+        try:
+            activated_gpu = set_gpu(use_gpu)
+        except Exception as e:
+            msg.warn("Exception: {}".format(e))
+        if activated_gpu is not None:
+            msg.text("Using GPU: {}".format(use_gpu))
+        else:
+            msg.warn("Unable to activate GPU: {}".format(use_gpu))
+            msg.text("Using CPU only")
+            use_gpu = -1
    if base_model:
        msg.text("Starting with base model '{}'".format(base_model))
        nlp = util.load_model(base_model)
@ -237,7 +265,15 @@ def train(
        optimizer = create_default_optimizer(Model.ops)
    else:
        # Start with a blank model, call begin_training
-        optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
+        cfg = {"device": use_gpu}
+        cfg["conv_depth"] = conv_depth
+        cfg["token_vector_width"] = width
+        cfg["bilstm_depth"] = bilstm_depth
+        cfg["cnn_maxout_pieces"] = cnn_pieces
+        cfg["embed_size"] = embed_rows
+        cfg["conv_window"] = cnn_window
+        cfg["subword_features"] = not use_chars
+        optimizer = nlp.begin_training(lambda: corpus.train_tuples, **cfg)

    nlp._optimizer = None

@ -362,13 +398,19 @@ def train(
                    if not batch:
                        continue
                    docs, golds = zip(*batch)
-                    nlp.update(
-                        docs,
-                        golds,
-                        sgd=optimizer,
-                        drop=next(dropout_rates),
-                        losses=losses,
-                    )
+                    try:
+                        nlp.update(
+                            docs,
+                            golds,
+                            sgd=optimizer,
+                            drop=next(dropout_rates),
+                            losses=losses,
+                        )
+                    except ValueError as e:
+                        msg.warn("Error during training")
+                        if init_tok2vec:
+                            msg.warn("Did you provide the same parameters during 'train' as during 'pretrain'?")
+                        msg.fail("Original error message: {}".format(e), exits=1)
                    if raw_text:
                        # If raw text is available, perform 'rehearsal' updates,
                        # which use unlabelled data to reduce overfitting.
@ -495,6 +537,8 @@ def train(
                            "score = {}".format(best_score, current_score)
                        )
                        break
+    except Exception as e:
+        msg.warn("Aborting and saving the final best model. Encountered exception: {}".format(e))
    finally:
        best_pipes = nlp.pipe_names
        if disabled_pipes:
--- a/spacy/displacy/init.py
+++ b/spacy/displacy/init.py
@ -144,10 +144,12 @@ def parse_deps(orig_doc, options={}):
            for span, tag, lemma, ent_type in spans:
                attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
                retokenizer.merge(span, attrs=attrs)
-    if options.get("fine_grained"):
-        words = [{"text": w.text, "tag": w.tag_} for w in doc]
-    else:
-        words = [{"text": w.text, "tag": w.pos_} for w in doc]
+    fine_grained = options.get("fine_grained")
+    add_lemma = options.get("add_lemma")
+    words = [{"text": w.text,
+              "tag": w.tag_ if fine_grained else w.pos_,
+              "lemma": w.lemma_ if add_lemma else None} for w in doc]
+
    arcs = []
    for word in doc:
        if word.i < word.head.i:
--- a/spacy/displacy/render.py
+++ b/spacy/displacy/render.py
@ -3,7 +3,7 @@ from __future__ import unicode_literals

 import uuid

-from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
+from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_WORDS_LEMMA, TPL_DEP_ARCS, TPL_ENTS
 from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
 from ..util import minify_html, escape_html, registry
 from ..errors import Errors
@ -83,7 +83,7 @@ class DependencyRenderer(object):
        self.width = self.offset_x + len(words) * self.distance
        self.height = self.offset_y + 3 * self.word_spacing
        self.id = render_id
-        words = [self.render_word(w["text"], w["tag"], i) for i, w in enumerate(words)]
+        words = [self.render_word(w["text"], w["tag"],  w.get("lemma", None), i) for i, w in enumerate(words)]
        arcs = [
            self.render_arrow(a["label"], a["start"], a["end"], a["dir"], i)
            for i, a in enumerate(arcs)
@ -101,7 +101,7 @@ class DependencyRenderer(object):
            lang=self.lang,
        )

-    def render_word(self, text, tag, i):
+    def render_word(self, text, tag, lemma, i,):
        """Render individual word.

        text (unicode): Word text.
@ -114,6 +114,8 @@ class DependencyRenderer(object):
        if self.direction == "rtl":
            x = self.width - x
        html_text = escape_html(text)
+        if lemma is not None:
+            return TPL_DEP_WORDS_LEMMA.format(text=html_text, tag=tag, lemma=lemma, x=x, y=y)
        return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)

    def render_arrow(self, label, start, end, direction, i):
--- a/spacy/displacy/templates.py
+++ b/spacy/displacy/templates.py
@ -18,6 +18,15 @@ TPL_DEP_WORDS = """
 """


+TPL_DEP_WORDS_LEMMA = """
+<text class="displacy-token" fill="currentColor" text-anchor="middle" y="{y}">
+    <tspan class="displacy-word" fill="currentColor" x="{x}">{text}</tspan>
+    <tspan class="displacy-lemma" dy="2em" fill="currentColor" x="{x}">{lemma}</tspan>
+    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="{x}">{tag}</tspan>
+</text>
+"""
+
+
 TPL_DEP_ARCS = """
 <g class="displacy-arrow">
    <path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
--- a/spacy/lang/de/stop_words.py
+++ b/spacy/lang/de/stop_words.py
@ -22,14 +22,14 @@ dort drei drin dritte dritten dritter drittes du durch durchaus dürfen dürft
 durfte durften

 eben ebenso ehrlich eigen eigene eigenen eigener eigenes ein einander eine
-einem einen einer eines einigeeinigen einiger einiges einmal einmaleins elf en
+einem einen einer eines einige einigen einiger einiges einmal einmaleins elf en
 ende endlich entweder er erst erste ersten erster erstes es etwa etwas euch

 früher fünf fünfte fünften fünfter fünftes für

 gab ganz ganze ganzen ganzer ganzes gar gedurft gegen gegenüber gehabt gehen
 geht gekannt gekonnt gemacht gemocht gemusst genug gerade gern gesagt geschweige
-gewesen gewollt geworden gibt ging gleich gott gross groß grosse große grossen
+gewesen gewollt geworden gibt ging gleich gross groß grosse große grossen
 großen grosser großer grosses großes gut gute guter gutes

 habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier
@ -47,9 +47,8 @@ kleines kommen kommt können könnt konnte könnte konnten kurz
 lang lange leicht leider lieber los

 machen macht machte mag magst man manche manchem manchen mancher manches mehr
-mein meine meinem meinen meiner meines mensch menschen mich mir mit mittel
-mochte möchte mochten mögen möglich mögt morgen muss muß müssen musst müsst
-musste mussten
+mein meine meinem meinen meiner meines mich mir mit mittel mochte möchte mochten 
+mögen möglich mögt morgen muss muß müssen musst müsst musste mussten

 na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter
 neuntes nicht nichts nie niemand niemandem niemanden noch nun nur
--- a/spacy/lang/el/init.py
+++ b/spacy/lang/el/init.py
@ -3,7 +3,7 @@
 from __future__ import unicode_literals

 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
-from .tag_map_general import TAG_MAP
+from ..tag_map import TAG_MAP
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .lemmatizer import GreekLemmatizer
--- a/spacy/lang/el/tag_map_fine.py
+++ b/spacy/lang/el/tag_map_fine.py
--- a/spacy/lang/el/tag_map_general.py
+++ b/spacy/lang/el/tag_map_general.py
@ -1,27 +0,0 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-from ...symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
-from ...symbols import PUNCT, NUM, AUX, X, ADJ, VERB, PART, SPACE, CCONJ
-
-
-TAG_MAP = {
-    "ADJ": {POS: ADJ},
-    "ADV": {POS: ADV},
-    "INTJ": {POS: INTJ},
-    "NOUN": {POS: NOUN},
-    "PROPN": {POS: PROPN},
-    "VERB": {POS: VERB},
-    "ADP": {POS: ADP},
-    "CCONJ": {POS: CCONJ},
-    "SCONJ": {POS: SCONJ},
-    "PART": {POS: PART},
-    "PUNCT": {POS: PUNCT},
-    "SYM": {POS: SYM},
-    "NUM": {POS: NUM},
-    "PRON": {POS: PRON},
-    "AUX": {POS: AUX},
-    "SPACE": {POS: SPACE},
-    "DET": {POS: DET},
-    "X": {POS: X},
-}
--- a/spacy/lang/fi/tokenizer_exceptions.py
+++ b/spacy/lang/fi/tokenizer_exceptions.py
@ -14,6 +14,7 @@ for exc_data in [
    {ORTH: "alv.", LEMMA: "arvonlisävero"},
    {ORTH: "ark.", LEMMA: "arkisin"},
    {ORTH: "as.", LEMMA: "asunto"},
+    {ORTH: "eaa.", LEMMA: "ennen ajanlaskun alkua"},
    {ORTH: "ed.", LEMMA: "edellinen"},
    {ORTH: "esim.", LEMMA: "esimerkki"},
    {ORTH: "huom.", LEMMA: "huomautus"},
@ -27,6 +28,7 @@ for exc_data in [
    {ORTH: "läh.", LEMMA: "lähettäjä"},
    {ORTH: "miel.", LEMMA: "mieluummin"},
    {ORTH: "milj.", LEMMA: "miljoona"},
+    {ORTH: "Mm.", LEMMA: "muun muassa"},
    {ORTH: "mm.", LEMMA: "muun muassa"},
    {ORTH: "myöh.", LEMMA: "myöhempi"},
    {ORTH: "n.", LEMMA: "noin"},
--- a/spacy/lang/ro/init.py
+++ b/spacy/lang/ro/init.py
@ -3,6 +3,8 @@ from __future__ import unicode_literals

 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
+from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
+from .punctuation import TOKENIZER_SUFFIXES

 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
@ -24,6 +26,9 @@ class RomanianDefaults(Language.Defaults):
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
+    prefixes = TOKENIZER_PREFIXES
+    suffixes = TOKENIZER_SUFFIXES
+    infixes = TOKENIZER_INFIXES
    tag_map = TAG_MAP


--- a/spacy/lang/ro/punctuation.py
+++ b/spacy/lang/ro/punctuation.py
@ -0,0 +1,164 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+import itertools
+
+from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY
+from ..char_classes import LIST_ICONS, CURRENCY
+from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT
+
+
+_list_icons = [x for x in LIST_ICONS if x != "°"]
+_list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
+
+
+_ro_variants = {
+    "Ă": ["Ă", "A"],
+    "Â": ["Â", "A"],
+    "Î": ["Î", "I"],
+    "Ș": ["Ș", "Ş", "S"],
+    "Ț": ["Ț", "Ţ", "T"],
+}
+
+
+def _make_ro_variants(tokens):
+    variants = []
+    for token in tokens:
+        upper_token = token.upper()
+        upper_char_variants = [_ro_variants.get(c, [c]) for c in upper_token]
+        upper_variants = ["".join(x) for x in itertools.product(*upper_char_variants)]
+        for variant in upper_variants:
+            variants.extend([variant, variant.lower(), variant.title()])
+    return sorted(list(set(variants)))
+
+
+# UD_Romanian-RRT closed class prefixes
+# POS: ADP|AUX|CCONJ|DET|NUM|PART|PRON|SCONJ
+_ud_rrt_prefixes = [
+    "a-",
+    "c-",
+    "ce-",
+    "cu-",
+    "d-",
+    "de-",
+    "dintr-",
+    "e-",
+    "făr-",
+    "i-",
+    "l-",
+    "le-",
+    "m-",
+    "mi-",
+    "n-",
+    "ne-",
+    "p-",
+    "pe-",
+    "prim-",
+    "printr-",
+    "s-",
+    "se-",
+    "te-",
+    "v-",
+    "într-",
+    "ș-",
+    "și-",
+    "ți-",
+]
+_ud_rrt_prefix_variants = _make_ro_variants(_ud_rrt_prefixes)
+
+
+# UD_Romanian-RRT closed class suffixes without NUM
+# POS: ADP|AUX|CCONJ|DET|PART|PRON|SCONJ
+_ud_rrt_suffixes = [
+    "-a",
+    "-aceasta",
+    "-ai",
+    "-al",
+    "-ale",
+    "-alta",
+    "-am",
+    "-ar",
+    "-astea",
+    "-atâta",
+    "-au",
+    "-aș",
+    "-ați",
+    "-i",
+    "-ilor",
+    "-l",
+    "-le",
+    "-lea",
+    "-mea",
+    "-meu",
+    "-mi",
+    "-mă",
+    "-n",
+    "-ndărătul",
+    "-ne",
+    "-o",
+    "-oi",
+    "-or",
+    "-s",
+    "-se",
+    "-si",
+    "-te",
+    "-ul",
+    "-ului",
+    "-un",
+    "-uri",
+    "-urile",
+    "-urilor",
+    "-veți",
+    "-vă",
+    "-ăștia",
+    "-și",
+    "-ți",
+]
+_ud_rrt_suffix_variants = _make_ro_variants(_ud_rrt_suffixes)
+
+
+_prefixes = (
+    ["§", "%", "=", "—", "–", r"\+(?![0-9])"]
+    + _ud_rrt_prefix_variants
+    + LIST_PUNCT
+    + LIST_ELLIPSES
+    + LIST_QUOTES
+    + LIST_CURRENCY
+    + LIST_ICONS
+)
+
+
+_suffixes = (
+    _ud_rrt_suffix_variants
+    + LIST_PUNCT
+    + LIST_ELLIPSES
+    + LIST_QUOTES
+    + _list_icons
+    + ["—", "–"]
+    + [
+        r"(?<=[0-9])\+",
+        r"(?<=°[FfCcKk])\.",
+        r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
+        r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format(
+            al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
+        ),
+        r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
+    ]
+)
+
+_infixes = (
+    LIST_ELLIPSES
+    + _list_icons
+    + [
+        r"(?<=[0-9])[+\*^](?=[0-9-])",
+        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
+            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
+        ),
+        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
+        r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
+    ]
+)
+
+TOKENIZER_PREFIXES = _prefixes
+TOKENIZER_SUFFIXES = _suffixes
+TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/ro/tokenizer_exceptions.py
+++ b/spacy/lang/ro/tokenizer_exceptions.py
@ -2,6 +2,7 @@
 from __future__ import unicode_literals

 from ...symbols import ORTH
+from .punctuation import _make_ro_variants


 _exc = {}
@ -45,8 +46,52 @@ for orth in [
    "dpdv",
    "șamd.",
    "ș.a.m.d.",
+    # below: from UD_Romanian-RRT:
+    "A.c.",
+    "A.f.",
+    "A.r.",
+    "Al.",
+    "Art.",
+    "Aug.",
+    "Bd.",
+    "Dem.",
+    "Dr.",
+    "Fig.",
+    "Fr.",
+    "Gh.",
+    "Gr.",
+    "Lt.",
+    "Nr.",
+    "Obs.",
+    "Prof.",
+    "Sf.",
+    "a.m.",
+    "a.r.",
+    "alin.",
+    "art.",
+    "d-l",
+    "d-lui",
+    "d-nei",
+    "ex.",
+    "fig.",
+    "ian.",
+    "lit.",
+    "lt.",
+    "p.a.",
+    "p.m.",
+    "pct.",
+    "prep.",
+    "sf.",
+    "tel.",
+    "univ.",
+    "îngr.",
+    "într-adevăr",
+    "Șt.",
+    "ș.a.",
 ]:
-    _exc[orth] = [{ORTH: orth}]
+    # note: does not distinguish capitalized-only exceptions from others
+    for variant in _make_ro_variants([orth]):
+        _exc[variant] = [{ORTH: variant}]


 TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/language.py
+++ b/spacy/language.py
@ -608,6 +608,7 @@ class Language(object):
        link_vectors_to_models(self.vocab)
        if self.vocab.vectors.data.shape[1]:
            cfg["pretrained_vectors"] = self.vocab.vectors.name
+            cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
        if sgd is None:
            sgd = create_default_optimizer(Model.ops)
        self._optimizer = sgd
--- a/spacy/pipeline/entityruler.py
+++ b/spacy/pipeline/entityruler.py
@ -8,7 +8,7 @@ from ..language import component
 from ..errors import Errors
 from ..compat import basestring_
 from ..util import ensure_path, to_disk, from_disk
-from ..tokens import Span
+from ..tokens import Doc, Span
 from ..matcher import Matcher, PhraseMatcher

 DEFAULT_ENT_ID_SEP = "||"
@ -162,6 +162,7 @@ class EntityRuler(object):
    @property
    def patterns(self):
        """Get all patterns that were added to the entity ruler.
+
        RETURNS (list): The original patterns, one dictionary per pattern.

        DOCS: https://spacy.io/api/entityruler#patterns
@ -194,6 +195,7 @@ class EntityRuler(object):

        DOCS: https://spacy.io/api/entityruler#add_patterns
        """
+
        # disable the nlp components after this one in case they hadn't been initialized / deserialised yet
        try:
            current_index = self.nlp.pipe_names.index(self.name)
@ -203,7 +205,33 @@ class EntityRuler(object):
        except ValueError:
            subsequent_pipes = []
        with self.nlp.disable_pipes(subsequent_pipes):
+            token_patterns = []
+            phrase_pattern_labels = []
+            phrase_pattern_texts = []
+            phrase_pattern_ids = []
+
            for entry in patterns:
+                if isinstance(entry["pattern"], basestring_):
+                    phrase_pattern_labels.append(entry["label"])
+                    phrase_pattern_texts.append(entry["pattern"])
+                    phrase_pattern_ids.append(entry.get("id"))
+                elif isinstance(entry["pattern"], list):
+                    token_patterns.append(entry)
+
+            phrase_patterns = []
+            for label, pattern, ent_id in zip(
+                phrase_pattern_labels,
+                self.nlp.pipe(phrase_pattern_texts),
+                phrase_pattern_ids
+            ):
+                phrase_pattern = {
+                    "label": label, "pattern": pattern, "id": ent_id
+                }
+                if ent_id:
+                    phrase_pattern["id"] = ent_id
+                phrase_patterns.append(phrase_pattern)
+
+            for entry in token_patterns + phrase_patterns:
                label = entry["label"]
                if "id" in entry:
                    ent_label = label
@ -212,8 +240,8 @@ class EntityRuler(object):
                    self._ent_ids[key] = (ent_label, entry["id"])

                pattern = entry["pattern"]
-                if isinstance(pattern, basestring_):
-                    self.phrase_patterns[label].append(self.nlp(pattern))
+                if isinstance(pattern, Doc):
+                    self.phrase_patterns[label].append(pattern)
                elif isinstance(pattern, list):
                    self.token_patterns[label].append(pattern)
                else:
@ -226,6 +254,8 @@ class EntityRuler(object):
    def _split_label(self, label):
        """Split Entity label into ent_label and ent_id if it contains self.ent_id_sep

+        label (str): The value of label in a pattern entry
+
        RETURNS (tuple): ent_label, ent_id
        """
        if self.ent_id_sep in label:
@ -239,6 +269,9 @@ class EntityRuler(object):
    def _create_label(self, label, ent_id):
        """Join Entity label with ent_id if the pattern has an `id` attribute

+        label (str): The label to set for ent.label_
+        ent_id (str): The label
+
        RETURNS (str): The ent_label joined with configured `ent_id_sep`
        """
        if isinstance(ent_id, basestring_):
@ -250,6 +283,7 @@ class EntityRuler(object):

        patterns_bytes (bytes): The bytestring to load.
        **kwargs: Other config paramters, mostly for consistency.
+
        RETURNS (EntityRuler): The loaded entity ruler.

        DOCS: https://spacy.io/api/entityruler#from_bytes
@ -292,6 +326,7 @@ class EntityRuler(object):

        path (unicode / Path): The JSONL file to load.
        **kwargs: Other config paramters, mostly for consistency.
+
        RETURNS (EntityRuler): The loaded entity ruler.

        DOCS: https://spacy.io/api/entityruler#from_disk
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@ -1044,6 +1044,7 @@ class TextCategorizer(Pipe):
                    self.add_label(cat)
        if self.model is True:
            self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
+            self.cfg["pretrained_dims"] = kwargs.get("pretrained_dims")
            self.require_labels()
            self.model = self.Model(len(self.labels), **self.cfg)
            link_vectors_to_models(self.vocab)
--- a/spacy/symbols.pxd
+++ b/spacy/symbols.pxd
@ -463,3 +463,5 @@ cdef enum symbol_t:

    ENT_KB_ID
    ENT_ID
+
+    IDX
--- a/spacy/symbols.pyx
+++ b/spacy/symbols.pyx
@ -93,6 +93,7 @@ IDS = {
    "SPACY": SPACY,
    "PROB": PROB,
    "LANG": LANG,
+    "IDX": IDX,

    "ADJ": ADJ,
    "ADP": ADP,
--- a/spacy/tests/doc/test_array.py
+++ b/spacy/tests/doc/test_array.py
@ -66,3 +66,14 @@ def test_doc_array_to_from_string_attrs(en_vocab, attrs):
    words = ["An", "example", "sentence"]
    doc = Doc(en_vocab, words=words)
    Doc(en_vocab, words=words).from_array(attrs, doc.to_array(attrs))
+
+
+def test_doc_array_idx(en_vocab):
+    """Test that Doc.to_array can retrieve token start indices"""
+    words = ["An", "example", "sentence"]
+    doc = Doc(en_vocab, words=words)
+    offsets = Doc(en_vocab, words=words).to_array("IDX")
+
+    assert offsets[0] == 0
+    assert offsets[1] == 3
+    assert offsets[2] == 11
--- a/spacy/tests/doc/test_doc_api.py
+++ b/spacy/tests/doc/test_doc_api.py
@ -7,7 +7,7 @@ import numpy
 from spacy.tokens import Doc, Span
 from spacy.vocab import Vocab
 from spacy.errors import ModelsWarning
-from spacy.attrs import ENT_TYPE, ENT_IOB
+from spacy.attrs import ENT_TYPE, ENT_IOB, SENT_START, HEAD, DEP

 from ..util import get_doc

@ -274,6 +274,39 @@ def test_doc_is_nered(en_vocab):
    assert new_doc.is_nered


+def test_doc_from_array_sent_starts(en_vocab):
+    words = ["I", "live", "in", "New", "York", ".", "I", "like", "cats", "."]
+    heads = [0, 0, 0, 0, 0, 0, 6, 6, 6, 6]
+    deps = ["ROOT", "dep", "dep", "dep", "dep", "dep", "ROOT", "dep", "dep", "dep", "dep"]
+    doc = Doc(en_vocab, words=words)
+    for i, (dep, head) in enumerate(zip(deps, heads)):
+        doc[i].dep_ = dep
+        doc[i].head = doc[head]
+        if head == i:
+            doc[i].is_sent_start = True
+    doc.is_parsed
+
+    attrs = [SENT_START, HEAD]
+    arr = doc.to_array(attrs)
+    new_doc = Doc(en_vocab, words=words)
+    with pytest.raises(ValueError):
+        new_doc.from_array(attrs, arr)
+
+    attrs = [SENT_START, DEP]
+    arr = doc.to_array(attrs)
+    new_doc = Doc(en_vocab, words=words)
+    new_doc.from_array(attrs, arr)
+    assert [t.is_sent_start for t in doc] == [t.is_sent_start for t in new_doc]
+    assert not new_doc.is_parsed
+
+    attrs = [HEAD, DEP]
+    arr = doc.to_array(attrs)
+    new_doc = Doc(en_vocab, words=words)
+    new_doc.from_array(attrs, arr)
+    assert [t.is_sent_start for t in doc] == [t.is_sent_start for t in new_doc]
+    assert new_doc.is_parsed
+
+
 def test_doc_lang(en_vocab):
    doc = Doc(en_vocab, words=["Hello", "world"])
    assert doc.lang_ == "en"
--- a/spacy/tests/doc/test_span.py
+++ b/spacy/tests/doc/test_span.py
@ -279,3 +279,12 @@ def test_filter_spans(doc):
    assert len(filtered[1]) == 5
    assert filtered[0].start == 1 and filtered[0].end == 4
    assert filtered[1].start == 5 and filtered[1].end == 10
+
+
+def test_span_eq_hash(doc, doc_not_parsed):
+    assert doc[0:2] == doc[0:2]
+    assert doc[0:2] != doc[1:3]
+    assert doc[0:2] != doc_not_parsed[0:2]
+    assert hash(doc[0:2]) == hash(doc[0:2])
+    assert hash(doc[0:2]) != hash(doc[1:3])
+    assert hash(doc[0:2]) != hash(doc_not_parsed[0:2])
--- a/spacy/tests/lang/fi/test_tokenizer.py
+++ b/spacy/tests/lang/fi/test_tokenizer.py
@ -10,28 +10,33 @@ ABBREVIATION_TESTS = [
        ["Hyvää", "uutta", "vuotta", "t.", "siht.", "Niemelä", "!"],
    ),
    ("Paino on n. 2.2 kg", ["Paino", "on", "n.", "2.2", "kg"]),
+    (
+        "Vuonna 1 eaa. tapahtui kauheita.",
+        ["Vuonna", "1", "eaa.", "tapahtui", "kauheita", "."],
+    ),
 ]

 HYPHENATED_TESTS = [
    (
-        "1700-luvulle sijoittuva taide-elokuva",
-        ["1700-luvulle", "sijoittuva", "taide-elokuva"],
+        "1700-luvulle sijoittuva taide-elokuva Wikimedia-säätiön Varsinais-Suomen",
+        [
+            "1700-luvulle",
+            "sijoittuva",
+            "taide-elokuva",
+            "Wikimedia-säätiön",
+            "Varsinais-Suomen",
+        ],
    )
 ]

 ABBREVIATION_INFLECTION_TESTS = [
    (
        "VTT:ssa ennen v:ta 2010 suoritetut mittaukset",
-        ["VTT:ssa", "ennen", "v:ta", "2010", "suoritetut", "mittaukset"]
+        ["VTT:ssa", "ennen", "v:ta", "2010", "suoritetut", "mittaukset"],
    ),
-    (
-        "ALV:n osuus on 24 %.",
-        ["ALV:n", "osuus", "on", "24", "%", "."]
-    ),
-    (
-        "Hiihtäjä oli kilpailun 14:s.",
-        ["Hiihtäjä", "oli", "kilpailun", "14:s", "."]
-    )
+    ("ALV:n osuus on 24 %.", ["ALV:n", "osuus", "on", "24", "%", "."]),
+    ("Hiihtäjä oli kilpailun 14:s.", ["Hiihtäjä", "oli", "kilpailun", "14:s", "."]),
+    ("EU:n toimesta tehtiin jotain.", ["EU:n", "toimesta", "tehtiin", "jotain", "."]),
 ]


--- a/spacy/tests/test_displacy.py
+++ b/spacy/tests/test_displacy.py
@ -31,10 +31,10 @@ def test_displacy_parse_deps(en_vocab):
    deps = displacy.parse_deps(doc)
    assert isinstance(deps, dict)
    assert deps["words"] == [
-        {"text": "This", "tag": "DET"},
-        {"text": "is", "tag": "AUX"},
-        {"text": "a", "tag": "DET"},
-        {"text": "sentence", "tag": "NOUN"},
+        {"lemma": None, "text": "This", "tag": "DET"},
+        {"lemma": None, "text": "is", "tag": "AUX"},
+        {"lemma": None, "text": "a", "tag": "DET"},
+        {"lemma": None, "text": "sentence", "tag": "NOUN"},
    ]
    assert deps["arcs"] == [
        {"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
--- a/spacy/tests/util.py
+++ b/spacy/tests/util.py
@ -95,7 +95,11 @@ def assert_docs_equal(doc1, doc2):

    assert [t.ent_type for t in doc1] == [t.ent_type for t in doc2]
    assert [t.ent_iob for t in doc1] == [t.ent_iob for t in doc2]
-    assert [ent for ent in doc1.ents] == [ent for ent in doc2.ents]
+    for ent1, ent2 in zip(doc1.ents, doc2.ents):
+        assert ent1.start == ent2.start
+        assert ent1.end == ent2.end
+        assert ent1.label == ent2.label
+        assert ent1.kb_id == ent2.kb_id


 def assert_packed_msg_equal(b1, b2):
--- a/spacy/tokens/doc.pyx
+++ b/spacy/tokens/doc.pyx
@ -23,7 +23,7 @@ from ..lexeme cimport Lexeme, EMPTY_LEXEME
 from ..typedefs cimport attr_t, flags_t
 from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
 from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
-from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, attr_id_t
+from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, IDX, attr_id_t
 from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t

 from ..attrs import intify_attrs, IDS
@ -73,6 +73,8 @@ cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
        return token.ent_id
    elif feat_name == ENT_KB_ID:
        return token.ent_kb_id
+    elif feat_name == IDX:
+        return token.idx
    else:
        return Lexeme.get_struct_attr(token.lex, feat_name)

@ -813,7 +815,7 @@ cdef class Doc:
                if attr_ids[j] != TAG:
                    Token.set_struct_attr(token, attr_ids[j], array[i, j])
        # Set flags
-        self.is_parsed = bool(self.is_parsed or HEAD in attrs or DEP in attrs)
+        self.is_parsed = bool(self.is_parsed or HEAD in attrs)
        self.is_tagged = bool(self.is_tagged or TAG in attrs or POS in attrs)
        # If document is parsed, set children
        if self.is_parsed:
--- a/spacy/tokens/span.pyx
+++ b/spacy/tokens/span.pyx
@ -127,22 +127,27 @@ cdef class Span:
                return False
            else:
                return True
-        # Eq
+        # <
        if op == 0:
            return self.start_char < other.start_char
+        # <=
        elif op == 1:
            return self.start_char <= other.start_char
+        # ==
        elif op == 2:
-            return self.start_char == other.start_char and self.end_char == other.end_char
+            return (self.doc, self.start_char, self.end_char, self.label, self.kb_id) == (other.doc, other.start_char, other.end_char, other.label, other.kb_id)
+        # !=
        elif op == 3:
-            return self.start_char != other.start_char or self.end_char != other.end_char
+            return (self.doc, self.start_char, self.end_char, self.label, self.kb_id) != (other.doc, other.start_char, other.end_char, other.label, other.kb_id)
+        # >
        elif op == 4:
            return self.start_char > other.start_char
+        # >=
        elif op == 5:
            return self.start_char >= other.start_char

    def __hash__(self):
-        return hash((self.doc, self.label, self.start_char, self.end_char))
+        return hash((self.doc, self.start_char, self.end_char, self.label, self.kb_id))

    def __len__(self):
        """Get the number of tokens in the span.
--- a/spacy/vectors.pyx
+++ b/spacy/vectors.pyx
@ -283,7 +283,11 @@ cdef class Vectors:

        DOCS: https://spacy.io/api/vectors#add
        """
-        key = get_string_id(key)
+        # use int for all keys and rows in key2row for more efficient access
+        # and serialization
+        key = int(get_string_id(key))
+        if row is not None:
+            row = int(row)
        if row is None and key in self.key2row:
            row = self.key2row[key]
        elif row is None:
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -239,6 +239,7 @@ If a setting is not present in the options, the default value will be used.
 | Name               | Type    | Description                                                                                                     | Default                 |
 | ------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
 | `fine_grained`     | bool    | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`).              | `False`                 |
+| `add_lemma`        | bool    | Print the lemma's in a separate row below the token texts in the `dep` visualisation.                           | `False`                 |
 | `collapse_punct`   | bool    | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True`                  |
 | `collapse_phrases` | bool    | Merge noun phrases into one token.                                                                              | `False`                 |
 | `compact`          | bool    | "Compact mode" with square arrows that takes up less space.                                                     | `False`                 |
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@ -1096,6 +1096,33 @@ with the patterns. When you load the model back in, all pipeline components will
 be restored and deserialized – including the entity ruler. This lets you ship
 powerful model packages with binary weights _and_ rules included!

+### Using a large number of phrase patterns {#entityruler-large-phrase-patterns new="2.2.4"}
+
+When using a large amount of **phrase patterns** (roughly > 10000) it's useful to understand how the `add_patterns` function of the EntityRuler works. For each **phrase pattern**,
+the EntityRuler calls the nlp object to construct a doc object. This happens in case you try
+to add the EntityRuler at the end of an existing pipeline with, for example, a POS tagger and want to 
+extract matches based on the pattern's POS signature.
+
+In this case you would pass a config value of `phrase_matcher_attr="POS"` for the EntityRuler.
+
+Running the full language pipeline across every pattern in a large list scales linearly and can therefore take a long time on large amounts of phrase patterns.
+
+As of spaCy 2.2.4 the `add_patterns` function has been refactored to use nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with 5,000-100,000 phrase patterns respectively. 
+
+Even with this speedup (but especially if you're using an older version) the `add_patterns` function can still take a long time.
+
+An easy workaround to make this function run faster is disabling the other language pipes
+while adding the phrase patterns.
+
+```python
+entityruler = EntityRuler(nlp)
+patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
+
+other_pipes = [p for p in nlp.pipe_names if p != "tagger"]
+with nlp.disable_pipes(*disable_pipes):
+    entityruler.add_patterns(patterns)
+```
+
 ## Combining models and rules {#models-rules}

 You can combine statistical and rule-based components in a variety of ways.
--- a/website/meta/universe.json
+++ b/website/meta/universe.json
@ -999,6 +999,17 @@
            "author": "Graphbrain",
            "category": ["standalone"]
        },
+        {
+            "type": "education",
+            "id": "nostarch-nlp-python",
+            "title": "Natural Language Processing Using Python",
+            "slogan": "No Starch Press, 2020",
+            "description": "Natural Language Processing Using Python is an introduction to natural language processing (NLP), the task of converting human language into data that a computer can process. The book uses spaCy, a leading Python library for NLP, to guide readers through common NLP tasks related to generating and understanding human language with code. It addresses problems like understanding a user's intent, continuing a conversation with a human, and maintaining the state of a conversation.",
+            "cover": "https://nostarch.com/sites/default/files/styles/uc_product_full/public/NaturalLanguageProcessing_final_v01.jpg",
+            "url": "https://nostarch.com/NLPPython",
+            "author": "Yuli Vasiliev",
+            "category": ["books"]
+        },                     
        {
            "type": "education",
            "id": "oreilly-python-ds",