Merge branch 'master' into spacy.io

2025-09-17 17:42:43 +03:00 · 2020-02-23 12:04:20 +01:00 · 2020-02-23 12:04:20 +01:00 · 89967f3701
commit 89967f3701
parent 13b516289b ddf63b97a8
36 changed files with 704 additions and 84 deletions
--- a/.github/contributors/Jan-711.md
+++ b/.github/contributors/Jan-711.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Jan Jessewitsch      |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 16.02.2020           |
 | GitHub username                | Jan-711              |
 | Website (optional)             |                      |
--- a/.github/contributors/MisterKeefe.md
+++ b/.github/contributors/MisterKeefe.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [ ] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           |      Tom Keefe       |
 | Company name (if applicable)   |          /           |
 | Title or role (if applicable)  |          /           |
 | Date                           |   18 February 2020   |
 | GitHub username                |     MisterKeefe      |
 | Website (optional)             |          /           |
--- a/MANIFEST.in
+++ b/MANIFEST.in
@ -1,5 +1,5 @@
 recursive-include include *.h
-recursive-include spacy *.txt
+recursive-include spacy *.txt *.pyx *.pxd
 include LICENSE
 include README.md
 include bin/spacy
--- a/examples/streamlit_spacy.py
+++ b/examples/streamlit_spacy.py
@ -26,12 +26,12 @@ DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook."
 HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
-@st.cache(ignore_hash=True)
+@st.cache(allow_output_mutation=True)
 def load_model(name):
    return spacy.load(name)
-@st.cache(ignore_hash=True)
+@st.cache(allow_output_mutation=True)
 def process_text(model_name, text):
    nlp = load_model(model_name)
    return nlp(text)
@ -79,7 +79,9 @@ if "ner" in nlp.pipe_names:
    st.header("Named Entities")
    st.sidebar.header("Named Entities")
    label_set = nlp.get_pipe("ner").labels
-    labels = st.sidebar.multiselect("Entity labels", label_set, label_set)
+    labels = st.sidebar.multiselect(
        "Entity labels", options=label_set, default=list(label_set)
    )
    html = displacy.render(doc, style="ent", options={"ents": labels})
    # Newlines seem to mess with the rendering
    html = html.replace("\n", " ")
--- a/spacy/attrs.pxd
+++ b/spacy/attrs.pxd
@ -92,3 +92,5 @@ cdef enum attr_id_t:
    LANG
    ENT_KB_ID = symbols.ENT_KB_ID
    ENT_ID = symbols.ENT_ID
    IDX
--- a/spacy/attrs.pyx
+++ b/spacy/attrs.pyx
@ -91,6 +91,7 @@ IDS = {
    "SPACY": SPACY,
    "PROB": PROB,
    "LANG": LANG,
    "IDX": IDX
 }
--- a/spacy/cli/pretrain.py
+++ b/spacy/cli/pretrain.py
@ -34,7 +34,7 @@ from .train import _load_pretrained_tok2vec
    vectors_model=("Name or path to spaCy model with vectors to learn from"),
    output_dir=("Directory to write models to on each epoch", "positional", None, str),
    width=("Width of CNN layers", "option", "cw", int),
-    depth=("Depth of CNN layers", "option", "cd", int),
+    conv_depth=("Depth of CNN layers", "option", "cd", int),
    cnn_window=("Window size for CNN layers", "option", "cW", int),
    cnn_pieces=("Maxout size for CNN layers. 1 for Mish", "option", "cP", int),
    use_chars=("Whether to use character-based embedding", "flag", "chr", bool),
@ -84,7 +84,7 @@ def pretrain(
    vectors_model,
    output_dir,
    width=96,
-    depth=4,
+    conv_depth=4,
    bilstm_depth=0,
    cnn_pieces=3,
    sa_depth=0,
@ -132,9 +132,15 @@ def pretrain(
    msg.info("Using GPU" if has_gpu else "Not using GPU")
    output_dir = Path(output_dir)
    if output_dir.exists() and [p for p in output_dir.iterdir()]:
        msg.warn(
            "Output directory is not empty",
            "It is better to use an empty directory or refer to a new output path, "
            "then the new directory will be created for you.",
        )
    if not output_dir.exists():
        output_dir.mkdir()
-        msg.good("Created output directory")
+        msg.good("Created output directory: {}".format(output_dir))
    srsly.write_json(output_dir / "config.json", config)
    msg.good("Saved settings to config.json")
@ -162,7 +168,7 @@ def pretrain(
        Tok2Vec(
            width,
            embed_rows,
-            conv_depth=depth,
+            conv_depth=conv_depth,
            pretrained_vectors=pretrained_vectors,
            bilstm_depth=bilstm_depth,  # Requires PyTorch. Experimental.
            subword_features=not use_chars,  # Set to False for Chinese etc
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -14,6 +14,7 @@ import contextlib
 import random
 from .._ml import create_default_optimizer
 from ..util import use_gpu as set_gpu
 from ..attrs import PROB, IS_OOV, CLUSTER, LANG
 from ..gold import GoldCorpus
 from ..compat import path2str
@ -32,6 +33,13 @@ from .. import about
    pipeline=("Comma-separated names of pipeline components", "option", "p", str),
    replace_components=("Replace components from base model", "flag", "R", bool),
    vectors=("Model to load vectors from", "option", "v", str),
    width=("Width of CNN layers of Tok2Vec component", "option", "cw", int),
    conv_depth=("Depth of CNN layers of Tok2Vec component", "option", "cd", int),
    cnn_window=("Window size for CNN layers of Tok2Vec component", "option", "cW", int),
    cnn_pieces=("Maxout size for CNN layers of Tok2Vec component. 1 for Mish", "option", "cP", int),
    use_chars=("Whether to use character-based embedding of Tok2Vec component", "flag", "chr", bool),
    bilstm_depth=("Depth of BiLSTM layers of Tok2Vec component (requires PyTorch)", "option", "lstm", int),
    embed_rows=("Number of embedding rows of Tok2Vec component", "option", "er", int),
    n_iter=("Number of iterations", "option", "n", int),
    n_early_stopping=("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int),
    n_examples=("Number of examples", "option", "ns", int),
@ -63,6 +71,13 @@ def train(
    pipeline="tagger,parser,ner",
    replace_components=False,
    vectors=None,
    width=96,
    conv_depth=4,
    cnn_window=1,
    cnn_pieces=3,
    use_chars=False,
    bilstm_depth=0,
    embed_rows=2000,
    n_iter=30,
    n_early_stopping=None,
    n_examples=0,
@ -115,6 +130,7 @@ def train(
        )
    if not output_path.exists():
        output_path.mkdir()
        msg.good("Created output directory: {}".format(output_path))
    # Take dropout and batch size as generators of values -- dropout
    # starts high and decays sharply, to force the optimizer to explore.
@ -147,6 +163,18 @@ def train(
    disabled_pipes = None
    pipes_added = False
    msg.text("Training pipeline: {}".format(pipeline))
    if use_gpu >= 0:
        activated_gpu = None
        try:
            activated_gpu = set_gpu(use_gpu)
        except Exception as e:
            msg.warn("Exception: {}".format(e))
        if activated_gpu is not None:
            msg.text("Using GPU: {}".format(use_gpu))
        else:
            msg.warn("Unable to activate GPU: {}".format(use_gpu))
            msg.text("Using CPU only")
            use_gpu = -1
    if base_model:
        msg.text("Starting with base model '{}'".format(base_model))
        nlp = util.load_model(base_model)
@ -237,7 +265,15 @@ def train(
        optimizer = create_default_optimizer(Model.ops)
    else:
        # Start with a blank model, call begin_training
-        optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
+        cfg = {"device": use_gpu}
        cfg["conv_depth"] = conv_depth
        cfg["token_vector_width"] = width
        cfg["bilstm_depth"] = bilstm_depth
        cfg["cnn_maxout_pieces"] = cnn_pieces
        cfg["embed_size"] = embed_rows
        cfg["conv_window"] = cnn_window
        cfg["subword_features"] = not use_chars
        optimizer = nlp.begin_training(lambda: corpus.train_tuples, **cfg)
    nlp._optimizer = None
@ -362,13 +398,19 @@ def train(
                    if not batch:
                        continue
                    docs, golds = zip(*batch)
-                    nlp.update(
+                    try:
-                        docs,
+                        nlp.update(
-                        golds,
+                            docs,
-                        sgd=optimizer,
+                            golds,
-                        drop=next(dropout_rates),
+                            sgd=optimizer,
-                        losses=losses,
+                            drop=next(dropout_rates),
-                    )
+                            losses=losses,
                        )
                    except ValueError as e:
                        msg.warn("Error during training")
                        if init_tok2vec:
                            msg.warn("Did you provide the same parameters during 'train' as during 'pretrain'?")
                        msg.fail("Original error message: {}".format(e), exits=1)
                    if raw_text:
                        # If raw text is available, perform 'rehearsal' updates,
                        # which use unlabelled data to reduce overfitting.
@ -495,6 +537,8 @@ def train(
                            "score = {}".format(best_score, current_score)
                        )
                        break
    except Exception as e:
        msg.warn("Aborting and saving the final best model. Encountered exception: {}".format(e))
    finally:
        best_pipes = nlp.pipe_names
        if disabled_pipes:
--- a/spacy/displacy/init.py
+++ b/spacy/displacy/init.py
@ -144,10 +144,12 @@ def parse_deps(orig_doc, options={}):
            for span, tag, lemma, ent_type in spans:
                attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
                retokenizer.merge(span, attrs=attrs)
-    if options.get("fine_grained"):
+    fine_grained = options.get("fine_grained")
-        words = [{"text": w.text, "tag": w.tag_} for w in doc]
+    add_lemma = options.get("add_lemma")
-    else:
+    words = [{"text": w.text,
-        words = [{"text": w.text, "tag": w.pos_} for w in doc]
+              "tag": w.tag_ if fine_grained else w.pos_,
              "lemma": w.lemma_ if add_lemma else None} for w in doc]
    arcs = []
    for word in doc:
        if word.i < word.head.i:
--- a/spacy/displacy/render.py
+++ b/spacy/displacy/render.py
@ -3,7 +3,7 @@ from __future__ import unicode_literals
 import uuid
-from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
+from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_WORDS_LEMMA, TPL_DEP_ARCS, TPL_ENTS
 from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
 from ..util import minify_html, escape_html, registry
 from ..errors import Errors
@ -83,7 +83,7 @@ class DependencyRenderer(object):
        self.width = self.offset_x + len(words) * self.distance
        self.height = self.offset_y + 3 * self.word_spacing
        self.id = render_id
-        words = [self.render_word(w["text"], w["tag"], i) for i, w in enumerate(words)]
+        words = [self.render_word(w["text"], w["tag"],  w.get("lemma", None), i) for i, w in enumerate(words)]
        arcs = [
            self.render_arrow(a["label"], a["start"], a["end"], a["dir"], i)
            for i, a in enumerate(arcs)
@ -101,7 +101,7 @@ class DependencyRenderer(object):
            lang=self.lang,
        )
-    def render_word(self, text, tag, i):
+    def render_word(self, text, tag, lemma, i,):
        """Render individual word.
        text (unicode): Word text.
@ -114,6 +114,8 @@ class DependencyRenderer(object):
        if self.direction == "rtl":
            x = self.width - x
        html_text = escape_html(text)
        if lemma is not None:
            return TPL_DEP_WORDS_LEMMA.format(text=html_text, tag=tag, lemma=lemma, x=x, y=y)
        return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
    def render_arrow(self, label, start, end, direction, i):
--- a/spacy/displacy/templates.py
+++ b/spacy/displacy/templates.py
@ -18,6 +18,15 @@ TPL_DEP_WORDS = """
 """
 TPL_DEP_WORDS_LEMMA = """
 <text class="displacy-token" fill="currentColor" text-anchor="middle" y="{y}">
    <tspan class="displacy-word" fill="currentColor" x="{x}">{text}</tspan>
    <tspan class="displacy-lemma" dy="2em" fill="currentColor" x="{x}">{lemma}</tspan>
    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="{x}">{tag}</tspan>
 </text>
 """
 TPL_DEP_ARCS = """
 <g class="displacy-arrow">
    <path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
--- a/spacy/lang/de/stop_words.py
+++ b/spacy/lang/de/stop_words.py
@ -22,14 +22,14 @@ dort drei drin dritte dritten dritter drittes du durch durchaus dürfen dürft
 durfte durften
 eben ebenso ehrlich eigen eigene eigenen eigener eigenes ein einander eine
-einem einen einer eines einigeeinigen einiger einiges einmal einmaleins elf en
+einem einen einer eines einige einigen einiger einiges einmal einmaleins elf en
 ende endlich entweder er erst erste ersten erster erstes es etwa etwas euch
 früher fünf fünfte fünften fünfter fünftes für
 gab ganz ganze ganzen ganzer ganzes gar gedurft gegen gegenüber gehabt gehen
 geht gekannt gekonnt gemacht gemocht gemusst genug gerade gern gesagt geschweige
-gewesen gewollt geworden gibt ging gleich gott gross groß grosse große grossen
+gewesen gewollt geworden gibt ging gleich gross groß grosse große grossen
 großen grosser großer grosses großes gut gute guter gutes
 habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier
@ -47,9 +47,8 @@ kleines kommen kommt können könnt konnte könnte konnten kurz
 lang lange leicht leider lieber los
 machen macht machte mag magst man manche manchem manchen mancher manches mehr
-mein meine meinem meinen meiner meines mensch menschen mich mir mit mittel
+mein meine meinem meinen meiner meines mich mir mit mittel mochte möchte mochten 
-mochte möchte mochten mögen möglich mögt morgen muss muß müssen musst müsst
+mögen möglich mögt morgen muss muß müssen musst müsst musste mussten
 musste mussten
 na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter
 neuntes nicht nichts nie niemand niemandem niemanden noch nun nur
--- a/spacy/lang/el/init.py
+++ b/spacy/lang/el/init.py
@ -3,7 +3,7 @@
 from __future__ import unicode_literals
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
-from .tag_map_general import TAG_MAP
+from ..tag_map import TAG_MAP
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from .lemmatizer import GreekLemmatizer
--- a/spacy/lang/el/tag_map_fine.py
+++ b/spacy/lang/el/tag_map_fine.py
--- a/spacy/lang/el/tag_map_general.py
+++ b/spacy/lang/el/tag_map_general.py
@ -1,27 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ...symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
 from ...symbols import PUNCT, NUM, AUX, X, ADJ, VERB, PART, SPACE, CCONJ
 TAG_MAP = {
    "ADJ": {POS: ADJ},
    "ADV": {POS: ADV},
    "INTJ": {POS: INTJ},
    "NOUN": {POS: NOUN},
    "PROPN": {POS: PROPN},
    "VERB": {POS: VERB},
    "ADP": {POS: ADP},
    "CCONJ": {POS: CCONJ},
    "SCONJ": {POS: SCONJ},
    "PART": {POS: PART},
    "PUNCT": {POS: PUNCT},
    "SYM": {POS: SYM},
    "NUM": {POS: NUM},
    "PRON": {POS: PRON},
    "AUX": {POS: AUX},
    "SPACE": {POS: SPACE},
    "DET": {POS: DET},
    "X": {POS: X},
 }
--- a/spacy/lang/fi/tokenizer_exceptions.py
+++ b/spacy/lang/fi/tokenizer_exceptions.py
@ -14,6 +14,7 @@ for exc_data in [
    {ORTH: "alv.", LEMMA: "arvonlisävero"},
    {ORTH: "ark.", LEMMA: "arkisin"},
    {ORTH: "as.", LEMMA: "asunto"},
    {ORTH: "eaa.", LEMMA: "ennen ajanlaskun alkua"},
    {ORTH: "ed.", LEMMA: "edellinen"},
    {ORTH: "esim.", LEMMA: "esimerkki"},
    {ORTH: "huom.", LEMMA: "huomautus"},
@ -27,6 +28,7 @@ for exc_data in [
    {ORTH: "läh.", LEMMA: "lähettäjä"},
    {ORTH: "miel.", LEMMA: "mieluummin"},
    {ORTH: "milj.", LEMMA: "miljoona"},
    {ORTH: "Mm.", LEMMA: "muun muassa"},
    {ORTH: "mm.", LEMMA: "muun muassa"},
    {ORTH: "myöh.", LEMMA: "myöhempi"},
    {ORTH: "n.", LEMMA: "noin"},
--- a/spacy/lang/ro/init.py
+++ b/spacy/lang/ro/init.py
@ -3,6 +3,8 @@ from __future__ import unicode_literals
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
 from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
 from .punctuation import TOKENIZER_SUFFIXES
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
 from ..norm_exceptions import BASE_NORMS
@ -24,6 +26,9 @@ class RomanianDefaults(Language.Defaults):
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
    prefixes = TOKENIZER_PREFIXES
    suffixes = TOKENIZER_SUFFIXES
    infixes = TOKENIZER_INFIXES
    tag_map = TAG_MAP
--- a/spacy/lang/ro/punctuation.py
+++ b/spacy/lang/ro/punctuation.py
@ -0,0 +1,164 @@
 # coding: utf8
 from __future__ import unicode_literals
 import itertools
 from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY
 from ..char_classes import LIST_ICONS, CURRENCY
 from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT
 _list_icons = [x for x in LIST_ICONS if x != "°"]
 _list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
 _ro_variants = {
    "Ă": ["Ă", "A"],
    "Â": ["Â", "A"],
    "Î": ["Î", "I"],
    "Ș": ["Ș", "Ş", "S"],
    "Ț": ["Ț", "Ţ", "T"],
 }
 def _make_ro_variants(tokens):
    variants = []
    for token in tokens:
        upper_token = token.upper()
        upper_char_variants = [_ro_variants.get(c, [c]) for c in upper_token]
        upper_variants = ["".join(x) for x in itertools.product(*upper_char_variants)]
        for variant in upper_variants:
            variants.extend([variant, variant.lower(), variant.title()])
    return sorted(list(set(variants)))
 # UD_Romanian-RRT closed class prefixes
 # POS: ADP|AUX|CCONJ|DET|NUM|PART|PRON|SCONJ
 _ud_rrt_prefixes = [
    "a-",
    "c-",
    "ce-",
    "cu-",
    "d-",
    "de-",
    "dintr-",
    "e-",
    "făr-",
    "i-",
    "l-",
    "le-",
    "m-",
    "mi-",
    "n-",
    "ne-",
    "p-",
    "pe-",
    "prim-",
    "printr-",
    "s-",
    "se-",
    "te-",
    "v-",
    "într-",
    "ș-",
    "și-",
    "ți-",
 ]
 _ud_rrt_prefix_variants = _make_ro_variants(_ud_rrt_prefixes)
 # UD_Romanian-RRT closed class suffixes without NUM
 # POS: ADP|AUX|CCONJ|DET|PART|PRON|SCONJ
 _ud_rrt_suffixes = [
    "-a",
    "-aceasta",
    "-ai",
    "-al",
    "-ale",
    "-alta",
    "-am",
    "-ar",
    "-astea",
    "-atâta",
    "-au",
    "-aș",
    "-ați",
    "-i",
    "-ilor",
    "-l",
    "-le",
    "-lea",
    "-mea",
    "-meu",
    "-mi",
    "-mă",
    "-n",
    "-ndărătul",
    "-ne",
    "-o",
    "-oi",
    "-or",
    "-s",
    "-se",
    "-si",
    "-te",
    "-ul",
    "-ului",
    "-un",
    "-uri",
    "-urile",
    "-urilor",
    "-veți",
    "-vă",
    "-ăștia",
    "-și",
    "-ți",
 ]
 _ud_rrt_suffix_variants = _make_ro_variants(_ud_rrt_suffixes)
 _prefixes = (
    ["§", "%", "=", "—", "–", r"\+(?![0-9])"]
    + _ud_rrt_prefix_variants
    + LIST_PUNCT
    + LIST_ELLIPSES
    + LIST_QUOTES
    + LIST_CURRENCY
    + LIST_ICONS
 )
 _suffixes = (
    _ud_rrt_suffix_variants
    + LIST_PUNCT
    + LIST_ELLIPSES
    + LIST_QUOTES
    + _list_icons
    + ["—", "–"]
    + [
        r"(?<=[0-9])\+",
        r"(?<=°[FfCcKk])\.",
        r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
        r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format(
            al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
        ),
        r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
    ]
 )
 _infixes = (
    LIST_ELLIPSES
    + _list_icons
    + [
        r"(?<=[0-9])[+\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
    ]
 )
 TOKENIZER_PREFIXES = _prefixes
 TOKENIZER_SUFFIXES = _suffixes
 TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/ro/tokenizer_exceptions.py
+++ b/spacy/lang/ro/tokenizer_exceptions.py
@ -2,6 +2,7 @@
 from __future__ import unicode_literals
 from ...symbols import ORTH
 from .punctuation import _make_ro_variants
 _exc = {}
@ -45,8 +46,52 @@ for orth in [
    "dpdv",
    "șamd.",
    "ș.a.m.d.",
    # below: from UD_Romanian-RRT:
    "A.c.",
    "A.f.",
    "A.r.",
    "Al.",
    "Art.",
    "Aug.",
    "Bd.",
    "Dem.",
    "Dr.",
    "Fig.",
    "Fr.",
    "Gh.",
    "Gr.",
    "Lt.",
    "Nr.",
    "Obs.",
    "Prof.",
    "Sf.",
    "a.m.",
    "a.r.",
    "alin.",
    "art.",
    "d-l",
    "d-lui",
    "d-nei",
    "ex.",
    "fig.",
    "ian.",
    "lit.",
    "lt.",
    "p.a.",
    "p.m.",
    "pct.",
    "prep.",
    "sf.",
    "tel.",
    "univ.",
    "îngr.",
    "într-adevăr",
    "Șt.",
    "ș.a.",
 ]:
-    _exc[orth] = [{ORTH: orth}]
+    # note: does not distinguish capitalized-only exceptions from others
    for variant in _make_ro_variants([orth]):
        _exc[variant] = [{ORTH: variant}]
 TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/language.py
+++ b/spacy/language.py
@ -608,6 +608,7 @@ class Language(object):
        link_vectors_to_models(self.vocab)
        if self.vocab.vectors.data.shape[1]:
            cfg["pretrained_vectors"] = self.vocab.vectors.name
            cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
        if sgd is None:
            sgd = create_default_optimizer(Model.ops)
        self._optimizer = sgd
--- a/spacy/pipeline/entityruler.py
+++ b/spacy/pipeline/entityruler.py
@ -8,7 +8,7 @@ from ..language import component
 from ..errors import Errors
 from ..compat import basestring_
 from ..util import ensure_path, to_disk, from_disk
-from ..tokens import Span
+from ..tokens import Doc, Span
 from ..matcher import Matcher, PhraseMatcher
 DEFAULT_ENT_ID_SEP = "||"
@ -162,6 +162,7 @@ class EntityRuler(object):
    @property
    def patterns(self):
        """Get all patterns that were added to the entity ruler.
        RETURNS (list): The original patterns, one dictionary per pattern.
        DOCS: https://spacy.io/api/entityruler#patterns
@ -194,6 +195,7 @@ class EntityRuler(object):
        DOCS: https://spacy.io/api/entityruler#add_patterns
        """
        # disable the nlp components after this one in case they hadn't been initialized / deserialised yet
        try:
            current_index = self.nlp.pipe_names.index(self.name)
@ -203,7 +205,33 @@ class EntityRuler(object):
        except ValueError:
            subsequent_pipes = []
        with self.nlp.disable_pipes(subsequent_pipes):
            token_patterns = []
            phrase_pattern_labels = []
            phrase_pattern_texts = []
            phrase_pattern_ids = []
            for entry in patterns:
                if isinstance(entry["pattern"], basestring_):
                    phrase_pattern_labels.append(entry["label"])
                    phrase_pattern_texts.append(entry["pattern"])
                    phrase_pattern_ids.append(entry.get("id"))
                elif isinstance(entry["pattern"], list):
                    token_patterns.append(entry)
            phrase_patterns = []
            for label, pattern, ent_id in zip(
                phrase_pattern_labels,
                self.nlp.pipe(phrase_pattern_texts),
                phrase_pattern_ids
            ):
                phrase_pattern = {
                    "label": label, "pattern": pattern, "id": ent_id
                }
                if ent_id:
                    phrase_pattern["id"] = ent_id
                phrase_patterns.append(phrase_pattern)
            for entry in token_patterns + phrase_patterns:
                label = entry["label"]
                if "id" in entry:
                    ent_label = label
@ -212,8 +240,8 @@ class EntityRuler(object):
                    self._ent_ids[key] = (ent_label, entry["id"])
                pattern = entry["pattern"]
-                if isinstance(pattern, basestring_):
+                if isinstance(pattern, Doc):
-                    self.phrase_patterns[label].append(self.nlp(pattern))
+                    self.phrase_patterns[label].append(pattern)
                elif isinstance(pattern, list):
                    self.token_patterns[label].append(pattern)
                else:
@ -226,6 +254,8 @@ class EntityRuler(object):
    def _split_label(self, label):
        """Split Entity label into ent_label and ent_id if it contains self.ent_id_sep
        label (str): The value of label in a pattern entry
        RETURNS (tuple): ent_label, ent_id
        """
        if self.ent_id_sep in label:
@ -239,6 +269,9 @@ class EntityRuler(object):
    def _create_label(self, label, ent_id):
        """Join Entity label with ent_id if the pattern has an `id` attribute
        label (str): The label to set for ent.label_
        ent_id (str): The label
        RETURNS (str): The ent_label joined with configured `ent_id_sep`
        """
        if isinstance(ent_id, basestring_):
@ -250,6 +283,7 @@ class EntityRuler(object):
        patterns_bytes (bytes): The bytestring to load.
        **kwargs: Other config paramters, mostly for consistency.
        RETURNS (EntityRuler): The loaded entity ruler.
        DOCS: https://spacy.io/api/entityruler#from_bytes
@ -292,6 +326,7 @@ class EntityRuler(object):
        path (unicode / Path): The JSONL file to load.
        **kwargs: Other config paramters, mostly for consistency.
        RETURNS (EntityRuler): The loaded entity ruler.
        DOCS: https://spacy.io/api/entityruler#from_disk
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@ -1044,6 +1044,7 @@ class TextCategorizer(Pipe):
                    self.add_label(cat)
        if self.model is True:
            self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
            self.cfg["pretrained_dims"] = kwargs.get("pretrained_dims")
            self.require_labels()
            self.model = self.Model(len(self.labels), **self.cfg)
            link_vectors_to_models(self.vocab)
--- a/spacy/symbols.pxd
+++ b/spacy/symbols.pxd
@ -463,3 +463,5 @@ cdef enum symbol_t:
    ENT_KB_ID
    ENT_ID
    IDX
--- a/spacy/symbols.pyx
+++ b/spacy/symbols.pyx
@ -93,6 +93,7 @@ IDS = {
    "SPACY": SPACY,
    "PROB": PROB,
    "LANG": LANG,
    "IDX": IDX,
    "ADJ": ADJ,
    "ADP": ADP,
--- a/spacy/tests/doc/test_array.py
+++ b/spacy/tests/doc/test_array.py
@ -66,3 +66,14 @@ def test_doc_array_to_from_string_attrs(en_vocab, attrs):
    words = ["An", "example", "sentence"]
    doc = Doc(en_vocab, words=words)
    Doc(en_vocab, words=words).from_array(attrs, doc.to_array(attrs))
 def test_doc_array_idx(en_vocab):
    """Test that Doc.to_array can retrieve token start indices"""
    words = ["An", "example", "sentence"]
    doc = Doc(en_vocab, words=words)
    offsets = Doc(en_vocab, words=words).to_array("IDX")
    assert offsets[0] == 0
    assert offsets[1] == 3
    assert offsets[2] == 11
--- a/spacy/tests/doc/test_doc_api.py
+++ b/spacy/tests/doc/test_doc_api.py
@ -7,7 +7,7 @@ import numpy
 from spacy.tokens import Doc, Span
 from spacy.vocab import Vocab
 from spacy.errors import ModelsWarning
-from spacy.attrs import ENT_TYPE, ENT_IOB
+from spacy.attrs import ENT_TYPE, ENT_IOB, SENT_START, HEAD, DEP
 from ..util import get_doc
@ -274,6 +274,39 @@ def test_doc_is_nered(en_vocab):
    assert new_doc.is_nered
 def test_doc_from_array_sent_starts(en_vocab):
    words = ["I", "live", "in", "New", "York", ".", "I", "like", "cats", "."]
    heads = [0, 0, 0, 0, 0, 0, 6, 6, 6, 6]
    deps = ["ROOT", "dep", "dep", "dep", "dep", "dep", "ROOT", "dep", "dep", "dep", "dep"]
    doc = Doc(en_vocab, words=words)
    for i, (dep, head) in enumerate(zip(deps, heads)):
        doc[i].dep_ = dep
        doc[i].head = doc[head]
        if head == i:
            doc[i].is_sent_start = True
    doc.is_parsed
    attrs = [SENT_START, HEAD]
    arr = doc.to_array(attrs)
    new_doc = Doc(en_vocab, words=words)
    with pytest.raises(ValueError):
        new_doc.from_array(attrs, arr)
    attrs = [SENT_START, DEP]
    arr = doc.to_array(attrs)
    new_doc = Doc(en_vocab, words=words)
    new_doc.from_array(attrs, arr)
    assert [t.is_sent_start for t in doc] == [t.is_sent_start for t in new_doc]
    assert not new_doc.is_parsed
    attrs = [HEAD, DEP]
    arr = doc.to_array(attrs)
    new_doc = Doc(en_vocab, words=words)
    new_doc.from_array(attrs, arr)
    assert [t.is_sent_start for t in doc] == [t.is_sent_start for t in new_doc]
    assert new_doc.is_parsed
 def test_doc_lang(en_vocab):
    doc = Doc(en_vocab, words=["Hello", "world"])
    assert doc.lang_ == "en"
--- a/spacy/tests/doc/test_span.py
+++ b/spacy/tests/doc/test_span.py
@ -279,3 +279,12 @@ def test_filter_spans(doc):
    assert len(filtered[1]) == 5
    assert filtered[0].start == 1 and filtered[0].end == 4
    assert filtered[1].start == 5 and filtered[1].end == 10
 def test_span_eq_hash(doc, doc_not_parsed):
    assert doc[0:2] == doc[0:2]
    assert doc[0:2] != doc[1:3]
    assert doc[0:2] != doc_not_parsed[0:2]
    assert hash(doc[0:2]) == hash(doc[0:2])
    assert hash(doc[0:2]) != hash(doc[1:3])
    assert hash(doc[0:2]) != hash(doc_not_parsed[0:2])
--- a/spacy/tests/lang/fi/test_tokenizer.py
+++ b/spacy/tests/lang/fi/test_tokenizer.py
@ -10,28 +10,33 @@ ABBREVIATION_TESTS = [
        ["Hyvää", "uutta", "vuotta", "t.", "siht.", "Niemelä", "!"],
    ),
    ("Paino on n. 2.2 kg", ["Paino", "on", "n.", "2.2", "kg"]),
    (
        "Vuonna 1 eaa. tapahtui kauheita.",
        ["Vuonna", "1", "eaa.", "tapahtui", "kauheita", "."],
    ),
 ]
 HYPHENATED_TESTS = [
    (
-        "1700-luvulle sijoittuva taide-elokuva",
+        "1700-luvulle sijoittuva taide-elokuva Wikimedia-säätiön Varsinais-Suomen",
-        ["1700-luvulle", "sijoittuva", "taide-elokuva"],
+        [
            "1700-luvulle",
            "sijoittuva",
            "taide-elokuva",
            "Wikimedia-säätiön",
            "Varsinais-Suomen",
        ],
    )
 ]
 ABBREVIATION_INFLECTION_TESTS = [
    (
        "VTT:ssa ennen v:ta 2010 suoritetut mittaukset",
-        ["VTT:ssa", "ennen", "v:ta", "2010", "suoritetut", "mittaukset"]
+        ["VTT:ssa", "ennen", "v:ta", "2010", "suoritetut", "mittaukset"],
    ),
-    (
+    ("ALV:n osuus on 24 %.", ["ALV:n", "osuus", "on", "24", "%", "."]),
-        "ALV:n osuus on 24 %.",
+    ("Hiihtäjä oli kilpailun 14:s.", ["Hiihtäjä", "oli", "kilpailun", "14:s", "."]),
-        ["ALV:n", "osuus", "on", "24", "%", "."]
+    ("EU:n toimesta tehtiin jotain.", ["EU:n", "toimesta", "tehtiin", "jotain", "."]),
    ),
    (
        "Hiihtäjä oli kilpailun 14:s.",
        ["Hiihtäjä", "oli", "kilpailun", "14:s", "."]
    )
 ]
--- a/spacy/tests/test_displacy.py
+++ b/spacy/tests/test_displacy.py
@ -31,10 +31,10 @@ def test_displacy_parse_deps(en_vocab):
    deps = displacy.parse_deps(doc)
    assert isinstance(deps, dict)
    assert deps["words"] == [
-        {"text": "This", "tag": "DET"},
+        {"lemma": None, "text": "This", "tag": "DET"},
-        {"text": "is", "tag": "AUX"},
+        {"lemma": None, "text": "is", "tag": "AUX"},
-        {"text": "a", "tag": "DET"},
+        {"lemma": None, "text": "a", "tag": "DET"},
-        {"text": "sentence", "tag": "NOUN"},
+        {"lemma": None, "text": "sentence", "tag": "NOUN"},
    ]
    assert deps["arcs"] == [
        {"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
--- a/spacy/tests/util.py
+++ b/spacy/tests/util.py
@ -95,7 +95,11 @@ def assert_docs_equal(doc1, doc2):
    assert [t.ent_type for t in doc1] == [t.ent_type for t in doc2]
    assert [t.ent_iob for t in doc1] == [t.ent_iob for t in doc2]
-    assert [ent for ent in doc1.ents] == [ent for ent in doc2.ents]
+    for ent1, ent2 in zip(doc1.ents, doc2.ents):
        assert ent1.start == ent2.start
        assert ent1.end == ent2.end
        assert ent1.label == ent2.label
        assert ent1.kb_id == ent2.kb_id
 def assert_packed_msg_equal(b1, b2):
--- a/spacy/tokens/doc.pyx
+++ b/spacy/tokens/doc.pyx
@ -23,7 +23,7 @@ from ..lexeme cimport Lexeme, EMPTY_LEXEME
 from ..typedefs cimport attr_t, flags_t
 from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
 from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
-from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, attr_id_t
+from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, IDX, attr_id_t
 from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
 from ..attrs import intify_attrs, IDS
@ -73,6 +73,8 @@ cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
        return token.ent_id
    elif feat_name == ENT_KB_ID:
        return token.ent_kb_id
    elif feat_name == IDX:
        return token.idx
    else:
        return Lexeme.get_struct_attr(token.lex, feat_name)
@ -813,7 +815,7 @@ cdef class Doc:
                if attr_ids[j] != TAG:
                    Token.set_struct_attr(token, attr_ids[j], array[i, j])
        # Set flags
-        self.is_parsed = bool(self.is_parsed or HEAD in attrs or DEP in attrs)
+        self.is_parsed = bool(self.is_parsed or HEAD in attrs)
        self.is_tagged = bool(self.is_tagged or TAG in attrs or POS in attrs)
        # If document is parsed, set children
        if self.is_parsed:
--- a/spacy/tokens/span.pyx
+++ b/spacy/tokens/span.pyx
@ -127,22 +127,27 @@ cdef class Span:
                return False
            else:
                return True
-        # Eq
+        # <
        if op == 0:
            return self.start_char < other.start_char
        # <=
        elif op == 1:
            return self.start_char <= other.start_char
        # ==
        elif op == 2:
-            return self.start_char == other.start_char and self.end_char == other.end_char
+            return (self.doc, self.start_char, self.end_char, self.label, self.kb_id) == (other.doc, other.start_char, other.end_char, other.label, other.kb_id)
        # !=
        elif op == 3:
-            return self.start_char != other.start_char or self.end_char != other.end_char
+            return (self.doc, self.start_char, self.end_char, self.label, self.kb_id) != (other.doc, other.start_char, other.end_char, other.label, other.kb_id)
        # >
        elif op == 4:
            return self.start_char > other.start_char
        # >=
        elif op == 5:
            return self.start_char >= other.start_char
    def __hash__(self):
-        return hash((self.doc, self.label, self.start_char, self.end_char))
+        return hash((self.doc, self.start_char, self.end_char, self.label, self.kb_id))
    def __len__(self):
        """Get the number of tokens in the span.
--- a/spacy/vectors.pyx
+++ b/spacy/vectors.pyx
@ -283,7 +283,11 @@ cdef class Vectors:
        DOCS: https://spacy.io/api/vectors#add
        """
-        key = get_string_id(key)
+        # use int for all keys and rows in key2row for more efficient access
        # and serialization
        key = int(get_string_id(key))
        if row is not None:
            row = int(row)
        if row is None and key in self.key2row:
            row = self.key2row[key]
        elif row is None:
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -239,6 +239,7 @@ If a setting is not present in the options, the default value will be used.
 | Name               | Type    | Description                                                                                                     | Default                 |
 | ------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
 | `fine_grained`     | bool    | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`).              | `False`                 |
 | `add_lemma`        | bool    | Print the lemma's in a separate row below the token texts in the `dep` visualisation.                           | `False`                 |
 | `collapse_punct`   | bool    | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True`                  |
 | `collapse_phrases` | bool    | Merge noun phrases into one token.                                                                              | `False`                 |
 | `compact`          | bool    | "Compact mode" with square arrows that takes up less space.                                                     | `False`                 |
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@ -1096,6 +1096,33 @@ with the patterns. When you load the model back in, all pipeline components will
 be restored and deserialized – including the entity ruler. This lets you ship
 powerful model packages with binary weights _and_ rules included!
 ### Using a large number of phrase patterns {#entityruler-large-phrase-patterns new="2.2.4"}
 When using a large amount of **phrase patterns** (roughly > 10000) it's useful to understand how the `add_patterns` function of the EntityRuler works. For each **phrase pattern**,
 the EntityRuler calls the nlp object to construct a doc object. This happens in case you try
 to add the EntityRuler at the end of an existing pipeline with, for example, a POS tagger and want to 
 extract matches based on the pattern's POS signature.
 In this case you would pass a config value of `phrase_matcher_attr="POS"` for the EntityRuler.
 Running the full language pipeline across every pattern in a large list scales linearly and can therefore take a long time on large amounts of phrase patterns.
 As of spaCy 2.2.4 the `add_patterns` function has been refactored to use nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with 5,000-100,000 phrase patterns respectively. 
 Even with this speedup (but especially if you're using an older version) the `add_patterns` function can still take a long time.
 An easy workaround to make this function run faster is disabling the other language pipes
 while adding the phrase patterns.
 ```python
 entityruler = EntityRuler(nlp)
 patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
 other_pipes = [p for p in nlp.pipe_names if p != "tagger"]
 with nlp.disable_pipes(*disable_pipes):
    entityruler.add_patterns(patterns)
 ```
 ## Combining models and rules {#models-rules}
 You can combine statistical and rule-based components in a variety of ways.
--- a/website/meta/universe.json
+++ b/website/meta/universe.json
@ -999,6 +999,17 @@
            "author": "Graphbrain",
            "category": ["standalone"]
        },
        {
            "type": "education",
            "id": "nostarch-nlp-python",
            "title": "Natural Language Processing Using Python",
            "slogan": "No Starch Press, 2020",
            "description": "Natural Language Processing Using Python is an introduction to natural language processing (NLP), the task of converting human language into data that a computer can process. The book uses spaCy, a leading Python library for NLP, to guide readers through common NLP tasks related to generating and understanding human language with code. It addresses problems like understanding a user's intent, continuing a conversation with a human, and maintaining the state of a conversation.",
            "cover": "https://nostarch.com/sites/default/files/styles/uc_product_full/public/NaturalLanguageProcessing_final_v01.jpg",
            "url": "https://nostarch.com/NLPPython",
            "author": "Yuli Vasiliev",
            "category": ["books"]
        },                     
        {
            "type": "education",
            "id": "oreilly-python-ds",