Merge branch 'master' into feature/nel-wiki

2025-12-23 18:13:13 +03:00 · 2019-07-09 21:57:47 +02:00 · 2019-07-09 21:57:47 +02:00 · f2ea3e3ea2
commit f2ea3e3ea2
parent b7a0c9bf60 547464609d
36 changed files with 245809 additions and 523 deletions
--- a/.github/contributors/ameyuuno.md
+++ b/.github/contributors/ameyuuno.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your 
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Alexey Kim           |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2019-07-09           |
+| GitHub username                | ameyuuno             |
+| Website (optional)             | https://ameyuuno.io  |
--- a/.github/contributors/askhogan.md
+++ b/.github/contributors/askhogan.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [X] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Patrick Hogan        |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 7/7/2019             |
+| GitHub username                | askhogan@gmail.com   |
+| Website (optional)             |                      |
--- a/.github/contributors/khellan.md
+++ b/.github/contributors/khellan.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Knut O. Hellan       |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 02.07.2019           |
+| GitHub username                | khellan              |
+| Website (optional)             | knuthellan.com       |
--- a/.github/contributors/kognate.md
+++ b/.github/contributors/kognate.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [X] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Joshua B. Smith      |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | July 7, 2019         |
+| GitHub username                | kognate              |
+| Website (optional)             |                      |
--- a/.github/contributors/rokasramas.md
+++ b/.github/contributors/rokasramas.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [ ] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [x] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                   |
+|------------------------------- | ----------------------- |
+| Name                           | Rokas Ramanauskas       |
+| Company name (if applicable)   | TokenMill               |
+| Title or role (if applicable)  | Software Engineer       |
+| Date                           | 2019-07-02              |
+| GitHub username                | rokasramas              |
+| Website (optional)             | http://www.tokenmill.lt |
--- a/10
+++ b/10
@ -1,6 +1,6 @@
-@ARTICLE{spacy2,
-   AUTHOR  = {Honnibal, Matthew AND Montani, Ines},
-   TITLE   = {spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing},
-   YEAR    = {2017},
-   JOURNAL = {To appear}
+@unpublished{spacy2,
+    AUTHOR = {Honnibal, Matthew and Montani, Ines},
+    TITLE  = {{spaCy 2}: Natural language understanding with {B}loom embeddings, convolutional neural networks and incremental parsing},
+    YEAR   = {2017},
+    Note   = {To appear}
 }
--- a/examples/information_extraction/entity_relations.py
+++ b/examples/information_extraction/entity_relations.py
@ -51,7 +51,6 @@ def filter_spans(spans):

 def extract_currency_relations(doc):
    # Merge entities and noun chunks into one token
-    seen_tokens = set()
    spans = list(doc.ents) + list(doc.noun_chunks)
    spans = filter_spans(spans)
    with doc.retokenize() as retokenizer:
--- a/spacy/cli/pretrain.py
+++ b/spacy/cli/pretrain.py
@ -5,6 +5,7 @@ import plac
 import random
 import numpy
 import time
+import re
 from collections import Counter
 from pathlib import Path
 from thinc.v2v import Affine, Maxout
@ -23,19 +24,39 @@ from .train import _load_pretrained_tok2vec


@plac.annotations(
-    texts_loc=("Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the "
-               "key 'tokens'", "positional", None, str),
+    texts_loc=(
+        "Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the "
+        "key 'tokens'",
+        "positional",
+        None,
+        str,
+    ),
    vectors_model=("Name or path to spaCy model with vectors to learn from"),
    output_dir=("Directory to write models to on each epoch", "positional", None, str),
    width=("Width of CNN layers", "option", "cw", int),
    depth=("Depth of CNN layers", "option", "cd", int),
    embed_rows=("Number of embedding rows", "option", "er", int),
-    loss_func=("Loss function to use for the objective. Either 'L2' or 'cosine'", "option", "L", str),
+    loss_func=(
+        "Loss function to use for the objective. Either 'L2' or 'cosine'",
+        "option",
+        "L",
+        str,
+    ),
    use_vectors=("Whether to use the static vectors as input features", "flag", "uv"),
    dropout=("Dropout rate", "option", "d", float),
    batch_size=("Number of words per training batch", "option", "bs", int),
-    max_length=("Max words per example. Longer examples are discarded", "option", "xw", int),
-    min_length=("Min words per example. Shorter examples are discarded", "option", "nw", int),
+    max_length=(
+        "Max words per example. Longer examples are discarded",
+        "option",
+        "xw",
+        int,
+    ),
+    min_length=(
+        "Min words per example. Shorter examples are discarded",
+        "option",
+        "nw",
+        int,
+    ),
    seed=("Seed for random number generators", "option", "s", int),
    n_iter=("Number of iterations to pretrain", "option", "i", int),
    n_save_every=("Save model every X batches.", "option", "se", int),
@ -45,6 +66,13 @@ from .train import _load_pretrained_tok2vec
        "t2v",
        Path,
    ),
+    epoch_start=(
+        "The epoch to start counting at. Only relevant when using '--init-tok2vec' and the given weight file has been "
+        "renamed. Prevents unintended overwriting of existing weight files.",
+        "option",
+        "es",
+        int
+    ),
 )
 def pretrain(
    texts_loc,
@ -63,6 +91,7 @@ def pretrain(
    seed=0,
    n_save_every=None,
    init_tok2vec=None,
+    epoch_start=None,
 ):
    """
    Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
@ -131,9 +160,29 @@ def pretrain(
    if init_tok2vec is not None:
        components = _load_pretrained_tok2vec(nlp, init_tok2vec)
        msg.text("Loaded pretrained tok2vec for: {}".format(components))
+        # Parse the epoch number from the given weight file
+        model_name = re.search(r"model\d+\.bin", str(init_tok2vec))
+        if model_name:
+            # Default weight file name so read epoch_start from it by cutting off 'model' and '.bin'
+            epoch_start = int(model_name.group(0)[5:][:-4]) + 1
+        else:
+            if not epoch_start:
+                msg.fail(
+                    "You have to use the '--epoch-start' argument when using a renamed weight file for "
+                    "'--init-tok2vec'", exits=True
+                )
+            elif epoch_start < 0:
+                msg.fail(
+                    "The argument '--epoch-start' has to be greater or equal to 0. '%d' is invalid" % epoch_start,
+                    exits=True
+                )
+    else:
+        # Without '--init-tok2vec' the '--epoch-start' argument is ignored
+        epoch_start = 0
+
    optimizer = create_default_optimizer(model.ops)
    tracker = ProgressTracker(frequency=10000)
-    msg.divider("Pre-training tok2vec layer")
+    msg.divider("Pre-training tok2vec layer - starting at epoch %d" % epoch_start)
    row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")}
    msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings)

@ -154,7 +203,7 @@ def pretrain(
                file_.write(srsly.json_dumps(log) + "\n")

    skip_counter = 0
-    for epoch in range(n_iter):
+    for epoch in range(epoch_start, n_iter + epoch_start):
        for batch_id, batch in enumerate(
            util.minibatch_by_words(((text, None) for text in texts), size=batch_size)
        ):
--- a/spacy/displacy/init.py
+++ b/spacy/displacy/init.py
@ -116,7 +116,7 @@ def parse_deps(orig_doc, options={}):
    doc (Doc): Document do parse.
    RETURNS (dict): Generated dependency parse keyed by words and arcs.
    """
-    doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes())
+    doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes(exclude=["user_data"]))
    if not doc.is_parsed:
        user_warning(Warnings.W005)
    if options.get("collapse_phrases", False):
--- a/spacy/lang/en/tokenizer_exceptions.py
+++ b/spacy/lang/en/tokenizer_exceptions.py
@ -537,6 +537,7 @@ for orth in [
    "Sen.",
    "St.",
    "vs.",
+    "v.s."
 ]:
    _exc[orth] = [{ORTH: orth}]

--- a/spacy/lang/id/examples.py
+++ b/spacy/lang/id/examples.py
@ -5,7 +5,7 @@ from __future__ import unicode_literals
 """
 Example sentences to test spaCy and its language models.

->>> from spacy.lang.en.examples import sentences
+>>> from spacy.lang.id.examples import sentences
 >>> docs = nlp.pipe(sentences)
 """

--- a/spacy/lang/lt/init.py
+++ b/spacy/lang/lt/init.py
@ -1,15 +1,37 @@
 # coding: utf8
 from __future__ import unicode_literals

+from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .stop_words import STOP_WORDS
+from .lex_attrs import LEX_ATTRS
+from .tag_map import TAG_MAP
+from .lemmatizer import LOOKUP
+from .morph_rules import MORPH_RULES
+
+from ..tokenizer_exceptions import BASE_EXCEPTIONS
+from ..norm_exceptions import BASE_NORMS
 from ...language import Language
-from ...attrs import LANG
+from ...attrs import LANG, NORM
+from ...util import update_exc, add_lookups
+
+
+def _return_lt(_):
+    return "lt"


 class LithuanianDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
-    lex_attr_getters[LANG] = lambda text: "lt"
+    lex_attr_getters[LANG] = _return_lt
+    lex_attr_getters[NORM] = add_lookups(
+        Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
+    )
+    lex_attr_getters.update(LEX_ATTRS)
+
+    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    stop_words = STOP_WORDS
+    tag_map = TAG_MAP
+    morph_rules = MORPH_RULES
+    lemma_lookup = LOOKUP


 class Lithuanian(Language):
--- a/spacy/lang/lt/examples.py
+++ b/spacy/lang/lt/examples.py
@ -0,0 +1,22 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.lt.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+    "Jaunikis pirmąją vestuvinę naktį iškeitė į areštinės gultą",
+    "Bepiločiai automobiliai išnaikins vairavimo mokyklas, autoservisus ir eismo nelaimes",
+    "Vilniuje galvojama uždrausti naudoti skėčius",
+    "Londonas yra didelis miestas Jungtinėje Karalystėje",
+    "Kur tu?",
+    "Kas yra Prancūzijos prezidentas?",
+    "Kokia yra Jungtinių Amerikos Valstijų sostinė?",
+    "Kada gimė Dalia Grybauskaitė?",
+]
--- a/spacy/lang/lt/lemmatizer.py
+++ b/spacy/lang/lt/lemmatizer.py
--- a/spacy/lang/lt/lex_attrs.py
+++ b/spacy/lang/lt/lex_attrs.py
--- a/spacy/lang/lt/morph_rules.py
+++ b/spacy/lang/lt/morph_rules.py
--- a/spacy/lang/lt/stop_words.py
+++ b/spacy/lang/lt/stop_words.py
--- a/spacy/lang/lt/tag_map.py
+++ b/spacy/lang/lt/tag_map.py
--- a/spacy/lang/lt/tokenizer_exceptions.py
+++ b/spacy/lang/lt/tokenizer_exceptions.py
@ -0,0 +1,268 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...symbols import ORTH
+
+_exc = {}
+
+for orth in [
+    "G.",
+    "J. E.",
+    "J. Em.",
+    "J.E.",
+    "J.Em.",
+    "K.",
+    "N.",
+    "V.",
+    "Vt.",
+    "a.",
+    "a.k.",
+    "a.s.",
+    "adv.",
+    "akad.",
+    "aklg.",
+    "akt.",
+    "al.",
+    "ang.",
+    "angl.",
+    "aps.",
+    "apskr.",
+    "apyg.",
+    "arbat.",
+    "asist.",
+    "asm.",
+    "asm.k.",
+    "asmv.",
+    "atk.",
+    "atsak.",
+    "atsisk.",
+    "atsisk.sąsk.",
+    "atv.",
+    "aut.",
+    "avd.",
+    "b.k.",
+    "baud.",
+    "biol.",
+    "bkl.",
+    "bot.",
+    "bt.",
+    "buv.",
+    "ch.",
+    "chem.",
+    "corp.",
+    "d.",
+    "dab.",
+    "dail.",
+    "dek.",
+    "deš.",
+    "dir.",
+    "dirig.",
+    "doc.",
+    "dol.",
+    "dr.",
+    "drp.",
+    "dvit.",
+    "dėst.",
+    "dš.",
+    "dž.",
+    "e.b.",
+    "e.bankas",
+    "e.p.",
+    "e.parašas",
+    "e.paštas",
+    "e.v.",
+    "e.valdžia",
+    "egz.",
+    "eil.",
+    "ekon.",
+    "el.",
+    "el.bankas",
+    "el.p.",
+    "el.parašas",
+    "el.paštas",
+    "el.valdžia",
+    "etc.",
+    "ež.",
+    "fak.",
+    "faks.",
+    "feat.",
+    "filol.",
+    "filos.",
+    "g.",
+    "gen.",
+    "geol.",
+    "gerb.",
+    "gim.",
+    "gr.",
+    "gv.",
+    "gyd.",
+    "gyv.",
+    "habil.",
+    "inc.",
+    "insp.",
+    "inž.",
+    "ir pan.",
+    "ir t. t.",
+    "isp.",
+    "istor.",
+    "it.",
+    "just.",
+    "k.",
+    "k. a.",
+    "k.a.",
+    "kab.",
+    "kand.",
+    "kart.",
+    "kat.",
+    "ketv.",
+    "kh.",
+    "kl.",
+    "kln.",
+    "km.",
+    "kn.",
+    "koresp.",
+    "kpt.",
+    "kr.",
+    "kt.",
+    "kub.",
+    "kun.",
+    "kv.",
+    "kyš.",
+    "l. e. p.",
+    "l.e.p.",
+    "lenk.",
+    "liet.",
+    "lot.",
+    "lt.",
+    "ltd.",
+    "ltn.",
+    "m.",
+    "m.e..",
+    "m.m.",
+    "mat.",
+    "med.",
+    "mgnt.",
+    "mgr.",
+    "min.",
+    "mjr.",
+    "ml.",
+    "mln.",
+    "mlrd.",
+    "mob.",
+    "mok.",
+    "moksl.",
+    "mokyt.",
+    "mot.",
+    "mr.",
+    "mst.",
+    "mstl.",
+    "mėn.",
+    "nkt.",
+    "no.",
+    "nr.",
+    "ntk.",
+    "nuotr.",
+    "op.",
+    "org.",
+    "orig.",
+    "p.",
+    "p.d.",
+    "p.m.e.",
+    "p.s.",
+    "pab.",
+    "pan.",
+    "past.",
+    "pav.",
+    "pavad.",
+    "per.",
+    "perd.",
+    "pirm.",
+    "pl.",
+    "plg.",
+    "plk.",
+    "pr.",
+    "pr.Kr.",
+    "pranc.",
+    "proc.",
+    "prof.",
+    "prom.",
+    "prot.",
+    "psl.",
+    "pss.",
+    "pvz.",
+    "pšt.",
+    "r.",
+    "raj.",
+    "red.",
+    "rez.",
+    "rež.",
+    "rus.",
+    "rš.",
+    "s.",
+    "sav.",
+    "saviv.",
+    "sek.",
+    "sekr.",
+    "sen.",
+    "sh.",
+    "sk.",
+    "skg.",
+    "skv.",
+    "skyr.",
+    "sp.",
+    "spec.",
+    "sr.",
+    "st.",
+    "str.",
+    "stud.",
+    "sąs.",
+    "t.",
+    "t. p.",
+    "t. y.",
+    "t.p.",
+    "t.t.",
+    "t.y.",
+    "techn.",
+    "tel.",
+    "teol.",
+    "th.",
+    "tir.",
+    "trit.",
+    "trln.",
+    "tšk.",
+    "tūks.",
+    "tūkst.",
+    "up.",
+    "upl.",
+    "v.s.",
+    "vad.",
+    "val.",
+    "valg.",
+    "ved.",
+    "vert.",
+    "vet.",
+    "vid.",
+    "virš.",
+    "vlsč.",
+    "vnt.",
+    "vok.",
+    "vs.",
+    "vtv.",
+    "vv.",
+    "vyr.",
+    "vyresn.",
+    "zool.",
+    "Įn",
+    "įl.",
+    "š.m.",
+    "šnek.",
+    "šv.",
+    "švč.",
+    "ž.ū.",
+    "žin.",
+    "žml.",
+    "žr.",
+]:
+    _exc[orth] = [{ORTH: orth}]
+
+TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/lang/nb/lemmatizer/_lemma_rules.py
+++ b/spacy/lang/nb/lemmatizer/_lemma_rules.py
@ -22,6 +22,7 @@ NOUN_RULES = [
 VERB_RULES = [
    ["er", "e"],  # vasker -> vaske
    ["et", "e"],  # vasket -> vaske
+    ["a", "e"],  # vaska -> vaske
    ["es", "e"],  # vaskes -> vaske
    ["te", "e"],  # stekte -> steke
    ["år", "å"],  # får -> få
--- a/spacy/lang/nb/tokenizer_exceptions.py
+++ b/spacy/lang/nb/tokenizer_exceptions.py
@ -10,7 +10,15 @@ _exc = {}
 for exc_data in [
    {ORTH: "jan.", LEMMA: "januar"},
    {ORTH: "feb.", LEMMA: "februar"},
+    {ORTH: "mar.", LEMMA: "mars"},
+    {ORTH: "apr.", LEMMA: "april"},
+    {ORTH: "jun.", LEMMA: "juni"},
    {ORTH: "jul.", LEMMA: "juli"},
+    {ORTH: "aug.", LEMMA: "august"},
+    {ORTH: "sep.", LEMMA: "september"},
+    {ORTH: "okt.", LEMMA: "oktober"},
+    {ORTH: "nov.", LEMMA: "november"},
+    {ORTH: "des.", LEMMA: "desember"},
 ]:
    _exc[exc_data[ORTH]] = [exc_data]

@ -18,11 +26,13 @@ for exc_data in [
 for orth in [
    "adm.dir.",
    "a.m.",
+    "andelsnr",
    "Aq.",
    "b.c.",
    "bl.a.",
    "bla.",
    "bm.",
+    "bnr.",
    "bto.",
    "ca.",
    "cand.mag.",
@ -41,6 +51,7 @@ for orth in [
    "el.",
    "e.l.",
    "et.",
+    "etc.",
    "etg.",
    "ev.",
    "evt.",
@ -76,6 +87,7 @@ for orth in [
    "kgl.res.",
    "kl.",
    "komm.",
+    "kr.",
    "kst.",
    "lø.",
    "ma.",
@ -106,6 +118,7 @@ for orth in [
    "o.l.",
    "on.",
    "op.",
+    "org."
    "osv.",
    "ovf.",
    "p.",
@ -130,6 +143,7 @@ for orth in [
    "sep.",
    "siviling.",
    "sms.",
+    "snr.",
    "spm.",
    "sr.",
    "sst.",
--- a/spacy/lang/sq/examples.py
+++ b/spacy/lang/sq/examples.py
@ -0,0 +1,18 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.sq.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+    "Apple po shqyrton blerjen e nje shoqërie të U.K. për 1 miliard dollarë",
+    "Makinat autonome ndryshojnë përgjegjësinë e sigurimit ndaj prodhuesve",
+    "San Francisko konsideron ndalimin e robotëve të shpërndarjes",
+    "Londra është një qytet i madh në Mbretërinë e Bashkuar.",
+]
--- a/spacy/pipeline/entityruler.py
+++ b/spacy/pipeline/entityruler.py
@ -1,15 +1,17 @@
 # coding: utf8
 from __future__ import unicode_literals

-from collections import defaultdict
+from collections import defaultdict, OrderedDict
 import srsly

 from ..errors import Errors
 from ..compat import basestring_
-from ..util import ensure_path
+from ..util import ensure_path, to_disk, from_disk
 from ..tokens import Span
 from ..matcher import Matcher, PhraseMatcher

+DEFAULT_ENT_ID_SEP = '||'
+

 class EntityRuler(object):
    """The EntityRuler lets you add spans to the `Doc.ents` using token-based
@ -24,7 +26,7 @@ class EntityRuler(object):

    name = "entity_ruler"

-    def __init__(self, nlp, **cfg):
+    def __init__(self, nlp, phrase_matcher_attr=None, **cfg):
        """Initialize the entitiy ruler. If patterns are supplied here, they
        need to be a list of dictionaries with a `"label"` and `"pattern"`
        key. A pattern can either be a token pattern (list) or a phrase pattern
@ -32,6 +34,8 @@ class EntityRuler(object):

        nlp (Language): The shared nlp object to pass the vocab to the matchers
            and process phrase patterns.
+        phrase_matcher_attr (int / unicode): Token attribute to match on, passed
+            to the internal PhraseMatcher as `attr`
        patterns (iterable): Optional patterns to load in.
        overwrite_ents (bool): If existing entities are present, e.g. entities
            added by the model, overwrite them by matches if necessary.
@ -47,8 +51,13 @@ class EntityRuler(object):
        self.token_patterns = defaultdict(list)
        self.phrase_patterns = defaultdict(list)
        self.matcher = Matcher(nlp.vocab)
-        self.phrase_matcher = PhraseMatcher(nlp.vocab)
-        self.ent_id_sep = cfg.get("ent_id_sep", "||")
+        if phrase_matcher_attr is not None:
+            self.phrase_matcher_attr = phrase_matcher_attr
+            self.phrase_matcher = PhraseMatcher(nlp.vocab, attr=self.phrase_matcher_attr)
+        else:
+            self.phrase_matcher_attr = None
+            self.phrase_matcher = PhraseMatcher(nlp.vocab)
+        self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
        patterns = cfg.get("patterns")
        if patterns is not None:
            self.add_patterns(patterns)
@ -212,8 +221,17 @@ class EntityRuler(object):

        DOCS: https://spacy.io/api/entityruler#from_bytes
        """
-        patterns = srsly.msgpack_loads(patterns_bytes)
-        self.add_patterns(patterns)
+        cfg = srsly.msgpack_loads(patterns_bytes)
+        if isinstance(cfg, dict):
+            self.add_patterns(cfg.get('patterns', cfg))
+            self.overwrite = cfg.get('overwrite', False)
+            self.phrase_matcher_attr = cfg.get('phrase_matcher_attr', None)
+            if self.phrase_matcher_attr is not None:
+                self.phrase_matcher = PhraseMatcher(self.nlp.vocab,
+                                                    attr=self.phrase_matcher_attr)
+            self.ent_id_sep = cfg.get('ent_id_sep', DEFAULT_ENT_ID_SEP)
+        else:
+            self.add_patterns(cfg)
        return self

    def to_bytes(self, **kwargs):
@ -223,7 +241,13 @@ class EntityRuler(object):

        DOCS: https://spacy.io/api/entityruler#to_bytes
        """
-        return srsly.msgpack_dumps(self.patterns)
+
+        serial = OrderedDict((
+            ('overwrite', self.overwrite),
+            ('ent_id_sep', self.ent_id_sep),
+            ('phrase_matcher_attr', self.phrase_matcher_attr),
+            ('patterns', self.patterns)))
+        return srsly.msgpack_dumps(serial)

    def from_disk(self, path, **kwargs):
        """Load the entity ruler from a file. Expects a file containing
@ -236,9 +260,23 @@ class EntityRuler(object):
        DOCS: https://spacy.io/api/entityruler#from_disk
        """
        path = ensure_path(path)
-        path = path.with_suffix(".jsonl")
-        patterns = srsly.read_jsonl(path)
-        self.add_patterns(patterns)
+        if path.is_file():
+            patterns = srsly.read_jsonl(path)
+            self.add_patterns(patterns)
+        else:
+            cfg = {}
+            deserializers = {
+                'patterns': lambda p: self.add_patterns(srsly.read_jsonl(p.with_suffix('.jsonl'))),
+                'cfg': lambda p: cfg.update(srsly.read_json(p))
+            }
+            from_disk(path, deserializers, {})
+            self.overwrite = cfg.get('overwrite', False)
+            self.phrase_matcher_attr = cfg.get('phrase_matcher_attr')
+            self.ent_id_sep = cfg.get('ent_id_sep', DEFAULT_ENT_ID_SEP)
+
+            if self.phrase_matcher_attr is not None:
+                self.phrase_matcher = PhraseMatcher(self.nlp.vocab,
+                                                    attr=self.phrase_matcher_attr)
        return self

    def to_disk(self, path, **kwargs):
@ -251,6 +289,13 @@ class EntityRuler(object):

        DOCS: https://spacy.io/api/entityruler#to_disk
        """
+        cfg = {'overwrite': self.overwrite,
+               'phrase_matcher_attr': self.phrase_matcher_attr,
+               'ent_id_sep': self.ent_id_sep}
+        serializers = {
+            'patterns': lambda p: srsly.write_jsonl(p.with_suffix('.jsonl'),
+                                                    self.patterns),
+            'cfg': lambda p: srsly.write_json(p, cfg)
+        }
        path = ensure_path(path)
-        path = path.with_suffix(".jsonl")
-        srsly.write_jsonl(path, self.patterns)
+        to_disk(path, serializers, {})
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@ -1003,7 +1003,7 @@ cdef class DependencyParser(Parser):

    @property
    def postprocesses(self):
-        return [nonproj.deprojectivize]  # , merge_subtokens]
+        return [nonproj.deprojectivize]

    def add_multitask_objective(self, target):
        if target == "cloze":
--- a/spacy/scorer.py
+++ b/spacy/scorer.py
@ -52,6 +52,7 @@ class Scorer(object):
        self.labelled = PRFScore()
        self.tags = PRFScore()
        self.ner = PRFScore()
+        self.ner_per_ents = dict()
        self.eval_punct = eval_punct

    @property
@ -104,6 +105,15 @@ class Scorer(object):
            "ents_f": self.ents_f,
            "tags_acc": self.tags_acc,
            "token_acc": self.token_acc,
+            "ents_per_type": self.__scores_per_ents(),
+        }
+
+    def __scores_per_ents(self):
+        """RETURNS (dict): Scores per NER entity
+        """
+        return {
+            k: {"p": v.precision * 100, "r": v.recall * 100, "f": v.fscore * 100}
+            for k, v in self.ner_per_ents.items()
        }

    def score(self, doc, gold, verbose=False, punct_labels=("p", "punct")):
@ -149,13 +159,31 @@ class Scorer(object):
                    cand_deps.add((gold_i, gold_head, token.dep_.lower()))
        if "-" not in [token[-1] for token in gold.orig_annot]:
            cand_ents = set()
+            current_ent = {k.label_: set() for k in doc.ents}
+            current_gold = {k.label_: set() for k in doc.ents}
            for ent in doc.ents:
+                if ent.label_ not in self.ner_per_ents:
+                    self.ner_per_ents[ent.label_] = PRFScore()
                first = gold.cand_to_gold[ent.start]
                last = gold.cand_to_gold[ent.end - 1]
                if first is None or last is None:
                    self.ner.fp += 1
+                    self.ner_per_ents[ent.label_].fp += 1
                else:
                    cand_ents.add((ent.label_, first, last))
+                    current_ent[ent.label_].add(
+                        tuple(x for x in cand_ents if x[0] == ent.label_)
+                    )
+                    current_gold[ent.label_].add(
+                        tuple(x for x in gold_ents if x[0] == ent.label_)
+                    )
+            # Scores per ent
+            [
+                v.score_set(current_ent[k], current_gold[k])
+                for k, v in self.ner_per_ents.items()
+                if k in current_ent
+            ]
+            # Score for all ents
            self.ner.score_set(cand_ents, gold_ents)
        self.tags.score_set(cand_tags, gold_tags)
        self.labelled.score_set(cand_deps, gold_deps)
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -124,6 +124,16 @@ def ja_tokenizer():
    return get_lang_class("ja").Defaults.create_tokenizer()


+@pytest.fixture(scope="session")
+def lt_tokenizer():
+    return get_lang_class("lt").Defaults.create_tokenizer()
+
+
+@pytest.fixture(scope="session")
+def lt_lemmatizer():
+    return get_lang_class("lt").Defaults.create_lemmatizer()
+
+
@pytest.fixture(scope="session")
 def nb_tokenizer():
    return get_lang_class("nb").Defaults.create_tokenizer()
--- a/spacy/tests/lang/lt/init.py
+++ b/spacy/tests/lang/lt/init.py
--- a/spacy/tests/lang/lt/test_lemmatizer.py
+++ b/spacy/tests/lang/lt/test_lemmatizer.py
@ -0,0 +1,15 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+@pytest.mark.parametrize("tokens,lemmas", [
+    (["Galime", "vadinti", "gerovės", "valstybe", ",", "turime", "išvystytą", "socialinę", "apsaugą", ",",
+      "sveikatos", "apsaugą", "ir", "prieinamą", "švietimą", "."],
+     ["galėti", "vadintas", "gerovė", "valstybė", ",", "turėti", "išvystytas", "socialinis",
+      "apsauga", ",", "sveikata", "apsauga", "ir", "prieinamas", "švietimas", "."]),
+    (["taip", ",", "uoliai", "tyrinėjau", "ir", "pasirinkau", "geriausią", "variantą", "."],
+     ["taip", ",", "uolus", "tyrinėti", "ir", "pasirinkti", "geras", "variantas", "."])])
+def test_lt_lemmatizer(lt_lemmatizer, tokens, lemmas):
+    assert lemmas == [lt_lemmatizer.lookup(token) for token in tokens]
--- a/spacy/tests/lang/lt/test_text.py
+++ b/spacy/tests/lang/lt/test_text.py
@ -0,0 +1,44 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+def test_lt_tokenizer_handles_long_text(lt_tokenizer):
+    text = """Tokios sausros kriterijus atitinka pirmadienį atlikti skaičiavimai, palyginus faktinį ir žemiausią 
+vidutinį daugiametį vandens lygį. Nustatyta, kad iš 48 šalies vandens matavimo stočių 28-iose stotyse vandens lygis 
+yra žemesnis arba lygus žemiausiam vidutiniam daugiamečiam šiltojo laikotarpio vandens lygiui."""
+    tokens = lt_tokenizer(text.replace("\n", ""))
+    assert len(tokens) == 42
+
+
+@pytest.mark.parametrize('text,length', [
+    ("177R Parodų rūmai–Ozo g. nuo vasario 18 d. bus skelbiamas interneto tinklalapyje.", 15),
+    ("ISM universiteto doc. dr. Ieva Augutytė-Kvedaravičienė pastebi, kad tyrimais nustatyti elgesio pokyčiai.", 16)])
+def test_lt_tokenizer_handles_punct_abbrev(lt_tokenizer, text, length):
+    tokens = lt_tokenizer(text)
+    assert len(tokens) == length
+
+
+@pytest.mark.parametrize("text", ["km.", "pvz.", "biol."])
+def test_lt_tokenizer_abbrev_exceptions(lt_tokenizer, text):
+    tokens = lt_tokenizer(text)
+    assert len(tokens) == 1
+
+
+@pytest.mark.parametrize("text,match", [
+    ("10", True),
+    ("1", True),
+    ("10,000", True),
+    ("10,00", True),
+    ("999.0", True),
+    ("vienas", True),
+    ("du", True),
+    ("milijardas", True),
+    ("šuo", False),
+    (",", False),
+    ("1/2", True)])
+def test_lt_lex_attrs_like_number(lt_tokenizer, text, match):
+    tokens = lt_tokenizer(text)
+    assert len(tokens) == 1
+    assert tokens[0].like_num == match
--- a/spacy/tests/pipeline/test_entity_ruler.py
+++ b/spacy/tests/pipeline/test_entity_ruler.py
@ -106,5 +106,24 @@ def test_entity_ruler_serialize_bytes(nlp, patterns):
    assert len(new_ruler) == 0
    assert len(new_ruler.labels) == 0
    new_ruler = new_ruler.from_bytes(ruler_bytes)
+    assert len(new_ruler) == len(patterns)
+    assert len(new_ruler.labels) == 4
+    assert len(new_ruler.patterns) == len(ruler.patterns)
+    for pattern in ruler.patterns:
+        assert pattern in new_ruler.patterns
+    assert new_ruler.labels == ruler.labels
+
+
+def test_entity_ruler_serialize_phrase_matcher_attr_bytes(nlp, patterns):
+    ruler = EntityRuler(nlp, phrase_matcher_attr="LOWER", patterns=patterns)
    assert len(ruler) == len(patterns)
    assert len(ruler.labels) == 4
+    ruler_bytes = ruler.to_bytes()
+    new_ruler = EntityRuler(nlp)
+    assert len(new_ruler) == 0
+    assert len(new_ruler.labels) == 0
+    assert new_ruler.phrase_matcher_attr is None
+    new_ruler = new_ruler.from_bytes(ruler_bytes)
+    assert len(new_ruler) == len(patterns)
+    assert len(new_ruler.labels) == 4
+    assert new_ruler.phrase_matcher_attr == "LOWER"
--- a/spacy/tests/regression/test_issue3526.py
+++ b/spacy/tests/regression/test_issue3526.py
@ -0,0 +1,86 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+import pytest
+from spacy.tokens import Span
+from spacy.language import Language
+from spacy.pipeline import EntityRuler
+from spacy import load
+import srsly
+from ..util import make_tempdir
+
+
+@pytest.fixture
+def patterns():
+    return [
+        {"label": "HELLO", "pattern": "hello world"},
+        {"label": "BYE", "pattern": [{"LOWER": "bye"}, {"LOWER": "bye"}]},
+        {"label": "HELLO", "pattern": [{"ORTH": "HELLO"}]},
+        {"label": "COMPLEX", "pattern": [{"ORTH": "foo", "OP": "*"}]},
+        {"label": "TECH_ORG", "pattern": "Apple", "id": "a1"},
+    ]
+
+
+@pytest.fixture
+def add_ent():
+    def add_ent_component(doc):
+        doc.ents = [Span(doc, 0, 3, label=doc.vocab.strings["ORG"])]
+        return doc
+
+    return add_ent_component
+
+
+def test_entity_ruler_existing_overwrite_serialize_bytes(patterns, en_vocab):
+    nlp = Language(vocab=en_vocab)
+    ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
+    ruler_bytes = ruler.to_bytes()
+    assert len(ruler) == len(patterns)
+    assert len(ruler.labels) == 4
+    assert ruler.overwrite
+    new_ruler = EntityRuler(nlp)
+    new_ruler = new_ruler.from_bytes(ruler_bytes)
+    assert len(new_ruler) == len(ruler)
+    assert len(new_ruler.labels) == 4
+    assert new_ruler.overwrite == ruler.overwrite
+    assert new_ruler.ent_id_sep == ruler.ent_id_sep
+
+
+def test_entity_ruler_existing_bytes_old_format_safe(patterns, en_vocab):
+    nlp = Language(vocab=en_vocab)
+    ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
+    bytes_old_style = srsly.msgpack_dumps(ruler.patterns)
+    new_ruler = EntityRuler(nlp)
+    new_ruler = new_ruler.from_bytes(bytes_old_style)
+    assert len(new_ruler) == len(ruler)
+    for pattern in ruler.patterns:
+        assert pattern in new_ruler.patterns
+    assert new_ruler.overwrite is not ruler.overwrite
+
+
+def test_entity_ruler_from_disk_old_format_safe(patterns, en_vocab):
+    nlp = Language(vocab=en_vocab)
+    ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True)
+    with make_tempdir() as tmpdir:
+        out_file = tmpdir / "entity_ruler.jsonl"
+        srsly.write_jsonl(out_file, ruler.patterns)
+        new_ruler = EntityRuler(nlp)
+        new_ruler = new_ruler.from_disk(out_file)
+        for pattern in ruler.patterns:
+            assert pattern in new_ruler.patterns
+        assert len(new_ruler) == len(ruler)
+        assert new_ruler.overwrite is not ruler.overwrite
+
+
+def test_entity_ruler_in_pipeline_from_issue(patterns, en_vocab):
+    nlp = Language(vocab=en_vocab)
+    ruler = EntityRuler(nlp, overwrite_ents=True)
+
+    ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
+    nlp.add_pipe(ruler)
+    with make_tempdir() as tmpdir:
+        nlp.to_disk(tmpdir)
+        assert nlp.pipeline[-1][-1].patterns == [{"label": "ORG", "pattern": "Apple"}]
+        assert nlp.pipeline[-1][-1].overwrite is True
+        nlp2 = load(tmpdir)
+        assert nlp2.pipeline[-1][-1].patterns == [{"label": "ORG", "pattern": "Apple"}]
+        assert nlp2.pipeline[-1][-1].overwrite is True
--- a/spacy/tests/regression/test_issue3882.py
+++ b/spacy/tests/regression/test_issue3882.py
@ -0,0 +1,15 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from spacy.displacy import parse_deps
+from spacy.tokens import Doc
+
+
+def test_issue3882(en_vocab):
+    """Test that displaCy doesn't serialize the doc.user_data when making a
+    copy of the Doc.
+    """
+    doc = Doc(en_vocab, words=["Hello", "world"])
+    doc.is_parsed = True
+    doc.user_data["test"] = set()
+    parse_deps(doc)
--- a/website/docs/api/cli.md
+++ b/website/docs/api/cli.md
@ -284,9 +284,9 @@ same between pretraining and training. The API and errors around this need some
 improvement.

 ```bash
-$ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width]
-[--depth] [--embed-rows] [--loss_func] [--dropout] [--seed] [--n-iter] [--use-vectors]
-[--n-save_every]
+$ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
+[--width] [--depth] [--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length] [--min-length]
+[--seed] [--n-iter] [--use-vectors] [--n-save_every] [--init-tok2vec] [--epoch-start]
 ```

 | Argument                | Type       | Description                                                                                                                       |
@ -306,7 +306,8 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width]
 | `--n-iter`, `-i`        | option     | Number of iterations to pretrain.                                                                                                 |
 | `--use-vectors`, `-uv`  | flag       | Whether to use the static vectors as input features.                                                                              |
 | `--n-save-every`, `-se` | option     | Save model every X batches.                                                                                                       |
-| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option        | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental.|
+| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental.|
+| `--epoch-start`, `-es` <Tag variant="new">2.1.5</Tag> | option | The epoch to start counting at. Only relevant when using `--init-tok2vec` and the given weight file has been renamed. Prevents unintended overwriting of existing weight files.|
 | **CREATES**             | weights    | The pre-trained weights that can be used to initialize `spacy train`.                                                             |

 ### JSONL format for raw text {#pretrain-jsonl}
--- a/website/docs/api/entityruler.md
+++ b/website/docs/api/entityruler.md
@ -34,6 +34,7 @@ be a token pattern (list) or a phrase pattern (string). For example:
 | ---------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `nlp`            | `Language`    | The shared nlp object to pass the vocab to the matchers and process phrase patterns.                                                                  |
 | `patterns`       | iterable      | Optional patterns to load in.                                                                                                                         |
+| `phrase_matcher_attr` | int / unicode | Optional attr to pass to the internal [`PhraseMatcher`](/api/phtasematcher). defaults to `None`
 | `overwrite_ents` | bool          | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`.                      |
 | `**cfg`          | -             | Other config parameters. If pipeline component is loaded as part of a model pipeline, this will include all keyword arguments passed to `spacy.load`. |
 | **RETURNS**      | `EntityRuler` | The newly constructed object.                                                                                                                         |
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@ -305,11 +305,11 @@ match on the uppercase versions, in case someone has written it as "Google i/o".

 ```python
 ### {executable="true"}
-import spacy
+from spacy.lang.en import English
 from spacy.matcher import Matcher
 from spacy.tokens import Span

-nlp = spacy.load("en_core_web_sm")
+nlp = English()
 matcher = Matcher(nlp.vocab)

 def add_event_ent(matcher, doc, i, matches):
@ -322,7 +322,7 @@ def add_event_ent(matcher, doc, i, matches):

 pattern = [{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}]
 matcher.add("GoogleIO", add_event_ent, pattern)
-doc = nlp(u"This is a text about Google I/O.")
+doc = nlp(u"This is a text about Google I/O")
 matches = matcher(doc)
 ```

--- a/website/meta/languages.json
+++ b/website/meta/languages.json
@ -106,7 +106,12 @@
        { "code": "hi", "name": "Hindi", "example": "यह एक वाक्य है।", "has_examples": true },
        { "code": "kn", "name": "Kannada" },
        { "code": "ta", "name": "Tamil", "has_examples": true },
-        { "code": "id", "name": "Indonesian", "has_examples": true },
+        {
+            "code": "id",
+            "name": "Indonesian",
+            "example": "Ini adalah sebuah kalimat.",
+            "has_examples": true
+        },
        { "code": "tl", "name": "Tagalog" },
        { "code": "af", "name": "Afrikaans" },
        { "code": "bg", "name": "Bulgarian" },
@ -116,7 +121,12 @@
        { "code": "lv", "name": "Latvian" },
        { "code": "sk", "name": "Slovak" },
        { "code": "sl", "name": "Slovenian" },
-        { "code": "sq", "name": "Albanian" },
+        {
+            "code": "sq",
+            "name": "Albanian",
+            "example": "Kjo është një fjali.",
+            "has_examples": true
+        },
        { "code": "et", "name": "Estonian" },
        {
            "code": "th",