Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher

2025-11-07 11:27:37 +03:00 · 2018-02-17 16:47:35 +01:00 · 2018-02-17 16:47:35 +01:00 · f7dc64d2a3
commit f7dc64d2a3
parent afbd46adfb 95c1de90fd
33 changed files with 613 additions and 52 deletions
--- a/.github/contributors/emulbreh.md
+++ b/.github/contributors/emulbreh.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your 
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Johannes Dollinger   |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2018-02-13           |
 | GitHub username                | emulbreh             |
 | Website (optional)             |                      |
--- a/.github/contributors/enerrio.md
+++ b/.github/contributors/enerrio.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Aaron Marquez        |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2/15/2018            |
 | GitHub username                | enerrio              |
 | Website (optional)             |                      |
--- a/.github/contributors/oxinabox.md
+++ b/.github/contributors/oxinabox.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your 
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Lyndon White         |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 9/2/2018             |
 | GitHub username                | oxinabox             |
 | Website (optional)             | white.ucc.asn.au     |
--- a/.github/contributors/ursachec.md
+++ b/.github/contributors/ursachec.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your 
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                     |
 |------------------------------- | ------------------------- |
 | Name                           | Claudiu-Vlad Ursache      |
 | Company name (if applicable)   |                           |
 | Title or role (if applicable)  |                           |
 | Date                           | 2018-02-04                |
 | GitHub username                | ursachec                  |
 | Website (optional)             | https://www.cvursache.com |
--- a/spacy/attrs.pxd
+++ b/spacy/attrs.pxd
@ -18,9 +18,9 @@ cdef enum attr_id_t:
    IS_QUOTE
    IS_LEFT_PUNCT
    IS_RIGHT_PUNCT
    IS_CURRENCY
-    FLAG18 = 18
+    FLAG19 = 19
    FLAG19
    FLAG20
    FLAG21
    FLAG22
--- a/spacy/attrs.pyx
+++ b/spacy/attrs.pyx
@ -21,7 +21,7 @@ IDS = {
    "IS_QUOTE": IS_QUOTE,
    "IS_LEFT_PUNCT": IS_LEFT_PUNCT,
    "IS_RIGHT_PUNCT": IS_RIGHT_PUNCT,
-    "FLAG18": FLAG18,
+    "IS_CURRENCY": IS_CURRENCY,
    "FLAG19": FLAG19,
    "FLAG20": FLAG20,
    "FLAG21": FLAG21,
--- a/spacy/cli/evaluate.py
+++ b/spacy/cli/evaluate.py
@ -3,8 +3,6 @@ from __future__ import unicode_literals, division, print_function
 import plac
 from timeit import default_timer as timer
 import random
 import numpy.random
 from ..gold import GoldCorpus
 from ..util import prints
@ -12,10 +10,6 @@ from .. import util
 from .. import displacy
 random.seed(0)
 numpy.random.seed(0)
@plac.annotations(
    model=("model name or path", "positional", None, str),
    data_path=("location of JSON-formatted evaluation data", "positional",
@ -31,6 +25,8 @@ def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None
    Evaluate a model. To render a sample of parses in a HTML file, set an
    output directory as the displacy_path argument.
    """
    util.fix_random_seed()
    if gpu_id >= 0:
        util.use_gpu(gpu_id)
    util.set_env_log(False)
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -6,8 +6,6 @@ from pathlib import Path
 import tqdm
 from thinc.neural._classes.model import Model
 from timeit import default_timer as timer
 import random
 import numpy.random
 from ..gold import GoldCorpus, minibatch
 from ..util import prints
@ -16,9 +14,6 @@ from .. import about
 from .. import displacy
 from ..compat import json_dumps
 random.seed(0)
 numpy.random.seed(0)
@plac.annotations(
    lang=("model language", "positional", None, str),
@ -45,6 +40,7 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
    """
    Train a model. Expects data in spaCy's JSON format.
    """
    util.fix_random_seed()
    util.set_env_log(True)
    n_sents = n_sents or None
    output_path = util.ensure_path(output_dir)
--- a/spacy/compat.py
+++ b/spacy/compat.py
@ -43,15 +43,15 @@ fix_text = ftfy.fix_text
 copy_array = copy_array
 izip = getattr(itertools, 'izip', zip)
 is_python2 = six.PY2
 is_python3 = six.PY3
 is_windows = sys.platform.startswith('win')
 is_linux = sys.platform.startswith('linux')
 is_osx = sys.platform == 'darwin'
 is_python2 = six.PY2
 is_python3 = six.PY3
 is_python_pre_3_5 = is_python2 or (is_python3 and sys.version_info[1]<5)
 if is_python2:
    import imp
    bytes_ = str
    unicode_ = unicode  # noqa: F821
    basestring_ = basestring  # noqa: F821
@ -60,7 +60,6 @@ if is_python2:
    path2str = lambda path: str(path).decode('utf8')
 elif is_python3:
    import importlib.util
    bytes_ = bytes
    unicode_ = str
    basestring_ = str
@ -111,9 +110,11 @@ def normalize_string_keys(old):
 def import_file(name, loc):
    loc = str(loc)
-    if is_python2:
+    if is_python_pre_3_5:
        import imp
        return imp.load_source(name, loc)
    else:
        import importlib.util
        spec = importlib.util.spec_from_file_location(name, str(loc))
        module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(module)
--- a/spacy/glossary.py
+++ b/spacy/glossary.py
@ -115,7 +115,7 @@ GLOSSARY = {
    'ADJA':         'adjective, attributive',
    'ADJD':         'adjective, adverbial or predicative',
    'APPO':         'postposition',
-    'APRP':         'preposition; circumposition left',
+    'APPR':         'preposition; circumposition left',
    'APPRART':      'preposition with article',
    'APZR':         'circumposition right',
    'ART':          'definite or indefinite article',
--- a/spacy/lang/lex_attrs.py
+++ b/spacy/lang/lex_attrs.py
@ -69,6 +69,14 @@ def is_right_punct(text):
    return text in right_punct
 def is_currency(text):
    # can be overwritten by lang with list of currency words, e.g. dollar, euro
    for char in text:
        if unicodedata.category(char) != 'Sc':
            return False
    return True
 def like_email(text):
    return bool(_like_email(text))
@ -164,5 +172,6 @@ LEX_ATTRS = {
    attrs.IS_QUOTE: is_quote,
    attrs.IS_LEFT_PUNCT: is_left_punct,
    attrs.IS_RIGHT_PUNCT: is_right_punct,
    attrs.IS_CURRENCY: is_currency,
    attrs.LIKE_URL: like_url
 }
--- a/spacy/language.py
+++ b/spacy/language.py
@ -624,7 +624,7 @@ class Language(object):
        deserializers = OrderedDict((
            ('vocab', lambda p: self.vocab.from_disk(p)),
            ('tokenizer', lambda p: self.tokenizer.from_disk(p, vocab=False)),
-            ('meta.json', lambda p: self.meta.update(ujson.load(p.open('r'))))
+            ('meta.json', lambda p: self.meta.update(util.read_json(p)))
        ))
        for name, proc in self.pipeline:
            if name in disable:
@ -720,5 +720,5 @@ class DisabledPipes(list):
 def _pipe(func, docs):
    for doc in docs:
-        func(doc)
+        doc = func(doc)
        yield doc
--- a/spacy/lexeme.pyx
+++ b/spacy/lexeme.pyx
@ -12,7 +12,7 @@ import numpy
 from .typedefs cimport attr_t, flags_t
 from .attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
 from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
-from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT, IS_OOV
+from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT, IS_CURRENCY, IS_OOV
 from .attrs cimport PROB
 from .attrs import intify_attrs
 from . import about
@ -474,6 +474,14 @@ cdef class Lexeme:
        def __set__(self, bint x):
            Lexeme.c_set_flag(self.c, IS_RIGHT_PUNCT, x)
    property is_currency:
        """RETURNS (bool): Whether the lexeme is a currency symbol, e.g. $, €."""
        def __get__(self):
            return Lexeme.c_check_flag(self.c, IS_CURRENCY)
        def __set__(self, bint x):
            Lexeme.c_set_flag(self.c, IS_CURRENCY, x)
    property like_url:
        """RETURNS (bool): Whether the lexeme resembles a URL."""
        def __get__(self):
--- a/spacy/pipeline.pyx
+++ b/spacy/pipeline.pyx
@ -144,7 +144,8 @@ class Pipe(object):
        return create_default_optimizer(self.model.ops,
                                        **self.cfg.get('optimizer', {}))
-    def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None):
+    def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None,
                       **kwargs):
        """Initialize the pipe for training, using data exampes if available.
        If no model has been initialized yet, the model is added."""
        if self.model is True:
@ -214,7 +215,8 @@ class Pipe(object):
 def _load_cfg(path):
    if path.exists():
-        return ujson.load(path.open())
+        with path.open() as file_:
            return ujson.load(file_)
    else:
        return {}
@ -344,7 +346,8 @@ class Tensorizer(Pipe):
        loss = (d_scores**2).sum()
        return loss, d_scores
-    def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None):
+    def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None,
                        **kwargs):
        """Allocate models, pre-process training data and acquire an
        optimizer.
@ -467,7 +470,8 @@ class Tagger(Pipe):
        d_scores = self.model.ops.unflatten(d_scores, [len(d) for d in docs])
        return float(loss), d_scores
-    def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None):
+    def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None,
                       **kwargs):
        orig_tag_map = dict(self.vocab.morphology.tag_map)
        new_tag_map = OrderedDict()
        for raw_text, annots_brackets in gold_tuples:
@ -580,7 +584,8 @@ class Tagger(Pipe):
        def load_model(p):
            if self.model is True:
                self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
-            self.model.from_bytes(p.open('rb').read())
+            with p.open('rb') as file_:
                self.model.from_bytes(file_.read())
        def load_tag_map(p):
            with p.open('rb') as file_:
@ -641,7 +646,7 @@ class MultitaskObjective(Tagger):
        pass
    def begin_training(self, gold_tuples=tuple(), pipeline=None, tok2vec=None,
-                       sgd=None):
+                       sgd=None, **kwargs):
        gold_tuples = nonproj.preprocess_training_data(gold_tuples)
        for raw_text, annots_brackets in gold_tuples:
            for annots, brackets in annots_brackets:
@ -766,7 +771,7 @@ class SimilarityHook(Pipe):
    def update(self, doc1_doc2, golds, sgd=None, drop=0.):
        sims, bp_sims = self.model.begin_update(doc1_doc2, drop=drop)
-    def begin_training(self, _=tuple(), pipeline=None, sgd=None):
+    def begin_training(self, _=tuple(), pipeline=None, sgd=None, **kwargs):
        """Allocate model, using width from tensorizer in pipeline.
        gold_tuples (iterable): Gold-standard training data.
@ -887,6 +892,7 @@ cdef class DependencyParser(Parser):
        self._multitasks.append(labeller)
    def init_multitask_objectives(self, gold_tuples, pipeline, sgd=None, **cfg):
        self.add_multitask_objective('tag')
        for labeller in self._multitasks:
            tok2vec = self.model[0]
            labeller.begin_training(gold_tuples, pipeline=pipeline,
--- a/spacy/symbols.pxd
+++ b/spacy/symbols.pxd
@ -17,9 +17,9 @@ cdef enum symbol_t:
    IS_QUOTE
    IS_LEFT_PUNCT
    IS_RIGHT_PUNCT
    IS_CURRENCY
-    FLAG18 = 18
+    FLAG19 = 19
    FLAG19
    FLAG20
    FLAG21
    FLAG22
--- a/spacy/symbols.pyx
+++ b/spacy/symbols.pyx
@ -22,8 +22,8 @@ IDS = {
    "IS_QUOTE": IS_QUOTE,
    "IS_LEFT_PUNCT": IS_LEFT_PUNCT,
    "IS_RIGHT_PUNCT": IS_RIGHT_PUNCT,
    "IS_CURRENCY": IS_CURRENCY,
    "FLAG18": FLAG18,
    "FLAG19": FLAG19,
    "FLAG20": FLAG20,
    "FLAG21": FLAG21,
--- a/spacy/syntax/arc_eager.pyx
+++ b/spacy/syntax/arc_eager.pyx
@ -390,6 +390,22 @@ cdef class ArcEager(TransitionSystem):
                gold.c.labels[i] = self.strings.add(label)
        return gold
    def get_beam_parses(self, Beam beam):
        parses = []
        probs = beam.probs
        for i in range(beam.size):
            state = <StateC*>beam.at(i)
            if state.is_final():
                self.finalize_state(state)
                prob = probs[i]
                parse = []
                for j in range(state.length):
                    head = state.H(j)
                    label = self.strings[state._sent[j].dep]
                    parse.append((head, j, label))
                parses.append((prob, parse))
        return parses
    cdef Transition lookup_transition(self, object name) except *:
        if '-' in name:
            move_str, label_str = name.split('-', 1)
--- a/spacy/syntax/nn_parser.pyx
+++ b/spacy/syntax/nn_parser.pyx
@ -835,6 +835,7 @@ cdef class Parser:
                sgd = self.create_optimizer()
            self.model[1].begin_training(
                    self.model[1].ops.allocate((5, cfg['token_vector_width'])))
            if pipeline is not None:
                self.init_multitask_objectives(gold_tuples, pipeline, sgd=sgd, **cfg)
            link_vectors_to_models(self.vocab)
        else:
@ -887,7 +888,7 @@ cdef class Parser:
        deserializers = {
            'vocab': lambda p: self.vocab.from_disk(p),
            'moves': lambda p: self.moves.from_disk(p, strings=False),
-            'cfg': lambda p: self.cfg.update(ujson.load(p.open())),
+            'cfg': lambda p: self.cfg.update(util.read_json(p)),
            'model': lambda p: None
        }
        util.from_disk(path, deserializers, exclude)
--- a/spacy/tests/lang/test_attrs.py
+++ b/spacy/tests/lang/test_attrs.py
@ -2,7 +2,7 @@
 from __future__ import unicode_literals
 from ...attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA
-from ...lang.lex_attrs import is_punct, is_ascii, like_url, word_shape
+from ...lang.lex_attrs import is_punct, is_ascii, is_currency, like_url, word_shape
 import pytest
@ -37,6 +37,13 @@ def test_lex_attrs_is_ascii(text, match):
    assert is_ascii(text) == match
@pytest.mark.parametrize('text,match', [('$', True), ('£', True), ('♥', False),
    ('€', True), ('¥', True), ('¢', True),
    ('a', False), ('www.google.com', False), ('dog', False)])
 def test_lex_attrs_is_currency(text, match):
    assert is_currency(text) == match
@pytest.mark.parametrize('text,match', [
    ('www.google.com', True), ('google.com', True), ('sydney.com', True),
    ('2girls1cup.org', True), ('http://stupid', True), ('www.hi', True),
--- a/spacy/tests/regression/test_issue1959.py
+++ b/spacy/tests/regression/test_issue1959.py
@ -0,0 +1,23 @@
 # coding: utf8
 from __future__ import unicode_literals
 import pytest
@pytest.mark.models('en')
 def test_issue1959(EN):
    texts = ['Apple is looking at buying U.K. startup for $1 billion.']
    # nlp = load_test_model('en_core_web_sm')
    EN.add_pipe(clean_component, name='cleaner', after='ner')
    doc = EN(texts[0])
    doc_pipe = [doc_pipe for doc_pipe in EN.pipe(texts)]
    assert doc == doc_pipe[0]
 def clean_component(doc):
    """ Clean up text. Make lowercase and remove punctuation and stopwords """
    # Remove punctuation, symbols (#) and stopwords
    doc = [tok.text.lower() for tok in doc if (not tok.is_stop
                                               and tok.pos_ != 'PUNCT' and
                                               tok.pos_ != 'SYM')]
    doc = ' '.join(doc)
    return doc
--- a/spacy/tests/serialize/test_serialize_language.py
+++ b/spacy/tests/serialize/test_serialize_language.py
@ -0,0 +1,28 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ..util import make_tempdir
 from ...language import Language
 import pytest
@pytest.fixture
 def meta_data():
    return {
        'name': 'name-in-fixture',
        'version': 'version-in-fixture',
        'description': 'description-in-fixture',
        'author': 'author-in-fixture',
        'email': 'email-in-fixture',
        'url': 'url-in-fixture',
        'license': 'license-in-fixture',
    }
 def test_serialize_language_meta_disk(meta_data):
    language = Language(meta=meta_data)
    with make_tempdir() as d:
        language.to_disk(d)
        new_language = Language().from_disk(d)
    assert new_language.meta == language.meta
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@ -15,7 +15,7 @@ from ..lexeme cimport Lexeme
 from .. import parts_of_speech
 from ..attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
 from ..attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT
-from ..attrs cimport IS_OOV, IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL
+from ..attrs cimport IS_OOV, IS_TITLE, IS_UPPER, IS_CURRENCY, LIKE_URL, LIKE_NUM, LIKE_EMAIL
 from ..attrs cimport IS_STOP, ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX
 from ..attrs cimport LENGTH, CLUSTER, LEMMA, POS, TAG, DEP
 from ..compat import is_config
@ -855,6 +855,11 @@ cdef class Token:
        def __get__(self):
            return Lexeme.c_check_flag(self.c.lex, IS_RIGHT_PUNCT)
    property is_currency:
        """RETURNS (bool): Whether the token is a currency symbol."""
        def __get__(self):
            return Lexeme.c_check_flag(self.c.lex, IS_CURRENCY)
    property like_url:
        """RETURNS (bool): Whether the token resembles a URL."""
        def __get__(self):
--- a/spacy/util.py
+++ b/spacy/util.py
@ -17,6 +17,7 @@ from thinc.neural._classes.model import Model
 import functools
 import cytoolz
 import itertools
 import numpy.random
 from .symbols import ORTH
 from .compat import cupy, CudaStream, path2str, basestring_, input_, unicode_
@ -623,3 +624,8 @@ def use_gpu(gpu_id):
    Model.ops = CupyOps()
    Model.Ops = CupyOps
    return device
 def fix_random_seed(seed=0):
    random.seed(seed)
    numpy.random.seed(seed)
--- a/spacy/vectors.pyx
+++ b/spacy/vectors.pyx
@ -347,7 +347,8 @@ cdef class Vectors:
        """
        def load_key2row(path):
            if path.exists():
-                self.key2row = msgpack.load(path.open('rb'))
+                with path.open('rb') as file_:
                    self.key2row = msgpack.load(file_)
            for key, row in self.key2row.items():
                if row in self._unset:
                    self._unset.remove(row)
--- a/website/_includes/_navigation.jade
+++ b/website/_includes/_navigation.jade
@ -10,6 +10,9 @@ nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : null)
            li.c-nav__menu__item(class=is_active ? "is-active" : null)
                +a(url)(tabindex=is_active ? "-1" : null)=item
        li.c-nav__menu__item.u-hidden-xs
            +a("https://survey.spacy.io", true) User Survey 2018
        li.c-nav__menu__item.u-hidden-xs
            +a(gh("spaCy"))(aria-label="GitHub") #[+icon("github", 20)]
--- a/website/usage/_facts-figures/_benchmarks.jade
+++ b/website/usage/_facts-figures/_benchmarks.jade
@ -13,7 +13,7 @@ p
    |  Their results and subsequent discussions helped us develop a novel
    |  psychologically-motivated technique to improve spaCy's accuracy, which
    |  we published in joint work with Macquarie University
-    |  #[+a("https://aclweb.org/anthology/D/D15/D15-1162.pdf") (Honnibal and Johnson, 2015)].
+    |  #[+a("https://www.aclweb.org/anthology/D/D15/D15-1162.pdf") (Honnibal and Johnson, 2015)].
 include _benchmarks-choi-2015
--- a/website/usage/_install/_troubleshooting.jade
+++ b/website/usage/_install/_troubleshooting.jade
@ -38,9 +38,10 @@ p
    |  #[code spacy/data] directory. This means your user needs permission to do
    |  this. The above error mostly occurs when doing a system-wide installation,
    |  which will create the symlinks in a system directory. Run the
-    |  #[code download] or #[code link] command as administrator, or use a
+    |  #[code download] or #[code link] command as administrator (on Windows,
-    |  #[code virtualenv] to install spaCy in a user directory, instead
+    |  simply right-click on your terminal or shell ans select "Run as
-    |  of doing a system-wide installation.
+    |  Administrator"), or use a #[code virtualenv] to install spaCy in a user
    |  directory, instead of doing a system-wide installation.
 +h(3, "no-cache-dir") No such option: --no-cache-dir
--- a/website/usage/_linguistic-features/_dependency-parse.jade
+++ b/website/usage/_linguistic-features/_dependency-parse.jade
@ -65,9 +65,9 @@ p
    - var style = [0, 1, 0, 1, 0]
    +annotation-row(["Autonomous", "amod", "cars", "NOUN", ""], style)
    +annotation-row(["cars", "nsubj", "shift", "VERB", "Autonomous"], style)
-    +annotation-row(["shift", "ROOT", "shift", "VERB", "cars, liability"], style)
+    +annotation-row(["shift", "ROOT", "shift", "VERB", "cars, liability, toward"], style)
    +annotation-row(["insurance", "compound", "liability", "NOUN", ""], style)
-    +annotation-row(["liability", "dobj", "shift", "VERB", "insurance, toward"], style)
+    +annotation-row(["liability", "dobj", "shift", "VERB", "insurance"], style)
    +annotation-row(["toward", "prep", "liability", "NOUN", "manufacturers"], style)
    +annotation-row(["manufacturers", "pobj", "toward", "ADP", ""], style)
--- a/website/usage/_linguistic-features/_named-entities.jade
+++ b/website/usage/_linguistic-features/_named-entities.jade
@ -80,7 +80,7 @@ p
    doc.ents = [netflix_ent]
    ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
-    assert ents = [(u'Netflix', 0, 7, u'ORG')]
+    assert ents == [(u'Netflix', 0, 7, u'ORG')]
 p
    |  Keep in mind that you need to create a #[code Span] with the start and
--- a/website/usage/_linguistic-features/_rule-based-matching.jade
+++ b/website/usage/_linguistic-features/_rule-based-matching.jade
@ -54,10 +54,21 @@ p
 p
    |  The matcher returns a list of #[code (match_id, start, end)] tuples – in
-    |  this case, #[code [('HelloWorld', 0, 2)]], which maps to the span
+    |  this case, #[code [('15578876784678163569', 0, 2)]], which maps to the
-    |  #[code doc[0:2]] of our original document. Optionally, we could also
+    |  span #[code doc[0:2]] of our original document. The #[code match_id]
-    |  choose to add more than one pattern, for example to also match sequences
+    |  is the #[+a("/usage/spacy-101#vocab") hash value] of the string ID
-    |  without punctuation between "hello" and "world":
+    |  "HelloWorld". To get the string value, you can look up the ID
    |  in the #[+api("stringstore") #[code StringStore]].
 +code.
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # 'HelloWorld'
        span = doc[start:end]                    # the matched span
 p
    |  Optionally, we could also choose to add more than one pattern, for
    |  example to also match sequences without punctuation between "hello" and
    |  "world":
 +code.
    matcher.add('HelloWorld', None,
@ -91,6 +102,10 @@ p
        +cell.u-nowrap #[code LOWER]
        +cell The lowercase form of the token text.
    +row
        +cell #[code LENGTH]
        +cell The length of the token text.
    +row
        +cell.u-nowrap #[code IS_ALPHA], #[code IS_ASCII], #[code IS_DIGIT]
        +cell
@ -117,6 +132,10 @@ p
            |  The token's simple and extended part-of-speech tag, dependency
            |  label, lemma, shape.
    +row
        +cell.u-nowrap #[code ENT_TYPE]
        +cell The token's entity label.
 +h(4, "adding-patterns-wildcard") Using wildcard token patterns
    +tag-new(2)
@ -335,7 +354,8 @@ p
    |  flag.
 +code.
-    IS_DEFINITELY = nlp.vocab.add_flag(re.compile(r'deff?in[ia]tely').match)
+    definitely_flag = lambda text: bool(re.compile(r'deff?in[ia]tely').match(text))
    IS_DEFINITELY = nlp.vocab.add_flag(definitely_flag)
    matcher = Matcher(nlp.vocab)
    matcher.add('DEFINITELY', None, [{IS_DEFINITELY: True}])
--- a/website/usage/_linguistic-features/_tokenization.jade
+++ b/website/usage/_linguistic-features/_tokenization.jade
@ -54,7 +54,7 @@ p
 +code.
    import spacy
-    from spacy.symbols import ORTH, LEMMA, POS
+    from spacy.symbols import ORTH, LEMMA, POS, TAG
    nlp = spacy.load('en')
    doc = nlp(u'gimme that') # phrase to tokenize
--- a/website/usage/_models/_install-basics.jade
+++ b/website/usage/_models/_install-basics.jade
@ -31,3 +31,13 @@ p
    import spacy
    nlp = spacy.load('en')
    doc = nlp(u'This is a sentence.')
 +infobox("Important note", "⚠️")
    |  To allow loading models via convenient shortcuts like #[code 'en'], spaCy
    |  will create a symlink within the #[code spacy/data] directory. This means
    |  that your user needs the #[strong required permissions].
    |  If you've installed spaCy to a system directory and don't have admin
    |  privileges, the model linking may fail. The easiest solution
    |  is to re-run the command as admin, or use a #[code virtualenv]. For more
    |  info on this, see the
    |  #[+a("/usage/#symlink-privilege") troubleshooting guide].
--- a/website/usage/_models/_install.jade
+++ b/website/usage/_models/_install.jade
@ -132,7 +132,7 @@ p
    # set up shortcut link to load local model as "my_amazing_model"
    python -m spacy link /Users/you/model my_amazing_model
-+infobox("Important note")
+infobox("Important note", "⚠️")
    |  In order to create a symlink, your user needs the #[strong required permissions].
    |  If you've installed spaCy to a system directory and don't have admin
    |  privileges, the #[code spacy link] command may fail. The easiest solution