Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher

2025-11-07 19:37:38 +03:00 · 2018-02-17 16:47:35 +01:00 · 2018-02-17 16:47:35 +01:00 · f7dc64d2a3
commit f7dc64d2a3
parent afbd46adfb 95c1de90fd
33 changed files with 613 additions and 52 deletions
--- a/.github/contributors/emulbreh.md
+++ b/.github/contributors/emulbreh.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your 
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Johannes Dollinger   |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2018-02-13           |
+| GitHub username                | emulbreh             |
+| Website (optional)             |                      |
--- a/.github/contributors/enerrio.md
+++ b/.github/contributors/enerrio.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Aaron Marquez        |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2/15/2018            |
+| GitHub username                | enerrio              |
+| Website (optional)             |                      |
--- a/.github/contributors/oxinabox.md
+++ b/.github/contributors/oxinabox.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your 
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Lyndon White         |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 9/2/2018             |
+| GitHub username                | oxinabox             |
+| Website (optional)             | white.ucc.asn.au     |
--- a/.github/contributors/ursachec.md
+++ b/.github/contributors/ursachec.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your 
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                     |
+|------------------------------- | ------------------------- |
+| Name                           | Claudiu-Vlad Ursache      |
+| Company name (if applicable)   |                           |
+| Title or role (if applicable)  |                           |
+| Date                           | 2018-02-04                |
+| GitHub username                | ursachec                  |
+| Website (optional)             | https://www.cvursache.com |
--- a/spacy/attrs.pxd
+++ b/spacy/attrs.pxd
@ -18,9 +18,9 @@ cdef enum attr_id_t:
    IS_QUOTE
    IS_LEFT_PUNCT
    IS_RIGHT_PUNCT
+    IS_CURRENCY

-    FLAG18 = 18
-    FLAG19
+    FLAG19 = 19
    FLAG20
    FLAG21
    FLAG22
--- a/spacy/attrs.pyx
+++ b/spacy/attrs.pyx
@ -21,7 +21,7 @@ IDS = {
    "IS_QUOTE": IS_QUOTE,
    "IS_LEFT_PUNCT": IS_LEFT_PUNCT,
    "IS_RIGHT_PUNCT": IS_RIGHT_PUNCT,
-    "FLAG18": FLAG18,
+    "IS_CURRENCY": IS_CURRENCY,
    "FLAG19": FLAG19,
    "FLAG20": FLAG20,
    "FLAG21": FLAG21,
--- a/spacy/cli/evaluate.py
+++ b/spacy/cli/evaluate.py
@ -3,8 +3,6 @@ from __future__ import unicode_literals, division, print_function

 import plac
 from timeit import default_timer as timer
-import random
-import numpy.random

 from ..gold import GoldCorpus
 from ..util import prints
@ -12,10 +10,6 @@ from .. import util
 from .. import displacy


-random.seed(0)
-numpy.random.seed(0)
-
-
@plac.annotations(
    model=("model name or path", "positional", None, str),
    data_path=("location of JSON-formatted evaluation data", "positional",
@ -31,6 +25,8 @@ def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None
    Evaluate a model. To render a sample of parses in a HTML file, set an
    output directory as the displacy_path argument.
    """
+
+    util.fix_random_seed()
    if gpu_id >= 0:
        util.use_gpu(gpu_id)
    util.set_env_log(False)
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -6,8 +6,6 @@ from pathlib import Path
 import tqdm
 from thinc.neural._classes.model import Model
 from timeit import default_timer as timer
-import random
-import numpy.random

 from ..gold import GoldCorpus, minibatch
 from ..util import prints
@ -16,9 +14,6 @@ from .. import about
 from .. import displacy
 from ..compat import json_dumps

-random.seed(0)
-numpy.random.seed(0)
-

@plac.annotations(
    lang=("model language", "positional", None, str),
@ -45,6 +40,7 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
    """
    Train a model. Expects data in spaCy's JSON format.
    """
+    util.fix_random_seed()
    util.set_env_log(True)
    n_sents = n_sents or None
    output_path = util.ensure_path(output_dir)
--- a/spacy/compat.py
+++ b/spacy/compat.py
@ -43,15 +43,15 @@ fix_text = ftfy.fix_text
 copy_array = copy_array
 izip = getattr(itertools, 'izip', zip)

-is_python2 = six.PY2
-is_python3 = six.PY3
 is_windows = sys.platform.startswith('win')
 is_linux = sys.platform.startswith('linux')
 is_osx = sys.platform == 'darwin'

+is_python2 = six.PY2
+is_python3 = six.PY3
+is_python_pre_3_5 = is_python2 or (is_python3 and sys.version_info[1]<5)

 if is_python2:
-    import imp
    bytes_ = str
    unicode_ = unicode  # noqa: F821
    basestring_ = basestring  # noqa: F821
@ -60,7 +60,6 @@ if is_python2:
    path2str = lambda path: str(path).decode('utf8')

 elif is_python3:
-    import importlib.util
    bytes_ = bytes
    unicode_ = str
    basestring_ = str
@ -111,9 +110,11 @@ def normalize_string_keys(old):

 def import_file(name, loc):
    loc = str(loc)
-    if is_python2:
+    if is_python_pre_3_5:
+        import imp
        return imp.load_source(name, loc)
    else:
+        import importlib.util
        spec = importlib.util.spec_from_file_location(name, str(loc))
        module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(module)
--- a/spacy/glossary.py
+++ b/spacy/glossary.py
@ -115,7 +115,7 @@ GLOSSARY = {
    'ADJA':         'adjective, attributive',
    'ADJD':         'adjective, adverbial or predicative',
    'APPO':         'postposition',
-    'APRP':         'preposition; circumposition left',
+    'APPR':         'preposition; circumposition left',
    'APPRART':      'preposition with article',
    'APZR':         'circumposition right',
    'ART':          'definite or indefinite article',
--- a/spacy/lang/lex_attrs.py
+++ b/spacy/lang/lex_attrs.py
@ -69,6 +69,14 @@ def is_right_punct(text):
    return text in right_punct


+def is_currency(text):
+    # can be overwritten by lang with list of currency words, e.g. dollar, euro
+    for char in text:
+        if unicodedata.category(char) != 'Sc':
+            return False
+    return True
+
+
 def like_email(text):
    return bool(_like_email(text))

@ -164,5 +172,6 @@ LEX_ATTRS = {
    attrs.IS_QUOTE: is_quote,
    attrs.IS_LEFT_PUNCT: is_left_punct,
    attrs.IS_RIGHT_PUNCT: is_right_punct,
+    attrs.IS_CURRENCY: is_currency,
    attrs.LIKE_URL: like_url
 }
--- a/spacy/language.py
+++ b/spacy/language.py
@ -624,7 +624,7 @@ class Language(object):
        deserializers = OrderedDict((
            ('vocab', lambda p: self.vocab.from_disk(p)),
            ('tokenizer', lambda p: self.tokenizer.from_disk(p, vocab=False)),
-            ('meta.json', lambda p: self.meta.update(ujson.load(p.open('r'))))
+            ('meta.json', lambda p: self.meta.update(util.read_json(p)))
        ))
        for name, proc in self.pipeline:
            if name in disable:
@ -720,5 +720,5 @@ class DisabledPipes(list):

 def _pipe(func, docs):
    for doc in docs:
-        func(doc)
+        doc = func(doc)
        yield doc
--- a/spacy/lexeme.pyx
+++ b/spacy/lexeme.pyx
@ -12,7 +12,7 @@ import numpy
 from .typedefs cimport attr_t, flags_t
 from .attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
 from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
-from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT, IS_OOV
+from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT, IS_CURRENCY, IS_OOV
 from .attrs cimport PROB
 from .attrs import intify_attrs
 from . import about
@ -474,6 +474,14 @@ cdef class Lexeme:
        def __set__(self, bint x):
            Lexeme.c_set_flag(self.c, IS_RIGHT_PUNCT, x)

+    property is_currency:
+        """RETURNS (bool): Whether the lexeme is a currency symbol, e.g. $, €."""
+        def __get__(self):
+            return Lexeme.c_check_flag(self.c, IS_CURRENCY)
+
+        def __set__(self, bint x):
+            Lexeme.c_set_flag(self.c, IS_CURRENCY, x)
+
    property like_url:
        """RETURNS (bool): Whether the lexeme resembles a URL."""
        def __get__(self):
--- a/spacy/pipeline.pyx
+++ b/spacy/pipeline.pyx
@ -144,7 +144,8 @@ class Pipe(object):
        return create_default_optimizer(self.model.ops,
                                        **self.cfg.get('optimizer', {}))

-    def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None):
+    def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None,
+                       **kwargs):
        """Initialize the pipe for training, using data exampes if available.
        If no model has been initialized yet, the model is added."""
        if self.model is True:
@ -214,7 +215,8 @@ class Pipe(object):

 def _load_cfg(path):
    if path.exists():
-        return ujson.load(path.open())
+        with path.open() as file_:
+            return ujson.load(file_)
    else:
        return {}

@ -344,7 +346,8 @@ class Tensorizer(Pipe):
        loss = (d_scores**2).sum()
        return loss, d_scores

-    def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None):
+    def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None,
+                        **kwargs):
        """Allocate models, pre-process training data and acquire an
        optimizer.

@ -467,7 +470,8 @@ class Tagger(Pipe):
        d_scores = self.model.ops.unflatten(d_scores, [len(d) for d in docs])
        return float(loss), d_scores

-    def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None):
+    def begin_training(self, gold_tuples=tuple(), pipeline=None, sgd=None,
+                       **kwargs):
        orig_tag_map = dict(self.vocab.morphology.tag_map)
        new_tag_map = OrderedDict()
        for raw_text, annots_brackets in gold_tuples:
@ -580,7 +584,8 @@ class Tagger(Pipe):
        def load_model(p):
            if self.model is True:
                self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
-            self.model.from_bytes(p.open('rb').read())
+            with p.open('rb') as file_:
+                self.model.from_bytes(file_.read())

        def load_tag_map(p):
            with p.open('rb') as file_:
@ -641,7 +646,7 @@ class MultitaskObjective(Tagger):
        pass

    def begin_training(self, gold_tuples=tuple(), pipeline=None, tok2vec=None,
-                       sgd=None):
+                       sgd=None, **kwargs):
        gold_tuples = nonproj.preprocess_training_data(gold_tuples)
        for raw_text, annots_brackets in gold_tuples:
            for annots, brackets in annots_brackets:
@ -766,7 +771,7 @@ class SimilarityHook(Pipe):
    def update(self, doc1_doc2, golds, sgd=None, drop=0.):
        sims, bp_sims = self.model.begin_update(doc1_doc2, drop=drop)

-    def begin_training(self, _=tuple(), pipeline=None, sgd=None):
+    def begin_training(self, _=tuple(), pipeline=None, sgd=None, **kwargs):
        """Allocate model, using width from tensorizer in pipeline.

        gold_tuples (iterable): Gold-standard training data.
@ -887,6 +892,7 @@ cdef class DependencyParser(Parser):
        self._multitasks.append(labeller)

    def init_multitask_objectives(self, gold_tuples, pipeline, sgd=None, **cfg):
+        self.add_multitask_objective('tag')
        for labeller in self._multitasks:
            tok2vec = self.model[0]
            labeller.begin_training(gold_tuples, pipeline=pipeline,
--- a/spacy/symbols.pxd
+++ b/spacy/symbols.pxd
@ -17,9 +17,9 @@ cdef enum symbol_t:
    IS_QUOTE
    IS_LEFT_PUNCT
    IS_RIGHT_PUNCT
+    IS_CURRENCY

-    FLAG18 = 18
-    FLAG19
+    FLAG19 = 19
    FLAG20
    FLAG21
    FLAG22
--- a/spacy/symbols.pyx
+++ b/spacy/symbols.pyx
@ -22,8 +22,8 @@ IDS = {
    "IS_QUOTE": IS_QUOTE,
    "IS_LEFT_PUNCT": IS_LEFT_PUNCT,
    "IS_RIGHT_PUNCT": IS_RIGHT_PUNCT,
+    "IS_CURRENCY": IS_CURRENCY,

-    "FLAG18": FLAG18,
    "FLAG19": FLAG19,
    "FLAG20": FLAG20,
    "FLAG21": FLAG21,
--- a/spacy/syntax/arc_eager.pyx
+++ b/spacy/syntax/arc_eager.pyx
@ -390,6 +390,22 @@ cdef class ArcEager(TransitionSystem):
                gold.c.labels[i] = self.strings.add(label)
        return gold

+    def get_beam_parses(self, Beam beam):
+        parses = []
+        probs = beam.probs
+        for i in range(beam.size):
+            state = <StateC*>beam.at(i)
+            if state.is_final():
+                self.finalize_state(state)
+                prob = probs[i]
+                parse = []
+                for j in range(state.length):
+                    head = state.H(j)
+                    label = self.strings[state._sent[j].dep]
+                    parse.append((head, j, label))
+                parses.append((prob, parse))
+        return parses
+
    cdef Transition lookup_transition(self, object name) except *:
        if '-' in name:
            move_str, label_str = name.split('-', 1)
--- a/spacy/syntax/nn_parser.pyx
+++ b/spacy/syntax/nn_parser.pyx
@ -835,7 +835,8 @@ cdef class Parser:
                sgd = self.create_optimizer()
            self.model[1].begin_training(
                    self.model[1].ops.allocate((5, cfg['token_vector_width'])))
-            self.init_multitask_objectives(gold_tuples, pipeline, sgd=sgd, **cfg)
+            if pipeline is not None:
+                self.init_multitask_objectives(gold_tuples, pipeline, sgd=sgd, **cfg)
            link_vectors_to_models(self.vocab)
        else:
            if sgd is None:
@ -887,7 +888,7 @@ cdef class Parser:
        deserializers = {
            'vocab': lambda p: self.vocab.from_disk(p),
            'moves': lambda p: self.moves.from_disk(p, strings=False),
-            'cfg': lambda p: self.cfg.update(ujson.load(p.open())),
+            'cfg': lambda p: self.cfg.update(util.read_json(p)),
            'model': lambda p: None
        }
        util.from_disk(path, deserializers, exclude)
--- a/spacy/tests/lang/test_attrs.py
+++ b/spacy/tests/lang/test_attrs.py
@ -2,7 +2,7 @@
 from __future__ import unicode_literals

 from ...attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA
-from ...lang.lex_attrs import is_punct, is_ascii, like_url, word_shape
+from ...lang.lex_attrs import is_punct, is_ascii, is_currency, like_url, word_shape

 import pytest

@ -37,6 +37,13 @@ def test_lex_attrs_is_ascii(text, match):
    assert is_ascii(text) == match


+@pytest.mark.parametrize('text,match', [('$', True), ('£', True), ('♥', False),
+    ('€', True), ('¥', True), ('¢', True),
+    ('a', False), ('www.google.com', False), ('dog', False)])
+def test_lex_attrs_is_currency(text, match):
+    assert is_currency(text) == match
+
+
@pytest.mark.parametrize('text,match', [
    ('www.google.com', True), ('google.com', True), ('sydney.com', True),
    ('2girls1cup.org', True), ('http://stupid', True), ('www.hi', True),
--- a/spacy/tests/regression/test_issue1959.py
+++ b/spacy/tests/regression/test_issue1959.py
@ -0,0 +1,23 @@
+# coding: utf8
+from __future__ import unicode_literals
+import pytest
+
+
+@pytest.mark.models('en')
+def test_issue1959(EN):
+    texts = ['Apple is looking at buying U.K. startup for $1 billion.']
+    # nlp = load_test_model('en_core_web_sm')
+    EN.add_pipe(clean_component, name='cleaner', after='ner')
+    doc = EN(texts[0])
+    doc_pipe = [doc_pipe for doc_pipe in EN.pipe(texts)]
+    assert doc == doc_pipe[0]
+
+
+def clean_component(doc):
+    """ Clean up text. Make lowercase and remove punctuation and stopwords """
+    # Remove punctuation, symbols (#) and stopwords
+    doc = [tok.text.lower() for tok in doc if (not tok.is_stop
+                                               and tok.pos_ != 'PUNCT' and
+                                               tok.pos_ != 'SYM')]
+    doc = ' '.join(doc)
+    return doc
--- a/spacy/tests/serialize/test_serialize_language.py
+++ b/spacy/tests/serialize/test_serialize_language.py
@ -0,0 +1,28 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+from ..util import make_tempdir
+from ...language import Language
+
+import pytest
+
+
+@pytest.fixture
+def meta_data():
+    return {
+        'name': 'name-in-fixture',
+        'version': 'version-in-fixture',
+        'description': 'description-in-fixture',
+        'author': 'author-in-fixture',
+        'email': 'email-in-fixture',
+        'url': 'url-in-fixture',
+        'license': 'license-in-fixture',
+    }
+
+
+def test_serialize_language_meta_disk(meta_data):
+    language = Language(meta=meta_data)
+    with make_tempdir() as d:
+        language.to_disk(d)
+        new_language = Language().from_disk(d)
+    assert new_language.meta == language.meta
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@ -15,7 +15,7 @@ from ..lexeme cimport Lexeme
 from .. import parts_of_speech
 from ..attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
 from ..attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT
-from ..attrs cimport IS_OOV, IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL
+from ..attrs cimport IS_OOV, IS_TITLE, IS_UPPER, IS_CURRENCY, LIKE_URL, LIKE_NUM, LIKE_EMAIL
 from ..attrs cimport IS_STOP, ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX
 from ..attrs cimport LENGTH, CLUSTER, LEMMA, POS, TAG, DEP
 from ..compat import is_config
@ -855,6 +855,11 @@ cdef class Token:
        def __get__(self):
            return Lexeme.c_check_flag(self.c.lex, IS_RIGHT_PUNCT)

+    property is_currency:
+        """RETURNS (bool): Whether the token is a currency symbol."""
+        def __get__(self):
+            return Lexeme.c_check_flag(self.c.lex, IS_CURRENCY)
+
    property like_url:
        """RETURNS (bool): Whether the token resembles a URL."""
        def __get__(self):
--- a/spacy/util.py
+++ b/spacy/util.py
@ -17,6 +17,7 @@ from thinc.neural._classes.model import Model
 import functools
 import cytoolz
 import itertools
+import numpy.random

 from .symbols import ORTH
 from .compat import cupy, CudaStream, path2str, basestring_, input_, unicode_
@ -623,3 +624,8 @@ def use_gpu(gpu_id):
    Model.ops = CupyOps()
    Model.Ops = CupyOps
    return device
+
+
+def fix_random_seed(seed=0):
+    random.seed(seed)
+    numpy.random.seed(seed)
--- a/spacy/vectors.pyx
+++ b/spacy/vectors.pyx
@ -347,7 +347,8 @@ cdef class Vectors:
        """
        def load_key2row(path):
            if path.exists():
-                self.key2row = msgpack.load(path.open('rb'))
+                with path.open('rb') as file_:
+                    self.key2row = msgpack.load(file_)
            for key, row in self.key2row.items():
                if row in self._unset:
                    self._unset.remove(row)
--- a/website/_includes/_navigation.jade
+++ b/website/_includes/_navigation.jade
@ -10,6 +10,9 @@ nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : null)
            li.c-nav__menu__item(class=is_active ? "is-active" : null)
                +a(url)(tabindex=is_active ? "-1" : null)=item

+        li.c-nav__menu__item.u-hidden-xs
+            +a("https://survey.spacy.io", true) User Survey 2018
+
        li.c-nav__menu__item.u-hidden-xs
            +a(gh("spaCy"))(aria-label="GitHub") #[+icon("github", 20)]

--- a/website/usage/_facts-figures/_benchmarks.jade
+++ b/website/usage/_facts-figures/_benchmarks.jade
@ -13,7 +13,7 @@ p
    |  Their results and subsequent discussions helped us develop a novel
    |  psychologically-motivated technique to improve spaCy's accuracy, which
    |  we published in joint work with Macquarie University
-    |  #[+a("https://aclweb.org/anthology/D/D15/D15-1162.pdf") (Honnibal and Johnson, 2015)].
+    |  #[+a("https://www.aclweb.org/anthology/D/D15/D15-1162.pdf") (Honnibal and Johnson, 2015)].

 include _benchmarks-choi-2015

--- a/website/usage/_install/_troubleshooting.jade
+++ b/website/usage/_install/_troubleshooting.jade
@ -38,9 +38,10 @@ p
    |  #[code spacy/data] directory. This means your user needs permission to do
    |  this. The above error mostly occurs when doing a system-wide installation,
    |  which will create the symlinks in a system directory. Run the
-    |  #[code download] or #[code link] command as administrator, or use a
-    |  #[code virtualenv] to install spaCy in a user directory, instead
-    |  of doing a system-wide installation.
+    |  #[code download] or #[code link] command as administrator (on Windows,
+    |  simply right-click on your terminal or shell ans select "Run as
+    |  Administrator"), or use a #[code virtualenv] to install spaCy in a user
+    |  directory, instead of doing a system-wide installation.

 +h(3, "no-cache-dir") No such option: --no-cache-dir

--- a/website/usage/_linguistic-features/_dependency-parse.jade
+++ b/website/usage/_linguistic-features/_dependency-parse.jade
@ -65,9 +65,9 @@ p
    - var style = [0, 1, 0, 1, 0]
    +annotation-row(["Autonomous", "amod", "cars", "NOUN", ""], style)
    +annotation-row(["cars", "nsubj", "shift", "VERB", "Autonomous"], style)
-    +annotation-row(["shift", "ROOT", "shift", "VERB", "cars, liability"], style)
+    +annotation-row(["shift", "ROOT", "shift", "VERB", "cars, liability, toward"], style)
    +annotation-row(["insurance", "compound", "liability", "NOUN", ""], style)
-    +annotation-row(["liability", "dobj", "shift", "VERB", "insurance, toward"], style)
+    +annotation-row(["liability", "dobj", "shift", "VERB", "insurance"], style)
    +annotation-row(["toward", "prep", "liability", "NOUN", "manufacturers"], style)
    +annotation-row(["manufacturers", "pobj", "toward", "ADP", ""], style)

--- a/website/usage/_linguistic-features/_named-entities.jade
+++ b/website/usage/_linguistic-features/_named-entities.jade
@ -80,7 +80,7 @@ p
    doc.ents = [netflix_ent]

    ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
-    assert ents = [(u'Netflix', 0, 7, u'ORG')]
+    assert ents == [(u'Netflix', 0, 7, u'ORG')]

 p
    |  Keep in mind that you need to create a #[code Span] with the start and
--- a/website/usage/_linguistic-features/_rule-based-matching.jade
+++ b/website/usage/_linguistic-features/_rule-based-matching.jade
@ -54,10 +54,21 @@ p

 p
    |  The matcher returns a list of #[code (match_id, start, end)] tuples – in
-    |  this case, #[code [('HelloWorld', 0, 2)]], which maps to the span
-    |  #[code doc[0:2]] of our original document. Optionally, we could also
-    |  choose to add more than one pattern, for example to also match sequences
-    |  without punctuation between "hello" and "world":
+    |  this case, #[code [('15578876784678163569', 0, 2)]], which maps to the
+    |  span #[code doc[0:2]] of our original document. The #[code match_id]
+    |  is the #[+a("/usage/spacy-101#vocab") hash value] of the string ID
+    |  "HelloWorld". To get the string value, you can look up the ID
+    |  in the #[+api("stringstore") #[code StringStore]].
+
+code.
+    for match_id, start, end in matches:
+        string_id = nlp.vocab.strings[match_id]  # 'HelloWorld'
+        span = doc[start:end]                    # the matched span
+
+p
+    |  Optionally, we could also choose to add more than one pattern, for
+    |  example to also match sequences without punctuation between "hello" and
+    |  "world":

 +code.
    matcher.add('HelloWorld', None,
@ -91,6 +102,10 @@ p
        +cell.u-nowrap #[code LOWER]
        +cell The lowercase form of the token text.

+    +row
+        +cell #[code LENGTH]
+        +cell The length of the token text.
+
    +row
        +cell.u-nowrap #[code IS_ALPHA], #[code IS_ASCII], #[code IS_DIGIT]
        +cell
@ -117,6 +132,10 @@ p
            |  The token's simple and extended part-of-speech tag, dependency
            |  label, lemma, shape.

+    +row
+        +cell.u-nowrap #[code ENT_TYPE]
+        +cell The token's entity label.
+
 +h(4, "adding-patterns-wildcard") Using wildcard token patterns
    +tag-new(2)

@ -335,7 +354,8 @@ p
    |  flag.

 +code.
-    IS_DEFINITELY = nlp.vocab.add_flag(re.compile(r'deff?in[ia]tely').match)
+    definitely_flag = lambda text: bool(re.compile(r'deff?in[ia]tely').match(text))
+    IS_DEFINITELY = nlp.vocab.add_flag(definitely_flag)

    matcher = Matcher(nlp.vocab)
    matcher.add('DEFINITELY', None, [{IS_DEFINITELY: True}])
--- a/website/usage/_linguistic-features/_tokenization.jade
+++ b/website/usage/_linguistic-features/_tokenization.jade
@ -54,7 +54,7 @@ p

 +code.
    import spacy
-    from spacy.symbols import ORTH, LEMMA, POS
+    from spacy.symbols import ORTH, LEMMA, POS, TAG

    nlp = spacy.load('en')
    doc = nlp(u'gimme that') # phrase to tokenize
--- a/website/usage/_models/_install-basics.jade
+++ b/website/usage/_models/_install-basics.jade
@ -31,3 +31,13 @@ p
    import spacy
    nlp = spacy.load('en')
    doc = nlp(u'This is a sentence.')
+
+infobox("Important note", "⚠️")
+    |  To allow loading models via convenient shortcuts like #[code 'en'], spaCy
+    |  will create a symlink within the #[code spacy/data] directory. This means
+    |  that your user needs the #[strong required permissions].
+    |  If you've installed spaCy to a system directory and don't have admin
+    |  privileges, the model linking may fail. The easiest solution
+    |  is to re-run the command as admin, or use a #[code virtualenv]. For more
+    |  info on this, see the
+    |  #[+a("/usage/#symlink-privilege") troubleshooting guide].
--- a/website/usage/_models/_install.jade
+++ b/website/usage/_models/_install.jade
@ -132,7 +132,7 @@ p
    # set up shortcut link to load local model as "my_amazing_model"
    python -m spacy link /Users/you/model my_amazing_model

-+infobox("Important note")
+infobox("Important note", "⚠️")
    |  In order to create a symlink, your user needs the #[strong required permissions].
    |  If you've installed spaCy to a system directory and don't have admin
    |  privileges, the #[code spacy link] command may fail. The easiest solution