Merge branch 'master' into feature-improve-model-download

2025-08-23 21:44:54 +03:00 · 2018-01-10 18:21:55 +01:00 · 2018-01-10 18:21:55 +01:00 · 7ca49c2061
commit 7ca49c2061
parent ef210c73dd f246fab0c1
21 changed files with 174 additions and 22 deletions
--- a/.github/contributors/kwhumphreys.md
+++ b/.github/contributors/kwhumphreys.md
@ -0,0 +1,107 @@
+
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [ ] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect my
+    contributions.
+
+    * [x] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                            |
+|------------------------------- | -------------------------------- |
+| Name                           |  Kevin Humphreys                 |
+| Company name (if applicable)   |  Textio Inc.                     |
+| Title or role (if applicable)  |                                  |
+| Date                           |  01-03-2018                      |
+| GitHub username                |  kwhumphreys                     |
+| Website (optional)             |                                  |
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -150,7 +150,7 @@ recipes, that does provide some argument for bringing it "in house".

 ### Getting started

-To make changes to spaCy's code base, you need to clone the GitHub repository
+To make changes to spaCy's code base, you need to fork then clone the GitHub repository
 and build spaCy from source. You'll need to make sure that you have a
 development environment consisting of a Python distribution including header
 files, a compiler, [pip](https://pip.pypa.io/en/latest/installing/),
--- a/CONTRIBUTORS.md
+++ b/CONTRIBUTORS.md
@ -45,20 +45,25 @@ This is a list of everyone who has made significant contributions to spaCy, in a
 * Maxim Samsonov, [@maxirmx](https://github.com/maxirmx)
 * Michael Wallin, [@wallinm1](https://github.com/wallinm1)
 * Miguel Almeida, [@mamoit](https://github.com/mamoit)
+* Motoki Wu, [@tokestermw](https://github.com/tokestermw)
 * Oleg Zd, [@olegzd](https://github.com/olegzd)
+* Orhan Bilgin, [@melanuria](https://github.com/melanuria)
 * Orion Montoya, [@mdcclv](https://github.com/mdcclv)
 * Paul O'Leary McCann, [@polm](https://github.com/polm)
 * Pokey Rule, [@pokey](https://github.com/pokey)
 * Ramanan Balakrishnan, [@ramananbalakrishnan](https://github.com/ramananbalakrishnan)
 * Raphaël Bournhonesque, [@raphael0202](https://github.com/raphael0202)
 * Rob van Nieuwpoort, [@RvanNieuwpoort](https://github.com/RvanNieuwpoort)
+* Roman Domrachev, [@ligser](https://github.com/ligser)
 * Roman Inflianskas, [@rominf](https://github.com/rominf)
 * Sam Bozek, [@sambozek](https://github.com/sambozek)
 * Sasho Savkov, [@savkov](https://github.com/savkov)
 * Shuvanon Razik, [@shuvanon](https://github.com/shuvanon)
+* Søren Lind Kristiansen, [@sorenlind](https://github.com/sorenlind)
 * Swier, [@swierh](https://github.com/swierh)
 * Thomas Tanon, [@Tpt](https://github.com/Tpt)
 * Tiago Rodrigues, [@TiagoMRodrigues](https://github.com/TiagoMRodrigues)
+* Vadim Mazaev, [@GreenRiverRUS](https://github.com/GreenRiverRUS)
 * Vimos Tan, [@Vimos](https://github.com/Vimos)
 * Vsevolod Solovyov, [@vsolovyov](https://github.com/vsolovyov)
 * Wah Loon Keng, [@kengz](https://github.com/kengz)
--- a/spacy/init.py
+++ b/spacy/init.py
@ -25,4 +25,4 @@ def blank(name, **kwargs):


 def info(model=None, markdown=False):
-    return cli_info(None, model, markdown)
+    return cli_info(model, markdown)
--- a/spacy/main.py
+++ b/spacy/main.py
@ -28,7 +28,7 @@ if __name__ == '__main__':
    command = sys.argv.pop(1)
    sys.argv[0] = 'spacy %s' % command
    if command in commands:
-        plac.call(commands[command])
+        plac.call(commands[command], sys.argv[1:])
    else:
        prints(
            "Available: %s" % ', '.join(commands),
--- a/spacy/cli/convert.py
+++ b/spacy/cli/convert.py
@ -24,8 +24,7 @@ CONVERTERS = {
    n_sents=("Number of sentences per doc", "option", "n", int),
    converter=("Name of converter (auto, iob, conllu or ner)", "option", "c", str),
    morphology=("Enable appending morphology to tags", "flag", "m", bool))
-def convert(cmd, input_file, output_dir, n_sents=1, morphology=False,
-            converter='auto'):
+def convert(input_file, output_dir, n_sents=1, morphology=False, converter='auto'):
    """
    Convert files into JSON format for use with train command and other
    experiment management functions.
--- a/spacy/cli/download.py
+++ b/spacy/cli/download.py
@ -16,7 +16,7 @@ from .. import about
    model=("model to download, shortcut or name)", "positional", None, str),
    direct=("force direct download. Needs model name with version and won't "
            "perform compatibility check", "flag", "d", bool))
-def download(cmd, model, direct=False):
+def download(model, direct=False):
    """
    Download compatible model from default download path using pip. Model
    can be shortcut, model name or, if --direct flag is set, full model name
--- a/spacy/cli/evaluate.py
+++ b/spacy/cli/evaluate.py
@ -25,8 +25,8 @@ numpy.random.seed(0)
    displacy_path=("directory to output rendered parses as HTML", "option",
                   "dp", str),
    displacy_limit=("limit of parses to render as HTML", "option", "dl", int))
-def evaluate(cmd, model, data_path, gpu_id=-1, gold_preproc=False,
-             displacy_path=None, displacy_limit=25):
+def evaluate(model, data_path, gpu_id=-1, gold_preproc=False, displacy_path=None,
+             displacy_limit=25):
    """
    Evaluate a model. To render a sample of parses in a HTML file, set an
    output directory as the displacy_path argument.
--- a/spacy/cli/info.py
+++ b/spacy/cli/info.py
@ -13,7 +13,7 @@ from .. import util
@plac.annotations(
    model=("optional: shortcut link of model", "positional", None, str),
    markdown=("generate Markdown for GitHub issues", "flag", "md", str))
-def info(cmd, model=None, markdown=False):
+def info(model=None, markdown=False):
    """Print info about spaCy installation. If a model shortcut link is
    speficied as an argument, print model information. Flag --markdown
    prints details in Markdown for easy copy-pasting to GitHub issues.
--- a/spacy/cli/init_model.py
+++ b/spacy/cli/init_model.py
@ -25,7 +25,7 @@ from ..util import prints, ensure_path, get_lang_class
    prune_vectors=("optional: number of vectors to prune to",
                   "option", "V", int)
 )
-def init_model(_cmd, lang, output_dir, freqs_loc, clusters_loc=None, vectors_loc=None, prune_vectors=-1):
+def init_model(lang, output_dir, freqs_loc, clusters_loc=None, vectors_loc=None, prune_vectors=-1):
    """
    Create a new model from raw data, like word frequencies, Brown clusters
    and word vectors.
--- a/spacy/cli/link.py
+++ b/spacy/cli/link.py
@ -13,7 +13,7 @@ from .. import util
    origin=("package name or local path to model", "positional", None, str),
    link_name=("name of shortuct link to create", "positional", None, str),
    force=("force overwriting of existing link", "flag", "f", bool))
-def link(cmd, origin, link_name, force=False, model_path=None):
+def link(origin, link_name, force=False, model_path=None):
    """
    Create a symlink for models within the spacy/data directory. Accepts
    either the name of a pip package, or the local path to the model data
--- a/spacy/cli/package.py
+++ b/spacy/cli/package.py
@ -20,7 +20,7 @@ from .. import about
                 "the command line prompt", "flag", "c", bool),
    force=("force overwriting of existing model directory in output directory",
           "flag", "f", bool))
-def package(cmd, input_dir, output_dir, meta_path=None, create_meta=False,
+def package(input_dir, output_dir, meta_path=None, create_meta=False,
            force=False):
    """
    Generate Python package for model data, including meta and required
--- a/spacy/cli/profile.py
+++ b/spacy/cli/profile.py
@ -29,7 +29,7 @@ def read_inputs(loc):
@plac.annotations(
    lang=("model/language", "positional", None, str),
    inputs=("Location of input file", "positional", None, read_inputs))
-def profile(cmd, lang, inputs=None):
+def profile(lang, inputs=None):
    """
    Profile a spaCy pipeline, to find out which functions take the most time.
    """
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -38,7 +38,7 @@ numpy.random.seed(0)
    version=("Model version", "option", "V", str),
    meta_path=("Optional path to meta.json. All relevant properties will be "
               "overwritten.", "option", "m", Path))
-def train(cmd, lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
+def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
          use_gpu=-1, vectors=None, no_tagger=False,
          no_parser=False, no_entities=False, gold_preproc=False,
          version="0.0.0", meta_path=None):
--- a/spacy/cli/validate.py
+++ b/spacy/cli/validate.py
@ -11,7 +11,7 @@ from ..util import prints, get_data_path, read_json
 from .. import about


-def validate(cmd):
+def validate():
    """Validate that the currently installed version of spaCy is compatible
    with the installed models. Should be run after `pip install -U spacy`.
    """
--- a/spacy/cli/vocab.py
+++ b/spacy/cli/vocab.py
@ -21,8 +21,7 @@ from ..util import prints, ensure_path
    prune_vectors=("optional: number of vectors to prune to.",
                   "option", "V", int)
 )
-def make_vocab(cmd, lang, output_dir, lexemes_loc,
-               vectors_loc=None, prune_vectors=-1):
+def make_vocab(lang, output_dir, lexemes_loc, vectors_loc=None, prune_vectors=-1):
    """Compile a vocabulary from a lexicon jsonl file and word vectors."""
    if not lexemes_loc.exists():
        prints(lexemes_loc, title="Can't find lexical data", exits=1)
--- a/spacy/lang/en/tokenizer_exceptions.py
+++ b/spacy/lang/en/tokenizer_exceptions.py
@ -213,7 +213,8 @@ for verb_data in [
    {ORTH: "could", NORM: "could", TAG: "MD"},
    {ORTH: "might", NORM: "might", TAG: "MD"},
    {ORTH: "must", NORM: "must", TAG: "MD"},
-    {ORTH: "should", NORM: "should", TAG: "MD"}]:
+    {ORTH: "should", NORM: "should", TAG: "MD"},
+    {ORTH: "would", NORM: "would", TAG: "MD"}]:
    verb_data_tc = dict(verb_data)
    verb_data_tc[ORTH] = verb_data_tc[ORTH].title()
    for data in [verb_data, verb_data_tc]:
--- a/spacy/tests/regression/test_issue1622.py
+++ b/spacy/tests/regression/test_issue1622.py
@ -9,7 +9,6 @@ from ...cli.train import train

@pytest.mark.xfail
 def test_cli_trained_model_can_be_saved(tmpdir):
-    cmd = None
    lang = 'nl'
    output_dir = str(tmpdir)
    train_file = NamedTemporaryFile('wb', dir=output_dir, delete=False)
@ -86,6 +85,6 @@ def test_cli_trained_model_can_be_saved(tmpdir):

    # spacy train -n 1 -g -1 nl output_nl training_corpus.json training \
    # corpus.json
-    train(cmd, lang, output_dir, train_data, dev_data, n_iter=1)
+    train(lang, output_dir, train_data, dev_data, n_iter=1)

    assert True
--- a/spacy/tests/regression/test_issue1758.py
+++ b/spacy/tests/regression/test_issue1758.py
@ -0,0 +1,13 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+@pytest.mark.parametrize('text', ["would've"])
+def test_issue1758(en_tokenizer, text):
+    """Test that "would've" is handled by the English tokenizer exceptions."""
+    tokens = en_tokenizer(text)
+    assert len(tokens) == 2
+    assert tokens[0].tag_ == "MD"
+    assert tokens[1].lemma_ == "have"
--- a/website/api/_top-level/_util.jade
+++ b/website/api/_top-level/_util.jade
@ -51,7 +51,9 @@ p
 p
    |  Import and load a #[code Language] class. Allows lazy-loading
    |  #[+a("/usage/adding-languages") language data] and importing
-    |  languages using the two-letter language code.
+    |  languages using the two-letter language code. To add a language code
+    |  for a custom language class, you can use the
+    |  #[+api("top-level#util.set_lang_class") #[code set_lang_class]] helper.

 +aside-code("Example").
    for lang_id in ['en', 'de']:
@ -70,6 +72,33 @@ p
        +cell #[code Language]
        +cell Language class.

+h(3, "util.set_lang_class") util.set_lang_class
+    +tag function
+
+p
+    |  Set a custom #[code Language] class name that can be loaded via
+    |  #[+api("top-level#util.get_lang_class") #[code get_lang_class]]. If
+    |  your model uses a custom language, this is required so that spaCy can
+    |  load the correct class from the two-letter language code.
+
+aside-code("Example").
+    from spacy.lang.xy import CustomLanguage
+
+    util.set_lang_class('xy', CustomLanguage)
+    lang_class = util.get_lang_class('xy')
+    nlp = lang_class()
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code name]
+        +cell unicode
+        +cell Two-letter language code, e.g. #[code 'en'].
+
+    +row
+        +cell #[code cls]
+        +cell #[code Language]
+        +cell The language class, e.g. #[code English].
+
 +h(3, "util.load_model") util.load_model
    +tag function
    +tag-new(2)
--- a/website/api/goldparse.jade
+++ b/website/api/goldparse.jade
@ -136,7 +136,7 @@ p
 +aside-code("Example").
    from spacy.gold import biluo_tags_from_offsets

-    doc = nlp('I like London.')
+    doc = nlp(u'I like London.')
    entities = [(7, 13, 'LOC')]
    tags = biluo_tags_from_offsets(doc, entities)
    assert tags == ['O', 'O', 'U-LOC', 'O']