Merge master

2025-10-02 18:06:46 +03:00 · 2018-03-14 19:03:24 +01:00 · 2018-03-14 19:03:24 +01:00 · ab3d860686
commit ab3d860686
parent 0fb153cf05 9aeec9c242
32 changed files with 1910 additions and 199 deletions
--- a/.buildkite/train.yml
+++ b/.buildkite/train.yml
@ -0,0 +1,11 @@
 steps:
  -
    command: "fab env clean make test wheel"
    label: ":dizzy: :python:"
    artifact_paths: "dist/*.whl"
  - wait
  - trigger: "spacy-train-from-wheel"
    label: ":dizzy: :train:"
    build:
      env:
        SPACY_VERSION: "{$SPACY_VERSION}"
--- a/.github/contributors/alldefector.md
+++ b/.github/contributors/alldefector.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your 
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [x] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Feng Niu |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | Feb 21, 2018  |
 | GitHub username                | alldefector     |
 | Website (optional)             |                      |
--- a/.github/contributors/willismonroe.md
+++ b/.github/contributors/willismonroe.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your 
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [x] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Willis Monroe        |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2018-3-5             |
 | GitHub username                | willismonroe         |
 | Website (optional)             |                      |
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -182,7 +182,7 @@ If you've made a contribution to spaCy, you should fill in the
 [spaCy contributor agreement](.github/CONTRIBUTOR_AGREEMENT.md) to ensure that
 your contribution can be used across the project. If you agree to be bound by
 the terms of the agreement, fill in the [template](.github/CONTRIBUTOR_AGREEMENT.md)
-and include it with your pull request, or sumit it separately to
+and include it with your pull request, or submit it separately to
 [`.github/contributors/`](/.github/contributors). The name of the file should be
 your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
--- a/examples/training/conllu.py
+++ b/examples/training/conllu.py
@ -28,8 +28,10 @@ import cytoolz
 import conll17_ud_eval
 import spacy.lang.zh
 import spacy.lang.ja
 spacy.lang.zh.Chinese.Defaults.use_jieba = False
 spacy.lang.ja.Japanese.Defaults.use_janome = False
 random.seed(0)
 numpy.random.seed(0)
@ -280,6 +282,30 @@ def print_progress(itn, losses, ud_scores):
    ))
    print(tpl.format(itn, **fields))
 #def get_sent_conllu(sent, sent_id):
 #    lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
 def get_token_conllu(token, i):
    if token._.begins_fused:
        n = 1
        while token.nbor(n)._.inside_fused:
            n += 1
        id_ = '%d-%d' % (i, i+n)
        lines = [id_, token.text, '_', '_', '_', '_', '_', '_', '_', '_']
    else:
        lines = []
    if token.head.i == token.i:
        head = 0
    else:
        head = i + (token.head.i - token.i) + 1
    fields = [str(i+1), token.text, token.lemma_, token.pos_, token.tag_, '_',
              str(head), token.dep_.lower(), '_', '_']
    lines.append('\t'.join(fields))
    return '\n'.join(lines)
 Token.set_extension('get_conllu_lines', method=get_token_conllu)
 Token.set_extension('begins_fused', default=False)
 Token.set_extension('inside_fused', default=False)
 ##################
 # Initialization #
--- a/fabfile.py
+++ b/fabfile.py
@ -1,49 +1,92 @@
 # coding: utf-8
 from __future__ import unicode_literals, print_function
 import contextlib
 from pathlib import Path
 from fabric.api import local, lcd, env, settings, prefix
 from fabtools.python import virtualenv
 from os import path, environ
 import shutil
 PWD = path.dirname(__file__)
 ENV = environ['VENV_DIR'] if 'VENV_DIR' in environ else '.env'
-VENV_DIR = path.join(PWD, ENV)
+VENV_DIR = Path(PWD) / ENV
-def env(lang='python2.7'):
+@contextlib.contextmanager
-    if path.exists(VENV_DIR):
+def virtualenv(name, create=False, python='/usr/bin/python3.6'):
    python = Path(python).resolve()
    env_path = VENV_DIR
    if create:
        if env_path.exists():
            shutil.rmtree(str(env_path))
        local('{python} -m venv {env_path}'.format(python=python, env_path=VENV_DIR))
    def wrapped_local(cmd, env_vars=[], capture=False, direct=False):
        return local('source {}/bin/activate && {}'.format(env_path, cmd),
                     shell='/bin/bash', capture=False)
    yield wrapped_local
 def env(lang='python3.6'):
    if VENV_DIR.exists():
        local('rm -rf {env}'.format(env=VENV_DIR))
-    local('pip install virtualenv')
+    if lang.startswith('python3'):
-    local('python -m virtualenv -p {lang} {env}'.format(lang=lang, env=VENV_DIR))
+        local('{lang} -m venv {env}'.format(lang=lang, env=VENV_DIR))
    else:
        local('{lang} -m pip install virtualenv --no-cache-dir'.format(lang=lang))
        local('{lang} -m virtualenv {env} --no-cache-dir'.format(lang=lang, env=VENV_DIR))
    with virtualenv(VENV_DIR) as venv_local:
        print(venv_local('python --version', capture=True))
        venv_local('pip install --upgrade setuptools --no-cache-dir')
        venv_local('pip install pytest --no-cache-dir')
        venv_local('pip install wheel --no-cache-dir')
        venv_local('pip install -r requirements.txt --no-cache-dir')
        venv_local('pip install pex --no-cache-dir')
 def install():
-    with virtualenv(VENV_DIR):
+    with virtualenv(VENV_DIR) as venv_local:
-        local('pip install --upgrade setuptools')
+        venv_local('pip install dist/*.tar.gz')
        local('pip install dist/*.tar.gz')
        local('pip install pytest')
 def make():
    with virtualenv(VENV_DIR):
    with lcd(path.dirname(__file__)):
-            local('pip install cython')
+        local('export PYTHONPATH=`pwd` && source .env/bin/activate && python setup.py build_ext --inplace',
-            local('pip install murmurhash')
+            shell='/bin/bash')
            local('pip install -r requirements.txt')
            local('python setup.py build_ext --inplace')
 def sdist():
-    with virtualenv(VENV_DIR):
+    with virtualenv(VENV_DIR) as venv_local:
        with lcd(path.dirname(__file__)):
            local('python setup.py sdist')
 def wheel():
    with virtualenv(VENV_DIR) as venv_local:
        with lcd(path.dirname(__file__)):
            venv_local('python setup.py bdist_wheel')
 def pex():
    with virtualenv(VENV_DIR) as venv_local:
        with lcd(path.dirname(__file__)):
            sha = local('git rev-parse --short HEAD', capture=True)
            venv_local('pex dist/*.whl -e spacy -o dist/spacy-%s.pex' % sha,
                direct=True)
 def clean():
    with lcd(path.dirname(__file__)):
-        local('python setup.py clean --all')
+        local('rm -f dist/*.whl')
        local('rm -f dist/*.pex')
        with virtualenv(VENV_DIR) as venv_local:
            venv_local('python setup.py clean --all')
 def test():
-    with virtualenv(VENV_DIR):
+    with virtualenv(VENV_DIR) as venv_local:
        with lcd(path.dirname(__file__)):
-            local('py.test -x spacy/tests')
+            venv_local('pytest -x spacy/tests')
 def train():
    args = environ.get('SPACY_TRAIN_ARGS', '')
    with virtualenv(VENV_DIR) as venv_local:
        venv_local('spacy train {args}'.format(args=args))
--- a/spacy/main.py
+++ b/spacy/main.py
@ -8,6 +8,7 @@ if __name__ == '__main__':
    import sys
    from spacy.cli import download, link, info, package, train, convert
    from spacy.cli import vocab, init_model, profile, evaluate, validate
    from spacy.cli import ud_train, ud_evaluate
    from spacy.util import prints
    commands = {
@ -15,7 +16,9 @@ if __name__ == '__main__':
        'link': link,
        'info': info,
        'train': train,
        'ud-train': ud_train,
        'evaluate': evaluate,
        'ud-evaluate': ud_evaluate,
        'convert': convert,
        'package': package,
        'vocab': vocab,
--- a/spacy/about.py
+++ b/spacy/about.py
@ -3,7 +3,7 @@
 # https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
 __title__ = 'spacy'
-__version__ = '2.1.0.dev1'
+__version__ = '2.1.0.dev3'
 __summary__ = 'Industrial-strength Natural Language Processing (NLP) with Python and Cython'
 __uri__ = 'https://spacy.io'
 __author__ = 'Explosion AI'
--- a/spacy/cli/init.py
+++ b/spacy/cli/init.py
@ -9,3 +9,5 @@ from .convert import convert
 from .vocab import make_vocab as vocab
 from .init_model import init_model
 from .validate import validate
 from .ud_train import main as ud_train
 from .conll17_ud_eval import main as ud_evaluate
--- a/spacy/cli/conll17_ud_eval.py
+++ b/spacy/cli/conll17_ud_eval.py
@ -0,0 +1,570 @@
 #!/usr/bin/env python
 # CoNLL 2017 UD Parsing evaluation script.
 #
 # Compatible with Python 2.7 and 3.2+, can be used either as a module
 # or a standalone executable.
 #
 # Copyright 2017 Institute of Formal and Applied Linguistics (UFAL),
 # Faculty of Mathematics and Physics, Charles University, Czech Republic.
 #
 # Changelog:
 # - [02 Jan 2017] Version 0.9: Initial release
 # - [25 Jan 2017] Version 0.9.1: Fix bug in LCS alignment computation
 # - [10 Mar 2017] Version 1.0: Add documentation and test
 #                              Compare HEADs correctly using aligned words
 #                              Allow evaluation with errorneous spaces in forms
 #                              Compare forms in LCS case insensitively
 #                              Detect cycles and multiple root nodes
 #                              Compute AlignedAccuracy
 # Command line usage
 # ------------------
 # conll17_ud_eval.py [-v] [-w weights_file] gold_conllu_file system_conllu_file
 #
 # - if no -v is given, only the CoNLL17 UD Shared Task evaluation LAS metrics
 #   is printed
 # - if -v is given, several metrics are printed (as precision, recall, F1 score,
 #   and in case the metric is computed on aligned words also accuracy on these):
 #   - Tokens: how well do the gold tokens match system tokens
 #   - Sentences: how well do the gold sentences match system sentences
 #   - Words: how well can the gold words be aligned to system words
 #   - UPOS: using aligned words, how well does UPOS match
 #   - XPOS: using aligned words, how well does XPOS match
 #   - Feats: using aligned words, how well does FEATS match
 #   - AllTags: using aligned words, how well does UPOS+XPOS+FEATS match
 #   - Lemmas: using aligned words, how well does LEMMA match
 #   - UAS: using aligned words, how well does HEAD match
 #   - LAS: using aligned words, how well does HEAD+DEPREL(ignoring subtypes) match
 # - if weights_file is given (with lines containing deprel-weight pairs),
 #   one more metric is shown:
 #   - WeightedLAS: as LAS, but each deprel (ignoring subtypes) has different weight
 # API usage
 # ---------
 # - load_conllu(file)
 #   - loads CoNLL-U file from given file object to an internal representation
 #   - the file object should return str on both Python 2 and Python 3
 #   - raises UDError exception if the given file cannot be loaded
 # - evaluate(gold_ud, system_ud)
 #   - evaluate the given gold and system CoNLL-U files (loaded with load_conllu)
 #   - raises UDError if the concatenated tokens of gold and system file do not match
 #   - returns a dictionary with the metrics described above, each metrics having
 #     three fields: precision, recall and f1
 # Description of token matching
 # -----------------------------
 # In order to match tokens of gold file and system file, we consider the text
 # resulting from concatenation of gold tokens and text resulting from
 # concatenation of system tokens. These texts should match -- if they do not,
 # the evaluation fails.
 #
 # If the texts do match, every token is represented as a range in this original
 # text, and tokens are equal only if their range is the same.
 # Description of word matching
 # ----------------------------
 # When matching words of gold file and system file, we first match the tokens.
 # The words which are also tokens are matched as tokens, but words in multi-word
 # tokens have to be handled differently.
 #
 # To handle multi-word tokens, we start by finding "multi-word spans".
 # Multi-word span is a span in the original text such that
 # - it contains at least one multi-word token
 # - all multi-word tokens in the span (considering both gold and system ones)
 #   are completely inside the span (i.e., they do not "stick out")
 # - the multi-word span is as small as possible
 #
 # For every multi-word span, we align the gold and system words completely
 # inside this span using LCS on their FORMs. The words not intersecting
 # (even partially) any multi-word span are then aligned as tokens.
 from __future__ import division
 from __future__ import print_function
 import argparse
 import io
 import sys
 import unittest
 # CoNLL-U column names
 ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC = range(10)
 # UD Error is used when raising exceptions in this module
 class UDError(Exception):
    pass
 # Load given CoNLL-U file into internal representation
 def load_conllu(file):
    # Internal representation classes
    class UDRepresentation:
        def __init__(self):
            # Characters of all the tokens in the whole file.
            # Whitespace between tokens is not included.
            self.characters = []
            # List of UDSpan instances with start&end indices into `characters`.
            self.tokens = []
            # List of UDWord instances.
            self.words = []
            # List of UDSpan instances with start&end indices into `characters`.
            self.sentences = []
    class UDSpan:
        def __init__(self, start, end, characters):
            self.start = start
            # Note that self.end marks the first position **after the end** of span,
            # so we can use characters[start:end] or range(start, end).
            self.end = end
            self.characters = characters
        @property
        def text(self):
            return ''.join(self.characters[self.start:self.end])
        def __str__(self):
            return self.text
        def __repr__(self):
            return self.text
    class UDWord:
        def __init__(self, span, columns, is_multiword):
            # Span of this word (or MWT, see below) within ud_representation.characters.
            self.span = span
            # 10 columns of the CoNLL-U file: ID, FORM, LEMMA,...
            self.columns = columns
            # is_multiword==True means that this word is part of a multi-word token.
            # In that case, self.span marks the span of the whole multi-word token.
            self.is_multiword = is_multiword
            # Reference to the UDWord instance representing the HEAD (or None if root).
            self.parent = None
            # Let's ignore language-specific deprel subtypes.
            self.columns[DEPREL] = columns[DEPREL].split(':')[0]
    ud = UDRepresentation()
    # Load the CoNLL-U file
    index, sentence_start = 0, None
    linenum = 0
    while True:
        line = file.readline()
        linenum += 1
        if not line:
            break
        line = line.rstrip("\r\n")
        # Handle sentence start boundaries
        if sentence_start is None:
            # Skip comments
            if line.startswith("#"):
                continue
            # Start a new sentence
            ud.sentences.append(UDSpan(index, 0, ud.characters))
            sentence_start = len(ud.words)
        if not line:
            # Add parent UDWord links and check there are no cycles
            def process_word(word):
                if word.parent == "remapping":
                    raise UDError("There is a cycle in a sentence")
                if word.parent is None:
                    head = int(word.columns[HEAD])
                    if head > len(ud.words) - sentence_start:
                        raise UDError("HEAD '{}' points outside of the sentence".format(word.columns[HEAD]))
                    if head:
                        parent = ud.words[sentence_start + head - 1]
                        word.parent = "remapping"
                        process_word(parent)
                        word.parent = parent
            for word in ud.words[sentence_start:]:
                process_word(word)
            # Check there is a single root node
            if len([word for word in ud.words[sentence_start:] if word.parent is None]) != 1:
                raise UDError("There are multiple roots in a sentence")
            # End the sentence
            ud.sentences[-1].end = index
            sentence_start = None
            continue
        # Read next token/word
        columns = line.split("\t")
        if len(columns) != 10:
            raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, line))
        # Skip empty nodes
        if "." in columns[ID]:
            continue
        # Delete spaces from FORM  so gold.characters == system.characters
        # even if one of them tokenizes the space.
        columns[FORM] = columns[FORM].replace(" ", "")
        if not columns[FORM]:
            raise UDError("There is an empty FORM in the CoNLL-U file -- line %d" % linenum)
        # Save token
        ud.characters.extend(columns[FORM])
        ud.tokens.append(UDSpan(index, index + len(columns[FORM]), ud.characters))
        index += len(columns[FORM])
        # Handle multi-word tokens to save word(s)
        if "-" in columns[ID]:
            try:
                start, end = map(int, columns[ID].split("-"))
            except:
                raise UDError("Cannot parse multi-word token ID '{}'".format(columns[ID]))
            for _ in range(start, end + 1):
                word_line = file.readline().rstrip("\r\n")
                word_columns = word_line.split("\t")
                if len(word_columns) != 10:
                    print(columns)
                    raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, word_line))
                ud.words.append(UDWord(ud.tokens[-1], word_columns, is_multiword=True))
        # Basic tokens/words
        else:
            try:
                word_id = int(columns[ID])
            except:
                raise UDError("Cannot parse word ID '{}'".format(columns[ID]))
            if word_id != len(ud.words) - sentence_start + 1:
                raise UDError("Incorrect word ID '{}' for word '{}', expected '{}'".format(columns[ID], columns[FORM], len(ud.words) - sentence_start + 1))
            try:
                head_id = int(columns[HEAD])
            except:
                raise UDError("Cannot parse HEAD '{}'".format(columns[HEAD]))
            if head_id < 0:
                raise UDError("HEAD cannot be negative")
            ud.words.append(UDWord(ud.tokens[-1], columns, is_multiword=False))
    if sentence_start is not None:
        raise UDError("The CoNLL-U file does not end with empty line")
    return ud
 # Evaluate the gold and system treebanks (loaded using load_conllu).
 def evaluate(gold_ud, system_ud, deprel_weights=None):
    class Score:
        def __init__(self, gold_total, system_total, correct, aligned_total=None):
            self.precision = correct / system_total if system_total else 0.0
            self.recall = correct / gold_total if gold_total else 0.0
            self.f1 = 2 * correct / (system_total + gold_total) if system_total + gold_total else 0.0
            self.aligned_accuracy = correct / aligned_total if aligned_total else aligned_total
    class AlignmentWord:
        def __init__(self, gold_word, system_word):
            self.gold_word = gold_word
            self.system_word = system_word
            self.gold_parent = None
            self.system_parent_gold_aligned = None
    class Alignment:
        def __init__(self, gold_words, system_words):
            self.gold_words = gold_words
            self.system_words = system_words
            self.matched_words = []
            self.matched_words_map = {}
        def append_aligned_words(self, gold_word, system_word):
            self.matched_words.append(AlignmentWord(gold_word, system_word))
            self.matched_words_map[system_word] = gold_word
        def fill_parents(self):
            # We represent root parents in both gold and system data by '0'.
            # For gold data, we represent non-root parent by corresponding gold word.
            # For system data, we represent non-root parent by either gold word aligned
            # to parent system nodes, or by None if no gold words is aligned to the parent.
            for words in self.matched_words:
                words.gold_parent = words.gold_word.parent if words.gold_word.parent is not None else 0
                words.system_parent_gold_aligned = self.matched_words_map.get(words.system_word.parent, None) \
                    if words.system_word.parent is not None else 0
    def lower(text):
        if sys.version_info < (3, 0) and isinstance(text, str):
            return text.decode("utf-8").lower()
        return text.lower()
    def spans_score(gold_spans, system_spans):
        correct, gi, si = 0, 0, 0
        while gi < len(gold_spans) and si < len(system_spans):
            if system_spans[si].start < gold_spans[gi].start:
                si += 1
            elif gold_spans[gi].start < system_spans[si].start:
                gi += 1
            else:
                correct += gold_spans[gi].end == system_spans[si].end
                si += 1
                gi += 1
        return Score(len(gold_spans), len(system_spans), correct)
    def alignment_score(alignment, key_fn, weight_fn=lambda w: 1):
        gold, system, aligned, correct = 0, 0, 0, 0
        for word in alignment.gold_words:
            gold += weight_fn(word)
        for word in alignment.system_words:
            system += weight_fn(word)
        for words in alignment.matched_words:
            aligned += weight_fn(words.gold_word)
        if key_fn is None:
            # Return score for whole aligned words
            return Score(gold, system, aligned)
        for words in alignment.matched_words:
            if key_fn(words.gold_word, words.gold_parent) == key_fn(words.system_word, words.system_parent_gold_aligned):
                correct += weight_fn(words.gold_word)
        return Score(gold, system, correct, aligned)
    def beyond_end(words, i, multiword_span_end):
        if i >= len(words):
            return True
        if words[i].is_multiword:
            return words[i].span.start >= multiword_span_end
        return words[i].span.end > multiword_span_end
    def extend_end(word, multiword_span_end):
        if word.is_multiword and word.span.end > multiword_span_end:
            return word.span.end
        return multiword_span_end
    def find_multiword_span(gold_words, system_words, gi, si):
        # We know gold_words[gi].is_multiword or system_words[si].is_multiword.
        # Find the start of the multiword span (gs, ss), so the multiword span is minimal.
        # Initialize multiword_span_end characters index.
        if gold_words[gi].is_multiword:
            multiword_span_end = gold_words[gi].span.end
            if not system_words[si].is_multiword and system_words[si].span.start < gold_words[gi].span.start:
                si += 1
        else: # if system_words[si].is_multiword
            multiword_span_end = system_words[si].span.end
            if not gold_words[gi].is_multiword and gold_words[gi].span.start < system_words[si].span.start:
                gi += 1
        gs, ss = gi, si
        # Find the end of the multiword span
        # (so both gi and si are pointing to the word following the multiword span end).
        while not beyond_end(gold_words, gi, multiword_span_end) or \
              not beyond_end(system_words, si, multiword_span_end):
            if gi < len(gold_words) and (si >= len(system_words) or
                                         gold_words[gi].span.start <= system_words[si].span.start):
                multiword_span_end = extend_end(gold_words[gi], multiword_span_end)
                gi += 1
            else:
                multiword_span_end = extend_end(system_words[si], multiword_span_end)
                si += 1
        return gs, ss, gi, si
    def compute_lcs(gold_words, system_words, gi, si, gs, ss):
        lcs = [[0] * (si - ss) for i in range(gi - gs)]
        for g in reversed(range(gi - gs)):
            for s in reversed(range(si - ss)):
                if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
                    lcs[g][s] = 1 + (lcs[g+1][s+1] if g+1 < gi-gs and s+1 < si-ss else 0)
                lcs[g][s] = max(lcs[g][s], lcs[g+1][s] if g+1 < gi-gs else 0)
                lcs[g][s] = max(lcs[g][s], lcs[g][s+1] if s+1 < si-ss else 0)
        return lcs
    def align_words(gold_words, system_words):
        alignment = Alignment(gold_words, system_words)
        gi, si = 0, 0
        while gi < len(gold_words) and si < len(system_words):
            if gold_words[gi].is_multiword or system_words[si].is_multiword:
                # A: Multi-word tokens => align via LCS within the whole "multiword span".
                gs, ss, gi, si = find_multiword_span(gold_words, system_words, gi, si)
                if si > ss and gi > gs:
                    lcs = compute_lcs(gold_words, system_words, gi, si, gs, ss)
                    # Store aligned words
                    s, g = 0, 0
                    while g < gi - gs and s < si - ss:
                        if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
                            alignment.append_aligned_words(gold_words[gs+g], system_words[ss+s])
                            g += 1
                            s += 1
                        elif lcs[g][s] == (lcs[g+1][s] if g+1 < gi-gs else 0):
                            g += 1
                        else:
                            s += 1
            else:
                # B: No multi-word token => align according to spans.
                if (gold_words[gi].span.start, gold_words[gi].span.end) == (system_words[si].span.start, system_words[si].span.end):
                    alignment.append_aligned_words(gold_words[gi], system_words[si])
                    gi += 1
                    si += 1
                elif gold_words[gi].span.start <= system_words[si].span.start:
                    gi += 1
                else:
                    si += 1
        alignment.fill_parents()
        return alignment
    # Check that underlying character sequences do match
    if gold_ud.characters != system_ud.characters:
        index = 0
        while gold_ud.characters[index] == system_ud.characters[index]:
            index += 1
        raise UDError(
            "The concatenation of tokens in gold file and in system file differ!\n" +
            "First 20 differing characters in gold file: '{}' and system file: '{}'".format(
                "".join(gold_ud.characters[index:index + 20]),
                "".join(system_ud.characters[index:index + 20])
            )
        )
    # Align words
    alignment = align_words(gold_ud.words, system_ud.words)
    # Compute the F1-scores
    result = {
        "Tokens": spans_score(gold_ud.tokens, system_ud.tokens),
        "Sentences": spans_score(gold_ud.sentences, system_ud.sentences),
        "Words": alignment_score(alignment, None),
        "UPOS": alignment_score(alignment, lambda w, parent: w.columns[UPOS]),
        "XPOS": alignment_score(alignment, lambda w, parent: w.columns[XPOS]),
        "Feats": alignment_score(alignment, lambda w, parent: w.columns[FEATS]),
        "AllTags": alignment_score(alignment, lambda w, parent: (w.columns[UPOS], w.columns[XPOS], w.columns[FEATS])),
        "Lemmas": alignment_score(alignment, lambda w, parent: w.columns[LEMMA]),
        "UAS": alignment_score(alignment, lambda w, parent: parent),
        "LAS": alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL])),
    }
    # Add WeightedLAS if weights are given
    if deprel_weights is not None:
        def weighted_las(word):
            return deprel_weights.get(word.columns[DEPREL], 1.0)
        result["WeightedLAS"] = alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL]), weighted_las)
    return result
 def load_deprel_weights(weights_file):
    if weights_file is None:
        return None
    deprel_weights = {}
    for line in weights_file:
        # Ignore comments and empty lines
        if line.startswith("#") or not line.strip():
            continue
        columns = line.rstrip("\r\n").split()
        if len(columns) != 2:
            raise ValueError("Expected two columns in the UD Relations weights file on line '{}'".format(line))
        deprel_weights[columns[0]] = float(columns[1])
    return deprel_weights
 def load_conllu_file(path):
    _file = open(path, mode="r", **({"encoding": "utf-8"} if sys.version_info >= (3, 0) else {}))
    return load_conllu(_file)
 def evaluate_wrapper(args):
    # Load CoNLL-U files
    gold_ud = load_conllu_file(args.gold_file)
    system_ud = load_conllu_file(args.system_file)
    # Load weights if requested
    deprel_weights = load_deprel_weights(args.weights)
    return evaluate(gold_ud, system_ud, deprel_weights)
 def main():
    # Parse arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("gold_file", type=str,
                        help="Name of the CoNLL-U file with the gold data.")
    parser.add_argument("system_file", type=str,
                        help="Name of the CoNLL-U file with the predicted data.")
    parser.add_argument("--weights", "-w", type=argparse.FileType("r"), default=None,
                        metavar="deprel_weights_file",
                        help="Compute WeightedLAS using given weights for Universal Dependency Relations.")
    parser.add_argument("--verbose", "-v", default=0, action="count",
                        help="Print all metrics.")
    args = parser.parse_args()
    # Use verbose if weights are supplied
    if args.weights is not None and not args.verbose:
        args.verbose = 1
    # Evaluate
    evaluation = evaluate_wrapper(args)
    # Print the evaluation
    if not args.verbose:
        print("LAS F1 Score: {:.2f}".format(100 * evaluation["LAS"].f1))
    else:
        metrics = ["Tokens", "Sentences", "Words", "UPOS", "XPOS", "Feats", "AllTags", "Lemmas", "UAS", "LAS"]
        if args.weights is not None:
            metrics.append("WeightedLAS")
        print("Metrics    | Precision |    Recall |  F1 Score | AligndAcc")
        print("-----------+-----------+-----------+-----------+-----------")
        for metric in metrics:
            print("{:11}|{:10.2f} |{:10.2f} |{:10.2f} |{}".format(
                metric,
                100 * evaluation[metric].precision,
                100 * evaluation[metric].recall,
                100 * evaluation[metric].f1,
                "{:10.2f}".format(100 * evaluation[metric].aligned_accuracy) if evaluation[metric].aligned_accuracy is not None else ""
            ))
 if __name__ == "__main__":
    main()
 # Tests, which can be executed with `python -m unittest conll17_ud_eval`.
 class TestAlignment(unittest.TestCase):
    @staticmethod
    def _load_words(words):
        """Prepare fake CoNLL-U files with fake HEAD to prevent multiple roots errors."""
        lines, num_words = [], 0
        for w in words:
            parts = w.split(" ")
            if len(parts) == 1:
                num_words += 1
                lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, parts[0], int(num_words>1)))
            else:
                lines.append("{}-{}\t{}\t_\t_\t_\t_\t_\t_\t_\t_".format(num_words + 1, num_words + len(parts) - 1, parts[0]))
                for part in parts[1:]:
                    num_words += 1
                    lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, part, int(num_words>1)))
        return load_conllu((io.StringIO if sys.version_info >= (3, 0) else io.BytesIO)("\n".join(lines+["\n"])))
    def _test_exception(self, gold, system):
        self.assertRaises(UDError, evaluate, self._load_words(gold), self._load_words(system))
    def _test_ok(self, gold, system, correct):
        metrics = evaluate(self._load_words(gold), self._load_words(system))
        gold_words = sum((max(1, len(word.split(" ")) - 1) for word in gold))
        system_words = sum((max(1, len(word.split(" ")) - 1) for word in system))
        self.assertEqual((metrics["Words"].precision, metrics["Words"].recall, metrics["Words"].f1),
                         (correct / system_words, correct / gold_words, 2 * correct / (gold_words + system_words)))
    def test_exception(self):
        self._test_exception(["a"], ["b"])
    def test_equal(self):
        self._test_ok(["a"], ["a"], 1)
        self._test_ok(["a", "b", "c"], ["a", "b", "c"], 3)
    def test_equal_with_multiword(self):
        self._test_ok(["abc a b c"], ["a", "b", "c"], 3)
        self._test_ok(["a", "bc b c", "d"], ["a", "b", "c", "d"], 4)
        self._test_ok(["abcd a b c d"], ["ab a b", "cd c d"], 4)
        self._test_ok(["abc a b c", "de d e"], ["a", "bcd b c d", "e"], 5)
    def test_alignment(self):
        self._test_ok(["abcd"], ["a", "b", "c", "d"], 0)
        self._test_ok(["abc", "d"], ["a", "b", "c", "d"], 1)
        self._test_ok(["a", "bc", "d"], ["a", "b", "c", "d"], 2)
        self._test_ok(["a", "bc b c", "d"], ["a", "b", "cd"], 2)
        self._test_ok(["abc a BX c", "def d EX f"], ["ab a b", "cd c d", "ef e f"], 4)
        self._test_ok(["ab a b", "cd bc d"], ["a", "bc", "d"], 2)
        self._test_ok(["a", "bc b c", "d"], ["ab AX BX", "cd CX a"], 1)
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -116,10 +116,9 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
    print("Itn.\tP.Loss\tN.Loss\tUAS\tNER P.\tNER R.\tNER F.\tTag %\tToken %")
    try:
        for i in range(n_iter):
            train_docs = corpus.train_docs(nlp, projectivize=True, noise_level=0.0,
                                           gold_preproc=gold_preproc, max_length=0)
        train_docs = list(train_docs)
        for i in range(n_iter):
            with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
                losses = {}
                for batch in minibatch(train_docs, size=batch_sizes):
--- a/spacy/cli/ud_train.py
+++ b/spacy/cli/ud_train.py
@ -0,0 +1,390 @@
 '''Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
 .conllu format for development data, allowing the official scorer to be used.
 '''
 from __future__ import unicode_literals
 import plac
 import tqdm
 from pathlib import Path
 import re
 import sys
 import json
 import spacy
 import spacy.util
 from ..tokens import Token, Doc
 from ..gold import GoldParse
 from ..syntax.nonproj import projectivize
 from ..matcher import Matcher
 from collections import defaultdict, Counter
 from timeit import default_timer as timer
 import itertools
 import random
 import numpy.random
 import cytoolz
 from . import conll17_ud_eval
 from .. import lang
 from .. import lang
 from ..lang import zh
 from ..lang import ja
 lang.zh.Chinese.Defaults.use_jieba = False
 lang.ja.Japanese.Defaults.use_janome = False
 random.seed(0)
 numpy.random.seed(0)
 def minibatch_by_words(items, size=5000):
    random.shuffle(items)
    if isinstance(size, int):
        size_ = itertools.repeat(size)
    else:
        size_ = size
    items = iter(items)
    while True:
        batch_size = next(size_)
        batch = []
        while batch_size >= 0:
            try:
                doc, gold = next(items)
            except StopIteration:
                if batch:
                    yield batch
                return
            batch_size -= len(doc)
            batch.append((doc, gold))
        if batch:
            yield batch
        else:
            break
 ################
 # Data reading #
 ################
 space_re = re.compile('\s+')
 def split_text(text):
    return [space_re.sub(' ', par.strip()) for par in text.split('\n\n')]
 def read_data(nlp, conllu_file, text_file, raw_text=True, oracle_segments=False,
              max_doc_length=None, limit=None):
    '''Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True,
    include Doc objects created using nlp.make_doc and then aligned against
    the gold-standard sequences. If oracle_segments=True, include Doc objects
    created from the gold-standard segments. At least one must be True.'''
    if not raw_text and not oracle_segments:
        raise ValueError("At least one of raw_text or oracle_segments must be True")
    paragraphs = split_text(text_file.read())
    conllu = read_conllu(conllu_file)
    # sd is spacy doc; cd is conllu doc
    # cs is conllu sent, ct is conllu token
    docs = []
    golds = []
    for doc_id, (text, cd) in enumerate(zip(paragraphs, conllu)):
        sent_annots = []
        for cs in cd:
            sent = defaultdict(list)
            for id_, word, lemma, pos, tag, morph, head, dep, _, space_after in cs:
                if '.' in id_:
                    continue
                if '-' in id_:
                    continue
                id_ = int(id_)-1
                head = int(head)-1 if head != '0' else id_
                sent['words'].append(word)
                sent['tags'].append(tag)
                sent['heads'].append(head)
                sent['deps'].append('ROOT' if dep == 'root' else dep)
                sent['spaces'].append(space_after == '_')
            sent['entities'] = ['-'] * len(sent['words'])
            sent['heads'], sent['deps'] = projectivize(sent['heads'],
                                                       sent['deps'])
            if oracle_segments:
                docs.append(Doc(nlp.vocab, words=sent['words'], spaces=sent['spaces']))
                golds.append(GoldParse(docs[-1], **sent))
            sent_annots.append(sent)
            if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
                doc, gold = _make_gold(nlp, None, sent_annots)
                sent_annots = []
                docs.append(doc)
                golds.append(gold)
                if limit and len(docs) >= limit:
                    return docs, golds
        if raw_text and sent_annots:
            doc, gold = _make_gold(nlp, None, sent_annots)
            docs.append(doc)
            golds.append(gold)
        if limit and len(docs) >= limit:
            return docs, golds
    return docs, golds
 def read_conllu(file_):
    docs = []
    sent = []
    doc = []
    for line in file_:
        if line.startswith('# newdoc'):
            if doc:
                docs.append(doc)
            doc = []
        elif line.startswith('#'):
            continue
        elif not line.strip():
            if sent:
                doc.append(sent)
            sent = []
        else:
            sent.append(list(line.strip().split('\t')))
            if len(sent[-1]) != 10:
                print(repr(line))
                raise ValueError
    if sent:
        doc.append(sent)
    if doc:
        docs.append(doc)
    return docs
 def _make_gold(nlp, text, sent_annots):
    # Flatten the conll annotations, and adjust the head indices
    flat = defaultdict(list)
    for sent in sent_annots:
        flat['heads'].extend(len(flat['words'])+head for head in sent['heads'])
        for field in ['words', 'tags', 'deps', 'entities', 'spaces']:
            flat[field].extend(sent[field])
    # Construct text if necessary
    assert len(flat['words']) == len(flat['spaces'])
    if text is None:
        text = ''.join(word+' '*space for word, space in zip(flat['words'], flat['spaces'])) 
    doc = nlp.make_doc(text)
    flat.pop('spaces')
    gold = GoldParse(doc, **flat)
    return doc, gold
 #############################
 # Data transforms for spaCy #
 #############################
 def golds_to_gold_tuples(docs, golds):
    '''Get out the annoying 'tuples' format used by begin_training, given the
    GoldParse objects.'''
    tuples = []
    for doc, gold in zip(docs, golds):
        text = doc.text
        ids, words, tags, heads, labels, iob = zip(*gold.orig_annot)
        sents = [((ids, words, tags, heads, labels, iob), [])]
        tuples.append((text, sents))
    return tuples
 ##############
 # Evaluation #
 ##############
 def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
    with text_loc.open('r', encoding='utf8') as text_file:
        texts = split_text(text_file.read())
        docs = list(nlp.pipe(texts))
    with sys_loc.open('w', encoding='utf8') as out_file:
        write_conllu(docs, out_file)
    with gold_loc.open('r', encoding='utf8') as gold_file:
        gold_ud = conll17_ud_eval.load_conllu(gold_file)
        with sys_loc.open('r', encoding='utf8') as sys_file:
            sys_ud = conll17_ud_eval.load_conllu(sys_file)
        scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
    return scores
 def write_conllu(docs, file_):
    merger = Matcher(docs[0].vocab)
    merger.add('SUBTOK', None, [{'DEP': 'subtok', 'op': '+'}])
    for i, doc in enumerate(docs):
        matches = merger(doc)
        spans = [doc[start:end+1] for _, start, end in matches]
        offsets = [(span.start_char, span.end_char) for span in spans]
        for start_char, end_char in offsets:
            doc.merge(start_char, end_char)
        file_.write("# newdoc id = {i}\n".format(i=i))
        for j, sent in enumerate(doc.sents):
            file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
            file_.write("# text = {text}\n".format(text=sent.text))
            for k, token in enumerate(sent):
                file_.write(token._.get_conllu_lines(k) + '\n')
            file_.write('\n')
 def print_progress(itn, losses, ud_scores):
    fields = {
        'dep_loss': losses.get('parser', 0.0),
        'tag_loss': losses.get('tagger', 0.0),
        'words': ud_scores['Words'].f1 * 100,
        'sents': ud_scores['Sentences'].f1 * 100,
        'tags': ud_scores['XPOS'].f1 * 100,
        'uas': ud_scores['UAS'].f1 * 100,
        'las': ud_scores['LAS'].f1 * 100,
    }
    header = ['Epoch', 'Loss', 'LAS', 'UAS', 'TAG', 'SENT', 'WORD']
    if itn == 0:
        print('\t'.join(header))
    tpl = '\t'.join((
        '{:d}',
        '{dep_loss:.1f}',
        '{las:.1f}',
        '{uas:.1f}',
        '{tags:.1f}',
        '{sents:.1f}',
        '{words:.1f}',
    ))
    print(tpl.format(itn, **fields))
 #def get_sent_conllu(sent, sent_id):
 #    lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
 def get_token_conllu(token, i):
    if token._.begins_fused:
        n = 1
        while token.nbor(n)._.inside_fused:
            n += 1
        id_ = '%d-%d' % (i, i+n)
        lines = [id_, token.text, '_', '_', '_', '_', '_', '_', '_', '_']
    else:
        lines = []
    if token.head.i == token.i:
        head = 0
    else:
        head = i + (token.head.i - token.i) + 1
    fields = [str(i+1), token.text, token.lemma_, token.pos_, token.tag_, '_',
              str(head), token.dep_.lower(), '_', '_']
    lines.append('\t'.join(fields))
    return '\n'.join(lines)
 Token.set_extension('get_conllu_lines', method=get_token_conllu)
 Token.set_extension('begins_fused', default=False)
 Token.set_extension('inside_fused', default=False)
 ##################
 # Initialization #
 ##################
 def load_nlp(corpus, config):
    lang = corpus.split('_')[0]
    nlp = spacy.blank(lang)
    if config.vectors:
        nlp.vocab.from_disk(config.vectors / 'vocab')
    return nlp
 def initialize_pipeline(nlp, docs, golds, config):
    nlp.add_pipe(nlp.create_pipe('parser'))
    if config.multitask_tag:
        nlp.parser.add_multitask_objective('tag')
    if config.multitask_sent:
        nlp.parser.add_multitask_objective('sent_start')
    nlp.parser.moves.add_action(2, 'subtok')
    nlp.add_pipe(nlp.create_pipe('tagger'))
    for gold in golds:
        for tag in gold.tags:
            if tag is not None:
                nlp.tagger.add_label(tag)
    # Replace labels that didn't make the frequency cutoff
    actions = set(nlp.parser.labels)
    label_set = set([act.split('-')[1] for act in actions if '-' in act])
    for gold in golds:
        for i, label in enumerate(gold.labels):
            if label is not None and label not in label_set:
                gold.labels[i] = label.split('||')[0]
    return nlp.begin_training(lambda: golds_to_gold_tuples(docs, golds))
 ########################
 # Command line helpers #
 ########################
 class Config(object):
    def __init__(self, vectors=None, max_doc_length=10, multitask_tag=True,
            multitask_sent=True, nr_epoch=30, batch_size=1000, dropout=0.2):
        for key, value in locals().items():
            setattr(self, key, value)
    @classmethod
    def load(cls, loc):
        with Path(loc).open('r', encoding='utf8') as file_:
            cfg = json.load(file_)
        return cls(**cfg)
 class Dataset(object):
    def __init__(self, path, section):
        self.path = path
        self.section = section
        self.conllu = None
        self.text = None
        for file_path in self.path.iterdir():
            name = file_path.parts[-1]
            if section in name and name.endswith('conllu'):
                self.conllu = file_path
            elif section in name and name.endswith('txt'):
                self.text = file_path
        if self.conllu is None:
            msg = "Could not find .txt file in {path} for {section}"
            raise IOError(msg.format(section=section, path=path))
        if self.text is None:
            msg = "Could not find .txt file in {path} for {section}"
        self.lang = self.conllu.parts[-1].split('-')[0].split('_')[0]
 class TreebankPaths(object):
    def __init__(self, ud_path, treebank, **cfg):
        self.train = Dataset(ud_path / treebank, 'train')
        self.dev = Dataset(ud_path / treebank, 'dev')
        self.lang = self.train.lang
@plac.annotations(
    ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
    corpus=("UD corpus to train and evaluate on, e.g. en, es_ancora, etc",
            "positional", None, str),
    parses_dir=("Directory to write the development parses", "positional", None, Path),
    config=("Path to json formatted config file", "positional"),
    limit=("Size limit", "option", "n", int)
 )
 def main(ud_dir, parses_dir, config, corpus, limit=0):
    config = Config.load(config)
    paths = TreebankPaths(ud_dir, corpus)
    if not (parses_dir / corpus).exists():
        (parses_dir / corpus).mkdir()
    print("Train and evaluate", corpus, "using lang", paths.lang)
    nlp = load_nlp(paths.lang, config)
    docs, golds = read_data(nlp, paths.train.conllu.open(), paths.train.text.open(),
                            max_doc_length=config.max_doc_length, limit=limit)
    optimizer = initialize_pipeline(nlp, docs, golds, config)
    for i in range(config.nr_epoch):
        docs = [nlp.make_doc(doc.text) for doc in docs]
        batches = minibatch_by_words(list(zip(docs, golds)), size=config.batch_size)
        losses = {}
        n_train_words = sum(len(doc) for doc in docs)
        with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
            for batch in batches:
                batch_docs, batch_gold = zip(*batch)
                pbar.update(sum(len(doc) for doc in batch_docs))
                nlp.update(batch_docs, batch_gold, sgd=optimizer,
                           drop=config.dropout, losses=losses)
        out_path = parses_dir / corpus / 'epoch-{i}.conllu'.format(i=i)
        with nlp.use_params(optimizer.averages):
            scores = evaluate(nlp, paths.dev.text, paths.dev.conllu, out_path)
            print_progress(i, losses, scores)
 if __name__ == '__main__':
    plac.call(main)
--- a/spacy/gold.pyx
+++ b/spacy/gold.pyx
@ -13,7 +13,7 @@ from . import _align
 from .syntax import nonproj
 from .tokens import Doc
 from . import util
-from .util import minibatch
+from .util import minibatch, itershuffle
 def tags_to_entities(tags):
@ -133,15 +133,14 @@ class GoldCorpus(object):
    def train_docs(self, nlp, gold_preproc=False,
                   projectivize=False, max_length=None,
                   noise_level=0.0):
        train_tuples = self.train_tuples
        if projectivize:
            train_tuples = nonproj.preprocess_training_data(
-                self.train_tuples, label_freq_cutoff=100)
+                self.train_tuples, label_freq_cutoff=30)
-        random.shuffle(train_tuples)
+        random.shuffle(self.train_locs)
        gold_docs = self.iter_gold_docs(nlp, train_tuples, gold_preproc,
                                        max_length=max_length,
                                        noise_level=noise_level)
-        yield from gold_docs
+        yield from itershuffle(gold_docs, bufsize=100)
    def dev_docs(self, nlp, gold_preproc=False):
        gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc)
--- a/spacy/lang/es/init.py
+++ b/spacy/lang/es/init.py
@ -21,7 +21,7 @@ class SpanishDefaults(Language.Defaults):
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    tag_map = TAG_MAP
    stop_words = STOP_WORDS
-    sytax_iterators = SYNTAX_ITERATORS
+    syntax_iterators = SYNTAX_ITERATORS
    lemma_lookup = LOOKUP
--- a/spacy/lang/es/syntax_iterators.py
+++ b/spacy/lang/es/syntax_iterators.py
@ -6,17 +6,19 @@ from ...symbols import NOUN, PROPN, PRON, VERB, AUX
 def noun_chunks(obj):
    doc = obj.doc
-    np_label = doc.vocab.strings['NP']
+    if not len(doc):
        return
    np_label = doc.vocab.strings.add('NP')
    left_labels = ['det', 'fixed', 'neg'] #['nunmod', 'det', 'appos', 'fixed']
    right_labels = ['flat', 'fixed', 'compound', 'neg']
    stop_labels = ['punct']
-    np_left_deps = [doc.vocab.strings[label] for label in left_labels]
+    np_left_deps = [doc.vocab.strings.add(label) for label in left_labels]
-    np_right_deps = [doc.vocab.strings[label] for label in right_labels]
+    np_right_deps = [doc.vocab.strings.add(label) for label in right_labels]
-    stop_deps = [doc.vocab.strings[label] for label in stop_labels]
+    stop_deps = [doc.vocab.strings.add(label) for label in stop_labels]
    token = doc[0]
    while token and token.i < len(doc):
        if token.pos in [PROPN, NOUN, PRON]:
-            left, right = noun_bounds(token)
+            left, right = noun_bounds(doc, token, np_left_deps, np_right_deps, stop_deps)
            yield left.i, right.i+1, np_label
            token = right
        token = next_token(token)
@ -33,7 +35,7 @@ def next_token(token):
        return None
-def noun_bounds(root):
+def noun_bounds(doc, root, np_left_deps, np_right_deps, stop_deps):
    left_bound = root
    for token in reversed(list(root.lefts)):
        if token.dep in np_left_deps:
@ -41,7 +43,7 @@ def noun_bounds(root):
    right_bound = root
    for token in root.rights:
        if (token.dep in np_right_deps):
-            left, right = noun_bounds(token)
+            left, right = noun_bounds(doc, token, np_left_deps, np_right_deps, stop_deps)
            if list(filter(lambda t: is_verb_token(t) or t.dep in stop_deps,
                           doc[left_bound.i: right.i])):
                break
--- a/spacy/lang/hr/stop_words.py
+++ b/spacy/lang/hr/stop_words.py
@ -6,10 +6,25 @@ from __future__ import unicode_literals
 STOP_WORDS = set("""
 a
 ah
 aha
 aj
 ako
 al
 ali
 arh
 au
 avaj
 bar
 baš
 bez
 bi
 bih
 bijah 
 bijahu
 bijaše
 bijasmo
 bijaste
 bila
 bili
 bilo
@ -17,25 +32,104 @@ bio
 bismo
 biste
 biti
 brr
 buć
 budavši
 bude
 budimo
 budite
 budu
 budući
 bum 
 bumo
 će
 ćemo
 ćeš
 ćete
 čijem
 čijim
 čijima
 ću
 da
 daj
 dakle
 de
 deder
 dem
 djelomice
 djelomično
 do
 doista
 dok
 dokle
 donekle
 dosad
 doskoro
 dotad
 dotle
 dovečer
 drugamo
 drugdje
 duž
 e
 eh
 ehe
 ej
 eno
 eto
 evo
 ga
 gdjekakav
 gdjekoje
 gic
 god
 halo
 hej
 hm
 hoće
 hoćemo
 hoćete
 hoćeš
 hoćete
 hoću
 hop
 htijahu
 htijasmo
 htijaste
 htio
 htjedoh
 htjedoše
 htjedoste
 htjela
 htjele
 htjeli
 hura
 i
 iako
 ih
 iju
 ijuju
 ikada
 ikakav
 ikakva
 ikakve
 ikakvi
 ikakvih
 ikakvim
 ikakvima
 ikakvo
 ikakvog
 ikakvoga
 ikakvoj
 ikakvom
 ikakvome
 ili
 im
 iz
 ja
 je
 jedna
 jedne
 jedni
 jedno
 jer
 jesam
@ -57,6 +151,7 @@ koji
 kojima
 koju
 kroz
 lani
 li
 me
 mene
@ -66,6 +161,8 @@ mimo
 moj
 moja
 moje
 moji
 moju
 mu
 na
 nad
@ -77,24 +174,27 @@ naš
 naša
 naše
 našeg
 naši
 ne
 neće
 nećemo
 nećeš
 nećete
 neću
 nego
 neka
 neke
 neki
 nekog
 neku
 nema
 netko
 neće
 nećemo
 nećete
 nećeš
 neću
 nešto
 netko
 ni
 nije
 nikoga
 nikoje
 nikoji
 nikoju
 nisam
 nisi
@ -123,33 +223,63 @@ od
 odmah
 on
 ona
 one
 oni
 ono
 onu
 onoj
 onom
 onim
 onima
 ova
 ovaj
 ovim
 ovima
 ovoj
 pa
 pak
 pljus
 po
 pod
 podalje
 poimence
 poizdalje
 ponekad
 pored
 postrance
 potajice
 potrbuške
 pouzdano
 prije
 s
 sa
 sam
 samo
 sasvim
 sav
 se
 sebe
 sebi
 si
 šic
 smo
 ste
 što
 šta
 štogod
 štagod
 su
 sva
 sve
 svi
 svi
 svog
 svoj
 svoja
 svoje
 svoju
 svom
 svu
 ta
 tada
 taj
@ -158,6 +288,8 @@ te
 tebe
 tebi
 ti
 tim
 tima
 to
 toj
 tome
@ -165,23 +297,51 @@ tu
 tvoj
 tvoja
 tvoje
 tvoji
 tvoju
 u
 usprkos
 utaman
 uvijek
 uz
 uza
 uzagrapce
 uzalud
 uzduž
 valjda
 vam
 vama
 vas
 vaš
 vaša
 vaše
 vašim
 vašima
 već
 vi
 vjerojatno
 vjerovatno
 vrh
 vrlo
 za
 zaista
 zar
-će
+zatim
-ćemo
+zato
-ćete
+zbija
-ćeš
+zbog
-ću
+želeći
-što
+željah
 željela
 željele
 željeli
 željelo
 željen
 željena
 željene
 željeni
 željenu
 željeo
 zimus
 zum
 """.split())
--- a/spacy/lang/ja/init.py
+++ b/spacy/lang/ja/init.py
@ -35,14 +35,32 @@ class JapaneseTokenizer(object):
    def from_disk(self, path, **exclude):
        return self
 class JapaneseCharacterSegmenter(object):
    def __init__(self, vocab):
        self.vocab = vocab
    def __call__(self, text):
        words = []
        spaces = []
        doc = self.tokenizer(text)
        for token in self.tokenizer(text):
            words.extend(list(token.text))
            spaces.extend([False]*len(token.text))
            spaces[-1] = bool(token.whitespace_)
        return Doc(self.vocab, words=words, spaces=spaces)
 class JapaneseDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda text: 'ja'
    use_janome = True
    @classmethod
    def create_tokenizer(cls, nlp=None):
        if cls.use_janome:
            return JapaneseTokenizer(cls, nlp)
        else:
            return JapaneseCharacterSegmenter(cls, nlp.vocab)
 class Japanese(Language):
--- a/spacy/lang/tr/examples.py
+++ b/spacy/lang/tr/examples.py
@ -0,0 +1,22 @@
 # coding: utf8
 from __future__ import unicode_literals
 """
 Example sentences to test spaCy and its language models.
 >>> from spacy.lang.tr.examples import sentences
 >>> docs = nlp.pipe(sentences)
 """
 sentences = [
    "Neredesin?",
    "Neredesiniz?",
    "Bu bir cümledir.",
    "Sürücüsüz araçlar sigorta yükümlülüğünü üreticilere kaydırıyor.",
    "San Francisco kaldırımda kurye robotları yasaklayabilir."
    "Londra İngiltere'nin başkentidir.",
    "Türkiye'nin başkenti neresi?",
    "Bakanlar Kurulu 180 günlük eylem planını açıkladı.",
    "Merkez Bankası, beklentiler doğrultusunda faizlerde değişikliğe gitmedi."
 ]
--- a/spacy/lang/tr/lex_attrs.py
+++ b/spacy/lang/tr/lex_attrs.py
@ -0,0 +1,31 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ...attrs import LIKE_NUM
 #Thirteen, fifteen etc. are written separate: on üç
 _num_words = ['bir', 'iki', 'üç', 'dört', 'beş', 'altı', 'yedi', 'sekiz',
              'dokuz', 'on', 'yirmi', 'otuz', 'kırk', 'elli', 'altmış',
              'yetmiş', 'seksen', 'doksan', 'yüz', 'bin', 'milyon',
              'milyar', 'katrilyon', 'kentilyon']
 def like_num(text):
    text = text.replace(',', '').replace('.', '')
    if text.isdigit():
        return True
    if text.count('/') == 1:
        num, denom = text.split('/')
        if num.isdigit() and denom.isdigit():
            return True
    if text.lower() in _num_words:
        return True
    return False
 LEX_ATTRS = {
    LIKE_NUM: like_num
 }
--- a/spacy/lang/tr/stop_words.py
+++ b/spacy/lang/tr/stop_words.py
@ -10,16 +10,12 @@ acep
 adamakıllı
 adeta
 ait
 altmýþ
 altmış
 altý
 altı
 ama
 amma
 anca
 ancak
 arada
-artýk
+artık
 aslında
 aynen
 ayrıca
@ -29,46 +25,82 @@ açıkçası
 bana
 bari
 bazen
 bazý
 bazı
 bazısı
 bazısına
 bazısında
 bazısından
 bazısını
 bazısının
 başkası
-baţka
+başkasına
 başkasında
 başkasından
 başkasını
 başkasının
 başka
 belki
 ben
 bende
 benden
 beni
 benim
 beri
 beriki
-beþ
+berikinin
-beş
+berikiyi
-beţ
+berisi
 bilcümle
 bile
 bin
 binaen
 binaenaleyh
 bir
 biraz
 birazdan
 birbiri
 birbirine
 birbirini
 birbirinin
 birbirinde
 birbirinden
 birden
 birdenbire
 biri
 birine
 birini
 birinin
 birinde
 birinden
 birice
 birileri
 birilerinde
 birilerinden
 birilerine
 birilerini
 birilerinin
 birisi
 birisine
 birisini
 birisinin
 birisinde
 birisinden
 birkaç
 birkaçı
 birkaçına
 birkaçını
 birkaçının
 birkaçında
 birkaçından
 birkez
 birlikte
 birçok
 birçoğu
-birþey
+birçoğuna
-birþeyi
+birçoğunda
 birçoğundan
 birçoğunu
 birçoğunun
 birşey
 birşeyi
 birţey
 bitevi
 biteviye
 bittabi
@ -96,6 +128,11 @@ buracıkta
 burada
 buradan
 burası
 burasına
 burasını
 burasının
 burasında
 burasından
 böyle
 böylece
 böylecene
@ -106,8 +143,34 @@ büsbütün
 bütün
 cuk
 cümlesi
 cümlesine
 cümlesini
 cümlesinin
 cümlesinden
 cümlemize
 cümlemizi
 cümlemizden
 çabuk
 çabukça
 çeşitli
 çok
 çokları
 çoklarınca
 çokluk
 çoklukla
 çokça
 çoğu
 çoğun
 çoğunca
 çoğunda
 çoğundan
 çoğunlukla
 çoğunu
 çoğunun
 çünkü
 da
 daha
 dahası
 dahi
 dahil
 dahilen
@ -124,19 +187,17 @@ denli
 derakap
 derhal
 derken
 deđil
 değil
 değin
 diye
 diđer
 diğer
 diğeri
-doksan
+diğerine
-dokuz
+diğerini
 diğerinden
 dolayı
 dolayısıyla
 doğru
 dört
 edecek
 eden
 ederek
@ -146,7 +207,6 @@ edilmesi
 ediyor
 elbet
 elbette
 elli
 emme
 en
 enikonu
@ -168,10 +228,10 @@ evvelce
 evvelden
 evvelemirde
 evveli
 eđer
 eğer
 fakat
 filanca
 filancanın
 gah
 gayet
 gayetle
@ -197,6 +257,10 @@ haliyle
 handiyse
 hangi
 hangisi
 hangisine
 hangisine
 hangisinde
 hangisinden
 hani
 hariç
 hasebiyle
@ -207,17 +271,27 @@ hem
 henüz
 hep
 hepsi
 hepsini
 hepsinin
 hepsinde
 hepsinden
 her
 herhangi
 herkes
 herkesi
 herkesin
 herkesten
 hiç
 hiçbir
 hiçbiri
 hiçbirine
 hiçbirini
 hiçbirinin
 hiçbirinde
 hiçbirinden
 hoş
 hulasaten
 iken
 iki
 ila
 ile
 ilen
@ -240,43 +314,55 @@ iyicene
 için
 iş
 işte
 iţte
 kadar
 kaffesi
 kah
 kala
-kanýmca
+kanımca
 karşın
 katrilyon
 kaynak
 kaçı
 kaçına
 kaçında
 kaçından
 kaçını
 kaçının
 kelli
 kendi
 kendilerinde
 kendilerinden
 kendilerine
 kendilerini
 kendilerinin
 kendini
 kendisi
 kendisinde
 kendisinden
 kendisine
 kendisini
 kendisinin
 kere
 kez
 keza
 kezalik
 keşke
 keţke
 ki
 kim
 kimden
 kime
 kimi
 kiminin
 kimisi
 kimisinde
 kimisinden
 kimisine
 kimisinin
 kimse
 kimsecik
 kimsecikler
 külliyen
 kýrk
 kýsaca
 kırk
 kısaca
 kısacası
 lakin
 leh
 lütfen
@ -289,13 +375,10 @@ međer
 meğer
 meğerki
 meğerse
 milyar
 milyon
 mu
 mü
 mý
 mı
-nasýl
+mi
 nasıl
 nasılsa
 nazaran
@ -304,6 +387,8 @@ ne
 neden
 nedeniyle
 nedenle
 nedenler
 nedenlerden
 nedense
 nerde
 nerden
@ -332,32 +417,27 @@ olduklarını
 oldukça
 olduğu
 olduğunu
 olmadı
 olmadığı
 olmak
 olması
 olmayan
 olmaz
 olsa
 olsun
 olup
 olur
 olursa
 oluyor
 on
 ona
 onca
 onculayın
 onda
 ondan
 onlar
 onlara
 onlardan
 onlari
 onlarýn
 onları
 onların
 onu
 onun
 ora
 oracık
 oracıkta
 orada
@ -365,9 +445,26 @@ oradan
 oranca
 oranla
 oraya
 otuz
 oysa
 oysaki
 öbür
 öbürkü
 öbürü
 öbüründe
 öbüründen
 öbürüne
 öbürünü
 önce
 önceden
 önceleri
 öncelikle
 öteki
 ötekisi
 öyle
 öylece
 öylelikle
 öylemesine
 öz
 pek
 pekala
 peki
@ -379,8 +476,6 @@ sahi
 sahiden
 sana
 sanki
 sekiz
 seksen
 sen
 senden
 seni
@ -393,6 +488,27 @@ sonra
 sonradan
 sonraları
 sonunda
 şayet
 şey
 şeyden
 şeyi
 şeyler
 şu
 şuna
 şuncacık
 şunda
 şundan
 şunlar
 şunları
 şunların
 şunu
 şunun
 şura
 şuracık
 şuracıkta
 şurası
 şöyle
 şimdi
 tabii
 tam
 tamam
@ -400,8 +516,8 @@ tamamen
 tamamıyla
 tarafından
 tek
 trilyon
 tüm
 üzere
 var
 vardı
 vasıtasıyla
@ -429,84 +545,16 @@ yaptığını
 yapılan
 yapılması
 yapıyor
 yedi
 yeniden
 yenilerde
 yerine
 yetmiþ
 yetmiş
 yetmiţ
 yine
 yirmi
 yok
 yoksa
 yoluyla
 yüz
 yüzünden
 zarfında
 zaten
 zati
 zira
 çabuk
 çabukça
 çeşitli
 çok
 çokları
 çoklarınca
 çokluk
 çoklukla
 çokça
 çoğu
 çoğun
 çoğunca
 çoğunlukla
 çünkü
 öbür
 öbürkü
 öbürü
 önce
 önceden
 önceleri
 öncelikle
 öteki
 ötekisi
 öyle
 öylece
 öylelikle
 öylemesine
 öz
 üzere
 üç
 þey
 þeyden
 þeyi
 þeyler
 þu
 þuna
 þunda
 þundan
 þunu
 şayet
 şey
 şeyden
 şeyi
 şeyler
 şu
 şuna
 şuncacık
 şunda
 şundan
 şunlar
 şunları
 şunu
 şunun
 şura
 şuracık
 şuracıkta
 şurası
 şöyle
 ţayet
 ţimdi
 ţu
 ţöyle
 """.split())
--- a/spacy/lang/tr/tokenizer_exceptions.py
+++ b/spacy/lang/tr/tokenizer_exceptions.py
@ -3,11 +3,6 @@ from __future__ import unicode_literals
 from ...symbols import ORTH, NORM
 # These exceptions are mostly for example purposes – hoping that Turkish
 # speakers can contribute in the future! Source of copy-pasted examples:
 # https://en.wiktionary.org/wiki/Category:Turkish_language
 _exc = {
    "sağol": [
        {ORTH: "sağ"},
@ -16,11 +11,112 @@ _exc = {
 for exc_data in [
-    {ORTH: "A.B.D.", NORM: "Amerika Birleşik Devletleri"}]:
+    {ORTH: "A.B.D.", NORM: "Amerika Birleşik Devletleri"},
    {ORTH: "Alb.", NORM: "Albay"},
    {ORTH: "Ar.Gör.", NORM: "Araştırma Görevlisi"},
    {ORTH: "Arş.Gör.", NORM: "Araştırma Görevlisi"},
    {ORTH: "Asb.", NORM: "Astsubay"},
    {ORTH: "Astsb.", NORM: "Astsubay"},
    {ORTH: "As.İz.", NORM: "Askeri İnzibat"},
    {ORTH: "Atğm", NORM: "Asteğmen"},
    {ORTH: "Av.", NORM: "Avukat"},
    {ORTH: "Apt.", NORM: "Apartmanı"},
    {ORTH: "Bçvş.", NORM: "Başçavuş"},
    {ORTH: "bk.", NORM: "bakınız"},
    {ORTH: "bknz.", NORM: "bakınız"},
    {ORTH: "Bnb.", NORM: "Binbaşı"},
    {ORTH: "bnb.", NORM: "binbaşı"},
    {ORTH: "Böl.", NORM: "Bölümü"},
    {ORTH: "Bşk.", NORM: "Başkanlığı"},
    {ORTH: "Bştbp.", NORM: "Baştabip"},
    {ORTH: "Bul.", NORM: "Bulvarı"},
    {ORTH: "Cad.", NORM: "Caddesi"},
    {ORTH: "çev.", NORM: "çeviren"},
    {ORTH: "Çvş.", NORM: "Çavuş"},
    {ORTH: "dak.", NORM: "dakika"},
    {ORTH: "dk.", NORM: "dakika"},
    {ORTH: "Doç.", NORM: "Doçent"},
    {ORTH: "doğ.", NORM: "doğum tarihi"},
    {ORTH: "drl.", NORM: "derleyen"},
    {ORTH: "Dz.", NORM: "Deniz"},
    {ORTH: "Dz.K.K.lığı", NORM: "Deniz Kuvvetleri Komutanlığı"},
    {ORTH: "Dz.Kuv.", NORM: "Deniz Kuvvetleri"},
    {ORTH: "Dz.Kuv.K.", NORM: "Deniz Kuvvetleri Komutanlığı"},
    {ORTH: "dzl.", NORM: "düzenleyen"},
    {ORTH: "Ecz.", NORM: "Eczanesi"},
    {ORTH: "ekon.", NORM: "ekonomi"},
    {ORTH: "Fak.", NORM: "Fakültesi"},
    {ORTH: "Gn.", NORM: "Genel"},
    {ORTH: "Gnkur.", NORM: "Genelkurmay"},
    {ORTH: "Gn.Kur.", NORM: "Genelkurmay"},
    {ORTH: "gr.", NORM: "gram"},
    {ORTH: "Hst.", NORM: "Hastanesi"},
    {ORTH: "Hs.Uzm.", NORM: "Hesap Uzmanı"},
    {ORTH: "huk.", NORM: "hukuk"},
    {ORTH: "Hv.", NORM: "Hava"},
    {ORTH: "Hv.K.K.lığı", NORM: "Hava Kuvvetleri Komutanlığı"},
    {ORTH: "Hv.Kuv.", NORM: "Hava Kuvvetleri"},
    {ORTH: "Hv.Kuv.K.", NORM: "Hava Kuvvetleri Komutanlığı"},
    {ORTH: "Hz.", NORM: "Hazreti"},
    {ORTH: "Hz.Öz.", NORM: "Hizmete Özel"},
    {ORTH: "İng.", NORM: "İngilizce"},
    {ORTH: "Jeol.", NORM: "Jeoloji"},
    {ORTH: "jeol.", NORM: "jeoloji"},
    {ORTH: "Korg.", NORM: "Korgeneral"},
    {ORTH: "Kur.", NORM: "Kurmay"},
    {ORTH: "Kur.Bşk.", NORM: "Kurmay Başkanı"},
    {ORTH: "Kuv.", NORM: "Kuvvetleri"},
    {ORTH: "Ltd.", NORM: "Limited"},
    {ORTH: "Mah.", NORM: "Mahallesi"},
    {ORTH: "mah.", NORM: "mahallesi"},
    {ORTH: "max.", NORM: "maksimum"},
    {ORTH: "min.", NORM: "minimum"},
    {ORTH: "Müh.", NORM: "Mühendisliği"},
    {ORTH: "müh.", NORM: "mühendisliği"},
    {ORTH: "MÖ.", NORM: "Milattan Önce"},
    {ORTH: "Onb.", NORM: "Onbaşı"},
    {ORTH: "Ord.", NORM: "Ordinaryüs"},
    {ORTH: "Org.", NORM: "Orgeneral"},
    {ORTH: "Ped.", NORM: "Pedagoji"},
    {ORTH: "Prof.", NORM: "Profesör"},
    {ORTH: "Sb.", NORM: "Subay"},
    {ORTH: "Sn.", NORM: "Sayın"},
    {ORTH: "sn.", NORM: "saniye"},
    {ORTH: "Sok.", NORM: "Sokak"},
    {ORTH: "Şb.", NORM: "Şube"},
    {ORTH: "Şti.", NORM: "Şirketi"},
    {ORTH: "Tbp.", NORM: "Tabip"},
    {ORTH: "T.C.", NORM: "Türkiye Cumhuriyeti"},
    {ORTH: "Tel.", NORM: "Telefon"},
    {ORTH: "tel.", NORM: "telefon"},
    {ORTH: "telg.", NORM: "telgraf"},
    {ORTH: "Tğm.", NORM: "Teğmen"},
    {ORTH: "tğm.", NORM: "teğmen"},
    {ORTH: "tic.", NORM: "ticaret"},
    {ORTH: "Tug.", NORM: "Tugay"},
    {ORTH: "Tuğg.", NORM: "Tuğgeneral"},
    {ORTH: "Tümg.", NORM: "Tümgeneral"},
    {ORTH: "Uzm.", NORM: "Uzman"},
    {ORTH: "Üçvş.", NORM: "Üstçavuş"},
    {ORTH: "Üni.", NORM: "Üniversitesi"},
    {ORTH: "Ütğm.", NORM: "Üsteğmen"},
    {ORTH: "vb.", NORM: "ve benzeri"},
    {ORTH: "vs.", NORM: "vesaire"},
    {ORTH: "Yard.", NORM: "Yardımcı"},
    {ORTH: "Yar.", NORM: "Yardımcı"},
    {ORTH: "Yd.Sb.", NORM: "Yedek Subay"},
    {ORTH: "Yard.Doç.", NORM: "Yardımcı Doçent"},
    {ORTH: "Yar.Doç.", NORM: "Yardımcı Doçent"},
    {ORTH: "Yb.", NORM: "Yarbay"},
    {ORTH: "Yrd.", NORM: "Yardımcı"},
    {ORTH: "Yrd.Doç.", NORM: "Yardımcı Doçent"},
    {ORTH: "Y.Müh.", NORM: "Yüksek mühendis"},
    {ORTH: "Y.Mim.", NORM: "Yüksek mimar"}]:
    _exc[exc_data[ORTH]] = [exc_data]
-for orth in ["Dr."]:
+for orth in [
    "Dr.", "yy."]:
    _exc[orth] = [{ORTH: orth}]
--- a/spacy/syntax/arc_eager.pyx
+++ b/spacy/syntax/arc_eager.pyx
@ -319,7 +319,7 @@ cdef class ArcEager(TransitionSystem):
            (SHIFT, ['']),
            (REDUCE, ['']),
            (RIGHT, []),
-            (LEFT, []),
+            (LEFT, ['subtok']),
            (BREAK, ['ROOT']))
        ))
        seen_actions = set()
--- a/spacy/syntax/nn_parser.pyx
+++ b/spacy/syntax/nn_parser.pyx
@ -477,14 +477,15 @@ cdef class Parser:
        free(vectors)
        free(scores)
-    def beam_parse(self, docs, int beam_width=3, float beam_density=0.001):
+    def beam_parse(self, docs, int beam_width=3, float beam_density=0.001,
            float drop=0.):
        cdef Beam beam
        cdef np.ndarray scores
        cdef Doc doc
        cdef int nr_class = self.moves.n_moves
        cuda_stream = util.get_cuda_stream()
        (tokvecs, bp_tokvecs), state2vec, vec2scores = self.get_batch_model(
-            docs, cuda_stream, 0.0)
+            docs, cuda_stream, drop)
        cdef int offset = 0
        cdef int j = 0
        cdef int k
@ -523,8 +524,8 @@ cdef class Parser:
                        n_states += 1
            if n_states == 0:
                break
-            vectors = state2vec(token_ids[:n_states])
+            vectors, _ = state2vec.begin_update(token_ids[:n_states], drop)
-            scores = vec2scores(vectors)
+            scores, _ = vec2scores.begin_update(vectors, drop=drop)
            c_scores = <float*>scores.data
            for beam in todo:
                for i in range(beam.size):
--- a/spacy/syntax/nonproj.pyx
+++ b/spacy/syntax/nonproj.pyx
@ -191,9 +191,12 @@ def _filter_labels(gold_tuples, cutoff, freqs):
    for raw_text, sents in gold_tuples:
        filtered_sents = []
        for (ids, words, tags, heads, labels, iob), ctnts in sents:
-            filtered_labels = [decompose(label)[0]
+            filtered_labels = []
-                               if freqs.get(label, cutoff) < cutoff
+            for label in labels:
-                               else label for label in labels]
+                if is_decorated(label) and freqs.get(label, 0) < cutoff:
                    filtered_labels.append(decompose(label)[0])
                else:
                    filtered_labels.append(label)
            filtered_sents.append(
                ((ids, words, tags, heads, filtered_labels, iob), ctnts))
        filtered.append((raw_text, filtered_sents))
--- a/spacy/tests/parser/test_arc_eager_oracle.py
+++ b/spacy/tests/parser/test_arc_eager_oracle.py
@ -0,0 +1,74 @@
 from ...vocab import Vocab
 from ...pipeline import DependencyParser
 from ...tokens import Doc
 from ...gold import GoldParse
 from ...syntax.nonproj import projectivize
 annot_tuples = [
    (0, 'When', 'WRB', 11, 'advmod', 'O'),
    (1, 'Walter', 'NNP', 2, 'compound', 'B-PERSON'),
    (2, 'Rodgers', 'NNP', 11, 'nsubj', 'L-PERSON'),
    (3, ',', ',', 2, 'punct', 'O'),
    (4, 'our', 'PRP$', 6, 'poss', 'O'),
    (5, 'embedded', 'VBN', 6, 'amod', 'O'),
    (6, 'reporter', 'NN', 2, 'appos', 'O'),
    (7, 'with', 'IN', 6, 'prep', 'O'),
    (8, 'the', 'DT', 10, 'det', 'B-ORG'),
    (9, '3rd', 'NNP', 10, 'compound', 'I-ORG'),
    (10, 'Cavalry', 'NNP', 7, 'pobj', 'L-ORG'),
    (11, 'says', 'VBZ', 44, 'advcl', 'O'),
    (12, 'three', 'CD', 13, 'nummod', 'U-CARDINAL'),
    (13, 'battalions', 'NNS', 16, 'nsubj', 'O'),
    (14, 'of', 'IN', 13, 'prep', 'O'),
    (15, 'troops', 'NNS', 14, 'pobj', 'O'),
    (16, 'are', 'VBP', 11, 'ccomp', 'O'),
    (17, 'on', 'IN', 16, 'prep', 'O'),
    (18, 'the', 'DT', 19, 'det', 'O'),
    (19, 'ground', 'NN', 17, 'pobj', 'O'),
    (20, ',', ',', 17, 'punct', 'O'),
    (21, 'inside', 'IN', 17, 'prep', 'O'),
    (22, 'Baghdad', 'NNP', 21, 'pobj', 'U-GPE'),
    (23, 'itself', 'PRP', 22, 'appos', 'O'),
    (24, ',', ',', 16, 'punct', 'O'),
    (25, 'have', 'VBP', 26, 'aux', 'O'),
    (26, 'taken', 'VBN', 16, 'dep', 'O'),
    (27, 'up', 'RP', 26, 'prt', 'O'),
    (28, 'positions', 'NNS', 26, 'dobj', 'O'),
    (29, 'they', 'PRP', 31, 'nsubj', 'O'),
    (30, "'re", 'VBP', 31, 'aux', 'O'),
    (31, 'going', 'VBG', 26, 'parataxis', 'O'),
    (32, 'to', 'TO', 33, 'aux', 'O'),
    (33, 'spend', 'VB', 31, 'xcomp', 'O'),
    (34, 'the', 'DT', 35, 'det', 'B-TIME'), 
    (35, 'night', 'NN', 33, 'dobj', 'L-TIME'),
    (36, 'there', 'RB', 33, 'advmod', 'O'),
    (37, 'presumably', 'RB', 33, 'advmod', 'O'),
    (38, ',', ',', 44, 'punct', 'O'),
    (39, 'how', 'WRB', 40, 'advmod', 'O'),
    (40, 'many', 'JJ', 41, 'amod', 'O'),
    (41, 'soldiers', 'NNS', 44, 'pobj', 'O'),
    (42, 'are', 'VBP', 44, 'aux', 'O'),
    (43, 'we', 'PRP', 44, 'nsubj', 'O'),
    (44, 'talking', 'VBG', 44, 'ROOT', 'O'),
    (45, 'about', 'IN', 44, 'prep', 'O'),
    (46, 'right', 'RB', 47, 'advmod', 'O'),
    (47, 'now', 'RB', 44, 'advmod', 'O'),
    (48, '?', '.', 44, 'punct', 'O')]
 def test_get_oracle_actions():
    doc = Doc(Vocab(), words=[t[1] for t in annot_tuples])
    parser = DependencyParser(doc.vocab)
    parser.moves.add_action(0, '')
    parser.moves.add_action(1, '')
    parser.moves.add_action(1, '')
    parser.moves.add_action(4, 'ROOT')
    for i, (id_, word, tag, head, dep, ent) in enumerate(annot_tuples):
        if head > i:
            parser.moves.add_action(2, dep)
        elif head < i:
            parser.moves.add_action(3, dep)
    ids, words, tags, heads, deps, ents = zip(*annot_tuples)
    heads, deps = projectivize(heads, deps)
    gold = GoldParse(doc, words=words, tags=tags, heads=heads, deps=deps)
    parser.moves.preprocess_gold(gold)
    actions = parser.moves.get_oracle_sequence(doc, gold)
--- a/spacy/tokens/span.pyx
+++ b/spacy/tokens/span.pyx
@ -294,6 +294,7 @@ cdef class Span:
            cdef int i
            if self.doc.is_parsed:
                root = &self.doc.c[self.start]
                n = 0
                while root.head != 0:
                    root += root.head
                    n += 1
@ -307,8 +308,10 @@ cdef class Span:
                    start += -1
                # find end of the sentence
                end = self.end
-                while self.doc.c[end].sent_start != 1:
+                n = 0
                while end < self.doc.length and self.doc.c[end].sent_start != 1:
                    end += 1
                    n += 1
                    if n >= self.doc.length:
                        break
                #
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@ -279,8 +279,8 @@ cdef class Token:
        """
        def __get__(self):
            if self.c.lemma == 0:
-                lemma = self.vocab.morphology.lemmatizer.lookup(self.orth_)
+                lemma_ = self.vocab.morphology.lemmatizer.lookup(self.orth_)
-                return lemma
+                return self.vocab.strings[lemma_]
            else:
                return self.c.lemma
--- a/spacy/util.py
+++ b/spacy/util.py
@ -451,7 +451,7 @@ def itershuffle(iterable, bufsize=1000):
    try:
        while True:
            for i in range(random.randint(1, bufsize-len(buf))):
-                buf.append(iterable.next())
+                buf.append(next(iterable))
            random.shuffle(buf)
            for i in range(random.randint(1, bufsize)):
                if buf:
--- a/spacy/vectors.pyx
+++ b/spacy/vectors.pyx
@ -15,11 +15,8 @@ from .compat import basestring_, path2str
 from . import util
-def unpickle_vectors(keys_and_rows, data):
+def unpickle_vectors(bytes_data):
-    vectors = Vectors(data=data)
+    return Vectors().from_bytes(bytes_data)
    for key, row in keys_and_rows:
        vectors.add(key, row=row)
    return vectors
 cdef class Vectors:
@ -86,8 +83,7 @@ cdef class Vectors:
        return len(self.key2row)
    def __reduce__(self):
-        keys_and_rows = tuple(self.key2row.items())
+        return (unpickle_vectors, (self.to_bytes(),))
        return (unpickle_vectors, (keys_and_rows, self.data))
    def __getitem__(self, key):
        """Get a vector by key. If the key is not found, a KeyError is raised.
--- a/website/assets/img/social/preview_alpha.jpg
+++ b/website/assets/img/social/preview_alpha.jpg
--- a/website/models/_data.json
+++ b/website/models/_data.json
@ -76,11 +76,13 @@
    },
    "MODEL_LICENSES": {
        "CC BY 4.0":       "https://creativecommons.org/licenses/by/4.0/",
        "CC BY-SA":        "https://creativecommons.org/licenses/by-sa/3.0/",
        "CC BY-SA 3.0":    "https://creativecommons.org/licenses/by-sa/3.0/",
        "CC BY-SA 4.0":    "https://creativecommons.org/licenses/by-sa/4.0/",
        "CC BY-NC":        "https://creativecommons.org/licenses/by-nc/3.0/",
        "CC BY-NC 3.0":    "https://creativecommons.org/licenses/by-nc/3.0/",
        "CC-BY-NC-SA 3.0": "https://creativecommons.org/licenses/by-nc-sa/3.0/",
        "GPL":             "https://www.gnu.org/licenses/gpl.html",
        "LGPL":            "https://www.gnu.org/licenses/lgpl.html"
    },
--- a/website/usage/spacy-101.jade
+++ b/website/usage/spacy-101.jade
@ -68,7 +68,7 @@ p
    +item #[strong spaCy is not research software].
        |  It's built on the latest research, but it's designed to get
        |  things done. This leads to fairly different design decisions than
-        |  #[+a("https://github./nltk/nltk") NLTK]
+        |  #[+a("https://github.com/nltk/nltk") NLTK]
        |  or #[+a("https://stanfordnlp.github.io/CoreNLP/") CoreNLP], which were
        |  created as platforms for teaching and research. The main difference
        |  is that spaCy is integrated and opinionated. spaCy tries to avoid asking