Merge master

2024-11-10 19:57:17 +03:00 · 2018-03-14 19:03:24 +01:00 · 2018-03-14 19:03:24 +01:00 · ab3d860686
commit ab3d860686
parent 0fb153cf05 9aeec9c242
32 changed files with 1910 additions and 199 deletions
--- a/.buildkite/train.yml
+++ b/.buildkite/train.yml
@ -0,0 +1,11 @@
+steps:
+  -
+    command: "fab env clean make test wheel"
+    label: ":dizzy: :python:"
+    artifact_paths: "dist/*.whl"
+  - wait
+  - trigger: "spacy-train-from-wheel"
+    label: ":dizzy: :train:"
+    build:
+      env:
+        SPACY_VERSION: "{$SPACY_VERSION}"
--- a/.github/contributors/alldefector.md
+++ b/.github/contributors/alldefector.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your 
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [x] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Feng Niu |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | Feb 21, 2018  |
+| GitHub username                | alldefector     |
+| Website (optional)             |                      |
--- a/.github/contributors/willismonroe.md
+++ b/.github/contributors/willismonroe.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your 
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [x] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Willis Monroe        |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2018-3-5             |
+| GitHub username                | willismonroe         |
+| Website (optional)             |                      |
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -182,7 +182,7 @@ If you've made a contribution to spaCy, you should fill in the
 [spaCy contributor agreement](.github/CONTRIBUTOR_AGREEMENT.md) to ensure that
 your contribution can be used across the project. If you agree to be bound by
 the terms of the agreement, fill in the [template](.github/CONTRIBUTOR_AGREEMENT.md)
-and include it with your pull request, or sumit it separately to
+and include it with your pull request, or submit it separately to
 [`.github/contributors/`](/.github/contributors). The name of the file should be
 your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
--- a/examples/training/conllu.py
+++ b/examples/training/conllu.py
@ -28,8 +28,10 @@ import cytoolz
 import conll17_ud_eval

 import spacy.lang.zh
+import spacy.lang.ja

 spacy.lang.zh.Chinese.Defaults.use_jieba = False
+spacy.lang.ja.Japanese.Defaults.use_janome = False

 random.seed(0)
 numpy.random.seed(0)
@ -280,6 +282,30 @@ def print_progress(itn, losses, ud_scores):
    ))
    print(tpl.format(itn, **fields))

+#def get_sent_conllu(sent, sent_id):
+#    lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
+
+def get_token_conllu(token, i):
+    if token._.begins_fused:
+        n = 1
+        while token.nbor(n)._.inside_fused:
+            n += 1
+        id_ = '%d-%d' % (i, i+n)
+        lines = [id_, token.text, '_', '_', '_', '_', '_', '_', '_', '_']
+    else:
+        lines = []
+    if token.head.i == token.i:
+        head = 0
+    else:
+        head = i + (token.head.i - token.i) + 1
+    fields = [str(i+1), token.text, token.lemma_, token.pos_, token.tag_, '_',
+              str(head), token.dep_.lower(), '_', '_']
+    lines.append('\t'.join(fields))
+    return '\n'.join(lines)
+
+Token.set_extension('get_conllu_lines', method=get_token_conllu)
+Token.set_extension('begins_fused', default=False)
+Token.set_extension('inside_fused', default=False)

 ##################
 # Initialization #
--- a/fabfile.py
+++ b/fabfile.py
@ -1,49 +1,92 @@
 # coding: utf-8
 from __future__ import unicode_literals, print_function

+import contextlib
+from pathlib import Path
 from fabric.api import local, lcd, env, settings, prefix
-from fabtools.python import virtualenv
 from os import path, environ
+import shutil


 PWD = path.dirname(__file__)
 ENV = environ['VENV_DIR'] if 'VENV_DIR' in environ else '.env'
-VENV_DIR = path.join(PWD, ENV)
+VENV_DIR = Path(PWD) / ENV


-def env(lang='python2.7'):
-    if path.exists(VENV_DIR):
+@contextlib.contextmanager
+def virtualenv(name, create=False, python='/usr/bin/python3.6'):
+    python = Path(python).resolve()
+    env_path = VENV_DIR
+    if create:
+        if env_path.exists():
+            shutil.rmtree(str(env_path))
+        local('{python} -m venv {env_path}'.format(python=python, env_path=VENV_DIR))
+    def wrapped_local(cmd, env_vars=[], capture=False, direct=False):
+        return local('source {}/bin/activate && {}'.format(env_path, cmd),
+                     shell='/bin/bash', capture=False)
+    yield wrapped_local
+
+
+def env(lang='python3.6'):
+    if VENV_DIR.exists():
        local('rm -rf {env}'.format(env=VENV_DIR))
-    local('pip install virtualenv')
-    local('python -m virtualenv -p {lang} {env}'.format(lang=lang, env=VENV_DIR))
+    if lang.startswith('python3'):
+        local('{lang} -m venv {env}'.format(lang=lang, env=VENV_DIR))
+    else:
+        local('{lang} -m pip install virtualenv --no-cache-dir'.format(lang=lang))
+        local('{lang} -m virtualenv {env} --no-cache-dir'.format(lang=lang, env=VENV_DIR))
+    with virtualenv(VENV_DIR) as venv_local:
+        print(venv_local('python --version', capture=True))
+        venv_local('pip install --upgrade setuptools --no-cache-dir')
+        venv_local('pip install pytest --no-cache-dir')
+        venv_local('pip install wheel --no-cache-dir')
+        venv_local('pip install -r requirements.txt --no-cache-dir')
+        venv_local('pip install pex --no-cache-dir')
+


 def install():
-    with virtualenv(VENV_DIR):
-        local('pip install --upgrade setuptools')
-        local('pip install dist/*.tar.gz')
-        local('pip install pytest')
+    with virtualenv(VENV_DIR) as venv_local:
+        venv_local('pip install dist/*.tar.gz')


 def make():
-    with virtualenv(VENV_DIR):
-        with lcd(path.dirname(__file__)):
-            local('pip install cython')
-            local('pip install murmurhash')
-            local('pip install -r requirements.txt')
-            local('python setup.py build_ext --inplace')
+    with lcd(path.dirname(__file__)):
+        local('export PYTHONPATH=`pwd` && source .env/bin/activate && python setup.py build_ext --inplace',
+            shell='/bin/bash')

 def sdist():
-    with virtualenv(VENV_DIR):
+    with virtualenv(VENV_DIR) as venv_local:
        with lcd(path.dirname(__file__)):
            local('python setup.py sdist')

+def wheel():
+    with virtualenv(VENV_DIR) as venv_local:
+        with lcd(path.dirname(__file__)):
+            venv_local('python setup.py bdist_wheel')
+
+def pex():
+    with virtualenv(VENV_DIR) as venv_local:
+        with lcd(path.dirname(__file__)):
+            sha = local('git rev-parse --short HEAD', capture=True)
+            venv_local('pex dist/*.whl -e spacy -o dist/spacy-%s.pex' % sha,
+                direct=True)
+
+
 def clean():
    with lcd(path.dirname(__file__)):
-        local('python setup.py clean --all')
+        local('rm -f dist/*.whl')
+        local('rm -f dist/*.pex')
+        with virtualenv(VENV_DIR) as venv_local:
+            venv_local('python setup.py clean --all')


 def test():
-    with virtualenv(VENV_DIR):
+    with virtualenv(VENV_DIR) as venv_local:
        with lcd(path.dirname(__file__)):
-            local('py.test -x spacy/tests')
+            venv_local('pytest -x spacy/tests')
+
+def train():
+    args = environ.get('SPACY_TRAIN_ARGS', '')
+    with virtualenv(VENV_DIR) as venv_local:
+        venv_local('spacy train {args}'.format(args=args))
--- a/spacy/main.py
+++ b/spacy/main.py
@ -8,6 +8,7 @@ if __name__ == '__main__':
    import sys
    from spacy.cli import download, link, info, package, train, convert
    from spacy.cli import vocab, init_model, profile, evaluate, validate
+    from spacy.cli import ud_train, ud_evaluate
    from spacy.util import prints

    commands = {
@ -15,7 +16,9 @@ if __name__ == '__main__':
        'link': link,
        'info': info,
        'train': train,
+        'ud-train': ud_train,
        'evaluate': evaluate,
+        'ud-evaluate': ud_evaluate,
        'convert': convert,
        'package': package,
        'vocab': vocab,
--- a/spacy/about.py
+++ b/spacy/about.py
@ -3,7 +3,7 @@
 # https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py

 __title__ = 'spacy'
-__version__ = '2.1.0.dev1'
+__version__ = '2.1.0.dev3'
 __summary__ = 'Industrial-strength Natural Language Processing (NLP) with Python and Cython'
 __uri__ = 'https://spacy.io'
 __author__ = 'Explosion AI'
--- a/spacy/cli/init.py
+++ b/spacy/cli/init.py
@ -9,3 +9,5 @@ from .convert import convert
 from .vocab import make_vocab as vocab
 from .init_model import init_model
 from .validate import validate
+from .ud_train import main as ud_train
+from .conll17_ud_eval import main as ud_evaluate
--- a/spacy/cli/conll17_ud_eval.py
+++ b/spacy/cli/conll17_ud_eval.py
@ -0,0 +1,570 @@
+#!/usr/bin/env python
+
+# CoNLL 2017 UD Parsing evaluation script.
+#
+# Compatible with Python 2.7 and 3.2+, can be used either as a module
+# or a standalone executable.
+#
+# Copyright 2017 Institute of Formal and Applied Linguistics (UFAL),
+# Faculty of Mathematics and Physics, Charles University, Czech Republic.
+#
+# Changelog:
+# - [02 Jan 2017] Version 0.9: Initial release
+# - [25 Jan 2017] Version 0.9.1: Fix bug in LCS alignment computation
+# - [10 Mar 2017] Version 1.0: Add documentation and test
+#                              Compare HEADs correctly using aligned words
+#                              Allow evaluation with errorneous spaces in forms
+#                              Compare forms in LCS case insensitively
+#                              Detect cycles and multiple root nodes
+#                              Compute AlignedAccuracy
+
+# Command line usage
+# ------------------
+# conll17_ud_eval.py [-v] [-w weights_file] gold_conllu_file system_conllu_file
+#
+# - if no -v is given, only the CoNLL17 UD Shared Task evaluation LAS metrics
+#   is printed
+# - if -v is given, several metrics are printed (as precision, recall, F1 score,
+#   and in case the metric is computed on aligned words also accuracy on these):
+#   - Tokens: how well do the gold tokens match system tokens
+#   - Sentences: how well do the gold sentences match system sentences
+#   - Words: how well can the gold words be aligned to system words
+#   - UPOS: using aligned words, how well does UPOS match
+#   - XPOS: using aligned words, how well does XPOS match
+#   - Feats: using aligned words, how well does FEATS match
+#   - AllTags: using aligned words, how well does UPOS+XPOS+FEATS match
+#   - Lemmas: using aligned words, how well does LEMMA match
+#   - UAS: using aligned words, how well does HEAD match
+#   - LAS: using aligned words, how well does HEAD+DEPREL(ignoring subtypes) match
+# - if weights_file is given (with lines containing deprel-weight pairs),
+#   one more metric is shown:
+#   - WeightedLAS: as LAS, but each deprel (ignoring subtypes) has different weight
+
+# API usage
+# ---------
+# - load_conllu(file)
+#   - loads CoNLL-U file from given file object to an internal representation
+#   - the file object should return str on both Python 2 and Python 3
+#   - raises UDError exception if the given file cannot be loaded
+# - evaluate(gold_ud, system_ud)
+#   - evaluate the given gold and system CoNLL-U files (loaded with load_conllu)
+#   - raises UDError if the concatenated tokens of gold and system file do not match
+#   - returns a dictionary with the metrics described above, each metrics having
+#     three fields: precision, recall and f1
+
+# Description of token matching
+# -----------------------------
+# In order to match tokens of gold file and system file, we consider the text
+# resulting from concatenation of gold tokens and text resulting from
+# concatenation of system tokens. These texts should match -- if they do not,
+# the evaluation fails.
+#
+# If the texts do match, every token is represented as a range in this original
+# text, and tokens are equal only if their range is the same.
+
+# Description of word matching
+# ----------------------------
+# When matching words of gold file and system file, we first match the tokens.
+# The words which are also tokens are matched as tokens, but words in multi-word
+# tokens have to be handled differently.
+#
+# To handle multi-word tokens, we start by finding "multi-word spans".
+# Multi-word span is a span in the original text such that
+# - it contains at least one multi-word token
+# - all multi-word tokens in the span (considering both gold and system ones)
+#   are completely inside the span (i.e., they do not "stick out")
+# - the multi-word span is as small as possible
+#
+# For every multi-word span, we align the gold and system words completely
+# inside this span using LCS on their FORMs. The words not intersecting
+# (even partially) any multi-word span are then aligned as tokens.
+
+
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import io
+import sys
+import unittest
+
+# CoNLL-U column names
+ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC = range(10)
+
+# UD Error is used when raising exceptions in this module
+class UDError(Exception):
+    pass
+
+# Load given CoNLL-U file into internal representation
+def load_conllu(file):
+    # Internal representation classes
+    class UDRepresentation:
+        def __init__(self):
+            # Characters of all the tokens in the whole file.
+            # Whitespace between tokens is not included.
+            self.characters = []
+            # List of UDSpan instances with start&end indices into `characters`.
+            self.tokens = []
+            # List of UDWord instances.
+            self.words = []
+            # List of UDSpan instances with start&end indices into `characters`.
+            self.sentences = []
+    class UDSpan:
+        def __init__(self, start, end, characters):
+            self.start = start
+            # Note that self.end marks the first position **after the end** of span,
+            # so we can use characters[start:end] or range(start, end).
+            self.end = end
+            self.characters = characters
+
+        @property
+        def text(self):
+            return ''.join(self.characters[self.start:self.end])
+
+        def __str__(self):
+            return self.text
+
+        def __repr__(self):
+            return self.text
+    class UDWord:
+        def __init__(self, span, columns, is_multiword):
+            # Span of this word (or MWT, see below) within ud_representation.characters.
+            self.span = span
+            # 10 columns of the CoNLL-U file: ID, FORM, LEMMA,...
+            self.columns = columns
+            # is_multiword==True means that this word is part of a multi-word token.
+            # In that case, self.span marks the span of the whole multi-word token.
+            self.is_multiword = is_multiword
+            # Reference to the UDWord instance representing the HEAD (or None if root).
+            self.parent = None
+            # Let's ignore language-specific deprel subtypes.
+            self.columns[DEPREL] = columns[DEPREL].split(':')[0]
+
+    ud = UDRepresentation()
+
+    # Load the CoNLL-U file
+    index, sentence_start = 0, None
+    linenum = 0
+    while True:
+        line = file.readline()
+        linenum += 1
+        if not line:
+            break
+        line = line.rstrip("\r\n")
+
+        # Handle sentence start boundaries
+        if sentence_start is None:
+            # Skip comments
+            if line.startswith("#"):
+                continue
+            # Start a new sentence
+            ud.sentences.append(UDSpan(index, 0, ud.characters))
+            sentence_start = len(ud.words)
+        if not line:
+            # Add parent UDWord links and check there are no cycles
+            def process_word(word):
+                if word.parent == "remapping":
+                    raise UDError("There is a cycle in a sentence")
+                if word.parent is None:
+                    head = int(word.columns[HEAD])
+                    if head > len(ud.words) - sentence_start:
+                        raise UDError("HEAD '{}' points outside of the sentence".format(word.columns[HEAD]))
+                    if head:
+                        parent = ud.words[sentence_start + head - 1]
+                        word.parent = "remapping"
+                        process_word(parent)
+                        word.parent = parent
+
+            for word in ud.words[sentence_start:]:
+                process_word(word)
+
+            # Check there is a single root node
+            if len([word for word in ud.words[sentence_start:] if word.parent is None]) != 1:
+                raise UDError("There are multiple roots in a sentence")
+
+            # End the sentence
+            ud.sentences[-1].end = index
+            sentence_start = None
+            continue
+
+        # Read next token/word
+        columns = line.split("\t")
+        if len(columns) != 10:
+            raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, line))
+
+        # Skip empty nodes
+        if "." in columns[ID]:
+            continue
+
+        # Delete spaces from FORM  so gold.characters == system.characters
+        # even if one of them tokenizes the space.
+        columns[FORM] = columns[FORM].replace(" ", "")
+        if not columns[FORM]:
+            raise UDError("There is an empty FORM in the CoNLL-U file -- line %d" % linenum)
+
+        # Save token
+        ud.characters.extend(columns[FORM])
+        ud.tokens.append(UDSpan(index, index + len(columns[FORM]), ud.characters))
+        index += len(columns[FORM])
+
+        # Handle multi-word tokens to save word(s)
+        if "-" in columns[ID]:
+            try:
+                start, end = map(int, columns[ID].split("-"))
+            except:
+                raise UDError("Cannot parse multi-word token ID '{}'".format(columns[ID]))
+            
+            for _ in range(start, end + 1):
+                word_line = file.readline().rstrip("\r\n")
+                word_columns = word_line.split("\t")
+                if len(word_columns) != 10:
+                    print(columns)
+                    raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, word_line))
+                ud.words.append(UDWord(ud.tokens[-1], word_columns, is_multiword=True))
+        # Basic tokens/words
+        else:
+            try:
+                word_id = int(columns[ID])
+            except:
+                raise UDError("Cannot parse word ID '{}'".format(columns[ID]))
+            if word_id != len(ud.words) - sentence_start + 1:
+                raise UDError("Incorrect word ID '{}' for word '{}', expected '{}'".format(columns[ID], columns[FORM], len(ud.words) - sentence_start + 1))
+
+            try:
+                head_id = int(columns[HEAD])
+            except:
+                raise UDError("Cannot parse HEAD '{}'".format(columns[HEAD]))
+            if head_id < 0:
+                raise UDError("HEAD cannot be negative")
+
+            ud.words.append(UDWord(ud.tokens[-1], columns, is_multiword=False))
+
+    if sentence_start is not None:
+        raise UDError("The CoNLL-U file does not end with empty line")
+
+    return ud
+
+# Evaluate the gold and system treebanks (loaded using load_conllu).
+def evaluate(gold_ud, system_ud, deprel_weights=None):
+    class Score:
+        def __init__(self, gold_total, system_total, correct, aligned_total=None):
+            self.precision = correct / system_total if system_total else 0.0
+            self.recall = correct / gold_total if gold_total else 0.0
+            self.f1 = 2 * correct / (system_total + gold_total) if system_total + gold_total else 0.0
+            self.aligned_accuracy = correct / aligned_total if aligned_total else aligned_total
+    class AlignmentWord:
+        def __init__(self, gold_word, system_word):
+            self.gold_word = gold_word
+            self.system_word = system_word
+            self.gold_parent = None
+            self.system_parent_gold_aligned = None
+    class Alignment:
+        def __init__(self, gold_words, system_words):
+            self.gold_words = gold_words
+            self.system_words = system_words
+            self.matched_words = []
+            self.matched_words_map = {}
+        def append_aligned_words(self, gold_word, system_word):
+            self.matched_words.append(AlignmentWord(gold_word, system_word))
+            self.matched_words_map[system_word] = gold_word
+        def fill_parents(self):
+            # We represent root parents in both gold and system data by '0'.
+            # For gold data, we represent non-root parent by corresponding gold word.
+            # For system data, we represent non-root parent by either gold word aligned
+            # to parent system nodes, or by None if no gold words is aligned to the parent.
+            for words in self.matched_words:
+                words.gold_parent = words.gold_word.parent if words.gold_word.parent is not None else 0
+                words.system_parent_gold_aligned = self.matched_words_map.get(words.system_word.parent, None) \
+                    if words.system_word.parent is not None else 0
+
+    def lower(text):
+        if sys.version_info < (3, 0) and isinstance(text, str):
+            return text.decode("utf-8").lower()
+        return text.lower()
+
+    def spans_score(gold_spans, system_spans):
+        correct, gi, si = 0, 0, 0
+        while gi < len(gold_spans) and si < len(system_spans):
+            if system_spans[si].start < gold_spans[gi].start:
+                si += 1
+            elif gold_spans[gi].start < system_spans[si].start:
+                gi += 1
+            else:
+                correct += gold_spans[gi].end == system_spans[si].end
+                si += 1
+                gi += 1
+
+        return Score(len(gold_spans), len(system_spans), correct)
+
+    def alignment_score(alignment, key_fn, weight_fn=lambda w: 1):
+        gold, system, aligned, correct = 0, 0, 0, 0
+
+        for word in alignment.gold_words:
+            gold += weight_fn(word)
+
+        for word in alignment.system_words:
+            system += weight_fn(word)
+
+        for words in alignment.matched_words:
+            aligned += weight_fn(words.gold_word)
+
+        if key_fn is None:
+            # Return score for whole aligned words
+            return Score(gold, system, aligned)
+
+        for words in alignment.matched_words:
+            if key_fn(words.gold_word, words.gold_parent) == key_fn(words.system_word, words.system_parent_gold_aligned):
+                correct += weight_fn(words.gold_word)
+
+        return Score(gold, system, correct, aligned)
+
+    def beyond_end(words, i, multiword_span_end):
+        if i >= len(words):
+            return True
+        if words[i].is_multiword:
+            return words[i].span.start >= multiword_span_end
+        return words[i].span.end > multiword_span_end
+
+    def extend_end(word, multiword_span_end):
+        if word.is_multiword and word.span.end > multiword_span_end:
+            return word.span.end
+        return multiword_span_end
+
+    def find_multiword_span(gold_words, system_words, gi, si):
+        # We know gold_words[gi].is_multiword or system_words[si].is_multiword.
+        # Find the start of the multiword span (gs, ss), so the multiword span is minimal.
+        # Initialize multiword_span_end characters index.
+        if gold_words[gi].is_multiword:
+            multiword_span_end = gold_words[gi].span.end
+            if not system_words[si].is_multiword and system_words[si].span.start < gold_words[gi].span.start:
+                si += 1
+        else: # if system_words[si].is_multiword
+            multiword_span_end = system_words[si].span.end
+            if not gold_words[gi].is_multiword and gold_words[gi].span.start < system_words[si].span.start:
+                gi += 1
+        gs, ss = gi, si
+
+        # Find the end of the multiword span
+        # (so both gi and si are pointing to the word following the multiword span end).
+        while not beyond_end(gold_words, gi, multiword_span_end) or \
+              not beyond_end(system_words, si, multiword_span_end):
+            if gi < len(gold_words) and (si >= len(system_words) or
+                                         gold_words[gi].span.start <= system_words[si].span.start):
+                multiword_span_end = extend_end(gold_words[gi], multiword_span_end)
+                gi += 1
+            else:
+                multiword_span_end = extend_end(system_words[si], multiword_span_end)
+                si += 1
+        return gs, ss, gi, si
+
+    def compute_lcs(gold_words, system_words, gi, si, gs, ss):
+        lcs = [[0] * (si - ss) for i in range(gi - gs)]
+        for g in reversed(range(gi - gs)):
+            for s in reversed(range(si - ss)):
+                if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
+                    lcs[g][s] = 1 + (lcs[g+1][s+1] if g+1 < gi-gs and s+1 < si-ss else 0)
+                lcs[g][s] = max(lcs[g][s], lcs[g+1][s] if g+1 < gi-gs else 0)
+                lcs[g][s] = max(lcs[g][s], lcs[g][s+1] if s+1 < si-ss else 0)
+        return lcs
+
+    def align_words(gold_words, system_words):
+        alignment = Alignment(gold_words, system_words)
+
+        gi, si = 0, 0
+        while gi < len(gold_words) and si < len(system_words):
+            if gold_words[gi].is_multiword or system_words[si].is_multiword:
+                # A: Multi-word tokens => align via LCS within the whole "multiword span".
+                gs, ss, gi, si = find_multiword_span(gold_words, system_words, gi, si)
+
+                if si > ss and gi > gs:
+                    lcs = compute_lcs(gold_words, system_words, gi, si, gs, ss)
+
+                    # Store aligned words
+                    s, g = 0, 0
+                    while g < gi - gs and s < si - ss:
+                        if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
+                            alignment.append_aligned_words(gold_words[gs+g], system_words[ss+s])
+                            g += 1
+                            s += 1
+                        elif lcs[g][s] == (lcs[g+1][s] if g+1 < gi-gs else 0):
+                            g += 1
+                        else:
+                            s += 1
+            else:
+                # B: No multi-word token => align according to spans.
+                if (gold_words[gi].span.start, gold_words[gi].span.end) == (system_words[si].span.start, system_words[si].span.end):
+                    alignment.append_aligned_words(gold_words[gi], system_words[si])
+                    gi += 1
+                    si += 1
+                elif gold_words[gi].span.start <= system_words[si].span.start:
+                    gi += 1
+                else:
+                    si += 1
+
+        alignment.fill_parents()
+
+        return alignment
+
+    # Check that underlying character sequences do match
+    if gold_ud.characters != system_ud.characters:
+        index = 0
+        while gold_ud.characters[index] == system_ud.characters[index]:
+            index += 1
+
+        raise UDError(
+            "The concatenation of tokens in gold file and in system file differ!\n" +
+            "First 20 differing characters in gold file: '{}' and system file: '{}'".format(
+                "".join(gold_ud.characters[index:index + 20]),
+                "".join(system_ud.characters[index:index + 20])
+            )
+        )
+
+    # Align words
+    alignment = align_words(gold_ud.words, system_ud.words)
+
+    # Compute the F1-scores
+    result = {
+        "Tokens": spans_score(gold_ud.tokens, system_ud.tokens),
+        "Sentences": spans_score(gold_ud.sentences, system_ud.sentences),
+        "Words": alignment_score(alignment, None),
+        "UPOS": alignment_score(alignment, lambda w, parent: w.columns[UPOS]),
+        "XPOS": alignment_score(alignment, lambda w, parent: w.columns[XPOS]),
+        "Feats": alignment_score(alignment, lambda w, parent: w.columns[FEATS]),
+        "AllTags": alignment_score(alignment, lambda w, parent: (w.columns[UPOS], w.columns[XPOS], w.columns[FEATS])),
+        "Lemmas": alignment_score(alignment, lambda w, parent: w.columns[LEMMA]),
+        "UAS": alignment_score(alignment, lambda w, parent: parent),
+        "LAS": alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL])),
+    }
+
+    # Add WeightedLAS if weights are given
+    if deprel_weights is not None:
+        def weighted_las(word):
+            return deprel_weights.get(word.columns[DEPREL], 1.0)
+        result["WeightedLAS"] = alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL]), weighted_las)
+
+    return result
+
+def load_deprel_weights(weights_file):
+    if weights_file is None:
+        return None
+
+    deprel_weights = {}
+    for line in weights_file:
+        # Ignore comments and empty lines
+        if line.startswith("#") or not line.strip():
+            continue
+
+        columns = line.rstrip("\r\n").split()
+        if len(columns) != 2:
+            raise ValueError("Expected two columns in the UD Relations weights file on line '{}'".format(line))
+
+        deprel_weights[columns[0]] = float(columns[1])
+
+    return deprel_weights
+
+def load_conllu_file(path):
+    _file = open(path, mode="r", **({"encoding": "utf-8"} if sys.version_info >= (3, 0) else {}))
+    return load_conllu(_file)
+
+def evaluate_wrapper(args):
+    # Load CoNLL-U files
+    gold_ud = load_conllu_file(args.gold_file)
+    system_ud = load_conllu_file(args.system_file)
+
+    # Load weights if requested
+    deprel_weights = load_deprel_weights(args.weights)
+
+    return evaluate(gold_ud, system_ud, deprel_weights)
+
+def main():
+    # Parse arguments
+    parser = argparse.ArgumentParser()
+    parser.add_argument("gold_file", type=str,
+                        help="Name of the CoNLL-U file with the gold data.")
+    parser.add_argument("system_file", type=str,
+                        help="Name of the CoNLL-U file with the predicted data.")
+    parser.add_argument("--weights", "-w", type=argparse.FileType("r"), default=None,
+                        metavar="deprel_weights_file",
+                        help="Compute WeightedLAS using given weights for Universal Dependency Relations.")
+    parser.add_argument("--verbose", "-v", default=0, action="count",
+                        help="Print all metrics.")
+    args = parser.parse_args()
+
+    # Use verbose if weights are supplied
+    if args.weights is not None and not args.verbose:
+        args.verbose = 1
+
+    # Evaluate
+    evaluation = evaluate_wrapper(args)
+
+    # Print the evaluation
+    if not args.verbose:
+        print("LAS F1 Score: {:.2f}".format(100 * evaluation["LAS"].f1))
+    else:
+        metrics = ["Tokens", "Sentences", "Words", "UPOS", "XPOS", "Feats", "AllTags", "Lemmas", "UAS", "LAS"]
+        if args.weights is not None:
+            metrics.append("WeightedLAS")
+
+        print("Metrics    | Precision |    Recall |  F1 Score | AligndAcc")
+        print("-----------+-----------+-----------+-----------+-----------")
+        for metric in metrics:
+            print("{:11}|{:10.2f} |{:10.2f} |{:10.2f} |{}".format(
+                metric,
+                100 * evaluation[metric].precision,
+                100 * evaluation[metric].recall,
+                100 * evaluation[metric].f1,
+                "{:10.2f}".format(100 * evaluation[metric].aligned_accuracy) if evaluation[metric].aligned_accuracy is not None else ""
+            ))
+
+if __name__ == "__main__":
+    main()
+
+# Tests, which can be executed with `python -m unittest conll17_ud_eval`.
+class TestAlignment(unittest.TestCase):
+    @staticmethod
+    def _load_words(words):
+        """Prepare fake CoNLL-U files with fake HEAD to prevent multiple roots errors."""
+        lines, num_words = [], 0
+        for w in words:
+            parts = w.split(" ")
+            if len(parts) == 1:
+                num_words += 1
+                lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, parts[0], int(num_words>1)))
+            else:
+                lines.append("{}-{}\t{}\t_\t_\t_\t_\t_\t_\t_\t_".format(num_words + 1, num_words + len(parts) - 1, parts[0]))
+                for part in parts[1:]:
+                    num_words += 1
+                    lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, part, int(num_words>1)))
+        return load_conllu((io.StringIO if sys.version_info >= (3, 0) else io.BytesIO)("\n".join(lines+["\n"])))
+
+    def _test_exception(self, gold, system):
+        self.assertRaises(UDError, evaluate, self._load_words(gold), self._load_words(system))
+
+    def _test_ok(self, gold, system, correct):
+        metrics = evaluate(self._load_words(gold), self._load_words(system))
+        gold_words = sum((max(1, len(word.split(" ")) - 1) for word in gold))
+        system_words = sum((max(1, len(word.split(" ")) - 1) for word in system))
+        self.assertEqual((metrics["Words"].precision, metrics["Words"].recall, metrics["Words"].f1),
+                         (correct / system_words, correct / gold_words, 2 * correct / (gold_words + system_words)))
+
+    def test_exception(self):
+        self._test_exception(["a"], ["b"])
+
+    def test_equal(self):
+        self._test_ok(["a"], ["a"], 1)
+        self._test_ok(["a", "b", "c"], ["a", "b", "c"], 3)
+
+    def test_equal_with_multiword(self):
+        self._test_ok(["abc a b c"], ["a", "b", "c"], 3)
+        self._test_ok(["a", "bc b c", "d"], ["a", "b", "c", "d"], 4)
+        self._test_ok(["abcd a b c d"], ["ab a b", "cd c d"], 4)
+        self._test_ok(["abc a b c", "de d e"], ["a", "bcd b c d", "e"], 5)
+
+    def test_alignment(self):
+        self._test_ok(["abcd"], ["a", "b", "c", "d"], 0)
+        self._test_ok(["abc", "d"], ["a", "b", "c", "d"], 1)
+        self._test_ok(["a", "bc", "d"], ["a", "b", "c", "d"], 2)
+        self._test_ok(["a", "bc b c", "d"], ["a", "b", "cd"], 2)
+        self._test_ok(["abc a BX c", "def d EX f"], ["ab a b", "cd c d", "ef e f"], 4)
+        self._test_ok(["ab a b", "cd bc d"], ["a", "bc", "d"], 2)
+        self._test_ok(["a", "bc b c", "d"], ["ab AX BX", "cd CX a"], 1)
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -116,10 +116,9 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,

    print("Itn.\tP.Loss\tN.Loss\tUAS\tNER P.\tNER R.\tNER F.\tTag %\tToken %")
    try:
-        train_docs = corpus.train_docs(nlp, projectivize=True, noise_level=0.0,
-                                       gold_preproc=gold_preproc, max_length=0)
-        train_docs = list(train_docs)
        for i in range(n_iter):
+            train_docs = corpus.train_docs(nlp, projectivize=True, noise_level=0.0,
+                                           gold_preproc=gold_preproc, max_length=0)
            with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
                losses = {}
                for batch in minibatch(train_docs, size=batch_sizes):
--- a/spacy/cli/ud_train.py
+++ b/spacy/cli/ud_train.py
@ -0,0 +1,390 @@
+'''Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
+.conllu format for development data, allowing the official scorer to be used.
+'''
+from __future__ import unicode_literals
+import plac
+import tqdm
+from pathlib import Path
+import re
+import sys
+import json
+
+import spacy
+import spacy.util
+from ..tokens import Token, Doc
+from ..gold import GoldParse
+from ..syntax.nonproj import projectivize
+from ..matcher import Matcher
+from collections import defaultdict, Counter
+from timeit import default_timer as timer
+
+import itertools
+import random
+import numpy.random
+import cytoolz
+
+from . import conll17_ud_eval
+
+from .. import lang
+from .. import lang
+from ..lang import zh
+from ..lang import ja
+
+lang.zh.Chinese.Defaults.use_jieba = False
+lang.ja.Japanese.Defaults.use_janome = False
+
+random.seed(0)
+numpy.random.seed(0)
+
+def minibatch_by_words(items, size=5000):
+    random.shuffle(items)
+    if isinstance(size, int):
+        size_ = itertools.repeat(size)
+    else:
+        size_ = size
+    items = iter(items)
+    while True:
+        batch_size = next(size_)
+        batch = []
+        while batch_size >= 0:
+            try:
+                doc, gold = next(items)
+            except StopIteration:
+                if batch:
+                    yield batch
+                return
+            batch_size -= len(doc)
+            batch.append((doc, gold))
+        if batch:
+            yield batch
+        else:
+            break
+
+################
+# Data reading #
+################
+
+space_re = re.compile('\s+')
+def split_text(text):
+    return [space_re.sub(' ', par.strip()) for par in text.split('\n\n')]
+ 
+
+def read_data(nlp, conllu_file, text_file, raw_text=True, oracle_segments=False,
+              max_doc_length=None, limit=None):
+    '''Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True,
+    include Doc objects created using nlp.make_doc and then aligned against
+    the gold-standard sequences. If oracle_segments=True, include Doc objects
+    created from the gold-standard segments. At least one must be True.'''
+    if not raw_text and not oracle_segments:
+        raise ValueError("At least one of raw_text or oracle_segments must be True")
+    paragraphs = split_text(text_file.read())
+    conllu = read_conllu(conllu_file)
+    # sd is spacy doc; cd is conllu doc
+    # cs is conllu sent, ct is conllu token
+    docs = []
+    golds = []
+    for doc_id, (text, cd) in enumerate(zip(paragraphs, conllu)):
+        sent_annots = []
+        for cs in cd:
+            sent = defaultdict(list)
+            for id_, word, lemma, pos, tag, morph, head, dep, _, space_after in cs:
+                if '.' in id_:
+                    continue
+                if '-' in id_:
+                    continue
+                id_ = int(id_)-1
+                head = int(head)-1 if head != '0' else id_
+                sent['words'].append(word)
+                sent['tags'].append(tag)
+                sent['heads'].append(head)
+                sent['deps'].append('ROOT' if dep == 'root' else dep)
+                sent['spaces'].append(space_after == '_')
+            sent['entities'] = ['-'] * len(sent['words'])
+            sent['heads'], sent['deps'] = projectivize(sent['heads'],
+                                                       sent['deps'])
+            if oracle_segments:
+                docs.append(Doc(nlp.vocab, words=sent['words'], spaces=sent['spaces']))
+                golds.append(GoldParse(docs[-1], **sent))
+
+            sent_annots.append(sent)
+            if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
+                doc, gold = _make_gold(nlp, None, sent_annots)
+                sent_annots = []
+                docs.append(doc)
+                golds.append(gold)
+                if limit and len(docs) >= limit:
+                    return docs, golds
+
+        if raw_text and sent_annots:
+            doc, gold = _make_gold(nlp, None, sent_annots)
+            docs.append(doc)
+            golds.append(gold)
+        if limit and len(docs) >= limit:
+            return docs, golds
+    return docs, golds
+
+
+def read_conllu(file_):
+    docs = []
+    sent = []
+    doc = []
+    for line in file_:
+        if line.startswith('# newdoc'):
+            if doc:
+                docs.append(doc)
+            doc = []
+        elif line.startswith('#'):
+            continue
+        elif not line.strip():
+            if sent:
+                doc.append(sent)
+            sent = []
+        else:
+            sent.append(list(line.strip().split('\t')))
+            if len(sent[-1]) != 10:
+                print(repr(line))
+                raise ValueError
+    if sent:
+        doc.append(sent)
+    if doc:
+        docs.append(doc)
+    return docs
+
+
+def _make_gold(nlp, text, sent_annots):
+    # Flatten the conll annotations, and adjust the head indices
+    flat = defaultdict(list)
+    for sent in sent_annots:
+        flat['heads'].extend(len(flat['words'])+head for head in sent['heads'])
+        for field in ['words', 'tags', 'deps', 'entities', 'spaces']:
+            flat[field].extend(sent[field])
+    # Construct text if necessary
+    assert len(flat['words']) == len(flat['spaces'])
+    if text is None:
+        text = ''.join(word+' '*space for word, space in zip(flat['words'], flat['spaces'])) 
+    doc = nlp.make_doc(text)
+    flat.pop('spaces')
+    gold = GoldParse(doc, **flat)
+    return doc, gold
+
+#############################
+# Data transforms for spaCy #
+#############################
+
+def golds_to_gold_tuples(docs, golds):
+    '''Get out the annoying 'tuples' format used by begin_training, given the
+    GoldParse objects.'''
+    tuples = []
+    for doc, gold in zip(docs, golds):
+        text = doc.text
+        ids, words, tags, heads, labels, iob = zip(*gold.orig_annot)
+        sents = [((ids, words, tags, heads, labels, iob), [])]
+        tuples.append((text, sents))
+    return tuples
+
+
+##############
+# Evaluation #
+##############
+
+def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
+    with text_loc.open('r', encoding='utf8') as text_file:
+        texts = split_text(text_file.read())
+        docs = list(nlp.pipe(texts))
+    with sys_loc.open('w', encoding='utf8') as out_file:
+        write_conllu(docs, out_file)
+    with gold_loc.open('r', encoding='utf8') as gold_file:
+        gold_ud = conll17_ud_eval.load_conllu(gold_file)
+        with sys_loc.open('r', encoding='utf8') as sys_file:
+            sys_ud = conll17_ud_eval.load_conllu(sys_file)
+        scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
+    return scores
+
+
+def write_conllu(docs, file_):
+    merger = Matcher(docs[0].vocab)
+    merger.add('SUBTOK', None, [{'DEP': 'subtok', 'op': '+'}])
+    for i, doc in enumerate(docs):
+        matches = merger(doc)
+        spans = [doc[start:end+1] for _, start, end in matches]
+        offsets = [(span.start_char, span.end_char) for span in spans]
+        for start_char, end_char in offsets:
+            doc.merge(start_char, end_char)
+        file_.write("# newdoc id = {i}\n".format(i=i))
+        for j, sent in enumerate(doc.sents):
+            file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
+            file_.write("# text = {text}\n".format(text=sent.text))
+            for k, token in enumerate(sent):
+                file_.write(token._.get_conllu_lines(k) + '\n')
+            file_.write('\n')
+
+
+def print_progress(itn, losses, ud_scores):
+    fields = {
+        'dep_loss': losses.get('parser', 0.0),
+        'tag_loss': losses.get('tagger', 0.0),
+        'words': ud_scores['Words'].f1 * 100,
+        'sents': ud_scores['Sentences'].f1 * 100,
+        'tags': ud_scores['XPOS'].f1 * 100,
+        'uas': ud_scores['UAS'].f1 * 100,
+        'las': ud_scores['LAS'].f1 * 100,
+    }
+    header = ['Epoch', 'Loss', 'LAS', 'UAS', 'TAG', 'SENT', 'WORD']
+    if itn == 0:
+        print('\t'.join(header))
+    tpl = '\t'.join((
+        '{:d}',
+        '{dep_loss:.1f}',
+        '{las:.1f}',
+        '{uas:.1f}',
+        '{tags:.1f}',
+        '{sents:.1f}',
+        '{words:.1f}',
+    ))
+    print(tpl.format(itn, **fields))
+
+#def get_sent_conllu(sent, sent_id):
+#    lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
+
+def get_token_conllu(token, i):
+    if token._.begins_fused:
+        n = 1
+        while token.nbor(n)._.inside_fused:
+            n += 1
+        id_ = '%d-%d' % (i, i+n)
+        lines = [id_, token.text, '_', '_', '_', '_', '_', '_', '_', '_']
+    else:
+        lines = []
+    if token.head.i == token.i:
+        head = 0
+    else:
+        head = i + (token.head.i - token.i) + 1
+    fields = [str(i+1), token.text, token.lemma_, token.pos_, token.tag_, '_',
+              str(head), token.dep_.lower(), '_', '_']
+    lines.append('\t'.join(fields))
+    return '\n'.join(lines)
+
+Token.set_extension('get_conllu_lines', method=get_token_conllu)
+Token.set_extension('begins_fused', default=False)
+Token.set_extension('inside_fused', default=False)
+
+
+##################
+# Initialization #
+##################
+
+
+def load_nlp(corpus, config):
+    lang = corpus.split('_')[0]
+    nlp = spacy.blank(lang)
+    if config.vectors:
+        nlp.vocab.from_disk(config.vectors / 'vocab')
+    return nlp
+
+def initialize_pipeline(nlp, docs, golds, config):
+    nlp.add_pipe(nlp.create_pipe('parser'))
+    if config.multitask_tag:
+        nlp.parser.add_multitask_objective('tag')
+    if config.multitask_sent:
+        nlp.parser.add_multitask_objective('sent_start')
+    nlp.parser.moves.add_action(2, 'subtok')
+    nlp.add_pipe(nlp.create_pipe('tagger'))
+    for gold in golds:
+        for tag in gold.tags:
+            if tag is not None:
+                nlp.tagger.add_label(tag)
+    # Replace labels that didn't make the frequency cutoff
+    actions = set(nlp.parser.labels)
+    label_set = set([act.split('-')[1] for act in actions if '-' in act])
+    for gold in golds:
+        for i, label in enumerate(gold.labels):
+            if label is not None and label not in label_set:
+                gold.labels[i] = label.split('||')[0]
+    return nlp.begin_training(lambda: golds_to_gold_tuples(docs, golds))
+
+
+########################
+# Command line helpers #
+########################
+
+class Config(object):
+    def __init__(self, vectors=None, max_doc_length=10, multitask_tag=True,
+            multitask_sent=True, nr_epoch=30, batch_size=1000, dropout=0.2):
+        for key, value in locals().items():
+            setattr(self, key, value)
+
+    @classmethod
+    def load(cls, loc):
+        with Path(loc).open('r', encoding='utf8') as file_:
+            cfg = json.load(file_)
+        return cls(**cfg)
+
+
+class Dataset(object):
+    def __init__(self, path, section):
+        self.path = path
+        self.section = section
+        self.conllu = None
+        self.text = None
+        for file_path in self.path.iterdir():
+            name = file_path.parts[-1]
+            if section in name and name.endswith('conllu'):
+                self.conllu = file_path
+            elif section in name and name.endswith('txt'):
+                self.text = file_path
+        if self.conllu is None:
+            msg = "Could not find .txt file in {path} for {section}"
+            raise IOError(msg.format(section=section, path=path))
+        if self.text is None:
+            msg = "Could not find .txt file in {path} for {section}"
+        self.lang = self.conllu.parts[-1].split('-')[0].split('_')[0]
+
+
+class TreebankPaths(object):
+    def __init__(self, ud_path, treebank, **cfg):
+        self.train = Dataset(ud_path / treebank, 'train')
+        self.dev = Dataset(ud_path / treebank, 'dev')
+        self.lang = self.train.lang
+
+
+@plac.annotations(
+    ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
+    corpus=("UD corpus to train and evaluate on, e.g. en, es_ancora, etc",
+            "positional", None, str),
+    parses_dir=("Directory to write the development parses", "positional", None, Path),
+    config=("Path to json formatted config file", "positional"),
+    limit=("Size limit", "option", "n", int)
+)
+def main(ud_dir, parses_dir, config, corpus, limit=0):
+    config = Config.load(config)
+    paths = TreebankPaths(ud_dir, corpus)
+    if not (parses_dir / corpus).exists():
+        (parses_dir / corpus).mkdir()
+    print("Train and evaluate", corpus, "using lang", paths.lang)
+    nlp = load_nlp(paths.lang, config)
+
+    docs, golds = read_data(nlp, paths.train.conllu.open(), paths.train.text.open(),
+                            max_doc_length=config.max_doc_length, limit=limit)
+
+    optimizer = initialize_pipeline(nlp, docs, golds, config)
+
+    for i in range(config.nr_epoch):
+        docs = [nlp.make_doc(doc.text) for doc in docs]
+        batches = minibatch_by_words(list(zip(docs, golds)), size=config.batch_size)
+        losses = {}
+        n_train_words = sum(len(doc) for doc in docs)
+        with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
+            for batch in batches:
+                batch_docs, batch_gold = zip(*batch)
+                pbar.update(sum(len(doc) for doc in batch_docs))
+                nlp.update(batch_docs, batch_gold, sgd=optimizer,
+                           drop=config.dropout, losses=losses)
+        
+        out_path = parses_dir / corpus / 'epoch-{i}.conllu'.format(i=i)
+        with nlp.use_params(optimizer.averages):
+            scores = evaluate(nlp, paths.dev.text, paths.dev.conllu, out_path)
+            print_progress(i, losses, scores)
+
+
+if __name__ == '__main__':
+    plac.call(main)
--- a/spacy/gold.pyx
+++ b/spacy/gold.pyx
@ -13,7 +13,7 @@ from . import _align
 from .syntax import nonproj
 from .tokens import Doc
 from . import util
-from .util import minibatch
+from .util import minibatch, itershuffle


 def tags_to_entities(tags):
@ -133,15 +133,14 @@ class GoldCorpus(object):
    def train_docs(self, nlp, gold_preproc=False,
                   projectivize=False, max_length=None,
                   noise_level=0.0):
-        train_tuples = self.train_tuples
        if projectivize:
            train_tuples = nonproj.preprocess_training_data(
-                self.train_tuples, label_freq_cutoff=100)
-        random.shuffle(train_tuples)
+                self.train_tuples, label_freq_cutoff=30)
+        random.shuffle(self.train_locs)
        gold_docs = self.iter_gold_docs(nlp, train_tuples, gold_preproc,
                                        max_length=max_length,
                                        noise_level=noise_level)
-        yield from gold_docs
+        yield from itershuffle(gold_docs, bufsize=100)

    def dev_docs(self, nlp, gold_preproc=False):
        gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc)
--- a/spacy/lang/es/init.py
+++ b/spacy/lang/es/init.py
@ -21,7 +21,7 @@ class SpanishDefaults(Language.Defaults):
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    tag_map = TAG_MAP
    stop_words = STOP_WORDS
-    sytax_iterators = SYNTAX_ITERATORS
+    syntax_iterators = SYNTAX_ITERATORS
    lemma_lookup = LOOKUP


--- a/spacy/lang/es/syntax_iterators.py
+++ b/spacy/lang/es/syntax_iterators.py
@ -6,17 +6,19 @@ from ...symbols import NOUN, PROPN, PRON, VERB, AUX

 def noun_chunks(obj):
    doc = obj.doc
-    np_label = doc.vocab.strings['NP']
+    if not len(doc):
+        return
+    np_label = doc.vocab.strings.add('NP')
    left_labels = ['det', 'fixed', 'neg'] #['nunmod', 'det', 'appos', 'fixed']
    right_labels = ['flat', 'fixed', 'compound', 'neg']
    stop_labels = ['punct']
-    np_left_deps = [doc.vocab.strings[label] for label in left_labels]
-    np_right_deps = [doc.vocab.strings[label] for label in right_labels]
-    stop_deps = [doc.vocab.strings[label] for label in stop_labels]
+    np_left_deps = [doc.vocab.strings.add(label) for label in left_labels]
+    np_right_deps = [doc.vocab.strings.add(label) for label in right_labels]
+    stop_deps = [doc.vocab.strings.add(label) for label in stop_labels]
    token = doc[0]
    while token and token.i < len(doc):
        if token.pos in [PROPN, NOUN, PRON]:
-            left, right = noun_bounds(token)
+            left, right = noun_bounds(doc, token, np_left_deps, np_right_deps, stop_deps)
            yield left.i, right.i+1, np_label
            token = right
        token = next_token(token)
@ -33,7 +35,7 @@ def next_token(token):
        return None


-def noun_bounds(root):
+def noun_bounds(doc, root, np_left_deps, np_right_deps, stop_deps):
    left_bound = root
    for token in reversed(list(root.lefts)):
        if token.dep in np_left_deps:
@ -41,7 +43,7 @@ def noun_bounds(root):
    right_bound = root
    for token in root.rights:
        if (token.dep in np_right_deps):
-            left, right = noun_bounds(token)
+            left, right = noun_bounds(doc, token, np_left_deps, np_right_deps, stop_deps)
            if list(filter(lambda t: is_verb_token(t) or t.dep in stop_deps,
                           doc[left_bound.i: right.i])):
                break
--- a/spacy/lang/hr/stop_words.py
+++ b/spacy/lang/hr/stop_words.py
@ -6,10 +6,25 @@ from __future__ import unicode_literals

 STOP_WORDS = set("""
 a
+ah
+aha
+aj
 ako
+al
 ali
+arh
+au
+avaj
+bar
+baš
+bez
 bi
 bih
+bijah 
+bijahu
+bijaše
+bijasmo
+bijaste
 bila
 bili
 bilo
@ -17,25 +32,104 @@ bio
 bismo
 biste
 biti
+brr
+buć
+budavši
+bude
+budimo
+budite
+budu
+budući
+bum 
 bumo
+će
+ćemo
+ćeš
+ćete
+čijem
+čijim
+čijima
+ću
 da
+daj
+dakle
+de
+deder
+dem
+djelomice
+djelomično
 do
+doista
+dok
+dokle
+donekle
+dosad
+doskoro
+dotad
+dotle
+dovečer
+drugamo
+drugdje
 duž
+e
+eh
+ehe
+ej
+eno
+eto
+evo
 ga
+gdjekakav
+gdjekoje
+gic
+god
+halo
+hej
+hm
 hoće
 hoćemo
-hoćete
 hoćeš
+hoćete
 hoću
+hop
+htijahu
+htijasmo
+htijaste
+htio
+htjedoh
+htjedoše
+htjedoste
+htjela
+htjele
+htjeli
+hura
 i
 iako
 ih
+iju
+ijuju
+ikada
+ikakav
+ikakva
+ikakve
+ikakvi
+ikakvih
+ikakvim
+ikakvima
+ikakvo
+ikakvog
+ikakvoga
+ikakvoj
+ikakvom
+ikakvome
 ili
+im
 iz
 ja
 je
 jedna
 jedne
+jedni
 jedno
 jer
 jesam
@ -57,6 +151,7 @@ koji
 kojima
 koju
 kroz
+lani
 li
 me
 mene
@ -66,6 +161,8 @@ mimo
 moj
 moja
 moje
+moji
+moju
 mu
 na
 nad
@ -77,24 +174,27 @@ naš
 naša
 naše
 našeg
+naši
 ne
+neće
+nećemo
+nećeš
+nećete
+neću
 nego
 neka
+neke
 neki
 nekog
 neku
 nema
-netko
-neće
-nećemo
-nećete
-nećeš
-neću
 nešto
+netko
 ni
 nije
 nikoga
 nikoje
+nikoji
 nikoju
 nisam
 nisi
@ -123,33 +223,63 @@ od
 odmah
 on
 ona
+one
 oni
 ono
+onu
+onoj
+onom
+onim
+onima
 ova
+ovaj
+ovim
+ovima
+ovoj
 pa
 pak
+pljus
 po
 pod
+podalje
+poimence
+poizdalje
+ponekad
 pored
+postrance
+potajice
+potrbuške
+pouzdano
 prije
 s
 sa
 sam
 samo
+sasvim
+sav
 se
 sebe
 sebi
 si
+šic
 smo
 ste
+što
+šta
+štogod
+štagod
 su
+sva
 sve
 svi
+svi
 svog
 svoj
 svoja
 svoje
+svoju
 svom
+svu
 ta
 tada
 taj
@ -158,6 +288,8 @@ te
 tebe
 tebi
 ti
+tim
+tima
 to
 toj
 tome
@ -165,23 +297,51 @@ tu
 tvoj
 tvoja
 tvoje
+tvoji
+tvoju
 u
+usprkos
+utaman
+uvijek
 uz
+uza
+uzagrapce
+uzalud
+uzduž
+valjda
 vam
 vama
 vas
 vaš
 vaša
 vaše
+vašim
+vašima
 već
 vi
+vjerojatno
+vjerovatno
+vrh
 vrlo
 za
+zaista
 zar
-će
-ćemo
-ćete
-ćeš
-ću
-što
+zatim
+zato
+zbija
+zbog
+želeći
+željah
+željela
+željele
+željeli
+željelo
+željen
+željena
+željene
+željeni
+željenu
+željeo
+zimus
+zum
 """.split())
--- a/spacy/lang/ja/init.py
+++ b/spacy/lang/ja/init.py
@ -35,14 +35,32 @@ class JapaneseTokenizer(object):
    def from_disk(self, path, **exclude):
        return self

+class JapaneseCharacterSegmenter(object):
+    def __init__(self, vocab):
+        self.vocab = vocab
+
+    def __call__(self, text):
+        words = []
+        spaces = []
+        doc = self.tokenizer(text)
+        for token in self.tokenizer(text):
+            words.extend(list(token.text))
+            spaces.extend([False]*len(token.text))
+            spaces[-1] = bool(token.whitespace_)
+        return Doc(self.vocab, words=words, spaces=spaces)
+

 class JapaneseDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda text: 'ja'
+    use_janome = True

    @classmethod
    def create_tokenizer(cls, nlp=None):
-        return JapaneseTokenizer(cls, nlp)
+        if cls.use_janome:
+            return JapaneseTokenizer(cls, nlp)
+        else:
+            return JapaneseCharacterSegmenter(cls, nlp.vocab)


 class Japanese(Language):
--- a/spacy/lang/tr/examples.py
+++ b/spacy/lang/tr/examples.py
@ -0,0 +1,22 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+"""
+Example sentences to test spaCy and its language models.
+>>> from spacy.lang.tr.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+    "Neredesin?",
+    "Neredesiniz?",
+    "Bu bir cümledir.",
+    "Sürücüsüz araçlar sigorta yükümlülüğünü üreticilere kaydırıyor.",
+    "San Francisco kaldırımda kurye robotları yasaklayabilir."
+    "Londra İngiltere'nin başkentidir.",
+    "Türkiye'nin başkenti neresi?",
+    "Bakanlar Kurulu 180 günlük eylem planını açıkladı.",
+    "Merkez Bankası, beklentiler doğrultusunda faizlerde değişikliğe gitmedi."
+]
--- a/spacy/lang/tr/lex_attrs.py
+++ b/spacy/lang/tr/lex_attrs.py
@ -0,0 +1,31 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...attrs import LIKE_NUM
+
+
+#Thirteen, fifteen etc. are written separate: on üç
+
+_num_words = ['bir', 'iki', 'üç', 'dört', 'beş', 'altı', 'yedi', 'sekiz',
+              'dokuz', 'on', 'yirmi', 'otuz', 'kırk', 'elli', 'altmış',
+              'yetmiş', 'seksen', 'doksan', 'yüz', 'bin', 'milyon',
+              'milyar', 'katrilyon', 'kentilyon']
+
+
+def like_num(text):
+    text = text.replace(',', '').replace('.', '')
+    if text.isdigit():
+        return True
+    if text.count('/') == 1:
+        num, denom = text.split('/')
+        if num.isdigit() and denom.isdigit():
+            return True
+    if text.lower() in _num_words:
+        return True
+    return False
+
+
+LEX_ATTRS = {
+    LIKE_NUM: like_num
+}
+
--- a/spacy/lang/tr/stop_words.py
+++ b/spacy/lang/tr/stop_words.py
@ -10,16 +10,12 @@ acep
 adamakıllı
 adeta
 ait
-altmýþ
-altmış
-altý
-altı
 ama
 amma
 anca
 ancak
 arada
-artýk
+artık
 aslında
 aynen
 ayrıca
@ -29,46 +25,82 @@ açıkçası
 bana
 bari
 bazen
-bazý
 bazı
+bazısı
+bazısına
+bazısında
+bazısından
+bazısını
+bazısının
 başkası
-baţka
+başkasına
+başkasında
+başkasından
+başkasını
+başkasının
+başka
 belki
 ben
+bende
 benden
 beni
 benim
 beri
 beriki
-beþ
-beş
-beţ
+berikinin
+berikiyi
+berisi
 bilcümle
 bile
-bin
 binaen
 binaenaleyh
-bir
 biraz
 birazdan
 birbiri
+birbirine
+birbirini
+birbirinin
+birbirinde
+birbirinden
 birden
 birdenbire
 biri
+birine
+birini
+birinin
+birinde
+birinden
 birice
 birileri
+birilerinde
+birilerinden
+birilerine
+birilerini
+birilerinin
 birisi
+birisine
+birisini
+birisinin
+birisinde
+birisinden
 birkaç
 birkaçı
+birkaçına
+birkaçını
+birkaçının
+birkaçında
+birkaçından
 birkez
 birlikte
 birçok
 birçoğu
-birþey
-birþeyi
+birçoğuna
+birçoğunda
+birçoğundan
+birçoğunu
+birçoğunun
 birşey
 birşeyi
-birţey
 bitevi
 biteviye
 bittabi
@ -96,6 +128,11 @@ buracıkta
 burada
 buradan
 burası
+burasına
+burasını
+burasının
+burasında
+burasından
 böyle
 böylece
 böylecene
@ -106,8 +143,34 @@ büsbütün
 bütün
 cuk
 cümlesi
+cümlesine
+cümlesini
+cümlesinin
+cümlesinden
+cümlemize
+cümlemizi
+cümlemizden
+çabuk
+çabukça
+çeşitli
+çok
+çokları
+çoklarınca
+çokluk
+çoklukla
+çokça
+çoğu
+çoğun
+çoğunca
+çoğunda
+çoğundan
+çoğunlukla
+çoğunu
+çoğunun
+çünkü
 da
 daha
+dahası
 dahi
 dahil
 dahilen
@ -124,19 +187,17 @@ denli
 derakap
 derhal
 derken
-deđil
 değil
 değin
 diye
-diđer
 diğer
 diğeri
-doksan
-dokuz
+diğerine
+diğerini
+diğerinden
 dolayı
 dolayısıyla
 doğru
-dört
 edecek
 eden
 ederek
@ -146,7 +207,6 @@ edilmesi
 ediyor
 elbet
 elbette
-elli
 emme
 en
 enikonu
@ -168,10 +228,10 @@ evvelce
 evvelden
 evvelemirde
 evveli
-eđer
 eğer
 fakat
 filanca
+filancanın
 gah
 gayet
 gayetle
@ -197,6 +257,10 @@ haliyle
 handiyse
 hangi
 hangisi
+hangisine
+hangisine
+hangisinde
+hangisinden
 hani
 hariç
 hasebiyle
@ -207,17 +271,27 @@ hem
 henüz
 hep
 hepsi
+hepsini
+hepsinin
+hepsinde
+hepsinden
 her
 herhangi
 herkes
+herkesi
 herkesin
+herkesten
 hiç
 hiçbir
 hiçbiri
+hiçbirine
+hiçbirini
+hiçbirinin
+hiçbirinde
+hiçbirinden
 hoş
 hulasaten
 iken
-iki
 ila
 ile
 ilen
@ -240,43 +314,55 @@ iyicene
 için
 iş
 işte
-iţte
 kadar
 kaffesi
 kah
 kala
-kanýmca
+kanımca
 karşın
-katrilyon
 kaynak
 kaçı
+kaçına
+kaçında
+kaçından
+kaçını
+kaçının
 kelli
 kendi
+kendilerinde
+kendilerinden
 kendilerine
+kendilerini
+kendilerinin
 kendini
 kendisi
+kendisinde
+kendisinden
 kendisine
 kendisini
+kendisinin
 kere
 kez
 keza
 kezalik
 keşke
-keţke
 ki
 kim
 kimden
 kime
 kimi
+kiminin
 kimisi
+kimisinde
+kimisinden
+kimisine
+kimisinin
 kimse
 kimsecik
 kimsecikler
 külliyen
-kýrk
-kýsaca
-kırk
 kısaca
+kısacası
 lakin
 leh
 lütfen
@ -289,13 +375,10 @@ međer
 meğer
 meğerki
 meğerse
-milyar
-milyon
 mu
 mü
-mý
 mı
-nasýl
+mi
 nasıl
 nasılsa
 nazaran
@ -304,6 +387,8 @@ ne
 neden
 nedeniyle
 nedenle
+nedenler
+nedenlerden
 nedense
 nerde
 nerden
@ -332,32 +417,27 @@ olduklarını
 oldukça
 olduğu
 olduğunu
-olmadı
-olmadığı
 olmak
 olması
-olmayan
-olmaz
 olsa
 olsun
 olup
 olur
 olursa
 oluyor
-on
 ona
 onca
 onculayın
 onda
 ondan
 onlar
+onlara
 onlardan
-onlari
-onlarýn
 onları
 onların
 onu
 onun
+ora
 oracık
 oracıkta
 orada
@ -365,9 +445,26 @@ oradan
 oranca
 oranla
 oraya
-otuz
 oysa
 oysaki
+öbür
+öbürkü
+öbürü
+öbüründe
+öbüründen
+öbürüne
+öbürünü
+önce
+önceden
+önceleri
+öncelikle
+öteki
+ötekisi
+öyle
+öylece
+öylelikle
+öylemesine
+öz
 pek
 pekala
 peki
@ -379,8 +476,6 @@ sahi
 sahiden
 sana
 sanki
-sekiz
-seksen
 sen
 senden
 seni
@ -393,6 +488,27 @@ sonra
 sonradan
 sonraları
 sonunda
+şayet
+şey
+şeyden
+şeyi
+şeyler
+şu
+şuna
+şuncacık
+şunda
+şundan
+şunlar
+şunları
+şunların
+şunu
+şunun
+şura
+şuracık
+şuracıkta
+şurası
+şöyle
+şimdi
 tabii
 tam
 tamam
@ -400,8 +516,8 @@ tamamen
 tamamıyla
 tarafından
 tek
-trilyon
 tüm
+üzere
 var
 vardı
 vasıtasıyla
@ -429,84 +545,16 @@ yaptığını
 yapılan
 yapılması
 yapıyor
-yedi
 yeniden
 yenilerde
 yerine
-yetmiþ
-yetmiş
-yetmiţ
 yine
-yirmi
 yok
 yoksa
 yoluyla
-yüz
 yüzünden
 zarfında
 zaten
 zati
 zira
-çabuk
-çabukça
-çeşitli
-çok
-çokları
-çoklarınca
-çokluk
-çoklukla
-çokça
-çoğu
-çoğun
-çoğunca
-çoğunlukla
-çünkü
-öbür
-öbürkü
-öbürü
-önce
-önceden
-önceleri
-öncelikle
-öteki
-ötekisi
-öyle
-öylece
-öylelikle
-öylemesine
-öz
-üzere
-üç
-þey
-þeyden
-þeyi
-þeyler
-þu
-þuna
-þunda
-þundan
-þunu
-şayet
-şey
-şeyden
-şeyi
-şeyler
-şu
-şuna
-şuncacık
-şunda
-şundan
-şunlar
-şunları
-şunu
-şunun
-şura
-şuracık
-şuracıkta
-şurası
-şöyle
-ţayet
-ţimdi
-ţu
-ţöyle
 """.split())
--- a/spacy/lang/tr/tokenizer_exceptions.py
+++ b/spacy/lang/tr/tokenizer_exceptions.py
@ -3,11 +3,6 @@ from __future__ import unicode_literals

 from ...symbols import ORTH, NORM

-
-# These exceptions are mostly for example purposes – hoping that Turkish
-# speakers can contribute in the future! Source of copy-pasted examples:
-# https://en.wiktionary.org/wiki/Category:Turkish_language
-
 _exc = {
    "sağol": [
        {ORTH: "sağ"},
@ -16,11 +11,112 @@ _exc = {


 for exc_data in [
-    {ORTH: "A.B.D.", NORM: "Amerika Birleşik Devletleri"}]:
+    {ORTH: "A.B.D.", NORM: "Amerika Birleşik Devletleri"},
+    {ORTH: "Alb.", NORM: "Albay"},
+    {ORTH: "Ar.Gör.", NORM: "Araştırma Görevlisi"},
+    {ORTH: "Arş.Gör.", NORM: "Araştırma Görevlisi"},
+    {ORTH: "Asb.", NORM: "Astsubay"},
+    {ORTH: "Astsb.", NORM: "Astsubay"},
+    {ORTH: "As.İz.", NORM: "Askeri İnzibat"},
+    {ORTH: "Atğm", NORM: "Asteğmen"},
+    {ORTH: "Av.", NORM: "Avukat"},
+    {ORTH: "Apt.", NORM: "Apartmanı"},
+    {ORTH: "Bçvş.", NORM: "Başçavuş"},
+    {ORTH: "bk.", NORM: "bakınız"},
+    {ORTH: "bknz.", NORM: "bakınız"},
+    {ORTH: "Bnb.", NORM: "Binbaşı"},
+    {ORTH: "bnb.", NORM: "binbaşı"},
+    {ORTH: "Böl.", NORM: "Bölümü"},
+    {ORTH: "Bşk.", NORM: "Başkanlığı"},
+    {ORTH: "Bştbp.", NORM: "Baştabip"},
+    {ORTH: "Bul.", NORM: "Bulvarı"},
+    {ORTH: "Cad.", NORM: "Caddesi"},
+    {ORTH: "çev.", NORM: "çeviren"},
+    {ORTH: "Çvş.", NORM: "Çavuş"},
+    {ORTH: "dak.", NORM: "dakika"},
+    {ORTH: "dk.", NORM: "dakika"},
+    {ORTH: "Doç.", NORM: "Doçent"},
+    {ORTH: "doğ.", NORM: "doğum tarihi"},
+    {ORTH: "drl.", NORM: "derleyen"},
+    {ORTH: "Dz.", NORM: "Deniz"},
+    {ORTH: "Dz.K.K.lığı", NORM: "Deniz Kuvvetleri Komutanlığı"},
+    {ORTH: "Dz.Kuv.", NORM: "Deniz Kuvvetleri"},
+    {ORTH: "Dz.Kuv.K.", NORM: "Deniz Kuvvetleri Komutanlığı"},
+    {ORTH: "dzl.", NORM: "düzenleyen"},
+    {ORTH: "Ecz.", NORM: "Eczanesi"},
+    {ORTH: "ekon.", NORM: "ekonomi"},
+    {ORTH: "Fak.", NORM: "Fakültesi"},
+    {ORTH: "Gn.", NORM: "Genel"},
+    {ORTH: "Gnkur.", NORM: "Genelkurmay"},
+    {ORTH: "Gn.Kur.", NORM: "Genelkurmay"},
+    {ORTH: "gr.", NORM: "gram"},
+    {ORTH: "Hst.", NORM: "Hastanesi"},
+    {ORTH: "Hs.Uzm.", NORM: "Hesap Uzmanı"},
+    {ORTH: "huk.", NORM: "hukuk"},
+    {ORTH: "Hv.", NORM: "Hava"},
+    {ORTH: "Hv.K.K.lığı", NORM: "Hava Kuvvetleri Komutanlığı"},
+    {ORTH: "Hv.Kuv.", NORM: "Hava Kuvvetleri"},
+    {ORTH: "Hv.Kuv.K.", NORM: "Hava Kuvvetleri Komutanlığı"},
+    {ORTH: "Hz.", NORM: "Hazreti"},
+    {ORTH: "Hz.Öz.", NORM: "Hizmete Özel"},
+    {ORTH: "İng.", NORM: "İngilizce"},
+    {ORTH: "Jeol.", NORM: "Jeoloji"},
+    {ORTH: "jeol.", NORM: "jeoloji"},
+    {ORTH: "Korg.", NORM: "Korgeneral"},
+    {ORTH: "Kur.", NORM: "Kurmay"},
+    {ORTH: "Kur.Bşk.", NORM: "Kurmay Başkanı"},
+    {ORTH: "Kuv.", NORM: "Kuvvetleri"},
+    {ORTH: "Ltd.", NORM: "Limited"},
+    {ORTH: "Mah.", NORM: "Mahallesi"},
+    {ORTH: "mah.", NORM: "mahallesi"},
+    {ORTH: "max.", NORM: "maksimum"},
+    {ORTH: "min.", NORM: "minimum"},
+    {ORTH: "Müh.", NORM: "Mühendisliği"},
+    {ORTH: "müh.", NORM: "mühendisliği"},
+    {ORTH: "MÖ.", NORM: "Milattan Önce"},
+    {ORTH: "Onb.", NORM: "Onbaşı"},
+    {ORTH: "Ord.", NORM: "Ordinaryüs"},
+    {ORTH: "Org.", NORM: "Orgeneral"},
+    {ORTH: "Ped.", NORM: "Pedagoji"},
+    {ORTH: "Prof.", NORM: "Profesör"},
+    {ORTH: "Sb.", NORM: "Subay"},
+    {ORTH: "Sn.", NORM: "Sayın"},
+    {ORTH: "sn.", NORM: "saniye"},
+    {ORTH: "Sok.", NORM: "Sokak"},
+    {ORTH: "Şb.", NORM: "Şube"},
+    {ORTH: "Şti.", NORM: "Şirketi"},
+    {ORTH: "Tbp.", NORM: "Tabip"},
+    {ORTH: "T.C.", NORM: "Türkiye Cumhuriyeti"},
+    {ORTH: "Tel.", NORM: "Telefon"},
+    {ORTH: "tel.", NORM: "telefon"},
+    {ORTH: "telg.", NORM: "telgraf"},
+    {ORTH: "Tğm.", NORM: "Teğmen"},
+    {ORTH: "tğm.", NORM: "teğmen"},
+    {ORTH: "tic.", NORM: "ticaret"},
+    {ORTH: "Tug.", NORM: "Tugay"},
+    {ORTH: "Tuğg.", NORM: "Tuğgeneral"},
+    {ORTH: "Tümg.", NORM: "Tümgeneral"},
+    {ORTH: "Uzm.", NORM: "Uzman"},
+    {ORTH: "Üçvş.", NORM: "Üstçavuş"},
+    {ORTH: "Üni.", NORM: "Üniversitesi"},
+    {ORTH: "Ütğm.", NORM: "Üsteğmen"},
+    {ORTH: "vb.", NORM: "ve benzeri"},
+    {ORTH: "vs.", NORM: "vesaire"},
+    {ORTH: "Yard.", NORM: "Yardımcı"},
+    {ORTH: "Yar.", NORM: "Yardımcı"},
+    {ORTH: "Yd.Sb.", NORM: "Yedek Subay"},
+    {ORTH: "Yard.Doç.", NORM: "Yardımcı Doçent"},
+    {ORTH: "Yar.Doç.", NORM: "Yardımcı Doçent"},
+    {ORTH: "Yb.", NORM: "Yarbay"},
+    {ORTH: "Yrd.", NORM: "Yardımcı"},
+    {ORTH: "Yrd.Doç.", NORM: "Yardımcı Doçent"},
+    {ORTH: "Y.Müh.", NORM: "Yüksek mühendis"},
+    {ORTH: "Y.Mim.", NORM: "Yüksek mimar"}]:
    _exc[exc_data[ORTH]] = [exc_data]


-for orth in ["Dr."]:
+for orth in [
+    "Dr.", "yy."]:
    _exc[orth] = [{ORTH: orth}]


--- a/spacy/syntax/arc_eager.pyx
+++ b/spacy/syntax/arc_eager.pyx
@ -319,7 +319,7 @@ cdef class ArcEager(TransitionSystem):
            (SHIFT, ['']),
            (REDUCE, ['']),
            (RIGHT, []),
-            (LEFT, []),
+            (LEFT, ['subtok']),
            (BREAK, ['ROOT']))
        ))
        seen_actions = set()
--- a/spacy/syntax/nn_parser.pyx
+++ b/spacy/syntax/nn_parser.pyx
@ -477,14 +477,15 @@ cdef class Parser:
        free(vectors)
        free(scores)

-    def beam_parse(self, docs, int beam_width=3, float beam_density=0.001):
+    def beam_parse(self, docs, int beam_width=3, float beam_density=0.001,
+            float drop=0.):
        cdef Beam beam
        cdef np.ndarray scores
        cdef Doc doc
        cdef int nr_class = self.moves.n_moves
        cuda_stream = util.get_cuda_stream()
        (tokvecs, bp_tokvecs), state2vec, vec2scores = self.get_batch_model(
-            docs, cuda_stream, 0.0)
+            docs, cuda_stream, drop)
        cdef int offset = 0
        cdef int j = 0
        cdef int k
@ -523,8 +524,8 @@ cdef class Parser:
                        n_states += 1
            if n_states == 0:
                break
-            vectors = state2vec(token_ids[:n_states])
-            scores = vec2scores(vectors)
+            vectors, _ = state2vec.begin_update(token_ids[:n_states], drop)
+            scores, _ = vec2scores.begin_update(vectors, drop=drop)
            c_scores = <float*>scores.data
            for beam in todo:
                for i in range(beam.size):
--- a/spacy/syntax/nonproj.pyx
+++ b/spacy/syntax/nonproj.pyx
@ -191,9 +191,12 @@ def _filter_labels(gold_tuples, cutoff, freqs):
    for raw_text, sents in gold_tuples:
        filtered_sents = []
        for (ids, words, tags, heads, labels, iob), ctnts in sents:
-            filtered_labels = [decompose(label)[0]
-                               if freqs.get(label, cutoff) < cutoff
-                               else label for label in labels]
+            filtered_labels = []
+            for label in labels:
+                if is_decorated(label) and freqs.get(label, 0) < cutoff:
+                    filtered_labels.append(decompose(label)[0])
+                else:
+                    filtered_labels.append(label)
            filtered_sents.append(
                ((ids, words, tags, heads, filtered_labels, iob), ctnts))
        filtered.append((raw_text, filtered_sents))
--- a/spacy/tests/parser/test_arc_eager_oracle.py
+++ b/spacy/tests/parser/test_arc_eager_oracle.py
@ -0,0 +1,74 @@
+from ...vocab import Vocab
+from ...pipeline import DependencyParser
+from ...tokens import Doc
+from ...gold import GoldParse
+from ...syntax.nonproj import projectivize
+
+annot_tuples = [
+    (0, 'When', 'WRB', 11, 'advmod', 'O'),
+    (1, 'Walter', 'NNP', 2, 'compound', 'B-PERSON'),
+    (2, 'Rodgers', 'NNP', 11, 'nsubj', 'L-PERSON'),
+    (3, ',', ',', 2, 'punct', 'O'),
+    (4, 'our', 'PRP$', 6, 'poss', 'O'),
+    (5, 'embedded', 'VBN', 6, 'amod', 'O'),
+    (6, 'reporter', 'NN', 2, 'appos', 'O'),
+    (7, 'with', 'IN', 6, 'prep', 'O'),
+    (8, 'the', 'DT', 10, 'det', 'B-ORG'),
+    (9, '3rd', 'NNP', 10, 'compound', 'I-ORG'),
+    (10, 'Cavalry', 'NNP', 7, 'pobj', 'L-ORG'),
+    (11, 'says', 'VBZ', 44, 'advcl', 'O'),
+    (12, 'three', 'CD', 13, 'nummod', 'U-CARDINAL'),
+    (13, 'battalions', 'NNS', 16, 'nsubj', 'O'),
+    (14, 'of', 'IN', 13, 'prep', 'O'),
+    (15, 'troops', 'NNS', 14, 'pobj', 'O'),
+    (16, 'are', 'VBP', 11, 'ccomp', 'O'),
+    (17, 'on', 'IN', 16, 'prep', 'O'),
+    (18, 'the', 'DT', 19, 'det', 'O'),
+    (19, 'ground', 'NN', 17, 'pobj', 'O'),
+    (20, ',', ',', 17, 'punct', 'O'),
+    (21, 'inside', 'IN', 17, 'prep', 'O'),
+    (22, 'Baghdad', 'NNP', 21, 'pobj', 'U-GPE'),
+    (23, 'itself', 'PRP', 22, 'appos', 'O'),
+    (24, ',', ',', 16, 'punct', 'O'),
+    (25, 'have', 'VBP', 26, 'aux', 'O'),
+    (26, 'taken', 'VBN', 16, 'dep', 'O'),
+    (27, 'up', 'RP', 26, 'prt', 'O'),
+    (28, 'positions', 'NNS', 26, 'dobj', 'O'),
+    (29, 'they', 'PRP', 31, 'nsubj', 'O'),
+    (30, "'re", 'VBP', 31, 'aux', 'O'),
+    (31, 'going', 'VBG', 26, 'parataxis', 'O'),
+    (32, 'to', 'TO', 33, 'aux', 'O'),
+    (33, 'spend', 'VB', 31, 'xcomp', 'O'),
+    (34, 'the', 'DT', 35, 'det', 'B-TIME'), 
+    (35, 'night', 'NN', 33, 'dobj', 'L-TIME'),
+    (36, 'there', 'RB', 33, 'advmod', 'O'),
+    (37, 'presumably', 'RB', 33, 'advmod', 'O'),
+    (38, ',', ',', 44, 'punct', 'O'),
+    (39, 'how', 'WRB', 40, 'advmod', 'O'),
+    (40, 'many', 'JJ', 41, 'amod', 'O'),
+    (41, 'soldiers', 'NNS', 44, 'pobj', 'O'),
+    (42, 'are', 'VBP', 44, 'aux', 'O'),
+    (43, 'we', 'PRP', 44, 'nsubj', 'O'),
+    (44, 'talking', 'VBG', 44, 'ROOT', 'O'),
+    (45, 'about', 'IN', 44, 'prep', 'O'),
+    (46, 'right', 'RB', 47, 'advmod', 'O'),
+    (47, 'now', 'RB', 44, 'advmod', 'O'),
+    (48, '?', '.', 44, 'punct', 'O')]
+
+def test_get_oracle_actions():
+    doc = Doc(Vocab(), words=[t[1] for t in annot_tuples])
+    parser = DependencyParser(doc.vocab)
+    parser.moves.add_action(0, '')
+    parser.moves.add_action(1, '')
+    parser.moves.add_action(1, '')
+    parser.moves.add_action(4, 'ROOT')
+    for i, (id_, word, tag, head, dep, ent) in enumerate(annot_tuples):
+        if head > i:
+            parser.moves.add_action(2, dep)
+        elif head < i:
+            parser.moves.add_action(3, dep)
+    ids, words, tags, heads, deps, ents = zip(*annot_tuples)
+    heads, deps = projectivize(heads, deps)
+    gold = GoldParse(doc, words=words, tags=tags, heads=heads, deps=deps)
+    parser.moves.preprocess_gold(gold)
+    actions = parser.moves.get_oracle_sequence(doc, gold)
--- a/spacy/tokens/span.pyx
+++ b/spacy/tokens/span.pyx
@ -294,6 +294,7 @@ cdef class Span:
            cdef int i
            if self.doc.is_parsed:
                root = &self.doc.c[self.start]
+                n = 0
                while root.head != 0:
                    root += root.head
                    n += 1
@ -307,8 +308,10 @@ cdef class Span:
                    start += -1
                # find end of the sentence
                end = self.end
-                while self.doc.c[end].sent_start != 1:
+                n = 0
+                while end < self.doc.length and self.doc.c[end].sent_start != 1:
                    end += 1
+                    n += 1
                    if n >= self.doc.length:
                        break
                #
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@ -279,8 +279,8 @@ cdef class Token:
        """
        def __get__(self):
            if self.c.lemma == 0:
-                lemma = self.vocab.morphology.lemmatizer.lookup(self.orth_)
-                return lemma
+                lemma_ = self.vocab.morphology.lemmatizer.lookup(self.orth_)
+                return self.vocab.strings[lemma_]
            else:
                return self.c.lemma

--- a/spacy/util.py
+++ b/spacy/util.py
@ -451,7 +451,7 @@ def itershuffle(iterable, bufsize=1000):
    try:
        while True:
            for i in range(random.randint(1, bufsize-len(buf))):
-                buf.append(iterable.next())
+                buf.append(next(iterable))
            random.shuffle(buf)
            for i in range(random.randint(1, bufsize)):
                if buf:
--- a/spacy/vectors.pyx
+++ b/spacy/vectors.pyx
@ -15,11 +15,8 @@ from .compat import basestring_, path2str
 from . import util


-def unpickle_vectors(keys_and_rows, data):
-    vectors = Vectors(data=data)
-    for key, row in keys_and_rows:
-        vectors.add(key, row=row)
-    return vectors
+def unpickle_vectors(bytes_data):
+    return Vectors().from_bytes(bytes_data)


 cdef class Vectors:
@ -86,8 +83,7 @@ cdef class Vectors:
        return len(self.key2row)

    def __reduce__(self):
-        keys_and_rows = tuple(self.key2row.items())
-        return (unpickle_vectors, (keys_and_rows, self.data))
+        return (unpickle_vectors, (self.to_bytes(),))

    def __getitem__(self, key):
        """Get a vector by key. If the key is not found, a KeyError is raised.
--- a/website/assets/img/social/preview_alpha.jpg
+++ b/website/assets/img/social/preview_alpha.jpg
--- a/website/models/_data.json
+++ b/website/models/_data.json
@ -76,13 +76,15 @@
    },

    "MODEL_LICENSES": {
-        "CC BY-SA":     "https://creativecommons.org/licenses/by-sa/3.0/",
-        "CC BY-SA 3.0": "https://creativecommons.org/licenses/by-sa/3.0/",
-        "CC BY-SA 4.0": "https://creativecommons.org/licenses/by-sa/4.0/",
-        "CC BY-NC":     "https://creativecommons.org/licenses/by-nc/3.0/",
-        "CC BY-NC 3.0": "https://creativecommons.org/licenses/by-nc/3.0/",
-        "GPL":          "https://www.gnu.org/licenses/gpl.html",
-        "LGPL":         "https://www.gnu.org/licenses/lgpl.html"
+        "CC BY 4.0":       "https://creativecommons.org/licenses/by/4.0/",
+        "CC BY-SA":        "https://creativecommons.org/licenses/by-sa/3.0/",
+        "CC BY-SA 3.0":    "https://creativecommons.org/licenses/by-sa/3.0/",
+        "CC BY-SA 4.0":    "https://creativecommons.org/licenses/by-sa/4.0/",
+        "CC BY-NC":        "https://creativecommons.org/licenses/by-nc/3.0/",
+        "CC BY-NC 3.0":    "https://creativecommons.org/licenses/by-nc/3.0/",
+        "CC-BY-NC-SA 3.0": "https://creativecommons.org/licenses/by-nc-sa/3.0/",
+        "GPL":             "https://www.gnu.org/licenses/gpl.html",
+        "LGPL":            "https://www.gnu.org/licenses/lgpl.html"
    },

    "MODEL_BENCHMARKS": {
--- a/website/usage/spacy-101.jade
+++ b/website/usage/spacy-101.jade
@ -68,7 +68,7 @@ p
    +item #[strong spaCy is not research software].
        |  It's built on the latest research, but it's designed to get
        |  things done. This leads to fairly different design decisions than
-        |  #[+a("https://github./nltk/nltk") NLTK]
+        |  #[+a("https://github.com/nltk/nltk") NLTK]
        |  or #[+a("https://stanfordnlp.github.io/CoreNLP/") CoreNLP], which were
        |  created as platforms for teaching and research. The main difference
        |  is that spaCy is integrated and opinionated. spaCy tries to avoid asking