Merge master

This commit is contained in:
Matthew Honnibal 2018-03-14 19:03:24 +01:00
commit ab3d860686
32 changed files with 1910 additions and 199 deletions

11
.buildkite/train.yml Normal file
View File

@ -0,0 +1,11 @@
steps:
-
command: "fab env clean make test wheel"
label: ":dizzy: :python:"
artifact_paths: "dist/*.whl"
- wait
- trigger: "spacy-train-from-wheel"
label: ":dizzy: :train:"
build:
env:
SPACY_VERSION: "{$SPACY_VERSION}"

106
.github/contributors/alldefector.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Feng Niu |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | Feb 21, 2018 |
| GitHub username | alldefector |
| Website (optional) | |

106
.github/contributors/willismonroe.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Willis Monroe |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-3-5 |
| GitHub username | willismonroe |
| Website (optional) | |

View File

@ -182,7 +182,7 @@ If you've made a contribution to spaCy, you should fill in the
[spaCy contributor agreement](.github/CONTRIBUTOR_AGREEMENT.md) to ensure that
your contribution can be used across the project. If you agree to be bound by
the terms of the agreement, fill in the [template](.github/CONTRIBUTOR_AGREEMENT.md)
and include it with your pull request, or sumit it separately to
and include it with your pull request, or submit it separately to
[`.github/contributors/`](/.github/contributors). The name of the file should be
your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.

View File

@ -28,8 +28,10 @@ import cytoolz
import conll17_ud_eval
import spacy.lang.zh
import spacy.lang.ja
spacy.lang.zh.Chinese.Defaults.use_jieba = False
spacy.lang.ja.Japanese.Defaults.use_janome = False
random.seed(0)
numpy.random.seed(0)
@ -280,6 +282,30 @@ def print_progress(itn, losses, ud_scores):
))
print(tpl.format(itn, **fields))
#def get_sent_conllu(sent, sent_id):
# lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
def get_token_conllu(token, i):
if token._.begins_fused:
n = 1
while token.nbor(n)._.inside_fused:
n += 1
id_ = '%d-%d' % (i, i+n)
lines = [id_, token.text, '_', '_', '_', '_', '_', '_', '_', '_']
else:
lines = []
if token.head.i == token.i:
head = 0
else:
head = i + (token.head.i - token.i) + 1
fields = [str(i+1), token.text, token.lemma_, token.pos_, token.tag_, '_',
str(head), token.dep_.lower(), '_', '_']
lines.append('\t'.join(fields))
return '\n'.join(lines)
Token.set_extension('get_conllu_lines', method=get_token_conllu)
Token.set_extension('begins_fused', default=False)
Token.set_extension('inside_fused', default=False)
##################
# Initialization #

83
fabfile.py vendored
View File

@ -1,49 +1,92 @@
# coding: utf-8
from __future__ import unicode_literals, print_function
import contextlib
from pathlib import Path
from fabric.api import local, lcd, env, settings, prefix
from fabtools.python import virtualenv
from os import path, environ
import shutil
PWD = path.dirname(__file__)
ENV = environ['VENV_DIR'] if 'VENV_DIR' in environ else '.env'
VENV_DIR = path.join(PWD, ENV)
VENV_DIR = Path(PWD) / ENV
def env(lang='python2.7'):
if path.exists(VENV_DIR):
@contextlib.contextmanager
def virtualenv(name, create=False, python='/usr/bin/python3.6'):
python = Path(python).resolve()
env_path = VENV_DIR
if create:
if env_path.exists():
shutil.rmtree(str(env_path))
local('{python} -m venv {env_path}'.format(python=python, env_path=VENV_DIR))
def wrapped_local(cmd, env_vars=[], capture=False, direct=False):
return local('source {}/bin/activate && {}'.format(env_path, cmd),
shell='/bin/bash', capture=False)
yield wrapped_local
def env(lang='python3.6'):
if VENV_DIR.exists():
local('rm -rf {env}'.format(env=VENV_DIR))
local('pip install virtualenv')
local('python -m virtualenv -p {lang} {env}'.format(lang=lang, env=VENV_DIR))
if lang.startswith('python3'):
local('{lang} -m venv {env}'.format(lang=lang, env=VENV_DIR))
else:
local('{lang} -m pip install virtualenv --no-cache-dir'.format(lang=lang))
local('{lang} -m virtualenv {env} --no-cache-dir'.format(lang=lang, env=VENV_DIR))
with virtualenv(VENV_DIR) as venv_local:
print(venv_local('python --version', capture=True))
venv_local('pip install --upgrade setuptools --no-cache-dir')
venv_local('pip install pytest --no-cache-dir')
venv_local('pip install wheel --no-cache-dir')
venv_local('pip install -r requirements.txt --no-cache-dir')
venv_local('pip install pex --no-cache-dir')
def install():
with virtualenv(VENV_DIR):
local('pip install --upgrade setuptools')
local('pip install dist/*.tar.gz')
local('pip install pytest')
with virtualenv(VENV_DIR) as venv_local:
venv_local('pip install dist/*.tar.gz')
def make():
with virtualenv(VENV_DIR):
with lcd(path.dirname(__file__)):
local('pip install cython')
local('pip install murmurhash')
local('pip install -r requirements.txt')
local('python setup.py build_ext --inplace')
with lcd(path.dirname(__file__)):
local('export PYTHONPATH=`pwd` && source .env/bin/activate && python setup.py build_ext --inplace',
shell='/bin/bash')
def sdist():
with virtualenv(VENV_DIR):
with virtualenv(VENV_DIR) as venv_local:
with lcd(path.dirname(__file__)):
local('python setup.py sdist')
def wheel():
with virtualenv(VENV_DIR) as venv_local:
with lcd(path.dirname(__file__)):
venv_local('python setup.py bdist_wheel')
def pex():
with virtualenv(VENV_DIR) as venv_local:
with lcd(path.dirname(__file__)):
sha = local('git rev-parse --short HEAD', capture=True)
venv_local('pex dist/*.whl -e spacy -o dist/spacy-%s.pex' % sha,
direct=True)
def clean():
with lcd(path.dirname(__file__)):
local('python setup.py clean --all')
local('rm -f dist/*.whl')
local('rm -f dist/*.pex')
with virtualenv(VENV_DIR) as venv_local:
venv_local('python setup.py clean --all')
def test():
with virtualenv(VENV_DIR):
with virtualenv(VENV_DIR) as venv_local:
with lcd(path.dirname(__file__)):
local('py.test -x spacy/tests')
venv_local('pytest -x spacy/tests')
def train():
args = environ.get('SPACY_TRAIN_ARGS', '')
with virtualenv(VENV_DIR) as venv_local:
venv_local('spacy train {args}'.format(args=args))

View File

@ -8,6 +8,7 @@ if __name__ == '__main__':
import sys
from spacy.cli import download, link, info, package, train, convert
from spacy.cli import vocab, init_model, profile, evaluate, validate
from spacy.cli import ud_train, ud_evaluate
from spacy.util import prints
commands = {
@ -15,7 +16,9 @@ if __name__ == '__main__':
'link': link,
'info': info,
'train': train,
'ud-train': ud_train,
'evaluate': evaluate,
'ud-evaluate': ud_evaluate,
'convert': convert,
'package': package,
'vocab': vocab,

View File

@ -3,7 +3,7 @@
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
__title__ = 'spacy'
__version__ = '2.1.0.dev1'
__version__ = '2.1.0.dev3'
__summary__ = 'Industrial-strength Natural Language Processing (NLP) with Python and Cython'
__uri__ = 'https://spacy.io'
__author__ = 'Explosion AI'

View File

@ -9,3 +9,5 @@ from .convert import convert
from .vocab import make_vocab as vocab
from .init_model import init_model
from .validate import validate
from .ud_train import main as ud_train
from .conll17_ud_eval import main as ud_evaluate

View File

@ -0,0 +1,570 @@
#!/usr/bin/env python
# CoNLL 2017 UD Parsing evaluation script.
#
# Compatible with Python 2.7 and 3.2+, can be used either as a module
# or a standalone executable.
#
# Copyright 2017 Institute of Formal and Applied Linguistics (UFAL),
# Faculty of Mathematics and Physics, Charles University, Czech Republic.
#
# Changelog:
# - [02 Jan 2017] Version 0.9: Initial release
# - [25 Jan 2017] Version 0.9.1: Fix bug in LCS alignment computation
# - [10 Mar 2017] Version 1.0: Add documentation and test
# Compare HEADs correctly using aligned words
# Allow evaluation with errorneous spaces in forms
# Compare forms in LCS case insensitively
# Detect cycles and multiple root nodes
# Compute AlignedAccuracy
# Command line usage
# ------------------
# conll17_ud_eval.py [-v] [-w weights_file] gold_conllu_file system_conllu_file
#
# - if no -v is given, only the CoNLL17 UD Shared Task evaluation LAS metrics
# is printed
# - if -v is given, several metrics are printed (as precision, recall, F1 score,
# and in case the metric is computed on aligned words also accuracy on these):
# - Tokens: how well do the gold tokens match system tokens
# - Sentences: how well do the gold sentences match system sentences
# - Words: how well can the gold words be aligned to system words
# - UPOS: using aligned words, how well does UPOS match
# - XPOS: using aligned words, how well does XPOS match
# - Feats: using aligned words, how well does FEATS match
# - AllTags: using aligned words, how well does UPOS+XPOS+FEATS match
# - Lemmas: using aligned words, how well does LEMMA match
# - UAS: using aligned words, how well does HEAD match
# - LAS: using aligned words, how well does HEAD+DEPREL(ignoring subtypes) match
# - if weights_file is given (with lines containing deprel-weight pairs),
# one more metric is shown:
# - WeightedLAS: as LAS, but each deprel (ignoring subtypes) has different weight
# API usage
# ---------
# - load_conllu(file)
# - loads CoNLL-U file from given file object to an internal representation
# - the file object should return str on both Python 2 and Python 3
# - raises UDError exception if the given file cannot be loaded
# - evaluate(gold_ud, system_ud)
# - evaluate the given gold and system CoNLL-U files (loaded with load_conllu)
# - raises UDError if the concatenated tokens of gold and system file do not match
# - returns a dictionary with the metrics described above, each metrics having
# three fields: precision, recall and f1
# Description of token matching
# -----------------------------
# In order to match tokens of gold file and system file, we consider the text
# resulting from concatenation of gold tokens and text resulting from
# concatenation of system tokens. These texts should match -- if they do not,
# the evaluation fails.
#
# If the texts do match, every token is represented as a range in this original
# text, and tokens are equal only if their range is the same.
# Description of word matching
# ----------------------------
# When matching words of gold file and system file, we first match the tokens.
# The words which are also tokens are matched as tokens, but words in multi-word
# tokens have to be handled differently.
#
# To handle multi-word tokens, we start by finding "multi-word spans".
# Multi-word span is a span in the original text such that
# - it contains at least one multi-word token
# - all multi-word tokens in the span (considering both gold and system ones)
# are completely inside the span (i.e., they do not "stick out")
# - the multi-word span is as small as possible
#
# For every multi-word span, we align the gold and system words completely
# inside this span using LCS on their FORMs. The words not intersecting
# (even partially) any multi-word span are then aligned as tokens.
from __future__ import division
from __future__ import print_function
import argparse
import io
import sys
import unittest
# CoNLL-U column names
ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC = range(10)
# UD Error is used when raising exceptions in this module
class UDError(Exception):
pass
# Load given CoNLL-U file into internal representation
def load_conllu(file):
# Internal representation classes
class UDRepresentation:
def __init__(self):
# Characters of all the tokens in the whole file.
# Whitespace between tokens is not included.
self.characters = []
# List of UDSpan instances with start&end indices into `characters`.
self.tokens = []
# List of UDWord instances.
self.words = []
# List of UDSpan instances with start&end indices into `characters`.
self.sentences = []
class UDSpan:
def __init__(self, start, end, characters):
self.start = start
# Note that self.end marks the first position **after the end** of span,
# so we can use characters[start:end] or range(start, end).
self.end = end
self.characters = characters
@property
def text(self):
return ''.join(self.characters[self.start:self.end])
def __str__(self):
return self.text
def __repr__(self):
return self.text
class UDWord:
def __init__(self, span, columns, is_multiword):
# Span of this word (or MWT, see below) within ud_representation.characters.
self.span = span
# 10 columns of the CoNLL-U file: ID, FORM, LEMMA,...
self.columns = columns
# is_multiword==True means that this word is part of a multi-word token.
# In that case, self.span marks the span of the whole multi-word token.
self.is_multiword = is_multiword
# Reference to the UDWord instance representing the HEAD (or None if root).
self.parent = None
# Let's ignore language-specific deprel subtypes.
self.columns[DEPREL] = columns[DEPREL].split(':')[0]
ud = UDRepresentation()
# Load the CoNLL-U file
index, sentence_start = 0, None
linenum = 0
while True:
line = file.readline()
linenum += 1
if not line:
break
line = line.rstrip("\r\n")
# Handle sentence start boundaries
if sentence_start is None:
# Skip comments
if line.startswith("#"):
continue
# Start a new sentence
ud.sentences.append(UDSpan(index, 0, ud.characters))
sentence_start = len(ud.words)
if not line:
# Add parent UDWord links and check there are no cycles
def process_word(word):
if word.parent == "remapping":
raise UDError("There is a cycle in a sentence")
if word.parent is None:
head = int(word.columns[HEAD])
if head > len(ud.words) - sentence_start:
raise UDError("HEAD '{}' points outside of the sentence".format(word.columns[HEAD]))
if head:
parent = ud.words[sentence_start + head - 1]
word.parent = "remapping"
process_word(parent)
word.parent = parent
for word in ud.words[sentence_start:]:
process_word(word)
# Check there is a single root node
if len([word for word in ud.words[sentence_start:] if word.parent is None]) != 1:
raise UDError("There are multiple roots in a sentence")
# End the sentence
ud.sentences[-1].end = index
sentence_start = None
continue
# Read next token/word
columns = line.split("\t")
if len(columns) != 10:
raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, line))
# Skip empty nodes
if "." in columns[ID]:
continue
# Delete spaces from FORM so gold.characters == system.characters
# even if one of them tokenizes the space.
columns[FORM] = columns[FORM].replace(" ", "")
if not columns[FORM]:
raise UDError("There is an empty FORM in the CoNLL-U file -- line %d" % linenum)
# Save token
ud.characters.extend(columns[FORM])
ud.tokens.append(UDSpan(index, index + len(columns[FORM]), ud.characters))
index += len(columns[FORM])
# Handle multi-word tokens to save word(s)
if "-" in columns[ID]:
try:
start, end = map(int, columns[ID].split("-"))
except:
raise UDError("Cannot parse multi-word token ID '{}'".format(columns[ID]))
for _ in range(start, end + 1):
word_line = file.readline().rstrip("\r\n")
word_columns = word_line.split("\t")
if len(word_columns) != 10:
print(columns)
raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, word_line))
ud.words.append(UDWord(ud.tokens[-1], word_columns, is_multiword=True))
# Basic tokens/words
else:
try:
word_id = int(columns[ID])
except:
raise UDError("Cannot parse word ID '{}'".format(columns[ID]))
if word_id != len(ud.words) - sentence_start + 1:
raise UDError("Incorrect word ID '{}' for word '{}', expected '{}'".format(columns[ID], columns[FORM], len(ud.words) - sentence_start + 1))
try:
head_id = int(columns[HEAD])
except:
raise UDError("Cannot parse HEAD '{}'".format(columns[HEAD]))
if head_id < 0:
raise UDError("HEAD cannot be negative")
ud.words.append(UDWord(ud.tokens[-1], columns, is_multiword=False))
if sentence_start is not None:
raise UDError("The CoNLL-U file does not end with empty line")
return ud
# Evaluate the gold and system treebanks (loaded using load_conllu).
def evaluate(gold_ud, system_ud, deprel_weights=None):
class Score:
def __init__(self, gold_total, system_total, correct, aligned_total=None):
self.precision = correct / system_total if system_total else 0.0
self.recall = correct / gold_total if gold_total else 0.0
self.f1 = 2 * correct / (system_total + gold_total) if system_total + gold_total else 0.0
self.aligned_accuracy = correct / aligned_total if aligned_total else aligned_total
class AlignmentWord:
def __init__(self, gold_word, system_word):
self.gold_word = gold_word
self.system_word = system_word
self.gold_parent = None
self.system_parent_gold_aligned = None
class Alignment:
def __init__(self, gold_words, system_words):
self.gold_words = gold_words
self.system_words = system_words
self.matched_words = []
self.matched_words_map = {}
def append_aligned_words(self, gold_word, system_word):
self.matched_words.append(AlignmentWord(gold_word, system_word))
self.matched_words_map[system_word] = gold_word
def fill_parents(self):
# We represent root parents in both gold and system data by '0'.
# For gold data, we represent non-root parent by corresponding gold word.
# For system data, we represent non-root parent by either gold word aligned
# to parent system nodes, or by None if no gold words is aligned to the parent.
for words in self.matched_words:
words.gold_parent = words.gold_word.parent if words.gold_word.parent is not None else 0
words.system_parent_gold_aligned = self.matched_words_map.get(words.system_word.parent, None) \
if words.system_word.parent is not None else 0
def lower(text):
if sys.version_info < (3, 0) and isinstance(text, str):
return text.decode("utf-8").lower()
return text.lower()
def spans_score(gold_spans, system_spans):
correct, gi, si = 0, 0, 0
while gi < len(gold_spans) and si < len(system_spans):
if system_spans[si].start < gold_spans[gi].start:
si += 1
elif gold_spans[gi].start < system_spans[si].start:
gi += 1
else:
correct += gold_spans[gi].end == system_spans[si].end
si += 1
gi += 1
return Score(len(gold_spans), len(system_spans), correct)
def alignment_score(alignment, key_fn, weight_fn=lambda w: 1):
gold, system, aligned, correct = 0, 0, 0, 0
for word in alignment.gold_words:
gold += weight_fn(word)
for word in alignment.system_words:
system += weight_fn(word)
for words in alignment.matched_words:
aligned += weight_fn(words.gold_word)
if key_fn is None:
# Return score for whole aligned words
return Score(gold, system, aligned)
for words in alignment.matched_words:
if key_fn(words.gold_word, words.gold_parent) == key_fn(words.system_word, words.system_parent_gold_aligned):
correct += weight_fn(words.gold_word)
return Score(gold, system, correct, aligned)
def beyond_end(words, i, multiword_span_end):
if i >= len(words):
return True
if words[i].is_multiword:
return words[i].span.start >= multiword_span_end
return words[i].span.end > multiword_span_end
def extend_end(word, multiword_span_end):
if word.is_multiword and word.span.end > multiword_span_end:
return word.span.end
return multiword_span_end
def find_multiword_span(gold_words, system_words, gi, si):
# We know gold_words[gi].is_multiword or system_words[si].is_multiword.
# Find the start of the multiword span (gs, ss), so the multiword span is minimal.
# Initialize multiword_span_end characters index.
if gold_words[gi].is_multiword:
multiword_span_end = gold_words[gi].span.end
if not system_words[si].is_multiword and system_words[si].span.start < gold_words[gi].span.start:
si += 1
else: # if system_words[si].is_multiword
multiword_span_end = system_words[si].span.end
if not gold_words[gi].is_multiword and gold_words[gi].span.start < system_words[si].span.start:
gi += 1
gs, ss = gi, si
# Find the end of the multiword span
# (so both gi and si are pointing to the word following the multiword span end).
while not beyond_end(gold_words, gi, multiword_span_end) or \
not beyond_end(system_words, si, multiword_span_end):
if gi < len(gold_words) and (si >= len(system_words) or
gold_words[gi].span.start <= system_words[si].span.start):
multiword_span_end = extend_end(gold_words[gi], multiword_span_end)
gi += 1
else:
multiword_span_end = extend_end(system_words[si], multiword_span_end)
si += 1
return gs, ss, gi, si
def compute_lcs(gold_words, system_words, gi, si, gs, ss):
lcs = [[0] * (si - ss) for i in range(gi - gs)]
for g in reversed(range(gi - gs)):
for s in reversed(range(si - ss)):
if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
lcs[g][s] = 1 + (lcs[g+1][s+1] if g+1 < gi-gs and s+1 < si-ss else 0)
lcs[g][s] = max(lcs[g][s], lcs[g+1][s] if g+1 < gi-gs else 0)
lcs[g][s] = max(lcs[g][s], lcs[g][s+1] if s+1 < si-ss else 0)
return lcs
def align_words(gold_words, system_words):
alignment = Alignment(gold_words, system_words)
gi, si = 0, 0
while gi < len(gold_words) and si < len(system_words):
if gold_words[gi].is_multiword or system_words[si].is_multiword:
# A: Multi-word tokens => align via LCS within the whole "multiword span".
gs, ss, gi, si = find_multiword_span(gold_words, system_words, gi, si)
if si > ss and gi > gs:
lcs = compute_lcs(gold_words, system_words, gi, si, gs, ss)
# Store aligned words
s, g = 0, 0
while g < gi - gs and s < si - ss:
if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
alignment.append_aligned_words(gold_words[gs+g], system_words[ss+s])
g += 1
s += 1
elif lcs[g][s] == (lcs[g+1][s] if g+1 < gi-gs else 0):
g += 1
else:
s += 1
else:
# B: No multi-word token => align according to spans.
if (gold_words[gi].span.start, gold_words[gi].span.end) == (system_words[si].span.start, system_words[si].span.end):
alignment.append_aligned_words(gold_words[gi], system_words[si])
gi += 1
si += 1
elif gold_words[gi].span.start <= system_words[si].span.start:
gi += 1
else:
si += 1
alignment.fill_parents()
return alignment
# Check that underlying character sequences do match
if gold_ud.characters != system_ud.characters:
index = 0
while gold_ud.characters[index] == system_ud.characters[index]:
index += 1
raise UDError(
"The concatenation of tokens in gold file and in system file differ!\n" +
"First 20 differing characters in gold file: '{}' and system file: '{}'".format(
"".join(gold_ud.characters[index:index + 20]),
"".join(system_ud.characters[index:index + 20])
)
)
# Align words
alignment = align_words(gold_ud.words, system_ud.words)
# Compute the F1-scores
result = {
"Tokens": spans_score(gold_ud.tokens, system_ud.tokens),
"Sentences": spans_score(gold_ud.sentences, system_ud.sentences),
"Words": alignment_score(alignment, None),
"UPOS": alignment_score(alignment, lambda w, parent: w.columns[UPOS]),
"XPOS": alignment_score(alignment, lambda w, parent: w.columns[XPOS]),
"Feats": alignment_score(alignment, lambda w, parent: w.columns[FEATS]),
"AllTags": alignment_score(alignment, lambda w, parent: (w.columns[UPOS], w.columns[XPOS], w.columns[FEATS])),
"Lemmas": alignment_score(alignment, lambda w, parent: w.columns[LEMMA]),
"UAS": alignment_score(alignment, lambda w, parent: parent),
"LAS": alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL])),
}
# Add WeightedLAS if weights are given
if deprel_weights is not None:
def weighted_las(word):
return deprel_weights.get(word.columns[DEPREL], 1.0)
result["WeightedLAS"] = alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL]), weighted_las)
return result
def load_deprel_weights(weights_file):
if weights_file is None:
return None
deprel_weights = {}
for line in weights_file:
# Ignore comments and empty lines
if line.startswith("#") or not line.strip():
continue
columns = line.rstrip("\r\n").split()
if len(columns) != 2:
raise ValueError("Expected two columns in the UD Relations weights file on line '{}'".format(line))
deprel_weights[columns[0]] = float(columns[1])
return deprel_weights
def load_conllu_file(path):
_file = open(path, mode="r", **({"encoding": "utf-8"} if sys.version_info >= (3, 0) else {}))
return load_conllu(_file)
def evaluate_wrapper(args):
# Load CoNLL-U files
gold_ud = load_conllu_file(args.gold_file)
system_ud = load_conllu_file(args.system_file)
# Load weights if requested
deprel_weights = load_deprel_weights(args.weights)
return evaluate(gold_ud, system_ud, deprel_weights)
def main():
# Parse arguments
parser = argparse.ArgumentParser()
parser.add_argument("gold_file", type=str,
help="Name of the CoNLL-U file with the gold data.")
parser.add_argument("system_file", type=str,
help="Name of the CoNLL-U file with the predicted data.")
parser.add_argument("--weights", "-w", type=argparse.FileType("r"), default=None,
metavar="deprel_weights_file",
help="Compute WeightedLAS using given weights for Universal Dependency Relations.")
parser.add_argument("--verbose", "-v", default=0, action="count",
help="Print all metrics.")
args = parser.parse_args()
# Use verbose if weights are supplied
if args.weights is not None and not args.verbose:
args.verbose = 1
# Evaluate
evaluation = evaluate_wrapper(args)
# Print the evaluation
if not args.verbose:
print("LAS F1 Score: {:.2f}".format(100 * evaluation["LAS"].f1))
else:
metrics = ["Tokens", "Sentences", "Words", "UPOS", "XPOS", "Feats", "AllTags", "Lemmas", "UAS", "LAS"]
if args.weights is not None:
metrics.append("WeightedLAS")
print("Metrics | Precision | Recall | F1 Score | AligndAcc")
print("-----------+-----------+-----------+-----------+-----------")
for metric in metrics:
print("{:11}|{:10.2f} |{:10.2f} |{:10.2f} |{}".format(
metric,
100 * evaluation[metric].precision,
100 * evaluation[metric].recall,
100 * evaluation[metric].f1,
"{:10.2f}".format(100 * evaluation[metric].aligned_accuracy) if evaluation[metric].aligned_accuracy is not None else ""
))
if __name__ == "__main__":
main()
# Tests, which can be executed with `python -m unittest conll17_ud_eval`.
class TestAlignment(unittest.TestCase):
@staticmethod
def _load_words(words):
"""Prepare fake CoNLL-U files with fake HEAD to prevent multiple roots errors."""
lines, num_words = [], 0
for w in words:
parts = w.split(" ")
if len(parts) == 1:
num_words += 1
lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, parts[0], int(num_words>1)))
else:
lines.append("{}-{}\t{}\t_\t_\t_\t_\t_\t_\t_\t_".format(num_words + 1, num_words + len(parts) - 1, parts[0]))
for part in parts[1:]:
num_words += 1
lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, part, int(num_words>1)))
return load_conllu((io.StringIO if sys.version_info >= (3, 0) else io.BytesIO)("\n".join(lines+["\n"])))
def _test_exception(self, gold, system):
self.assertRaises(UDError, evaluate, self._load_words(gold), self._load_words(system))
def _test_ok(self, gold, system, correct):
metrics = evaluate(self._load_words(gold), self._load_words(system))
gold_words = sum((max(1, len(word.split(" ")) - 1) for word in gold))
system_words = sum((max(1, len(word.split(" ")) - 1) for word in system))
self.assertEqual((metrics["Words"].precision, metrics["Words"].recall, metrics["Words"].f1),
(correct / system_words, correct / gold_words, 2 * correct / (gold_words + system_words)))
def test_exception(self):
self._test_exception(["a"], ["b"])
def test_equal(self):
self._test_ok(["a"], ["a"], 1)
self._test_ok(["a", "b", "c"], ["a", "b", "c"], 3)
def test_equal_with_multiword(self):
self._test_ok(["abc a b c"], ["a", "b", "c"], 3)
self._test_ok(["a", "bc b c", "d"], ["a", "b", "c", "d"], 4)
self._test_ok(["abcd a b c d"], ["ab a b", "cd c d"], 4)
self._test_ok(["abc a b c", "de d e"], ["a", "bcd b c d", "e"], 5)
def test_alignment(self):
self._test_ok(["abcd"], ["a", "b", "c", "d"], 0)
self._test_ok(["abc", "d"], ["a", "b", "c", "d"], 1)
self._test_ok(["a", "bc", "d"], ["a", "b", "c", "d"], 2)
self._test_ok(["a", "bc b c", "d"], ["a", "b", "cd"], 2)
self._test_ok(["abc a BX c", "def d EX f"], ["ab a b", "cd c d", "ef e f"], 4)
self._test_ok(["ab a b", "cd bc d"], ["a", "bc", "d"], 2)
self._test_ok(["a", "bc b c", "d"], ["ab AX BX", "cd CX a"], 1)

View File

@ -116,10 +116,9 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
print("Itn.\tP.Loss\tN.Loss\tUAS\tNER P.\tNER R.\tNER F.\tTag %\tToken %")
try:
train_docs = corpus.train_docs(nlp, projectivize=True, noise_level=0.0,
gold_preproc=gold_preproc, max_length=0)
train_docs = list(train_docs)
for i in range(n_iter):
train_docs = corpus.train_docs(nlp, projectivize=True, noise_level=0.0,
gold_preproc=gold_preproc, max_length=0)
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
losses = {}
for batch in minibatch(train_docs, size=batch_sizes):

390
spacy/cli/ud_train.py Normal file
View File

@ -0,0 +1,390 @@
'''Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
.conllu format for development data, allowing the official scorer to be used.
'''
from __future__ import unicode_literals
import plac
import tqdm
from pathlib import Path
import re
import sys
import json
import spacy
import spacy.util
from ..tokens import Token, Doc
from ..gold import GoldParse
from ..syntax.nonproj import projectivize
from ..matcher import Matcher
from collections import defaultdict, Counter
from timeit import default_timer as timer
import itertools
import random
import numpy.random
import cytoolz
from . import conll17_ud_eval
from .. import lang
from .. import lang
from ..lang import zh
from ..lang import ja
lang.zh.Chinese.Defaults.use_jieba = False
lang.ja.Japanese.Defaults.use_janome = False
random.seed(0)
numpy.random.seed(0)
def minibatch_by_words(items, size=5000):
random.shuffle(items)
if isinstance(size, int):
size_ = itertools.repeat(size)
else:
size_ = size
items = iter(items)
while True:
batch_size = next(size_)
batch = []
while batch_size >= 0:
try:
doc, gold = next(items)
except StopIteration:
if batch:
yield batch
return
batch_size -= len(doc)
batch.append((doc, gold))
if batch:
yield batch
else:
break
################
# Data reading #
################
space_re = re.compile('\s+')
def split_text(text):
return [space_re.sub(' ', par.strip()) for par in text.split('\n\n')]
def read_data(nlp, conllu_file, text_file, raw_text=True, oracle_segments=False,
max_doc_length=None, limit=None):
'''Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True,
include Doc objects created using nlp.make_doc and then aligned against
the gold-standard sequences. If oracle_segments=True, include Doc objects
created from the gold-standard segments. At least one must be True.'''
if not raw_text and not oracle_segments:
raise ValueError("At least one of raw_text or oracle_segments must be True")
paragraphs = split_text(text_file.read())
conllu = read_conllu(conllu_file)
# sd is spacy doc; cd is conllu doc
# cs is conllu sent, ct is conllu token
docs = []
golds = []
for doc_id, (text, cd) in enumerate(zip(paragraphs, conllu)):
sent_annots = []
for cs in cd:
sent = defaultdict(list)
for id_, word, lemma, pos, tag, morph, head, dep, _, space_after in cs:
if '.' in id_:
continue
if '-' in id_:
continue
id_ = int(id_)-1
head = int(head)-1 if head != '0' else id_
sent['words'].append(word)
sent['tags'].append(tag)
sent['heads'].append(head)
sent['deps'].append('ROOT' if dep == 'root' else dep)
sent['spaces'].append(space_after == '_')
sent['entities'] = ['-'] * len(sent['words'])
sent['heads'], sent['deps'] = projectivize(sent['heads'],
sent['deps'])
if oracle_segments:
docs.append(Doc(nlp.vocab, words=sent['words'], spaces=sent['spaces']))
golds.append(GoldParse(docs[-1], **sent))
sent_annots.append(sent)
if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
doc, gold = _make_gold(nlp, None, sent_annots)
sent_annots = []
docs.append(doc)
golds.append(gold)
if limit and len(docs) >= limit:
return docs, golds
if raw_text and sent_annots:
doc, gold = _make_gold(nlp, None, sent_annots)
docs.append(doc)
golds.append(gold)
if limit and len(docs) >= limit:
return docs, golds
return docs, golds
def read_conllu(file_):
docs = []
sent = []
doc = []
for line in file_:
if line.startswith('# newdoc'):
if doc:
docs.append(doc)
doc = []
elif line.startswith('#'):
continue
elif not line.strip():
if sent:
doc.append(sent)
sent = []
else:
sent.append(list(line.strip().split('\t')))
if len(sent[-1]) != 10:
print(repr(line))
raise ValueError
if sent:
doc.append(sent)
if doc:
docs.append(doc)
return docs
def _make_gold(nlp, text, sent_annots):
# Flatten the conll annotations, and adjust the head indices
flat = defaultdict(list)
for sent in sent_annots:
flat['heads'].extend(len(flat['words'])+head for head in sent['heads'])
for field in ['words', 'tags', 'deps', 'entities', 'spaces']:
flat[field].extend(sent[field])
# Construct text if necessary
assert len(flat['words']) == len(flat['spaces'])
if text is None:
text = ''.join(word+' '*space for word, space in zip(flat['words'], flat['spaces']))
doc = nlp.make_doc(text)
flat.pop('spaces')
gold = GoldParse(doc, **flat)
return doc, gold
#############################
# Data transforms for spaCy #
#############################
def golds_to_gold_tuples(docs, golds):
'''Get out the annoying 'tuples' format used by begin_training, given the
GoldParse objects.'''
tuples = []
for doc, gold in zip(docs, golds):
text = doc.text
ids, words, tags, heads, labels, iob = zip(*gold.orig_annot)
sents = [((ids, words, tags, heads, labels, iob), [])]
tuples.append((text, sents))
return tuples
##############
# Evaluation #
##############
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
with text_loc.open('r', encoding='utf8') as text_file:
texts = split_text(text_file.read())
docs = list(nlp.pipe(texts))
with sys_loc.open('w', encoding='utf8') as out_file:
write_conllu(docs, out_file)
with gold_loc.open('r', encoding='utf8') as gold_file:
gold_ud = conll17_ud_eval.load_conllu(gold_file)
with sys_loc.open('r', encoding='utf8') as sys_file:
sys_ud = conll17_ud_eval.load_conllu(sys_file)
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
return scores
def write_conllu(docs, file_):
merger = Matcher(docs[0].vocab)
merger.add('SUBTOK', None, [{'DEP': 'subtok', 'op': '+'}])
for i, doc in enumerate(docs):
matches = merger(doc)
spans = [doc[start:end+1] for _, start, end in matches]
offsets = [(span.start_char, span.end_char) for span in spans]
for start_char, end_char in offsets:
doc.merge(start_char, end_char)
file_.write("# newdoc id = {i}\n".format(i=i))
for j, sent in enumerate(doc.sents):
file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
file_.write("# text = {text}\n".format(text=sent.text))
for k, token in enumerate(sent):
file_.write(token._.get_conllu_lines(k) + '\n')
file_.write('\n')
def print_progress(itn, losses, ud_scores):
fields = {
'dep_loss': losses.get('parser', 0.0),
'tag_loss': losses.get('tagger', 0.0),
'words': ud_scores['Words'].f1 * 100,
'sents': ud_scores['Sentences'].f1 * 100,
'tags': ud_scores['XPOS'].f1 * 100,
'uas': ud_scores['UAS'].f1 * 100,
'las': ud_scores['LAS'].f1 * 100,
}
header = ['Epoch', 'Loss', 'LAS', 'UAS', 'TAG', 'SENT', 'WORD']
if itn == 0:
print('\t'.join(header))
tpl = '\t'.join((
'{:d}',
'{dep_loss:.1f}',
'{las:.1f}',
'{uas:.1f}',
'{tags:.1f}',
'{sents:.1f}',
'{words:.1f}',
))
print(tpl.format(itn, **fields))
#def get_sent_conllu(sent, sent_id):
# lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
def get_token_conllu(token, i):
if token._.begins_fused:
n = 1
while token.nbor(n)._.inside_fused:
n += 1
id_ = '%d-%d' % (i, i+n)
lines = [id_, token.text, '_', '_', '_', '_', '_', '_', '_', '_']
else:
lines = []
if token.head.i == token.i:
head = 0
else:
head = i + (token.head.i - token.i) + 1
fields = [str(i+1), token.text, token.lemma_, token.pos_, token.tag_, '_',
str(head), token.dep_.lower(), '_', '_']
lines.append('\t'.join(fields))
return '\n'.join(lines)
Token.set_extension('get_conllu_lines', method=get_token_conllu)
Token.set_extension('begins_fused', default=False)
Token.set_extension('inside_fused', default=False)
##################
# Initialization #
##################
def load_nlp(corpus, config):
lang = corpus.split('_')[0]
nlp = spacy.blank(lang)
if config.vectors:
nlp.vocab.from_disk(config.vectors / 'vocab')
return nlp
def initialize_pipeline(nlp, docs, golds, config):
nlp.add_pipe(nlp.create_pipe('parser'))
if config.multitask_tag:
nlp.parser.add_multitask_objective('tag')
if config.multitask_sent:
nlp.parser.add_multitask_objective('sent_start')
nlp.parser.moves.add_action(2, 'subtok')
nlp.add_pipe(nlp.create_pipe('tagger'))
for gold in golds:
for tag in gold.tags:
if tag is not None:
nlp.tagger.add_label(tag)
# Replace labels that didn't make the frequency cutoff
actions = set(nlp.parser.labels)
label_set = set([act.split('-')[1] for act in actions if '-' in act])
for gold in golds:
for i, label in enumerate(gold.labels):
if label is not None and label not in label_set:
gold.labels[i] = label.split('||')[0]
return nlp.begin_training(lambda: golds_to_gold_tuples(docs, golds))
########################
# Command line helpers #
########################
class Config(object):
def __init__(self, vectors=None, max_doc_length=10, multitask_tag=True,
multitask_sent=True, nr_epoch=30, batch_size=1000, dropout=0.2):
for key, value in locals().items():
setattr(self, key, value)
@classmethod
def load(cls, loc):
with Path(loc).open('r', encoding='utf8') as file_:
cfg = json.load(file_)
return cls(**cfg)
class Dataset(object):
def __init__(self, path, section):
self.path = path
self.section = section
self.conllu = None
self.text = None
for file_path in self.path.iterdir():
name = file_path.parts[-1]
if section in name and name.endswith('conllu'):
self.conllu = file_path
elif section in name and name.endswith('txt'):
self.text = file_path
if self.conllu is None:
msg = "Could not find .txt file in {path} for {section}"
raise IOError(msg.format(section=section, path=path))
if self.text is None:
msg = "Could not find .txt file in {path} for {section}"
self.lang = self.conllu.parts[-1].split('-')[0].split('_')[0]
class TreebankPaths(object):
def __init__(self, ud_path, treebank, **cfg):
self.train = Dataset(ud_path / treebank, 'train')
self.dev = Dataset(ud_path / treebank, 'dev')
self.lang = self.train.lang
@plac.annotations(
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
corpus=("UD corpus to train and evaluate on, e.g. en, es_ancora, etc",
"positional", None, str),
parses_dir=("Directory to write the development parses", "positional", None, Path),
config=("Path to json formatted config file", "positional"),
limit=("Size limit", "option", "n", int)
)
def main(ud_dir, parses_dir, config, corpus, limit=0):
config = Config.load(config)
paths = TreebankPaths(ud_dir, corpus)
if not (parses_dir / corpus).exists():
(parses_dir / corpus).mkdir()
print("Train and evaluate", corpus, "using lang", paths.lang)
nlp = load_nlp(paths.lang, config)
docs, golds = read_data(nlp, paths.train.conllu.open(), paths.train.text.open(),
max_doc_length=config.max_doc_length, limit=limit)
optimizer = initialize_pipeline(nlp, docs, golds, config)
for i in range(config.nr_epoch):
docs = [nlp.make_doc(doc.text) for doc in docs]
batches = minibatch_by_words(list(zip(docs, golds)), size=config.batch_size)
losses = {}
n_train_words = sum(len(doc) for doc in docs)
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
for batch in batches:
batch_docs, batch_gold = zip(*batch)
pbar.update(sum(len(doc) for doc in batch_docs))
nlp.update(batch_docs, batch_gold, sgd=optimizer,
drop=config.dropout, losses=losses)
out_path = parses_dir / corpus / 'epoch-{i}.conllu'.format(i=i)
with nlp.use_params(optimizer.averages):
scores = evaluate(nlp, paths.dev.text, paths.dev.conllu, out_path)
print_progress(i, losses, scores)
if __name__ == '__main__':
plac.call(main)

View File

@ -13,7 +13,7 @@ from . import _align
from .syntax import nonproj
from .tokens import Doc
from . import util
from .util import minibatch
from .util import minibatch, itershuffle
def tags_to_entities(tags):
@ -133,15 +133,14 @@ class GoldCorpus(object):
def train_docs(self, nlp, gold_preproc=False,
projectivize=False, max_length=None,
noise_level=0.0):
train_tuples = self.train_tuples
if projectivize:
train_tuples = nonproj.preprocess_training_data(
self.train_tuples, label_freq_cutoff=100)
random.shuffle(train_tuples)
self.train_tuples, label_freq_cutoff=30)
random.shuffle(self.train_locs)
gold_docs = self.iter_gold_docs(nlp, train_tuples, gold_preproc,
max_length=max_length,
noise_level=noise_level)
yield from gold_docs
yield from itershuffle(gold_docs, bufsize=100)
def dev_docs(self, nlp, gold_preproc=False):
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc)

View File

@ -21,7 +21,7 @@ class SpanishDefaults(Language.Defaults):
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
tag_map = TAG_MAP
stop_words = STOP_WORDS
sytax_iterators = SYNTAX_ITERATORS
syntax_iterators = SYNTAX_ITERATORS
lemma_lookup = LOOKUP

View File

@ -6,17 +6,19 @@ from ...symbols import NOUN, PROPN, PRON, VERB, AUX
def noun_chunks(obj):
doc = obj.doc
np_label = doc.vocab.strings['NP']
if not len(doc):
return
np_label = doc.vocab.strings.add('NP')
left_labels = ['det', 'fixed', 'neg'] #['nunmod', 'det', 'appos', 'fixed']
right_labels = ['flat', 'fixed', 'compound', 'neg']
stop_labels = ['punct']
np_left_deps = [doc.vocab.strings[label] for label in left_labels]
np_right_deps = [doc.vocab.strings[label] for label in right_labels]
stop_deps = [doc.vocab.strings[label] for label in stop_labels]
np_left_deps = [doc.vocab.strings.add(label) for label in left_labels]
np_right_deps = [doc.vocab.strings.add(label) for label in right_labels]
stop_deps = [doc.vocab.strings.add(label) for label in stop_labels]
token = doc[0]
while token and token.i < len(doc):
if token.pos in [PROPN, NOUN, PRON]:
left, right = noun_bounds(token)
left, right = noun_bounds(doc, token, np_left_deps, np_right_deps, stop_deps)
yield left.i, right.i+1, np_label
token = right
token = next_token(token)
@ -33,7 +35,7 @@ def next_token(token):
return None
def noun_bounds(root):
def noun_bounds(doc, root, np_left_deps, np_right_deps, stop_deps):
left_bound = root
for token in reversed(list(root.lefts)):
if token.dep in np_left_deps:
@ -41,7 +43,7 @@ def noun_bounds(root):
right_bound = root
for token in root.rights:
if (token.dep in np_right_deps):
left, right = noun_bounds(token)
left, right = noun_bounds(doc, token, np_left_deps, np_right_deps, stop_deps)
if list(filter(lambda t: is_verb_token(t) or t.dep in stop_deps,
doc[left_bound.i: right.i])):
break

View File

@ -6,10 +6,25 @@ from __future__ import unicode_literals
STOP_WORDS = set("""
a
ah
aha
aj
ako
al
ali
arh
au
avaj
bar
baš
bez
bi
bih
bijah
bijahu
bijaše
bijasmo
bijaste
bila
bili
bilo
@ -17,25 +32,104 @@ bio
bismo
biste
biti
brr
buć
budavši
bude
budimo
budite
budu
budući
bum
bumo
će
ćemo
ćeš
ćete
čijem
čijim
čijima
ću
da
daj
dakle
de
deder
dem
djelomice
djelomično
do
doista
dok
dokle
donekle
dosad
doskoro
dotad
dotle
dovečer
drugamo
drugdje
duž
e
eh
ehe
ej
eno
eto
evo
ga
gdjekakav
gdjekoje
gic
god
halo
hej
hm
hoće
hoćemo
hoćete
hoćeš
hoćete
hoću
hop
htijahu
htijasmo
htijaste
htio
htjedoh
htjedoše
htjedoste
htjela
htjele
htjeli
hura
i
iako
ih
iju
ijuju
ikada
ikakav
ikakva
ikakve
ikakvi
ikakvih
ikakvim
ikakvima
ikakvo
ikakvog
ikakvoga
ikakvoj
ikakvom
ikakvome
ili
im
iz
ja
je
jedna
jedne
jedni
jedno
jer
jesam
@ -57,6 +151,7 @@ koji
kojima
koju
kroz
lani
li
me
mene
@ -66,6 +161,8 @@ mimo
moj
moja
moje
moji
moju
mu
na
nad
@ -77,24 +174,27 @@ naš
naša
naše
našeg
naši
ne
neće
nećemo
nećeš
nećete
neću
nego
neka
neke
neki
nekog
neku
nema
netko
neće
nećemo
nećete
nećeš
neću
nešto
netko
ni
nije
nikoga
nikoje
nikoji
nikoju
nisam
nisi
@ -123,33 +223,63 @@ od
odmah
on
ona
one
oni
ono
onu
onoj
onom
onim
onima
ova
ovaj
ovim
ovima
ovoj
pa
pak
pljus
po
pod
podalje
poimence
poizdalje
ponekad
pored
postrance
potajice
potrbuške
pouzdano
prije
s
sa
sam
samo
sasvim
sav
se
sebe
sebi
si
šic
smo
ste
što
šta
štogod
štagod
su
sva
sve
svi
svi
svog
svoj
svoja
svoje
svoju
svom
svu
ta
tada
taj
@ -158,6 +288,8 @@ te
tebe
tebi
ti
tim
tima
to
toj
tome
@ -165,23 +297,51 @@ tu
tvoj
tvoja
tvoje
tvoji
tvoju
u
usprkos
utaman
uvijek
uz
uza
uzagrapce
uzalud
uzduž
valjda
vam
vama
vas
vaš
vaša
vaše
vašim
vašima
već
vi
vjerojatno
vjerovatno
vrh
vrlo
za
zaista
zar
će
ćemo
ćete
ćeš
ću
što
zatim
zato
zbija
zbog
želeći
željah
željela
željele
željeli
željelo
željen
željena
željene
željeni
željenu
željeo
zimus
zum
""".split())

View File

@ -35,14 +35,32 @@ class JapaneseTokenizer(object):
def from_disk(self, path, **exclude):
return self
class JapaneseCharacterSegmenter(object):
def __init__(self, vocab):
self.vocab = vocab
def __call__(self, text):
words = []
spaces = []
doc = self.tokenizer(text)
for token in self.tokenizer(text):
words.extend(list(token.text))
spaces.extend([False]*len(token.text))
spaces[-1] = bool(token.whitespace_)
return Doc(self.vocab, words=words, spaces=spaces)
class JapaneseDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'ja'
use_janome = True
@classmethod
def create_tokenizer(cls, nlp=None):
return JapaneseTokenizer(cls, nlp)
if cls.use_janome:
return JapaneseTokenizer(cls, nlp)
else:
return JapaneseCharacterSegmenter(cls, nlp.vocab)
class Japanese(Language):

22
spacy/lang/tr/examples.py Normal file
View File

@ -0,0 +1,22 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.tr.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Neredesin?",
"Neredesiniz?",
"Bu bir cümledir.",
"Sürücüsüz araçlar sigorta yükümlülüğünü üreticilere kaydırıyor.",
"San Francisco kaldırımda kurye robotları yasaklayabilir."
"Londra İngiltere'nin başkentidir.",
"Türkiye'nin başkenti neresi?",
"Bakanlar Kurulu 180 günlük eylem planınııkladı.",
"Merkez Bankası, beklentiler doğrultusunda faizlerde değişikliğe gitmedi."
]

View File

@ -0,0 +1,31 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
#Thirteen, fifteen etc. are written separate: on üç
_num_words = ['bir', 'iki', 'üç', 'dört', 'beş', 'altı', 'yedi', 'sekiz',
'dokuz', 'on', 'yirmi', 'otuz', 'kırk', 'elli', 'altmış',
'yetmiş', 'seksen', 'doksan', 'yüz', 'bin', 'milyon',
'milyar', 'katrilyon', 'kentilyon']
def like_num(text):
text = text.replace(',', '').replace('.', '')
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}

View File

@ -10,16 +10,12 @@ acep
adamakıllı
adeta
ait
altmýþ
altmış
altý
altı
ama
amma
anca
ancak
arada
artýk
artık
aslında
aynen
ayrıca
@ -29,46 +25,82 @@ açıkçası
bana
bari
bazen
bazý
bazı
bazısı
bazısına
bazısında
bazısından
bazısını
bazısının
başkası
baţka
başkasına
başkasında
başkasından
başkasını
başkasının
başka
belki
ben
bende
benden
beni
benim
beri
beriki
beþ
beş
beţ
berikinin
berikiyi
berisi
bilcümle
bile
bin
binaen
binaenaleyh
bir
biraz
birazdan
birbiri
birbirine
birbirini
birbirinin
birbirinde
birbirinden
birden
birdenbire
biri
birine
birini
birinin
birinde
birinden
birice
birileri
birilerinde
birilerinden
birilerine
birilerini
birilerinin
birisi
birisine
birisini
birisinin
birisinde
birisinden
birkaç
birkaçı
birkaçına
birkaçını
birkaçının
birkaçında
birkaçından
birkez
birlikte
birçok
birçoğu
birþey
birþeyi
birçoğuna
birçoğunda
birçoğundan
birçoğunu
birçoğunun
birşey
birşeyi
birţey
bitevi
biteviye
bittabi
@ -96,6 +128,11 @@ buracıkta
burada
buradan
burası
burasına
burasını
burasının
burasında
burasından
böyle
böylece
böylecene
@ -106,8 +143,34 @@ büsbütün
bütün
cuk
cümlesi
cümlesine
cümlesini
cümlesinin
cümlesinden
cümlemize
cümlemizi
cümlemizden
çabuk
çabukça
çeşitli
çok
çokları
çoklarınca
çokluk
çoklukla
çokça
çoğu
çoğun
çoğunca
çoğunda
çoğundan
çoğunlukla
çoğunu
çoğunun
çünkü
da
daha
dahası
dahi
dahil
dahilen
@ -124,19 +187,17 @@ denli
derakap
derhal
derken
deđil
değil
değin
diye
diđer
diğer
diğeri
doksan
dokuz
diğerine
diğerini
diğerinden
dolayı
dolayısıyla
doğru
dört
edecek
eden
ederek
@ -146,7 +207,6 @@ edilmesi
ediyor
elbet
elbette
elli
emme
en
enikonu
@ -168,10 +228,10 @@ evvelce
evvelden
evvelemirde
evveli
eđer
eğer
fakat
filanca
filancanın
gah
gayet
gayetle
@ -197,6 +257,10 @@ haliyle
handiyse
hangi
hangisi
hangisine
hangisine
hangisinde
hangisinden
hani
hariç
hasebiyle
@ -207,17 +271,27 @@ hem
henüz
hep
hepsi
hepsini
hepsinin
hepsinde
hepsinden
her
herhangi
herkes
herkesi
herkesin
herkesten
hiç
hiçbir
hiçbiri
hiçbirine
hiçbirini
hiçbirinin
hiçbirinde
hiçbirinden
hoş
hulasaten
iken
iki
ila
ile
ilen
@ -240,43 +314,55 @@ iyicene
için
işte
iţte
kadar
kaffesi
kah
kala
kanýmca
kanımca
karşın
katrilyon
kaynak
kaçı
kaçına
kaçında
kaçından
kaçını
kaçının
kelli
kendi
kendilerinde
kendilerinden
kendilerine
kendilerini
kendilerinin
kendini
kendisi
kendisinde
kendisinden
kendisine
kendisini
kendisinin
kere
kez
keza
kezalik
keşke
keţke
ki
kim
kimden
kime
kimi
kiminin
kimisi
kimisinde
kimisinden
kimisine
kimisinin
kimse
kimsecik
kimsecikler
külliyen
kýrk
kýsaca
kırk
kısaca
kısacası
lakin
leh
lütfen
@ -289,13 +375,10 @@ međer
meğer
meğerki
meğerse
milyar
milyon
mu
mı
nasýl
mi
nasıl
nasılsa
nazaran
@ -304,6 +387,8 @@ ne
neden
nedeniyle
nedenle
nedenler
nedenlerden
nedense
nerde
nerden
@ -332,32 +417,27 @@ olduklarını
oldukça
olduğu
olduğunu
olmadı
olmadığı
olmak
olması
olmayan
olmaz
olsa
olsun
olup
olur
olursa
oluyor
on
ona
onca
onculayın
onda
ondan
onlar
onlara
onlardan
onlari
onlarýn
onları
onların
onu
onun
ora
oracık
oracıkta
orada
@ -365,9 +445,26 @@ oradan
oranca
oranla
oraya
otuz
oysa
oysaki
öbür
öbürkü
öbürü
öbüründe
öbüründen
öbürüne
öbürünü
önce
önceden
önceleri
öncelikle
öteki
ötekisi
öyle
öylece
öylelikle
öylemesine
öz
pek
pekala
peki
@ -379,8 +476,6 @@ sahi
sahiden
sana
sanki
sekiz
seksen
sen
senden
seni
@ -393,6 +488,27 @@ sonra
sonradan
sonraları
sonunda
şayet
şey
şeyden
şeyi
şeyler
şu
şuna
şuncacık
şunda
şundan
şunlar
şunları
şunların
şunu
şunun
şura
şuracık
şuracıkta
şurası
şöyle
şimdi
tabii
tam
tamam
@ -400,8 +516,8 @@ tamamen
tamamıyla
tarafından
tek
trilyon
tüm
üzere
var
vardı
vasıtasıyla
@ -429,84 +545,16 @@ yaptığını
yapılan
yapılması
yapıyor
yedi
yeniden
yenilerde
yerine
yetmiþ
yetmiş
yetmiţ
yine
yirmi
yok
yoksa
yoluyla
yüz
yüzünden
zarfında
zaten
zati
zira
çabuk
çabukça
çeşitli
çok
çokları
çoklarınca
çokluk
çoklukla
çokça
çoğu
çoğun
çoğunca
çoğunlukla
çünkü
öbür
öbürkü
öbürü
önce
önceden
önceleri
öncelikle
öteki
ötekisi
öyle
öylece
öylelikle
öylemesine
öz
üzere
üç
þey
þeyden
þeyi
þeyler
þu
þuna
þunda
þundan
þunu
şayet
şey
şeyden
şeyi
şeyler
şu
şuna
şuncacık
şunda
şundan
şunlar
şunları
şunu
şunun
şura
şuracık
şuracıkta
şurası
şöyle
ţayet
ţimdi
ţu
ţöyle
""".split())

View File

@ -3,11 +3,6 @@ from __future__ import unicode_literals
from ...symbols import ORTH, NORM
# These exceptions are mostly for example purposes hoping that Turkish
# speakers can contribute in the future! Source of copy-pasted examples:
# https://en.wiktionary.org/wiki/Category:Turkish_language
_exc = {
"sağol": [
{ORTH: "sağ"},
@ -16,11 +11,112 @@ _exc = {
for exc_data in [
{ORTH: "A.B.D.", NORM: "Amerika Birleşik Devletleri"}]:
{ORTH: "A.B.D.", NORM: "Amerika Birleşik Devletleri"},
{ORTH: "Alb.", NORM: "Albay"},
{ORTH: "Ar.Gör.", NORM: "Araştırma Görevlisi"},
{ORTH: "Arş.Gör.", NORM: "Araştırma Görevlisi"},
{ORTH: "Asb.", NORM: "Astsubay"},
{ORTH: "Astsb.", NORM: "Astsubay"},
{ORTH: "As.İz.", NORM: "Askeri İnzibat"},
{ORTH: "Atğm", NORM: "Asteğmen"},
{ORTH: "Av.", NORM: "Avukat"},
{ORTH: "Apt.", NORM: "Apartmanı"},
{ORTH: "Bçvş.", NORM: "Başçavuş"},
{ORTH: "bk.", NORM: "bakınız"},
{ORTH: "bknz.", NORM: "bakınız"},
{ORTH: "Bnb.", NORM: "Binbaşı"},
{ORTH: "bnb.", NORM: "binbaşı"},
{ORTH: "Böl.", NORM: "Bölümü"},
{ORTH: "Bşk.", NORM: "Başkanlığı"},
{ORTH: "Bştbp.", NORM: "Baştabip"},
{ORTH: "Bul.", NORM: "Bulvarı"},
{ORTH: "Cad.", NORM: "Caddesi"},
{ORTH: "çev.", NORM: "çeviren"},
{ORTH: "Çvş.", NORM: "Çavuş"},
{ORTH: "dak.", NORM: "dakika"},
{ORTH: "dk.", NORM: "dakika"},
{ORTH: "Doç.", NORM: "Doçent"},
{ORTH: "doğ.", NORM: "doğum tarihi"},
{ORTH: "drl.", NORM: "derleyen"},
{ORTH: "Dz.", NORM: "Deniz"},
{ORTH: "Dz.K.K.lığı", NORM: "Deniz Kuvvetleri Komutanlığı"},
{ORTH: "Dz.Kuv.", NORM: "Deniz Kuvvetleri"},
{ORTH: "Dz.Kuv.K.", NORM: "Deniz Kuvvetleri Komutanlığı"},
{ORTH: "dzl.", NORM: "düzenleyen"},
{ORTH: "Ecz.", NORM: "Eczanesi"},
{ORTH: "ekon.", NORM: "ekonomi"},
{ORTH: "Fak.", NORM: "Fakültesi"},
{ORTH: "Gn.", NORM: "Genel"},
{ORTH: "Gnkur.", NORM: "Genelkurmay"},
{ORTH: "Gn.Kur.", NORM: "Genelkurmay"},
{ORTH: "gr.", NORM: "gram"},
{ORTH: "Hst.", NORM: "Hastanesi"},
{ORTH: "Hs.Uzm.", NORM: "Hesap Uzmanı"},
{ORTH: "huk.", NORM: "hukuk"},
{ORTH: "Hv.", NORM: "Hava"},
{ORTH: "Hv.K.K.lığı", NORM: "Hava Kuvvetleri Komutanlığı"},
{ORTH: "Hv.Kuv.", NORM: "Hava Kuvvetleri"},
{ORTH: "Hv.Kuv.K.", NORM: "Hava Kuvvetleri Komutanlığı"},
{ORTH: "Hz.", NORM: "Hazreti"},
{ORTH: "Hz.Öz.", NORM: "Hizmete Özel"},
{ORTH: "İng.", NORM: "İngilizce"},
{ORTH: "Jeol.", NORM: "Jeoloji"},
{ORTH: "jeol.", NORM: "jeoloji"},
{ORTH: "Korg.", NORM: "Korgeneral"},
{ORTH: "Kur.", NORM: "Kurmay"},
{ORTH: "Kur.Bşk.", NORM: "Kurmay Başkanı"},
{ORTH: "Kuv.", NORM: "Kuvvetleri"},
{ORTH: "Ltd.", NORM: "Limited"},
{ORTH: "Mah.", NORM: "Mahallesi"},
{ORTH: "mah.", NORM: "mahallesi"},
{ORTH: "max.", NORM: "maksimum"},
{ORTH: "min.", NORM: "minimum"},
{ORTH: "Müh.", NORM: "Mühendisliği"},
{ORTH: "müh.", NORM: "mühendisliği"},
{ORTH: "MÖ.", NORM: "Milattan Önce"},
{ORTH: "Onb.", NORM: "Onbaşı"},
{ORTH: "Ord.", NORM: "Ordinaryüs"},
{ORTH: "Org.", NORM: "Orgeneral"},
{ORTH: "Ped.", NORM: "Pedagoji"},
{ORTH: "Prof.", NORM: "Profesör"},
{ORTH: "Sb.", NORM: "Subay"},
{ORTH: "Sn.", NORM: "Sayın"},
{ORTH: "sn.", NORM: "saniye"},
{ORTH: "Sok.", NORM: "Sokak"},
{ORTH: "Şb.", NORM: "Şube"},
{ORTH: "Şti.", NORM: "Şirketi"},
{ORTH: "Tbp.", NORM: "Tabip"},
{ORTH: "T.C.", NORM: "Türkiye Cumhuriyeti"},
{ORTH: "Tel.", NORM: "Telefon"},
{ORTH: "tel.", NORM: "telefon"},
{ORTH: "telg.", NORM: "telgraf"},
{ORTH: "Tğm.", NORM: "Teğmen"},
{ORTH: "tğm.", NORM: "teğmen"},
{ORTH: "tic.", NORM: "ticaret"},
{ORTH: "Tug.", NORM: "Tugay"},
{ORTH: "Tuğg.", NORM: "Tuğgeneral"},
{ORTH: "Tümg.", NORM: "Tümgeneral"},
{ORTH: "Uzm.", NORM: "Uzman"},
{ORTH: "Üçvş.", NORM: "Üstçavuş"},
{ORTH: "Üni.", NORM: "Üniversitesi"},
{ORTH: "Ütğm.", NORM: "Üsteğmen"},
{ORTH: "vb.", NORM: "ve benzeri"},
{ORTH: "vs.", NORM: "vesaire"},
{ORTH: "Yard.", NORM: "Yardımcı"},
{ORTH: "Yar.", NORM: "Yardımcı"},
{ORTH: "Yd.Sb.", NORM: "Yedek Subay"},
{ORTH: "Yard.Doç.", NORM: "Yardımcı Doçent"},
{ORTH: "Yar.Doç.", NORM: "Yardımcı Doçent"},
{ORTH: "Yb.", NORM: "Yarbay"},
{ORTH: "Yrd.", NORM: "Yardımcı"},
{ORTH: "Yrd.Doç.", NORM: "Yardımcı Doçent"},
{ORTH: "Y.Müh.", NORM: "Yüksek mühendis"},
{ORTH: "Y.Mim.", NORM: "Yüksek mimar"}]:
_exc[exc_data[ORTH]] = [exc_data]
for orth in ["Dr."]:
for orth in [
"Dr.", "yy."]:
_exc[orth] = [{ORTH: orth}]

View File

@ -319,7 +319,7 @@ cdef class ArcEager(TransitionSystem):
(SHIFT, ['']),
(REDUCE, ['']),
(RIGHT, []),
(LEFT, []),
(LEFT, ['subtok']),
(BREAK, ['ROOT']))
))
seen_actions = set()

View File

@ -477,14 +477,15 @@ cdef class Parser:
free(vectors)
free(scores)
def beam_parse(self, docs, int beam_width=3, float beam_density=0.001):
def beam_parse(self, docs, int beam_width=3, float beam_density=0.001,
float drop=0.):
cdef Beam beam
cdef np.ndarray scores
cdef Doc doc
cdef int nr_class = self.moves.n_moves
cuda_stream = util.get_cuda_stream()
(tokvecs, bp_tokvecs), state2vec, vec2scores = self.get_batch_model(
docs, cuda_stream, 0.0)
docs, cuda_stream, drop)
cdef int offset = 0
cdef int j = 0
cdef int k
@ -523,8 +524,8 @@ cdef class Parser:
n_states += 1
if n_states == 0:
break
vectors = state2vec(token_ids[:n_states])
scores = vec2scores(vectors)
vectors, _ = state2vec.begin_update(token_ids[:n_states], drop)
scores, _ = vec2scores.begin_update(vectors, drop=drop)
c_scores = <float*>scores.data
for beam in todo:
for i in range(beam.size):

View File

@ -191,9 +191,12 @@ def _filter_labels(gold_tuples, cutoff, freqs):
for raw_text, sents in gold_tuples:
filtered_sents = []
for (ids, words, tags, heads, labels, iob), ctnts in sents:
filtered_labels = [decompose(label)[0]
if freqs.get(label, cutoff) < cutoff
else label for label in labels]
filtered_labels = []
for label in labels:
if is_decorated(label) and freqs.get(label, 0) < cutoff:
filtered_labels.append(decompose(label)[0])
else:
filtered_labels.append(label)
filtered_sents.append(
((ids, words, tags, heads, filtered_labels, iob), ctnts))
filtered.append((raw_text, filtered_sents))

View File

@ -0,0 +1,74 @@
from ...vocab import Vocab
from ...pipeline import DependencyParser
from ...tokens import Doc
from ...gold import GoldParse
from ...syntax.nonproj import projectivize
annot_tuples = [
(0, 'When', 'WRB', 11, 'advmod', 'O'),
(1, 'Walter', 'NNP', 2, 'compound', 'B-PERSON'),
(2, 'Rodgers', 'NNP', 11, 'nsubj', 'L-PERSON'),
(3, ',', ',', 2, 'punct', 'O'),
(4, 'our', 'PRP$', 6, 'poss', 'O'),
(5, 'embedded', 'VBN', 6, 'amod', 'O'),
(6, 'reporter', 'NN', 2, 'appos', 'O'),
(7, 'with', 'IN', 6, 'prep', 'O'),
(8, 'the', 'DT', 10, 'det', 'B-ORG'),
(9, '3rd', 'NNP', 10, 'compound', 'I-ORG'),
(10, 'Cavalry', 'NNP', 7, 'pobj', 'L-ORG'),
(11, 'says', 'VBZ', 44, 'advcl', 'O'),
(12, 'three', 'CD', 13, 'nummod', 'U-CARDINAL'),
(13, 'battalions', 'NNS', 16, 'nsubj', 'O'),
(14, 'of', 'IN', 13, 'prep', 'O'),
(15, 'troops', 'NNS', 14, 'pobj', 'O'),
(16, 'are', 'VBP', 11, 'ccomp', 'O'),
(17, 'on', 'IN', 16, 'prep', 'O'),
(18, 'the', 'DT', 19, 'det', 'O'),
(19, 'ground', 'NN', 17, 'pobj', 'O'),
(20, ',', ',', 17, 'punct', 'O'),
(21, 'inside', 'IN', 17, 'prep', 'O'),
(22, 'Baghdad', 'NNP', 21, 'pobj', 'U-GPE'),
(23, 'itself', 'PRP', 22, 'appos', 'O'),
(24, ',', ',', 16, 'punct', 'O'),
(25, 'have', 'VBP', 26, 'aux', 'O'),
(26, 'taken', 'VBN', 16, 'dep', 'O'),
(27, 'up', 'RP', 26, 'prt', 'O'),
(28, 'positions', 'NNS', 26, 'dobj', 'O'),
(29, 'they', 'PRP', 31, 'nsubj', 'O'),
(30, "'re", 'VBP', 31, 'aux', 'O'),
(31, 'going', 'VBG', 26, 'parataxis', 'O'),
(32, 'to', 'TO', 33, 'aux', 'O'),
(33, 'spend', 'VB', 31, 'xcomp', 'O'),
(34, 'the', 'DT', 35, 'det', 'B-TIME'),
(35, 'night', 'NN', 33, 'dobj', 'L-TIME'),
(36, 'there', 'RB', 33, 'advmod', 'O'),
(37, 'presumably', 'RB', 33, 'advmod', 'O'),
(38, ',', ',', 44, 'punct', 'O'),
(39, 'how', 'WRB', 40, 'advmod', 'O'),
(40, 'many', 'JJ', 41, 'amod', 'O'),
(41, 'soldiers', 'NNS', 44, 'pobj', 'O'),
(42, 'are', 'VBP', 44, 'aux', 'O'),
(43, 'we', 'PRP', 44, 'nsubj', 'O'),
(44, 'talking', 'VBG', 44, 'ROOT', 'O'),
(45, 'about', 'IN', 44, 'prep', 'O'),
(46, 'right', 'RB', 47, 'advmod', 'O'),
(47, 'now', 'RB', 44, 'advmod', 'O'),
(48, '?', '.', 44, 'punct', 'O')]
def test_get_oracle_actions():
doc = Doc(Vocab(), words=[t[1] for t in annot_tuples])
parser = DependencyParser(doc.vocab)
parser.moves.add_action(0, '')
parser.moves.add_action(1, '')
parser.moves.add_action(1, '')
parser.moves.add_action(4, 'ROOT')
for i, (id_, word, tag, head, dep, ent) in enumerate(annot_tuples):
if head > i:
parser.moves.add_action(2, dep)
elif head < i:
parser.moves.add_action(3, dep)
ids, words, tags, heads, deps, ents = zip(*annot_tuples)
heads, deps = projectivize(heads, deps)
gold = GoldParse(doc, words=words, tags=tags, heads=heads, deps=deps)
parser.moves.preprocess_gold(gold)
actions = parser.moves.get_oracle_sequence(doc, gold)

View File

@ -294,6 +294,7 @@ cdef class Span:
cdef int i
if self.doc.is_parsed:
root = &self.doc.c[self.start]
n = 0
while root.head != 0:
root += root.head
n += 1
@ -307,8 +308,10 @@ cdef class Span:
start += -1
# find end of the sentence
end = self.end
while self.doc.c[end].sent_start != 1:
n = 0
while end < self.doc.length and self.doc.c[end].sent_start != 1:
end += 1
n += 1
if n >= self.doc.length:
break
#

View File

@ -279,8 +279,8 @@ cdef class Token:
"""
def __get__(self):
if self.c.lemma == 0:
lemma = self.vocab.morphology.lemmatizer.lookup(self.orth_)
return lemma
lemma_ = self.vocab.morphology.lemmatizer.lookup(self.orth_)
return self.vocab.strings[lemma_]
else:
return self.c.lemma

View File

@ -451,7 +451,7 @@ def itershuffle(iterable, bufsize=1000):
try:
while True:
for i in range(random.randint(1, bufsize-len(buf))):
buf.append(iterable.next())
buf.append(next(iterable))
random.shuffle(buf)
for i in range(random.randint(1, bufsize)):
if buf:

View File

@ -15,11 +15,8 @@ from .compat import basestring_, path2str
from . import util
def unpickle_vectors(keys_and_rows, data):
vectors = Vectors(data=data)
for key, row in keys_and_rows:
vectors.add(key, row=row)
return vectors
def unpickle_vectors(bytes_data):
return Vectors().from_bytes(bytes_data)
cdef class Vectors:
@ -86,8 +83,7 @@ cdef class Vectors:
return len(self.key2row)
def __reduce__(self):
keys_and_rows = tuple(self.key2row.items())
return (unpickle_vectors, (keys_and_rows, self.data))
return (unpickle_vectors, (self.to_bytes(),))
def __getitem__(self, key):
"""Get a vector by key. If the key is not found, a KeyError is raised.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 378 KiB

View File

@ -76,13 +76,15 @@
},
"MODEL_LICENSES": {
"CC BY-SA": "https://creativecommons.org/licenses/by-sa/3.0/",
"CC BY-SA 3.0": "https://creativecommons.org/licenses/by-sa/3.0/",
"CC BY-SA 4.0": "https://creativecommons.org/licenses/by-sa/4.0/",
"CC BY-NC": "https://creativecommons.org/licenses/by-nc/3.0/",
"CC BY-NC 3.0": "https://creativecommons.org/licenses/by-nc/3.0/",
"GPL": "https://www.gnu.org/licenses/gpl.html",
"LGPL": "https://www.gnu.org/licenses/lgpl.html"
"CC BY 4.0": "https://creativecommons.org/licenses/by/4.0/",
"CC BY-SA": "https://creativecommons.org/licenses/by-sa/3.0/",
"CC BY-SA 3.0": "https://creativecommons.org/licenses/by-sa/3.0/",
"CC BY-SA 4.0": "https://creativecommons.org/licenses/by-sa/4.0/",
"CC BY-NC": "https://creativecommons.org/licenses/by-nc/3.0/",
"CC BY-NC 3.0": "https://creativecommons.org/licenses/by-nc/3.0/",
"CC-BY-NC-SA 3.0": "https://creativecommons.org/licenses/by-nc-sa/3.0/",
"GPL": "https://www.gnu.org/licenses/gpl.html",
"LGPL": "https://www.gnu.org/licenses/lgpl.html"
},
"MODEL_BENCHMARKS": {

View File

@ -68,7 +68,7 @@ p
+item #[strong spaCy is not research software].
| It's built on the latest research, but it's designed to get
| things done. This leads to fairly different design decisions than
| #[+a("https://github./nltk/nltk") NLTK]
| #[+a("https://github.com/nltk/nltk") NLTK]
| or #[+a("https://stanfordnlp.github.io/CoreNLP/") CoreNLP], which were
| created as platforms for teaching and research. The main difference
| is that spaCy is integrated and opinionated. spaCy tries to avoid asking