Merge master

This commit is contained in:
Matthew Honnibal 2018-03-14 19:03:24 +01:00
commit ab3d860686
32 changed files with 1910 additions and 199 deletions

11
.buildkite/train.yml Normal file
View File

@ -0,0 +1,11 @@
steps:
-
command: "fab env clean make test wheel"
label: ":dizzy: :python:"
artifact_paths: "dist/*.whl"
- wait
- trigger: "spacy-train-from-wheel"
label: ":dizzy: :train:"
build:
env:
SPACY_VERSION: "{$SPACY_VERSION}"

106
.github/contributors/alldefector.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Feng Niu |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | Feb 21, 2018 |
| GitHub username | alldefector |
| Website (optional) | |

106
.github/contributors/willismonroe.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Willis Monroe |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-3-5 |
| GitHub username | willismonroe |
| Website (optional) | |

View File

@ -182,7 +182,7 @@ If you've made a contribution to spaCy, you should fill in the
[spaCy contributor agreement](.github/CONTRIBUTOR_AGREEMENT.md) to ensure that [spaCy contributor agreement](.github/CONTRIBUTOR_AGREEMENT.md) to ensure that
your contribution can be used across the project. If you agree to be bound by your contribution can be used across the project. If you agree to be bound by
the terms of the agreement, fill in the [template](.github/CONTRIBUTOR_AGREEMENT.md) the terms of the agreement, fill in the [template](.github/CONTRIBUTOR_AGREEMENT.md)
and include it with your pull request, or sumit it separately to and include it with your pull request, or submit it separately to
[`.github/contributors/`](/.github/contributors). The name of the file should be [`.github/contributors/`](/.github/contributors). The name of the file should be
your GitHub username, with the extension `.md`. For example, the user your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`. example_user would create the file `.github/contributors/example_user.md`.

View File

@ -28,8 +28,10 @@ import cytoolz
import conll17_ud_eval import conll17_ud_eval
import spacy.lang.zh import spacy.lang.zh
import spacy.lang.ja
spacy.lang.zh.Chinese.Defaults.use_jieba = False spacy.lang.zh.Chinese.Defaults.use_jieba = False
spacy.lang.ja.Japanese.Defaults.use_janome = False
random.seed(0) random.seed(0)
numpy.random.seed(0) numpy.random.seed(0)
@ -280,6 +282,30 @@ def print_progress(itn, losses, ud_scores):
)) ))
print(tpl.format(itn, **fields)) print(tpl.format(itn, **fields))
#def get_sent_conllu(sent, sent_id):
# lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
def get_token_conllu(token, i):
if token._.begins_fused:
n = 1
while token.nbor(n)._.inside_fused:
n += 1
id_ = '%d-%d' % (i, i+n)
lines = [id_, token.text, '_', '_', '_', '_', '_', '_', '_', '_']
else:
lines = []
if token.head.i == token.i:
head = 0
else:
head = i + (token.head.i - token.i) + 1
fields = [str(i+1), token.text, token.lemma_, token.pos_, token.tag_, '_',
str(head), token.dep_.lower(), '_', '_']
lines.append('\t'.join(fields))
return '\n'.join(lines)
Token.set_extension('get_conllu_lines', method=get_token_conllu)
Token.set_extension('begins_fused', default=False)
Token.set_extension('inside_fused', default=False)
################## ##################
# Initialization # # Initialization #

81
fabfile.py vendored
View File

@ -1,49 +1,92 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals, print_function from __future__ import unicode_literals, print_function
import contextlib
from pathlib import Path
from fabric.api import local, lcd, env, settings, prefix from fabric.api import local, lcd, env, settings, prefix
from fabtools.python import virtualenv
from os import path, environ from os import path, environ
import shutil
PWD = path.dirname(__file__) PWD = path.dirname(__file__)
ENV = environ['VENV_DIR'] if 'VENV_DIR' in environ else '.env' ENV = environ['VENV_DIR'] if 'VENV_DIR' in environ else '.env'
VENV_DIR = path.join(PWD, ENV) VENV_DIR = Path(PWD) / ENV
def env(lang='python2.7'): @contextlib.contextmanager
if path.exists(VENV_DIR): def virtualenv(name, create=False, python='/usr/bin/python3.6'):
python = Path(python).resolve()
env_path = VENV_DIR
if create:
if env_path.exists():
shutil.rmtree(str(env_path))
local('{python} -m venv {env_path}'.format(python=python, env_path=VENV_DIR))
def wrapped_local(cmd, env_vars=[], capture=False, direct=False):
return local('source {}/bin/activate && {}'.format(env_path, cmd),
shell='/bin/bash', capture=False)
yield wrapped_local
def env(lang='python3.6'):
if VENV_DIR.exists():
local('rm -rf {env}'.format(env=VENV_DIR)) local('rm -rf {env}'.format(env=VENV_DIR))
local('pip install virtualenv') if lang.startswith('python3'):
local('python -m virtualenv -p {lang} {env}'.format(lang=lang, env=VENV_DIR)) local('{lang} -m venv {env}'.format(lang=lang, env=VENV_DIR))
else:
local('{lang} -m pip install virtualenv --no-cache-dir'.format(lang=lang))
local('{lang} -m virtualenv {env} --no-cache-dir'.format(lang=lang, env=VENV_DIR))
with virtualenv(VENV_DIR) as venv_local:
print(venv_local('python --version', capture=True))
venv_local('pip install --upgrade setuptools --no-cache-dir')
venv_local('pip install pytest --no-cache-dir')
venv_local('pip install wheel --no-cache-dir')
venv_local('pip install -r requirements.txt --no-cache-dir')
venv_local('pip install pex --no-cache-dir')
def install(): def install():
with virtualenv(VENV_DIR): with virtualenv(VENV_DIR) as venv_local:
local('pip install --upgrade setuptools') venv_local('pip install dist/*.tar.gz')
local('pip install dist/*.tar.gz')
local('pip install pytest')
def make(): def make():
with virtualenv(VENV_DIR):
with lcd(path.dirname(__file__)): with lcd(path.dirname(__file__)):
local('pip install cython') local('export PYTHONPATH=`pwd` && source .env/bin/activate && python setup.py build_ext --inplace',
local('pip install murmurhash') shell='/bin/bash')
local('pip install -r requirements.txt')
local('python setup.py build_ext --inplace')
def sdist(): def sdist():
with virtualenv(VENV_DIR): with virtualenv(VENV_DIR) as venv_local:
with lcd(path.dirname(__file__)): with lcd(path.dirname(__file__)):
local('python setup.py sdist') local('python setup.py sdist')
def wheel():
with virtualenv(VENV_DIR) as venv_local:
with lcd(path.dirname(__file__)):
venv_local('python setup.py bdist_wheel')
def pex():
with virtualenv(VENV_DIR) as venv_local:
with lcd(path.dirname(__file__)):
sha = local('git rev-parse --short HEAD', capture=True)
venv_local('pex dist/*.whl -e spacy -o dist/spacy-%s.pex' % sha,
direct=True)
def clean(): def clean():
with lcd(path.dirname(__file__)): with lcd(path.dirname(__file__)):
local('python setup.py clean --all') local('rm -f dist/*.whl')
local('rm -f dist/*.pex')
with virtualenv(VENV_DIR) as venv_local:
venv_local('python setup.py clean --all')
def test(): def test():
with virtualenv(VENV_DIR): with virtualenv(VENV_DIR) as venv_local:
with lcd(path.dirname(__file__)): with lcd(path.dirname(__file__)):
local('py.test -x spacy/tests') venv_local('pytest -x spacy/tests')
def train():
args = environ.get('SPACY_TRAIN_ARGS', '')
with virtualenv(VENV_DIR) as venv_local:
venv_local('spacy train {args}'.format(args=args))

View File

@ -8,6 +8,7 @@ if __name__ == '__main__':
import sys import sys
from spacy.cli import download, link, info, package, train, convert from spacy.cli import download, link, info, package, train, convert
from spacy.cli import vocab, init_model, profile, evaluate, validate from spacy.cli import vocab, init_model, profile, evaluate, validate
from spacy.cli import ud_train, ud_evaluate
from spacy.util import prints from spacy.util import prints
commands = { commands = {
@ -15,7 +16,9 @@ if __name__ == '__main__':
'link': link, 'link': link,
'info': info, 'info': info,
'train': train, 'train': train,
'ud-train': ud_train,
'evaluate': evaluate, 'evaluate': evaluate,
'ud-evaluate': ud_evaluate,
'convert': convert, 'convert': convert,
'package': package, 'package': package,
'vocab': vocab, 'vocab': vocab,

View File

@ -3,7 +3,7 @@
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py # https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
__title__ = 'spacy' __title__ = 'spacy'
__version__ = '2.1.0.dev1' __version__ = '2.1.0.dev3'
__summary__ = 'Industrial-strength Natural Language Processing (NLP) with Python and Cython' __summary__ = 'Industrial-strength Natural Language Processing (NLP) with Python and Cython'
__uri__ = 'https://spacy.io' __uri__ = 'https://spacy.io'
__author__ = 'Explosion AI' __author__ = 'Explosion AI'

View File

@ -9,3 +9,5 @@ from .convert import convert
from .vocab import make_vocab as vocab from .vocab import make_vocab as vocab
from .init_model import init_model from .init_model import init_model
from .validate import validate from .validate import validate
from .ud_train import main as ud_train
from .conll17_ud_eval import main as ud_evaluate

View File

@ -0,0 +1,570 @@
#!/usr/bin/env python
# CoNLL 2017 UD Parsing evaluation script.
#
# Compatible with Python 2.7 and 3.2+, can be used either as a module
# or a standalone executable.
#
# Copyright 2017 Institute of Formal and Applied Linguistics (UFAL),
# Faculty of Mathematics and Physics, Charles University, Czech Republic.
#
# Changelog:
# - [02 Jan 2017] Version 0.9: Initial release
# - [25 Jan 2017] Version 0.9.1: Fix bug in LCS alignment computation
# - [10 Mar 2017] Version 1.0: Add documentation and test
# Compare HEADs correctly using aligned words
# Allow evaluation with errorneous spaces in forms
# Compare forms in LCS case insensitively
# Detect cycles and multiple root nodes
# Compute AlignedAccuracy
# Command line usage
# ------------------
# conll17_ud_eval.py [-v] [-w weights_file] gold_conllu_file system_conllu_file
#
# - if no -v is given, only the CoNLL17 UD Shared Task evaluation LAS metrics
# is printed
# - if -v is given, several metrics are printed (as precision, recall, F1 score,
# and in case the metric is computed on aligned words also accuracy on these):
# - Tokens: how well do the gold tokens match system tokens
# - Sentences: how well do the gold sentences match system sentences
# - Words: how well can the gold words be aligned to system words
# - UPOS: using aligned words, how well does UPOS match
# - XPOS: using aligned words, how well does XPOS match
# - Feats: using aligned words, how well does FEATS match
# - AllTags: using aligned words, how well does UPOS+XPOS+FEATS match
# - Lemmas: using aligned words, how well does LEMMA match
# - UAS: using aligned words, how well does HEAD match
# - LAS: using aligned words, how well does HEAD+DEPREL(ignoring subtypes) match
# - if weights_file is given (with lines containing deprel-weight pairs),
# one more metric is shown:
# - WeightedLAS: as LAS, but each deprel (ignoring subtypes) has different weight
# API usage
# ---------
# - load_conllu(file)
# - loads CoNLL-U file from given file object to an internal representation
# - the file object should return str on both Python 2 and Python 3
# - raises UDError exception if the given file cannot be loaded
# - evaluate(gold_ud, system_ud)
# - evaluate the given gold and system CoNLL-U files (loaded with load_conllu)
# - raises UDError if the concatenated tokens of gold and system file do not match
# - returns a dictionary with the metrics described above, each metrics having
# three fields: precision, recall and f1
# Description of token matching
# -----------------------------
# In order to match tokens of gold file and system file, we consider the text
# resulting from concatenation of gold tokens and text resulting from
# concatenation of system tokens. These texts should match -- if they do not,
# the evaluation fails.
#
# If the texts do match, every token is represented as a range in this original
# text, and tokens are equal only if their range is the same.
# Description of word matching
# ----------------------------
# When matching words of gold file and system file, we first match the tokens.
# The words which are also tokens are matched as tokens, but words in multi-word
# tokens have to be handled differently.
#
# To handle multi-word tokens, we start by finding "multi-word spans".
# Multi-word span is a span in the original text such that
# - it contains at least one multi-word token
# - all multi-word tokens in the span (considering both gold and system ones)
# are completely inside the span (i.e., they do not "stick out")
# - the multi-word span is as small as possible
#
# For every multi-word span, we align the gold and system words completely
# inside this span using LCS on their FORMs. The words not intersecting
# (even partially) any multi-word span are then aligned as tokens.
from __future__ import division
from __future__ import print_function
import argparse
import io
import sys
import unittest
# CoNLL-U column names
ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC = range(10)
# UD Error is used when raising exceptions in this module
class UDError(Exception):
pass
# Load given CoNLL-U file into internal representation
def load_conllu(file):
# Internal representation classes
class UDRepresentation:
def __init__(self):
# Characters of all the tokens in the whole file.
# Whitespace between tokens is not included.
self.characters = []
# List of UDSpan instances with start&end indices into `characters`.
self.tokens = []
# List of UDWord instances.
self.words = []
# List of UDSpan instances with start&end indices into `characters`.
self.sentences = []
class UDSpan:
def __init__(self, start, end, characters):
self.start = start
# Note that self.end marks the first position **after the end** of span,
# so we can use characters[start:end] or range(start, end).
self.end = end
self.characters = characters
@property
def text(self):
return ''.join(self.characters[self.start:self.end])
def __str__(self):
return self.text
def __repr__(self):
return self.text
class UDWord:
def __init__(self, span, columns, is_multiword):
# Span of this word (or MWT, see below) within ud_representation.characters.
self.span = span
# 10 columns of the CoNLL-U file: ID, FORM, LEMMA,...
self.columns = columns
# is_multiword==True means that this word is part of a multi-word token.
# In that case, self.span marks the span of the whole multi-word token.
self.is_multiword = is_multiword
# Reference to the UDWord instance representing the HEAD (or None if root).
self.parent = None
# Let's ignore language-specific deprel subtypes.
self.columns[DEPREL] = columns[DEPREL].split(':')[0]
ud = UDRepresentation()
# Load the CoNLL-U file
index, sentence_start = 0, None
linenum = 0
while True:
line = file.readline()
linenum += 1
if not line:
break
line = line.rstrip("\r\n")
# Handle sentence start boundaries
if sentence_start is None:
# Skip comments
if line.startswith("#"):
continue
# Start a new sentence
ud.sentences.append(UDSpan(index, 0, ud.characters))
sentence_start = len(ud.words)
if not line:
# Add parent UDWord links and check there are no cycles
def process_word(word):
if word.parent == "remapping":
raise UDError("There is a cycle in a sentence")
if word.parent is None:
head = int(word.columns[HEAD])
if head > len(ud.words) - sentence_start:
raise UDError("HEAD '{}' points outside of the sentence".format(word.columns[HEAD]))
if head:
parent = ud.words[sentence_start + head - 1]
word.parent = "remapping"
process_word(parent)
word.parent = parent
for word in ud.words[sentence_start:]:
process_word(word)
# Check there is a single root node
if len([word for word in ud.words[sentence_start:] if word.parent is None]) != 1:
raise UDError("There are multiple roots in a sentence")
# End the sentence
ud.sentences[-1].end = index
sentence_start = None
continue
# Read next token/word
columns = line.split("\t")
if len(columns) != 10:
raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, line))
# Skip empty nodes
if "." in columns[ID]:
continue
# Delete spaces from FORM so gold.characters == system.characters
# even if one of them tokenizes the space.
columns[FORM] = columns[FORM].replace(" ", "")
if not columns[FORM]:
raise UDError("There is an empty FORM in the CoNLL-U file -- line %d" % linenum)
# Save token
ud.characters.extend(columns[FORM])
ud.tokens.append(UDSpan(index, index + len(columns[FORM]), ud.characters))
index += len(columns[FORM])
# Handle multi-word tokens to save word(s)
if "-" in columns[ID]:
try:
start, end = map(int, columns[ID].split("-"))
except:
raise UDError("Cannot parse multi-word token ID '{}'".format(columns[ID]))
for _ in range(start, end + 1):
word_line = file.readline().rstrip("\r\n")
word_columns = word_line.split("\t")
if len(word_columns) != 10:
print(columns)
raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, word_line))
ud.words.append(UDWord(ud.tokens[-1], word_columns, is_multiword=True))
# Basic tokens/words
else:
try:
word_id = int(columns[ID])
except:
raise UDError("Cannot parse word ID '{}'".format(columns[ID]))
if word_id != len(ud.words) - sentence_start + 1:
raise UDError("Incorrect word ID '{}' for word '{}', expected '{}'".format(columns[ID], columns[FORM], len(ud.words) - sentence_start + 1))
try:
head_id = int(columns[HEAD])
except:
raise UDError("Cannot parse HEAD '{}'".format(columns[HEAD]))
if head_id < 0:
raise UDError("HEAD cannot be negative")
ud.words.append(UDWord(ud.tokens[-1], columns, is_multiword=False))
if sentence_start is not None:
raise UDError("The CoNLL-U file does not end with empty line")
return ud
# Evaluate the gold and system treebanks (loaded using load_conllu).
def evaluate(gold_ud, system_ud, deprel_weights=None):
class Score:
def __init__(self, gold_total, system_total, correct, aligned_total=None):
self.precision = correct / system_total if system_total else 0.0
self.recall = correct / gold_total if gold_total else 0.0
self.f1 = 2 * correct / (system_total + gold_total) if system_total + gold_total else 0.0
self.aligned_accuracy = correct / aligned_total if aligned_total else aligned_total
class AlignmentWord:
def __init__(self, gold_word, system_word):
self.gold_word = gold_word
self.system_word = system_word
self.gold_parent = None
self.system_parent_gold_aligned = None
class Alignment:
def __init__(self, gold_words, system_words):
self.gold_words = gold_words
self.system_words = system_words
self.matched_words = []
self.matched_words_map = {}
def append_aligned_words(self, gold_word, system_word):
self.matched_words.append(AlignmentWord(gold_word, system_word))
self.matched_words_map[system_word] = gold_word
def fill_parents(self):
# We represent root parents in both gold and system data by '0'.
# For gold data, we represent non-root parent by corresponding gold word.
# For system data, we represent non-root parent by either gold word aligned
# to parent system nodes, or by None if no gold words is aligned to the parent.
for words in self.matched_words:
words.gold_parent = words.gold_word.parent if words.gold_word.parent is not None else 0
words.system_parent_gold_aligned = self.matched_words_map.get(words.system_word.parent, None) \
if words.system_word.parent is not None else 0
def lower(text):
if sys.version_info < (3, 0) and isinstance(text, str):
return text.decode("utf-8").lower()
return text.lower()
def spans_score(gold_spans, system_spans):
correct, gi, si = 0, 0, 0
while gi < len(gold_spans) and si < len(system_spans):
if system_spans[si].start < gold_spans[gi].start:
si += 1
elif gold_spans[gi].start < system_spans[si].start:
gi += 1
else:
correct += gold_spans[gi].end == system_spans[si].end
si += 1
gi += 1
return Score(len(gold_spans), len(system_spans), correct)
def alignment_score(alignment, key_fn, weight_fn=lambda w: 1):
gold, system, aligned, correct = 0, 0, 0, 0
for word in alignment.gold_words:
gold += weight_fn(word)
for word in alignment.system_words:
system += weight_fn(word)
for words in alignment.matched_words:
aligned += weight_fn(words.gold_word)
if key_fn is None:
# Return score for whole aligned words
return Score(gold, system, aligned)
for words in alignment.matched_words:
if key_fn(words.gold_word, words.gold_parent) == key_fn(words.system_word, words.system_parent_gold_aligned):
correct += weight_fn(words.gold_word)
return Score(gold, system, correct, aligned)
def beyond_end(words, i, multiword_span_end):
if i >= len(words):
return True
if words[i].is_multiword:
return words[i].span.start >= multiword_span_end
return words[i].span.end > multiword_span_end
def extend_end(word, multiword_span_end):
if word.is_multiword and word.span.end > multiword_span_end:
return word.span.end
return multiword_span_end
def find_multiword_span(gold_words, system_words, gi, si):
# We know gold_words[gi].is_multiword or system_words[si].is_multiword.
# Find the start of the multiword span (gs, ss), so the multiword span is minimal.
# Initialize multiword_span_end characters index.
if gold_words[gi].is_multiword:
multiword_span_end = gold_words[gi].span.end
if not system_words[si].is_multiword and system_words[si].span.start < gold_words[gi].span.start:
si += 1
else: # if system_words[si].is_multiword
multiword_span_end = system_words[si].span.end
if not gold_words[gi].is_multiword and gold_words[gi].span.start < system_words[si].span.start:
gi += 1
gs, ss = gi, si
# Find the end of the multiword span
# (so both gi and si are pointing to the word following the multiword span end).
while not beyond_end(gold_words, gi, multiword_span_end) or \
not beyond_end(system_words, si, multiword_span_end):
if gi < len(gold_words) and (si >= len(system_words) or
gold_words[gi].span.start <= system_words[si].span.start):
multiword_span_end = extend_end(gold_words[gi], multiword_span_end)
gi += 1
else:
multiword_span_end = extend_end(system_words[si], multiword_span_end)
si += 1
return gs, ss, gi, si
def compute_lcs(gold_words, system_words, gi, si, gs, ss):
lcs = [[0] * (si - ss) for i in range(gi - gs)]
for g in reversed(range(gi - gs)):
for s in reversed(range(si - ss)):
if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
lcs[g][s] = 1 + (lcs[g+1][s+1] if g+1 < gi-gs and s+1 < si-ss else 0)
lcs[g][s] = max(lcs[g][s], lcs[g+1][s] if g+1 < gi-gs else 0)
lcs[g][s] = max(lcs[g][s], lcs[g][s+1] if s+1 < si-ss else 0)
return lcs
def align_words(gold_words, system_words):
alignment = Alignment(gold_words, system_words)
gi, si = 0, 0
while gi < len(gold_words) and si < len(system_words):
if gold_words[gi].is_multiword or system_words[si].is_multiword:
# A: Multi-word tokens => align via LCS within the whole "multiword span".
gs, ss, gi, si = find_multiword_span(gold_words, system_words, gi, si)
if si > ss and gi > gs:
lcs = compute_lcs(gold_words, system_words, gi, si, gs, ss)
# Store aligned words
s, g = 0, 0
while g < gi - gs and s < si - ss:
if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
alignment.append_aligned_words(gold_words[gs+g], system_words[ss+s])
g += 1
s += 1
elif lcs[g][s] == (lcs[g+1][s] if g+1 < gi-gs else 0):
g += 1
else:
s += 1
else:
# B: No multi-word token => align according to spans.
if (gold_words[gi].span.start, gold_words[gi].span.end) == (system_words[si].span.start, system_words[si].span.end):
alignment.append_aligned_words(gold_words[gi], system_words[si])
gi += 1
si += 1
elif gold_words[gi].span.start <= system_words[si].span.start:
gi += 1
else:
si += 1
alignment.fill_parents()
return alignment
# Check that underlying character sequences do match
if gold_ud.characters != system_ud.characters:
index = 0
while gold_ud.characters[index] == system_ud.characters[index]:
index += 1
raise UDError(
"The concatenation of tokens in gold file and in system file differ!\n" +
"First 20 differing characters in gold file: '{}' and system file: '{}'".format(
"".join(gold_ud.characters[index:index + 20]),
"".join(system_ud.characters[index:index + 20])
)
)
# Align words
alignment = align_words(gold_ud.words, system_ud.words)
# Compute the F1-scores
result = {
"Tokens": spans_score(gold_ud.tokens, system_ud.tokens),
"Sentences": spans_score(gold_ud.sentences, system_ud.sentences),
"Words": alignment_score(alignment, None),
"UPOS": alignment_score(alignment, lambda w, parent: w.columns[UPOS]),
"XPOS": alignment_score(alignment, lambda w, parent: w.columns[XPOS]),
"Feats": alignment_score(alignment, lambda w, parent: w.columns[FEATS]),
"AllTags": alignment_score(alignment, lambda w, parent: (w.columns[UPOS], w.columns[XPOS], w.columns[FEATS])),
"Lemmas": alignment_score(alignment, lambda w, parent: w.columns[LEMMA]),
"UAS": alignment_score(alignment, lambda w, parent: parent),
"LAS": alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL])),
}
# Add WeightedLAS if weights are given
if deprel_weights is not None:
def weighted_las(word):
return deprel_weights.get(word.columns[DEPREL], 1.0)
result["WeightedLAS"] = alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL]), weighted_las)
return result
def load_deprel_weights(weights_file):
if weights_file is None:
return None
deprel_weights = {}
for line in weights_file:
# Ignore comments and empty lines
if line.startswith("#") or not line.strip():
continue
columns = line.rstrip("\r\n").split()
if len(columns) != 2:
raise ValueError("Expected two columns in the UD Relations weights file on line '{}'".format(line))
deprel_weights[columns[0]] = float(columns[1])
return deprel_weights
def load_conllu_file(path):
_file = open(path, mode="r", **({"encoding": "utf-8"} if sys.version_info >= (3, 0) else {}))
return load_conllu(_file)
def evaluate_wrapper(args):
# Load CoNLL-U files
gold_ud = load_conllu_file(args.gold_file)
system_ud = load_conllu_file(args.system_file)
# Load weights if requested
deprel_weights = load_deprel_weights(args.weights)
return evaluate(gold_ud, system_ud, deprel_weights)
def main():
# Parse arguments
parser = argparse.ArgumentParser()
parser.add_argument("gold_file", type=str,
help="Name of the CoNLL-U file with the gold data.")
parser.add_argument("system_file", type=str,
help="Name of the CoNLL-U file with the predicted data.")
parser.add_argument("--weights", "-w", type=argparse.FileType("r"), default=None,
metavar="deprel_weights_file",
help="Compute WeightedLAS using given weights for Universal Dependency Relations.")
parser.add_argument("--verbose", "-v", default=0, action="count",
help="Print all metrics.")
args = parser.parse_args()
# Use verbose if weights are supplied
if args.weights is not None and not args.verbose:
args.verbose = 1
# Evaluate
evaluation = evaluate_wrapper(args)
# Print the evaluation
if not args.verbose:
print("LAS F1 Score: {:.2f}".format(100 * evaluation["LAS"].f1))
else:
metrics = ["Tokens", "Sentences", "Words", "UPOS", "XPOS", "Feats", "AllTags", "Lemmas", "UAS", "LAS"]
if args.weights is not None:
metrics.append("WeightedLAS")
print("Metrics | Precision | Recall | F1 Score | AligndAcc")
print("-----------+-----------+-----------+-----------+-----------")
for metric in metrics:
print("{:11}|{:10.2f} |{:10.2f} |{:10.2f} |{}".format(
metric,
100 * evaluation[metric].precision,
100 * evaluation[metric].recall,
100 * evaluation[metric].f1,
"{:10.2f}".format(100 * evaluation[metric].aligned_accuracy) if evaluation[metric].aligned_accuracy is not None else ""
))
if __name__ == "__main__":
main()
# Tests, which can be executed with `python -m unittest conll17_ud_eval`.
class TestAlignment(unittest.TestCase):
@staticmethod
def _load_words(words):
"""Prepare fake CoNLL-U files with fake HEAD to prevent multiple roots errors."""
lines, num_words = [], 0
for w in words:
parts = w.split(" ")
if len(parts) == 1:
num_words += 1
lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, parts[0], int(num_words>1)))
else:
lines.append("{}-{}\t{}\t_\t_\t_\t_\t_\t_\t_\t_".format(num_words + 1, num_words + len(parts) - 1, parts[0]))
for part in parts[1:]:
num_words += 1
lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, part, int(num_words>1)))
return load_conllu((io.StringIO if sys.version_info >= (3, 0) else io.BytesIO)("\n".join(lines+["\n"])))
def _test_exception(self, gold, system):
self.assertRaises(UDError, evaluate, self._load_words(gold), self._load_words(system))
def _test_ok(self, gold, system, correct):
metrics = evaluate(self._load_words(gold), self._load_words(system))
gold_words = sum((max(1, len(word.split(" ")) - 1) for word in gold))
system_words = sum((max(1, len(word.split(" ")) - 1) for word in system))
self.assertEqual((metrics["Words"].precision, metrics["Words"].recall, metrics["Words"].f1),
(correct / system_words, correct / gold_words, 2 * correct / (gold_words + system_words)))
def test_exception(self):
self._test_exception(["a"], ["b"])
def test_equal(self):
self._test_ok(["a"], ["a"], 1)
self._test_ok(["a", "b", "c"], ["a", "b", "c"], 3)
def test_equal_with_multiword(self):
self._test_ok(["abc a b c"], ["a", "b", "c"], 3)
self._test_ok(["a", "bc b c", "d"], ["a", "b", "c", "d"], 4)
self._test_ok(["abcd a b c d"], ["ab a b", "cd c d"], 4)
self._test_ok(["abc a b c", "de d e"], ["a", "bcd b c d", "e"], 5)
def test_alignment(self):
self._test_ok(["abcd"], ["a", "b", "c", "d"], 0)
self._test_ok(["abc", "d"], ["a", "b", "c", "d"], 1)
self._test_ok(["a", "bc", "d"], ["a", "b", "c", "d"], 2)
self._test_ok(["a", "bc b c", "d"], ["a", "b", "cd"], 2)
self._test_ok(["abc a BX c", "def d EX f"], ["ab a b", "cd c d", "ef e f"], 4)
self._test_ok(["ab a b", "cd bc d"], ["a", "bc", "d"], 2)
self._test_ok(["a", "bc b c", "d"], ["ab AX BX", "cd CX a"], 1)

View File

@ -116,10 +116,9 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
print("Itn.\tP.Loss\tN.Loss\tUAS\tNER P.\tNER R.\tNER F.\tTag %\tToken %") print("Itn.\tP.Loss\tN.Loss\tUAS\tNER P.\tNER R.\tNER F.\tTag %\tToken %")
try: try:
for i in range(n_iter):
train_docs = corpus.train_docs(nlp, projectivize=True, noise_level=0.0, train_docs = corpus.train_docs(nlp, projectivize=True, noise_level=0.0,
gold_preproc=gold_preproc, max_length=0) gold_preproc=gold_preproc, max_length=0)
train_docs = list(train_docs)
for i in range(n_iter):
with tqdm.tqdm(total=n_train_words, leave=False) as pbar: with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
losses = {} losses = {}
for batch in minibatch(train_docs, size=batch_sizes): for batch in minibatch(train_docs, size=batch_sizes):

390
spacy/cli/ud_train.py Normal file
View File

@ -0,0 +1,390 @@
'''Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
.conllu format for development data, allowing the official scorer to be used.
'''
from __future__ import unicode_literals
import plac
import tqdm
from pathlib import Path
import re
import sys
import json
import spacy
import spacy.util
from ..tokens import Token, Doc
from ..gold import GoldParse
from ..syntax.nonproj import projectivize
from ..matcher import Matcher
from collections import defaultdict, Counter
from timeit import default_timer as timer
import itertools
import random
import numpy.random
import cytoolz
from . import conll17_ud_eval
from .. import lang
from .. import lang
from ..lang import zh
from ..lang import ja
lang.zh.Chinese.Defaults.use_jieba = False
lang.ja.Japanese.Defaults.use_janome = False
random.seed(0)
numpy.random.seed(0)
def minibatch_by_words(items, size=5000):
random.shuffle(items)
if isinstance(size, int):
size_ = itertools.repeat(size)
else:
size_ = size
items = iter(items)
while True:
batch_size = next(size_)
batch = []
while batch_size >= 0:
try:
doc, gold = next(items)
except StopIteration:
if batch:
yield batch
return
batch_size -= len(doc)
batch.append((doc, gold))
if batch:
yield batch
else:
break
################
# Data reading #
################
space_re = re.compile('\s+')
def split_text(text):
return [space_re.sub(' ', par.strip()) for par in text.split('\n\n')]
def read_data(nlp, conllu_file, text_file, raw_text=True, oracle_segments=False,
max_doc_length=None, limit=None):
'''Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True,
include Doc objects created using nlp.make_doc and then aligned against
the gold-standard sequences. If oracle_segments=True, include Doc objects
created from the gold-standard segments. At least one must be True.'''
if not raw_text and not oracle_segments:
raise ValueError("At least one of raw_text or oracle_segments must be True")
paragraphs = split_text(text_file.read())
conllu = read_conllu(conllu_file)
# sd is spacy doc; cd is conllu doc
# cs is conllu sent, ct is conllu token
docs = []
golds = []
for doc_id, (text, cd) in enumerate(zip(paragraphs, conllu)):
sent_annots = []
for cs in cd:
sent = defaultdict(list)
for id_, word, lemma, pos, tag, morph, head, dep, _, space_after in cs:
if '.' in id_:
continue
if '-' in id_:
continue
id_ = int(id_)-1
head = int(head)-1 if head != '0' else id_
sent['words'].append(word)
sent['tags'].append(tag)
sent['heads'].append(head)
sent['deps'].append('ROOT' if dep == 'root' else dep)
sent['spaces'].append(space_after == '_')
sent['entities'] = ['-'] * len(sent['words'])
sent['heads'], sent['deps'] = projectivize(sent['heads'],
sent['deps'])
if oracle_segments:
docs.append(Doc(nlp.vocab, words=sent['words'], spaces=sent['spaces']))
golds.append(GoldParse(docs[-1], **sent))
sent_annots.append(sent)
if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
doc, gold = _make_gold(nlp, None, sent_annots)
sent_annots = []
docs.append(doc)
golds.append(gold)
if limit and len(docs) >= limit:
return docs, golds
if raw_text and sent_annots:
doc, gold = _make_gold(nlp, None, sent_annots)
docs.append(doc)
golds.append(gold)
if limit and len(docs) >= limit:
return docs, golds
return docs, golds
def read_conllu(file_):
docs = []
sent = []
doc = []
for line in file_:
if line.startswith('# newdoc'):
if doc:
docs.append(doc)
doc = []
elif line.startswith('#'):
continue
elif not line.strip():
if sent:
doc.append(sent)
sent = []
else:
sent.append(list(line.strip().split('\t')))
if len(sent[-1]) != 10:
print(repr(line))
raise ValueError
if sent:
doc.append(sent)
if doc:
docs.append(doc)
return docs
def _make_gold(nlp, text, sent_annots):
# Flatten the conll annotations, and adjust the head indices
flat = defaultdict(list)
for sent in sent_annots:
flat['heads'].extend(len(flat['words'])+head for head in sent['heads'])
for field in ['words', 'tags', 'deps', 'entities', 'spaces']:
flat[field].extend(sent[field])
# Construct text if necessary
assert len(flat['words']) == len(flat['spaces'])
if text is None:
text = ''.join(word+' '*space for word, space in zip(flat['words'], flat['spaces']))
doc = nlp.make_doc(text)
flat.pop('spaces')
gold = GoldParse(doc, **flat)
return doc, gold
#############################
# Data transforms for spaCy #
#############################
def golds_to_gold_tuples(docs, golds):
'''Get out the annoying 'tuples' format used by begin_training, given the
GoldParse objects.'''
tuples = []
for doc, gold in zip(docs, golds):
text = doc.text
ids, words, tags, heads, labels, iob = zip(*gold.orig_annot)
sents = [((ids, words, tags, heads, labels, iob), [])]
tuples.append((text, sents))
return tuples
##############
# Evaluation #
##############
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
with text_loc.open('r', encoding='utf8') as text_file:
texts = split_text(text_file.read())
docs = list(nlp.pipe(texts))
with sys_loc.open('w', encoding='utf8') as out_file:
write_conllu(docs, out_file)
with gold_loc.open('r', encoding='utf8') as gold_file:
gold_ud = conll17_ud_eval.load_conllu(gold_file)
with sys_loc.open('r', encoding='utf8') as sys_file:
sys_ud = conll17_ud_eval.load_conllu(sys_file)
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
return scores
def write_conllu(docs, file_):
merger = Matcher(docs[0].vocab)
merger.add('SUBTOK', None, [{'DEP': 'subtok', 'op': '+'}])
for i, doc in enumerate(docs):
matches = merger(doc)
spans = [doc[start:end+1] for _, start, end in matches]
offsets = [(span.start_char, span.end_char) for span in spans]
for start_char, end_char in offsets:
doc.merge(start_char, end_char)
file_.write("# newdoc id = {i}\n".format(i=i))
for j, sent in enumerate(doc.sents):
file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
file_.write("# text = {text}\n".format(text=sent.text))
for k, token in enumerate(sent):
file_.write(token._.get_conllu_lines(k) + '\n')
file_.write('\n')
def print_progress(itn, losses, ud_scores):
fields = {
'dep_loss': losses.get('parser', 0.0),
'tag_loss': losses.get('tagger', 0.0),
'words': ud_scores['Words'].f1 * 100,
'sents': ud_scores['Sentences'].f1 * 100,
'tags': ud_scores['XPOS'].f1 * 100,
'uas': ud_scores['UAS'].f1 * 100,
'las': ud_scores['LAS'].f1 * 100,
}
header = ['Epoch', 'Loss', 'LAS', 'UAS', 'TAG', 'SENT', 'WORD']
if itn == 0:
print('\t'.join(header))
tpl = '\t'.join((
'{:d}',
'{dep_loss:.1f}',
'{las:.1f}',
'{uas:.1f}',
'{tags:.1f}',
'{sents:.1f}',
'{words:.1f}',
))
print(tpl.format(itn, **fields))
#def get_sent_conllu(sent, sent_id):
# lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
def get_token_conllu(token, i):
if token._.begins_fused:
n = 1
while token.nbor(n)._.inside_fused:
n += 1
id_ = '%d-%d' % (i, i+n)
lines = [id_, token.text, '_', '_', '_', '_', '_', '_', '_', '_']
else:
lines = []
if token.head.i == token.i:
head = 0
else:
head = i + (token.head.i - token.i) + 1
fields = [str(i+1), token.text, token.lemma_, token.pos_, token.tag_, '_',
str(head), token.dep_.lower(), '_', '_']
lines.append('\t'.join(fields))
return '\n'.join(lines)
Token.set_extension('get_conllu_lines', method=get_token_conllu)
Token.set_extension('begins_fused', default=False)
Token.set_extension('inside_fused', default=False)
##################
# Initialization #
##################
def load_nlp(corpus, config):
lang = corpus.split('_')[0]
nlp = spacy.blank(lang)
if config.vectors:
nlp.vocab.from_disk(config.vectors / 'vocab')
return nlp
def initialize_pipeline(nlp, docs, golds, config):
nlp.add_pipe(nlp.create_pipe('parser'))
if config.multitask_tag:
nlp.parser.add_multitask_objective('tag')
if config.multitask_sent:
nlp.parser.add_multitask_objective('sent_start')
nlp.parser.moves.add_action(2, 'subtok')
nlp.add_pipe(nlp.create_pipe('tagger'))
for gold in golds:
for tag in gold.tags:
if tag is not None:
nlp.tagger.add_label(tag)
# Replace labels that didn't make the frequency cutoff
actions = set(nlp.parser.labels)
label_set = set([act.split('-')[1] for act in actions if '-' in act])
for gold in golds:
for i, label in enumerate(gold.labels):
if label is not None and label not in label_set:
gold.labels[i] = label.split('||')[0]
return nlp.begin_training(lambda: golds_to_gold_tuples(docs, golds))
########################
# Command line helpers #
########################
class Config(object):
def __init__(self, vectors=None, max_doc_length=10, multitask_tag=True,
multitask_sent=True, nr_epoch=30, batch_size=1000, dropout=0.2):
for key, value in locals().items():
setattr(self, key, value)
@classmethod
def load(cls, loc):
with Path(loc).open('r', encoding='utf8') as file_:
cfg = json.load(file_)
return cls(**cfg)
class Dataset(object):
def __init__(self, path, section):
self.path = path
self.section = section
self.conllu = None
self.text = None
for file_path in self.path.iterdir():
name = file_path.parts[-1]
if section in name and name.endswith('conllu'):
self.conllu = file_path
elif section in name and name.endswith('txt'):
self.text = file_path
if self.conllu is None:
msg = "Could not find .txt file in {path} for {section}"
raise IOError(msg.format(section=section, path=path))
if self.text is None:
msg = "Could not find .txt file in {path} for {section}"
self.lang = self.conllu.parts[-1].split('-')[0].split('_')[0]
class TreebankPaths(object):
def __init__(self, ud_path, treebank, **cfg):
self.train = Dataset(ud_path / treebank, 'train')
self.dev = Dataset(ud_path / treebank, 'dev')
self.lang = self.train.lang
@plac.annotations(
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
corpus=("UD corpus to train and evaluate on, e.g. en, es_ancora, etc",
"positional", None, str),
parses_dir=("Directory to write the development parses", "positional", None, Path),
config=("Path to json formatted config file", "positional"),
limit=("Size limit", "option", "n", int)
)
def main(ud_dir, parses_dir, config, corpus, limit=0):
config = Config.load(config)
paths = TreebankPaths(ud_dir, corpus)
if not (parses_dir / corpus).exists():
(parses_dir / corpus).mkdir()
print("Train and evaluate", corpus, "using lang", paths.lang)
nlp = load_nlp(paths.lang, config)
docs, golds = read_data(nlp, paths.train.conllu.open(), paths.train.text.open(),
max_doc_length=config.max_doc_length, limit=limit)
optimizer = initialize_pipeline(nlp, docs, golds, config)
for i in range(config.nr_epoch):
docs = [nlp.make_doc(doc.text) for doc in docs]
batches = minibatch_by_words(list(zip(docs, golds)), size=config.batch_size)
losses = {}
n_train_words = sum(len(doc) for doc in docs)
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
for batch in batches:
batch_docs, batch_gold = zip(*batch)
pbar.update(sum(len(doc) for doc in batch_docs))
nlp.update(batch_docs, batch_gold, sgd=optimizer,
drop=config.dropout, losses=losses)
out_path = parses_dir / corpus / 'epoch-{i}.conllu'.format(i=i)
with nlp.use_params(optimizer.averages):
scores = evaluate(nlp, paths.dev.text, paths.dev.conllu, out_path)
print_progress(i, losses, scores)
if __name__ == '__main__':
plac.call(main)

View File

@ -13,7 +13,7 @@ from . import _align
from .syntax import nonproj from .syntax import nonproj
from .tokens import Doc from .tokens import Doc
from . import util from . import util
from .util import minibatch from .util import minibatch, itershuffle
def tags_to_entities(tags): def tags_to_entities(tags):
@ -133,15 +133,14 @@ class GoldCorpus(object):
def train_docs(self, nlp, gold_preproc=False, def train_docs(self, nlp, gold_preproc=False,
projectivize=False, max_length=None, projectivize=False, max_length=None,
noise_level=0.0): noise_level=0.0):
train_tuples = self.train_tuples
if projectivize: if projectivize:
train_tuples = nonproj.preprocess_training_data( train_tuples = nonproj.preprocess_training_data(
self.train_tuples, label_freq_cutoff=100) self.train_tuples, label_freq_cutoff=30)
random.shuffle(train_tuples) random.shuffle(self.train_locs)
gold_docs = self.iter_gold_docs(nlp, train_tuples, gold_preproc, gold_docs = self.iter_gold_docs(nlp, train_tuples, gold_preproc,
max_length=max_length, max_length=max_length,
noise_level=noise_level) noise_level=noise_level)
yield from gold_docs yield from itershuffle(gold_docs, bufsize=100)
def dev_docs(self, nlp, gold_preproc=False): def dev_docs(self, nlp, gold_preproc=False):
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc) gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc)

View File

@ -21,7 +21,7 @@ class SpanishDefaults(Language.Defaults):
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
tag_map = TAG_MAP tag_map = TAG_MAP
stop_words = STOP_WORDS stop_words = STOP_WORDS
sytax_iterators = SYNTAX_ITERATORS syntax_iterators = SYNTAX_ITERATORS
lemma_lookup = LOOKUP lemma_lookup = LOOKUP

View File

@ -6,17 +6,19 @@ from ...symbols import NOUN, PROPN, PRON, VERB, AUX
def noun_chunks(obj): def noun_chunks(obj):
doc = obj.doc doc = obj.doc
np_label = doc.vocab.strings['NP'] if not len(doc):
return
np_label = doc.vocab.strings.add('NP')
left_labels = ['det', 'fixed', 'neg'] #['nunmod', 'det', 'appos', 'fixed'] left_labels = ['det', 'fixed', 'neg'] #['nunmod', 'det', 'appos', 'fixed']
right_labels = ['flat', 'fixed', 'compound', 'neg'] right_labels = ['flat', 'fixed', 'compound', 'neg']
stop_labels = ['punct'] stop_labels = ['punct']
np_left_deps = [doc.vocab.strings[label] for label in left_labels] np_left_deps = [doc.vocab.strings.add(label) for label in left_labels]
np_right_deps = [doc.vocab.strings[label] for label in right_labels] np_right_deps = [doc.vocab.strings.add(label) for label in right_labels]
stop_deps = [doc.vocab.strings[label] for label in stop_labels] stop_deps = [doc.vocab.strings.add(label) for label in stop_labels]
token = doc[0] token = doc[0]
while token and token.i < len(doc): while token and token.i < len(doc):
if token.pos in [PROPN, NOUN, PRON]: if token.pos in [PROPN, NOUN, PRON]:
left, right = noun_bounds(token) left, right = noun_bounds(doc, token, np_left_deps, np_right_deps, stop_deps)
yield left.i, right.i+1, np_label yield left.i, right.i+1, np_label
token = right token = right
token = next_token(token) token = next_token(token)
@ -33,7 +35,7 @@ def next_token(token):
return None return None
def noun_bounds(root): def noun_bounds(doc, root, np_left_deps, np_right_deps, stop_deps):
left_bound = root left_bound = root
for token in reversed(list(root.lefts)): for token in reversed(list(root.lefts)):
if token.dep in np_left_deps: if token.dep in np_left_deps:
@ -41,7 +43,7 @@ def noun_bounds(root):
right_bound = root right_bound = root
for token in root.rights: for token in root.rights:
if (token.dep in np_right_deps): if (token.dep in np_right_deps):
left, right = noun_bounds(token) left, right = noun_bounds(doc, token, np_left_deps, np_right_deps, stop_deps)
if list(filter(lambda t: is_verb_token(t) or t.dep in stop_deps, if list(filter(lambda t: is_verb_token(t) or t.dep in stop_deps,
doc[left_bound.i: right.i])): doc[left_bound.i: right.i])):
break break

View File

@ -6,10 +6,25 @@ from __future__ import unicode_literals
STOP_WORDS = set(""" STOP_WORDS = set("""
a a
ah
aha
aj
ako ako
al
ali ali
arh
au
avaj
bar
baš
bez
bi bi
bih bih
bijah
bijahu
bijaše
bijasmo
bijaste
bila bila
bili bili
bilo bilo
@ -17,25 +32,104 @@ bio
bismo bismo
biste biste
biti biti
brr
buć
budavši
bude
budimo
budite
budu
budući
bum
bumo bumo
će
ćemo
ćeš
ćete
čijem
čijim
čijima
ću
da da
daj
dakle
de
deder
dem
djelomice
djelomično
do do
doista
dok
dokle
donekle
dosad
doskoro
dotad
dotle
dovečer
drugamo
drugdje
duž duž
e
eh
ehe
ej
eno
eto
evo
ga ga
gdjekakav
gdjekoje
gic
god
halo
hej
hm
hoće hoće
hoćemo hoćemo
hoćete
hoćeš hoćeš
hoćete
hoću hoću
hop
htijahu
htijasmo
htijaste
htio
htjedoh
htjedoše
htjedoste
htjela
htjele
htjeli
hura
i i
iako iako
ih ih
iju
ijuju
ikada
ikakav
ikakva
ikakve
ikakvi
ikakvih
ikakvim
ikakvima
ikakvo
ikakvog
ikakvoga
ikakvoj
ikakvom
ikakvome
ili ili
im
iz iz
ja ja
je je
jedna jedna
jedne jedne
jedni
jedno jedno
jer jer
jesam jesam
@ -57,6 +151,7 @@ koji
kojima kojima
koju koju
kroz kroz
lani
li li
me me
mene mene
@ -66,6 +161,8 @@ mimo
moj moj
moja moja
moje moje
moji
moju
mu mu
na na
nad nad
@ -77,24 +174,27 @@ naš
naša naša
naše naše
našeg našeg
naši
ne ne
neće
nećemo
nećeš
nećete
neću
nego nego
neka neka
neke
neki neki
nekog nekog
neku neku
nema nema
netko
neće
nećemo
nećete
nećeš
neću
nešto nešto
netko
ni ni
nije nije
nikoga nikoga
nikoje nikoje
nikoji
nikoju nikoju
nisam nisam
nisi nisi
@ -123,33 +223,63 @@ od
odmah odmah
on on
ona ona
one
oni oni
ono ono
onu
onoj
onom
onim
onima
ova ova
ovaj
ovim
ovima
ovoj
pa pa
pak pak
pljus
po po
pod pod
podalje
poimence
poizdalje
ponekad
pored pored
postrance
potajice
potrbuške
pouzdano
prije prije
s s
sa sa
sam sam
samo samo
sasvim
sav
se se
sebe sebe
sebi sebi
si si
šic
smo smo
ste ste
što
šta
štogod
štagod
su su
sva
sve sve
svi svi
svi
svog svog
svoj svoj
svoja svoja
svoje svoje
svoju
svom svom
svu
ta ta
tada tada
taj taj
@ -158,6 +288,8 @@ te
tebe tebe
tebi tebi
ti ti
tim
tima
to to
toj toj
tome tome
@ -165,23 +297,51 @@ tu
tvoj tvoj
tvoja tvoja
tvoje tvoje
tvoji
tvoju
u u
usprkos
utaman
uvijek
uz uz
uza
uzagrapce
uzalud
uzduž
valjda
vam vam
vama vama
vas vas
vaš vaš
vaša vaša
vaše vaše
vašim
vašima
već već
vi vi
vjerojatno
vjerovatno
vrh
vrlo vrlo
za za
zaista
zar zar
će zatim
ćemo zato
ćete zbija
ćeš zbog
ću želeći
što željah
željela
željele
željeli
željelo
željen
željena
željene
željeni
željenu
željeo
zimus
zum
""".split()) """.split())

View File

@ -35,14 +35,32 @@ class JapaneseTokenizer(object):
def from_disk(self, path, **exclude): def from_disk(self, path, **exclude):
return self return self
class JapaneseCharacterSegmenter(object):
def __init__(self, vocab):
self.vocab = vocab
def __call__(self, text):
words = []
spaces = []
doc = self.tokenizer(text)
for token in self.tokenizer(text):
words.extend(list(token.text))
spaces.extend([False]*len(token.text))
spaces[-1] = bool(token.whitespace_)
return Doc(self.vocab, words=words, spaces=spaces)
class JapaneseDefaults(Language.Defaults): class JapaneseDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'ja' lex_attr_getters[LANG] = lambda text: 'ja'
use_janome = True
@classmethod @classmethod
def create_tokenizer(cls, nlp=None): def create_tokenizer(cls, nlp=None):
if cls.use_janome:
return JapaneseTokenizer(cls, nlp) return JapaneseTokenizer(cls, nlp)
else:
return JapaneseCharacterSegmenter(cls, nlp.vocab)
class Japanese(Language): class Japanese(Language):

22
spacy/lang/tr/examples.py Normal file
View File

@ -0,0 +1,22 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.tr.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Neredesin?",
"Neredesiniz?",
"Bu bir cümledir.",
"Sürücüsüz araçlar sigorta yükümlülüğünü üreticilere kaydırıyor.",
"San Francisco kaldırımda kurye robotları yasaklayabilir."
"Londra İngiltere'nin başkentidir.",
"Türkiye'nin başkenti neresi?",
"Bakanlar Kurulu 180 günlük eylem planınııkladı.",
"Merkez Bankası, beklentiler doğrultusunda faizlerde değişikliğe gitmedi."
]

View File

@ -0,0 +1,31 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
#Thirteen, fifteen etc. are written separate: on üç
_num_words = ['bir', 'iki', 'üç', 'dört', 'beş', 'altı', 'yedi', 'sekiz',
'dokuz', 'on', 'yirmi', 'otuz', 'kırk', 'elli', 'altmış',
'yetmiş', 'seksen', 'doksan', 'yüz', 'bin', 'milyon',
'milyar', 'katrilyon', 'kentilyon']
def like_num(text):
text = text.replace(',', '').replace('.', '')
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}

View File

@ -10,16 +10,12 @@ acep
adamakıllı adamakıllı
adeta adeta
ait ait
altmýþ
altmış
altý
altı
ama ama
amma amma
anca anca
ancak ancak
arada arada
artýk artık
aslında aslında
aynen aynen
ayrıca ayrıca
@ -29,46 +25,82 @@ açıkçası
bana bana
bari bari
bazen bazen
bazý
bazı bazı
bazısı
bazısına
bazısında
bazısından
bazısını
bazısının
başkası başkası
baţka başkasına
başkasında
başkasından
başkasını
başkasının
başka
belki belki
ben ben
bende
benden benden
beni beni
benim benim
beri beri
beriki beriki
beþ berikinin
beş berikiyi
beţ berisi
bilcümle bilcümle
bile bile
bin
binaen binaen
binaenaleyh binaenaleyh
bir
biraz biraz
birazdan birazdan
birbiri birbiri
birbirine
birbirini
birbirinin
birbirinde
birbirinden
birden birden
birdenbire birdenbire
biri biri
birine
birini
birinin
birinde
birinden
birice birice
birileri birileri
birilerinde
birilerinden
birilerine
birilerini
birilerinin
birisi birisi
birisine
birisini
birisinin
birisinde
birisinden
birkaç birkaç
birkaçı birkaçı
birkaçına
birkaçını
birkaçının
birkaçında
birkaçından
birkez birkez
birlikte birlikte
birçok birçok
birçoğu birçoğu
birþey birçoğuna
birþeyi birçoğunda
birçoğundan
birçoğunu
birçoğunun
birşey birşey
birşeyi birşeyi
birţey
bitevi bitevi
biteviye biteviye
bittabi bittabi
@ -96,6 +128,11 @@ buracıkta
burada burada
buradan buradan
burası burası
burasına
burasını
burasının
burasında
burasından
böyle böyle
böylece böylece
böylecene böylecene
@ -106,8 +143,34 @@ büsbütün
bütün bütün
cuk cuk
cümlesi cümlesi
cümlesine
cümlesini
cümlesinin
cümlesinden
cümlemize
cümlemizi
cümlemizden
çabuk
çabukça
çeşitli
çok
çokları
çoklarınca
çokluk
çoklukla
çokça
çoğu
çoğun
çoğunca
çoğunda
çoğundan
çoğunlukla
çoğunu
çoğunun
çünkü
da da
daha daha
dahası
dahi dahi
dahil dahil
dahilen dahilen
@ -124,19 +187,17 @@ denli
derakap derakap
derhal derhal
derken derken
deđil
değil değil
değin değin
diye diye
diđer
diğer diğer
diğeri diğeri
doksan diğerine
dokuz diğerini
diğerinden
dolayı dolayı
dolayısıyla dolayısıyla
doğru doğru
dört
edecek edecek
eden eden
ederek ederek
@ -146,7 +207,6 @@ edilmesi
ediyor ediyor
elbet elbet
elbette elbette
elli
emme emme
en en
enikonu enikonu
@ -168,10 +228,10 @@ evvelce
evvelden evvelden
evvelemirde evvelemirde
evveli evveli
eđer
eğer eğer
fakat fakat
filanca filanca
filancanın
gah gah
gayet gayet
gayetle gayetle
@ -197,6 +257,10 @@ haliyle
handiyse handiyse
hangi hangi
hangisi hangisi
hangisine
hangisine
hangisinde
hangisinden
hani hani
hariç hariç
hasebiyle hasebiyle
@ -207,17 +271,27 @@ hem
henüz henüz
hep hep
hepsi hepsi
hepsini
hepsinin
hepsinde
hepsinden
her her
herhangi herhangi
herkes herkes
herkesi
herkesin herkesin
herkesten
hiç hiç
hiçbir hiçbir
hiçbiri hiçbiri
hiçbirine
hiçbirini
hiçbirinin
hiçbirinde
hiçbirinden
hoş hoş
hulasaten hulasaten
iken iken
iki
ila ila
ile ile
ilen ilen
@ -240,43 +314,55 @@ iyicene
için için
işte işte
iţte
kadar kadar
kaffesi kaffesi
kah kah
kala kala
kanýmca kanımca
karşın karşın
katrilyon
kaynak kaynak
kaçı kaçı
kaçına
kaçında
kaçından
kaçını
kaçının
kelli kelli
kendi kendi
kendilerinde
kendilerinden
kendilerine kendilerine
kendilerini
kendilerinin
kendini kendini
kendisi kendisi
kendisinde
kendisinden
kendisine kendisine
kendisini kendisini
kendisinin
kere kere
kez kez
keza keza
kezalik kezalik
keşke keşke
keţke
ki ki
kim kim
kimden kimden
kime kime
kimi kimi
kiminin
kimisi kimisi
kimisinde
kimisinden
kimisine
kimisinin
kimse kimse
kimsecik kimsecik
kimsecikler kimsecikler
külliyen külliyen
kýrk
kýsaca
kırk
kısaca kısaca
kısacası
lakin lakin
leh leh
lütfen lütfen
@ -289,13 +375,10 @@ međer
meğer meğer
meğerki meğerki
meğerse meğerse
milyar
milyon
mu mu
mı mı
nasýl mi
nasıl nasıl
nasılsa nasılsa
nazaran nazaran
@ -304,6 +387,8 @@ ne
neden neden
nedeniyle nedeniyle
nedenle nedenle
nedenler
nedenlerden
nedense nedense
nerde nerde
nerden nerden
@ -332,32 +417,27 @@ olduklarını
oldukça oldukça
olduğu olduğu
olduğunu olduğunu
olmadı
olmadığı
olmak olmak
olması olması
olmayan
olmaz
olsa olsa
olsun olsun
olup olup
olur olur
olursa olursa
oluyor oluyor
on
ona ona
onca onca
onculayın onculayın
onda onda
ondan ondan
onlar onlar
onlara
onlardan onlardan
onlari
onlarýn
onları onları
onların onların
onu onu
onun onun
ora
oracık oracık
oracıkta oracıkta
orada orada
@ -365,9 +445,26 @@ oradan
oranca oranca
oranla oranla
oraya oraya
otuz
oysa oysa
oysaki oysaki
öbür
öbürkü
öbürü
öbüründe
öbüründen
öbürüne
öbürünü
önce
önceden
önceleri
öncelikle
öteki
ötekisi
öyle
öylece
öylelikle
öylemesine
öz
pek pek
pekala pekala
peki peki
@ -379,8 +476,6 @@ sahi
sahiden sahiden
sana sana
sanki sanki
sekiz
seksen
sen sen
senden senden
seni seni
@ -393,6 +488,27 @@ sonra
sonradan sonradan
sonraları sonraları
sonunda sonunda
şayet
şey
şeyden
şeyi
şeyler
şu
şuna
şuncacık
şunda
şundan
şunlar
şunları
şunların
şunu
şunun
şura
şuracık
şuracıkta
şurası
şöyle
şimdi
tabii tabii
tam tam
tamam tamam
@ -400,8 +516,8 @@ tamamen
tamamıyla tamamıyla
tarafından tarafından
tek tek
trilyon
tüm tüm
üzere
var var
vardı vardı
vasıtasıyla vasıtasıyla
@ -429,84 +545,16 @@ yaptığını
yapılan yapılan
yapılması yapılması
yapıyor yapıyor
yedi
yeniden yeniden
yenilerde yenilerde
yerine yerine
yetmiþ
yetmiş
yetmiţ
yine yine
yirmi
yok yok
yoksa yoksa
yoluyla yoluyla
yüz
yüzünden yüzünden
zarfında zarfında
zaten zaten
zati zati
zira zira
çabuk
çabukça
çeşitli
çok
çokları
çoklarınca
çokluk
çoklukla
çokça
çoğu
çoğun
çoğunca
çoğunlukla
çünkü
öbür
öbürkü
öbürü
önce
önceden
önceleri
öncelikle
öteki
ötekisi
öyle
öylece
öylelikle
öylemesine
öz
üzere
üç
þey
þeyden
þeyi
þeyler
þu
þuna
þunda
þundan
þunu
şayet
şey
şeyden
şeyi
şeyler
şu
şuna
şuncacık
şunda
şundan
şunlar
şunları
şunu
şunun
şura
şuracık
şuracıkta
şurası
şöyle
ţayet
ţimdi
ţu
ţöyle
""".split()) """.split())

View File

@ -3,11 +3,6 @@ from __future__ import unicode_literals
from ...symbols import ORTH, NORM from ...symbols import ORTH, NORM
# These exceptions are mostly for example purposes hoping that Turkish
# speakers can contribute in the future! Source of copy-pasted examples:
# https://en.wiktionary.org/wiki/Category:Turkish_language
_exc = { _exc = {
"sağol": [ "sağol": [
{ORTH: "sağ"}, {ORTH: "sağ"},
@ -16,11 +11,112 @@ _exc = {
for exc_data in [ for exc_data in [
{ORTH: "A.B.D.", NORM: "Amerika Birleşik Devletleri"}]: {ORTH: "A.B.D.", NORM: "Amerika Birleşik Devletleri"},
{ORTH: "Alb.", NORM: "Albay"},
{ORTH: "Ar.Gör.", NORM: "Araştırma Görevlisi"},
{ORTH: "Arş.Gör.", NORM: "Araştırma Görevlisi"},
{ORTH: "Asb.", NORM: "Astsubay"},
{ORTH: "Astsb.", NORM: "Astsubay"},
{ORTH: "As.İz.", NORM: "Askeri İnzibat"},
{ORTH: "Atğm", NORM: "Asteğmen"},
{ORTH: "Av.", NORM: "Avukat"},
{ORTH: "Apt.", NORM: "Apartmanı"},
{ORTH: "Bçvş.", NORM: "Başçavuş"},
{ORTH: "bk.", NORM: "bakınız"},
{ORTH: "bknz.", NORM: "bakınız"},
{ORTH: "Bnb.", NORM: "Binbaşı"},
{ORTH: "bnb.", NORM: "binbaşı"},
{ORTH: "Böl.", NORM: "Bölümü"},
{ORTH: "Bşk.", NORM: "Başkanlığı"},
{ORTH: "Bştbp.", NORM: "Baştabip"},
{ORTH: "Bul.", NORM: "Bulvarı"},
{ORTH: "Cad.", NORM: "Caddesi"},
{ORTH: "çev.", NORM: "çeviren"},
{ORTH: "Çvş.", NORM: "Çavuş"},
{ORTH: "dak.", NORM: "dakika"},
{ORTH: "dk.", NORM: "dakika"},
{ORTH: "Doç.", NORM: "Doçent"},
{ORTH: "doğ.", NORM: "doğum tarihi"},
{ORTH: "drl.", NORM: "derleyen"},
{ORTH: "Dz.", NORM: "Deniz"},
{ORTH: "Dz.K.K.lığı", NORM: "Deniz Kuvvetleri Komutanlığı"},
{ORTH: "Dz.Kuv.", NORM: "Deniz Kuvvetleri"},
{ORTH: "Dz.Kuv.K.", NORM: "Deniz Kuvvetleri Komutanlığı"},
{ORTH: "dzl.", NORM: "düzenleyen"},
{ORTH: "Ecz.", NORM: "Eczanesi"},
{ORTH: "ekon.", NORM: "ekonomi"},
{ORTH: "Fak.", NORM: "Fakültesi"},
{ORTH: "Gn.", NORM: "Genel"},
{ORTH: "Gnkur.", NORM: "Genelkurmay"},
{ORTH: "Gn.Kur.", NORM: "Genelkurmay"},
{ORTH: "gr.", NORM: "gram"},
{ORTH: "Hst.", NORM: "Hastanesi"},
{ORTH: "Hs.Uzm.", NORM: "Hesap Uzmanı"},
{ORTH: "huk.", NORM: "hukuk"},
{ORTH: "Hv.", NORM: "Hava"},
{ORTH: "Hv.K.K.lığı", NORM: "Hava Kuvvetleri Komutanlığı"},
{ORTH: "Hv.Kuv.", NORM: "Hava Kuvvetleri"},
{ORTH: "Hv.Kuv.K.", NORM: "Hava Kuvvetleri Komutanlığı"},
{ORTH: "Hz.", NORM: "Hazreti"},
{ORTH: "Hz.Öz.", NORM: "Hizmete Özel"},
{ORTH: "İng.", NORM: "İngilizce"},
{ORTH: "Jeol.", NORM: "Jeoloji"},
{ORTH: "jeol.", NORM: "jeoloji"},
{ORTH: "Korg.", NORM: "Korgeneral"},
{ORTH: "Kur.", NORM: "Kurmay"},
{ORTH: "Kur.Bşk.", NORM: "Kurmay Başkanı"},
{ORTH: "Kuv.", NORM: "Kuvvetleri"},
{ORTH: "Ltd.", NORM: "Limited"},
{ORTH: "Mah.", NORM: "Mahallesi"},
{ORTH: "mah.", NORM: "mahallesi"},
{ORTH: "max.", NORM: "maksimum"},
{ORTH: "min.", NORM: "minimum"},
{ORTH: "Müh.", NORM: "Mühendisliği"},
{ORTH: "müh.", NORM: "mühendisliği"},
{ORTH: "MÖ.", NORM: "Milattan Önce"},
{ORTH: "Onb.", NORM: "Onbaşı"},
{ORTH: "Ord.", NORM: "Ordinaryüs"},
{ORTH: "Org.", NORM: "Orgeneral"},
{ORTH: "Ped.", NORM: "Pedagoji"},
{ORTH: "Prof.", NORM: "Profesör"},
{ORTH: "Sb.", NORM: "Subay"},
{ORTH: "Sn.", NORM: "Sayın"},
{ORTH: "sn.", NORM: "saniye"},
{ORTH: "Sok.", NORM: "Sokak"},
{ORTH: "Şb.", NORM: "Şube"},
{ORTH: "Şti.", NORM: "Şirketi"},
{ORTH: "Tbp.", NORM: "Tabip"},
{ORTH: "T.C.", NORM: "Türkiye Cumhuriyeti"},
{ORTH: "Tel.", NORM: "Telefon"},
{ORTH: "tel.", NORM: "telefon"},
{ORTH: "telg.", NORM: "telgraf"},
{ORTH: "Tğm.", NORM: "Teğmen"},
{ORTH: "tğm.", NORM: "teğmen"},
{ORTH: "tic.", NORM: "ticaret"},
{ORTH: "Tug.", NORM: "Tugay"},
{ORTH: "Tuğg.", NORM: "Tuğgeneral"},
{ORTH: "Tümg.", NORM: "Tümgeneral"},
{ORTH: "Uzm.", NORM: "Uzman"},
{ORTH: "Üçvş.", NORM: "Üstçavuş"},
{ORTH: "Üni.", NORM: "Üniversitesi"},
{ORTH: "Ütğm.", NORM: "Üsteğmen"},
{ORTH: "vb.", NORM: "ve benzeri"},
{ORTH: "vs.", NORM: "vesaire"},
{ORTH: "Yard.", NORM: "Yardımcı"},
{ORTH: "Yar.", NORM: "Yardımcı"},
{ORTH: "Yd.Sb.", NORM: "Yedek Subay"},
{ORTH: "Yard.Doç.", NORM: "Yardımcı Doçent"},
{ORTH: "Yar.Doç.", NORM: "Yardımcı Doçent"},
{ORTH: "Yb.", NORM: "Yarbay"},
{ORTH: "Yrd.", NORM: "Yardımcı"},
{ORTH: "Yrd.Doç.", NORM: "Yardımcı Doçent"},
{ORTH: "Y.Müh.", NORM: "Yüksek mühendis"},
{ORTH: "Y.Mim.", NORM: "Yüksek mimar"}]:
_exc[exc_data[ORTH]] = [exc_data] _exc[exc_data[ORTH]] = [exc_data]
for orth in ["Dr."]: for orth in [
"Dr.", "yy."]:
_exc[orth] = [{ORTH: orth}] _exc[orth] = [{ORTH: orth}]

View File

@ -319,7 +319,7 @@ cdef class ArcEager(TransitionSystem):
(SHIFT, ['']), (SHIFT, ['']),
(REDUCE, ['']), (REDUCE, ['']),
(RIGHT, []), (RIGHT, []),
(LEFT, []), (LEFT, ['subtok']),
(BREAK, ['ROOT'])) (BREAK, ['ROOT']))
)) ))
seen_actions = set() seen_actions = set()

View File

@ -477,14 +477,15 @@ cdef class Parser:
free(vectors) free(vectors)
free(scores) free(scores)
def beam_parse(self, docs, int beam_width=3, float beam_density=0.001): def beam_parse(self, docs, int beam_width=3, float beam_density=0.001,
float drop=0.):
cdef Beam beam cdef Beam beam
cdef np.ndarray scores cdef np.ndarray scores
cdef Doc doc cdef Doc doc
cdef int nr_class = self.moves.n_moves cdef int nr_class = self.moves.n_moves
cuda_stream = util.get_cuda_stream() cuda_stream = util.get_cuda_stream()
(tokvecs, bp_tokvecs), state2vec, vec2scores = self.get_batch_model( (tokvecs, bp_tokvecs), state2vec, vec2scores = self.get_batch_model(
docs, cuda_stream, 0.0) docs, cuda_stream, drop)
cdef int offset = 0 cdef int offset = 0
cdef int j = 0 cdef int j = 0
cdef int k cdef int k
@ -523,8 +524,8 @@ cdef class Parser:
n_states += 1 n_states += 1
if n_states == 0: if n_states == 0:
break break
vectors = state2vec(token_ids[:n_states]) vectors, _ = state2vec.begin_update(token_ids[:n_states], drop)
scores = vec2scores(vectors) scores, _ = vec2scores.begin_update(vectors, drop=drop)
c_scores = <float*>scores.data c_scores = <float*>scores.data
for beam in todo: for beam in todo:
for i in range(beam.size): for i in range(beam.size):

View File

@ -191,9 +191,12 @@ def _filter_labels(gold_tuples, cutoff, freqs):
for raw_text, sents in gold_tuples: for raw_text, sents in gold_tuples:
filtered_sents = [] filtered_sents = []
for (ids, words, tags, heads, labels, iob), ctnts in sents: for (ids, words, tags, heads, labels, iob), ctnts in sents:
filtered_labels = [decompose(label)[0] filtered_labels = []
if freqs.get(label, cutoff) < cutoff for label in labels:
else label for label in labels] if is_decorated(label) and freqs.get(label, 0) < cutoff:
filtered_labels.append(decompose(label)[0])
else:
filtered_labels.append(label)
filtered_sents.append( filtered_sents.append(
((ids, words, tags, heads, filtered_labels, iob), ctnts)) ((ids, words, tags, heads, filtered_labels, iob), ctnts))
filtered.append((raw_text, filtered_sents)) filtered.append((raw_text, filtered_sents))

View File

@ -0,0 +1,74 @@
from ...vocab import Vocab
from ...pipeline import DependencyParser
from ...tokens import Doc
from ...gold import GoldParse
from ...syntax.nonproj import projectivize
annot_tuples = [
(0, 'When', 'WRB', 11, 'advmod', 'O'),
(1, 'Walter', 'NNP', 2, 'compound', 'B-PERSON'),
(2, 'Rodgers', 'NNP', 11, 'nsubj', 'L-PERSON'),
(3, ',', ',', 2, 'punct', 'O'),
(4, 'our', 'PRP$', 6, 'poss', 'O'),
(5, 'embedded', 'VBN', 6, 'amod', 'O'),
(6, 'reporter', 'NN', 2, 'appos', 'O'),
(7, 'with', 'IN', 6, 'prep', 'O'),
(8, 'the', 'DT', 10, 'det', 'B-ORG'),
(9, '3rd', 'NNP', 10, 'compound', 'I-ORG'),
(10, 'Cavalry', 'NNP', 7, 'pobj', 'L-ORG'),
(11, 'says', 'VBZ', 44, 'advcl', 'O'),
(12, 'three', 'CD', 13, 'nummod', 'U-CARDINAL'),
(13, 'battalions', 'NNS', 16, 'nsubj', 'O'),
(14, 'of', 'IN', 13, 'prep', 'O'),
(15, 'troops', 'NNS', 14, 'pobj', 'O'),
(16, 'are', 'VBP', 11, 'ccomp', 'O'),
(17, 'on', 'IN', 16, 'prep', 'O'),
(18, 'the', 'DT', 19, 'det', 'O'),
(19, 'ground', 'NN', 17, 'pobj', 'O'),
(20, ',', ',', 17, 'punct', 'O'),
(21, 'inside', 'IN', 17, 'prep', 'O'),
(22, 'Baghdad', 'NNP', 21, 'pobj', 'U-GPE'),
(23, 'itself', 'PRP', 22, 'appos', 'O'),
(24, ',', ',', 16, 'punct', 'O'),
(25, 'have', 'VBP', 26, 'aux', 'O'),
(26, 'taken', 'VBN', 16, 'dep', 'O'),
(27, 'up', 'RP', 26, 'prt', 'O'),
(28, 'positions', 'NNS', 26, 'dobj', 'O'),
(29, 'they', 'PRP', 31, 'nsubj', 'O'),
(30, "'re", 'VBP', 31, 'aux', 'O'),
(31, 'going', 'VBG', 26, 'parataxis', 'O'),
(32, 'to', 'TO', 33, 'aux', 'O'),
(33, 'spend', 'VB', 31, 'xcomp', 'O'),
(34, 'the', 'DT', 35, 'det', 'B-TIME'),
(35, 'night', 'NN', 33, 'dobj', 'L-TIME'),
(36, 'there', 'RB', 33, 'advmod', 'O'),
(37, 'presumably', 'RB', 33, 'advmod', 'O'),
(38, ',', ',', 44, 'punct', 'O'),
(39, 'how', 'WRB', 40, 'advmod', 'O'),
(40, 'many', 'JJ', 41, 'amod', 'O'),
(41, 'soldiers', 'NNS', 44, 'pobj', 'O'),
(42, 'are', 'VBP', 44, 'aux', 'O'),
(43, 'we', 'PRP', 44, 'nsubj', 'O'),
(44, 'talking', 'VBG', 44, 'ROOT', 'O'),
(45, 'about', 'IN', 44, 'prep', 'O'),
(46, 'right', 'RB', 47, 'advmod', 'O'),
(47, 'now', 'RB', 44, 'advmod', 'O'),
(48, '?', '.', 44, 'punct', 'O')]
def test_get_oracle_actions():
doc = Doc(Vocab(), words=[t[1] for t in annot_tuples])
parser = DependencyParser(doc.vocab)
parser.moves.add_action(0, '')
parser.moves.add_action(1, '')
parser.moves.add_action(1, '')
parser.moves.add_action(4, 'ROOT')
for i, (id_, word, tag, head, dep, ent) in enumerate(annot_tuples):
if head > i:
parser.moves.add_action(2, dep)
elif head < i:
parser.moves.add_action(3, dep)
ids, words, tags, heads, deps, ents = zip(*annot_tuples)
heads, deps = projectivize(heads, deps)
gold = GoldParse(doc, words=words, tags=tags, heads=heads, deps=deps)
parser.moves.preprocess_gold(gold)
actions = parser.moves.get_oracle_sequence(doc, gold)

View File

@ -294,6 +294,7 @@ cdef class Span:
cdef int i cdef int i
if self.doc.is_parsed: if self.doc.is_parsed:
root = &self.doc.c[self.start] root = &self.doc.c[self.start]
n = 0
while root.head != 0: while root.head != 0:
root += root.head root += root.head
n += 1 n += 1
@ -307,8 +308,10 @@ cdef class Span:
start += -1 start += -1
# find end of the sentence # find end of the sentence
end = self.end end = self.end
while self.doc.c[end].sent_start != 1: n = 0
while end < self.doc.length and self.doc.c[end].sent_start != 1:
end += 1 end += 1
n += 1
if n >= self.doc.length: if n >= self.doc.length:
break break
# #

View File

@ -279,8 +279,8 @@ cdef class Token:
""" """
def __get__(self): def __get__(self):
if self.c.lemma == 0: if self.c.lemma == 0:
lemma = self.vocab.morphology.lemmatizer.lookup(self.orth_) lemma_ = self.vocab.morphology.lemmatizer.lookup(self.orth_)
return lemma return self.vocab.strings[lemma_]
else: else:
return self.c.lemma return self.c.lemma

View File

@ -451,7 +451,7 @@ def itershuffle(iterable, bufsize=1000):
try: try:
while True: while True:
for i in range(random.randint(1, bufsize-len(buf))): for i in range(random.randint(1, bufsize-len(buf))):
buf.append(iterable.next()) buf.append(next(iterable))
random.shuffle(buf) random.shuffle(buf)
for i in range(random.randint(1, bufsize)): for i in range(random.randint(1, bufsize)):
if buf: if buf:

View File

@ -15,11 +15,8 @@ from .compat import basestring_, path2str
from . import util from . import util
def unpickle_vectors(keys_and_rows, data): def unpickle_vectors(bytes_data):
vectors = Vectors(data=data) return Vectors().from_bytes(bytes_data)
for key, row in keys_and_rows:
vectors.add(key, row=row)
return vectors
cdef class Vectors: cdef class Vectors:
@ -86,8 +83,7 @@ cdef class Vectors:
return len(self.key2row) return len(self.key2row)
def __reduce__(self): def __reduce__(self):
keys_and_rows = tuple(self.key2row.items()) return (unpickle_vectors, (self.to_bytes(),))
return (unpickle_vectors, (keys_and_rows, self.data))
def __getitem__(self, key): def __getitem__(self, key):
"""Get a vector by key. If the key is not found, a KeyError is raised. """Get a vector by key. If the key is not found, a KeyError is raised.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 378 KiB

View File

@ -76,11 +76,13 @@
}, },
"MODEL_LICENSES": { "MODEL_LICENSES": {
"CC BY 4.0": "https://creativecommons.org/licenses/by/4.0/",
"CC BY-SA": "https://creativecommons.org/licenses/by-sa/3.0/", "CC BY-SA": "https://creativecommons.org/licenses/by-sa/3.0/",
"CC BY-SA 3.0": "https://creativecommons.org/licenses/by-sa/3.0/", "CC BY-SA 3.0": "https://creativecommons.org/licenses/by-sa/3.0/",
"CC BY-SA 4.0": "https://creativecommons.org/licenses/by-sa/4.0/", "CC BY-SA 4.0": "https://creativecommons.org/licenses/by-sa/4.0/",
"CC BY-NC": "https://creativecommons.org/licenses/by-nc/3.0/", "CC BY-NC": "https://creativecommons.org/licenses/by-nc/3.0/",
"CC BY-NC 3.0": "https://creativecommons.org/licenses/by-nc/3.0/", "CC BY-NC 3.0": "https://creativecommons.org/licenses/by-nc/3.0/",
"CC-BY-NC-SA 3.0": "https://creativecommons.org/licenses/by-nc-sa/3.0/",
"GPL": "https://www.gnu.org/licenses/gpl.html", "GPL": "https://www.gnu.org/licenses/gpl.html",
"LGPL": "https://www.gnu.org/licenses/lgpl.html" "LGPL": "https://www.gnu.org/licenses/lgpl.html"
}, },

View File

@ -68,7 +68,7 @@ p
+item #[strong spaCy is not research software]. +item #[strong spaCy is not research software].
| It's built on the latest research, but it's designed to get | It's built on the latest research, but it's designed to get
| things done. This leads to fairly different design decisions than | things done. This leads to fairly different design decisions than
| #[+a("https://github./nltk/nltk") NLTK] | #[+a("https://github.com/nltk/nltk") NLTK]
| or #[+a("https://stanfordnlp.github.io/CoreNLP/") CoreNLP], which were | or #[+a("https://stanfordnlp.github.io/CoreNLP/") CoreNLP], which were
| created as platforms for teaching and research. The main difference | created as platforms for teaching and research. The main difference
| is that spaCy is integrated and opinionated. spaCy tries to avoid asking | is that spaCy is integrated and opinionated. spaCy tries to avoid asking