mirror of
https://github.com/explosion/spaCy.git
synced 2025-10-02 09:56:39 +03:00
Merge master
This commit is contained in:
commit
ab3d860686
11
.buildkite/train.yml
Normal file
11
.buildkite/train.yml
Normal file
|
@ -0,0 +1,11 @@
|
||||||
|
steps:
|
||||||
|
-
|
||||||
|
command: "fab env clean make test wheel"
|
||||||
|
label: ":dizzy: :python:"
|
||||||
|
artifact_paths: "dist/*.whl"
|
||||||
|
- wait
|
||||||
|
- trigger: "spacy-train-from-wheel"
|
||||||
|
label: ":dizzy: :train:"
|
||||||
|
build:
|
||||||
|
env:
|
||||||
|
SPACY_VERSION: "{$SPACY_VERSION}"
|
106
.github/contributors/alldefector.md
vendored
Normal file
106
.github/contributors/alldefector.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Feng Niu |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | Feb 21, 2018 |
|
||||||
|
| GitHub username | alldefector |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/willismonroe.md
vendored
Normal file
106
.github/contributors/willismonroe.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Willis Monroe |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2018-3-5 |
|
||||||
|
| GitHub username | willismonroe |
|
||||||
|
| Website (optional) | |
|
|
@ -182,7 +182,7 @@ If you've made a contribution to spaCy, you should fill in the
|
||||||
[spaCy contributor agreement](.github/CONTRIBUTOR_AGREEMENT.md) to ensure that
|
[spaCy contributor agreement](.github/CONTRIBUTOR_AGREEMENT.md) to ensure that
|
||||||
your contribution can be used across the project. If you agree to be bound by
|
your contribution can be used across the project. If you agree to be bound by
|
||||||
the terms of the agreement, fill in the [template](.github/CONTRIBUTOR_AGREEMENT.md)
|
the terms of the agreement, fill in the [template](.github/CONTRIBUTOR_AGREEMENT.md)
|
||||||
and include it with your pull request, or sumit it separately to
|
and include it with your pull request, or submit it separately to
|
||||||
[`.github/contributors/`](/.github/contributors). The name of the file should be
|
[`.github/contributors/`](/.github/contributors). The name of the file should be
|
||||||
your GitHub username, with the extension `.md`. For example, the user
|
your GitHub username, with the extension `.md`. For example, the user
|
||||||
example_user would create the file `.github/contributors/example_user.md`.
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
|
@ -28,8 +28,10 @@ import cytoolz
|
||||||
import conll17_ud_eval
|
import conll17_ud_eval
|
||||||
|
|
||||||
import spacy.lang.zh
|
import spacy.lang.zh
|
||||||
|
import spacy.lang.ja
|
||||||
|
|
||||||
spacy.lang.zh.Chinese.Defaults.use_jieba = False
|
spacy.lang.zh.Chinese.Defaults.use_jieba = False
|
||||||
|
spacy.lang.ja.Japanese.Defaults.use_janome = False
|
||||||
|
|
||||||
random.seed(0)
|
random.seed(0)
|
||||||
numpy.random.seed(0)
|
numpy.random.seed(0)
|
||||||
|
@ -280,6 +282,30 @@ def print_progress(itn, losses, ud_scores):
|
||||||
))
|
))
|
||||||
print(tpl.format(itn, **fields))
|
print(tpl.format(itn, **fields))
|
||||||
|
|
||||||
|
#def get_sent_conllu(sent, sent_id):
|
||||||
|
# lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
|
||||||
|
|
||||||
|
def get_token_conllu(token, i):
|
||||||
|
if token._.begins_fused:
|
||||||
|
n = 1
|
||||||
|
while token.nbor(n)._.inside_fused:
|
||||||
|
n += 1
|
||||||
|
id_ = '%d-%d' % (i, i+n)
|
||||||
|
lines = [id_, token.text, '_', '_', '_', '_', '_', '_', '_', '_']
|
||||||
|
else:
|
||||||
|
lines = []
|
||||||
|
if token.head.i == token.i:
|
||||||
|
head = 0
|
||||||
|
else:
|
||||||
|
head = i + (token.head.i - token.i) + 1
|
||||||
|
fields = [str(i+1), token.text, token.lemma_, token.pos_, token.tag_, '_',
|
||||||
|
str(head), token.dep_.lower(), '_', '_']
|
||||||
|
lines.append('\t'.join(fields))
|
||||||
|
return '\n'.join(lines)
|
||||||
|
|
||||||
|
Token.set_extension('get_conllu_lines', method=get_token_conllu)
|
||||||
|
Token.set_extension('begins_fused', default=False)
|
||||||
|
Token.set_extension('inside_fused', default=False)
|
||||||
|
|
||||||
##################
|
##################
|
||||||
# Initialization #
|
# Initialization #
|
||||||
|
|
83
fabfile.py
vendored
83
fabfile.py
vendored
|
@ -1,49 +1,92 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals, print_function
|
from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
|
import contextlib
|
||||||
|
from pathlib import Path
|
||||||
from fabric.api import local, lcd, env, settings, prefix
|
from fabric.api import local, lcd, env, settings, prefix
|
||||||
from fabtools.python import virtualenv
|
|
||||||
from os import path, environ
|
from os import path, environ
|
||||||
|
import shutil
|
||||||
|
|
||||||
|
|
||||||
PWD = path.dirname(__file__)
|
PWD = path.dirname(__file__)
|
||||||
ENV = environ['VENV_DIR'] if 'VENV_DIR' in environ else '.env'
|
ENV = environ['VENV_DIR'] if 'VENV_DIR' in environ else '.env'
|
||||||
VENV_DIR = path.join(PWD, ENV)
|
VENV_DIR = Path(PWD) / ENV
|
||||||
|
|
||||||
|
|
||||||
def env(lang='python2.7'):
|
@contextlib.contextmanager
|
||||||
if path.exists(VENV_DIR):
|
def virtualenv(name, create=False, python='/usr/bin/python3.6'):
|
||||||
|
python = Path(python).resolve()
|
||||||
|
env_path = VENV_DIR
|
||||||
|
if create:
|
||||||
|
if env_path.exists():
|
||||||
|
shutil.rmtree(str(env_path))
|
||||||
|
local('{python} -m venv {env_path}'.format(python=python, env_path=VENV_DIR))
|
||||||
|
def wrapped_local(cmd, env_vars=[], capture=False, direct=False):
|
||||||
|
return local('source {}/bin/activate && {}'.format(env_path, cmd),
|
||||||
|
shell='/bin/bash', capture=False)
|
||||||
|
yield wrapped_local
|
||||||
|
|
||||||
|
|
||||||
|
def env(lang='python3.6'):
|
||||||
|
if VENV_DIR.exists():
|
||||||
local('rm -rf {env}'.format(env=VENV_DIR))
|
local('rm -rf {env}'.format(env=VENV_DIR))
|
||||||
local('pip install virtualenv')
|
if lang.startswith('python3'):
|
||||||
local('python -m virtualenv -p {lang} {env}'.format(lang=lang, env=VENV_DIR))
|
local('{lang} -m venv {env}'.format(lang=lang, env=VENV_DIR))
|
||||||
|
else:
|
||||||
|
local('{lang} -m pip install virtualenv --no-cache-dir'.format(lang=lang))
|
||||||
|
local('{lang} -m virtualenv {env} --no-cache-dir'.format(lang=lang, env=VENV_DIR))
|
||||||
|
with virtualenv(VENV_DIR) as venv_local:
|
||||||
|
print(venv_local('python --version', capture=True))
|
||||||
|
venv_local('pip install --upgrade setuptools --no-cache-dir')
|
||||||
|
venv_local('pip install pytest --no-cache-dir')
|
||||||
|
venv_local('pip install wheel --no-cache-dir')
|
||||||
|
venv_local('pip install -r requirements.txt --no-cache-dir')
|
||||||
|
venv_local('pip install pex --no-cache-dir')
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def install():
|
def install():
|
||||||
with virtualenv(VENV_DIR):
|
with virtualenv(VENV_DIR) as venv_local:
|
||||||
local('pip install --upgrade setuptools')
|
venv_local('pip install dist/*.tar.gz')
|
||||||
local('pip install dist/*.tar.gz')
|
|
||||||
local('pip install pytest')
|
|
||||||
|
|
||||||
|
|
||||||
def make():
|
def make():
|
||||||
with virtualenv(VENV_DIR):
|
with lcd(path.dirname(__file__)):
|
||||||
with lcd(path.dirname(__file__)):
|
local('export PYTHONPATH=`pwd` && source .env/bin/activate && python setup.py build_ext --inplace',
|
||||||
local('pip install cython')
|
shell='/bin/bash')
|
||||||
local('pip install murmurhash')
|
|
||||||
local('pip install -r requirements.txt')
|
|
||||||
local('python setup.py build_ext --inplace')
|
|
||||||
|
|
||||||
def sdist():
|
def sdist():
|
||||||
with virtualenv(VENV_DIR):
|
with virtualenv(VENV_DIR) as venv_local:
|
||||||
with lcd(path.dirname(__file__)):
|
with lcd(path.dirname(__file__)):
|
||||||
local('python setup.py sdist')
|
local('python setup.py sdist')
|
||||||
|
|
||||||
|
def wheel():
|
||||||
|
with virtualenv(VENV_DIR) as venv_local:
|
||||||
|
with lcd(path.dirname(__file__)):
|
||||||
|
venv_local('python setup.py bdist_wheel')
|
||||||
|
|
||||||
|
def pex():
|
||||||
|
with virtualenv(VENV_DIR) as venv_local:
|
||||||
|
with lcd(path.dirname(__file__)):
|
||||||
|
sha = local('git rev-parse --short HEAD', capture=True)
|
||||||
|
venv_local('pex dist/*.whl -e spacy -o dist/spacy-%s.pex' % sha,
|
||||||
|
direct=True)
|
||||||
|
|
||||||
|
|
||||||
def clean():
|
def clean():
|
||||||
with lcd(path.dirname(__file__)):
|
with lcd(path.dirname(__file__)):
|
||||||
local('python setup.py clean --all')
|
local('rm -f dist/*.whl')
|
||||||
|
local('rm -f dist/*.pex')
|
||||||
|
with virtualenv(VENV_DIR) as venv_local:
|
||||||
|
venv_local('python setup.py clean --all')
|
||||||
|
|
||||||
|
|
||||||
def test():
|
def test():
|
||||||
with virtualenv(VENV_DIR):
|
with virtualenv(VENV_DIR) as venv_local:
|
||||||
with lcd(path.dirname(__file__)):
|
with lcd(path.dirname(__file__)):
|
||||||
local('py.test -x spacy/tests')
|
venv_local('pytest -x spacy/tests')
|
||||||
|
|
||||||
|
def train():
|
||||||
|
args = environ.get('SPACY_TRAIN_ARGS', '')
|
||||||
|
with virtualenv(VENV_DIR) as venv_local:
|
||||||
|
venv_local('spacy train {args}'.format(args=args))
|
||||||
|
|
|
@ -8,6 +8,7 @@ if __name__ == '__main__':
|
||||||
import sys
|
import sys
|
||||||
from spacy.cli import download, link, info, package, train, convert
|
from spacy.cli import download, link, info, package, train, convert
|
||||||
from spacy.cli import vocab, init_model, profile, evaluate, validate
|
from spacy.cli import vocab, init_model, profile, evaluate, validate
|
||||||
|
from spacy.cli import ud_train, ud_evaluate
|
||||||
from spacy.util import prints
|
from spacy.util import prints
|
||||||
|
|
||||||
commands = {
|
commands = {
|
||||||
|
@ -15,7 +16,9 @@ if __name__ == '__main__':
|
||||||
'link': link,
|
'link': link,
|
||||||
'info': info,
|
'info': info,
|
||||||
'train': train,
|
'train': train,
|
||||||
|
'ud-train': ud_train,
|
||||||
'evaluate': evaluate,
|
'evaluate': evaluate,
|
||||||
|
'ud-evaluate': ud_evaluate,
|
||||||
'convert': convert,
|
'convert': convert,
|
||||||
'package': package,
|
'package': package,
|
||||||
'vocab': vocab,
|
'vocab': vocab,
|
||||||
|
|
|
@ -3,7 +3,7 @@
|
||||||
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
|
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
|
||||||
|
|
||||||
__title__ = 'spacy'
|
__title__ = 'spacy'
|
||||||
__version__ = '2.1.0.dev1'
|
__version__ = '2.1.0.dev3'
|
||||||
__summary__ = 'Industrial-strength Natural Language Processing (NLP) with Python and Cython'
|
__summary__ = 'Industrial-strength Natural Language Processing (NLP) with Python and Cython'
|
||||||
__uri__ = 'https://spacy.io'
|
__uri__ = 'https://spacy.io'
|
||||||
__author__ = 'Explosion AI'
|
__author__ = 'Explosion AI'
|
||||||
|
|
|
@ -9,3 +9,5 @@ from .convert import convert
|
||||||
from .vocab import make_vocab as vocab
|
from .vocab import make_vocab as vocab
|
||||||
from .init_model import init_model
|
from .init_model import init_model
|
||||||
from .validate import validate
|
from .validate import validate
|
||||||
|
from .ud_train import main as ud_train
|
||||||
|
from .conll17_ud_eval import main as ud_evaluate
|
||||||
|
|
570
spacy/cli/conll17_ud_eval.py
Normal file
570
spacy/cli/conll17_ud_eval.py
Normal file
|
@ -0,0 +1,570 @@
|
||||||
|
#!/usr/bin/env python
|
||||||
|
|
||||||
|
# CoNLL 2017 UD Parsing evaluation script.
|
||||||
|
#
|
||||||
|
# Compatible with Python 2.7 and 3.2+, can be used either as a module
|
||||||
|
# or a standalone executable.
|
||||||
|
#
|
||||||
|
# Copyright 2017 Institute of Formal and Applied Linguistics (UFAL),
|
||||||
|
# Faculty of Mathematics and Physics, Charles University, Czech Republic.
|
||||||
|
#
|
||||||
|
# Changelog:
|
||||||
|
# - [02 Jan 2017] Version 0.9: Initial release
|
||||||
|
# - [25 Jan 2017] Version 0.9.1: Fix bug in LCS alignment computation
|
||||||
|
# - [10 Mar 2017] Version 1.0: Add documentation and test
|
||||||
|
# Compare HEADs correctly using aligned words
|
||||||
|
# Allow evaluation with errorneous spaces in forms
|
||||||
|
# Compare forms in LCS case insensitively
|
||||||
|
# Detect cycles and multiple root nodes
|
||||||
|
# Compute AlignedAccuracy
|
||||||
|
|
||||||
|
# Command line usage
|
||||||
|
# ------------------
|
||||||
|
# conll17_ud_eval.py [-v] [-w weights_file] gold_conllu_file system_conllu_file
|
||||||
|
#
|
||||||
|
# - if no -v is given, only the CoNLL17 UD Shared Task evaluation LAS metrics
|
||||||
|
# is printed
|
||||||
|
# - if -v is given, several metrics are printed (as precision, recall, F1 score,
|
||||||
|
# and in case the metric is computed on aligned words also accuracy on these):
|
||||||
|
# - Tokens: how well do the gold tokens match system tokens
|
||||||
|
# - Sentences: how well do the gold sentences match system sentences
|
||||||
|
# - Words: how well can the gold words be aligned to system words
|
||||||
|
# - UPOS: using aligned words, how well does UPOS match
|
||||||
|
# - XPOS: using aligned words, how well does XPOS match
|
||||||
|
# - Feats: using aligned words, how well does FEATS match
|
||||||
|
# - AllTags: using aligned words, how well does UPOS+XPOS+FEATS match
|
||||||
|
# - Lemmas: using aligned words, how well does LEMMA match
|
||||||
|
# - UAS: using aligned words, how well does HEAD match
|
||||||
|
# - LAS: using aligned words, how well does HEAD+DEPREL(ignoring subtypes) match
|
||||||
|
# - if weights_file is given (with lines containing deprel-weight pairs),
|
||||||
|
# one more metric is shown:
|
||||||
|
# - WeightedLAS: as LAS, but each deprel (ignoring subtypes) has different weight
|
||||||
|
|
||||||
|
# API usage
|
||||||
|
# ---------
|
||||||
|
# - load_conllu(file)
|
||||||
|
# - loads CoNLL-U file from given file object to an internal representation
|
||||||
|
# - the file object should return str on both Python 2 and Python 3
|
||||||
|
# - raises UDError exception if the given file cannot be loaded
|
||||||
|
# - evaluate(gold_ud, system_ud)
|
||||||
|
# - evaluate the given gold and system CoNLL-U files (loaded with load_conllu)
|
||||||
|
# - raises UDError if the concatenated tokens of gold and system file do not match
|
||||||
|
# - returns a dictionary with the metrics described above, each metrics having
|
||||||
|
# three fields: precision, recall and f1
|
||||||
|
|
||||||
|
# Description of token matching
|
||||||
|
# -----------------------------
|
||||||
|
# In order to match tokens of gold file and system file, we consider the text
|
||||||
|
# resulting from concatenation of gold tokens and text resulting from
|
||||||
|
# concatenation of system tokens. These texts should match -- if they do not,
|
||||||
|
# the evaluation fails.
|
||||||
|
#
|
||||||
|
# If the texts do match, every token is represented as a range in this original
|
||||||
|
# text, and tokens are equal only if their range is the same.
|
||||||
|
|
||||||
|
# Description of word matching
|
||||||
|
# ----------------------------
|
||||||
|
# When matching words of gold file and system file, we first match the tokens.
|
||||||
|
# The words which are also tokens are matched as tokens, but words in multi-word
|
||||||
|
# tokens have to be handled differently.
|
||||||
|
#
|
||||||
|
# To handle multi-word tokens, we start by finding "multi-word spans".
|
||||||
|
# Multi-word span is a span in the original text such that
|
||||||
|
# - it contains at least one multi-word token
|
||||||
|
# - all multi-word tokens in the span (considering both gold and system ones)
|
||||||
|
# are completely inside the span (i.e., they do not "stick out")
|
||||||
|
# - the multi-word span is as small as possible
|
||||||
|
#
|
||||||
|
# For every multi-word span, we align the gold and system words completely
|
||||||
|
# inside this span using LCS on their FORMs. The words not intersecting
|
||||||
|
# (even partially) any multi-word span are then aligned as tokens.
|
||||||
|
|
||||||
|
|
||||||
|
from __future__ import division
|
||||||
|
from __future__ import print_function
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import io
|
||||||
|
import sys
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
# CoNLL-U column names
|
||||||
|
ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC = range(10)
|
||||||
|
|
||||||
|
# UD Error is used when raising exceptions in this module
|
||||||
|
class UDError(Exception):
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Load given CoNLL-U file into internal representation
|
||||||
|
def load_conllu(file):
|
||||||
|
# Internal representation classes
|
||||||
|
class UDRepresentation:
|
||||||
|
def __init__(self):
|
||||||
|
# Characters of all the tokens in the whole file.
|
||||||
|
# Whitespace between tokens is not included.
|
||||||
|
self.characters = []
|
||||||
|
# List of UDSpan instances with start&end indices into `characters`.
|
||||||
|
self.tokens = []
|
||||||
|
# List of UDWord instances.
|
||||||
|
self.words = []
|
||||||
|
# List of UDSpan instances with start&end indices into `characters`.
|
||||||
|
self.sentences = []
|
||||||
|
class UDSpan:
|
||||||
|
def __init__(self, start, end, characters):
|
||||||
|
self.start = start
|
||||||
|
# Note that self.end marks the first position **after the end** of span,
|
||||||
|
# so we can use characters[start:end] or range(start, end).
|
||||||
|
self.end = end
|
||||||
|
self.characters = characters
|
||||||
|
|
||||||
|
@property
|
||||||
|
def text(self):
|
||||||
|
return ''.join(self.characters[self.start:self.end])
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
return self.text
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return self.text
|
||||||
|
class UDWord:
|
||||||
|
def __init__(self, span, columns, is_multiword):
|
||||||
|
# Span of this word (or MWT, see below) within ud_representation.characters.
|
||||||
|
self.span = span
|
||||||
|
# 10 columns of the CoNLL-U file: ID, FORM, LEMMA,...
|
||||||
|
self.columns = columns
|
||||||
|
# is_multiword==True means that this word is part of a multi-word token.
|
||||||
|
# In that case, self.span marks the span of the whole multi-word token.
|
||||||
|
self.is_multiword = is_multiword
|
||||||
|
# Reference to the UDWord instance representing the HEAD (or None if root).
|
||||||
|
self.parent = None
|
||||||
|
# Let's ignore language-specific deprel subtypes.
|
||||||
|
self.columns[DEPREL] = columns[DEPREL].split(':')[0]
|
||||||
|
|
||||||
|
ud = UDRepresentation()
|
||||||
|
|
||||||
|
# Load the CoNLL-U file
|
||||||
|
index, sentence_start = 0, None
|
||||||
|
linenum = 0
|
||||||
|
while True:
|
||||||
|
line = file.readline()
|
||||||
|
linenum += 1
|
||||||
|
if not line:
|
||||||
|
break
|
||||||
|
line = line.rstrip("\r\n")
|
||||||
|
|
||||||
|
# Handle sentence start boundaries
|
||||||
|
if sentence_start is None:
|
||||||
|
# Skip comments
|
||||||
|
if line.startswith("#"):
|
||||||
|
continue
|
||||||
|
# Start a new sentence
|
||||||
|
ud.sentences.append(UDSpan(index, 0, ud.characters))
|
||||||
|
sentence_start = len(ud.words)
|
||||||
|
if not line:
|
||||||
|
# Add parent UDWord links and check there are no cycles
|
||||||
|
def process_word(word):
|
||||||
|
if word.parent == "remapping":
|
||||||
|
raise UDError("There is a cycle in a sentence")
|
||||||
|
if word.parent is None:
|
||||||
|
head = int(word.columns[HEAD])
|
||||||
|
if head > len(ud.words) - sentence_start:
|
||||||
|
raise UDError("HEAD '{}' points outside of the sentence".format(word.columns[HEAD]))
|
||||||
|
if head:
|
||||||
|
parent = ud.words[sentence_start + head - 1]
|
||||||
|
word.parent = "remapping"
|
||||||
|
process_word(parent)
|
||||||
|
word.parent = parent
|
||||||
|
|
||||||
|
for word in ud.words[sentence_start:]:
|
||||||
|
process_word(word)
|
||||||
|
|
||||||
|
# Check there is a single root node
|
||||||
|
if len([word for word in ud.words[sentence_start:] if word.parent is None]) != 1:
|
||||||
|
raise UDError("There are multiple roots in a sentence")
|
||||||
|
|
||||||
|
# End the sentence
|
||||||
|
ud.sentences[-1].end = index
|
||||||
|
sentence_start = None
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Read next token/word
|
||||||
|
columns = line.split("\t")
|
||||||
|
if len(columns) != 10:
|
||||||
|
raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, line))
|
||||||
|
|
||||||
|
# Skip empty nodes
|
||||||
|
if "." in columns[ID]:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Delete spaces from FORM so gold.characters == system.characters
|
||||||
|
# even if one of them tokenizes the space.
|
||||||
|
columns[FORM] = columns[FORM].replace(" ", "")
|
||||||
|
if not columns[FORM]:
|
||||||
|
raise UDError("There is an empty FORM in the CoNLL-U file -- line %d" % linenum)
|
||||||
|
|
||||||
|
# Save token
|
||||||
|
ud.characters.extend(columns[FORM])
|
||||||
|
ud.tokens.append(UDSpan(index, index + len(columns[FORM]), ud.characters))
|
||||||
|
index += len(columns[FORM])
|
||||||
|
|
||||||
|
# Handle multi-word tokens to save word(s)
|
||||||
|
if "-" in columns[ID]:
|
||||||
|
try:
|
||||||
|
start, end = map(int, columns[ID].split("-"))
|
||||||
|
except:
|
||||||
|
raise UDError("Cannot parse multi-word token ID '{}'".format(columns[ID]))
|
||||||
|
|
||||||
|
for _ in range(start, end + 1):
|
||||||
|
word_line = file.readline().rstrip("\r\n")
|
||||||
|
word_columns = word_line.split("\t")
|
||||||
|
if len(word_columns) != 10:
|
||||||
|
print(columns)
|
||||||
|
raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, word_line))
|
||||||
|
ud.words.append(UDWord(ud.tokens[-1], word_columns, is_multiword=True))
|
||||||
|
# Basic tokens/words
|
||||||
|
else:
|
||||||
|
try:
|
||||||
|
word_id = int(columns[ID])
|
||||||
|
except:
|
||||||
|
raise UDError("Cannot parse word ID '{}'".format(columns[ID]))
|
||||||
|
if word_id != len(ud.words) - sentence_start + 1:
|
||||||
|
raise UDError("Incorrect word ID '{}' for word '{}', expected '{}'".format(columns[ID], columns[FORM], len(ud.words) - sentence_start + 1))
|
||||||
|
|
||||||
|
try:
|
||||||
|
head_id = int(columns[HEAD])
|
||||||
|
except:
|
||||||
|
raise UDError("Cannot parse HEAD '{}'".format(columns[HEAD]))
|
||||||
|
if head_id < 0:
|
||||||
|
raise UDError("HEAD cannot be negative")
|
||||||
|
|
||||||
|
ud.words.append(UDWord(ud.tokens[-1], columns, is_multiword=False))
|
||||||
|
|
||||||
|
if sentence_start is not None:
|
||||||
|
raise UDError("The CoNLL-U file does not end with empty line")
|
||||||
|
|
||||||
|
return ud
|
||||||
|
|
||||||
|
# Evaluate the gold and system treebanks (loaded using load_conllu).
|
||||||
|
def evaluate(gold_ud, system_ud, deprel_weights=None):
|
||||||
|
class Score:
|
||||||
|
def __init__(self, gold_total, system_total, correct, aligned_total=None):
|
||||||
|
self.precision = correct / system_total if system_total else 0.0
|
||||||
|
self.recall = correct / gold_total if gold_total else 0.0
|
||||||
|
self.f1 = 2 * correct / (system_total + gold_total) if system_total + gold_total else 0.0
|
||||||
|
self.aligned_accuracy = correct / aligned_total if aligned_total else aligned_total
|
||||||
|
class AlignmentWord:
|
||||||
|
def __init__(self, gold_word, system_word):
|
||||||
|
self.gold_word = gold_word
|
||||||
|
self.system_word = system_word
|
||||||
|
self.gold_parent = None
|
||||||
|
self.system_parent_gold_aligned = None
|
||||||
|
class Alignment:
|
||||||
|
def __init__(self, gold_words, system_words):
|
||||||
|
self.gold_words = gold_words
|
||||||
|
self.system_words = system_words
|
||||||
|
self.matched_words = []
|
||||||
|
self.matched_words_map = {}
|
||||||
|
def append_aligned_words(self, gold_word, system_word):
|
||||||
|
self.matched_words.append(AlignmentWord(gold_word, system_word))
|
||||||
|
self.matched_words_map[system_word] = gold_word
|
||||||
|
def fill_parents(self):
|
||||||
|
# We represent root parents in both gold and system data by '0'.
|
||||||
|
# For gold data, we represent non-root parent by corresponding gold word.
|
||||||
|
# For system data, we represent non-root parent by either gold word aligned
|
||||||
|
# to parent system nodes, or by None if no gold words is aligned to the parent.
|
||||||
|
for words in self.matched_words:
|
||||||
|
words.gold_parent = words.gold_word.parent if words.gold_word.parent is not None else 0
|
||||||
|
words.system_parent_gold_aligned = self.matched_words_map.get(words.system_word.parent, None) \
|
||||||
|
if words.system_word.parent is not None else 0
|
||||||
|
|
||||||
|
def lower(text):
|
||||||
|
if sys.version_info < (3, 0) and isinstance(text, str):
|
||||||
|
return text.decode("utf-8").lower()
|
||||||
|
return text.lower()
|
||||||
|
|
||||||
|
def spans_score(gold_spans, system_spans):
|
||||||
|
correct, gi, si = 0, 0, 0
|
||||||
|
while gi < len(gold_spans) and si < len(system_spans):
|
||||||
|
if system_spans[si].start < gold_spans[gi].start:
|
||||||
|
si += 1
|
||||||
|
elif gold_spans[gi].start < system_spans[si].start:
|
||||||
|
gi += 1
|
||||||
|
else:
|
||||||
|
correct += gold_spans[gi].end == system_spans[si].end
|
||||||
|
si += 1
|
||||||
|
gi += 1
|
||||||
|
|
||||||
|
return Score(len(gold_spans), len(system_spans), correct)
|
||||||
|
|
||||||
|
def alignment_score(alignment, key_fn, weight_fn=lambda w: 1):
|
||||||
|
gold, system, aligned, correct = 0, 0, 0, 0
|
||||||
|
|
||||||
|
for word in alignment.gold_words:
|
||||||
|
gold += weight_fn(word)
|
||||||
|
|
||||||
|
for word in alignment.system_words:
|
||||||
|
system += weight_fn(word)
|
||||||
|
|
||||||
|
for words in alignment.matched_words:
|
||||||
|
aligned += weight_fn(words.gold_word)
|
||||||
|
|
||||||
|
if key_fn is None:
|
||||||
|
# Return score for whole aligned words
|
||||||
|
return Score(gold, system, aligned)
|
||||||
|
|
||||||
|
for words in alignment.matched_words:
|
||||||
|
if key_fn(words.gold_word, words.gold_parent) == key_fn(words.system_word, words.system_parent_gold_aligned):
|
||||||
|
correct += weight_fn(words.gold_word)
|
||||||
|
|
||||||
|
return Score(gold, system, correct, aligned)
|
||||||
|
|
||||||
|
def beyond_end(words, i, multiword_span_end):
|
||||||
|
if i >= len(words):
|
||||||
|
return True
|
||||||
|
if words[i].is_multiword:
|
||||||
|
return words[i].span.start >= multiword_span_end
|
||||||
|
return words[i].span.end > multiword_span_end
|
||||||
|
|
||||||
|
def extend_end(word, multiword_span_end):
|
||||||
|
if word.is_multiword and word.span.end > multiword_span_end:
|
||||||
|
return word.span.end
|
||||||
|
return multiword_span_end
|
||||||
|
|
||||||
|
def find_multiword_span(gold_words, system_words, gi, si):
|
||||||
|
# We know gold_words[gi].is_multiword or system_words[si].is_multiword.
|
||||||
|
# Find the start of the multiword span (gs, ss), so the multiword span is minimal.
|
||||||
|
# Initialize multiword_span_end characters index.
|
||||||
|
if gold_words[gi].is_multiword:
|
||||||
|
multiword_span_end = gold_words[gi].span.end
|
||||||
|
if not system_words[si].is_multiword and system_words[si].span.start < gold_words[gi].span.start:
|
||||||
|
si += 1
|
||||||
|
else: # if system_words[si].is_multiword
|
||||||
|
multiword_span_end = system_words[si].span.end
|
||||||
|
if not gold_words[gi].is_multiword and gold_words[gi].span.start < system_words[si].span.start:
|
||||||
|
gi += 1
|
||||||
|
gs, ss = gi, si
|
||||||
|
|
||||||
|
# Find the end of the multiword span
|
||||||
|
# (so both gi and si are pointing to the word following the multiword span end).
|
||||||
|
while not beyond_end(gold_words, gi, multiword_span_end) or \
|
||||||
|
not beyond_end(system_words, si, multiword_span_end):
|
||||||
|
if gi < len(gold_words) and (si >= len(system_words) or
|
||||||
|
gold_words[gi].span.start <= system_words[si].span.start):
|
||||||
|
multiword_span_end = extend_end(gold_words[gi], multiword_span_end)
|
||||||
|
gi += 1
|
||||||
|
else:
|
||||||
|
multiword_span_end = extend_end(system_words[si], multiword_span_end)
|
||||||
|
si += 1
|
||||||
|
return gs, ss, gi, si
|
||||||
|
|
||||||
|
def compute_lcs(gold_words, system_words, gi, si, gs, ss):
|
||||||
|
lcs = [[0] * (si - ss) for i in range(gi - gs)]
|
||||||
|
for g in reversed(range(gi - gs)):
|
||||||
|
for s in reversed(range(si - ss)):
|
||||||
|
if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
|
||||||
|
lcs[g][s] = 1 + (lcs[g+1][s+1] if g+1 < gi-gs and s+1 < si-ss else 0)
|
||||||
|
lcs[g][s] = max(lcs[g][s], lcs[g+1][s] if g+1 < gi-gs else 0)
|
||||||
|
lcs[g][s] = max(lcs[g][s], lcs[g][s+1] if s+1 < si-ss else 0)
|
||||||
|
return lcs
|
||||||
|
|
||||||
|
def align_words(gold_words, system_words):
|
||||||
|
alignment = Alignment(gold_words, system_words)
|
||||||
|
|
||||||
|
gi, si = 0, 0
|
||||||
|
while gi < len(gold_words) and si < len(system_words):
|
||||||
|
if gold_words[gi].is_multiword or system_words[si].is_multiword:
|
||||||
|
# A: Multi-word tokens => align via LCS within the whole "multiword span".
|
||||||
|
gs, ss, gi, si = find_multiword_span(gold_words, system_words, gi, si)
|
||||||
|
|
||||||
|
if si > ss and gi > gs:
|
||||||
|
lcs = compute_lcs(gold_words, system_words, gi, si, gs, ss)
|
||||||
|
|
||||||
|
# Store aligned words
|
||||||
|
s, g = 0, 0
|
||||||
|
while g < gi - gs and s < si - ss:
|
||||||
|
if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
|
||||||
|
alignment.append_aligned_words(gold_words[gs+g], system_words[ss+s])
|
||||||
|
g += 1
|
||||||
|
s += 1
|
||||||
|
elif lcs[g][s] == (lcs[g+1][s] if g+1 < gi-gs else 0):
|
||||||
|
g += 1
|
||||||
|
else:
|
||||||
|
s += 1
|
||||||
|
else:
|
||||||
|
# B: No multi-word token => align according to spans.
|
||||||
|
if (gold_words[gi].span.start, gold_words[gi].span.end) == (system_words[si].span.start, system_words[si].span.end):
|
||||||
|
alignment.append_aligned_words(gold_words[gi], system_words[si])
|
||||||
|
gi += 1
|
||||||
|
si += 1
|
||||||
|
elif gold_words[gi].span.start <= system_words[si].span.start:
|
||||||
|
gi += 1
|
||||||
|
else:
|
||||||
|
si += 1
|
||||||
|
|
||||||
|
alignment.fill_parents()
|
||||||
|
|
||||||
|
return alignment
|
||||||
|
|
||||||
|
# Check that underlying character sequences do match
|
||||||
|
if gold_ud.characters != system_ud.characters:
|
||||||
|
index = 0
|
||||||
|
while gold_ud.characters[index] == system_ud.characters[index]:
|
||||||
|
index += 1
|
||||||
|
|
||||||
|
raise UDError(
|
||||||
|
"The concatenation of tokens in gold file and in system file differ!\n" +
|
||||||
|
"First 20 differing characters in gold file: '{}' and system file: '{}'".format(
|
||||||
|
"".join(gold_ud.characters[index:index + 20]),
|
||||||
|
"".join(system_ud.characters[index:index + 20])
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Align words
|
||||||
|
alignment = align_words(gold_ud.words, system_ud.words)
|
||||||
|
|
||||||
|
# Compute the F1-scores
|
||||||
|
result = {
|
||||||
|
"Tokens": spans_score(gold_ud.tokens, system_ud.tokens),
|
||||||
|
"Sentences": spans_score(gold_ud.sentences, system_ud.sentences),
|
||||||
|
"Words": alignment_score(alignment, None),
|
||||||
|
"UPOS": alignment_score(alignment, lambda w, parent: w.columns[UPOS]),
|
||||||
|
"XPOS": alignment_score(alignment, lambda w, parent: w.columns[XPOS]),
|
||||||
|
"Feats": alignment_score(alignment, lambda w, parent: w.columns[FEATS]),
|
||||||
|
"AllTags": alignment_score(alignment, lambda w, parent: (w.columns[UPOS], w.columns[XPOS], w.columns[FEATS])),
|
||||||
|
"Lemmas": alignment_score(alignment, lambda w, parent: w.columns[LEMMA]),
|
||||||
|
"UAS": alignment_score(alignment, lambda w, parent: parent),
|
||||||
|
"LAS": alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL])),
|
||||||
|
}
|
||||||
|
|
||||||
|
# Add WeightedLAS if weights are given
|
||||||
|
if deprel_weights is not None:
|
||||||
|
def weighted_las(word):
|
||||||
|
return deprel_weights.get(word.columns[DEPREL], 1.0)
|
||||||
|
result["WeightedLAS"] = alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL]), weighted_las)
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
def load_deprel_weights(weights_file):
|
||||||
|
if weights_file is None:
|
||||||
|
return None
|
||||||
|
|
||||||
|
deprel_weights = {}
|
||||||
|
for line in weights_file:
|
||||||
|
# Ignore comments and empty lines
|
||||||
|
if line.startswith("#") or not line.strip():
|
||||||
|
continue
|
||||||
|
|
||||||
|
columns = line.rstrip("\r\n").split()
|
||||||
|
if len(columns) != 2:
|
||||||
|
raise ValueError("Expected two columns in the UD Relations weights file on line '{}'".format(line))
|
||||||
|
|
||||||
|
deprel_weights[columns[0]] = float(columns[1])
|
||||||
|
|
||||||
|
return deprel_weights
|
||||||
|
|
||||||
|
def load_conllu_file(path):
|
||||||
|
_file = open(path, mode="r", **({"encoding": "utf-8"} if sys.version_info >= (3, 0) else {}))
|
||||||
|
return load_conllu(_file)
|
||||||
|
|
||||||
|
def evaluate_wrapper(args):
|
||||||
|
# Load CoNLL-U files
|
||||||
|
gold_ud = load_conllu_file(args.gold_file)
|
||||||
|
system_ud = load_conllu_file(args.system_file)
|
||||||
|
|
||||||
|
# Load weights if requested
|
||||||
|
deprel_weights = load_deprel_weights(args.weights)
|
||||||
|
|
||||||
|
return evaluate(gold_ud, system_ud, deprel_weights)
|
||||||
|
|
||||||
|
def main():
|
||||||
|
# Parse arguments
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("gold_file", type=str,
|
||||||
|
help="Name of the CoNLL-U file with the gold data.")
|
||||||
|
parser.add_argument("system_file", type=str,
|
||||||
|
help="Name of the CoNLL-U file with the predicted data.")
|
||||||
|
parser.add_argument("--weights", "-w", type=argparse.FileType("r"), default=None,
|
||||||
|
metavar="deprel_weights_file",
|
||||||
|
help="Compute WeightedLAS using given weights for Universal Dependency Relations.")
|
||||||
|
parser.add_argument("--verbose", "-v", default=0, action="count",
|
||||||
|
help="Print all metrics.")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Use verbose if weights are supplied
|
||||||
|
if args.weights is not None and not args.verbose:
|
||||||
|
args.verbose = 1
|
||||||
|
|
||||||
|
# Evaluate
|
||||||
|
evaluation = evaluate_wrapper(args)
|
||||||
|
|
||||||
|
# Print the evaluation
|
||||||
|
if not args.verbose:
|
||||||
|
print("LAS F1 Score: {:.2f}".format(100 * evaluation["LAS"].f1))
|
||||||
|
else:
|
||||||
|
metrics = ["Tokens", "Sentences", "Words", "UPOS", "XPOS", "Feats", "AllTags", "Lemmas", "UAS", "LAS"]
|
||||||
|
if args.weights is not None:
|
||||||
|
metrics.append("WeightedLAS")
|
||||||
|
|
||||||
|
print("Metrics | Precision | Recall | F1 Score | AligndAcc")
|
||||||
|
print("-----------+-----------+-----------+-----------+-----------")
|
||||||
|
for metric in metrics:
|
||||||
|
print("{:11}|{:10.2f} |{:10.2f} |{:10.2f} |{}".format(
|
||||||
|
metric,
|
||||||
|
100 * evaluation[metric].precision,
|
||||||
|
100 * evaluation[metric].recall,
|
||||||
|
100 * evaluation[metric].f1,
|
||||||
|
"{:10.2f}".format(100 * evaluation[metric].aligned_accuracy) if evaluation[metric].aligned_accuracy is not None else ""
|
||||||
|
))
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
|
||||||
|
# Tests, which can be executed with `python -m unittest conll17_ud_eval`.
|
||||||
|
class TestAlignment(unittest.TestCase):
|
||||||
|
@staticmethod
|
||||||
|
def _load_words(words):
|
||||||
|
"""Prepare fake CoNLL-U files with fake HEAD to prevent multiple roots errors."""
|
||||||
|
lines, num_words = [], 0
|
||||||
|
for w in words:
|
||||||
|
parts = w.split(" ")
|
||||||
|
if len(parts) == 1:
|
||||||
|
num_words += 1
|
||||||
|
lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, parts[0], int(num_words>1)))
|
||||||
|
else:
|
||||||
|
lines.append("{}-{}\t{}\t_\t_\t_\t_\t_\t_\t_\t_".format(num_words + 1, num_words + len(parts) - 1, parts[0]))
|
||||||
|
for part in parts[1:]:
|
||||||
|
num_words += 1
|
||||||
|
lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, part, int(num_words>1)))
|
||||||
|
return load_conllu((io.StringIO if sys.version_info >= (3, 0) else io.BytesIO)("\n".join(lines+["\n"])))
|
||||||
|
|
||||||
|
def _test_exception(self, gold, system):
|
||||||
|
self.assertRaises(UDError, evaluate, self._load_words(gold), self._load_words(system))
|
||||||
|
|
||||||
|
def _test_ok(self, gold, system, correct):
|
||||||
|
metrics = evaluate(self._load_words(gold), self._load_words(system))
|
||||||
|
gold_words = sum((max(1, len(word.split(" ")) - 1) for word in gold))
|
||||||
|
system_words = sum((max(1, len(word.split(" ")) - 1) for word in system))
|
||||||
|
self.assertEqual((metrics["Words"].precision, metrics["Words"].recall, metrics["Words"].f1),
|
||||||
|
(correct / system_words, correct / gold_words, 2 * correct / (gold_words + system_words)))
|
||||||
|
|
||||||
|
def test_exception(self):
|
||||||
|
self._test_exception(["a"], ["b"])
|
||||||
|
|
||||||
|
def test_equal(self):
|
||||||
|
self._test_ok(["a"], ["a"], 1)
|
||||||
|
self._test_ok(["a", "b", "c"], ["a", "b", "c"], 3)
|
||||||
|
|
||||||
|
def test_equal_with_multiword(self):
|
||||||
|
self._test_ok(["abc a b c"], ["a", "b", "c"], 3)
|
||||||
|
self._test_ok(["a", "bc b c", "d"], ["a", "b", "c", "d"], 4)
|
||||||
|
self._test_ok(["abcd a b c d"], ["ab a b", "cd c d"], 4)
|
||||||
|
self._test_ok(["abc a b c", "de d e"], ["a", "bcd b c d", "e"], 5)
|
||||||
|
|
||||||
|
def test_alignment(self):
|
||||||
|
self._test_ok(["abcd"], ["a", "b", "c", "d"], 0)
|
||||||
|
self._test_ok(["abc", "d"], ["a", "b", "c", "d"], 1)
|
||||||
|
self._test_ok(["a", "bc", "d"], ["a", "b", "c", "d"], 2)
|
||||||
|
self._test_ok(["a", "bc b c", "d"], ["a", "b", "cd"], 2)
|
||||||
|
self._test_ok(["abc a BX c", "def d EX f"], ["ab a b", "cd c d", "ef e f"], 4)
|
||||||
|
self._test_ok(["ab a b", "cd bc d"], ["a", "bc", "d"], 2)
|
||||||
|
self._test_ok(["a", "bc b c", "d"], ["ab AX BX", "cd CX a"], 1)
|
|
@ -116,10 +116,9 @@ def train(lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
|
||||||
|
|
||||||
print("Itn.\tP.Loss\tN.Loss\tUAS\tNER P.\tNER R.\tNER F.\tTag %\tToken %")
|
print("Itn.\tP.Loss\tN.Loss\tUAS\tNER P.\tNER R.\tNER F.\tTag %\tToken %")
|
||||||
try:
|
try:
|
||||||
train_docs = corpus.train_docs(nlp, projectivize=True, noise_level=0.0,
|
|
||||||
gold_preproc=gold_preproc, max_length=0)
|
|
||||||
train_docs = list(train_docs)
|
|
||||||
for i in range(n_iter):
|
for i in range(n_iter):
|
||||||
|
train_docs = corpus.train_docs(nlp, projectivize=True, noise_level=0.0,
|
||||||
|
gold_preproc=gold_preproc, max_length=0)
|
||||||
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
|
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
|
||||||
losses = {}
|
losses = {}
|
||||||
for batch in minibatch(train_docs, size=batch_sizes):
|
for batch in minibatch(train_docs, size=batch_sizes):
|
||||||
|
|
390
spacy/cli/ud_train.py
Normal file
390
spacy/cli/ud_train.py
Normal file
|
@ -0,0 +1,390 @@
|
||||||
|
'''Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
|
||||||
|
.conllu format for development data, allowing the official scorer to be used.
|
||||||
|
'''
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
import plac
|
||||||
|
import tqdm
|
||||||
|
from pathlib import Path
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
import json
|
||||||
|
|
||||||
|
import spacy
|
||||||
|
import spacy.util
|
||||||
|
from ..tokens import Token, Doc
|
||||||
|
from ..gold import GoldParse
|
||||||
|
from ..syntax.nonproj import projectivize
|
||||||
|
from ..matcher import Matcher
|
||||||
|
from collections import defaultdict, Counter
|
||||||
|
from timeit import default_timer as timer
|
||||||
|
|
||||||
|
import itertools
|
||||||
|
import random
|
||||||
|
import numpy.random
|
||||||
|
import cytoolz
|
||||||
|
|
||||||
|
from . import conll17_ud_eval
|
||||||
|
|
||||||
|
from .. import lang
|
||||||
|
from .. import lang
|
||||||
|
from ..lang import zh
|
||||||
|
from ..lang import ja
|
||||||
|
|
||||||
|
lang.zh.Chinese.Defaults.use_jieba = False
|
||||||
|
lang.ja.Japanese.Defaults.use_janome = False
|
||||||
|
|
||||||
|
random.seed(0)
|
||||||
|
numpy.random.seed(0)
|
||||||
|
|
||||||
|
def minibatch_by_words(items, size=5000):
|
||||||
|
random.shuffle(items)
|
||||||
|
if isinstance(size, int):
|
||||||
|
size_ = itertools.repeat(size)
|
||||||
|
else:
|
||||||
|
size_ = size
|
||||||
|
items = iter(items)
|
||||||
|
while True:
|
||||||
|
batch_size = next(size_)
|
||||||
|
batch = []
|
||||||
|
while batch_size >= 0:
|
||||||
|
try:
|
||||||
|
doc, gold = next(items)
|
||||||
|
except StopIteration:
|
||||||
|
if batch:
|
||||||
|
yield batch
|
||||||
|
return
|
||||||
|
batch_size -= len(doc)
|
||||||
|
batch.append((doc, gold))
|
||||||
|
if batch:
|
||||||
|
yield batch
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
|
||||||
|
################
|
||||||
|
# Data reading #
|
||||||
|
################
|
||||||
|
|
||||||
|
space_re = re.compile('\s+')
|
||||||
|
def split_text(text):
|
||||||
|
return [space_re.sub(' ', par.strip()) for par in text.split('\n\n')]
|
||||||
|
|
||||||
|
|
||||||
|
def read_data(nlp, conllu_file, text_file, raw_text=True, oracle_segments=False,
|
||||||
|
max_doc_length=None, limit=None):
|
||||||
|
'''Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True,
|
||||||
|
include Doc objects created using nlp.make_doc and then aligned against
|
||||||
|
the gold-standard sequences. If oracle_segments=True, include Doc objects
|
||||||
|
created from the gold-standard segments. At least one must be True.'''
|
||||||
|
if not raw_text and not oracle_segments:
|
||||||
|
raise ValueError("At least one of raw_text or oracle_segments must be True")
|
||||||
|
paragraphs = split_text(text_file.read())
|
||||||
|
conllu = read_conllu(conllu_file)
|
||||||
|
# sd is spacy doc; cd is conllu doc
|
||||||
|
# cs is conllu sent, ct is conllu token
|
||||||
|
docs = []
|
||||||
|
golds = []
|
||||||
|
for doc_id, (text, cd) in enumerate(zip(paragraphs, conllu)):
|
||||||
|
sent_annots = []
|
||||||
|
for cs in cd:
|
||||||
|
sent = defaultdict(list)
|
||||||
|
for id_, word, lemma, pos, tag, morph, head, dep, _, space_after in cs:
|
||||||
|
if '.' in id_:
|
||||||
|
continue
|
||||||
|
if '-' in id_:
|
||||||
|
continue
|
||||||
|
id_ = int(id_)-1
|
||||||
|
head = int(head)-1 if head != '0' else id_
|
||||||
|
sent['words'].append(word)
|
||||||
|
sent['tags'].append(tag)
|
||||||
|
sent['heads'].append(head)
|
||||||
|
sent['deps'].append('ROOT' if dep == 'root' else dep)
|
||||||
|
sent['spaces'].append(space_after == '_')
|
||||||
|
sent['entities'] = ['-'] * len(sent['words'])
|
||||||
|
sent['heads'], sent['deps'] = projectivize(sent['heads'],
|
||||||
|
sent['deps'])
|
||||||
|
if oracle_segments:
|
||||||
|
docs.append(Doc(nlp.vocab, words=sent['words'], spaces=sent['spaces']))
|
||||||
|
golds.append(GoldParse(docs[-1], **sent))
|
||||||
|
|
||||||
|
sent_annots.append(sent)
|
||||||
|
if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
|
||||||
|
doc, gold = _make_gold(nlp, None, sent_annots)
|
||||||
|
sent_annots = []
|
||||||
|
docs.append(doc)
|
||||||
|
golds.append(gold)
|
||||||
|
if limit and len(docs) >= limit:
|
||||||
|
return docs, golds
|
||||||
|
|
||||||
|
if raw_text and sent_annots:
|
||||||
|
doc, gold = _make_gold(nlp, None, sent_annots)
|
||||||
|
docs.append(doc)
|
||||||
|
golds.append(gold)
|
||||||
|
if limit and len(docs) >= limit:
|
||||||
|
return docs, golds
|
||||||
|
return docs, golds
|
||||||
|
|
||||||
|
|
||||||
|
def read_conllu(file_):
|
||||||
|
docs = []
|
||||||
|
sent = []
|
||||||
|
doc = []
|
||||||
|
for line in file_:
|
||||||
|
if line.startswith('# newdoc'):
|
||||||
|
if doc:
|
||||||
|
docs.append(doc)
|
||||||
|
doc = []
|
||||||
|
elif line.startswith('#'):
|
||||||
|
continue
|
||||||
|
elif not line.strip():
|
||||||
|
if sent:
|
||||||
|
doc.append(sent)
|
||||||
|
sent = []
|
||||||
|
else:
|
||||||
|
sent.append(list(line.strip().split('\t')))
|
||||||
|
if len(sent[-1]) != 10:
|
||||||
|
print(repr(line))
|
||||||
|
raise ValueError
|
||||||
|
if sent:
|
||||||
|
doc.append(sent)
|
||||||
|
if doc:
|
||||||
|
docs.append(doc)
|
||||||
|
return docs
|
||||||
|
|
||||||
|
|
||||||
|
def _make_gold(nlp, text, sent_annots):
|
||||||
|
# Flatten the conll annotations, and adjust the head indices
|
||||||
|
flat = defaultdict(list)
|
||||||
|
for sent in sent_annots:
|
||||||
|
flat['heads'].extend(len(flat['words'])+head for head in sent['heads'])
|
||||||
|
for field in ['words', 'tags', 'deps', 'entities', 'spaces']:
|
||||||
|
flat[field].extend(sent[field])
|
||||||
|
# Construct text if necessary
|
||||||
|
assert len(flat['words']) == len(flat['spaces'])
|
||||||
|
if text is None:
|
||||||
|
text = ''.join(word+' '*space for word, space in zip(flat['words'], flat['spaces']))
|
||||||
|
doc = nlp.make_doc(text)
|
||||||
|
flat.pop('spaces')
|
||||||
|
gold = GoldParse(doc, **flat)
|
||||||
|
return doc, gold
|
||||||
|
|
||||||
|
#############################
|
||||||
|
# Data transforms for spaCy #
|
||||||
|
#############################
|
||||||
|
|
||||||
|
def golds_to_gold_tuples(docs, golds):
|
||||||
|
'''Get out the annoying 'tuples' format used by begin_training, given the
|
||||||
|
GoldParse objects.'''
|
||||||
|
tuples = []
|
||||||
|
for doc, gold in zip(docs, golds):
|
||||||
|
text = doc.text
|
||||||
|
ids, words, tags, heads, labels, iob = zip(*gold.orig_annot)
|
||||||
|
sents = [((ids, words, tags, heads, labels, iob), [])]
|
||||||
|
tuples.append((text, sents))
|
||||||
|
return tuples
|
||||||
|
|
||||||
|
|
||||||
|
##############
|
||||||
|
# Evaluation #
|
||||||
|
##############
|
||||||
|
|
||||||
|
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
|
||||||
|
with text_loc.open('r', encoding='utf8') as text_file:
|
||||||
|
texts = split_text(text_file.read())
|
||||||
|
docs = list(nlp.pipe(texts))
|
||||||
|
with sys_loc.open('w', encoding='utf8') as out_file:
|
||||||
|
write_conllu(docs, out_file)
|
||||||
|
with gold_loc.open('r', encoding='utf8') as gold_file:
|
||||||
|
gold_ud = conll17_ud_eval.load_conllu(gold_file)
|
||||||
|
with sys_loc.open('r', encoding='utf8') as sys_file:
|
||||||
|
sys_ud = conll17_ud_eval.load_conllu(sys_file)
|
||||||
|
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
|
||||||
|
return scores
|
||||||
|
|
||||||
|
|
||||||
|
def write_conllu(docs, file_):
|
||||||
|
merger = Matcher(docs[0].vocab)
|
||||||
|
merger.add('SUBTOK', None, [{'DEP': 'subtok', 'op': '+'}])
|
||||||
|
for i, doc in enumerate(docs):
|
||||||
|
matches = merger(doc)
|
||||||
|
spans = [doc[start:end+1] for _, start, end in matches]
|
||||||
|
offsets = [(span.start_char, span.end_char) for span in spans]
|
||||||
|
for start_char, end_char in offsets:
|
||||||
|
doc.merge(start_char, end_char)
|
||||||
|
file_.write("# newdoc id = {i}\n".format(i=i))
|
||||||
|
for j, sent in enumerate(doc.sents):
|
||||||
|
file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
|
||||||
|
file_.write("# text = {text}\n".format(text=sent.text))
|
||||||
|
for k, token in enumerate(sent):
|
||||||
|
file_.write(token._.get_conllu_lines(k) + '\n')
|
||||||
|
file_.write('\n')
|
||||||
|
|
||||||
|
|
||||||
|
def print_progress(itn, losses, ud_scores):
|
||||||
|
fields = {
|
||||||
|
'dep_loss': losses.get('parser', 0.0),
|
||||||
|
'tag_loss': losses.get('tagger', 0.0),
|
||||||
|
'words': ud_scores['Words'].f1 * 100,
|
||||||
|
'sents': ud_scores['Sentences'].f1 * 100,
|
||||||
|
'tags': ud_scores['XPOS'].f1 * 100,
|
||||||
|
'uas': ud_scores['UAS'].f1 * 100,
|
||||||
|
'las': ud_scores['LAS'].f1 * 100,
|
||||||
|
}
|
||||||
|
header = ['Epoch', 'Loss', 'LAS', 'UAS', 'TAG', 'SENT', 'WORD']
|
||||||
|
if itn == 0:
|
||||||
|
print('\t'.join(header))
|
||||||
|
tpl = '\t'.join((
|
||||||
|
'{:d}',
|
||||||
|
'{dep_loss:.1f}',
|
||||||
|
'{las:.1f}',
|
||||||
|
'{uas:.1f}',
|
||||||
|
'{tags:.1f}',
|
||||||
|
'{sents:.1f}',
|
||||||
|
'{words:.1f}',
|
||||||
|
))
|
||||||
|
print(tpl.format(itn, **fields))
|
||||||
|
|
||||||
|
#def get_sent_conllu(sent, sent_id):
|
||||||
|
# lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
|
||||||
|
|
||||||
|
def get_token_conllu(token, i):
|
||||||
|
if token._.begins_fused:
|
||||||
|
n = 1
|
||||||
|
while token.nbor(n)._.inside_fused:
|
||||||
|
n += 1
|
||||||
|
id_ = '%d-%d' % (i, i+n)
|
||||||
|
lines = [id_, token.text, '_', '_', '_', '_', '_', '_', '_', '_']
|
||||||
|
else:
|
||||||
|
lines = []
|
||||||
|
if token.head.i == token.i:
|
||||||
|
head = 0
|
||||||
|
else:
|
||||||
|
head = i + (token.head.i - token.i) + 1
|
||||||
|
fields = [str(i+1), token.text, token.lemma_, token.pos_, token.tag_, '_',
|
||||||
|
str(head), token.dep_.lower(), '_', '_']
|
||||||
|
lines.append('\t'.join(fields))
|
||||||
|
return '\n'.join(lines)
|
||||||
|
|
||||||
|
Token.set_extension('get_conllu_lines', method=get_token_conllu)
|
||||||
|
Token.set_extension('begins_fused', default=False)
|
||||||
|
Token.set_extension('inside_fused', default=False)
|
||||||
|
|
||||||
|
|
||||||
|
##################
|
||||||
|
# Initialization #
|
||||||
|
##################
|
||||||
|
|
||||||
|
|
||||||
|
def load_nlp(corpus, config):
|
||||||
|
lang = corpus.split('_')[0]
|
||||||
|
nlp = spacy.blank(lang)
|
||||||
|
if config.vectors:
|
||||||
|
nlp.vocab.from_disk(config.vectors / 'vocab')
|
||||||
|
return nlp
|
||||||
|
|
||||||
|
def initialize_pipeline(nlp, docs, golds, config):
|
||||||
|
nlp.add_pipe(nlp.create_pipe('parser'))
|
||||||
|
if config.multitask_tag:
|
||||||
|
nlp.parser.add_multitask_objective('tag')
|
||||||
|
if config.multitask_sent:
|
||||||
|
nlp.parser.add_multitask_objective('sent_start')
|
||||||
|
nlp.parser.moves.add_action(2, 'subtok')
|
||||||
|
nlp.add_pipe(nlp.create_pipe('tagger'))
|
||||||
|
for gold in golds:
|
||||||
|
for tag in gold.tags:
|
||||||
|
if tag is not None:
|
||||||
|
nlp.tagger.add_label(tag)
|
||||||
|
# Replace labels that didn't make the frequency cutoff
|
||||||
|
actions = set(nlp.parser.labels)
|
||||||
|
label_set = set([act.split('-')[1] for act in actions if '-' in act])
|
||||||
|
for gold in golds:
|
||||||
|
for i, label in enumerate(gold.labels):
|
||||||
|
if label is not None and label not in label_set:
|
||||||
|
gold.labels[i] = label.split('||')[0]
|
||||||
|
return nlp.begin_training(lambda: golds_to_gold_tuples(docs, golds))
|
||||||
|
|
||||||
|
|
||||||
|
########################
|
||||||
|
# Command line helpers #
|
||||||
|
########################
|
||||||
|
|
||||||
|
class Config(object):
|
||||||
|
def __init__(self, vectors=None, max_doc_length=10, multitask_tag=True,
|
||||||
|
multitask_sent=True, nr_epoch=30, batch_size=1000, dropout=0.2):
|
||||||
|
for key, value in locals().items():
|
||||||
|
setattr(self, key, value)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def load(cls, loc):
|
||||||
|
with Path(loc).open('r', encoding='utf8') as file_:
|
||||||
|
cfg = json.load(file_)
|
||||||
|
return cls(**cfg)
|
||||||
|
|
||||||
|
|
||||||
|
class Dataset(object):
|
||||||
|
def __init__(self, path, section):
|
||||||
|
self.path = path
|
||||||
|
self.section = section
|
||||||
|
self.conllu = None
|
||||||
|
self.text = None
|
||||||
|
for file_path in self.path.iterdir():
|
||||||
|
name = file_path.parts[-1]
|
||||||
|
if section in name and name.endswith('conllu'):
|
||||||
|
self.conllu = file_path
|
||||||
|
elif section in name and name.endswith('txt'):
|
||||||
|
self.text = file_path
|
||||||
|
if self.conllu is None:
|
||||||
|
msg = "Could not find .txt file in {path} for {section}"
|
||||||
|
raise IOError(msg.format(section=section, path=path))
|
||||||
|
if self.text is None:
|
||||||
|
msg = "Could not find .txt file in {path} for {section}"
|
||||||
|
self.lang = self.conllu.parts[-1].split('-')[0].split('_')[0]
|
||||||
|
|
||||||
|
|
||||||
|
class TreebankPaths(object):
|
||||||
|
def __init__(self, ud_path, treebank, **cfg):
|
||||||
|
self.train = Dataset(ud_path / treebank, 'train')
|
||||||
|
self.dev = Dataset(ud_path / treebank, 'dev')
|
||||||
|
self.lang = self.train.lang
|
||||||
|
|
||||||
|
|
||||||
|
@plac.annotations(
|
||||||
|
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
|
||||||
|
corpus=("UD corpus to train and evaluate on, e.g. en, es_ancora, etc",
|
||||||
|
"positional", None, str),
|
||||||
|
parses_dir=("Directory to write the development parses", "positional", None, Path),
|
||||||
|
config=("Path to json formatted config file", "positional"),
|
||||||
|
limit=("Size limit", "option", "n", int)
|
||||||
|
)
|
||||||
|
def main(ud_dir, parses_dir, config, corpus, limit=0):
|
||||||
|
config = Config.load(config)
|
||||||
|
paths = TreebankPaths(ud_dir, corpus)
|
||||||
|
if not (parses_dir / corpus).exists():
|
||||||
|
(parses_dir / corpus).mkdir()
|
||||||
|
print("Train and evaluate", corpus, "using lang", paths.lang)
|
||||||
|
nlp = load_nlp(paths.lang, config)
|
||||||
|
|
||||||
|
docs, golds = read_data(nlp, paths.train.conllu.open(), paths.train.text.open(),
|
||||||
|
max_doc_length=config.max_doc_length, limit=limit)
|
||||||
|
|
||||||
|
optimizer = initialize_pipeline(nlp, docs, golds, config)
|
||||||
|
|
||||||
|
for i in range(config.nr_epoch):
|
||||||
|
docs = [nlp.make_doc(doc.text) for doc in docs]
|
||||||
|
batches = minibatch_by_words(list(zip(docs, golds)), size=config.batch_size)
|
||||||
|
losses = {}
|
||||||
|
n_train_words = sum(len(doc) for doc in docs)
|
||||||
|
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
|
||||||
|
for batch in batches:
|
||||||
|
batch_docs, batch_gold = zip(*batch)
|
||||||
|
pbar.update(sum(len(doc) for doc in batch_docs))
|
||||||
|
nlp.update(batch_docs, batch_gold, sgd=optimizer,
|
||||||
|
drop=config.dropout, losses=losses)
|
||||||
|
|
||||||
|
out_path = parses_dir / corpus / 'epoch-{i}.conllu'.format(i=i)
|
||||||
|
with nlp.use_params(optimizer.averages):
|
||||||
|
scores = evaluate(nlp, paths.dev.text, paths.dev.conllu, out_path)
|
||||||
|
print_progress(i, losses, scores)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
plac.call(main)
|
|
@ -13,7 +13,7 @@ from . import _align
|
||||||
from .syntax import nonproj
|
from .syntax import nonproj
|
||||||
from .tokens import Doc
|
from .tokens import Doc
|
||||||
from . import util
|
from . import util
|
||||||
from .util import minibatch
|
from .util import minibatch, itershuffle
|
||||||
|
|
||||||
|
|
||||||
def tags_to_entities(tags):
|
def tags_to_entities(tags):
|
||||||
|
@ -133,15 +133,14 @@ class GoldCorpus(object):
|
||||||
def train_docs(self, nlp, gold_preproc=False,
|
def train_docs(self, nlp, gold_preproc=False,
|
||||||
projectivize=False, max_length=None,
|
projectivize=False, max_length=None,
|
||||||
noise_level=0.0):
|
noise_level=0.0):
|
||||||
train_tuples = self.train_tuples
|
|
||||||
if projectivize:
|
if projectivize:
|
||||||
train_tuples = nonproj.preprocess_training_data(
|
train_tuples = nonproj.preprocess_training_data(
|
||||||
self.train_tuples, label_freq_cutoff=100)
|
self.train_tuples, label_freq_cutoff=30)
|
||||||
random.shuffle(train_tuples)
|
random.shuffle(self.train_locs)
|
||||||
gold_docs = self.iter_gold_docs(nlp, train_tuples, gold_preproc,
|
gold_docs = self.iter_gold_docs(nlp, train_tuples, gold_preproc,
|
||||||
max_length=max_length,
|
max_length=max_length,
|
||||||
noise_level=noise_level)
|
noise_level=noise_level)
|
||||||
yield from gold_docs
|
yield from itershuffle(gold_docs, bufsize=100)
|
||||||
|
|
||||||
def dev_docs(self, nlp, gold_preproc=False):
|
def dev_docs(self, nlp, gold_preproc=False):
|
||||||
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc)
|
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc)
|
||||||
|
|
|
@ -21,7 +21,7 @@ class SpanishDefaults(Language.Defaults):
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
sytax_iterators = SYNTAX_ITERATORS
|
syntax_iterators = SYNTAX_ITERATORS
|
||||||
lemma_lookup = LOOKUP
|
lemma_lookup = LOOKUP
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -6,17 +6,19 @@ from ...symbols import NOUN, PROPN, PRON, VERB, AUX
|
||||||
|
|
||||||
def noun_chunks(obj):
|
def noun_chunks(obj):
|
||||||
doc = obj.doc
|
doc = obj.doc
|
||||||
np_label = doc.vocab.strings['NP']
|
if not len(doc):
|
||||||
|
return
|
||||||
|
np_label = doc.vocab.strings.add('NP')
|
||||||
left_labels = ['det', 'fixed', 'neg'] #['nunmod', 'det', 'appos', 'fixed']
|
left_labels = ['det', 'fixed', 'neg'] #['nunmod', 'det', 'appos', 'fixed']
|
||||||
right_labels = ['flat', 'fixed', 'compound', 'neg']
|
right_labels = ['flat', 'fixed', 'compound', 'neg']
|
||||||
stop_labels = ['punct']
|
stop_labels = ['punct']
|
||||||
np_left_deps = [doc.vocab.strings[label] for label in left_labels]
|
np_left_deps = [doc.vocab.strings.add(label) for label in left_labels]
|
||||||
np_right_deps = [doc.vocab.strings[label] for label in right_labels]
|
np_right_deps = [doc.vocab.strings.add(label) for label in right_labels]
|
||||||
stop_deps = [doc.vocab.strings[label] for label in stop_labels]
|
stop_deps = [doc.vocab.strings.add(label) for label in stop_labels]
|
||||||
token = doc[0]
|
token = doc[0]
|
||||||
while token and token.i < len(doc):
|
while token and token.i < len(doc):
|
||||||
if token.pos in [PROPN, NOUN, PRON]:
|
if token.pos in [PROPN, NOUN, PRON]:
|
||||||
left, right = noun_bounds(token)
|
left, right = noun_bounds(doc, token, np_left_deps, np_right_deps, stop_deps)
|
||||||
yield left.i, right.i+1, np_label
|
yield left.i, right.i+1, np_label
|
||||||
token = right
|
token = right
|
||||||
token = next_token(token)
|
token = next_token(token)
|
||||||
|
@ -33,7 +35,7 @@ def next_token(token):
|
||||||
return None
|
return None
|
||||||
|
|
||||||
|
|
||||||
def noun_bounds(root):
|
def noun_bounds(doc, root, np_left_deps, np_right_deps, stop_deps):
|
||||||
left_bound = root
|
left_bound = root
|
||||||
for token in reversed(list(root.lefts)):
|
for token in reversed(list(root.lefts)):
|
||||||
if token.dep in np_left_deps:
|
if token.dep in np_left_deps:
|
||||||
|
@ -41,7 +43,7 @@ def noun_bounds(root):
|
||||||
right_bound = root
|
right_bound = root
|
||||||
for token in root.rights:
|
for token in root.rights:
|
||||||
if (token.dep in np_right_deps):
|
if (token.dep in np_right_deps):
|
||||||
left, right = noun_bounds(token)
|
left, right = noun_bounds(doc, token, np_left_deps, np_right_deps, stop_deps)
|
||||||
if list(filter(lambda t: is_verb_token(t) or t.dep in stop_deps,
|
if list(filter(lambda t: is_verb_token(t) or t.dep in stop_deps,
|
||||||
doc[left_bound.i: right.i])):
|
doc[left_bound.i: right.i])):
|
||||||
break
|
break
|
||||||
|
|
|
@ -6,10 +6,25 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
STOP_WORDS = set("""
|
STOP_WORDS = set("""
|
||||||
a
|
a
|
||||||
|
ah
|
||||||
|
aha
|
||||||
|
aj
|
||||||
ako
|
ako
|
||||||
|
al
|
||||||
ali
|
ali
|
||||||
|
arh
|
||||||
|
au
|
||||||
|
avaj
|
||||||
|
bar
|
||||||
|
baš
|
||||||
|
bez
|
||||||
bi
|
bi
|
||||||
bih
|
bih
|
||||||
|
bijah
|
||||||
|
bijahu
|
||||||
|
bijaše
|
||||||
|
bijasmo
|
||||||
|
bijaste
|
||||||
bila
|
bila
|
||||||
bili
|
bili
|
||||||
bilo
|
bilo
|
||||||
|
@ -17,25 +32,104 @@ bio
|
||||||
bismo
|
bismo
|
||||||
biste
|
biste
|
||||||
biti
|
biti
|
||||||
|
brr
|
||||||
|
buć
|
||||||
|
budavši
|
||||||
|
bude
|
||||||
|
budimo
|
||||||
|
budite
|
||||||
|
budu
|
||||||
|
budući
|
||||||
|
bum
|
||||||
bumo
|
bumo
|
||||||
|
će
|
||||||
|
ćemo
|
||||||
|
ćeš
|
||||||
|
ćete
|
||||||
|
čijem
|
||||||
|
čijim
|
||||||
|
čijima
|
||||||
|
ću
|
||||||
da
|
da
|
||||||
|
daj
|
||||||
|
dakle
|
||||||
|
de
|
||||||
|
deder
|
||||||
|
dem
|
||||||
|
djelomice
|
||||||
|
djelomično
|
||||||
do
|
do
|
||||||
|
doista
|
||||||
|
dok
|
||||||
|
dokle
|
||||||
|
donekle
|
||||||
|
dosad
|
||||||
|
doskoro
|
||||||
|
dotad
|
||||||
|
dotle
|
||||||
|
dovečer
|
||||||
|
drugamo
|
||||||
|
drugdje
|
||||||
duž
|
duž
|
||||||
|
e
|
||||||
|
eh
|
||||||
|
ehe
|
||||||
|
ej
|
||||||
|
eno
|
||||||
|
eto
|
||||||
|
evo
|
||||||
ga
|
ga
|
||||||
|
gdjekakav
|
||||||
|
gdjekoje
|
||||||
|
gic
|
||||||
|
god
|
||||||
|
halo
|
||||||
|
hej
|
||||||
|
hm
|
||||||
hoće
|
hoće
|
||||||
hoćemo
|
hoćemo
|
||||||
hoćete
|
|
||||||
hoćeš
|
hoćeš
|
||||||
|
hoćete
|
||||||
hoću
|
hoću
|
||||||
|
hop
|
||||||
|
htijahu
|
||||||
|
htijasmo
|
||||||
|
htijaste
|
||||||
|
htio
|
||||||
|
htjedoh
|
||||||
|
htjedoše
|
||||||
|
htjedoste
|
||||||
|
htjela
|
||||||
|
htjele
|
||||||
|
htjeli
|
||||||
|
hura
|
||||||
i
|
i
|
||||||
iako
|
iako
|
||||||
ih
|
ih
|
||||||
|
iju
|
||||||
|
ijuju
|
||||||
|
ikada
|
||||||
|
ikakav
|
||||||
|
ikakva
|
||||||
|
ikakve
|
||||||
|
ikakvi
|
||||||
|
ikakvih
|
||||||
|
ikakvim
|
||||||
|
ikakvima
|
||||||
|
ikakvo
|
||||||
|
ikakvog
|
||||||
|
ikakvoga
|
||||||
|
ikakvoj
|
||||||
|
ikakvom
|
||||||
|
ikakvome
|
||||||
ili
|
ili
|
||||||
|
im
|
||||||
iz
|
iz
|
||||||
ja
|
ja
|
||||||
je
|
je
|
||||||
jedna
|
jedna
|
||||||
jedne
|
jedne
|
||||||
|
jedni
|
||||||
jedno
|
jedno
|
||||||
jer
|
jer
|
||||||
jesam
|
jesam
|
||||||
|
@ -57,6 +151,7 @@ koji
|
||||||
kojima
|
kojima
|
||||||
koju
|
koju
|
||||||
kroz
|
kroz
|
||||||
|
lani
|
||||||
li
|
li
|
||||||
me
|
me
|
||||||
mene
|
mene
|
||||||
|
@ -66,6 +161,8 @@ mimo
|
||||||
moj
|
moj
|
||||||
moja
|
moja
|
||||||
moje
|
moje
|
||||||
|
moji
|
||||||
|
moju
|
||||||
mu
|
mu
|
||||||
na
|
na
|
||||||
nad
|
nad
|
||||||
|
@ -77,24 +174,27 @@ naš
|
||||||
naša
|
naša
|
||||||
naše
|
naše
|
||||||
našeg
|
našeg
|
||||||
|
naši
|
||||||
ne
|
ne
|
||||||
|
neće
|
||||||
|
nećemo
|
||||||
|
nećeš
|
||||||
|
nećete
|
||||||
|
neću
|
||||||
nego
|
nego
|
||||||
neka
|
neka
|
||||||
|
neke
|
||||||
neki
|
neki
|
||||||
nekog
|
nekog
|
||||||
neku
|
neku
|
||||||
nema
|
nema
|
||||||
netko
|
|
||||||
neće
|
|
||||||
nećemo
|
|
||||||
nećete
|
|
||||||
nećeš
|
|
||||||
neću
|
|
||||||
nešto
|
nešto
|
||||||
|
netko
|
||||||
ni
|
ni
|
||||||
nije
|
nije
|
||||||
nikoga
|
nikoga
|
||||||
nikoje
|
nikoje
|
||||||
|
nikoji
|
||||||
nikoju
|
nikoju
|
||||||
nisam
|
nisam
|
||||||
nisi
|
nisi
|
||||||
|
@ -123,33 +223,63 @@ od
|
||||||
odmah
|
odmah
|
||||||
on
|
on
|
||||||
ona
|
ona
|
||||||
|
one
|
||||||
oni
|
oni
|
||||||
ono
|
ono
|
||||||
|
onu
|
||||||
|
onoj
|
||||||
|
onom
|
||||||
|
onim
|
||||||
|
onima
|
||||||
ova
|
ova
|
||||||
|
ovaj
|
||||||
|
ovim
|
||||||
|
ovima
|
||||||
|
ovoj
|
||||||
pa
|
pa
|
||||||
pak
|
pak
|
||||||
|
pljus
|
||||||
po
|
po
|
||||||
pod
|
pod
|
||||||
|
podalje
|
||||||
|
poimence
|
||||||
|
poizdalje
|
||||||
|
ponekad
|
||||||
pored
|
pored
|
||||||
|
postrance
|
||||||
|
potajice
|
||||||
|
potrbuške
|
||||||
|
pouzdano
|
||||||
prije
|
prije
|
||||||
s
|
s
|
||||||
sa
|
sa
|
||||||
sam
|
sam
|
||||||
samo
|
samo
|
||||||
|
sasvim
|
||||||
|
sav
|
||||||
se
|
se
|
||||||
sebe
|
sebe
|
||||||
sebi
|
sebi
|
||||||
si
|
si
|
||||||
|
šic
|
||||||
smo
|
smo
|
||||||
ste
|
ste
|
||||||
|
što
|
||||||
|
šta
|
||||||
|
štogod
|
||||||
|
štagod
|
||||||
su
|
su
|
||||||
|
sva
|
||||||
sve
|
sve
|
||||||
svi
|
svi
|
||||||
|
svi
|
||||||
svog
|
svog
|
||||||
svoj
|
svoj
|
||||||
svoja
|
svoja
|
||||||
svoje
|
svoje
|
||||||
|
svoju
|
||||||
svom
|
svom
|
||||||
|
svu
|
||||||
ta
|
ta
|
||||||
tada
|
tada
|
||||||
taj
|
taj
|
||||||
|
@ -158,6 +288,8 @@ te
|
||||||
tebe
|
tebe
|
||||||
tebi
|
tebi
|
||||||
ti
|
ti
|
||||||
|
tim
|
||||||
|
tima
|
||||||
to
|
to
|
||||||
toj
|
toj
|
||||||
tome
|
tome
|
||||||
|
@ -165,23 +297,51 @@ tu
|
||||||
tvoj
|
tvoj
|
||||||
tvoja
|
tvoja
|
||||||
tvoje
|
tvoje
|
||||||
|
tvoji
|
||||||
|
tvoju
|
||||||
u
|
u
|
||||||
|
usprkos
|
||||||
|
utaman
|
||||||
|
uvijek
|
||||||
uz
|
uz
|
||||||
|
uza
|
||||||
|
uzagrapce
|
||||||
|
uzalud
|
||||||
|
uzduž
|
||||||
|
valjda
|
||||||
vam
|
vam
|
||||||
vama
|
vama
|
||||||
vas
|
vas
|
||||||
vaš
|
vaš
|
||||||
vaša
|
vaša
|
||||||
vaše
|
vaše
|
||||||
|
vašim
|
||||||
|
vašima
|
||||||
već
|
već
|
||||||
vi
|
vi
|
||||||
|
vjerojatno
|
||||||
|
vjerovatno
|
||||||
|
vrh
|
||||||
vrlo
|
vrlo
|
||||||
za
|
za
|
||||||
|
zaista
|
||||||
zar
|
zar
|
||||||
će
|
zatim
|
||||||
ćemo
|
zato
|
||||||
ćete
|
zbija
|
||||||
ćeš
|
zbog
|
||||||
ću
|
želeći
|
||||||
što
|
željah
|
||||||
|
željela
|
||||||
|
željele
|
||||||
|
željeli
|
||||||
|
željelo
|
||||||
|
željen
|
||||||
|
željena
|
||||||
|
željene
|
||||||
|
željeni
|
||||||
|
željenu
|
||||||
|
željeo
|
||||||
|
zimus
|
||||||
|
zum
|
||||||
""".split())
|
""".split())
|
||||||
|
|
|
@ -35,14 +35,32 @@ class JapaneseTokenizer(object):
|
||||||
def from_disk(self, path, **exclude):
|
def from_disk(self, path, **exclude):
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
class JapaneseCharacterSegmenter(object):
|
||||||
|
def __init__(self, vocab):
|
||||||
|
self.vocab = vocab
|
||||||
|
|
||||||
|
def __call__(self, text):
|
||||||
|
words = []
|
||||||
|
spaces = []
|
||||||
|
doc = self.tokenizer(text)
|
||||||
|
for token in self.tokenizer(text):
|
||||||
|
words.extend(list(token.text))
|
||||||
|
spaces.extend([False]*len(token.text))
|
||||||
|
spaces[-1] = bool(token.whitespace_)
|
||||||
|
return Doc(self.vocab, words=words, spaces=spaces)
|
||||||
|
|
||||||
|
|
||||||
class JapaneseDefaults(Language.Defaults):
|
class JapaneseDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters[LANG] = lambda text: 'ja'
|
lex_attr_getters[LANG] = lambda text: 'ja'
|
||||||
|
use_janome = True
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def create_tokenizer(cls, nlp=None):
|
def create_tokenizer(cls, nlp=None):
|
||||||
return JapaneseTokenizer(cls, nlp)
|
if cls.use_janome:
|
||||||
|
return JapaneseTokenizer(cls, nlp)
|
||||||
|
else:
|
||||||
|
return JapaneseCharacterSegmenter(cls, nlp.vocab)
|
||||||
|
|
||||||
|
|
||||||
class Japanese(Language):
|
class Japanese(Language):
|
||||||
|
|
22
spacy/lang/tr/examples.py
Normal file
22
spacy/lang/tr/examples.py
Normal file
|
@ -0,0 +1,22 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
>>> from spacy.lang.tr.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"Neredesin?",
|
||||||
|
"Neredesiniz?",
|
||||||
|
"Bu bir cümledir.",
|
||||||
|
"Sürücüsüz araçlar sigorta yükümlülüğünü üreticilere kaydırıyor.",
|
||||||
|
"San Francisco kaldırımda kurye robotları yasaklayabilir."
|
||||||
|
"Londra İngiltere'nin başkentidir.",
|
||||||
|
"Türkiye'nin başkenti neresi?",
|
||||||
|
"Bakanlar Kurulu 180 günlük eylem planını açıkladı.",
|
||||||
|
"Merkez Bankası, beklentiler doğrultusunda faizlerde değişikliğe gitmedi."
|
||||||
|
]
|
31
spacy/lang/tr/lex_attrs.py
Normal file
31
spacy/lang/tr/lex_attrs.py
Normal file
|
@ -0,0 +1,31 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
|
|
||||||
|
#Thirteen, fifteen etc. are written separate: on üç
|
||||||
|
|
||||||
|
_num_words = ['bir', 'iki', 'üç', 'dört', 'beş', 'altı', 'yedi', 'sekiz',
|
||||||
|
'dokuz', 'on', 'yirmi', 'otuz', 'kırk', 'elli', 'altmış',
|
||||||
|
'yetmiş', 'seksen', 'doksan', 'yüz', 'bin', 'milyon',
|
||||||
|
'milyar', 'katrilyon', 'kentilyon']
|
||||||
|
|
||||||
|
|
||||||
|
def like_num(text):
|
||||||
|
text = text.replace(',', '').replace('.', '')
|
||||||
|
if text.isdigit():
|
||||||
|
return True
|
||||||
|
if text.count('/') == 1:
|
||||||
|
num, denom = text.split('/')
|
||||||
|
if num.isdigit() and denom.isdigit():
|
||||||
|
return True
|
||||||
|
if text.lower() in _num_words:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
LEX_ATTRS = {
|
||||||
|
LIKE_NUM: like_num
|
||||||
|
}
|
||||||
|
|
|
@ -10,16 +10,12 @@ acep
|
||||||
adamakıllı
|
adamakıllı
|
||||||
adeta
|
adeta
|
||||||
ait
|
ait
|
||||||
altmýþ
|
|
||||||
altmış
|
|
||||||
altý
|
|
||||||
altı
|
|
||||||
ama
|
ama
|
||||||
amma
|
amma
|
||||||
anca
|
anca
|
||||||
ancak
|
ancak
|
||||||
arada
|
arada
|
||||||
artýk
|
artık
|
||||||
aslında
|
aslında
|
||||||
aynen
|
aynen
|
||||||
ayrıca
|
ayrıca
|
||||||
|
@ -29,46 +25,82 @@ açıkçası
|
||||||
bana
|
bana
|
||||||
bari
|
bari
|
||||||
bazen
|
bazen
|
||||||
bazý
|
|
||||||
bazı
|
bazı
|
||||||
|
bazısı
|
||||||
|
bazısına
|
||||||
|
bazısında
|
||||||
|
bazısından
|
||||||
|
bazısını
|
||||||
|
bazısının
|
||||||
başkası
|
başkası
|
||||||
baţka
|
başkasına
|
||||||
|
başkasında
|
||||||
|
başkasından
|
||||||
|
başkasını
|
||||||
|
başkasının
|
||||||
|
başka
|
||||||
belki
|
belki
|
||||||
ben
|
ben
|
||||||
|
bende
|
||||||
benden
|
benden
|
||||||
beni
|
beni
|
||||||
benim
|
benim
|
||||||
beri
|
beri
|
||||||
beriki
|
beriki
|
||||||
beþ
|
berikinin
|
||||||
beş
|
berikiyi
|
||||||
beţ
|
berisi
|
||||||
bilcümle
|
bilcümle
|
||||||
bile
|
bile
|
||||||
bin
|
|
||||||
binaen
|
binaen
|
||||||
binaenaleyh
|
binaenaleyh
|
||||||
bir
|
|
||||||
biraz
|
biraz
|
||||||
birazdan
|
birazdan
|
||||||
birbiri
|
birbiri
|
||||||
|
birbirine
|
||||||
|
birbirini
|
||||||
|
birbirinin
|
||||||
|
birbirinde
|
||||||
|
birbirinden
|
||||||
birden
|
birden
|
||||||
birdenbire
|
birdenbire
|
||||||
biri
|
biri
|
||||||
|
birine
|
||||||
|
birini
|
||||||
|
birinin
|
||||||
|
birinde
|
||||||
|
birinden
|
||||||
birice
|
birice
|
||||||
birileri
|
birileri
|
||||||
|
birilerinde
|
||||||
|
birilerinden
|
||||||
|
birilerine
|
||||||
|
birilerini
|
||||||
|
birilerinin
|
||||||
birisi
|
birisi
|
||||||
|
birisine
|
||||||
|
birisini
|
||||||
|
birisinin
|
||||||
|
birisinde
|
||||||
|
birisinden
|
||||||
birkaç
|
birkaç
|
||||||
birkaçı
|
birkaçı
|
||||||
|
birkaçına
|
||||||
|
birkaçını
|
||||||
|
birkaçının
|
||||||
|
birkaçında
|
||||||
|
birkaçından
|
||||||
birkez
|
birkez
|
||||||
birlikte
|
birlikte
|
||||||
birçok
|
birçok
|
||||||
birçoğu
|
birçoğu
|
||||||
birþey
|
birçoğuna
|
||||||
birþeyi
|
birçoğunda
|
||||||
|
birçoğundan
|
||||||
|
birçoğunu
|
||||||
|
birçoğunun
|
||||||
birşey
|
birşey
|
||||||
birşeyi
|
birşeyi
|
||||||
birţey
|
|
||||||
bitevi
|
bitevi
|
||||||
biteviye
|
biteviye
|
||||||
bittabi
|
bittabi
|
||||||
|
@ -96,6 +128,11 @@ buracıkta
|
||||||
burada
|
burada
|
||||||
buradan
|
buradan
|
||||||
burası
|
burası
|
||||||
|
burasına
|
||||||
|
burasını
|
||||||
|
burasının
|
||||||
|
burasında
|
||||||
|
burasından
|
||||||
böyle
|
böyle
|
||||||
böylece
|
böylece
|
||||||
böylecene
|
böylecene
|
||||||
|
@ -106,8 +143,34 @@ büsbütün
|
||||||
bütün
|
bütün
|
||||||
cuk
|
cuk
|
||||||
cümlesi
|
cümlesi
|
||||||
|
cümlesine
|
||||||
|
cümlesini
|
||||||
|
cümlesinin
|
||||||
|
cümlesinden
|
||||||
|
cümlemize
|
||||||
|
cümlemizi
|
||||||
|
cümlemizden
|
||||||
|
çabuk
|
||||||
|
çabukça
|
||||||
|
çeşitli
|
||||||
|
çok
|
||||||
|
çokları
|
||||||
|
çoklarınca
|
||||||
|
çokluk
|
||||||
|
çoklukla
|
||||||
|
çokça
|
||||||
|
çoğu
|
||||||
|
çoğun
|
||||||
|
çoğunca
|
||||||
|
çoğunda
|
||||||
|
çoğundan
|
||||||
|
çoğunlukla
|
||||||
|
çoğunu
|
||||||
|
çoğunun
|
||||||
|
çünkü
|
||||||
da
|
da
|
||||||
daha
|
daha
|
||||||
|
dahası
|
||||||
dahi
|
dahi
|
||||||
dahil
|
dahil
|
||||||
dahilen
|
dahilen
|
||||||
|
@ -124,19 +187,17 @@ denli
|
||||||
derakap
|
derakap
|
||||||
derhal
|
derhal
|
||||||
derken
|
derken
|
||||||
deđil
|
|
||||||
değil
|
değil
|
||||||
değin
|
değin
|
||||||
diye
|
diye
|
||||||
diđer
|
|
||||||
diğer
|
diğer
|
||||||
diğeri
|
diğeri
|
||||||
doksan
|
diğerine
|
||||||
dokuz
|
diğerini
|
||||||
|
diğerinden
|
||||||
dolayı
|
dolayı
|
||||||
dolayısıyla
|
dolayısıyla
|
||||||
doğru
|
doğru
|
||||||
dört
|
|
||||||
edecek
|
edecek
|
||||||
eden
|
eden
|
||||||
ederek
|
ederek
|
||||||
|
@ -146,7 +207,6 @@ edilmesi
|
||||||
ediyor
|
ediyor
|
||||||
elbet
|
elbet
|
||||||
elbette
|
elbette
|
||||||
elli
|
|
||||||
emme
|
emme
|
||||||
en
|
en
|
||||||
enikonu
|
enikonu
|
||||||
|
@ -168,10 +228,10 @@ evvelce
|
||||||
evvelden
|
evvelden
|
||||||
evvelemirde
|
evvelemirde
|
||||||
evveli
|
evveli
|
||||||
eđer
|
|
||||||
eğer
|
eğer
|
||||||
fakat
|
fakat
|
||||||
filanca
|
filanca
|
||||||
|
filancanın
|
||||||
gah
|
gah
|
||||||
gayet
|
gayet
|
||||||
gayetle
|
gayetle
|
||||||
|
@ -197,6 +257,10 @@ haliyle
|
||||||
handiyse
|
handiyse
|
||||||
hangi
|
hangi
|
||||||
hangisi
|
hangisi
|
||||||
|
hangisine
|
||||||
|
hangisine
|
||||||
|
hangisinde
|
||||||
|
hangisinden
|
||||||
hani
|
hani
|
||||||
hariç
|
hariç
|
||||||
hasebiyle
|
hasebiyle
|
||||||
|
@ -207,17 +271,27 @@ hem
|
||||||
henüz
|
henüz
|
||||||
hep
|
hep
|
||||||
hepsi
|
hepsi
|
||||||
|
hepsini
|
||||||
|
hepsinin
|
||||||
|
hepsinde
|
||||||
|
hepsinden
|
||||||
her
|
her
|
||||||
herhangi
|
herhangi
|
||||||
herkes
|
herkes
|
||||||
|
herkesi
|
||||||
herkesin
|
herkesin
|
||||||
|
herkesten
|
||||||
hiç
|
hiç
|
||||||
hiçbir
|
hiçbir
|
||||||
hiçbiri
|
hiçbiri
|
||||||
|
hiçbirine
|
||||||
|
hiçbirini
|
||||||
|
hiçbirinin
|
||||||
|
hiçbirinde
|
||||||
|
hiçbirinden
|
||||||
hoş
|
hoş
|
||||||
hulasaten
|
hulasaten
|
||||||
iken
|
iken
|
||||||
iki
|
|
||||||
ila
|
ila
|
||||||
ile
|
ile
|
||||||
ilen
|
ilen
|
||||||
|
@ -240,43 +314,55 @@ iyicene
|
||||||
için
|
için
|
||||||
iş
|
iş
|
||||||
işte
|
işte
|
||||||
iţte
|
|
||||||
kadar
|
kadar
|
||||||
kaffesi
|
kaffesi
|
||||||
kah
|
kah
|
||||||
kala
|
kala
|
||||||
kanýmca
|
kanımca
|
||||||
karşın
|
karşın
|
||||||
katrilyon
|
|
||||||
kaynak
|
kaynak
|
||||||
kaçı
|
kaçı
|
||||||
|
kaçına
|
||||||
|
kaçında
|
||||||
|
kaçından
|
||||||
|
kaçını
|
||||||
|
kaçının
|
||||||
kelli
|
kelli
|
||||||
kendi
|
kendi
|
||||||
|
kendilerinde
|
||||||
|
kendilerinden
|
||||||
kendilerine
|
kendilerine
|
||||||
|
kendilerini
|
||||||
|
kendilerinin
|
||||||
kendini
|
kendini
|
||||||
kendisi
|
kendisi
|
||||||
|
kendisinde
|
||||||
|
kendisinden
|
||||||
kendisine
|
kendisine
|
||||||
kendisini
|
kendisini
|
||||||
|
kendisinin
|
||||||
kere
|
kere
|
||||||
kez
|
kez
|
||||||
keza
|
keza
|
||||||
kezalik
|
kezalik
|
||||||
keşke
|
keşke
|
||||||
keţke
|
|
||||||
ki
|
ki
|
||||||
kim
|
kim
|
||||||
kimden
|
kimden
|
||||||
kime
|
kime
|
||||||
kimi
|
kimi
|
||||||
|
kiminin
|
||||||
kimisi
|
kimisi
|
||||||
|
kimisinde
|
||||||
|
kimisinden
|
||||||
|
kimisine
|
||||||
|
kimisinin
|
||||||
kimse
|
kimse
|
||||||
kimsecik
|
kimsecik
|
||||||
kimsecikler
|
kimsecikler
|
||||||
külliyen
|
külliyen
|
||||||
kýrk
|
|
||||||
kýsaca
|
|
||||||
kırk
|
|
||||||
kısaca
|
kısaca
|
||||||
|
kısacası
|
||||||
lakin
|
lakin
|
||||||
leh
|
leh
|
||||||
lütfen
|
lütfen
|
||||||
|
@ -289,13 +375,10 @@ međer
|
||||||
meğer
|
meğer
|
||||||
meğerki
|
meğerki
|
||||||
meğerse
|
meğerse
|
||||||
milyar
|
|
||||||
milyon
|
|
||||||
mu
|
mu
|
||||||
mü
|
mü
|
||||||
mý
|
|
||||||
mı
|
mı
|
||||||
nasýl
|
mi
|
||||||
nasıl
|
nasıl
|
||||||
nasılsa
|
nasılsa
|
||||||
nazaran
|
nazaran
|
||||||
|
@ -304,6 +387,8 @@ ne
|
||||||
neden
|
neden
|
||||||
nedeniyle
|
nedeniyle
|
||||||
nedenle
|
nedenle
|
||||||
|
nedenler
|
||||||
|
nedenlerden
|
||||||
nedense
|
nedense
|
||||||
nerde
|
nerde
|
||||||
nerden
|
nerden
|
||||||
|
@ -332,32 +417,27 @@ olduklarını
|
||||||
oldukça
|
oldukça
|
||||||
olduğu
|
olduğu
|
||||||
olduğunu
|
olduğunu
|
||||||
olmadı
|
|
||||||
olmadığı
|
|
||||||
olmak
|
olmak
|
||||||
olması
|
olması
|
||||||
olmayan
|
|
||||||
olmaz
|
|
||||||
olsa
|
olsa
|
||||||
olsun
|
olsun
|
||||||
olup
|
olup
|
||||||
olur
|
olur
|
||||||
olursa
|
olursa
|
||||||
oluyor
|
oluyor
|
||||||
on
|
|
||||||
ona
|
ona
|
||||||
onca
|
onca
|
||||||
onculayın
|
onculayın
|
||||||
onda
|
onda
|
||||||
ondan
|
ondan
|
||||||
onlar
|
onlar
|
||||||
|
onlara
|
||||||
onlardan
|
onlardan
|
||||||
onlari
|
|
||||||
onlarýn
|
|
||||||
onları
|
onları
|
||||||
onların
|
onların
|
||||||
onu
|
onu
|
||||||
onun
|
onun
|
||||||
|
ora
|
||||||
oracık
|
oracık
|
||||||
oracıkta
|
oracıkta
|
||||||
orada
|
orada
|
||||||
|
@ -365,9 +445,26 @@ oradan
|
||||||
oranca
|
oranca
|
||||||
oranla
|
oranla
|
||||||
oraya
|
oraya
|
||||||
otuz
|
|
||||||
oysa
|
oysa
|
||||||
oysaki
|
oysaki
|
||||||
|
öbür
|
||||||
|
öbürkü
|
||||||
|
öbürü
|
||||||
|
öbüründe
|
||||||
|
öbüründen
|
||||||
|
öbürüne
|
||||||
|
öbürünü
|
||||||
|
önce
|
||||||
|
önceden
|
||||||
|
önceleri
|
||||||
|
öncelikle
|
||||||
|
öteki
|
||||||
|
ötekisi
|
||||||
|
öyle
|
||||||
|
öylece
|
||||||
|
öylelikle
|
||||||
|
öylemesine
|
||||||
|
öz
|
||||||
pek
|
pek
|
||||||
pekala
|
pekala
|
||||||
peki
|
peki
|
||||||
|
@ -379,8 +476,6 @@ sahi
|
||||||
sahiden
|
sahiden
|
||||||
sana
|
sana
|
||||||
sanki
|
sanki
|
||||||
sekiz
|
|
||||||
seksen
|
|
||||||
sen
|
sen
|
||||||
senden
|
senden
|
||||||
seni
|
seni
|
||||||
|
@ -393,6 +488,27 @@ sonra
|
||||||
sonradan
|
sonradan
|
||||||
sonraları
|
sonraları
|
||||||
sonunda
|
sonunda
|
||||||
|
şayet
|
||||||
|
şey
|
||||||
|
şeyden
|
||||||
|
şeyi
|
||||||
|
şeyler
|
||||||
|
şu
|
||||||
|
şuna
|
||||||
|
şuncacık
|
||||||
|
şunda
|
||||||
|
şundan
|
||||||
|
şunlar
|
||||||
|
şunları
|
||||||
|
şunların
|
||||||
|
şunu
|
||||||
|
şunun
|
||||||
|
şura
|
||||||
|
şuracık
|
||||||
|
şuracıkta
|
||||||
|
şurası
|
||||||
|
şöyle
|
||||||
|
şimdi
|
||||||
tabii
|
tabii
|
||||||
tam
|
tam
|
||||||
tamam
|
tamam
|
||||||
|
@ -400,8 +516,8 @@ tamamen
|
||||||
tamamıyla
|
tamamıyla
|
||||||
tarafından
|
tarafından
|
||||||
tek
|
tek
|
||||||
trilyon
|
|
||||||
tüm
|
tüm
|
||||||
|
üzere
|
||||||
var
|
var
|
||||||
vardı
|
vardı
|
||||||
vasıtasıyla
|
vasıtasıyla
|
||||||
|
@ -429,84 +545,16 @@ yaptığını
|
||||||
yapılan
|
yapılan
|
||||||
yapılması
|
yapılması
|
||||||
yapıyor
|
yapıyor
|
||||||
yedi
|
|
||||||
yeniden
|
yeniden
|
||||||
yenilerde
|
yenilerde
|
||||||
yerine
|
yerine
|
||||||
yetmiþ
|
|
||||||
yetmiş
|
|
||||||
yetmiţ
|
|
||||||
yine
|
yine
|
||||||
yirmi
|
|
||||||
yok
|
yok
|
||||||
yoksa
|
yoksa
|
||||||
yoluyla
|
yoluyla
|
||||||
yüz
|
|
||||||
yüzünden
|
yüzünden
|
||||||
zarfında
|
zarfında
|
||||||
zaten
|
zaten
|
||||||
zati
|
zati
|
||||||
zira
|
zira
|
||||||
çabuk
|
|
||||||
çabukça
|
|
||||||
çeşitli
|
|
||||||
çok
|
|
||||||
çokları
|
|
||||||
çoklarınca
|
|
||||||
çokluk
|
|
||||||
çoklukla
|
|
||||||
çokça
|
|
||||||
çoğu
|
|
||||||
çoğun
|
|
||||||
çoğunca
|
|
||||||
çoğunlukla
|
|
||||||
çünkü
|
|
||||||
öbür
|
|
||||||
öbürkü
|
|
||||||
öbürü
|
|
||||||
önce
|
|
||||||
önceden
|
|
||||||
önceleri
|
|
||||||
öncelikle
|
|
||||||
öteki
|
|
||||||
ötekisi
|
|
||||||
öyle
|
|
||||||
öylece
|
|
||||||
öylelikle
|
|
||||||
öylemesine
|
|
||||||
öz
|
|
||||||
üzere
|
|
||||||
üç
|
|
||||||
þey
|
|
||||||
þeyden
|
|
||||||
þeyi
|
|
||||||
þeyler
|
|
||||||
þu
|
|
||||||
þuna
|
|
||||||
þunda
|
|
||||||
þundan
|
|
||||||
þunu
|
|
||||||
şayet
|
|
||||||
şey
|
|
||||||
şeyden
|
|
||||||
şeyi
|
|
||||||
şeyler
|
|
||||||
şu
|
|
||||||
şuna
|
|
||||||
şuncacık
|
|
||||||
şunda
|
|
||||||
şundan
|
|
||||||
şunlar
|
|
||||||
şunları
|
|
||||||
şunu
|
|
||||||
şunun
|
|
||||||
şura
|
|
||||||
şuracık
|
|
||||||
şuracıkta
|
|
||||||
şurası
|
|
||||||
şöyle
|
|
||||||
ţayet
|
|
||||||
ţimdi
|
|
||||||
ţu
|
|
||||||
ţöyle
|
|
||||||
""".split())
|
""".split())
|
||||||
|
|
|
@ -3,11 +3,6 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import ORTH, NORM
|
from ...symbols import ORTH, NORM
|
||||||
|
|
||||||
|
|
||||||
# These exceptions are mostly for example purposes – hoping that Turkish
|
|
||||||
# speakers can contribute in the future! Source of copy-pasted examples:
|
|
||||||
# https://en.wiktionary.org/wiki/Category:Turkish_language
|
|
||||||
|
|
||||||
_exc = {
|
_exc = {
|
||||||
"sağol": [
|
"sağol": [
|
||||||
{ORTH: "sağ"},
|
{ORTH: "sağ"},
|
||||||
|
@ -16,11 +11,112 @@ _exc = {
|
||||||
|
|
||||||
|
|
||||||
for exc_data in [
|
for exc_data in [
|
||||||
{ORTH: "A.B.D.", NORM: "Amerika Birleşik Devletleri"}]:
|
{ORTH: "A.B.D.", NORM: "Amerika Birleşik Devletleri"},
|
||||||
|
{ORTH: "Alb.", NORM: "Albay"},
|
||||||
|
{ORTH: "Ar.Gör.", NORM: "Araştırma Görevlisi"},
|
||||||
|
{ORTH: "Arş.Gör.", NORM: "Araştırma Görevlisi"},
|
||||||
|
{ORTH: "Asb.", NORM: "Astsubay"},
|
||||||
|
{ORTH: "Astsb.", NORM: "Astsubay"},
|
||||||
|
{ORTH: "As.İz.", NORM: "Askeri İnzibat"},
|
||||||
|
{ORTH: "Atğm", NORM: "Asteğmen"},
|
||||||
|
{ORTH: "Av.", NORM: "Avukat"},
|
||||||
|
{ORTH: "Apt.", NORM: "Apartmanı"},
|
||||||
|
{ORTH: "Bçvş.", NORM: "Başçavuş"},
|
||||||
|
{ORTH: "bk.", NORM: "bakınız"},
|
||||||
|
{ORTH: "bknz.", NORM: "bakınız"},
|
||||||
|
{ORTH: "Bnb.", NORM: "Binbaşı"},
|
||||||
|
{ORTH: "bnb.", NORM: "binbaşı"},
|
||||||
|
{ORTH: "Böl.", NORM: "Bölümü"},
|
||||||
|
{ORTH: "Bşk.", NORM: "Başkanlığı"},
|
||||||
|
{ORTH: "Bştbp.", NORM: "Baştabip"},
|
||||||
|
{ORTH: "Bul.", NORM: "Bulvarı"},
|
||||||
|
{ORTH: "Cad.", NORM: "Caddesi"},
|
||||||
|
{ORTH: "çev.", NORM: "çeviren"},
|
||||||
|
{ORTH: "Çvş.", NORM: "Çavuş"},
|
||||||
|
{ORTH: "dak.", NORM: "dakika"},
|
||||||
|
{ORTH: "dk.", NORM: "dakika"},
|
||||||
|
{ORTH: "Doç.", NORM: "Doçent"},
|
||||||
|
{ORTH: "doğ.", NORM: "doğum tarihi"},
|
||||||
|
{ORTH: "drl.", NORM: "derleyen"},
|
||||||
|
{ORTH: "Dz.", NORM: "Deniz"},
|
||||||
|
{ORTH: "Dz.K.K.lığı", NORM: "Deniz Kuvvetleri Komutanlığı"},
|
||||||
|
{ORTH: "Dz.Kuv.", NORM: "Deniz Kuvvetleri"},
|
||||||
|
{ORTH: "Dz.Kuv.K.", NORM: "Deniz Kuvvetleri Komutanlığı"},
|
||||||
|
{ORTH: "dzl.", NORM: "düzenleyen"},
|
||||||
|
{ORTH: "Ecz.", NORM: "Eczanesi"},
|
||||||
|
{ORTH: "ekon.", NORM: "ekonomi"},
|
||||||
|
{ORTH: "Fak.", NORM: "Fakültesi"},
|
||||||
|
{ORTH: "Gn.", NORM: "Genel"},
|
||||||
|
{ORTH: "Gnkur.", NORM: "Genelkurmay"},
|
||||||
|
{ORTH: "Gn.Kur.", NORM: "Genelkurmay"},
|
||||||
|
{ORTH: "gr.", NORM: "gram"},
|
||||||
|
{ORTH: "Hst.", NORM: "Hastanesi"},
|
||||||
|
{ORTH: "Hs.Uzm.", NORM: "Hesap Uzmanı"},
|
||||||
|
{ORTH: "huk.", NORM: "hukuk"},
|
||||||
|
{ORTH: "Hv.", NORM: "Hava"},
|
||||||
|
{ORTH: "Hv.K.K.lığı", NORM: "Hava Kuvvetleri Komutanlığı"},
|
||||||
|
{ORTH: "Hv.Kuv.", NORM: "Hava Kuvvetleri"},
|
||||||
|
{ORTH: "Hv.Kuv.K.", NORM: "Hava Kuvvetleri Komutanlığı"},
|
||||||
|
{ORTH: "Hz.", NORM: "Hazreti"},
|
||||||
|
{ORTH: "Hz.Öz.", NORM: "Hizmete Özel"},
|
||||||
|
{ORTH: "İng.", NORM: "İngilizce"},
|
||||||
|
{ORTH: "Jeol.", NORM: "Jeoloji"},
|
||||||
|
{ORTH: "jeol.", NORM: "jeoloji"},
|
||||||
|
{ORTH: "Korg.", NORM: "Korgeneral"},
|
||||||
|
{ORTH: "Kur.", NORM: "Kurmay"},
|
||||||
|
{ORTH: "Kur.Bşk.", NORM: "Kurmay Başkanı"},
|
||||||
|
{ORTH: "Kuv.", NORM: "Kuvvetleri"},
|
||||||
|
{ORTH: "Ltd.", NORM: "Limited"},
|
||||||
|
{ORTH: "Mah.", NORM: "Mahallesi"},
|
||||||
|
{ORTH: "mah.", NORM: "mahallesi"},
|
||||||
|
{ORTH: "max.", NORM: "maksimum"},
|
||||||
|
{ORTH: "min.", NORM: "minimum"},
|
||||||
|
{ORTH: "Müh.", NORM: "Mühendisliği"},
|
||||||
|
{ORTH: "müh.", NORM: "mühendisliği"},
|
||||||
|
{ORTH: "MÖ.", NORM: "Milattan Önce"},
|
||||||
|
{ORTH: "Onb.", NORM: "Onbaşı"},
|
||||||
|
{ORTH: "Ord.", NORM: "Ordinaryüs"},
|
||||||
|
{ORTH: "Org.", NORM: "Orgeneral"},
|
||||||
|
{ORTH: "Ped.", NORM: "Pedagoji"},
|
||||||
|
{ORTH: "Prof.", NORM: "Profesör"},
|
||||||
|
{ORTH: "Sb.", NORM: "Subay"},
|
||||||
|
{ORTH: "Sn.", NORM: "Sayın"},
|
||||||
|
{ORTH: "sn.", NORM: "saniye"},
|
||||||
|
{ORTH: "Sok.", NORM: "Sokak"},
|
||||||
|
{ORTH: "Şb.", NORM: "Şube"},
|
||||||
|
{ORTH: "Şti.", NORM: "Şirketi"},
|
||||||
|
{ORTH: "Tbp.", NORM: "Tabip"},
|
||||||
|
{ORTH: "T.C.", NORM: "Türkiye Cumhuriyeti"},
|
||||||
|
{ORTH: "Tel.", NORM: "Telefon"},
|
||||||
|
{ORTH: "tel.", NORM: "telefon"},
|
||||||
|
{ORTH: "telg.", NORM: "telgraf"},
|
||||||
|
{ORTH: "Tğm.", NORM: "Teğmen"},
|
||||||
|
{ORTH: "tğm.", NORM: "teğmen"},
|
||||||
|
{ORTH: "tic.", NORM: "ticaret"},
|
||||||
|
{ORTH: "Tug.", NORM: "Tugay"},
|
||||||
|
{ORTH: "Tuğg.", NORM: "Tuğgeneral"},
|
||||||
|
{ORTH: "Tümg.", NORM: "Tümgeneral"},
|
||||||
|
{ORTH: "Uzm.", NORM: "Uzman"},
|
||||||
|
{ORTH: "Üçvş.", NORM: "Üstçavuş"},
|
||||||
|
{ORTH: "Üni.", NORM: "Üniversitesi"},
|
||||||
|
{ORTH: "Ütğm.", NORM: "Üsteğmen"},
|
||||||
|
{ORTH: "vb.", NORM: "ve benzeri"},
|
||||||
|
{ORTH: "vs.", NORM: "vesaire"},
|
||||||
|
{ORTH: "Yard.", NORM: "Yardımcı"},
|
||||||
|
{ORTH: "Yar.", NORM: "Yardımcı"},
|
||||||
|
{ORTH: "Yd.Sb.", NORM: "Yedek Subay"},
|
||||||
|
{ORTH: "Yard.Doç.", NORM: "Yardımcı Doçent"},
|
||||||
|
{ORTH: "Yar.Doç.", NORM: "Yardımcı Doçent"},
|
||||||
|
{ORTH: "Yb.", NORM: "Yarbay"},
|
||||||
|
{ORTH: "Yrd.", NORM: "Yardımcı"},
|
||||||
|
{ORTH: "Yrd.Doç.", NORM: "Yardımcı Doçent"},
|
||||||
|
{ORTH: "Y.Müh.", NORM: "Yüksek mühendis"},
|
||||||
|
{ORTH: "Y.Mim.", NORM: "Yüksek mimar"}]:
|
||||||
_exc[exc_data[ORTH]] = [exc_data]
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
|
|
||||||
for orth in ["Dr."]:
|
for orth in [
|
||||||
|
"Dr.", "yy."]:
|
||||||
_exc[orth] = [{ORTH: orth}]
|
_exc[orth] = [{ORTH: orth}]
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -319,7 +319,7 @@ cdef class ArcEager(TransitionSystem):
|
||||||
(SHIFT, ['']),
|
(SHIFT, ['']),
|
||||||
(REDUCE, ['']),
|
(REDUCE, ['']),
|
||||||
(RIGHT, []),
|
(RIGHT, []),
|
||||||
(LEFT, []),
|
(LEFT, ['subtok']),
|
||||||
(BREAK, ['ROOT']))
|
(BREAK, ['ROOT']))
|
||||||
))
|
))
|
||||||
seen_actions = set()
|
seen_actions = set()
|
||||||
|
|
|
@ -477,14 +477,15 @@ cdef class Parser:
|
||||||
free(vectors)
|
free(vectors)
|
||||||
free(scores)
|
free(scores)
|
||||||
|
|
||||||
def beam_parse(self, docs, int beam_width=3, float beam_density=0.001):
|
def beam_parse(self, docs, int beam_width=3, float beam_density=0.001,
|
||||||
|
float drop=0.):
|
||||||
cdef Beam beam
|
cdef Beam beam
|
||||||
cdef np.ndarray scores
|
cdef np.ndarray scores
|
||||||
cdef Doc doc
|
cdef Doc doc
|
||||||
cdef int nr_class = self.moves.n_moves
|
cdef int nr_class = self.moves.n_moves
|
||||||
cuda_stream = util.get_cuda_stream()
|
cuda_stream = util.get_cuda_stream()
|
||||||
(tokvecs, bp_tokvecs), state2vec, vec2scores = self.get_batch_model(
|
(tokvecs, bp_tokvecs), state2vec, vec2scores = self.get_batch_model(
|
||||||
docs, cuda_stream, 0.0)
|
docs, cuda_stream, drop)
|
||||||
cdef int offset = 0
|
cdef int offset = 0
|
||||||
cdef int j = 0
|
cdef int j = 0
|
||||||
cdef int k
|
cdef int k
|
||||||
|
@ -523,8 +524,8 @@ cdef class Parser:
|
||||||
n_states += 1
|
n_states += 1
|
||||||
if n_states == 0:
|
if n_states == 0:
|
||||||
break
|
break
|
||||||
vectors = state2vec(token_ids[:n_states])
|
vectors, _ = state2vec.begin_update(token_ids[:n_states], drop)
|
||||||
scores = vec2scores(vectors)
|
scores, _ = vec2scores.begin_update(vectors, drop=drop)
|
||||||
c_scores = <float*>scores.data
|
c_scores = <float*>scores.data
|
||||||
for beam in todo:
|
for beam in todo:
|
||||||
for i in range(beam.size):
|
for i in range(beam.size):
|
||||||
|
|
|
@ -191,9 +191,12 @@ def _filter_labels(gold_tuples, cutoff, freqs):
|
||||||
for raw_text, sents in gold_tuples:
|
for raw_text, sents in gold_tuples:
|
||||||
filtered_sents = []
|
filtered_sents = []
|
||||||
for (ids, words, tags, heads, labels, iob), ctnts in sents:
|
for (ids, words, tags, heads, labels, iob), ctnts in sents:
|
||||||
filtered_labels = [decompose(label)[0]
|
filtered_labels = []
|
||||||
if freqs.get(label, cutoff) < cutoff
|
for label in labels:
|
||||||
else label for label in labels]
|
if is_decorated(label) and freqs.get(label, 0) < cutoff:
|
||||||
|
filtered_labels.append(decompose(label)[0])
|
||||||
|
else:
|
||||||
|
filtered_labels.append(label)
|
||||||
filtered_sents.append(
|
filtered_sents.append(
|
||||||
((ids, words, tags, heads, filtered_labels, iob), ctnts))
|
((ids, words, tags, heads, filtered_labels, iob), ctnts))
|
||||||
filtered.append((raw_text, filtered_sents))
|
filtered.append((raw_text, filtered_sents))
|
||||||
|
|
74
spacy/tests/parser/test_arc_eager_oracle.py
Normal file
74
spacy/tests/parser/test_arc_eager_oracle.py
Normal file
|
@ -0,0 +1,74 @@
|
||||||
|
from ...vocab import Vocab
|
||||||
|
from ...pipeline import DependencyParser
|
||||||
|
from ...tokens import Doc
|
||||||
|
from ...gold import GoldParse
|
||||||
|
from ...syntax.nonproj import projectivize
|
||||||
|
|
||||||
|
annot_tuples = [
|
||||||
|
(0, 'When', 'WRB', 11, 'advmod', 'O'),
|
||||||
|
(1, 'Walter', 'NNP', 2, 'compound', 'B-PERSON'),
|
||||||
|
(2, 'Rodgers', 'NNP', 11, 'nsubj', 'L-PERSON'),
|
||||||
|
(3, ',', ',', 2, 'punct', 'O'),
|
||||||
|
(4, 'our', 'PRP$', 6, 'poss', 'O'),
|
||||||
|
(5, 'embedded', 'VBN', 6, 'amod', 'O'),
|
||||||
|
(6, 'reporter', 'NN', 2, 'appos', 'O'),
|
||||||
|
(7, 'with', 'IN', 6, 'prep', 'O'),
|
||||||
|
(8, 'the', 'DT', 10, 'det', 'B-ORG'),
|
||||||
|
(9, '3rd', 'NNP', 10, 'compound', 'I-ORG'),
|
||||||
|
(10, 'Cavalry', 'NNP', 7, 'pobj', 'L-ORG'),
|
||||||
|
(11, 'says', 'VBZ', 44, 'advcl', 'O'),
|
||||||
|
(12, 'three', 'CD', 13, 'nummod', 'U-CARDINAL'),
|
||||||
|
(13, 'battalions', 'NNS', 16, 'nsubj', 'O'),
|
||||||
|
(14, 'of', 'IN', 13, 'prep', 'O'),
|
||||||
|
(15, 'troops', 'NNS', 14, 'pobj', 'O'),
|
||||||
|
(16, 'are', 'VBP', 11, 'ccomp', 'O'),
|
||||||
|
(17, 'on', 'IN', 16, 'prep', 'O'),
|
||||||
|
(18, 'the', 'DT', 19, 'det', 'O'),
|
||||||
|
(19, 'ground', 'NN', 17, 'pobj', 'O'),
|
||||||
|
(20, ',', ',', 17, 'punct', 'O'),
|
||||||
|
(21, 'inside', 'IN', 17, 'prep', 'O'),
|
||||||
|
(22, 'Baghdad', 'NNP', 21, 'pobj', 'U-GPE'),
|
||||||
|
(23, 'itself', 'PRP', 22, 'appos', 'O'),
|
||||||
|
(24, ',', ',', 16, 'punct', 'O'),
|
||||||
|
(25, 'have', 'VBP', 26, 'aux', 'O'),
|
||||||
|
(26, 'taken', 'VBN', 16, 'dep', 'O'),
|
||||||
|
(27, 'up', 'RP', 26, 'prt', 'O'),
|
||||||
|
(28, 'positions', 'NNS', 26, 'dobj', 'O'),
|
||||||
|
(29, 'they', 'PRP', 31, 'nsubj', 'O'),
|
||||||
|
(30, "'re", 'VBP', 31, 'aux', 'O'),
|
||||||
|
(31, 'going', 'VBG', 26, 'parataxis', 'O'),
|
||||||
|
(32, 'to', 'TO', 33, 'aux', 'O'),
|
||||||
|
(33, 'spend', 'VB', 31, 'xcomp', 'O'),
|
||||||
|
(34, 'the', 'DT', 35, 'det', 'B-TIME'),
|
||||||
|
(35, 'night', 'NN', 33, 'dobj', 'L-TIME'),
|
||||||
|
(36, 'there', 'RB', 33, 'advmod', 'O'),
|
||||||
|
(37, 'presumably', 'RB', 33, 'advmod', 'O'),
|
||||||
|
(38, ',', ',', 44, 'punct', 'O'),
|
||||||
|
(39, 'how', 'WRB', 40, 'advmod', 'O'),
|
||||||
|
(40, 'many', 'JJ', 41, 'amod', 'O'),
|
||||||
|
(41, 'soldiers', 'NNS', 44, 'pobj', 'O'),
|
||||||
|
(42, 'are', 'VBP', 44, 'aux', 'O'),
|
||||||
|
(43, 'we', 'PRP', 44, 'nsubj', 'O'),
|
||||||
|
(44, 'talking', 'VBG', 44, 'ROOT', 'O'),
|
||||||
|
(45, 'about', 'IN', 44, 'prep', 'O'),
|
||||||
|
(46, 'right', 'RB', 47, 'advmod', 'O'),
|
||||||
|
(47, 'now', 'RB', 44, 'advmod', 'O'),
|
||||||
|
(48, '?', '.', 44, 'punct', 'O')]
|
||||||
|
|
||||||
|
def test_get_oracle_actions():
|
||||||
|
doc = Doc(Vocab(), words=[t[1] for t in annot_tuples])
|
||||||
|
parser = DependencyParser(doc.vocab)
|
||||||
|
parser.moves.add_action(0, '')
|
||||||
|
parser.moves.add_action(1, '')
|
||||||
|
parser.moves.add_action(1, '')
|
||||||
|
parser.moves.add_action(4, 'ROOT')
|
||||||
|
for i, (id_, word, tag, head, dep, ent) in enumerate(annot_tuples):
|
||||||
|
if head > i:
|
||||||
|
parser.moves.add_action(2, dep)
|
||||||
|
elif head < i:
|
||||||
|
parser.moves.add_action(3, dep)
|
||||||
|
ids, words, tags, heads, deps, ents = zip(*annot_tuples)
|
||||||
|
heads, deps = projectivize(heads, deps)
|
||||||
|
gold = GoldParse(doc, words=words, tags=tags, heads=heads, deps=deps)
|
||||||
|
parser.moves.preprocess_gold(gold)
|
||||||
|
actions = parser.moves.get_oracle_sequence(doc, gold)
|
|
@ -294,6 +294,7 @@ cdef class Span:
|
||||||
cdef int i
|
cdef int i
|
||||||
if self.doc.is_parsed:
|
if self.doc.is_parsed:
|
||||||
root = &self.doc.c[self.start]
|
root = &self.doc.c[self.start]
|
||||||
|
n = 0
|
||||||
while root.head != 0:
|
while root.head != 0:
|
||||||
root += root.head
|
root += root.head
|
||||||
n += 1
|
n += 1
|
||||||
|
@ -307,8 +308,10 @@ cdef class Span:
|
||||||
start += -1
|
start += -1
|
||||||
# find end of the sentence
|
# find end of the sentence
|
||||||
end = self.end
|
end = self.end
|
||||||
while self.doc.c[end].sent_start != 1:
|
n = 0
|
||||||
|
while end < self.doc.length and self.doc.c[end].sent_start != 1:
|
||||||
end += 1
|
end += 1
|
||||||
|
n += 1
|
||||||
if n >= self.doc.length:
|
if n >= self.doc.length:
|
||||||
break
|
break
|
||||||
#
|
#
|
||||||
|
|
|
@ -279,8 +279,8 @@ cdef class Token:
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
if self.c.lemma == 0:
|
if self.c.lemma == 0:
|
||||||
lemma = self.vocab.morphology.lemmatizer.lookup(self.orth_)
|
lemma_ = self.vocab.morphology.lemmatizer.lookup(self.orth_)
|
||||||
return lemma
|
return self.vocab.strings[lemma_]
|
||||||
else:
|
else:
|
||||||
return self.c.lemma
|
return self.c.lemma
|
||||||
|
|
||||||
|
|
|
@ -451,7 +451,7 @@ def itershuffle(iterable, bufsize=1000):
|
||||||
try:
|
try:
|
||||||
while True:
|
while True:
|
||||||
for i in range(random.randint(1, bufsize-len(buf))):
|
for i in range(random.randint(1, bufsize-len(buf))):
|
||||||
buf.append(iterable.next())
|
buf.append(next(iterable))
|
||||||
random.shuffle(buf)
|
random.shuffle(buf)
|
||||||
for i in range(random.randint(1, bufsize)):
|
for i in range(random.randint(1, bufsize)):
|
||||||
if buf:
|
if buf:
|
||||||
|
|
|
@ -15,11 +15,8 @@ from .compat import basestring_, path2str
|
||||||
from . import util
|
from . import util
|
||||||
|
|
||||||
|
|
||||||
def unpickle_vectors(keys_and_rows, data):
|
def unpickle_vectors(bytes_data):
|
||||||
vectors = Vectors(data=data)
|
return Vectors().from_bytes(bytes_data)
|
||||||
for key, row in keys_and_rows:
|
|
||||||
vectors.add(key, row=row)
|
|
||||||
return vectors
|
|
||||||
|
|
||||||
|
|
||||||
cdef class Vectors:
|
cdef class Vectors:
|
||||||
|
@ -86,8 +83,7 @@ cdef class Vectors:
|
||||||
return len(self.key2row)
|
return len(self.key2row)
|
||||||
|
|
||||||
def __reduce__(self):
|
def __reduce__(self):
|
||||||
keys_and_rows = tuple(self.key2row.items())
|
return (unpickle_vectors, (self.to_bytes(),))
|
||||||
return (unpickle_vectors, (keys_and_rows, self.data))
|
|
||||||
|
|
||||||
def __getitem__(self, key):
|
def __getitem__(self, key):
|
||||||
"""Get a vector by key. If the key is not found, a KeyError is raised.
|
"""Get a vector by key. If the key is not found, a KeyError is raised.
|
||||||
|
|
Binary file not shown.
Before Width: | Height: | Size: 378 KiB |
|
@ -76,13 +76,15 @@
|
||||||
},
|
},
|
||||||
|
|
||||||
"MODEL_LICENSES": {
|
"MODEL_LICENSES": {
|
||||||
"CC BY-SA": "https://creativecommons.org/licenses/by-sa/3.0/",
|
"CC BY 4.0": "https://creativecommons.org/licenses/by/4.0/",
|
||||||
"CC BY-SA 3.0": "https://creativecommons.org/licenses/by-sa/3.0/",
|
"CC BY-SA": "https://creativecommons.org/licenses/by-sa/3.0/",
|
||||||
"CC BY-SA 4.0": "https://creativecommons.org/licenses/by-sa/4.0/",
|
"CC BY-SA 3.0": "https://creativecommons.org/licenses/by-sa/3.0/",
|
||||||
"CC BY-NC": "https://creativecommons.org/licenses/by-nc/3.0/",
|
"CC BY-SA 4.0": "https://creativecommons.org/licenses/by-sa/4.0/",
|
||||||
"CC BY-NC 3.0": "https://creativecommons.org/licenses/by-nc/3.0/",
|
"CC BY-NC": "https://creativecommons.org/licenses/by-nc/3.0/",
|
||||||
"GPL": "https://www.gnu.org/licenses/gpl.html",
|
"CC BY-NC 3.0": "https://creativecommons.org/licenses/by-nc/3.0/",
|
||||||
"LGPL": "https://www.gnu.org/licenses/lgpl.html"
|
"CC-BY-NC-SA 3.0": "https://creativecommons.org/licenses/by-nc-sa/3.0/",
|
||||||
|
"GPL": "https://www.gnu.org/licenses/gpl.html",
|
||||||
|
"LGPL": "https://www.gnu.org/licenses/lgpl.html"
|
||||||
},
|
},
|
||||||
|
|
||||||
"MODEL_BENCHMARKS": {
|
"MODEL_BENCHMARKS": {
|
||||||
|
|
|
@ -68,7 +68,7 @@ p
|
||||||
+item #[strong spaCy is not research software].
|
+item #[strong spaCy is not research software].
|
||||||
| It's built on the latest research, but it's designed to get
|
| It's built on the latest research, but it's designed to get
|
||||||
| things done. This leads to fairly different design decisions than
|
| things done. This leads to fairly different design decisions than
|
||||||
| #[+a("https://github./nltk/nltk") NLTK]
|
| #[+a("https://github.com/nltk/nltk") NLTK]
|
||||||
| or #[+a("https://stanfordnlp.github.io/CoreNLP/") CoreNLP], which were
|
| or #[+a("https://stanfordnlp.github.io/CoreNLP/") CoreNLP], which were
|
||||||
| created as platforms for teaching and research. The main difference
|
| created as platforms for teaching and research. The main difference
|
||||||
| is that spaCy is integrated and opinionated. spaCy tries to avoid asking
|
| is that spaCy is integrated and opinionated. spaCy tries to avoid asking
|
||||||
|
|
Loading…
Reference in New Issue
Block a user