mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-13 10:46:29 +03:00
Merge branch 'develop' into pr/5008
This commit is contained in:
commit
5d21d3e8b9
106
.github/contributors/Jan-711.md
vendored
Normal file
106
.github/contributors/Jan-711.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Jan Jessewitsch |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 16.02.2020 |
|
||||
| GitHub username | Jan-711 |
|
||||
| Website (optional) | |
|
4
.gitignore
vendored
4
.gitignore
vendored
|
@ -39,6 +39,7 @@ __pycache__/
|
|||
.env*
|
||||
.~env/
|
||||
.venv
|
||||
env3.6/
|
||||
venv/
|
||||
.dev
|
||||
.denv
|
||||
|
@ -111,3 +112,6 @@ Desktop.ini
|
|||
|
||||
# Pycharm project files
|
||||
*.idea
|
||||
|
||||
# IPython
|
||||
.ipynb_checkpoints/
|
||||
|
|
23
.travis.yml
23
.travis.yml
|
@ -1,23 +0,0 @@
|
|||
language: python
|
||||
sudo: false
|
||||
cache: pip
|
||||
dist: trusty
|
||||
group: edge
|
||||
python:
|
||||
- "2.7"
|
||||
os:
|
||||
- linux
|
||||
install:
|
||||
- "pip install -r requirements.txt"
|
||||
- "python setup.py build_ext --inplace"
|
||||
- "pip install -e ."
|
||||
script:
|
||||
- "cat /proc/cpuinfo | grep flags | head -n 1"
|
||||
- "python -m pytest --tb=native spacy"
|
||||
branches:
|
||||
except:
|
||||
- spacy.io
|
||||
notifications:
|
||||
slack:
|
||||
secure: F8GvqnweSdzImuLL64TpfG0i5rYl89liyr9tmFVsHl4c0DNiDuGhZivUz0M1broS8svE3OPOllLfQbACG/4KxD890qfF9MoHzvRDlp7U+RtwMV/YAkYn8MGWjPIbRbX0HpGdY7O2Rc9Qy4Kk0T8ZgiqXYIqAz2Eva9/9BlSmsJQ=
|
||||
email: false
|
|
@ -280,23 +280,7 @@ except: # noqa: E722
|
|||
|
||||
### Python conventions
|
||||
|
||||
All Python code must be written in an **intersection of Python 2 and Python 3**.
|
||||
This is easy in Cython, but somewhat ugly in Python. Logic that deals with
|
||||
Python or platform compatibility should only live in
|
||||
[`spacy.compat`](spacy/compat.py). To distinguish them from the builtin
|
||||
functions, replacement functions are suffixed with an underscore, for example
|
||||
`unicode_`. If you need to access the user's version or platform information,
|
||||
for example to show more specific error messages, you can use the `is_config()`
|
||||
helper function.
|
||||
|
||||
```python
|
||||
from .compat import unicode_, is_config
|
||||
|
||||
compatible_unicode = unicode_('hello world')
|
||||
if is_config(windows=True, python2=True):
|
||||
print("You are using Python 2 on Windows.")
|
||||
```
|
||||
|
||||
All Python code must be written **compatible with Python 3.6+**.
|
||||
Code that interacts with the file-system should accept objects that follow the
|
||||
`pathlib.Path` API, without assuming that the object inherits from `pathlib.Path`.
|
||||
If the function is user-facing and takes a path as an argument, it should check
|
||||
|
|
14
README.md
14
README.md
|
@ -15,7 +15,6 @@ It's commercial open-source software, released under the MIT license.
|
|||
[Check out the release notes here.](https://github.com/explosion/spaCy/releases)
|
||||
|
||||
[![Azure Pipelines](<https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build+(3.x)>)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
||||
[![Travis Build Status](<https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis-ci&logoColor=white&label=build+(2.7)>)](https://travis-ci.org/explosion/spaCy)
|
||||
[![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square&logo=github)](https://github.com/explosion/spaCy/releases)
|
||||
[![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy/)
|
||||
[![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square&logo=conda-forge&logoColor=white)](https://anaconda.org/conda-forge/spacy)
|
||||
|
@ -98,12 +97,19 @@ For detailed installation instructions, see the
|
|||
|
||||
- **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual
|
||||
Studio)
|
||||
- **Python version**: Python 2.7, 3.5+ (only 64 bit)
|
||||
- **Python version**: Python 3.6+ (only 64 bit)
|
||||
- **Package managers**: [pip] · [conda] (via `conda-forge`)
|
||||
|
||||
[pip]: https://pypi.org/project/spacy/
|
||||
[conda]: https://anaconda.org/conda-forge/spacy
|
||||
|
||||
> ⚠️ **Important note for Python 3.8:** We can't yet ship pre-compiled binary
|
||||
> wheels for spaCy that work on Python 3.8, as we're still waiting for our CI
|
||||
> providers and other tooling to support it. This means that in order to run
|
||||
> spaCy on Python 3.8, you'll need [a compiler installed](#source) and compile
|
||||
> the library and its Cython dependencies locally. If this is causing problems
|
||||
> for you, the easiest solution is to **use Python 3.7** in the meantime.
|
||||
|
||||
### pip
|
||||
|
||||
Using pip, spaCy releases are available as source packages and binary wheels (as
|
||||
|
@ -262,9 +268,7 @@ and git preinstalled.
|
|||
Install a version of the
|
||||
[Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)
|
||||
or [Visual Studio Express](https://visualstudio.microsoft.com/vs/express/) that
|
||||
matches the version that was used to compile your Python interpreter. For
|
||||
official distributions these are VS 2008 (Python 2.7), VS 2010 (Python 3.4) and
|
||||
VS 2015 (Python 3.5).
|
||||
matches the version that was used to compile your Python interpreter.
|
||||
|
||||
## Run tests
|
||||
|
||||
|
|
|
@ -35,12 +35,6 @@ jobs:
|
|||
dependsOn: 'Validate'
|
||||
strategy:
|
||||
matrix:
|
||||
Python35Linux:
|
||||
imageName: 'ubuntu-16.04'
|
||||
python.version: '3.5'
|
||||
Python35Windows:
|
||||
imageName: 'vs2017-win2016'
|
||||
python.version: '3.5'
|
||||
Python36Linux:
|
||||
imageName: 'ubuntu-16.04'
|
||||
python.version: '3.6'
|
||||
|
|
169
bin/cythonize.py
169
bin/cythonize.py
|
@ -1,169 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
""" cythonize.py
|
||||
|
||||
Cythonize pyx files into C++ files as needed.
|
||||
|
||||
Usage: cythonize.py [root]
|
||||
|
||||
Checks pyx files to see if they have been changed relative to their
|
||||
corresponding C++ files. If they have, then runs cython on these files to
|
||||
recreate the C++ files.
|
||||
|
||||
Additionally, checks pxd files and setup.py if they have been changed. If
|
||||
they have, rebuilds everything.
|
||||
|
||||
Change detection based on file hashes stored in JSON format.
|
||||
|
||||
For now, this script should be run by developers when changing Cython files
|
||||
and the resulting C++ files checked in, so that end-users (and Python-only
|
||||
developers) do not get the Cython dependencies.
|
||||
|
||||
Based upon:
|
||||
|
||||
https://raw.github.com/dagss/private-scipy-refactor/cythonize/cythonize.py
|
||||
https://raw.githubusercontent.com/numpy/numpy/master/tools/cythonize.py
|
||||
|
||||
Note: this script does not check any of the dependent C++ libraries.
|
||||
"""
|
||||
from __future__ import print_function
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import hashlib
|
||||
import subprocess
|
||||
import argparse
|
||||
|
||||
|
||||
HASH_FILE = "cythonize.json"
|
||||
|
||||
|
||||
def process_pyx(fromfile, tofile, language_level="-2"):
|
||||
print("Processing %s" % fromfile)
|
||||
try:
|
||||
from Cython.Compiler.Version import version as cython_version
|
||||
from distutils.version import LooseVersion
|
||||
|
||||
if LooseVersion(cython_version) < LooseVersion("0.19"):
|
||||
raise Exception("Require Cython >= 0.19")
|
||||
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
flags = ["--fast-fail", language_level]
|
||||
if tofile.endswith(".cpp"):
|
||||
flags += ["--cplus"]
|
||||
|
||||
try:
|
||||
try:
|
||||
r = subprocess.call(
|
||||
["cython"] + flags + ["-o", tofile, fromfile], env=os.environ
|
||||
) # See Issue #791
|
||||
if r != 0:
|
||||
raise Exception("Cython failed")
|
||||
except OSError:
|
||||
# There are ways of installing Cython that don't result in a cython
|
||||
# executable on the path, see gh-2397.
|
||||
r = subprocess.call(
|
||||
[
|
||||
sys.executable,
|
||||
"-c",
|
||||
"import sys; from Cython.Compiler.Main import "
|
||||
"setuptools_main as main; sys.exit(main())",
|
||||
]
|
||||
+ flags
|
||||
+ ["-o", tofile, fromfile]
|
||||
)
|
||||
if r != 0:
|
||||
raise Exception("Cython failed")
|
||||
except OSError:
|
||||
raise OSError("Cython needs to be installed")
|
||||
|
||||
|
||||
def preserve_cwd(path, func, *args):
|
||||
orig_cwd = os.getcwd()
|
||||
try:
|
||||
os.chdir(path)
|
||||
func(*args)
|
||||
finally:
|
||||
os.chdir(orig_cwd)
|
||||
|
||||
|
||||
def load_hashes(filename):
|
||||
try:
|
||||
return json.load(open(filename))
|
||||
except (ValueError, IOError):
|
||||
return {}
|
||||
|
||||
|
||||
def save_hashes(hash_db, filename):
|
||||
with open(filename, "w") as f:
|
||||
f.write(json.dumps(hash_db))
|
||||
|
||||
|
||||
def get_hash(path):
|
||||
return hashlib.md5(open(path, "rb").read()).hexdigest()
|
||||
|
||||
|
||||
def hash_changed(base, path, db):
|
||||
full_path = os.path.normpath(os.path.join(base, path))
|
||||
return not get_hash(full_path) == db.get(full_path)
|
||||
|
||||
|
||||
def hash_add(base, path, db):
|
||||
full_path = os.path.normpath(os.path.join(base, path))
|
||||
db[full_path] = get_hash(full_path)
|
||||
|
||||
|
||||
def process(base, filename, db):
|
||||
root, ext = os.path.splitext(filename)
|
||||
if ext in [".pyx", ".cpp"]:
|
||||
if hash_changed(base, filename, db) or not os.path.isfile(
|
||||
os.path.join(base, root + ".cpp")
|
||||
):
|
||||
preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp")
|
||||
hash_add(base, root + ".cpp", db)
|
||||
hash_add(base, root + ".pyx", db)
|
||||
|
||||
|
||||
def check_changes(root, db):
|
||||
res = False
|
||||
new_db = {}
|
||||
|
||||
setup_filename = "setup.py"
|
||||
hash_add(".", setup_filename, new_db)
|
||||
if hash_changed(".", setup_filename, db):
|
||||
res = True
|
||||
|
||||
for base, _, files in os.walk(root):
|
||||
for filename in files:
|
||||
if filename.endswith(".pxd"):
|
||||
hash_add(base, filename, new_db)
|
||||
if hash_changed(base, filename, db):
|
||||
res = True
|
||||
|
||||
if res:
|
||||
db.clear()
|
||||
db.update(new_db)
|
||||
return res
|
||||
|
||||
|
||||
def run(root):
|
||||
db = load_hashes(HASH_FILE)
|
||||
|
||||
try:
|
||||
check_changes(root, db)
|
||||
for base, _, files in os.walk(root):
|
||||
for filename in files:
|
||||
process(base, filename, db)
|
||||
finally:
|
||||
save_hashes(db, HASH_FILE)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Cythonize pyx files into C++ files as needed"
|
||||
)
|
||||
parser.add_argument("root", help="root directory")
|
||||
args = parser.parse_args()
|
||||
run(args.root)
|
|
@ -13,23 +13,12 @@ import srsly
|
|||
import spacy
|
||||
import spacy.util
|
||||
from spacy.tokens import Token, Doc
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.util import compounding, minibatch_by_words
|
||||
from spacy.syntax.nonproj import projectivize
|
||||
from spacy.matcher import Matcher
|
||||
|
||||
# from spacy.morphology import Fused_begin, Fused_inside
|
||||
from spacy import displacy
|
||||
from collections import defaultdict, Counter
|
||||
from timeit import default_timer as timer
|
||||
|
||||
Fused_begin = None
|
||||
Fused_inside = None
|
||||
|
||||
import itertools
|
||||
import random
|
||||
import numpy.random
|
||||
|
||||
from . import conll17_ud_eval
|
||||
|
||||
from spacy import lang
|
||||
|
@ -268,7 +257,7 @@ def load_nlp(experiments_dir, corpus):
|
|||
return nlp
|
||||
|
||||
|
||||
def initialize_pipeline(nlp, docs, golds, config, device):
|
||||
def initialize_pipeline(nlp, examples, config, device):
|
||||
nlp.add_pipe(nlp.create_pipe("parser"))
|
||||
return nlp
|
||||
|
||||
|
|
|
@ -14,7 +14,7 @@ import spacy
|
|||
import spacy.util
|
||||
from bin.ud import conll17_ud_eval
|
||||
from spacy.tokens import Token, Doc
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.gold import GoldParse, Example
|
||||
from spacy.util import compounding, minibatch, minibatch_by_words
|
||||
from spacy.syntax.nonproj import projectivize
|
||||
from spacy.matcher import Matcher
|
||||
|
@ -53,7 +53,7 @@ def read_data(
|
|||
max_doc_length=None,
|
||||
limit=None,
|
||||
):
|
||||
"""Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True,
|
||||
"""Read the CONLLU format into Example objects. If raw_text=True,
|
||||
include Doc objects created using nlp.make_doc and then aligned against
|
||||
the gold-standard sequences. If oracle_segments=True, include Doc objects
|
||||
created from the gold-standard segments. At least one must be True."""
|
||||
|
@ -98,15 +98,16 @@ def read_data(
|
|||
docs.append(doc)
|
||||
golds.append(gold)
|
||||
if limit and len(docs) >= limit:
|
||||
return docs, golds
|
||||
return golds_to_gold_data(docs, golds)
|
||||
|
||||
if raw_text and sent_annots:
|
||||
doc, gold = _make_gold(nlp, None, sent_annots)
|
||||
docs.append(doc)
|
||||
golds.append(gold)
|
||||
if limit and len(docs) >= limit:
|
||||
return docs, golds
|
||||
return docs, golds
|
||||
return golds_to_gold_data(docs, golds)
|
||||
return golds_to_gold_data(docs, golds)
|
||||
|
||||
|
||||
def _parse_morph_string(morph_string):
|
||||
if morph_string == '_':
|
||||
|
@ -120,6 +121,7 @@ def _parse_morph_string(morph_string):
|
|||
output.append('%s_%s' % (key, value.lower()))
|
||||
return set(output)
|
||||
|
||||
|
||||
def read_conllu(file_):
|
||||
docs = []
|
||||
sent = []
|
||||
|
@ -180,16 +182,18 @@ def _make_gold(nlp, text, sent_annots, drop_deps=0.0):
|
|||
#############################
|
||||
|
||||
|
||||
def golds_to_gold_tuples(docs, golds):
|
||||
"""Get out the annoying 'tuples' format used by begin_training, given the
|
||||
def golds_to_gold_data(docs, golds):
|
||||
"""Get out the training data format used by begin_training, given the
|
||||
GoldParse objects."""
|
||||
tuples = []
|
||||
data = []
|
||||
for doc, gold in zip(docs, golds):
|
||||
text = doc.text
|
||||
ids, words, tags, heads, labels, iob = zip(*gold.orig_annot)
|
||||
sents = [((ids, words, tags, heads, labels, iob), [])]
|
||||
tuples.append((text, sents))
|
||||
return tuples
|
||||
example = Example(doc=doc)
|
||||
example.add_doc_annotation(cats=gold.cats)
|
||||
token_annotation_dict = gold.orig.to_dict()
|
||||
example.add_token_annotation(**token_annotation_dict)
|
||||
example.goldparse = gold
|
||||
data.append(example)
|
||||
return data
|
||||
|
||||
|
||||
##############
|
||||
|
@ -327,7 +331,6 @@ def get_token_conllu(token, i):
|
|||
return "\n".join(lines)
|
||||
|
||||
|
||||
|
||||
##################
|
||||
# Initialization #
|
||||
##################
|
||||
|
@ -348,7 +351,7 @@ def load_nlp(corpus, config, vectors=None):
|
|||
return nlp
|
||||
|
||||
|
||||
def initialize_pipeline(nlp, docs, golds, config, device):
|
||||
def initialize_pipeline(nlp, examples, config, device):
|
||||
nlp.add_pipe(nlp.create_pipe("tagger", config={"set_morphology": False}))
|
||||
nlp.add_pipe(nlp.create_pipe("morphologizer"))
|
||||
nlp.add_pipe(nlp.create_pipe("parser"))
|
||||
|
@ -356,14 +359,15 @@ def initialize_pipeline(nlp, docs, golds, config, device):
|
|||
nlp.parser.add_multitask_objective("tag")
|
||||
if config.multitask_sent:
|
||||
nlp.parser.add_multitask_objective("sent_start")
|
||||
for gold in golds:
|
||||
for ex in examples:
|
||||
gold = ex.gold
|
||||
for tag in gold.tags:
|
||||
if tag is not None:
|
||||
nlp.tagger.add_label(tag)
|
||||
if torch is not None and device != -1:
|
||||
torch.set_default_tensor_type("torch.cuda.FloatTensor")
|
||||
optimizer = nlp.begin_training(
|
||||
lambda: golds_to_gold_tuples(docs, golds),
|
||||
lambda: examples,
|
||||
device=device,
|
||||
subword_features=config.subword_features,
|
||||
conv_depth=config.conv_depth,
|
||||
|
@ -491,6 +495,10 @@ def main(
|
|||
Token.set_extension("begins_fused", default=False)
|
||||
Token.set_extension("inside_fused", default=False)
|
||||
|
||||
Token.set_extension("get_conllu_lines", method=get_token_conllu)
|
||||
Token.set_extension("begins_fused", default=False)
|
||||
Token.set_extension("inside_fused", default=False)
|
||||
|
||||
spacy.util.fix_random_seed()
|
||||
lang.zh.Chinese.Defaults.use_jieba = False
|
||||
lang.ja.Japanese.Defaults.use_janome = False
|
||||
|
@ -505,7 +513,7 @@ def main(
|
|||
print("Train and evaluate", corpus, "using lang", paths.lang)
|
||||
nlp = load_nlp(paths.lang, config, vectors=vectors_dir)
|
||||
|
||||
docs, golds = read_data(
|
||||
examples = read_data(
|
||||
nlp,
|
||||
paths.train.conllu.open(encoding="utf8"),
|
||||
paths.train.text.open(encoding="utf8"),
|
||||
|
@ -513,12 +521,12 @@ def main(
|
|||
limit=limit,
|
||||
)
|
||||
|
||||
optimizer = initialize_pipeline(nlp, docs, golds, config, gpu_device)
|
||||
optimizer = initialize_pipeline(nlp, examples, config, gpu_device)
|
||||
|
||||
batch_sizes = compounding(config.min_batch_size, config.max_batch_size, 1.001)
|
||||
beam_prob = compounding(0.2, 0.8, 1.001)
|
||||
for i in range(config.nr_epoch):
|
||||
docs, golds = read_data(
|
||||
examples = read_data(
|
||||
nlp,
|
||||
paths.train.conllu.open(encoding="utf8"),
|
||||
paths.train.text.open(encoding="utf8"),
|
||||
|
@ -527,22 +535,19 @@ def main(
|
|||
oracle_segments=use_oracle_segments,
|
||||
raw_text=not use_oracle_segments,
|
||||
)
|
||||
Xs = list(zip(docs, golds))
|
||||
random.shuffle(Xs)
|
||||
random.shuffle(examples)
|
||||
if config.batch_by_words:
|
||||
batches = minibatch_by_words(Xs, size=batch_sizes)
|
||||
batches = minibatch_by_words(examples, size=batch_sizes)
|
||||
else:
|
||||
batches = minibatch(Xs, size=batch_sizes)
|
||||
batches = minibatch(examples, size=batch_sizes)
|
||||
losses = {}
|
||||
n_train_words = sum(len(doc) for doc in docs)
|
||||
n_train_words = sum(len(ex.doc) for ex in examples)
|
||||
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
|
||||
for batch in batches:
|
||||
batch_docs, batch_gold = zip(*batch)
|
||||
pbar.update(sum(len(doc) for doc in batch_docs))
|
||||
pbar.update(sum(len(ex.doc) for ex in batch))
|
||||
nlp.parser.cfg["beam_update_prob"] = next(beam_prob)
|
||||
nlp.update(
|
||||
batch_docs,
|
||||
batch_gold,
|
||||
batch,
|
||||
sgd=optimizer,
|
||||
drop=config.dropout,
|
||||
losses=losses,
|
||||
|
|
|
@ -46,7 +46,7 @@ def _define_entities(nlp, kb, entity_def_path, entity_descr_path, min_entity_fre
|
|||
" cf. https://spacy.io/usage/models#languages."
|
||||
)
|
||||
|
||||
logger.info("Filtering entities with fewer than {} mentions".format(min_entity_freq))
|
||||
logger.info("Filtering entities with fewer than {} mentions or no description".format(min_entity_freq))
|
||||
entity_frequencies = io.read_entity_to_count(entity_freq_path)
|
||||
# filter the entities for in the KB by frequency, because there's just too much data (8M entities) otherwise
|
||||
filtered_title_to_id, entity_list, description_list, frequency_list = get_filtered_entities(
|
||||
|
|
|
@ -4,12 +4,12 @@ from random import shuffle
|
|||
import logging
|
||||
import numpy as np
|
||||
|
||||
from spacy._ml import zero_init, create_default_optimizer
|
||||
from spacy.cli.pretrain import get_cossim_loss
|
||||
|
||||
from thinc.v2v import Model
|
||||
from thinc.model import Model
|
||||
from thinc.api import chain
|
||||
from thinc.neural._classes.affine import Affine
|
||||
from thinc.loss import CosineDistance
|
||||
from thinc.layers import Linear
|
||||
|
||||
from spacy.util import create_default_optimizer
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
@ -34,6 +34,7 @@ class EntityEncoder:
|
|||
self.input_dim = input_dim
|
||||
self.desc_width = desc_width
|
||||
self.epochs = epochs
|
||||
self.distance = CosineDistance(ignore_zeros=True, normalize=False)
|
||||
|
||||
def apply_encoder(self, description_list):
|
||||
if self.encoder is None:
|
||||
|
@ -132,21 +133,17 @@ class EntityEncoder:
|
|||
def _build_network(self, orig_width, hidden_with):
|
||||
with Model.define_operators({">>": chain}):
|
||||
# very simple encoder-decoder model
|
||||
self.encoder = Affine(hidden_with, orig_width)
|
||||
self.model = self.encoder >> zero_init(
|
||||
Affine(orig_width, hidden_with, drop_factor=0.0)
|
||||
)
|
||||
self.sgd = create_default_optimizer(self.model.ops)
|
||||
self.encoder = Linear(hidden_with, orig_width)
|
||||
# TODO: removed the zero_init here - is oK?
|
||||
self.model = self.encoder >> Linear(orig_width, hidden_with)
|
||||
self.sgd = create_default_optimizer()
|
||||
|
||||
def _update(self, vectors):
|
||||
truths = self.model.ops.asarray(vectors)
|
||||
predictions, bp_model = self.model.begin_update(
|
||||
np.asarray(vectors), drop=self.DROP
|
||||
truths, drop=self.DROP
|
||||
)
|
||||
loss, d_scores = self._get_loss(scores=predictions, golds=np.asarray(vectors))
|
||||
d_scores, loss = self.distance(predictions, truths)
|
||||
bp_model(d_scores, sgd=self.sgd)
|
||||
return loss / len(vectors)
|
||||
|
||||
@staticmethod
|
||||
def _get_loss(golds, scores):
|
||||
loss, gradients = get_cossim_loss(scores, golds)
|
||||
return loss, gradients
|
||||
|
|
|
@ -17,7 +17,13 @@ import plac
|
|||
from tqdm import tqdm
|
||||
|
||||
from bin.wiki_entity_linking import wikipedia_processor
|
||||
from bin.wiki_entity_linking import TRAINING_DATA_FILE, KB_MODEL_DIR, KB_FILE, LOG_FORMAT, OUTPUT_MODEL_DIR
|
||||
from bin.wiki_entity_linking import (
|
||||
TRAINING_DATA_FILE,
|
||||
KB_MODEL_DIR,
|
||||
KB_FILE,
|
||||
LOG_FORMAT,
|
||||
OUTPUT_MODEL_DIR,
|
||||
)
|
||||
from bin.wiki_entity_linking.entity_linker_evaluation import measure_performance
|
||||
from bin.wiki_entity_linking.kb_creator import read_kb
|
||||
|
||||
|
@ -48,10 +54,12 @@ def main(
|
|||
l2=1e-6,
|
||||
train_articles=None,
|
||||
dev_articles=None,
|
||||
labels_discard=None
|
||||
labels_discard=None,
|
||||
):
|
||||
if not output_dir:
|
||||
logger.warning("No output dir specified so no results will be written, are you sure about this ?")
|
||||
logger.warning(
|
||||
"No output dir specified so no results will be written, are you sure about this ?"
|
||||
)
|
||||
|
||||
logger.info("Creating Entity Linker with Wikipedia and WikiData")
|
||||
|
||||
|
@ -68,7 +76,11 @@ def main(
|
|||
# STEP 1 : load the NLP object
|
||||
logger.info("STEP 1a: Loading model from {}".format(nlp_dir))
|
||||
nlp = spacy.load(nlp_dir)
|
||||
logger.info("Original NLP pipeline has following pipeline components: {}".format(nlp.pipe_names))
|
||||
logger.info(
|
||||
"Original NLP pipeline has following pipeline components: {}".format(
|
||||
nlp.pipe_names
|
||||
)
|
||||
)
|
||||
|
||||
# check that there is a NER component in the pipeline
|
||||
if "ner" not in nlp.pipe_names:
|
||||
|
@ -79,25 +91,42 @@ def main(
|
|||
|
||||
# STEP 2: read the training dataset previously created from WP
|
||||
logger.info("STEP 2: Reading training & dev dataset from {}".format(training_path))
|
||||
train_indices, dev_indices = wikipedia_processor.read_training_indices(training_path)
|
||||
logger.info("Training set has {} articles, limit set to roughly {} articles per epoch"
|
||||
.format(len(train_indices), train_articles if train_articles else "all"))
|
||||
logger.info("Dev set has {} articles, limit set to rougly {} articles for evaluation"
|
||||
.format(len(dev_indices), dev_articles if dev_articles else "all"))
|
||||
train_indices, dev_indices = wikipedia_processor.read_training_indices(
|
||||
training_path
|
||||
)
|
||||
logger.info(
|
||||
"Training set has {} articles, limit set to roughly {} articles per epoch".format(
|
||||
len(train_indices), train_articles if train_articles else "all"
|
||||
)
|
||||
)
|
||||
logger.info(
|
||||
"Dev set has {} articles, limit set to rougly {} articles for evaluation".format(
|
||||
len(dev_indices), dev_articles if dev_articles else "all"
|
||||
)
|
||||
)
|
||||
if dev_articles:
|
||||
dev_indices = dev_indices[0:dev_articles]
|
||||
|
||||
# STEP 3: create and train an entity linking pipe
|
||||
logger.info("STEP 3: Creating and training an Entity Linking pipe for {} epochs".format(epochs))
|
||||
logger.info(
|
||||
"STEP 3: Creating and training an Entity Linking pipe for {} epochs".format(
|
||||
epochs
|
||||
)
|
||||
)
|
||||
if labels_discard:
|
||||
labels_discard = [x.strip() for x in labels_discard.split(",")]
|
||||
logger.info("Discarding {} NER types: {}".format(len(labels_discard), labels_discard))
|
||||
logger.info(
|
||||
"Discarding {} NER types: {}".format(len(labels_discard), labels_discard)
|
||||
)
|
||||
else:
|
||||
labels_discard = []
|
||||
|
||||
el_pipe = nlp.create_pipe(
|
||||
name="entity_linker", config={"pretrained_vectors": nlp.vocab.vectors.name,
|
||||
"labels_discard": labels_discard}
|
||||
name="entity_linker",
|
||||
config={
|
||||
"pretrained_vectors": nlp.vocab.vectors,
|
||||
"labels_discard": labels_discard,
|
||||
},
|
||||
)
|
||||
el_pipe.set_kb(kb)
|
||||
nlp.add_pipe(el_pipe, last=True)
|
||||
|
@ -109,11 +138,18 @@ def main(
|
|||
optimizer.L2 = l2
|
||||
|
||||
logger.info("Dev Baseline Accuracies:")
|
||||
dev_data = wikipedia_processor.read_el_docs_golds(nlp=nlp, entity_file_path=training_path,
|
||||
dev=True, line_ids=dev_indices,
|
||||
kb=kb, labels_discard=labels_discard)
|
||||
dev_data = wikipedia_processor.read_el_docs_golds(
|
||||
nlp=nlp,
|
||||
entity_file_path=training_path,
|
||||
dev=True,
|
||||
line_ids=dev_indices,
|
||||
kb=kb,
|
||||
labels_discard=labels_discard,
|
||||
)
|
||||
|
||||
measure_performance(dev_data, kb, el_pipe, baseline=True, context=False, dev_limit=len(dev_indices))
|
||||
measure_performance(
|
||||
dev_data, kb, el_pipe, baseline=True, context=False, dev_limit=len(dev_indices)
|
||||
)
|
||||
|
||||
for itn in range(epochs):
|
||||
random.shuffle(train_indices)
|
||||
|
@ -127,13 +163,18 @@ def main(
|
|||
if train_articles:
|
||||
bar_total = train_articles
|
||||
|
||||
with tqdm(total=bar_total, leave=False, desc='Epoch ' + str(itn)) as pbar:
|
||||
with tqdm(total=bar_total, leave=False, desc=f"Epoch {itn}") as pbar:
|
||||
for batch in batches:
|
||||
if not train_articles or articles_processed < train_articles:
|
||||
with nlp.disable_pipes("entity_linker"):
|
||||
train_batch = wikipedia_processor.read_el_docs_golds(nlp=nlp, entity_file_path=training_path,
|
||||
dev=False, line_ids=batch,
|
||||
kb=kb, labels_discard=labels_discard)
|
||||
train_batch = wikipedia_processor.read_el_docs_golds(
|
||||
nlp=nlp,
|
||||
entity_file_path=training_path,
|
||||
dev=False,
|
||||
line_ids=batch,
|
||||
kb=kb,
|
||||
labels_discard=labels_discard,
|
||||
)
|
||||
docs, golds = zip(*train_batch)
|
||||
try:
|
||||
with nlp.disable_pipes(*other_pipes):
|
||||
|
@ -150,17 +191,36 @@ def main(
|
|||
except Exception as e:
|
||||
logger.error("Error updating batch:" + str(e))
|
||||
if batchnr > 0:
|
||||
logging.info("Epoch {} trained on {} articles, train loss {}"
|
||||
.format(itn, articles_processed, round(losses["entity_linker"] / batchnr, 2)))
|
||||
logging.info(
|
||||
"Epoch {} trained on {} articles, train loss {}".format(
|
||||
itn, articles_processed, round(losses["entity_linker"] / batchnr, 2)
|
||||
)
|
||||
)
|
||||
# re-read the dev_data (data is returned as a generator)
|
||||
dev_data = wikipedia_processor.read_el_docs_golds(nlp=nlp, entity_file_path=training_path,
|
||||
dev=True, line_ids=dev_indices,
|
||||
kb=kb, labels_discard=labels_discard)
|
||||
measure_performance(dev_data, kb, el_pipe, baseline=False, context=True, dev_limit=len(dev_indices))
|
||||
dev_data = wikipedia_processor.read_el_docs_golds(
|
||||
nlp=nlp,
|
||||
entity_file_path=training_path,
|
||||
dev=True,
|
||||
line_ids=dev_indices,
|
||||
kb=kb,
|
||||
labels_discard=labels_discard,
|
||||
)
|
||||
measure_performance(
|
||||
dev_data,
|
||||
kb,
|
||||
el_pipe,
|
||||
baseline=False,
|
||||
context=True,
|
||||
dev_limit=len(dev_indices),
|
||||
)
|
||||
|
||||
if output_dir:
|
||||
# STEP 4: write the NLP pipeline (now including an EL model) to file
|
||||
logger.info("Final NLP pipeline has following pipeline components: {}".format(nlp.pipe_names))
|
||||
logger.info(
|
||||
"Final NLP pipeline has following pipeline components: {}".format(
|
||||
nlp.pipe_names
|
||||
)
|
||||
)
|
||||
logger.info("STEP 4: Writing trained NLP to {}".format(nlp_output_dir))
|
||||
nlp.to_disk(nlp_output_dir)
|
||||
|
||||
|
|
|
@ -14,7 +14,7 @@ pip install keras==2.0.9
|
|||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
|
||||
import ml_datasets
|
||||
import plac
|
||||
import random
|
||||
import pathlib
|
||||
|
@ -24,7 +24,6 @@ from keras.models import Sequential, model_from_json
|
|||
from keras.layers import LSTM, Dense, Embedding, Bidirectional
|
||||
from keras.layers import TimeDistributed
|
||||
from keras.optimizers import Adam
|
||||
import thinc.extra.datasets
|
||||
from spacy.compat import pickle
|
||||
import spacy
|
||||
|
||||
|
@ -224,7 +223,7 @@ def main(
|
|||
if model_dir is not None:
|
||||
model_dir = pathlib.Path(model_dir)
|
||||
if train_dir is None or dev_dir is None:
|
||||
imdb_data = thinc.extra.datasets.imdb()
|
||||
imdb_data = ml_datasets.imdb()
|
||||
if is_runtime:
|
||||
if dev_dir is None:
|
||||
dev_texts, dev_labels = zip(*imdb_data[1])
|
||||
|
|
63
examples/experiments/ptb-joint-pos-dep/bilstm_tok2vec.cfg
Normal file
63
examples/experiments/ptb-joint-pos-dep/bilstm_tok2vec.cfg
Normal file
|
@ -0,0 +1,63 @@
|
|||
[training]
|
||||
patience = 10000
|
||||
eval_frequency = 200
|
||||
dropout = 0.2
|
||||
init_tok2vec = null
|
||||
vectors = null
|
||||
max_epochs = 100
|
||||
orth_variant_level = 0.0
|
||||
gold_preproc = true
|
||||
max_length = 0
|
||||
use_gpu = 0
|
||||
scores = ["tags_acc", "uas", "las"]
|
||||
score_weights = {"las": 0.8, "tags_acc": 0.2}
|
||||
limit = 0
|
||||
|
||||
[training.batch_size]
|
||||
@schedules = "compounding.v1"
|
||||
start = 100
|
||||
stop = 1000
|
||||
compound = 1.001
|
||||
|
||||
[optimizer]
|
||||
@optimizers = "Adam.v1"
|
||||
learn_rate = 0.001
|
||||
beta1 = 0.9
|
||||
beta2 = 0.999
|
||||
|
||||
[nlp]
|
||||
lang = "en"
|
||||
vectors = ${training:vectors}
|
||||
|
||||
[nlp.pipeline.tok2vec]
|
||||
factory = "tok2vec"
|
||||
|
||||
[nlp.pipeline.tagger]
|
||||
factory = "tagger"
|
||||
|
||||
[nlp.pipeline.parser]
|
||||
factory = "parser"
|
||||
|
||||
[nlp.pipeline.tagger.model]
|
||||
@architectures = "tagger_model.v1"
|
||||
|
||||
[nlp.pipeline.tagger.model.tok2vec]
|
||||
@architectures = "tok2vec_tensors.v1"
|
||||
width = ${nlp.pipeline.tok2vec.model:width}
|
||||
|
||||
[nlp.pipeline.parser.model]
|
||||
@architectures = "transition_based_parser.v1"
|
||||
nr_feature_tokens = 8
|
||||
hidden_width = 64
|
||||
maxout_pieces = 3
|
||||
|
||||
[nlp.pipeline.parser.model.tok2vec]
|
||||
@architectures = "tok2vec_tensors.v1"
|
||||
width = ${nlp.pipeline.tok2vec.model:width}
|
||||
|
||||
[nlp.pipeline.tok2vec.model]
|
||||
@architectures = "hash_embed_bilstm.v1"
|
||||
pretrained_vectors = ${nlp:vectors}
|
||||
width = 96
|
||||
depth = 4
|
||||
embed_size = 2000
|
65
examples/experiments/ptb-joint-pos-dep/defaults.cfg
Normal file
65
examples/experiments/ptb-joint-pos-dep/defaults.cfg
Normal file
|
@ -0,0 +1,65 @@
|
|||
[training]
|
||||
patience = 10000
|
||||
eval_frequency = 200
|
||||
dropout = 0.2
|
||||
init_tok2vec = null
|
||||
vectors = null
|
||||
max_epochs = 100
|
||||
orth_variant_level = 0.0
|
||||
gold_preproc = true
|
||||
max_length = 0
|
||||
use_gpu = -1
|
||||
scores = ["tags_acc", "uas", "las"]
|
||||
score_weights = {"las": 0.8, "tags_acc": 0.2}
|
||||
limit = 0
|
||||
|
||||
[training.batch_size]
|
||||
@schedules = "compounding.v1"
|
||||
start = 100
|
||||
stop = 1000
|
||||
compound = 1.001
|
||||
|
||||
[optimizer]
|
||||
@optimizers = "Adam.v1"
|
||||
learn_rate = 0.001
|
||||
beta1 = 0.9
|
||||
beta2 = 0.999
|
||||
|
||||
[nlp]
|
||||
lang = "en"
|
||||
vectors = ${training:vectors}
|
||||
|
||||
[nlp.pipeline.tok2vec]
|
||||
factory = "tok2vec"
|
||||
|
||||
[nlp.pipeline.tagger]
|
||||
factory = "tagger"
|
||||
|
||||
[nlp.pipeline.parser]
|
||||
factory = "parser"
|
||||
|
||||
[nlp.pipeline.tagger.model]
|
||||
@architectures = "tagger_model.v1"
|
||||
|
||||
[nlp.pipeline.tagger.model.tok2vec]
|
||||
@architectures = "tok2vec_tensors.v1"
|
||||
width = ${nlp.pipeline.tok2vec.model:width}
|
||||
|
||||
[nlp.pipeline.parser.model]
|
||||
@architectures = "transition_based_parser.v1"
|
||||
nr_feature_tokens = 8
|
||||
hidden_width = 64
|
||||
maxout_pieces = 3
|
||||
|
||||
[nlp.pipeline.parser.model.tok2vec]
|
||||
@architectures = "tok2vec_tensors.v1"
|
||||
width = ${nlp.pipeline.tok2vec.model:width}
|
||||
|
||||
[nlp.pipeline.tok2vec.model]
|
||||
@architectures = "hash_embed_cnn.v1"
|
||||
pretrained_vectors = ${nlp:vectors}
|
||||
width = 96
|
||||
depth = 4
|
||||
window_size = 1
|
||||
embed_size = 2000
|
||||
maxout_pieces = 3
|
|
@ -13,9 +13,10 @@ Prerequisites: pip install joblib
|
|||
from __future__ import print_function, unicode_literals
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import ml_datasets
|
||||
from joblib import Parallel, delayed
|
||||
from functools import partial
|
||||
import thinc.extra.datasets
|
||||
import plac
|
||||
import spacy
|
||||
from spacy.util import minibatch
|
||||
|
@ -35,7 +36,7 @@ def main(output_dir, model="en_core_web_sm", n_jobs=4, batch_size=1000, limit=10
|
|||
output_dir.mkdir()
|
||||
# load and pre-process the IMBD dataset
|
||||
print("Loading IMDB data...")
|
||||
data, _ = thinc.extra.datasets.imdb()
|
||||
data, _ = ml_datasets.imdb()
|
||||
texts, _ = zip(*data[-limit:])
|
||||
print("Processing texts...")
|
||||
partitions = minibatch(texts, size=batch_size)
|
||||
|
|
|
@ -26,12 +26,12 @@ DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook."
|
|||
HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
|
||||
|
||||
|
||||
@st.cache(ignore_hash=True)
|
||||
@st.cache(allow_output_mutation=True)
|
||||
def load_model(name):
|
||||
return spacy.load(name)
|
||||
|
||||
|
||||
@st.cache(ignore_hash=True)
|
||||
@st.cache(allow_output_mutation=True)
|
||||
def process_text(model_name, text):
|
||||
nlp = load_model(model_name)
|
||||
return nlp(text)
|
||||
|
@ -79,7 +79,9 @@ if "ner" in nlp.pipe_names:
|
|||
st.header("Named Entities")
|
||||
st.sidebar.header("Named Entities")
|
||||
label_set = nlp.get_pipe("ner").labels
|
||||
labels = st.sidebar.multiselect("Entity labels", label_set, label_set)
|
||||
labels = st.sidebar.multiselect(
|
||||
"Entity labels", options=label_set, default=list(label_set)
|
||||
)
|
||||
html = displacy.render(doc, style="ent", options={"ents": labels})
|
||||
# Newlines seem to mess with the rendering
|
||||
html = html.replace("\n", " ")
|
||||
|
|
|
@ -12,7 +12,7 @@ import tqdm
|
|||
import spacy
|
||||
import spacy.util
|
||||
from spacy.tokens import Token, Doc
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.gold import GoldParse, Example
|
||||
from spacy.syntax.nonproj import projectivize
|
||||
from collections import defaultdict
|
||||
from spacy.matcher import Matcher
|
||||
|
@ -33,25 +33,25 @@ random.seed(0)
|
|||
numpy.random.seed(0)
|
||||
|
||||
|
||||
def minibatch_by_words(items, size=5000):
|
||||
random.shuffle(items)
|
||||
def minibatch_by_words(examples, size=5000):
|
||||
random.shuffle(examples)
|
||||
if isinstance(size, int):
|
||||
size_ = itertools.repeat(size)
|
||||
else:
|
||||
size_ = size
|
||||
items = iter(items)
|
||||
examples = iter(examples)
|
||||
while True:
|
||||
batch_size = next(size_)
|
||||
batch = []
|
||||
while batch_size >= 0:
|
||||
try:
|
||||
doc, gold = next(items)
|
||||
example = next(examples)
|
||||
except StopIteration:
|
||||
if batch:
|
||||
yield batch
|
||||
return
|
||||
batch_size -= len(doc)
|
||||
batch.append((doc, gold))
|
||||
batch_size -= len(example.doc)
|
||||
batch.append(example)
|
||||
if batch:
|
||||
yield batch
|
||||
else:
|
||||
|
@ -78,7 +78,7 @@ def read_data(
|
|||
max_doc_length=None,
|
||||
limit=None,
|
||||
):
|
||||
"""Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True,
|
||||
"""Read the CONLLU format into Example objects. If raw_text=True,
|
||||
include Doc objects created using nlp.make_doc and then aligned against
|
||||
the gold-standard sequences. If oracle_segments=True, include Doc objects
|
||||
created from the gold-standard segments. At least one must be True."""
|
||||
|
@ -119,15 +119,15 @@ def read_data(
|
|||
docs.append(doc)
|
||||
golds.append(gold)
|
||||
if limit and len(docs) >= limit:
|
||||
return docs, golds
|
||||
return golds_to_gold_data(docs, golds)
|
||||
|
||||
if raw_text and sent_annots:
|
||||
doc, gold = _make_gold(nlp, None, sent_annots)
|
||||
docs.append(doc)
|
||||
golds.append(gold)
|
||||
if limit and len(docs) >= limit:
|
||||
return docs, golds
|
||||
return docs, golds
|
||||
return golds_to_gold_data(docs, golds)
|
||||
return golds_to_gold_data(docs, golds)
|
||||
|
||||
|
||||
def read_conllu(file_):
|
||||
|
@ -181,16 +181,18 @@ def _make_gold(nlp, text, sent_annots):
|
|||
#############################
|
||||
|
||||
|
||||
def golds_to_gold_tuples(docs, golds):
|
||||
"""Get out the annoying 'tuples' format used by begin_training, given the
|
||||
def golds_to_gold_data(docs, golds):
|
||||
"""Get out the training data format used by begin_training, given the
|
||||
GoldParse objects."""
|
||||
tuples = []
|
||||
data = []
|
||||
for doc, gold in zip(docs, golds):
|
||||
text = doc.text
|
||||
ids, words, tags, heads, labels, iob = zip(*gold.orig_annot)
|
||||
sents = [((ids, words, tags, heads, labels, iob), [])]
|
||||
tuples.append((text, sents))
|
||||
return tuples
|
||||
example = Example(doc=doc)
|
||||
example.add_doc_annotation(cats=gold.cats)
|
||||
token_annotation_dict = gold.orig.to_dict()
|
||||
example.add_token_annotation(**token_annotation_dict)
|
||||
example.goldparse = gold
|
||||
data.append(example)
|
||||
return data
|
||||
|
||||
|
||||
##############
|
||||
|
@ -303,7 +305,7 @@ def load_nlp(corpus, config):
|
|||
return nlp
|
||||
|
||||
|
||||
def initialize_pipeline(nlp, docs, golds, config):
|
||||
def initialize_pipeline(nlp, examples, config):
|
||||
nlp.add_pipe(nlp.create_pipe("parser"))
|
||||
if config.multitask_tag:
|
||||
nlp.parser.add_multitask_objective("tag")
|
||||
|
@ -311,18 +313,19 @@ def initialize_pipeline(nlp, docs, golds, config):
|
|||
nlp.parser.add_multitask_objective("sent_start")
|
||||
nlp.parser.moves.add_action(2, "subtok")
|
||||
nlp.add_pipe(nlp.create_pipe("tagger"))
|
||||
for gold in golds:
|
||||
for tag in gold.tags:
|
||||
for ex in examples:
|
||||
for tag in ex.gold.tags:
|
||||
if tag is not None:
|
||||
nlp.tagger.add_label(tag)
|
||||
# Replace labels that didn't make the frequency cutoff
|
||||
actions = set(nlp.parser.labels)
|
||||
label_set = set([act.split("-")[1] for act in actions if "-" in act])
|
||||
for gold in golds:
|
||||
for ex in examples:
|
||||
gold = ex.gold
|
||||
for i, label in enumerate(gold.labels):
|
||||
if label is not None and label not in label_set:
|
||||
gold.labels[i] = label.split("||")[0]
|
||||
return nlp.begin_training(lambda: golds_to_gold_tuples(docs, golds))
|
||||
return nlp.begin_training(lambda: examples)
|
||||
|
||||
|
||||
########################
|
||||
|
@ -391,13 +394,17 @@ def main(ud_dir, parses_dir, config, corpus, limit=0):
|
|||
Token.set_extension("begins_fused", default=False)
|
||||
Token.set_extension("inside_fused", default=False)
|
||||
|
||||
Token.set_extension("get_conllu_lines", method=get_token_conllu)
|
||||
Token.set_extension("begins_fused", default=False)
|
||||
Token.set_extension("inside_fused", default=False)
|
||||
|
||||
paths = TreebankPaths(ud_dir, corpus)
|
||||
if not (parses_dir / corpus).exists():
|
||||
(parses_dir / corpus).mkdir()
|
||||
print("Train and evaluate", corpus, "using lang", paths.lang)
|
||||
nlp = load_nlp(paths.lang, config)
|
||||
|
||||
docs, golds = read_data(
|
||||
examples = read_data(
|
||||
nlp,
|
||||
paths.train.conllu.open(encoding="utf8"),
|
||||
paths.train.text.open(encoding="utf8"),
|
||||
|
@ -405,23 +412,18 @@ def main(ud_dir, parses_dir, config, corpus, limit=0):
|
|||
limit=limit,
|
||||
)
|
||||
|
||||
optimizer = initialize_pipeline(nlp, docs, golds, config)
|
||||
optimizer = initialize_pipeline(nlp, examples, config)
|
||||
|
||||
for i in range(config.nr_epoch):
|
||||
docs = [nlp.make_doc(doc.text) for doc in docs]
|
||||
batches = minibatch_by_words(list(zip(docs, golds)), size=config.batch_size)
|
||||
docs = [nlp.make_doc(example.doc.text) for example in examples]
|
||||
batches = minibatch_by_words(examples, size=config.batch_size)
|
||||
losses = {}
|
||||
n_train_words = sum(len(doc) for doc in docs)
|
||||
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
|
||||
for batch in batches:
|
||||
batch_docs, batch_gold = zip(*batch)
|
||||
pbar.update(sum(len(doc) for doc in batch_docs))
|
||||
pbar.update(sum(len(ex.doc) for ex in batch))
|
||||
nlp.update(
|
||||
batch_docs,
|
||||
batch_gold,
|
||||
sgd=optimizer,
|
||||
drop=config.dropout,
|
||||
losses=losses,
|
||||
examples=batch, sgd=optimizer, drop=config.dropout, losses=losses,
|
||||
)
|
||||
|
||||
out_path = parses_dir / corpus / "epoch-{i}.conllu".format(i=i)
|
||||
|
|
|
@ -31,14 +31,13 @@ random.seed(0)
|
|||
|
||||
PWD = os.path.dirname(__file__)
|
||||
|
||||
TRAIN_DATA = list(read_json_file(
|
||||
os.path.join(PWD, "ner_example_data", "ner-sent-per-line.json")))
|
||||
TRAIN_DATA = list(read_json_file(os.path.join(PWD, "training-data.json")))
|
||||
|
||||
|
||||
def get_position_label(i, words, tags, heads, labels, ents):
|
||||
def get_position_label(i, token_annotation):
|
||||
"""Return labels indicating the position of the word in the document.
|
||||
"""
|
||||
if len(words) < 20:
|
||||
if len(token_annotation.words) < 20:
|
||||
return "short-doc"
|
||||
elif i == 0:
|
||||
return "first-word"
|
||||
|
@ -46,7 +45,7 @@ def get_position_label(i, words, tags, heads, labels, ents):
|
|||
return "early-word"
|
||||
elif i < 20:
|
||||
return "mid-word"
|
||||
elif i == len(words) - 1:
|
||||
elif i == len(token_annotation.words) - 1:
|
||||
return "last-word"
|
||||
else:
|
||||
return "late-word"
|
||||
|
@ -60,17 +59,17 @@ def main(n_iter=10):
|
|||
print(nlp.pipeline)
|
||||
|
||||
print("Create data", len(TRAIN_DATA))
|
||||
optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA)
|
||||
optimizer = nlp.begin_training(get_examples=lambda: TRAIN_DATA)
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
for text, annot_brackets in TRAIN_DATA:
|
||||
for annotations, _ in annot_brackets:
|
||||
doc = Doc(nlp.vocab, words=annotations[1])
|
||||
gold = GoldParse.from_annot_tuples(doc, annotations)
|
||||
for example in TRAIN_DATA:
|
||||
for token_annotation in example.token_annotations:
|
||||
doc = Doc(nlp.vocab, words=token_annotation.words)
|
||||
gold = GoldParse.from_annotation(doc, example.doc_annotation, token_annotation)
|
||||
|
||||
nlp.update(
|
||||
[doc], # batch of texts
|
||||
[gold], # batch of annotations
|
||||
examples=[(doc, gold)], # 1 example
|
||||
drop=0.2, # dropout - make it harder to memorise data
|
||||
sgd=optimizer, # callable to update weights
|
||||
losses=losses,
|
||||
|
@ -78,9 +77,9 @@ def main(n_iter=10):
|
|||
print(losses.get("nn_labeller", 0.0), losses["ner"])
|
||||
|
||||
# test the trained model
|
||||
for text, _ in TRAIN_DATA:
|
||||
if text is not None:
|
||||
doc = nlp(text)
|
||||
for example in TRAIN_DATA:
|
||||
if example.text is not None:
|
||||
doc = nlp(example.text)
|
||||
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
|
||||
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
|
||||
|
||||
|
|
|
@ -16,16 +16,18 @@ the development labels, after all --- only the unlabelled text.
|
|||
import plac
|
||||
import tqdm
|
||||
import random
|
||||
|
||||
import ml_datasets
|
||||
|
||||
import spacy
|
||||
import thinc.extra.datasets
|
||||
from spacy.util import minibatch, use_gpu, compounding
|
||||
from spacy._ml import Tok2Vec
|
||||
from spacy.pipeline import TextCategorizer
|
||||
from spacy.ml.tok2vec import Tok2Vec
|
||||
import numpy
|
||||
|
||||
|
||||
def load_texts(limit=0):
|
||||
train, dev = thinc.extra.datasets.imdb()
|
||||
train, dev = ml_datasets.imdb()
|
||||
train_texts, train_labels = zip(*train)
|
||||
dev_texts, dev_labels = zip(*train)
|
||||
train_texts = list(train_texts)
|
||||
|
@ -41,7 +43,7 @@ def load_texts(limit=0):
|
|||
def load_textcat_data(limit=0):
|
||||
"""Load data from the IMDB dataset."""
|
||||
# Partition off part of the train data for evaluation
|
||||
train_data, eval_data = thinc.extra.datasets.imdb()
|
||||
train_data, eval_data = ml_datasets.imdb()
|
||||
random.shuffle(train_data)
|
||||
train_data = train_data[-limit:]
|
||||
texts, labels = zip(*train_data)
|
||||
|
@ -63,17 +65,15 @@ def prefer_gpu():
|
|||
|
||||
|
||||
def build_textcat_model(tok2vec, nr_class, width):
|
||||
from thinc.v2v import Model, Softmax, Maxout
|
||||
from thinc.api import flatten_add_lengths, chain
|
||||
from thinc.t2v import Pooling, sum_pool, mean_pool, max_pool
|
||||
from thinc.misc import Residual, LayerNorm
|
||||
from spacy._ml import logistic, zero_init
|
||||
from thinc.model import Model
|
||||
from thinc.layers import Softmax, chain, reduce_mean
|
||||
from thinc.layers import list2ragged
|
||||
|
||||
with Model.define_operators({">>": chain}):
|
||||
model = (
|
||||
tok2vec
|
||||
>> flatten_add_lengths
|
||||
>> Pooling(mean_pool)
|
||||
>> list2ragged()
|
||||
>> reduce_mean()
|
||||
>> Softmax(nr_class, width)
|
||||
)
|
||||
model.tok2vec = tok2vec
|
||||
|
@ -81,7 +81,7 @@ def build_textcat_model(tok2vec, nr_class, width):
|
|||
|
||||
|
||||
def block_gradients(model):
|
||||
from thinc.api import wrap
|
||||
from thinc.api import wrap # TODO FIX
|
||||
|
||||
def forward(X, drop=0.0):
|
||||
Y, _ = model.begin_update(X, drop=drop)
|
||||
|
@ -114,7 +114,7 @@ def train_tensorizer(nlp, texts, dropout, n_iter):
|
|||
losses = {}
|
||||
for i, batch in enumerate(minibatch(tqdm.tqdm(texts))):
|
||||
docs = [nlp.make_doc(text) for text in batch]
|
||||
tensorizer.update(docs, None, losses=losses, sgd=optimizer, drop=dropout)
|
||||
tensorizer.update((docs, None), losses=losses, sgd=optimizer, drop=dropout)
|
||||
print(losses)
|
||||
return optimizer
|
||||
|
||||
|
@ -143,8 +143,7 @@ def train_textcat(nlp, n_texts, n_iter=10):
|
|||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(tqdm.tqdm(train_data), size=2)
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
|
||||
nlp.update(batch, sgd=optimizer, drop=0.2, losses=losses)
|
||||
with textcat.model.use_params(optimizer.averages):
|
||||
# evaluate on the dev data split off in load_data()
|
||||
scores = evaluate_textcat(nlp.tokenizer, textcat, dev_texts, dev_cats)
|
||||
|
|
|
@ -58,7 +58,7 @@ def main(model_name, unlabelled_loc):
|
|||
# yet, but I'm getting weird results from Adam. Try commenting out the
|
||||
# nlp.update(), and using Adam -- you'll find the models drift apart.
|
||||
# I guess Adam is losing precision, introducing gradient noise?
|
||||
optimizer.alpha = 0.1
|
||||
optimizer.learn_rate = 0.1
|
||||
optimizer.b1 = 0.0
|
||||
optimizer.b2 = 0.0
|
||||
|
||||
|
@ -75,8 +75,7 @@ def main(model_name, unlabelled_loc):
|
|||
# batch up the examples using spaCy's minibatch
|
||||
raw_batches = minibatch(raw_docs, size=4)
|
||||
for batch in minibatch(TRAIN_DATA, size=sizes):
|
||||
docs, golds = zip(*batch)
|
||||
nlp.update(docs, golds, sgd=optimizer, drop=dropout, losses=losses)
|
||||
nlp.update(batch, sgd=optimizer, drop=dropout, losses=losses)
|
||||
raw_batch = list(next(raw_batches))
|
||||
nlp.rehearse(raw_batch, sgd=optimizer, losses=r_losses)
|
||||
print("Losses", losses)
|
||||
|
|
|
@ -17,7 +17,7 @@ import plac
|
|||
import random
|
||||
from pathlib import Path
|
||||
|
||||
from spacy.symbols import PERSON
|
||||
import srsly
|
||||
from spacy.vocab import Vocab
|
||||
|
||||
import spacy
|
||||
|
@ -68,7 +68,7 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
|
|||
vocab = Vocab().from_disk(vocab_path)
|
||||
# create blank Language class with correct vocab
|
||||
nlp = spacy.blank("en", vocab=vocab)
|
||||
nlp.vocab.vectors.name = "spacy_pretrained_vectors"
|
||||
nlp.vocab.vectors.name = "nel_vectors"
|
||||
print("Created blank 'en' model with vocab from '%s'" % vocab_path)
|
||||
|
||||
# Add a sentencizer component. Alternatively, add a dependency parser for higher accuracy.
|
||||
|
@ -93,7 +93,7 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
|
|||
nlp.add_pipe(entity_linker, last=True)
|
||||
|
||||
# Convert the texts to docs to make sure we have doc.ents set for the training examples.
|
||||
# Also ensure that the annotated examples correspond to known identifiers in the knowlege base.
|
||||
# Also ensure that the annotated examples correspond to known identifiers in the knowledge base.
|
||||
kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings()
|
||||
TRAIN_DOCS = []
|
||||
for text, annotation in TRAIN_DATA:
|
||||
|
@ -118,16 +118,15 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
|
|||
with nlp.disable_pipes(*other_pipes): # only train entity linker
|
||||
# reset and initialize the weights randomly
|
||||
optimizer = nlp.begin_training()
|
||||
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DOCS)
|
||||
losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(TRAIN_DOCS, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(
|
||||
texts, # batch of texts
|
||||
annotations, # batch of annotations
|
||||
batch,
|
||||
drop=0.2, # dropout - make it harder to memorise data
|
||||
losses=losses,
|
||||
sgd=optimizer,
|
||||
|
|
|
@ -134,8 +134,7 @@ def main(model=None, output_dir=None, n_iter=15):
|
|||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
|
||||
nlp.update(batch, sgd=optimizer, losses=losses)
|
||||
print("Losses", losses)
|
||||
|
||||
# test the trained model
|
||||
|
|
|
@ -68,10 +68,8 @@ def main(model=None, output_dir=None, n_iter=100):
|
|||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(
|
||||
texts, # batch of texts
|
||||
annotations, # batch of annotations
|
||||
batch,
|
||||
drop=0.5, # dropout - make it harder to memorise data
|
||||
losses=losses,
|
||||
)
|
||||
|
|
|
@ -105,8 +105,7 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
|
|||
batches = minibatch(TRAIN_DATA, size=sizes)
|
||||
losses = {}
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
|
||||
nlp.update(batch, sgd=optimizer, drop=0.35, losses=losses)
|
||||
print("Losses", losses)
|
||||
|
||||
# test the trained model
|
||||
|
|
|
@ -75,8 +75,7 @@ def main(model=None, output_dir=None, n_iter=15):
|
|||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
|
||||
nlp.update(batch, sgd=optimizer, losses=losses)
|
||||
print("Losses", losses)
|
||||
|
||||
# test the trained model
|
||||
|
|
|
@ -65,8 +65,7 @@ def main(lang="en", output_dir=None, n_iter=25):
|
|||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
|
||||
nlp.update(batch, sgd=optimizer, losses=losses)
|
||||
print("Losses", losses)
|
||||
|
||||
# test the trained model
|
||||
|
|
|
@ -10,10 +10,11 @@ see the documentation:
|
|||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import ml_datasets
|
||||
import plac
|
||||
import random
|
||||
from pathlib import Path
|
||||
import thinc.extra.datasets
|
||||
|
||||
import spacy
|
||||
from spacy.util import minibatch, compounding
|
||||
|
@ -83,8 +84,7 @@ def main(model=None, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=None
|
|||
random.shuffle(train_data)
|
||||
batches = minibatch(train_data, size=batch_sizes)
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
|
||||
nlp.update(batch, sgd=optimizer, drop=0.2, losses=losses)
|
||||
with textcat.model.use_params(optimizer.averages):
|
||||
# evaluate on the dev data split off in load_data()
|
||||
scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
|
||||
|
@ -117,7 +117,7 @@ def main(model=None, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=None
|
|||
def load_data(limit=0, split=0.8):
|
||||
"""Load data from the IMDB dataset."""
|
||||
# Partition off part of the train data for evaluation
|
||||
train_data, _ = thinc.extra.datasets.imdb()
|
||||
train_data, _ = ml_datasets.imdb()
|
||||
random.shuffle(train_data)
|
||||
train_data = train_data[-limit:]
|
||||
texts, labels = zip(*train_data)
|
||||
|
|
9
fabfile.py
vendored
9
fabfile.py
vendored
|
@ -1,9 +1,6 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import contextlib
|
||||
from pathlib import Path
|
||||
from fabric.api import local, lcd, env, settings, prefix
|
||||
from fabric.api import local, lcd
|
||||
from os import path, environ
|
||||
import shutil
|
||||
import sys
|
||||
|
@ -82,9 +79,7 @@ def pex():
|
|||
with virtualenv(VENV_DIR) as venv_local:
|
||||
with lcd(path.dirname(__file__)):
|
||||
sha = local("git rev-parse --short HEAD", capture=True)
|
||||
venv_local(
|
||||
"pex dist/*.whl -e spacy -o dist/spacy-%s.pex" % sha, direct=True
|
||||
)
|
||||
venv_local(f"pex dist/*.whl -e spacy -o dist/spacy-{sha}.pex", direct=True)
|
||||
|
||||
|
||||
def clean():
|
||||
|
|
|
@ -1,20 +1,21 @@
|
|||
# Our libraries
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc==7.4.0.dev0
|
||||
thinc==8.0.0a0
|
||||
blis>=0.4.0,<0.5.0
|
||||
ml_datasets>=0.1.1
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
wasabi>=0.4.0,<1.1.0
|
||||
srsly>=1.0.1,<1.1.0
|
||||
srsly>=2.0.0,<3.0.0
|
||||
catalogue>=0.0.7,<1.1.0
|
||||
# Third party dependencies
|
||||
numpy>=1.15.0
|
||||
requests>=2.13.0,<3.0.0
|
||||
plac>=0.9.6,<1.2.0
|
||||
pathlib==1.0.1; python_version < "3.4"
|
||||
tqdm>=4.38.0,<5.0.0
|
||||
# Optional dependencies
|
||||
jsonschema>=2.6.0,<3.1.0
|
||||
pydantic>=1.0.0,<2.0.0
|
||||
# Development dependencies
|
||||
cython>=0.25
|
||||
pytest>=4.6.5
|
||||
|
|
16
setup.cfg
16
setup.cfg
|
@ -16,10 +16,7 @@ classifiers =
|
|||
Operating System :: MacOS :: MacOS X
|
||||
Operating System :: Microsoft :: Windows
|
||||
Programming Language :: Cython
|
||||
Programming Language :: Python :: 2
|
||||
Programming Language :: Python :: 2.7
|
||||
Programming Language :: Python :: 3
|
||||
Programming Language :: Python :: 3.5
|
||||
Programming Language :: Python :: 3.6
|
||||
Programming Language :: Python :: 3.7
|
||||
Programming Language :: Python :: 3.8
|
||||
|
@ -30,32 +27,35 @@ zip_safe = false
|
|||
include_package_data = true
|
||||
scripts =
|
||||
bin/spacy
|
||||
python_requires = >=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*
|
||||
python_requires = >=3.6
|
||||
setup_requires =
|
||||
wheel
|
||||
cython>=0.25
|
||||
numpy>=1.15.0
|
||||
# We also need our Cython packages here to compile against
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
thinc==7.4.0.dev0
|
||||
thinc==8.0.0a0
|
||||
install_requires =
|
||||
# Our libraries
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc==7.4.0.dev0
|
||||
thinc==8.0.0a0
|
||||
blis>=0.4.0,<0.5.0
|
||||
wasabi>=0.4.0,<1.1.0
|
||||
srsly>=1.0.1,<1.1.0
|
||||
srsly>=2.0.0,<3.0.0
|
||||
catalogue>=0.0.7,<1.1.0
|
||||
ml_datasets
|
||||
# Third-party dependencies
|
||||
tqdm>=4.38.0,<5.0.0
|
||||
setuptools
|
||||
numpy>=1.15.0
|
||||
plac>=0.9.6,<1.2.0
|
||||
requests>=2.13.0,<3.0.0
|
||||
pathlib==1.0.1; python_version < "3.4"
|
||||
pydantic>=1.3.0,<2.0.0
|
||||
tqdm>=4.38.0,<5.0.0
|
||||
|
||||
[options.extras_require]
|
||||
lookups =
|
||||
|
|
167
setup.py
167
setup.py
|
@ -1,37 +1,23 @@
|
|||
#!/usr/bin/env python
|
||||
from __future__ import print_function
|
||||
import io
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
import contextlib
|
||||
from distutils.command.build_ext import build_ext
|
||||
from distutils.sysconfig import get_python_inc
|
||||
import distutils.util
|
||||
from distutils import ccompiler, msvccompiler
|
||||
from setuptools import Extension, setup, find_packages
|
||||
import numpy
|
||||
from pathlib import Path
|
||||
from Cython.Build import cythonize
|
||||
from Cython.Compiler import Options
|
||||
|
||||
|
||||
def is_new_osx():
|
||||
"""Check whether we're on OSX >= 10.10"""
|
||||
name = distutils.util.get_platform()
|
||||
if sys.platform != "darwin":
|
||||
return False
|
||||
elif name.startswith("macosx-10"):
|
||||
minor_version = int(name.split("-")[1].split(".")[1])
|
||||
if minor_version >= 7:
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
else:
|
||||
return False
|
||||
# Preserve `__doc__` on functions and classes
|
||||
# http://docs.cython.org/en/latest/src/userguide/source_files_and_compilation.html#compiler-options
|
||||
Options.docstrings = True
|
||||
|
||||
|
||||
PACKAGES = find_packages()
|
||||
|
||||
|
||||
MOD_NAMES = [
|
||||
"spacy._align",
|
||||
"spacy.parts_of_speech",
|
||||
"spacy.strings",
|
||||
"spacy.lexeme",
|
||||
|
@ -63,16 +49,32 @@ MOD_NAMES = [
|
|||
"spacy.symbols",
|
||||
"spacy.vectors",
|
||||
]
|
||||
|
||||
|
||||
COMPILE_OPTIONS = {
|
||||
"msvc": ["/Ox", "/EHsc"],
|
||||
"mingw32": ["-O2", "-Wno-strict-prototypes", "-Wno-unused-function"],
|
||||
"other": ["-O2", "-Wno-strict-prototypes", "-Wno-unused-function"],
|
||||
}
|
||||
|
||||
|
||||
LINK_OPTIONS = {"msvc": [], "mingw32": [], "other": []}
|
||||
COMPILER_DIRECTIVES = {
|
||||
"language_level": -3,
|
||||
"embedsignature": True,
|
||||
"annotation_typing": False,
|
||||
}
|
||||
|
||||
|
||||
def is_new_osx():
|
||||
"""Check whether we're on OSX >= 10.10"""
|
||||
name = distutils.util.get_platform()
|
||||
if sys.platform != "darwin":
|
||||
return False
|
||||
elif name.startswith("macosx-10"):
|
||||
minor_version = int(name.split("-")[1].split(".")[1])
|
||||
if minor_version >= 7:
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
else:
|
||||
return False
|
||||
|
||||
|
||||
if is_new_osx():
|
||||
|
@ -105,95 +107,50 @@ class build_ext_subclass(build_ext, build_ext_options):
|
|||
build_ext.build_extensions(self)
|
||||
|
||||
|
||||
def generate_cython(root, source):
|
||||
print("Cythonizing sources")
|
||||
p = subprocess.call(
|
||||
[sys.executable, os.path.join(root, "bin", "cythonize.py"), source],
|
||||
env=os.environ,
|
||||
)
|
||||
if p != 0:
|
||||
raise RuntimeError("Running cythonize failed")
|
||||
|
||||
|
||||
def is_source_release(path):
|
||||
return os.path.exists(os.path.join(path, "PKG-INFO"))
|
||||
|
||||
|
||||
def clean(path):
|
||||
for name in MOD_NAMES:
|
||||
name = name.replace(".", "/")
|
||||
for ext in [".so", ".html", ".cpp", ".c"]:
|
||||
file_path = os.path.join(path, name + ext)
|
||||
if os.path.exists(file_path):
|
||||
os.unlink(file_path)
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
def chdir(new_dir):
|
||||
old_dir = os.getcwd()
|
||||
try:
|
||||
os.chdir(new_dir)
|
||||
sys.path.insert(0, new_dir)
|
||||
yield
|
||||
finally:
|
||||
del sys.path[0]
|
||||
os.chdir(old_dir)
|
||||
for path in path.glob("**/*"):
|
||||
if path.is_file() and path.suffix in (".so", ".cpp"):
|
||||
print(f"Deleting {path.name}")
|
||||
path.unlink()
|
||||
|
||||
|
||||
def setup_package():
|
||||
root = os.path.abspath(os.path.dirname(__file__))
|
||||
root = Path(__file__).parent
|
||||
|
||||
if len(sys.argv) > 1 and sys.argv[1] == "clean":
|
||||
return clean(root)
|
||||
return clean(root / "spacy")
|
||||
|
||||
with chdir(root):
|
||||
with io.open(os.path.join(root, "spacy", "about.py"), encoding="utf8") as f:
|
||||
about = {}
|
||||
exec(f.read(), about)
|
||||
with (root / "spacy" / "about.py").open("r") as f:
|
||||
about = {}
|
||||
exec(f.read(), about)
|
||||
|
||||
include_dirs = [
|
||||
get_python_inc(plat_specific=True),
|
||||
os.path.join(root, "include"),
|
||||
]
|
||||
include_dirs = [
|
||||
get_python_inc(plat_specific=True),
|
||||
numpy.get_include(),
|
||||
str(root / "include"),
|
||||
]
|
||||
if (
|
||||
ccompiler.new_compiler().compiler_type == "msvc"
|
||||
and msvccompiler.get_build_version() == 9
|
||||
):
|
||||
include_dirs.append(str(root / "include" / "msvc9"))
|
||||
ext_modules = []
|
||||
for name in MOD_NAMES:
|
||||
mod_path = name.replace(".", "/") + ".pyx"
|
||||
ext = Extension(name, [mod_path], language="c++")
|
||||
ext_modules.append(ext)
|
||||
print("Cythonizing sources")
|
||||
ext_modules = cythonize(ext_modules, compiler_directives=COMPILER_DIRECTIVES)
|
||||
|
||||
if (
|
||||
ccompiler.new_compiler().compiler_type == "msvc"
|
||||
and msvccompiler.get_build_version() == 9
|
||||
):
|
||||
include_dirs.append(os.path.join(root, "include", "msvc9"))
|
||||
|
||||
ext_modules = []
|
||||
for mod_name in MOD_NAMES:
|
||||
mod_path = mod_name.replace(".", "/") + ".cpp"
|
||||
extra_link_args = []
|
||||
# ???
|
||||
# Imported from patch from @mikepb
|
||||
# See Issue #267. Running blind here...
|
||||
if sys.platform == "darwin":
|
||||
dylib_path = [".." for _ in range(mod_name.count("."))]
|
||||
dylib_path = "/".join(dylib_path)
|
||||
dylib_path = "@loader_path/%s/spacy/platform/darwin/lib" % dylib_path
|
||||
extra_link_args.append("-Wl,-rpath,%s" % dylib_path)
|
||||
ext_modules.append(
|
||||
Extension(
|
||||
mod_name,
|
||||
[mod_path],
|
||||
language="c++",
|
||||
include_dirs=include_dirs,
|
||||
extra_link_args=extra_link_args,
|
||||
)
|
||||
)
|
||||
|
||||
if not is_source_release(root):
|
||||
generate_cython(root, "spacy")
|
||||
|
||||
setup(
|
||||
name="spacy",
|
||||
packages=PACKAGES,
|
||||
version=about["__version__"],
|
||||
ext_modules=ext_modules,
|
||||
cmdclass={"build_ext": build_ext_subclass},
|
||||
)
|
||||
setup(
|
||||
name="spacy",
|
||||
packages=PACKAGES,
|
||||
version=about["__version__"],
|
||||
ext_modules=ext_modules,
|
||||
cmdclass={"build_ext": build_ext_subclass},
|
||||
include_dirs=include_dirs,
|
||||
package_data={"": ["*.pyx", "*.pxd", "*.pxi", "*.cpp"]},
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
|
|
@ -1,5 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
import warnings
|
||||
import sys
|
||||
|
||||
|
@ -7,7 +5,7 @@ warnings.filterwarnings("ignore", message="numpy.dtype size changed")
|
|||
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
|
||||
|
||||
# These are imported as part of the API
|
||||
from thinc.neural.util import prefer_gpu, require_gpu
|
||||
from thinc.api import prefer_gpu, require_gpu
|
||||
|
||||
from . import pipeline
|
||||
from .cli.info import info as cli_info
|
||||
|
@ -23,6 +21,9 @@ if sys.maxunicode == 65535:
|
|||
raise SystemError(Errors.E130)
|
||||
|
||||
|
||||
config = registry
|
||||
|
||||
|
||||
def load(name, **overrides):
|
||||
depr_path = overrides.get("path")
|
||||
if depr_path not in (True, False, None):
|
||||
|
|
|
@ -1,21 +1,17 @@
|
|||
# coding: utf8
|
||||
from __future__ import print_function
|
||||
|
||||
# NB! This breaks in plac on Python 2!!
|
||||
# from __future__ import unicode_literals
|
||||
|
||||
if __name__ == "__main__":
|
||||
import plac
|
||||
import sys
|
||||
from wasabi import msg
|
||||
from spacy.cli import download, link, info, package, train, pretrain, convert
|
||||
from spacy.cli import init_model, profile, evaluate, validate, debug_data
|
||||
from spacy.cli import train_from_config_cli
|
||||
|
||||
commands = {
|
||||
"download": download,
|
||||
"link": link,
|
||||
"info": info,
|
||||
"train": train,
|
||||
"train-from-config": train_from_config_cli,
|
||||
"pretrain": pretrain,
|
||||
"debug-data": debug_data,
|
||||
"evaluate": evaluate,
|
||||
|
@ -28,9 +24,9 @@ if __name__ == "__main__":
|
|||
if len(sys.argv) == 1:
|
||||
msg.info("Available commands", ", ".join(commands), exits=1)
|
||||
command = sys.argv.pop(1)
|
||||
sys.argv[0] = "spacy %s" % command
|
||||
sys.argv[0] = f"spacy {command}"
|
||||
if command in commands:
|
||||
plac.call(commands[command], sys.argv[1:])
|
||||
else:
|
||||
available = "Available: {}".format(", ".join(commands))
|
||||
msg.fail("Unknown command: {}".format(command), available, exits=1)
|
||||
available = f"Available: {', '.join(commands)}"
|
||||
msg.fail(f"Unknown command: {command}", available, exits=1)
|
||||
|
|
255
spacy/_align.pyx
255
spacy/_align.pyx
|
@ -1,255 +0,0 @@
|
|||
# cython: infer_types=True
|
||||
'''Do Levenshtein alignment, for evaluation of tokenized input.
|
||||
|
||||
Random notes:
|
||||
|
||||
r i n g
|
||||
0 1 2 3 4
|
||||
r 1 0 1 2 3
|
||||
a 2 1 1 2 3
|
||||
n 3 2 2 1 2
|
||||
g 4 3 3 2 1
|
||||
|
||||
0,0: (1,1)=min(0+0,1+1,1+1)=0 S
|
||||
1,0: (2,1)=min(1+1,0+1,2+1)=1 D
|
||||
2,0: (3,1)=min(2+1,3+1,1+1)=2 D
|
||||
3,0: (4,1)=min(3+1,4+1,2+1)=3 D
|
||||
0,1: (1,2)=min(1+1,2+1,0+1)=1 D
|
||||
1,1: (2,2)=min(0+1,1+1,1+1)=1 S
|
||||
2,1: (3,2)=min(1+1,1+1,2+1)=2 S or I
|
||||
3,1: (4,2)=min(2+1,2+1,3+1)=3 S or I
|
||||
0,2: (1,3)=min(2+1,3+1,1+1)=2 I
|
||||
1,2: (2,3)=min(1+1,2+1,1+1)=2 S or I
|
||||
2,2: (3,3)
|
||||
3,2: (4,3)
|
||||
At state (i, j) we're asking "How do I transform S[:i+1] to T[:j+1]?"
|
||||
|
||||
We know the costs to transition:
|
||||
|
||||
S[:i] -> T[:j] (at D[i,j])
|
||||
S[:i+1] -> T[:j] (at D[i+1,j])
|
||||
S[:i] -> T[:j+1] (at D[i,j+1])
|
||||
|
||||
Further, now we can transform:
|
||||
S[:i+1] -> S[:i] (DEL) for 1,
|
||||
T[:j+1] -> T[:j] (INS) for 1.
|
||||
S[i+1] -> T[j+1] (SUB) for 0 or 1
|
||||
|
||||
Therefore we have the costs:
|
||||
SUB: Cost(S[:i]->T[:j]) + Cost(S[i]->S[j])
|
||||
i.e. D[i, j] + S[i+1] != T[j+1]
|
||||
INS: Cost(S[:i+1]->T[:j]) + Cost(T[:j+1]->T[:j])
|
||||
i.e. D[i+1,j] + 1
|
||||
DEL: Cost(S[:i]->T[:j+1]) + Cost(S[:i+1]->S[:i])
|
||||
i.e. D[i,j+1] + 1
|
||||
|
||||
Source string S has length m, with index i
|
||||
Target string T has length n, with index j
|
||||
|
||||
Output two alignment vectors: i2j (length m) and j2i (length n)
|
||||
# function LevenshteinDistance(char s[1..m], char t[1..n]):
|
||||
# for all i and j, d[i,j] will hold the Levenshtein distance between
|
||||
# the first i characters of s and the first j characters of t
|
||||
# note that d has (m+1)*(n+1) values
|
||||
# set each element in d to zero
|
||||
ring rang
|
||||
- r i n g
|
||||
- 0 0 0 0 0
|
||||
r 0 0 0 0 0
|
||||
a 0 0 0 0 0
|
||||
n 0 0 0 0 0
|
||||
g 0 0 0 0 0
|
||||
|
||||
# source prefixes can be transformed into empty string by
|
||||
# dropping all characters
|
||||
# d[i, 0] := i
|
||||
ring rang
|
||||
- r i n g
|
||||
- 0 0 0 0 0
|
||||
r 1 0 0 0 0
|
||||
a 2 0 0 0 0
|
||||
n 3 0 0 0 0
|
||||
g 4 0 0 0 0
|
||||
|
||||
# target prefixes can be reached from empty source prefix
|
||||
# by inserting every character
|
||||
# d[0, j] := j
|
||||
- r i n g
|
||||
- 0 1 2 3 4
|
||||
r 1 0 0 0 0
|
||||
a 2 0 0 0 0
|
||||
n 3 0 0 0 0
|
||||
g 4 0 0 0 0
|
||||
|
||||
'''
|
||||
from __future__ import unicode_literals
|
||||
from libc.stdint cimport uint32_t
|
||||
import numpy
|
||||
cimport numpy as np
|
||||
from .compat import unicode_
|
||||
from murmurhash.mrmr cimport hash32
|
||||
|
||||
|
||||
def align(S, T):
|
||||
cdef int m = len(S)
|
||||
cdef int n = len(T)
|
||||
cdef np.ndarray matrix = numpy.zeros((m+1, n+1), dtype='int32')
|
||||
cdef np.ndarray i2j = numpy.zeros((m,), dtype='i')
|
||||
cdef np.ndarray j2i = numpy.zeros((n,), dtype='i')
|
||||
|
||||
cdef np.ndarray S_arr = _convert_sequence(S)
|
||||
cdef np.ndarray T_arr = _convert_sequence(T)
|
||||
|
||||
fill_matrix(<int*>matrix.data,
|
||||
<const int*>S_arr.data, m, <const int*>T_arr.data, n)
|
||||
fill_i2j(i2j, matrix)
|
||||
fill_j2i(j2i, matrix)
|
||||
for i in range(i2j.shape[0]):
|
||||
if i2j[i] >= 0 and len(S[i]) != len(T[i2j[i]]):
|
||||
i2j[i] = -1
|
||||
for j in range(j2i.shape[0]):
|
||||
if j2i[j] >= 0 and len(T[j]) != len(S[j2i[j]]):
|
||||
j2i[j] = -1
|
||||
return matrix[-1,-1], i2j, j2i, matrix
|
||||
|
||||
|
||||
def multi_align(np.ndarray i2j, np.ndarray j2i, i_lengths, j_lengths):
|
||||
'''Let's say we had:
|
||||
|
||||
Guess: [aa bb cc dd]
|
||||
Truth: [aa bbcc dd]
|
||||
i2j: [0, None, -2, 2]
|
||||
j2i: [0, -2, 3]
|
||||
|
||||
We want:
|
||||
|
||||
i2j_multi: {1: 1, 2: 1}
|
||||
j2i_multi: {}
|
||||
'''
|
||||
i2j_miss = _get_regions(i2j, i_lengths)
|
||||
j2i_miss = _get_regions(j2i, j_lengths)
|
||||
|
||||
i2j_multi, j2i_multi = _get_mapping(i2j_miss, j2i_miss, i_lengths, j_lengths)
|
||||
return i2j_multi, j2i_multi
|
||||
|
||||
|
||||
def _get_regions(alignment, lengths):
|
||||
regions = {}
|
||||
start = None
|
||||
offset = 0
|
||||
for i in range(len(alignment)):
|
||||
if alignment[i] < 0:
|
||||
if start is None:
|
||||
start = offset
|
||||
regions.setdefault(start, [])
|
||||
regions[start].append(i)
|
||||
else:
|
||||
start = None
|
||||
offset += lengths[i]
|
||||
return regions
|
||||
|
||||
|
||||
def _get_mapping(miss1, miss2, lengths1, lengths2):
|
||||
i2j = {}
|
||||
j2i = {}
|
||||
for start, region1 in miss1.items():
|
||||
if not region1 or start not in miss2:
|
||||
continue
|
||||
region2 = miss2[start]
|
||||
if sum(lengths1[i] for i in region1) == sum(lengths2[i] for i in region2):
|
||||
j = region2.pop(0)
|
||||
buff = []
|
||||
# Consume tokens from region 1, until we meet the length of the
|
||||
# first token in region2. If we do, align the tokens. If
|
||||
# we exceed the length, break.
|
||||
while region1:
|
||||
buff.append(region1.pop(0))
|
||||
if sum(lengths1[i] for i in buff) == lengths2[j]:
|
||||
for i in buff:
|
||||
i2j[i] = j
|
||||
j2i[j] = buff[-1]
|
||||
j += 1
|
||||
buff = []
|
||||
elif sum(lengths1[i] for i in buff) > lengths2[j]:
|
||||
break
|
||||
else:
|
||||
if buff and sum(lengths1[i] for i in buff) == lengths2[j]:
|
||||
for i in buff:
|
||||
i2j[i] = j
|
||||
j2i[j] = buff[-1]
|
||||
return i2j, j2i
|
||||
|
||||
|
||||
def _convert_sequence(seq):
|
||||
if isinstance(seq, numpy.ndarray):
|
||||
return numpy.ascontiguousarray(seq, dtype='uint32_t')
|
||||
cdef np.ndarray output = numpy.zeros((len(seq),), dtype='uint32')
|
||||
cdef bytes item_bytes
|
||||
for i, item in enumerate(seq):
|
||||
if item == "``":
|
||||
item = '"'
|
||||
elif item == "''":
|
||||
item = '"'
|
||||
if isinstance(item, unicode):
|
||||
item_bytes = item.encode('utf8')
|
||||
else:
|
||||
item_bytes = item
|
||||
output[i] = hash32(<void*><char*>item_bytes, len(item_bytes), 0)
|
||||
return output
|
||||
|
||||
|
||||
cdef void fill_matrix(int* D,
|
||||
const int* S, int m, const int* T, int n) nogil:
|
||||
m1 = m+1
|
||||
n1 = n+1
|
||||
for i in range(m1*n1):
|
||||
D[i] = 0
|
||||
|
||||
for i in range(m1):
|
||||
D[i*n1] = i
|
||||
|
||||
for j in range(n1):
|
||||
D[j] = j
|
||||
|
||||
cdef int sub_cost, ins_cost, del_cost
|
||||
for j in range(n):
|
||||
for i in range(m):
|
||||
i_j = i*n1 + j
|
||||
i1_j1 = (i+1)*n1 + j+1
|
||||
i1_j = (i+1)*n1 + j
|
||||
i_j1 = i*n1 + j+1
|
||||
if S[i] != T[j]:
|
||||
sub_cost = D[i_j] + 1
|
||||
else:
|
||||
sub_cost = D[i_j]
|
||||
del_cost = D[i_j1] + 1
|
||||
ins_cost = D[i1_j] + 1
|
||||
best = min(min(sub_cost, ins_cost), del_cost)
|
||||
D[i1_j1] = best
|
||||
|
||||
|
||||
cdef void fill_i2j(np.ndarray i2j, np.ndarray D) except *:
|
||||
j = D.shape[1]-2
|
||||
cdef int i = D.shape[0]-2
|
||||
while i >= 0:
|
||||
while D[i+1, j] < D[i+1, j+1]:
|
||||
j -= 1
|
||||
if D[i, j+1] < D[i+1, j+1]:
|
||||
i2j[i] = -1
|
||||
else:
|
||||
i2j[i] = j
|
||||
j -= 1
|
||||
i -= 1
|
||||
|
||||
cdef void fill_j2i(np.ndarray j2i, np.ndarray D) except *:
|
||||
i = D.shape[0]-2
|
||||
cdef int j = D.shape[1]-2
|
||||
while j >= 0:
|
||||
while D[i, j+1] < D[i+1, j+1]:
|
||||
i -= 1
|
||||
if D[i+1, j] < D[i+1, j+1]:
|
||||
j2i[j] = -1
|
||||
else:
|
||||
j2i[j] = i
|
||||
i -= 1
|
||||
j -= 1
|
985
spacy/_ml.py
985
spacy/_ml.py
|
@ -1,985 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import numpy
|
||||
from thinc.v2v import Model, Maxout, Softmax, Affine, ReLu
|
||||
from thinc.t2t import ExtractWindow, ParametricAttention
|
||||
from thinc.t2v import Pooling, sum_pool, mean_pool
|
||||
from thinc.i2v import HashEmbed
|
||||
from thinc.misc import Residual, FeatureExtracter
|
||||
from thinc.misc import LayerNorm as LN
|
||||
from thinc.api import add, layerize, chain, clone, concatenate, with_flatten
|
||||
from thinc.api import with_getitem, flatten_add_lengths
|
||||
from thinc.api import uniqued, wrap, noop
|
||||
from thinc.linear.linear import LinearModel
|
||||
from thinc.neural.ops import NumpyOps, CupyOps
|
||||
from thinc.neural.util import get_array_module, copy_array
|
||||
from thinc.neural.optimizers import Adam
|
||||
|
||||
from thinc import describe
|
||||
from thinc.describe import Dimension, Synapses, Biases, Gradient
|
||||
from thinc.neural._classes.affine import _set_dimensions_if_needed
|
||||
import thinc.extra.load_nlp
|
||||
|
||||
from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
|
||||
from .errors import Errors, user_warning, Warnings
|
||||
from . import util
|
||||
from . import ml as new_ml
|
||||
from .ml import _legacy_tok2vec
|
||||
|
||||
|
||||
VECTORS_KEY = "spacy_pretrained_vectors"
|
||||
# Backwards compatibility with <2.2.2
|
||||
USE_MODEL_REGISTRY_TOK2VEC = False
|
||||
|
||||
|
||||
def cosine(vec1, vec2):
|
||||
xp = get_array_module(vec1)
|
||||
norm1 = xp.linalg.norm(vec1)
|
||||
norm2 = xp.linalg.norm(vec2)
|
||||
if norm1 == 0.0 or norm2 == 0.0:
|
||||
return 0
|
||||
else:
|
||||
return vec1.dot(vec2) / (norm1 * norm2)
|
||||
|
||||
|
||||
def create_default_optimizer(ops, **cfg):
|
||||
learn_rate = util.env_opt("learn_rate", 0.001)
|
||||
beta1 = util.env_opt("optimizer_B1", 0.9)
|
||||
beta2 = util.env_opt("optimizer_B2", 0.999)
|
||||
eps = util.env_opt("optimizer_eps", 1e-8)
|
||||
L2 = util.env_opt("L2_penalty", 1e-6)
|
||||
max_grad_norm = util.env_opt("grad_norm_clip", 1.0)
|
||||
optimizer = Adam(ops, learn_rate, L2=L2, beta1=beta1, beta2=beta2, eps=eps)
|
||||
optimizer.max_grad_norm = max_grad_norm
|
||||
optimizer.device = ops.device
|
||||
return optimizer
|
||||
|
||||
|
||||
@layerize
|
||||
def _flatten_add_lengths(seqs, pad=0, drop=0.0):
|
||||
ops = Model.ops
|
||||
lengths = ops.asarray([len(seq) for seq in seqs], dtype="i")
|
||||
|
||||
def finish_update(d_X, sgd=None):
|
||||
return ops.unflatten(d_X, lengths, pad=pad)
|
||||
|
||||
X = ops.flatten(seqs, pad=pad)
|
||||
return (X, lengths), finish_update
|
||||
|
||||
|
||||
def _zero_init(model):
|
||||
def _zero_init_impl(self, *args, **kwargs):
|
||||
self.W.fill(0)
|
||||
|
||||
model.on_init_hooks.append(_zero_init_impl)
|
||||
if model.W is not None:
|
||||
model.W.fill(0.0)
|
||||
return model
|
||||
|
||||
|
||||
def with_cpu(ops, model):
|
||||
"""Wrap a model that should run on CPU, transferring inputs and outputs
|
||||
as necessary."""
|
||||
model.to_cpu()
|
||||
|
||||
def with_cpu_forward(inputs, drop=0.0):
|
||||
cpu_outputs, backprop = model.begin_update(_to_cpu(inputs), drop=drop)
|
||||
gpu_outputs = _to_device(ops, cpu_outputs)
|
||||
|
||||
def with_cpu_backprop(d_outputs, sgd=None):
|
||||
cpu_d_outputs = _to_cpu(d_outputs)
|
||||
return backprop(cpu_d_outputs, sgd=sgd)
|
||||
|
||||
return gpu_outputs, with_cpu_backprop
|
||||
|
||||
return wrap(with_cpu_forward, model)
|
||||
|
||||
|
||||
def _to_cpu(X):
|
||||
if isinstance(X, numpy.ndarray):
|
||||
return X
|
||||
elif isinstance(X, tuple):
|
||||
return tuple([_to_cpu(x) for x in X])
|
||||
elif isinstance(X, list):
|
||||
return [_to_cpu(x) for x in X]
|
||||
elif hasattr(X, "get"):
|
||||
return X.get()
|
||||
else:
|
||||
return X
|
||||
|
||||
|
||||
def _to_device(ops, X):
|
||||
if isinstance(X, tuple):
|
||||
return tuple([_to_device(ops, x) for x in X])
|
||||
elif isinstance(X, list):
|
||||
return [_to_device(ops, x) for x in X]
|
||||
else:
|
||||
return ops.asarray(X)
|
||||
|
||||
|
||||
class extract_ngrams(Model):
|
||||
def __init__(self, ngram_size, attr=LOWER):
|
||||
Model.__init__(self)
|
||||
self.ngram_size = ngram_size
|
||||
self.attr = attr
|
||||
|
||||
def begin_update(self, docs, drop=0.0):
|
||||
batch_keys = []
|
||||
batch_vals = []
|
||||
for doc in docs:
|
||||
unigrams = doc.to_array([self.attr])
|
||||
ngrams = [unigrams]
|
||||
for n in range(2, self.ngram_size + 1):
|
||||
ngrams.append(self.ops.ngrams(n, unigrams))
|
||||
keys = self.ops.xp.concatenate(ngrams)
|
||||
keys, vals = self.ops.xp.unique(keys, return_counts=True)
|
||||
batch_keys.append(keys)
|
||||
batch_vals.append(vals)
|
||||
# The dtype here matches what thinc is expecting -- which differs per
|
||||
# platform (by int definition). This should be fixed once the problem
|
||||
# is fixed on Thinc's side.
|
||||
lengths = self.ops.asarray(
|
||||
[arr.shape[0] for arr in batch_keys], dtype=numpy.int_
|
||||
)
|
||||
batch_keys = self.ops.xp.concatenate(batch_keys)
|
||||
batch_vals = self.ops.asarray(self.ops.xp.concatenate(batch_vals), dtype="f")
|
||||
return (batch_keys, batch_vals, lengths), None
|
||||
|
||||
|
||||
@describe.on_data(
|
||||
_set_dimensions_if_needed, lambda model, X, y: model.init_weights(model)
|
||||
)
|
||||
@describe.attributes(
|
||||
nI=Dimension("Input size"),
|
||||
nF=Dimension("Number of features"),
|
||||
nO=Dimension("Output size"),
|
||||
nP=Dimension("Maxout pieces"),
|
||||
W=Synapses("Weights matrix", lambda obj: (obj.nF, obj.nO, obj.nP, obj.nI)),
|
||||
b=Biases("Bias vector", lambda obj: (obj.nO, obj.nP)),
|
||||
pad=Synapses(
|
||||
"Pad",
|
||||
lambda obj: (1, obj.nF, obj.nO, obj.nP),
|
||||
lambda M, ops: ops.normal_init(M, 1.0),
|
||||
),
|
||||
d_W=Gradient("W"),
|
||||
d_pad=Gradient("pad"),
|
||||
d_b=Gradient("b"),
|
||||
)
|
||||
class PrecomputableAffine(Model):
|
||||
def __init__(self, nO=None, nI=None, nF=None, nP=None, **kwargs):
|
||||
Model.__init__(self, **kwargs)
|
||||
self.nO = nO
|
||||
self.nP = nP
|
||||
self.nI = nI
|
||||
self.nF = nF
|
||||
|
||||
def begin_update(self, X, drop=0.0):
|
||||
Yf = self.ops.gemm(
|
||||
X, self.W.reshape((self.nF * self.nO * self.nP, self.nI)), trans2=True
|
||||
)
|
||||
Yf = Yf.reshape((Yf.shape[0], self.nF, self.nO, self.nP))
|
||||
Yf = self._add_padding(Yf)
|
||||
|
||||
def backward(dY_ids, sgd=None):
|
||||
dY, ids = dY_ids
|
||||
dY, ids = self._backprop_padding(dY, ids)
|
||||
Xf = X[ids]
|
||||
Xf = Xf.reshape((Xf.shape[0], self.nF * self.nI))
|
||||
|
||||
self.d_b += dY.sum(axis=0)
|
||||
dY = dY.reshape((dY.shape[0], self.nO * self.nP))
|
||||
|
||||
Wopfi = self.W.transpose((1, 2, 0, 3))
|
||||
Wopfi = self.ops.xp.ascontiguousarray(Wopfi)
|
||||
Wopfi = Wopfi.reshape((self.nO * self.nP, self.nF * self.nI))
|
||||
dXf = self.ops.gemm(dY.reshape((dY.shape[0], self.nO * self.nP)), Wopfi)
|
||||
|
||||
# Reuse the buffer
|
||||
dWopfi = Wopfi
|
||||
dWopfi.fill(0.0)
|
||||
self.ops.gemm(dY, Xf, out=dWopfi, trans1=True)
|
||||
dWopfi = dWopfi.reshape((self.nO, self.nP, self.nF, self.nI))
|
||||
# (o, p, f, i) --> (f, o, p, i)
|
||||
self.d_W += dWopfi.transpose((2, 0, 1, 3))
|
||||
|
||||
if sgd is not None:
|
||||
sgd(self._mem.weights, self._mem.gradient, key=self.id)
|
||||
return dXf.reshape((dXf.shape[0], self.nF, self.nI))
|
||||
|
||||
return Yf, backward
|
||||
|
||||
def _add_padding(self, Yf):
|
||||
Yf_padded = self.ops.xp.vstack((self.pad, Yf))
|
||||
return Yf_padded
|
||||
|
||||
def _backprop_padding(self, dY, ids):
|
||||
# (1, nF, nO, nP) += (nN, nF, nO, nP) where IDs (nN, nF) < 0
|
||||
mask = ids < 0.0
|
||||
mask = mask.sum(axis=1)
|
||||
d_pad = dY * mask.reshape((ids.shape[0], 1, 1))
|
||||
self.d_pad += d_pad.sum(axis=0)
|
||||
return dY, ids
|
||||
|
||||
@staticmethod
|
||||
def init_weights(model):
|
||||
"""This is like the 'layer sequential unit variance', but instead
|
||||
of taking the actual inputs, we randomly generate whitened data.
|
||||
|
||||
Why's this all so complicated? We have a huge number of inputs,
|
||||
and the maxout unit makes guessing the dynamics tricky. Instead
|
||||
we set the maxout weights to values that empirically result in
|
||||
whitened outputs given whitened inputs.
|
||||
"""
|
||||
if (model.W ** 2).sum() != 0.0:
|
||||
return
|
||||
ops = model.ops
|
||||
xp = ops.xp
|
||||
ops.normal_init(model.W, model.nF * model.nI, inplace=True)
|
||||
|
||||
ids = ops.allocate((5000, model.nF), dtype="f")
|
||||
ids += xp.random.uniform(0, 1000, ids.shape)
|
||||
ids = ops.asarray(ids, dtype="i")
|
||||
tokvecs = ops.allocate((5000, model.nI), dtype="f")
|
||||
tokvecs += xp.random.normal(loc=0.0, scale=1.0, size=tokvecs.size).reshape(
|
||||
tokvecs.shape
|
||||
)
|
||||
|
||||
def predict(ids, tokvecs):
|
||||
# nS ids. nW tokvecs. Exclude the padding array.
|
||||
hiddens = model(tokvecs[:-1]) # (nW, f, o, p)
|
||||
vectors = model.ops.allocate((ids.shape[0], model.nO * model.nP), dtype="f")
|
||||
# need nS vectors
|
||||
hiddens = hiddens.reshape(
|
||||
(hiddens.shape[0] * model.nF, model.nO * model.nP)
|
||||
)
|
||||
model.ops.scatter_add(vectors, ids.flatten(), hiddens)
|
||||
vectors = vectors.reshape((vectors.shape[0], model.nO, model.nP))
|
||||
vectors += model.b
|
||||
vectors = model.ops.asarray(vectors)
|
||||
if model.nP >= 2:
|
||||
return model.ops.maxout(vectors)[0]
|
||||
else:
|
||||
return vectors * (vectors >= 0)
|
||||
|
||||
tol_var = 0.01
|
||||
tol_mean = 0.01
|
||||
t_max = 10
|
||||
t_i = 0
|
||||
for t_i in range(t_max):
|
||||
acts1 = predict(ids, tokvecs)
|
||||
var = model.ops.xp.var(acts1)
|
||||
mean = model.ops.xp.mean(acts1)
|
||||
if abs(var - 1.0) >= tol_var:
|
||||
model.W /= model.ops.xp.sqrt(var)
|
||||
elif abs(mean) >= tol_mean:
|
||||
model.b -= mean
|
||||
else:
|
||||
break
|
||||
|
||||
|
||||
def link_vectors_to_models(vocab):
|
||||
vectors = vocab.vectors
|
||||
if vectors.name is None:
|
||||
vectors.name = VECTORS_KEY
|
||||
if vectors.data.size != 0:
|
||||
user_warning(Warnings.W020.format(shape=vectors.data.shape))
|
||||
ops = Model.ops
|
||||
for word in vocab:
|
||||
if word.orth in vectors.key2row:
|
||||
word.rank = vectors.key2row[word.orth]
|
||||
else:
|
||||
word.rank = 0
|
||||
data = ops.asarray(vectors.data)
|
||||
# Set an entry here, so that vectors are accessed by StaticVectors
|
||||
# (unideal, I know)
|
||||
key = (ops.device, vectors.name)
|
||||
if key in thinc.extra.load_nlp.VECTORS:
|
||||
if thinc.extra.load_nlp.VECTORS[key].shape != data.shape:
|
||||
# This is a hack to avoid the problem in #3853. Maybe we should
|
||||
# print a warning as well?
|
||||
old_name = vectors.name
|
||||
new_name = vectors.name + "_%d" % data.shape[0]
|
||||
user_warning(Warnings.W019.format(old=old_name, new=new_name))
|
||||
vectors.name = new_name
|
||||
key = (ops.device, vectors.name)
|
||||
thinc.extra.load_nlp.VECTORS[key] = data
|
||||
|
||||
|
||||
def PyTorchBiLSTM(nO, nI, depth, dropout=0.2):
|
||||
import torch.nn
|
||||
from thinc.api import with_square_sequences
|
||||
from thinc.extra.wrappers import PyTorchWrapperRNN
|
||||
|
||||
if depth == 0:
|
||||
return layerize(noop())
|
||||
model = torch.nn.LSTM(nI, nO // 2, depth, bidirectional=True, dropout=dropout)
|
||||
return with_square_sequences(PyTorchWrapperRNN(model))
|
||||
|
||||
|
||||
def Tok2Vec(width, embed_size, **kwargs):
|
||||
if not USE_MODEL_REGISTRY_TOK2VEC:
|
||||
# Preserve prior tok2vec for backwards compat, in v2.2.2
|
||||
return _legacy_tok2vec.Tok2Vec(width, embed_size, **kwargs)
|
||||
pretrained_vectors = kwargs.get("pretrained_vectors", None)
|
||||
cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3)
|
||||
subword_features = kwargs.get("subword_features", True)
|
||||
char_embed = kwargs.get("char_embed", False)
|
||||
conv_depth = kwargs.get("conv_depth", 4)
|
||||
bilstm_depth = kwargs.get("bilstm_depth", 0)
|
||||
conv_window = kwargs.get("conv_window", 1)
|
||||
|
||||
cols = ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"]
|
||||
|
||||
doc2feats_cfg = {"arch": "spacy.Doc2Feats.v1", "config": {"columns": cols}}
|
||||
if char_embed:
|
||||
embed_cfg = {
|
||||
"arch": "spacy.CharacterEmbed.v1",
|
||||
"config": {
|
||||
"width": 64,
|
||||
"chars": 6,
|
||||
"@mix": {
|
||||
"arch": "spacy.LayerNormalizedMaxout.v1",
|
||||
"config": {"width": width, "pieces": 3},
|
||||
},
|
||||
"@embed_features": None,
|
||||
},
|
||||
}
|
||||
else:
|
||||
embed_cfg = {
|
||||
"arch": "spacy.MultiHashEmbed.v1",
|
||||
"config": {
|
||||
"width": width,
|
||||
"rows": embed_size,
|
||||
"columns": cols,
|
||||
"use_subwords": subword_features,
|
||||
"@pretrained_vectors": None,
|
||||
"@mix": {
|
||||
"arch": "spacy.LayerNormalizedMaxout.v1",
|
||||
"config": {"width": width, "pieces": 3},
|
||||
},
|
||||
},
|
||||
}
|
||||
if pretrained_vectors:
|
||||
embed_cfg["config"]["@pretrained_vectors"] = {
|
||||
"arch": "spacy.PretrainedVectors.v1",
|
||||
"config": {
|
||||
"vectors_name": pretrained_vectors,
|
||||
"width": width,
|
||||
"column": cols.index("ID"),
|
||||
},
|
||||
}
|
||||
if cnn_maxout_pieces >= 2:
|
||||
cnn_cfg = {
|
||||
"arch": "spacy.MaxoutWindowEncoder.v1",
|
||||
"config": {
|
||||
"width": width,
|
||||
"window_size": conv_window,
|
||||
"pieces": cnn_maxout_pieces,
|
||||
"depth": conv_depth,
|
||||
},
|
||||
}
|
||||
else:
|
||||
cnn_cfg = {
|
||||
"arch": "spacy.MishWindowEncoder.v1",
|
||||
"config": {"width": width, "window_size": conv_window, "depth": conv_depth},
|
||||
}
|
||||
bilstm_cfg = {
|
||||
"arch": "spacy.TorchBiLSTMEncoder.v1",
|
||||
"config": {"width": width, "depth": bilstm_depth},
|
||||
}
|
||||
if conv_depth == 0 and bilstm_depth == 0:
|
||||
encode_cfg = {}
|
||||
elif conv_depth >= 1 and bilstm_depth >= 1:
|
||||
encode_cfg = {
|
||||
"arch": "thinc.FeedForward.v1",
|
||||
"config": {"children": [cnn_cfg, bilstm_cfg]},
|
||||
}
|
||||
elif conv_depth >= 1:
|
||||
encode_cfg = cnn_cfg
|
||||
else:
|
||||
encode_cfg = bilstm_cfg
|
||||
config = {"@doc2feats": doc2feats_cfg, "@embed": embed_cfg, "@encode": encode_cfg}
|
||||
return new_ml.Tok2Vec(config)
|
||||
|
||||
|
||||
def reapply(layer, n_times):
|
||||
def reapply_fwd(X, drop=0.0):
|
||||
backprops = []
|
||||
for i in range(n_times):
|
||||
Y, backprop = layer.begin_update(X, drop=drop)
|
||||
X = Y
|
||||
backprops.append(backprop)
|
||||
|
||||
def reapply_bwd(dY, sgd=None):
|
||||
dX = None
|
||||
for backprop in reversed(backprops):
|
||||
dY = backprop(dY, sgd=sgd)
|
||||
if dX is None:
|
||||
dX = dY
|
||||
else:
|
||||
dX += dY
|
||||
return dX
|
||||
|
||||
return Y, reapply_bwd
|
||||
|
||||
return wrap(reapply_fwd, layer)
|
||||
|
||||
|
||||
def asarray(ops, dtype):
|
||||
def forward(X, drop=0.0):
|
||||
return ops.asarray(X, dtype=dtype), None
|
||||
|
||||
return layerize(forward)
|
||||
|
||||
|
||||
def _divide_array(X, size):
|
||||
parts = []
|
||||
index = 0
|
||||
while index < len(X):
|
||||
parts.append(X[index : index + size])
|
||||
index += size
|
||||
return parts
|
||||
|
||||
|
||||
def get_col(idx):
|
||||
if idx < 0:
|
||||
raise IndexError(Errors.E066.format(value=idx))
|
||||
|
||||
def forward(X, drop=0.0):
|
||||
if isinstance(X, numpy.ndarray):
|
||||
ops = NumpyOps()
|
||||
else:
|
||||
ops = CupyOps()
|
||||
output = ops.xp.ascontiguousarray(X[:, idx], dtype=X.dtype)
|
||||
|
||||
def backward(y, sgd=None):
|
||||
dX = ops.allocate(X.shape)
|
||||
dX[:, idx] += y
|
||||
return dX
|
||||
|
||||
return output, backward
|
||||
|
||||
return layerize(forward)
|
||||
|
||||
|
||||
def doc2feats(cols=None):
|
||||
if cols is None:
|
||||
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
|
||||
|
||||
def forward(docs, drop=0.0):
|
||||
feats = []
|
||||
for doc in docs:
|
||||
feats.append(doc.to_array(cols))
|
||||
return feats, None
|
||||
|
||||
model = layerize(forward)
|
||||
model.cols = cols
|
||||
return model
|
||||
|
||||
|
||||
def print_shape(prefix):
|
||||
def forward(X, drop=0.0):
|
||||
return X, lambda dX, **kwargs: dX
|
||||
|
||||
return layerize(forward)
|
||||
|
||||
|
||||
@layerize
|
||||
def get_token_vectors(tokens_attrs_vectors, drop=0.0):
|
||||
tokens, attrs, vectors = tokens_attrs_vectors
|
||||
|
||||
def backward(d_output, sgd=None):
|
||||
return (tokens, d_output)
|
||||
|
||||
return vectors, backward
|
||||
|
||||
|
||||
@layerize
|
||||
def logistic(X, drop=0.0):
|
||||
xp = get_array_module(X)
|
||||
if not isinstance(X, xp.ndarray):
|
||||
X = xp.asarray(X)
|
||||
# Clip to range (-10, 10)
|
||||
X = xp.minimum(X, 10.0, X)
|
||||
X = xp.maximum(X, -10.0, X)
|
||||
Y = 1.0 / (1.0 + xp.exp(-X))
|
||||
|
||||
def logistic_bwd(dY, sgd=None):
|
||||
dX = dY * (Y * (1 - Y))
|
||||
return dX
|
||||
|
||||
return Y, logistic_bwd
|
||||
|
||||
|
||||
def zero_init(model):
|
||||
def _zero_init_impl(self, X, y):
|
||||
self.W.fill(0)
|
||||
|
||||
model.on_data_hooks.append(_zero_init_impl)
|
||||
return model
|
||||
|
||||
|
||||
def getitem(i):
|
||||
def getitem_fwd(X, drop=0.0):
|
||||
return X[i], None
|
||||
|
||||
return layerize(getitem_fwd)
|
||||
|
||||
|
||||
@describe.attributes(
|
||||
W=Synapses("Weights matrix", lambda obj: (obj.nO, obj.nI), lambda W, ops: None)
|
||||
)
|
||||
class MultiSoftmax(Affine):
|
||||
"""Neural network layer that predicts several multi-class attributes at once.
|
||||
For instance, we might predict one class with 6 variables, and another with 5.
|
||||
We predict the 11 neurons required for this, and then softmax them such
|
||||
that columns 0-6 make a probability distribution and coumns 6-11 make another.
|
||||
"""
|
||||
|
||||
name = "multisoftmax"
|
||||
|
||||
def __init__(self, out_sizes, nI=None, **kwargs):
|
||||
Model.__init__(self, **kwargs)
|
||||
self.out_sizes = out_sizes
|
||||
self.nO = sum(out_sizes)
|
||||
self.nI = nI
|
||||
|
||||
def predict(self, input__BI):
|
||||
output__BO = self.ops.affine(self.W, self.b, input__BI)
|
||||
i = 0
|
||||
for out_size in self.out_sizes:
|
||||
self.ops.softmax(output__BO[:, i : i + out_size], inplace=True)
|
||||
i += out_size
|
||||
return output__BO
|
||||
|
||||
def begin_update(self, input__BI, drop=0.0):
|
||||
output__BO = self.predict(input__BI)
|
||||
|
||||
def finish_update(grad__BO, sgd=None):
|
||||
self.d_W += self.ops.gemm(grad__BO, input__BI, trans1=True)
|
||||
self.d_b += grad__BO.sum(axis=0)
|
||||
grad__BI = self.ops.gemm(grad__BO, self.W)
|
||||
if sgd is not None:
|
||||
sgd(self._mem.weights, self._mem.gradient, key=self.id)
|
||||
return grad__BI
|
||||
|
||||
return output__BO, finish_update
|
||||
|
||||
|
||||
def build_tagger_model(nr_class, **cfg):
|
||||
embed_size = util.env_opt("embed_size", 2000)
|
||||
if "token_vector_width" in cfg:
|
||||
token_vector_width = cfg["token_vector_width"]
|
||||
else:
|
||||
token_vector_width = util.env_opt("token_vector_width", 96)
|
||||
pretrained_vectors = cfg.get("pretrained_vectors")
|
||||
subword_features = cfg.get("subword_features", True)
|
||||
with Model.define_operators({">>": chain, "+": add}):
|
||||
if "tok2vec" in cfg:
|
||||
tok2vec = cfg["tok2vec"]
|
||||
else:
|
||||
tok2vec = Tok2Vec(
|
||||
token_vector_width,
|
||||
embed_size,
|
||||
subword_features=subword_features,
|
||||
pretrained_vectors=pretrained_vectors,
|
||||
)
|
||||
softmax = with_flatten(Softmax(nr_class, token_vector_width))
|
||||
model = tok2vec >> softmax
|
||||
model.nI = None
|
||||
model.tok2vec = tok2vec
|
||||
model.softmax = softmax
|
||||
return model
|
||||
|
||||
|
||||
def build_morphologizer_model(class_nums, **cfg):
|
||||
embed_size = util.env_opt("embed_size", 7000)
|
||||
if "token_vector_width" in cfg:
|
||||
token_vector_width = cfg["token_vector_width"]
|
||||
else:
|
||||
token_vector_width = util.env_opt("token_vector_width", 128)
|
||||
pretrained_vectors = cfg.get("pretrained_vectors")
|
||||
char_embed = cfg.get("char_embed", True)
|
||||
with Model.define_operators({">>": chain, "+": add, "**": clone}):
|
||||
if "tok2vec" in cfg:
|
||||
tok2vec = cfg["tok2vec"]
|
||||
else:
|
||||
tok2vec = Tok2Vec(
|
||||
token_vector_width,
|
||||
embed_size,
|
||||
char_embed=char_embed,
|
||||
pretrained_vectors=pretrained_vectors,
|
||||
)
|
||||
softmax = with_flatten(MultiSoftmax(class_nums, token_vector_width))
|
||||
softmax.out_sizes = class_nums
|
||||
model = tok2vec >> softmax
|
||||
model.nI = None
|
||||
model.tok2vec = tok2vec
|
||||
model.softmax = softmax
|
||||
return model
|
||||
|
||||
|
||||
@layerize
|
||||
def SpacyVectors(docs, drop=0.0):
|
||||
batch = []
|
||||
for doc in docs:
|
||||
indices = numpy.zeros((len(doc),), dtype="i")
|
||||
for i, word in enumerate(doc):
|
||||
if word.orth in doc.vocab.vectors.key2row:
|
||||
indices[i] = doc.vocab.vectors.key2row[word.orth]
|
||||
else:
|
||||
indices[i] = 0
|
||||
vectors = doc.vocab.vectors.data[indices]
|
||||
batch.append(vectors)
|
||||
return batch, None
|
||||
|
||||
|
||||
def build_text_classifier(nr_class, width=64, **cfg):
|
||||
depth = cfg.get("depth", 2)
|
||||
nr_vector = cfg.get("nr_vector", 5000)
|
||||
pretrained_dims = cfg.get("pretrained_dims", 0)
|
||||
with Model.define_operators({">>": chain, "+": add, "|": concatenate, "**": clone}):
|
||||
if cfg.get("low_data") and pretrained_dims:
|
||||
model = (
|
||||
SpacyVectors
|
||||
>> flatten_add_lengths
|
||||
>> with_getitem(0, Affine(width, pretrained_dims))
|
||||
>> ParametricAttention(width)
|
||||
>> Pooling(sum_pool)
|
||||
>> Residual(ReLu(width, width)) ** 2
|
||||
>> zero_init(Affine(nr_class, width, drop_factor=0.0))
|
||||
>> logistic
|
||||
)
|
||||
return model
|
||||
|
||||
lower = HashEmbed(width, nr_vector, column=1)
|
||||
prefix = HashEmbed(width // 2, nr_vector, column=2)
|
||||
suffix = HashEmbed(width // 2, nr_vector, column=3)
|
||||
shape = HashEmbed(width // 2, nr_vector, column=4)
|
||||
|
||||
trained_vectors = FeatureExtracter(
|
||||
[ORTH, LOWER, PREFIX, SUFFIX, SHAPE, ID]
|
||||
) >> with_flatten(
|
||||
uniqued(
|
||||
(lower | prefix | suffix | shape)
|
||||
>> LN(Maxout(width, width + (width // 2) * 3)),
|
||||
column=0,
|
||||
)
|
||||
)
|
||||
|
||||
if pretrained_dims:
|
||||
static_vectors = SpacyVectors >> with_flatten(
|
||||
Affine(width, pretrained_dims)
|
||||
)
|
||||
# TODO Make concatenate support lists
|
||||
vectors = concatenate_lists(trained_vectors, static_vectors)
|
||||
vectors_width = width * 2
|
||||
else:
|
||||
vectors = trained_vectors
|
||||
vectors_width = width
|
||||
static_vectors = None
|
||||
tok2vec = vectors >> with_flatten(
|
||||
LN(Maxout(width, vectors_width))
|
||||
>> Residual((ExtractWindow(nW=1) >> LN(Maxout(width, width * 3)))) ** depth,
|
||||
pad=depth,
|
||||
)
|
||||
cnn_model = (
|
||||
tok2vec
|
||||
>> flatten_add_lengths
|
||||
>> ParametricAttention(width)
|
||||
>> Pooling(sum_pool)
|
||||
>> Residual(zero_init(Maxout(width, width)))
|
||||
>> zero_init(Affine(nr_class, width, drop_factor=0.0))
|
||||
)
|
||||
|
||||
linear_model = build_bow_text_classifier(
|
||||
nr_class, ngram_size=cfg.get("ngram_size", 1), exclusive_classes=False
|
||||
)
|
||||
if cfg.get("exclusive_classes"):
|
||||
output_layer = Softmax(nr_class, nr_class * 2)
|
||||
else:
|
||||
output_layer = (
|
||||
zero_init(Affine(nr_class, nr_class * 2, drop_factor=0.0)) >> logistic
|
||||
)
|
||||
model = (linear_model | cnn_model) >> output_layer
|
||||
model.tok2vec = chain(tok2vec, flatten)
|
||||
model.nO = nr_class
|
||||
model.lsuv = False
|
||||
return model
|
||||
|
||||
|
||||
def build_bow_text_classifier(
|
||||
nr_class, ngram_size=1, exclusive_classes=False, no_output_layer=False, **cfg
|
||||
):
|
||||
with Model.define_operators({">>": chain}):
|
||||
model = with_cpu(
|
||||
Model.ops, extract_ngrams(ngram_size, attr=ORTH) >> LinearModel(nr_class)
|
||||
)
|
||||
if not no_output_layer:
|
||||
model = model >> (cpu_softmax if exclusive_classes else logistic)
|
||||
model.nO = nr_class
|
||||
return model
|
||||
|
||||
|
||||
@layerize
|
||||
def cpu_softmax(X, drop=0.0):
|
||||
ops = NumpyOps()
|
||||
|
||||
def cpu_softmax_backward(dY, sgd=None):
|
||||
return dY
|
||||
|
||||
return ops.softmax(X), cpu_softmax_backward
|
||||
|
||||
|
||||
def build_simple_cnn_text_classifier(tok2vec, nr_class, exclusive_classes=False, **cfg):
|
||||
"""
|
||||
Build a simple CNN text classifier, given a token-to-vector model as inputs.
|
||||
If exclusive_classes=True, a softmax non-linearity is applied, so that the
|
||||
outputs sum to 1. If exclusive_classes=False, a logistic non-linearity
|
||||
is applied instead, so that outputs are in the range [0, 1].
|
||||
"""
|
||||
with Model.define_operators({">>": chain}):
|
||||
if exclusive_classes:
|
||||
output_layer = Softmax(nr_class, tok2vec.nO)
|
||||
else:
|
||||
output_layer = (
|
||||
zero_init(Affine(nr_class, tok2vec.nO, drop_factor=0.0)) >> logistic
|
||||
)
|
||||
model = tok2vec >> flatten_add_lengths >> Pooling(mean_pool) >> output_layer
|
||||
model.tok2vec = chain(tok2vec, flatten)
|
||||
model.nO = nr_class
|
||||
return model
|
||||
|
||||
|
||||
def build_nel_encoder(embed_width, hidden_width, ner_types, **cfg):
|
||||
if "entity_width" not in cfg:
|
||||
raise ValueError(Errors.E144.format(param="entity_width"))
|
||||
|
||||
conv_depth = cfg.get("conv_depth", 2)
|
||||
cnn_maxout_pieces = cfg.get("cnn_maxout_pieces", 3)
|
||||
pretrained_vectors = cfg.get("pretrained_vectors", None)
|
||||
context_width = cfg.get("entity_width")
|
||||
|
||||
with Model.define_operators({">>": chain, "**": clone}):
|
||||
# context encoder
|
||||
tok2vec = Tok2Vec(
|
||||
width=hidden_width,
|
||||
embed_size=embed_width,
|
||||
pretrained_vectors=pretrained_vectors,
|
||||
cnn_maxout_pieces=cnn_maxout_pieces,
|
||||
subword_features=True,
|
||||
conv_depth=conv_depth,
|
||||
bilstm_depth=0,
|
||||
)
|
||||
|
||||
model = (
|
||||
tok2vec
|
||||
>> flatten_add_lengths
|
||||
>> Pooling(mean_pool)
|
||||
>> Residual(zero_init(Maxout(hidden_width, hidden_width)))
|
||||
>> zero_init(Affine(context_width, hidden_width, drop_factor=0.0))
|
||||
)
|
||||
|
||||
model.tok2vec = tok2vec
|
||||
model.nO = context_width
|
||||
return model
|
||||
|
||||
|
||||
@layerize
|
||||
def flatten(seqs, drop=0.0):
|
||||
ops = Model.ops
|
||||
lengths = ops.asarray([len(seq) for seq in seqs], dtype="i")
|
||||
|
||||
def finish_update(d_X, sgd=None):
|
||||
return ops.unflatten(d_X, lengths, pad=0)
|
||||
|
||||
X = ops.flatten(seqs, pad=0)
|
||||
return X, finish_update
|
||||
|
||||
|
||||
def concatenate_lists(*layers, **kwargs): # pragma: no cover
|
||||
"""Compose two or more models `f`, `g`, etc, such that their outputs are
|
||||
concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))`
|
||||
"""
|
||||
if not layers:
|
||||
return noop()
|
||||
drop_factor = kwargs.get("drop_factor", 1.0)
|
||||
ops = layers[0].ops
|
||||
layers = [chain(layer, flatten) for layer in layers]
|
||||
concat = concatenate(*layers)
|
||||
|
||||
def concatenate_lists_fwd(Xs, drop=0.0):
|
||||
if drop is not None:
|
||||
drop *= drop_factor
|
||||
lengths = ops.asarray([len(X) for X in Xs], dtype="i")
|
||||
flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop)
|
||||
ys = ops.unflatten(flat_y, lengths)
|
||||
|
||||
def concatenate_lists_bwd(d_ys, sgd=None):
|
||||
return bp_flat_y(ops.flatten(d_ys), sgd=sgd)
|
||||
|
||||
return ys, concatenate_lists_bwd
|
||||
|
||||
model = wrap(concatenate_lists_fwd, concat)
|
||||
return model
|
||||
|
||||
|
||||
def masked_language_model(vocab, model, mask_prob=0.15):
|
||||
"""Convert a model into a BERT-style masked language model"""
|
||||
|
||||
random_words = _RandomWords(vocab)
|
||||
|
||||
def mlm_forward(docs, drop=0.0):
|
||||
mask, docs = _apply_mask(docs, random_words, mask_prob=mask_prob)
|
||||
mask = model.ops.asarray(mask).reshape((mask.shape[0], 1))
|
||||
output, backprop = model.begin_update(docs, drop=drop)
|
||||
|
||||
def mlm_backward(d_output, sgd=None):
|
||||
d_output *= 1 - mask
|
||||
return backprop(d_output, sgd=sgd)
|
||||
|
||||
return output, mlm_backward
|
||||
|
||||
return wrap(mlm_forward, model)
|
||||
|
||||
|
||||
class _RandomWords(object):
|
||||
def __init__(self, vocab):
|
||||
self.words = [lex.text for lex in vocab if lex.prob != 0.0]
|
||||
self.probs = [lex.prob for lex in vocab if lex.prob != 0.0]
|
||||
self.words = self.words[:10000]
|
||||
self.probs = self.probs[:10000]
|
||||
self.probs = numpy.exp(numpy.array(self.probs, dtype="f"))
|
||||
self.probs /= self.probs.sum()
|
||||
self._cache = []
|
||||
|
||||
def next(self):
|
||||
if not self._cache:
|
||||
self._cache.extend(
|
||||
numpy.random.choice(len(self.words), 10000, p=self.probs)
|
||||
)
|
||||
index = self._cache.pop()
|
||||
return self.words[index]
|
||||
|
||||
|
||||
def _apply_mask(docs, random_words, mask_prob=0.15):
|
||||
# This needs to be here to avoid circular imports
|
||||
from .tokens.doc import Doc
|
||||
|
||||
N = sum(len(doc) for doc in docs)
|
||||
mask = numpy.random.uniform(0.0, 1.0, (N,))
|
||||
mask = mask >= mask_prob
|
||||
i = 0
|
||||
masked_docs = []
|
||||
for doc in docs:
|
||||
words = []
|
||||
for token in doc:
|
||||
if not mask[i]:
|
||||
word = _replace_word(token.text, random_words)
|
||||
else:
|
||||
word = token.text
|
||||
words.append(word)
|
||||
i += 1
|
||||
spaces = [bool(w.whitespace_) for w in doc]
|
||||
# NB: If you change this implementation to instead modify
|
||||
# the docs in place, take care that the IDs reflect the original
|
||||
# words. Currently we use the original docs to make the vectors
|
||||
# for the target, so we don't lose the original tokens. But if
|
||||
# you modified the docs in place here, you would.
|
||||
masked_docs.append(Doc(doc.vocab, words=words, spaces=spaces))
|
||||
return mask, masked_docs
|
||||
|
||||
|
||||
def _replace_word(word, random_words, mask="[MASK]"):
|
||||
roll = numpy.random.random()
|
||||
if roll < 0.8:
|
||||
return mask
|
||||
elif roll < 0.9:
|
||||
return random_words.next()
|
||||
else:
|
||||
return word
|
||||
|
||||
|
||||
def _uniform_init(lo, hi):
|
||||
def wrapped(W, ops):
|
||||
copy_array(W, ops.xp.random.uniform(lo, hi, W.shape))
|
||||
|
||||
return wrapped
|
||||
|
||||
|
||||
@describe.attributes(
|
||||
nM=Dimension("Vector dimensions"),
|
||||
nC=Dimension("Number of characters per word"),
|
||||
vectors=Synapses(
|
||||
"Embed matrix", lambda obj: (obj.nC, obj.nV, obj.nM), _uniform_init(-0.1, 0.1)
|
||||
),
|
||||
d_vectors=Gradient("vectors"),
|
||||
)
|
||||
class CharacterEmbed(Model):
|
||||
def __init__(self, nM=None, nC=None, **kwargs):
|
||||
Model.__init__(self, **kwargs)
|
||||
self.nM = nM
|
||||
self.nC = nC
|
||||
|
||||
@property
|
||||
def nO(self):
|
||||
return self.nM * self.nC
|
||||
|
||||
@property
|
||||
def nV(self):
|
||||
return 256
|
||||
|
||||
def begin_update(self, docs, drop=0.0):
|
||||
if not docs:
|
||||
return []
|
||||
ids = []
|
||||
output = []
|
||||
weights = self.vectors
|
||||
# This assists in indexing; it's like looping over this dimension.
|
||||
# Still consider this weird witch craft...But thanks to Mark Neumann
|
||||
# for the tip.
|
||||
nCv = self.ops.xp.arange(self.nC)
|
||||
for doc in docs:
|
||||
doc_ids = doc.to_utf8_array(nr_char=self.nC)
|
||||
doc_vectors = self.ops.allocate((len(doc), self.nC, self.nM))
|
||||
# Let's say I have a 2d array of indices, and a 3d table of data. What numpy
|
||||
# incantation do I chant to get
|
||||
# output[i, j, k] == data[j, ids[i, j], k]?
|
||||
doc_vectors[:, nCv] = weights[nCv, doc_ids[:, nCv]]
|
||||
output.append(doc_vectors.reshape((len(doc), self.nO)))
|
||||
ids.append(doc_ids)
|
||||
|
||||
def backprop_character_embed(d_vectors, sgd=None):
|
||||
gradient = self.d_vectors
|
||||
for doc_ids, d_doc_vectors in zip(ids, d_vectors):
|
||||
d_doc_vectors = d_doc_vectors.reshape((len(doc_ids), self.nC, self.nM))
|
||||
gradient[nCv, doc_ids[:, nCv]] += d_doc_vectors[:, nCv]
|
||||
if sgd is not None:
|
||||
sgd(self._mem.weights, self._mem.gradient, key=self.id)
|
||||
return None
|
||||
|
||||
return output, backprop_character_embed
|
||||
|
||||
|
||||
def get_cossim_loss(yh, y, ignore_zeros=False):
|
||||
xp = get_array_module(yh)
|
||||
# Find the zero vectors
|
||||
if ignore_zeros:
|
||||
zero_indices = xp.abs(y).sum(axis=1) == 0
|
||||
# Add a small constant to avoid 0 vectors
|
||||
yh = yh + 1e-8
|
||||
y = y + 1e-8
|
||||
# https://math.stackexchange.com/questions/1923613/partial-derivative-of-cosine-similarity
|
||||
norm_yh = xp.linalg.norm(yh, axis=1, keepdims=True)
|
||||
norm_y = xp.linalg.norm(y, axis=1, keepdims=True)
|
||||
mul_norms = norm_yh * norm_y
|
||||
cosine = (yh * y).sum(axis=1, keepdims=True) / mul_norms
|
||||
d_yh = (y / mul_norms) - (cosine * (yh / norm_yh ** 2))
|
||||
losses = xp.abs(cosine - 1)
|
||||
if ignore_zeros:
|
||||
# If the target was a zero vector, don't count it in the loss.
|
||||
d_yh[zero_indices] = 0
|
||||
losses[zero_indices] = 0
|
||||
loss = losses.sum()
|
||||
return loss, -d_yh
|
|
@ -1,6 +1,6 @@
|
|||
# fmt: off
|
||||
__title__ = "spacy"
|
||||
__version__ = "2.2.3"
|
||||
__version__ = "3.0.0.dev3"
|
||||
__release__ = True
|
||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||
|
|
|
@ -1,7 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from collections import OrderedDict
|
||||
from wasabi import Printer
|
||||
|
||||
from .tokens import Doc, Token, Span
|
||||
|
@ -23,7 +19,7 @@ def analyze_pipes(pipeline, name, pipe, index, warn=True):
|
|||
assert pipeline[index][0] == name
|
||||
prev_pipes = pipeline[:index]
|
||||
pipe_requires = getattr(pipe, "requires", [])
|
||||
requires = OrderedDict([(annot, False) for annot in pipe_requires])
|
||||
requires = {annot: False for annot in pipe_requires}
|
||||
if requires:
|
||||
for prev_name, prev_pipe in prev_pipes:
|
||||
prev_assigns = getattr(prev_pipe, "assigns", [])
|
||||
|
@ -98,15 +94,15 @@ def validate_attrs(values):
|
|||
for ext_attr, ext_value in value.items():
|
||||
# We don't check whether the attribute actually exists
|
||||
if ext_value is not True: # attr is something like doc._.x.y
|
||||
good = "{}._.{}".format(obj_key, ext_attr)
|
||||
bad = "{}.{}".format(good, ".".join(ext_value))
|
||||
good = f"{obj_key}._.{ext_attr}"
|
||||
bad = f"{good}.{'.'.join(ext_value)}"
|
||||
raise ValueError(Errors.E183.format(attr=bad, solution=good))
|
||||
continue # we can't validate those further
|
||||
if attr.endswith("_"): # attr is something like "token.pos_"
|
||||
raise ValueError(Errors.E184.format(attr=attr, solution=attr[:-1]))
|
||||
if value is not True: # attr is something like doc.x.y
|
||||
good = "{}.{}".format(obj_key, attr)
|
||||
bad = "{}.{}".format(good, ".".join(value))
|
||||
good = f"{obj_key}.{attr}"
|
||||
bad = f"{good}.{'.'.join(value)}"
|
||||
raise ValueError(Errors.E183.format(attr=bad, solution=good))
|
||||
obj = objs[obj_key]
|
||||
if not hasattr(obj, attr):
|
||||
|
@ -168,11 +164,10 @@ def print_summary(nlp, pretty=True, no_print=False):
|
|||
msg.table(overview, header=header, divider=True, multiline=True)
|
||||
n_problems = sum(len(p) for p in problems.values())
|
||||
if any(p for p in problems.values()):
|
||||
msg.divider("Problems ({})".format(n_problems))
|
||||
msg.divider(f"Problems ({n_problems})")
|
||||
for name, problem in problems.items():
|
||||
if problem:
|
||||
problem = ", ".join(problem)
|
||||
msg.warn("'{}' requirements not met: {}".format(name, problem))
|
||||
msg.warn(f"'{name}' requirements not met: {', '.join(problem)}")
|
||||
else:
|
||||
msg.good("No problems found.")
|
||||
if no_print:
|
||||
|
|
|
@ -91,4 +91,5 @@ cdef enum attr_id_t:
|
|||
|
||||
LANG
|
||||
ENT_KB_ID = symbols.ENT_KB_ID
|
||||
MORPH
|
||||
ENT_ID = symbols.ENT_ID
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
IDS = {
|
||||
"": NULL_ATTR,
|
||||
|
@ -91,6 +88,7 @@ IDS = {
|
|||
"SPACY": SPACY,
|
||||
"PROB": PROB,
|
||||
"LANG": LANG,
|
||||
"MORPH": MORPH,
|
||||
}
|
||||
|
||||
|
||||
|
|
|
@ -1,12 +1,21 @@
|
|||
from wasabi import msg
|
||||
|
||||
from .download import download # noqa: F401
|
||||
from .info import info # noqa: F401
|
||||
from .link import link # noqa: F401
|
||||
from .package import package # noqa: F401
|
||||
from .profile import profile # noqa: F401
|
||||
from .train import train # noqa: F401
|
||||
from .train_from_config import train_from_config_cli # noqa: F401
|
||||
from .pretrain import pretrain # noqa: F401
|
||||
from .debug_data import debug_data # noqa: F401
|
||||
from .evaluate import evaluate # noqa: F401
|
||||
from .convert import convert # noqa: F401
|
||||
from .init_model import init_model # noqa: F401
|
||||
from .validate import validate # noqa: F401
|
||||
|
||||
|
||||
def link(*args, **kwargs):
|
||||
msg.warn(
|
||||
"As of spaCy v3.0, model symlinks are deprecated. You can load models "
|
||||
"using their full names or from a directory path."
|
||||
)
|
||||
|
|
|
@ -1,220 +0,0 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
# NB: This schema describes the new format of the training data, see #2928
|
||||
TRAINING_SCHEMA = {
|
||||
"$schema": "http://json-schema.org/draft-06/schema",
|
||||
"title": "Training data for spaCy models",
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"text": {
|
||||
"title": "The text of the training example",
|
||||
"type": "string",
|
||||
"minLength": 1,
|
||||
},
|
||||
"ents": {
|
||||
"title": "Named entity spans in the text",
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"start": {
|
||||
"title": "Start character offset of the span",
|
||||
"type": "integer",
|
||||
"minimum": 0,
|
||||
},
|
||||
"end": {
|
||||
"title": "End character offset of the span",
|
||||
"type": "integer",
|
||||
"minimum": 0,
|
||||
},
|
||||
"label": {
|
||||
"title": "Entity label",
|
||||
"type": "string",
|
||||
"minLength": 1,
|
||||
"pattern": "^[A-Z0-9]*$",
|
||||
},
|
||||
},
|
||||
"required": ["start", "end", "label"],
|
||||
},
|
||||
},
|
||||
"sents": {
|
||||
"title": "Sentence spans in the text",
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"start": {
|
||||
"title": "Start character offset of the span",
|
||||
"type": "integer",
|
||||
"minimum": 0,
|
||||
},
|
||||
"end": {
|
||||
"title": "End character offset of the span",
|
||||
"type": "integer",
|
||||
"minimum": 0,
|
||||
},
|
||||
},
|
||||
"required": ["start", "end"],
|
||||
},
|
||||
},
|
||||
"cats": {
|
||||
"title": "Text categories for the text classifier",
|
||||
"type": "object",
|
||||
"patternProperties": {
|
||||
"*": {
|
||||
"title": "A text category",
|
||||
"oneOf": [
|
||||
{"type": "boolean"},
|
||||
{"type": "number", "minimum": 0},
|
||||
],
|
||||
}
|
||||
},
|
||||
"propertyNames": {"pattern": "^[A-Z0-9]*$", "minLength": 1},
|
||||
},
|
||||
"tokens": {
|
||||
"title": "The tokens in the text",
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"minProperties": 1,
|
||||
"properties": {
|
||||
"id": {
|
||||
"title": "Token ID, usually token index",
|
||||
"type": "integer",
|
||||
"minimum": 0,
|
||||
},
|
||||
"start": {
|
||||
"title": "Start character offset of the token",
|
||||
"type": "integer",
|
||||
"minimum": 0,
|
||||
},
|
||||
"end": {
|
||||
"title": "End character offset of the token",
|
||||
"type": "integer",
|
||||
"minimum": 0,
|
||||
},
|
||||
"pos": {
|
||||
"title": "Coarse-grained part-of-speech tag",
|
||||
"type": "string",
|
||||
"minLength": 1,
|
||||
},
|
||||
"tag": {
|
||||
"title": "Fine-grained part-of-speech tag",
|
||||
"type": "string",
|
||||
"minLength": 1,
|
||||
},
|
||||
"dep": {
|
||||
"title": "Dependency label",
|
||||
"type": "string",
|
||||
"minLength": 1,
|
||||
},
|
||||
"head": {
|
||||
"title": "Index of the token's head",
|
||||
"type": "integer",
|
||||
"minimum": 0,
|
||||
},
|
||||
},
|
||||
"required": ["start", "end"],
|
||||
},
|
||||
},
|
||||
"_": {"title": "Custom user space", "type": "object"},
|
||||
},
|
||||
"required": ["text"],
|
||||
},
|
||||
}
|
||||
|
||||
META_SCHEMA = {
|
||||
"$schema": "http://json-schema.org/draft-06/schema",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"lang": {
|
||||
"title": "Two-letter language code, e.g. 'en'",
|
||||
"type": "string",
|
||||
"minLength": 2,
|
||||
"maxLength": 2,
|
||||
"pattern": "^[a-z]*$",
|
||||
},
|
||||
"name": {
|
||||
"title": "Model name",
|
||||
"type": "string",
|
||||
"minLength": 1,
|
||||
"pattern": "^[a-z_]*$",
|
||||
},
|
||||
"version": {
|
||||
"title": "Model version",
|
||||
"type": "string",
|
||||
"minLength": 1,
|
||||
"pattern": "^[0-9a-z.-]*$",
|
||||
},
|
||||
"spacy_version": {
|
||||
"title": "Compatible spaCy version identifier",
|
||||
"type": "string",
|
||||
"minLength": 1,
|
||||
"pattern": "^[0-9a-z.-><=]*$",
|
||||
},
|
||||
"parent_package": {
|
||||
"title": "Name of parent spaCy package, e.g. spacy or spacy-nightly",
|
||||
"type": "string",
|
||||
"minLength": 1,
|
||||
"default": "spacy",
|
||||
},
|
||||
"pipeline": {
|
||||
"title": "Names of pipeline components",
|
||||
"type": "array",
|
||||
"items": {"type": "string", "minLength": 1},
|
||||
},
|
||||
"description": {"title": "Model description", "type": "string"},
|
||||
"license": {"title": "Model license", "type": "string"},
|
||||
"author": {"title": "Model author name", "type": "string"},
|
||||
"email": {"title": "Model author email", "type": "string", "format": "email"},
|
||||
"url": {"title": "Model author URL", "type": "string", "format": "uri"},
|
||||
"sources": {
|
||||
"title": "Training data sources",
|
||||
"type": "array",
|
||||
"items": {"type": "string"},
|
||||
},
|
||||
"vectors": {
|
||||
"title": "Included word vectors",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"keys": {
|
||||
"title": "Number of unique keys",
|
||||
"type": "integer",
|
||||
"minimum": 0,
|
||||
},
|
||||
"vectors": {
|
||||
"title": "Number of unique vectors",
|
||||
"type": "integer",
|
||||
"minimum": 0,
|
||||
},
|
||||
"width": {
|
||||
"title": "Number of dimensions",
|
||||
"type": "integer",
|
||||
"minimum": 0,
|
||||
},
|
||||
},
|
||||
},
|
||||
"accuracy": {
|
||||
"title": "Accuracy numbers",
|
||||
"type": "object",
|
||||
"patternProperties": {"*": {"type": "number", "minimum": 0.0}},
|
||||
},
|
||||
"speed": {
|
||||
"title": "Speed evaluation numbers",
|
||||
"type": "object",
|
||||
"patternProperties": {
|
||||
"*": {
|
||||
"oneOf": [
|
||||
{"type": "number", "minimum": 0.0},
|
||||
{"type": "integer", "minimum": 0},
|
||||
]
|
||||
}
|
||||
},
|
||||
},
|
||||
},
|
||||
"required": ["lang", "name", "version"],
|
||||
}
|
|
@ -1,7 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import plac
|
||||
from pathlib import Path
|
||||
from wasabi import Printer
|
||||
import srsly
|
||||
|
@ -29,27 +25,20 @@ FILE_TYPES = ("json", "jsonl", "msg")
|
|||
FILE_TYPES_STDOUT = ("json", "jsonl")
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
input_file=("Input file", "positional", None, str),
|
||||
output_dir=("Output directory. '-' for stdout.", "positional", None, str),
|
||||
file_type=("Type of data to produce: {}".format(FILE_TYPES), "option", "t", str),
|
||||
n_sents=("Number of sentences per doc (0 to disable)", "option", "n", int),
|
||||
seg_sents=("Segment sentences (for -c ner)", "flag", "s"),
|
||||
model=("Model for sentence segmentation (for -s)", "option", "b", str),
|
||||
converter=("Converter: {}".format(tuple(CONVERTERS.keys())), "option", "c", str),
|
||||
lang=("Language (if tokenizer required)", "option", "l", str),
|
||||
morphology=("Enable appending morphology to tags", "flag", "m", bool),
|
||||
)
|
||||
def convert(
|
||||
input_file,
|
||||
output_dir="-",
|
||||
file_type="json",
|
||||
n_sents=1,
|
||||
seg_sents=False,
|
||||
model=None,
|
||||
morphology=False,
|
||||
converter="auto",
|
||||
lang=None,
|
||||
# fmt: off
|
||||
input_file: ("Input file", "positional", None, str),
|
||||
output_dir: ("Output directory. '-' for stdout.", "positional", None, str) = "-",
|
||||
file_type: (f"Type of data to produce: {FILE_TYPES}", "option", "t", str, FILE_TYPES) = "json",
|
||||
n_sents: ("Number of sentences per doc (0 to disable)", "option", "n", int) = 1,
|
||||
seg_sents: ("Segment sentences (for -c ner)", "flag", "s") = False,
|
||||
model: ("Model for sentence segmentation (for -s)", "option", "b", str) = None,
|
||||
morphology: ("Enable appending morphology to tags", "flag", "m", bool) = False,
|
||||
merge_subtokens: ("Merge CoNLL-U subtokens", "flag", "T", bool) = False,
|
||||
converter: (f"Converter: {tuple(CONVERTERS.keys())}", "option", "c", str) = "auto",
|
||||
ner_map_path: ("NER tag mapping (as JSON-encoded dict of entity types)", "option", "N", Path) = None,
|
||||
lang: ("Language (if tokenizer required)", "option", "l", str) = None,
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Convert files into JSON format for use with train command and other
|
||||
|
@ -60,16 +49,10 @@ def convert(
|
|||
no_print = output_dir == "-"
|
||||
msg = Printer(no_print=no_print)
|
||||
input_path = Path(input_file)
|
||||
if file_type not in FILE_TYPES:
|
||||
msg.fail(
|
||||
"Unknown file type: '{}'".format(file_type),
|
||||
"Supported file types: '{}'".format(", ".join(FILE_TYPES)),
|
||||
exits=1,
|
||||
)
|
||||
if file_type not in FILE_TYPES_STDOUT and output_dir == "-":
|
||||
# TODO: support msgpack via stdout in srsly?
|
||||
msg.fail(
|
||||
"Can't write .{} data to stdout.".format(file_type),
|
||||
f"Can't write .{file_type} data to stdout",
|
||||
"Please specify an output directory.",
|
||||
exits=1,
|
||||
)
|
||||
|
@ -93,21 +76,26 @@ def convert(
|
|||
"Can't automatically detect NER format. Conversion may not succeed. See https://spacy.io/api/cli#convert"
|
||||
)
|
||||
if converter not in CONVERTERS:
|
||||
msg.fail("Can't find converter for {}".format(converter), exits=1)
|
||||
msg.fail(f"Can't find converter for {converter}", exits=1)
|
||||
ner_map = None
|
||||
if ner_map_path is not None:
|
||||
ner_map = srsly.read_json(ner_map_path)
|
||||
# Use converter function to convert data
|
||||
func = CONVERTERS[converter]
|
||||
data = func(
|
||||
input_data,
|
||||
n_sents=n_sents,
|
||||
seg_sents=seg_sents,
|
||||
use_morphology=morphology,
|
||||
append_morphology=morphology,
|
||||
merge_subtokens=merge_subtokens,
|
||||
lang=lang,
|
||||
model=model,
|
||||
no_print=no_print,
|
||||
ner_map=ner_map,
|
||||
)
|
||||
if output_dir != "-":
|
||||
# Export data to a file
|
||||
suffix = ".{}".format(file_type)
|
||||
suffix = f".{file_type}"
|
||||
output_file = Path(output_dir) / Path(input_path.parts[-1]).with_suffix(suffix)
|
||||
if file_type == "json":
|
||||
srsly.write_json(output_file, data)
|
||||
|
@ -115,9 +103,7 @@ def convert(
|
|||
srsly.write_jsonl(output_file, data)
|
||||
elif file_type == "msg":
|
||||
srsly.write_msgpack(output_file, data)
|
||||
msg.good(
|
||||
"Generated output file ({} documents): {}".format(len(data), output_file)
|
||||
)
|
||||
msg.good(f"Generated output file ({len(data)} documents): {output_file}")
|
||||
else:
|
||||
# Print to stdout
|
||||
if file_type == "json":
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from wasabi import Printer
|
||||
|
||||
from ...gold import iob_to_biluo
|
||||
|
@ -64,9 +61,9 @@ def conll_ner2json(
|
|||
# sentence segmentation required for document segmentation
|
||||
if n_sents > 0 and not seg_sents:
|
||||
msg.warn(
|
||||
"No sentence boundaries found to use with option `-n {}`. "
|
||||
"Use `-s` to automatically segment sentences or `-n 0` "
|
||||
"to disable.".format(n_sents)
|
||||
f"No sentence boundaries found to use with option `-n {n_sents}`. "
|
||||
f"Use `-s` to automatically segment sentences or `-n 0` "
|
||||
f"to disable."
|
||||
)
|
||||
else:
|
||||
n_sents_info(msg, n_sents)
|
||||
|
@ -129,7 +126,7 @@ def segment_sents_and_docs(doc, n_sents, doc_delimiter, model=None, msg=None):
|
|||
if model:
|
||||
nlp = load_model(model)
|
||||
if "parser" in nlp.pipe_names:
|
||||
msg.info("Segmenting sentences with parser from model '{}'.".format(model))
|
||||
msg.info(f"Segmenting sentences with parser from model '{model}'.")
|
||||
sentencizer = nlp.get_pipe("parser")
|
||||
if not sentencizer:
|
||||
msg.info(
|
||||
|
@ -166,7 +163,7 @@ def segment_docs(input_data, n_sents, doc_delimiter):
|
|||
|
||||
|
||||
def n_sents_info(msg, n_sents):
|
||||
msg.info("Grouping every {} sentences into a document.".format(n_sents))
|
||||
msg.info(f"Grouping every {n_sents} sentences into a document.")
|
||||
if n_sents == 1:
|
||||
msg.warn(
|
||||
"To generate better training data, you may want to group "
|
||||
|
|
|
@ -1,141 +1,348 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import re
|
||||
|
||||
from ...gold import iob_to_biluo
|
||||
from ...gold import Example
|
||||
from ...gold import iob_to_biluo, spans_from_biluo_tags, biluo_tags_from_offsets
|
||||
from ...language import Language
|
||||
from ...tokens import Doc, Token
|
||||
from .conll_ner2json import n_sents_info
|
||||
from wasabi import Printer
|
||||
|
||||
|
||||
def conllu2json(input_data, n_sents=10, use_morphology=False, lang=None, **_):
|
||||
def conllu2json(
|
||||
input_data,
|
||||
n_sents=10,
|
||||
append_morphology=False,
|
||||
lang=None,
|
||||
ner_map=None,
|
||||
merge_subtokens=False,
|
||||
no_print=False,
|
||||
**_
|
||||
):
|
||||
"""
|
||||
Convert conllu files into JSON format for use with train cli.
|
||||
use_morphology parameter enables appending morphology to tags, which is
|
||||
append_morphology parameter enables appending morphology to tags, which is
|
||||
useful for languages such as Spanish, where UD tags are not so rich.
|
||||
|
||||
Extract NER tags if available and convert them so that they follow
|
||||
BILUO and the Wikipedia scheme
|
||||
"""
|
||||
# by @dvsrepo, via #11 explosion/spacy-dev-resources
|
||||
# by @katarkor
|
||||
MISC_NER_PATTERN = "\|?(?:name=)?(([A-Z_]+)-([A-Z_]+)|O)\|?"
|
||||
msg = Printer(no_print=no_print)
|
||||
n_sents_info(msg, n_sents)
|
||||
docs = []
|
||||
raw = ""
|
||||
sentences = []
|
||||
conll_tuples = read_conllx(input_data, use_morphology=use_morphology)
|
||||
checked_for_ner = False
|
||||
has_ner_tags = False
|
||||
for i, (raw_text, tokens) in enumerate(conll_tuples):
|
||||
sentence, brackets = tokens[0]
|
||||
if not checked_for_ner:
|
||||
has_ner_tags = is_ner(sentence[5][0])
|
||||
checked_for_ner = True
|
||||
sentences.append(generate_sentence(sentence, has_ner_tags))
|
||||
conll_data = read_conllx(
|
||||
input_data,
|
||||
append_morphology=append_morphology,
|
||||
ner_tag_pattern=MISC_NER_PATTERN,
|
||||
ner_map=ner_map,
|
||||
merge_subtokens=merge_subtokens,
|
||||
)
|
||||
has_ner_tags = has_ner(input_data, ner_tag_pattern=MISC_NER_PATTERN)
|
||||
for i, example in enumerate(conll_data):
|
||||
raw += example.text
|
||||
sentences.append(
|
||||
generate_sentence(
|
||||
example.token_annotation,
|
||||
has_ner_tags,
|
||||
MISC_NER_PATTERN,
|
||||
ner_map=ner_map,
|
||||
)
|
||||
)
|
||||
# Real-sized documents could be extracted using the comments on the
|
||||
# conluu document
|
||||
# conllu document
|
||||
if len(sentences) % n_sents == 0:
|
||||
doc = create_doc(sentences, i)
|
||||
doc = create_json_doc(raw, sentences, i)
|
||||
docs.append(doc)
|
||||
raw = ""
|
||||
sentences = []
|
||||
if sentences:
|
||||
doc = create_doc(sentences, i)
|
||||
doc = create_json_doc(raw, sentences, i)
|
||||
docs.append(doc)
|
||||
return docs
|
||||
|
||||
|
||||
def is_ner(tag):
|
||||
def has_ner(input_data, ner_tag_pattern):
|
||||
"""
|
||||
Check the 10th column of the first token to determine if the file contains
|
||||
NER tags
|
||||
"""
|
||||
tag_match = re.match("([A-Z_]+)-([A-Z_]+)", tag)
|
||||
if tag_match:
|
||||
return True
|
||||
elif tag == "O":
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
|
||||
|
||||
def read_conllx(input_data, use_morphology=False, n=0):
|
||||
i = 0
|
||||
for sent in input_data.strip().split("\n\n"):
|
||||
lines = sent.strip().split("\n")
|
||||
if lines:
|
||||
while lines[0].startswith("#"):
|
||||
lines.pop(0)
|
||||
tokens = []
|
||||
for line in lines:
|
||||
|
||||
parts = line.split("\t")
|
||||
id_, word, lemma, pos, tag, morph, head, dep, _1, iob = parts
|
||||
if "-" in id_ or "." in id_:
|
||||
continue
|
||||
try:
|
||||
id_ = int(id_) - 1
|
||||
head = (int(head) - 1) if head not in ["0", "_"] else id_
|
||||
dep = "ROOT" if dep == "root" else dep
|
||||
tag = pos if tag == "_" else tag
|
||||
tag = tag + "__" + morph if use_morphology else tag
|
||||
iob = iob if iob else "O"
|
||||
tokens.append((id_, word, tag, head, dep, iob))
|
||||
except: # noqa: E722
|
||||
print(line)
|
||||
raise
|
||||
tuples = [list(t) for t in zip(*tokens)]
|
||||
yield (None, [[tuples, []]])
|
||||
i += 1
|
||||
if n >= 1 and i >= n:
|
||||
break
|
||||
if lines:
|
||||
parts = lines[0].split("\t")
|
||||
id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts
|
||||
if re.search(ner_tag_pattern, misc):
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
|
||||
|
||||
def simplify_tags(iob):
|
||||
def read_conllx(
|
||||
input_data,
|
||||
append_morphology=False,
|
||||
merge_subtokens=False,
|
||||
ner_tag_pattern="",
|
||||
ner_map=None,
|
||||
):
|
||||
""" Yield examples, one for each sentence """
|
||||
vocab = Language.Defaults.create_vocab() # need vocab to make a minimal Doc
|
||||
for sent in input_data.strip().split("\n\n"):
|
||||
lines = sent.strip().split("\n")
|
||||
if lines:
|
||||
while lines[0].startswith("#"):
|
||||
lines.pop(0)
|
||||
example = example_from_conllu_sentence(
|
||||
vocab,
|
||||
lines,
|
||||
ner_tag_pattern,
|
||||
merge_subtokens=merge_subtokens,
|
||||
append_morphology=append_morphology,
|
||||
ner_map=ner_map,
|
||||
)
|
||||
yield example
|
||||
|
||||
|
||||
def get_entities(lines, tag_pattern, ner_map=None):
|
||||
"""Find entities in the MISC column according to the pattern and map to
|
||||
final entity type with `ner_map` if mapping present. Entity tag is 'O' if
|
||||
the pattern is not matched.
|
||||
|
||||
lines (unicode): CONLL-U lines for one sentences
|
||||
tag_pattern (unicode): Regex pattern for entity tag
|
||||
ner_map (dict): Map old NER tag names to new ones, '' maps to O.
|
||||
RETURNS (list): List of BILUO entity tags
|
||||
"""
|
||||
Simplify tags obtained from the dataset in order to follow Wikipedia
|
||||
scheme (PER, LOC, ORG, MISC). 'PER', 'LOC' and 'ORG' keep their tags, while
|
||||
'GPE_LOC' is simplified to 'LOC', 'GPE_ORG' to 'ORG' and all remaining tags to
|
||||
'MISC'.
|
||||
"""
|
||||
new_iob = []
|
||||
for tag in iob:
|
||||
tag_match = re.match("([A-Z_]+)-([A-Z_]+)", tag)
|
||||
miscs = []
|
||||
for line in lines:
|
||||
parts = line.split("\t")
|
||||
id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts
|
||||
if "-" in id_ or "." in id_:
|
||||
continue
|
||||
miscs.append(misc)
|
||||
|
||||
iob = []
|
||||
for misc in miscs:
|
||||
tag_match = re.search(tag_pattern, misc)
|
||||
iob_tag = "O"
|
||||
if tag_match:
|
||||
prefix = tag_match.group(1)
|
||||
suffix = tag_match.group(2)
|
||||
if suffix == "GPE_LOC":
|
||||
suffix = "LOC"
|
||||
elif suffix == "GPE_ORG":
|
||||
suffix = "ORG"
|
||||
elif suffix != "PER" and suffix != "LOC" and suffix != "ORG":
|
||||
suffix = "MISC"
|
||||
tag = prefix + "-" + suffix
|
||||
new_iob.append(tag)
|
||||
return new_iob
|
||||
prefix = tag_match.group(2)
|
||||
suffix = tag_match.group(3)
|
||||
if prefix and suffix:
|
||||
iob_tag = prefix + "-" + suffix
|
||||
if ner_map:
|
||||
suffix = ner_map.get(suffix, suffix)
|
||||
if suffix == "":
|
||||
iob_tag = "O"
|
||||
else:
|
||||
iob_tag = prefix + "-" + suffix
|
||||
iob.append(iob_tag)
|
||||
return iob_to_biluo(iob)
|
||||
|
||||
|
||||
def generate_sentence(sent, has_ner_tags):
|
||||
(id_, word, tag, head, dep, iob) = sent
|
||||
def generate_sentence(token_annotation, has_ner_tags, tag_pattern, ner_map=None):
|
||||
sentence = {}
|
||||
tokens = []
|
||||
if has_ner_tags:
|
||||
iob = simplify_tags(iob)
|
||||
biluo = iob_to_biluo(iob)
|
||||
for i, id in enumerate(id_):
|
||||
for i, id_ in enumerate(token_annotation.ids):
|
||||
token = {}
|
||||
token["id"] = id
|
||||
token["orth"] = word[i]
|
||||
token["tag"] = tag[i]
|
||||
token["head"] = head[i] - id
|
||||
token["dep"] = dep[i]
|
||||
token["id"] = id_
|
||||
token["orth"] = token_annotation.get_word(i)
|
||||
token["tag"] = token_annotation.get_tag(i)
|
||||
token["pos"] = token_annotation.get_pos(i)
|
||||
token["lemma"] = token_annotation.get_lemma(i)
|
||||
token["morph"] = token_annotation.get_morph(i)
|
||||
token["head"] = token_annotation.get_head(i) - id_
|
||||
token["dep"] = token_annotation.get_dep(i)
|
||||
if has_ner_tags:
|
||||
token["ner"] = biluo[i]
|
||||
token["ner"] = token_annotation.get_entity(i)
|
||||
tokens.append(token)
|
||||
sentence["tokens"] = tokens
|
||||
return sentence
|
||||
|
||||
|
||||
def create_doc(sentences, id):
|
||||
def create_json_doc(raw, sentences, id_):
|
||||
doc = {}
|
||||
paragraph = {}
|
||||
doc["id"] = id
|
||||
doc["id"] = id_
|
||||
doc["paragraphs"] = []
|
||||
paragraph["raw"] = raw.strip()
|
||||
paragraph["sentences"] = sentences
|
||||
doc["paragraphs"].append(paragraph)
|
||||
return doc
|
||||
|
||||
|
||||
def example_from_conllu_sentence(
|
||||
vocab,
|
||||
lines,
|
||||
ner_tag_pattern,
|
||||
merge_subtokens=False,
|
||||
append_morphology=False,
|
||||
ner_map=None,
|
||||
):
|
||||
"""Create an Example from the lines for one CoNLL-U sentence, merging
|
||||
subtokens and appending morphology to tags if required.
|
||||
|
||||
lines (unicode): The non-comment lines for a CoNLL-U sentence
|
||||
ner_tag_pattern (unicode): The regex pattern for matching NER in MISC col
|
||||
RETURNS (Example): An example containing the annotation
|
||||
"""
|
||||
# create a Doc with each subtoken as its own token
|
||||
# if merging subtokens, each subtoken orth is the merged subtoken form
|
||||
if not Token.has_extension("merged_orth"):
|
||||
Token.set_extension("merged_orth", default="")
|
||||
if not Token.has_extension("merged_lemma"):
|
||||
Token.set_extension("merged_lemma", default="")
|
||||
if not Token.has_extension("merged_morph"):
|
||||
Token.set_extension("merged_morph", default="")
|
||||
if not Token.has_extension("merged_spaceafter"):
|
||||
Token.set_extension("merged_spaceafter", default="")
|
||||
words, spaces, tags, poses, morphs, lemmas = [], [], [], [], [], []
|
||||
heads, deps = [], []
|
||||
subtok_word = ""
|
||||
in_subtok = False
|
||||
for i in range(len(lines)):
|
||||
line = lines[i]
|
||||
parts = line.split("\t")
|
||||
id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts
|
||||
if "." in id_:
|
||||
continue
|
||||
if "-" in id_:
|
||||
in_subtok = True
|
||||
if "-" in id_:
|
||||
in_subtok = True
|
||||
subtok_word = word
|
||||
subtok_start, subtok_end = id_.split("-")
|
||||
subtok_spaceafter = "SpaceAfter=No" not in misc
|
||||
continue
|
||||
if merge_subtokens and in_subtok:
|
||||
words.append(subtok_word)
|
||||
else:
|
||||
words.append(word)
|
||||
if in_subtok:
|
||||
if id_ == subtok_end:
|
||||
spaces.append(subtok_spaceafter)
|
||||
else:
|
||||
spaces.append(False)
|
||||
elif "SpaceAfter=No" in misc:
|
||||
spaces.append(False)
|
||||
else:
|
||||
spaces.append(True)
|
||||
if in_subtok and id_ == subtok_end:
|
||||
subtok_word = ""
|
||||
in_subtok = False
|
||||
id_ = int(id_) - 1
|
||||
head = (int(head) - 1) if head not in ("0", "_") else id_
|
||||
tag = pos if tag == "_" else tag
|
||||
morph = morph if morph != "_" else ""
|
||||
dep = "ROOT" if dep == "root" else dep
|
||||
lemmas.append(lemma)
|
||||
poses.append(pos)
|
||||
tags.append(tag)
|
||||
morphs.append(morph)
|
||||
heads.append(head)
|
||||
deps.append(dep)
|
||||
|
||||
doc = Doc(vocab, words=words, spaces=spaces)
|
||||
for i in range(len(doc)):
|
||||
doc[i].tag_ = tags[i]
|
||||
doc[i].pos_ = poses[i]
|
||||
doc[i].dep_ = deps[i]
|
||||
doc[i].lemma_ = lemmas[i]
|
||||
doc[i].head = doc[heads[i]]
|
||||
doc[i]._.merged_orth = words[i]
|
||||
doc[i]._.merged_morph = morphs[i]
|
||||
doc[i]._.merged_lemma = lemmas[i]
|
||||
doc[i]._.merged_spaceafter = spaces[i]
|
||||
ents = get_entities(lines, ner_tag_pattern, ner_map)
|
||||
doc.ents = spans_from_biluo_tags(doc, ents)
|
||||
doc.is_parsed = True
|
||||
doc.is_tagged = True
|
||||
|
||||
if merge_subtokens:
|
||||
doc = merge_conllu_subtokens(lines, doc)
|
||||
|
||||
# create Example from custom Doc annotation
|
||||
ids, words, tags, heads, deps = [], [], [], [], []
|
||||
pos, lemmas, morphs, spaces = [], [], [], []
|
||||
for i, t in enumerate(doc):
|
||||
ids.append(i)
|
||||
words.append(t._.merged_orth)
|
||||
if append_morphology and t._.merged_morph:
|
||||
tags.append(t.tag_ + "__" + t._.merged_morph)
|
||||
else:
|
||||
tags.append(t.tag_)
|
||||
pos.append(t.pos_)
|
||||
morphs.append(t._.merged_morph)
|
||||
lemmas.append(t._.merged_lemma)
|
||||
heads.append(t.head.i)
|
||||
deps.append(t.dep_)
|
||||
spaces.append(t._.merged_spaceafter)
|
||||
ent_offsets = [(e.start_char, e.end_char, e.label_) for e in doc.ents]
|
||||
ents = biluo_tags_from_offsets(doc, ent_offsets)
|
||||
raw = ""
|
||||
for word, space in zip(words, spaces):
|
||||
raw += word
|
||||
if space:
|
||||
raw += " "
|
||||
example = Example(doc=raw)
|
||||
example.set_token_annotation(
|
||||
ids=ids,
|
||||
words=words,
|
||||
tags=tags,
|
||||
pos=pos,
|
||||
morphs=morphs,
|
||||
lemmas=lemmas,
|
||||
heads=heads,
|
||||
deps=deps,
|
||||
entities=ents,
|
||||
)
|
||||
return example
|
||||
|
||||
|
||||
def merge_conllu_subtokens(lines, doc):
|
||||
# identify and process all subtoken spans to prepare attrs for merging
|
||||
subtok_spans = []
|
||||
for line in lines:
|
||||
parts = line.split("\t")
|
||||
id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts
|
||||
if "-" in id_:
|
||||
subtok_start, subtok_end = id_.split("-")
|
||||
subtok_span = doc[int(subtok_start) - 1 : int(subtok_end)]
|
||||
subtok_spans.append(subtok_span)
|
||||
# create merged tag, morph, and lemma values
|
||||
tags = []
|
||||
morphs = {}
|
||||
lemmas = []
|
||||
for token in subtok_span:
|
||||
tags.append(token.tag_)
|
||||
lemmas.append(token.lemma_)
|
||||
if token._.merged_morph:
|
||||
for feature in token._.merged_morph.split("|"):
|
||||
field, values = feature.split("=", 1)
|
||||
if field not in morphs:
|
||||
morphs[field] = set()
|
||||
for value in values.split(","):
|
||||
morphs[field].add(value)
|
||||
# create merged features for each morph field
|
||||
for field, values in morphs.items():
|
||||
morphs[field] = field + "=" + ",".join(sorted(values))
|
||||
# set the same attrs on all subtok tokens so that whatever head the
|
||||
# retokenizer chooses, the final attrs are available on that token
|
||||
for token in subtok_span:
|
||||
token._.merged_orth = token.orth_
|
||||
token._.merged_lemma = " ".join(lemmas)
|
||||
token.tag_ = "_".join(tags)
|
||||
token._.merged_morph = "|".join(sorted(morphs.values()))
|
||||
token._.merged_spaceafter = (
|
||||
True if subtok_span[-1].whitespace_ else False
|
||||
)
|
||||
|
||||
with doc.retokenize() as retokenizer:
|
||||
for span in subtok_spans:
|
||||
retokenizer.merge(span)
|
||||
|
||||
return doc
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from wasabi import Printer
|
||||
|
||||
from ...gold import iob_to_biluo
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import srsly
|
||||
|
||||
from ...gold import docs_to_json
|
||||
|
|
|
@ -1,9 +1,5 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
from pathlib import Path
|
||||
from collections import Counter
|
||||
import plac
|
||||
import sys
|
||||
import srsly
|
||||
from wasabi import Printer, MESSAGES
|
||||
|
@ -22,30 +18,18 @@ BLANK_MODEL_MIN_THRESHOLD = 100
|
|||
BLANK_MODEL_THRESHOLD = 2000
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
lang=("model language", "positional", None, str),
|
||||
train_path=("location of JSON-formatted training data", "positional", None, Path),
|
||||
dev_path=("location of JSON-formatted development data", "positional", None, Path),
|
||||
base_model=("name of model to update (optional)", "option", "b", str),
|
||||
pipeline=(
|
||||
"Comma-separated names of pipeline components to train",
|
||||
"option",
|
||||
"p",
|
||||
str,
|
||||
),
|
||||
ignore_warnings=("Ignore warnings, only show stats and errors", "flag", "IW", bool),
|
||||
verbose=("Print additional information and explanations", "flag", "V", bool),
|
||||
no_format=("Don't pretty-print the results", "flag", "NF", bool),
|
||||
)
|
||||
def debug_data(
|
||||
lang,
|
||||
train_path,
|
||||
dev_path,
|
||||
base_model=None,
|
||||
pipeline="tagger,parser,ner",
|
||||
ignore_warnings=False,
|
||||
verbose=False,
|
||||
no_format=False,
|
||||
# fmt: off
|
||||
lang: ("Model language", "positional", None, str),
|
||||
train_path: ("Location of JSON-formatted training data", "positional", None, Path),
|
||||
dev_path: ("Location of JSON-formatted development data", "positional", None, Path),
|
||||
tag_map_path: ("Location of JSON-formatted tag map", "option", "tm", Path) = None,
|
||||
base_model: ("Name of model to update (optional)", "option", "b", str) = None,
|
||||
pipeline: ("Comma-separated names of pipeline components to train", "option", "p", str) = "tagger,parser,ner",
|
||||
ignore_warnings: ("Ignore warnings, only show stats and errors", "flag", "IW", bool) = False,
|
||||
verbose: ("Print additional information and explanations", "flag", "V", bool) = False,
|
||||
no_format: ("Don't pretty-print the results", "flag", "NF", bool) = False,
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Analyze, debug and validate your training and development data, get useful
|
||||
|
@ -60,6 +44,10 @@ def debug_data(
|
|||
if not dev_path.exists():
|
||||
msg.fail("Development data not found", dev_path, exits=1)
|
||||
|
||||
tag_map = {}
|
||||
if tag_map_path is not None:
|
||||
tag_map = srsly.read_json(tag_map_path)
|
||||
|
||||
# Initialize the model and pipeline
|
||||
pipeline = [p.strip() for p in pipeline.split(",")]
|
||||
if base_model:
|
||||
|
@ -67,6 +55,8 @@ def debug_data(
|
|||
else:
|
||||
lang_cls = get_lang_class(lang)
|
||||
nlp = lang_cls()
|
||||
# Update tag map with provided mapping
|
||||
nlp.vocab.morphology.tag_map.update(tag_map)
|
||||
|
||||
msg.divider("Data format validation")
|
||||
|
||||
|
@ -80,20 +70,16 @@ def debug_data(
|
|||
with msg.loading("Loading corpus..."):
|
||||
corpus = GoldCorpus(train_path, dev_path)
|
||||
try:
|
||||
train_docs = list(corpus.train_docs(nlp))
|
||||
train_docs_unpreprocessed = list(
|
||||
corpus.train_docs_without_preprocessing(nlp)
|
||||
train_dataset = list(corpus.train_dataset(nlp))
|
||||
train_dataset_unpreprocessed = list(
|
||||
corpus.train_dataset_without_preprocessing(nlp)
|
||||
)
|
||||
except ValueError as e:
|
||||
loading_train_error_message = "Training data cannot be loaded: {}".format(
|
||||
str(e)
|
||||
)
|
||||
loading_train_error_message = f"Training data cannot be loaded: {e}"
|
||||
try:
|
||||
dev_docs = list(corpus.dev_docs(nlp))
|
||||
dev_dataset = list(corpus.dev_dataset(nlp))
|
||||
except ValueError as e:
|
||||
loading_dev_error_message = "Development data cannot be loaded: {}".format(
|
||||
str(e)
|
||||
)
|
||||
loading_dev_error_message = f"Development data cannot be loaded: {e}"
|
||||
if loading_train_error_message or loading_dev_error_message:
|
||||
if loading_train_error_message:
|
||||
msg.fail(loading_train_error_message)
|
||||
|
@ -102,80 +88,68 @@ def debug_data(
|
|||
sys.exit(1)
|
||||
msg.good("Corpus is loadable")
|
||||
|
||||
# Create all gold data here to avoid iterating over the train_docs constantly
|
||||
gold_train_data = _compile_gold(train_docs, pipeline)
|
||||
gold_train_unpreprocessed_data = _compile_gold(train_docs_unpreprocessed, pipeline)
|
||||
gold_dev_data = _compile_gold(dev_docs, pipeline)
|
||||
# Create all gold data here to avoid iterating over the train_dataset constantly
|
||||
gold_train_data = _compile_gold(train_dataset, pipeline)
|
||||
gold_train_unpreprocessed_data = _compile_gold(
|
||||
train_dataset_unpreprocessed, pipeline
|
||||
)
|
||||
gold_dev_data = _compile_gold(dev_dataset, pipeline)
|
||||
|
||||
train_texts = gold_train_data["texts"]
|
||||
dev_texts = gold_dev_data["texts"]
|
||||
|
||||
msg.divider("Training stats")
|
||||
msg.text("Training pipeline: {}".format(", ".join(pipeline)))
|
||||
msg.text(f"Training pipeline: {', '.join(pipeline)}")
|
||||
for pipe in [p for p in pipeline if p not in nlp.factories]:
|
||||
msg.fail("Pipeline component '{}' not available in factories".format(pipe))
|
||||
msg.fail(f"Pipeline component '{pipe}' not available in factories")
|
||||
if base_model:
|
||||
msg.text("Starting with base model '{}'".format(base_model))
|
||||
msg.text(f"Starting with base model '{base_model}'")
|
||||
else:
|
||||
msg.text("Starting with blank model '{}'".format(lang))
|
||||
msg.text("{} training docs".format(len(train_docs)))
|
||||
msg.text("{} evaluation docs".format(len(dev_docs)))
|
||||
msg.text(f"Starting with blank model '{lang}'")
|
||||
msg.text(f"{len(train_dataset)} training docs")
|
||||
msg.text(f"{len(dev_dataset)} evaluation docs")
|
||||
|
||||
if not len(dev_docs):
|
||||
if not len(gold_dev_data):
|
||||
msg.fail("No evaluation docs")
|
||||
overlap = len(train_texts.intersection(dev_texts))
|
||||
if overlap:
|
||||
msg.warn("{} training examples also in evaluation data".format(overlap))
|
||||
msg.warn(f"{overlap} training examples also in evaluation data")
|
||||
else:
|
||||
msg.good("No overlap between training and evaluation data")
|
||||
if not base_model and len(train_docs) < BLANK_MODEL_THRESHOLD:
|
||||
text = "Low number of examples to train from a blank model ({})".format(
|
||||
len(train_docs)
|
||||
if not base_model and len(train_dataset) < BLANK_MODEL_THRESHOLD:
|
||||
text = (
|
||||
f"Low number of examples to train from a blank model ({len(train_dataset)})"
|
||||
)
|
||||
if len(train_docs) < BLANK_MODEL_MIN_THRESHOLD:
|
||||
if len(train_dataset) < BLANK_MODEL_MIN_THRESHOLD:
|
||||
msg.fail(text)
|
||||
else:
|
||||
msg.warn(text)
|
||||
msg.text(
|
||||
"It's recommended to use at least {} examples (minimum {})".format(
|
||||
BLANK_MODEL_THRESHOLD, BLANK_MODEL_MIN_THRESHOLD
|
||||
),
|
||||
f"It's recommended to use at least {BLANK_MODEL_THRESHOLD} examples "
|
||||
f"(minimum {BLANK_MODEL_MIN_THRESHOLD})",
|
||||
show=verbose,
|
||||
)
|
||||
|
||||
msg.divider("Vocab & Vectors")
|
||||
n_words = gold_train_data["n_words"]
|
||||
msg.info(
|
||||
"{} total {} in the data ({} unique)".format(
|
||||
n_words, "word" if n_words == 1 else "words", len(gold_train_data["words"])
|
||||
)
|
||||
f"{n_words} total word(s) in the data ({len(gold_train_data['words'])} unique)"
|
||||
)
|
||||
if gold_train_data["n_misaligned_words"] > 0:
|
||||
msg.warn(
|
||||
"{} misaligned tokens in the training data".format(
|
||||
gold_train_data["n_misaligned_words"]
|
||||
)
|
||||
)
|
||||
n_misaligned = gold_train_data["n_misaligned_words"]
|
||||
msg.warn(f"{n_misaligned} misaligned tokens in the training data")
|
||||
if gold_dev_data["n_misaligned_words"] > 0:
|
||||
msg.warn(
|
||||
"{} misaligned tokens in the dev data".format(
|
||||
gold_dev_data["n_misaligned_words"]
|
||||
)
|
||||
)
|
||||
n_misaligned = gold_dev_data["n_misaligned_words"]
|
||||
msg.warn(f"{n_misaligned} misaligned tokens in the dev data")
|
||||
most_common_words = gold_train_data["words"].most_common(10)
|
||||
msg.text(
|
||||
"10 most common words: {}".format(
|
||||
_format_labels(most_common_words, counts=True)
|
||||
),
|
||||
f"10 most common words: {_format_labels(most_common_words, counts=True)}",
|
||||
show=verbose,
|
||||
)
|
||||
if len(nlp.vocab.vectors):
|
||||
msg.info(
|
||||
"{} vectors ({} unique keys, {} dimensions)".format(
|
||||
len(nlp.vocab.vectors),
|
||||
nlp.vocab.vectors.n_keys,
|
||||
nlp.vocab.vectors_length,
|
||||
)
|
||||
f"{len(nlp.vocab.vectors)} vectors ({nlp.vocab.vectors.n_keys} "
|
||||
f"unique keys, {nlp.vocab.vectors_length} dimensions)"
|
||||
)
|
||||
else:
|
||||
msg.info("No word vectors present in the model")
|
||||
|
@ -183,7 +157,7 @@ def debug_data(
|
|||
if "ner" in pipeline:
|
||||
# Get all unique NER labels present in the data
|
||||
labels = set(
|
||||
label for label in gold_train_data["ner"] if label not in ("O", "-")
|
||||
label for label in gold_train_data["ner"] if label not in ("O", "-", None)
|
||||
)
|
||||
label_counts = gold_train_data["ner"]
|
||||
model_labels = _get_labels_from_model(nlp, "ner")
|
||||
|
@ -196,19 +170,10 @@ def debug_data(
|
|||
|
||||
msg.divider("Named Entity Recognition")
|
||||
msg.info(
|
||||
"{} new {}, {} existing {}".format(
|
||||
len(new_labels),
|
||||
"label" if len(new_labels) == 1 else "labels",
|
||||
len(existing_labels),
|
||||
"label" if len(existing_labels) == 1 else "labels",
|
||||
)
|
||||
f"{len(new_labels)} new label(s), {len(existing_labels)} existing label(s)"
|
||||
)
|
||||
missing_values = label_counts["-"]
|
||||
msg.text(
|
||||
"{} missing {} (tokens with '-' label)".format(
|
||||
missing_values, "value" if missing_values == 1 else "values"
|
||||
)
|
||||
)
|
||||
msg.text(f"{missing_values} missing value(s) (tokens with '-' label)")
|
||||
for label in new_labels:
|
||||
if len(label) == 0:
|
||||
msg.fail("Empty label found in new labels")
|
||||
|
@ -219,39 +184,28 @@ def debug_data(
|
|||
if label != "-"
|
||||
]
|
||||
labels_with_counts = _format_labels(labels_with_counts, counts=True)
|
||||
msg.text("New: {}".format(labels_with_counts), show=verbose)
|
||||
msg.text(f"New: {labels_with_counts}", show=verbose)
|
||||
if existing_labels:
|
||||
msg.text(
|
||||
"Existing: {}".format(_format_labels(existing_labels)), show=verbose
|
||||
)
|
||||
|
||||
msg.text(f"Existing: {_format_labels(existing_labels)}", show=verbose)
|
||||
if gold_train_data["ws_ents"]:
|
||||
msg.fail(
|
||||
"{} invalid whitespace entity span(s)".format(gold_train_data["ws_ents"])
|
||||
)
|
||||
msg.fail(f"{gold_train_data['ws_ents']} invalid whitespace entity spans")
|
||||
has_ws_ents_error = True
|
||||
|
||||
if gold_train_data["punct_ents"]:
|
||||
msg.warn(
|
||||
"{} entity span(s) with punctuation".format(gold_train_data["punct_ents"])
|
||||
)
|
||||
msg.warn(f"{gold_train_data['punct_ents']} entity span(s) with punctuation")
|
||||
has_punct_ents_warning = True
|
||||
|
||||
for label in new_labels:
|
||||
if label_counts[label] <= NEW_LABEL_THRESHOLD:
|
||||
msg.warn(
|
||||
"Low number of examples for new label '{}' ({})".format(
|
||||
label, label_counts[label]
|
||||
)
|
||||
f"Low number of examples for new label '{label}' ({label_counts[label]})"
|
||||
)
|
||||
has_low_data_warning = True
|
||||
|
||||
with msg.loading("Analyzing label distribution..."):
|
||||
neg_docs = _get_examples_without_label(train_docs, label)
|
||||
neg_docs = _get_examples_without_label(train_dataset, label)
|
||||
if neg_docs == 0:
|
||||
msg.warn(
|
||||
"No examples for texts WITHOUT new label '{}'".format(label)
|
||||
)
|
||||
msg.warn(f"No examples for texts WITHOUT new label '{label}'")
|
||||
has_no_neg_warning = True
|
||||
|
||||
if not has_low_data_warning:
|
||||
|
@ -265,8 +219,8 @@ def debug_data(
|
|||
|
||||
if has_low_data_warning:
|
||||
msg.text(
|
||||
"To train a new entity type, your data should include at "
|
||||
"least {} instances of the new label".format(NEW_LABEL_THRESHOLD),
|
||||
f"To train a new entity type, your data should include at "
|
||||
f"least {NEW_LABEL_THRESHOLD} instances of the new label",
|
||||
show=verbose,
|
||||
)
|
||||
if has_no_neg_warning:
|
||||
|
@ -295,27 +249,21 @@ def debug_data(
|
|||
new_labels = [l for l in labels if l not in model_labels]
|
||||
existing_labels = [l for l in labels if l in model_labels]
|
||||
msg.info(
|
||||
"Text Classification: {} new label(s), {} existing label(s)".format(
|
||||
len(new_labels), len(existing_labels)
|
||||
)
|
||||
f"Text Classification: {len(new_labels)} new label(s), "
|
||||
f"{len(existing_labels)} existing label(s)"
|
||||
)
|
||||
if new_labels:
|
||||
labels_with_counts = _format_labels(
|
||||
gold_train_data["cats"].most_common(), counts=True
|
||||
)
|
||||
msg.text("New: {}".format(labels_with_counts), show=verbose)
|
||||
msg.text(f"New: {labels_with_counts}", show=verbose)
|
||||
if existing_labels:
|
||||
msg.text(
|
||||
"Existing: {}".format(_format_labels(existing_labels)), show=verbose
|
||||
)
|
||||
msg.text(f"Existing: {_format_labels(existing_labels)}", show=verbose)
|
||||
if set(gold_train_data["cats"]) != set(gold_dev_data["cats"]):
|
||||
msg.fail(
|
||||
"The train and dev labels are not the same. "
|
||||
"Train labels: {}. "
|
||||
"Dev labels: {}.".format(
|
||||
_format_labels(gold_train_data["cats"]),
|
||||
_format_labels(gold_dev_data["cats"]),
|
||||
)
|
||||
f"The train and dev labels are not the same. "
|
||||
f"Train labels: {_format_labels(gold_train_data['cats'])}. "
|
||||
f"Dev labels: {_format_labels(gold_dev_data['cats'])}."
|
||||
)
|
||||
if gold_train_data["n_cats_multilabel"] > 0:
|
||||
msg.info(
|
||||
|
@ -344,28 +292,17 @@ def debug_data(
|
|||
if "tagger" in pipeline:
|
||||
msg.divider("Part-of-speech Tagging")
|
||||
labels = [label for label in gold_train_data["tags"]]
|
||||
tag_map = nlp.Defaults.tag_map
|
||||
msg.info(
|
||||
"{} {} in data ({} {} in tag map)".format(
|
||||
len(labels),
|
||||
"label" if len(labels) == 1 else "labels",
|
||||
len(tag_map),
|
||||
"label" if len(tag_map) == 1 else "labels",
|
||||
)
|
||||
)
|
||||
tag_map = nlp.vocab.morphology.tag_map
|
||||
msg.info(f"{len(labels)} label(s) in data ({len(tag_map)} label(s) in tag map)")
|
||||
labels_with_counts = _format_labels(
|
||||
gold_train_data["tags"].most_common(), counts=True
|
||||
)
|
||||
msg.text(labels_with_counts, show=verbose)
|
||||
non_tagmap = [l for l in labels if l not in tag_map]
|
||||
if not non_tagmap:
|
||||
msg.good("All labels present in tag map for language '{}'".format(nlp.lang))
|
||||
msg.good(f"All labels present in tag map for language '{nlp.lang}'")
|
||||
for label in non_tagmap:
|
||||
msg.fail(
|
||||
"Label '{}' not found in tag map for language '{}'".format(
|
||||
label, nlp.lang
|
||||
)
|
||||
)
|
||||
msg.fail(f"Label '{label}' not found in tag map for language '{nlp.lang}'")
|
||||
|
||||
if "parser" in pipeline:
|
||||
has_low_data_warning = False
|
||||
|
@ -373,21 +310,18 @@ def debug_data(
|
|||
|
||||
# profile sentence length
|
||||
msg.info(
|
||||
"Found {} sentence{} with an average length of {:.1f} words.".format(
|
||||
gold_train_data["n_sents"],
|
||||
"s" if len(train_docs) > 1 else "",
|
||||
gold_train_data["n_words"] / gold_train_data["n_sents"],
|
||||
)
|
||||
f"Found {gold_train_data['n_sents']} sentence(s) with an average "
|
||||
f"length of {gold_train_data['n_words'] / gold_train_data['n_sents']:.1f} words."
|
||||
)
|
||||
|
||||
# check for documents with multiple sentences
|
||||
sents_per_doc = gold_train_data["n_sents"] / len(gold_train_data["texts"])
|
||||
if sents_per_doc < 1.1:
|
||||
msg.warn(
|
||||
"The training data contains {:.2f} sentences per "
|
||||
"document. When there are very few documents containing more "
|
||||
"than one sentence, the parser will not learn how to segment "
|
||||
"longer texts into sentences.".format(sents_per_doc)
|
||||
f"The training data contains {sents_per_doc:.2f} sentences per "
|
||||
f"document. When there are very few documents containing more "
|
||||
f"than one sentence, the parser will not learn how to segment "
|
||||
f"longer texts into sentences."
|
||||
)
|
||||
|
||||
# profile labels
|
||||
|
@ -398,32 +332,13 @@ def debug_data(
|
|||
labels_dev = [label for label in gold_dev_data["deps"]]
|
||||
|
||||
if gold_train_unpreprocessed_data["n_nonproj"] > 0:
|
||||
msg.info(
|
||||
"Found {} nonprojective train sentence{}".format(
|
||||
gold_train_unpreprocessed_data["n_nonproj"],
|
||||
"s" if gold_train_unpreprocessed_data["n_nonproj"] > 1 else "",
|
||||
)
|
||||
)
|
||||
n_nonproj = gold_train_unpreprocessed_data["n_nonproj"]
|
||||
msg.info(f"Found {n_nonproj} nonprojective train sentence(s)")
|
||||
if gold_dev_data["n_nonproj"] > 0:
|
||||
msg.info(
|
||||
"Found {} nonprojective dev sentence{}".format(
|
||||
gold_dev_data["n_nonproj"],
|
||||
"s" if gold_dev_data["n_nonproj"] > 1 else "",
|
||||
)
|
||||
)
|
||||
|
||||
msg.info(
|
||||
"{} {} in train data".format(
|
||||
len(labels_train_unpreprocessed),
|
||||
"label" if len(labels_train) == 1 else "labels",
|
||||
)
|
||||
)
|
||||
msg.info(
|
||||
"{} {} in projectivized train data".format(
|
||||
len(labels_train), "label" if len(labels_train) == 1 else "labels"
|
||||
)
|
||||
)
|
||||
|
||||
n_nonproj = gold_dev_data["n_nonproj"]
|
||||
msg.info(f"Found {n_nonproj} nonprojective dev sentence(s)")
|
||||
msg.info(f"{labels_train_unpreprocessed} label(s) in train data")
|
||||
msg.info(f"{len(labels_train)} label(s) in projectivized train data")
|
||||
labels_with_counts = _format_labels(
|
||||
gold_train_unpreprocessed_data["deps"].most_common(), counts=True
|
||||
)
|
||||
|
@ -433,9 +348,8 @@ def debug_data(
|
|||
for label in gold_train_unpreprocessed_data["deps"]:
|
||||
if gold_train_unpreprocessed_data["deps"][label] <= DEP_LABEL_THRESHOLD:
|
||||
msg.warn(
|
||||
"Low number of examples for label '{}' ({})".format(
|
||||
label, gold_train_unpreprocessed_data["deps"][label]
|
||||
)
|
||||
f"Low number of examples for label '{label}' "
|
||||
f"({gold_train_unpreprocessed_data['deps'][label]})"
|
||||
)
|
||||
has_low_data_warning = True
|
||||
|
||||
|
@ -444,22 +358,19 @@ def debug_data(
|
|||
for label in gold_train_data["deps"]:
|
||||
if gold_train_data["deps"][label] <= DEP_LABEL_THRESHOLD and "||" in label:
|
||||
rare_projectivized_labels.append(
|
||||
"{}: {}".format(label, str(gold_train_data["deps"][label]))
|
||||
f"{label}: {gold_train_data['deps'][label]}"
|
||||
)
|
||||
|
||||
if len(rare_projectivized_labels) > 0:
|
||||
msg.warn(
|
||||
"Low number of examples for {} label{} in the "
|
||||
"projectivized dependency trees used for training. You may "
|
||||
"want to projectivize labels such as punct before "
|
||||
"training in order to improve parser performance.".format(
|
||||
len(rare_projectivized_labels),
|
||||
"s" if len(rare_projectivized_labels) > 1 else "",
|
||||
)
|
||||
f"Low number of examples for {len(rare_projectivized_labels)} "
|
||||
"label(s) in the projectivized dependency trees used for "
|
||||
"training. You may want to projectivize labels such as punct "
|
||||
"before training in order to improve parser performance."
|
||||
)
|
||||
msg.warn(
|
||||
"Projectivized labels with low numbers of examples: "
|
||||
"{}".format("\n".join(rare_projectivized_labels)),
|
||||
f"Projectivized labels with low numbers of examples: ",
|
||||
", ".join(rare_projectivized_labels),
|
||||
show=verbose,
|
||||
)
|
||||
has_low_data_warning = True
|
||||
|
@ -467,50 +378,44 @@ def debug_data(
|
|||
# labels only in train
|
||||
if set(labels_train) - set(labels_dev):
|
||||
msg.warn(
|
||||
"The following labels were found only in the train data: "
|
||||
"{}".format(", ".join(set(labels_train) - set(labels_dev))),
|
||||
"The following labels were found only in the train data:",
|
||||
", ".join(set(labels_train) - set(labels_dev)),
|
||||
show=verbose,
|
||||
)
|
||||
|
||||
# labels only in dev
|
||||
if set(labels_dev) - set(labels_train):
|
||||
msg.warn(
|
||||
"The following labels were found only in the dev data: "
|
||||
+ ", ".join(set(labels_dev) - set(labels_train)),
|
||||
"The following labels were found only in the dev data:",
|
||||
", ".join(set(labels_dev) - set(labels_train)),
|
||||
show=verbose,
|
||||
)
|
||||
|
||||
if has_low_data_warning:
|
||||
msg.text(
|
||||
"To train a parser, your data should include at "
|
||||
"least {} instances of each label.".format(DEP_LABEL_THRESHOLD),
|
||||
f"To train a parser, your data should include at "
|
||||
f"least {DEP_LABEL_THRESHOLD} instances of each label.",
|
||||
show=verbose,
|
||||
)
|
||||
|
||||
# multiple root labels
|
||||
if len(gold_train_unpreprocessed_data["roots"]) > 1:
|
||||
msg.warn(
|
||||
"Multiple root labels ({}) ".format(
|
||||
", ".join(gold_train_unpreprocessed_data["roots"])
|
||||
)
|
||||
+ "found in training data. spaCy's parser uses a single root "
|
||||
"label ROOT so this distinction will not be available."
|
||||
f"Multiple root labels "
|
||||
f"({', '.join(gold_train_unpreprocessed_data['roots'])}) "
|
||||
f"found in training data. spaCy's parser uses a single root "
|
||||
f"label ROOT so this distinction will not be available."
|
||||
)
|
||||
|
||||
# these should not happen, but just in case
|
||||
if gold_train_data["n_nonproj"] > 0:
|
||||
msg.fail(
|
||||
"Found {} nonprojective projectivized train sentence{}".format(
|
||||
gold_train_data["n_nonproj"],
|
||||
"s" if gold_train_data["n_nonproj"] > 1 else "",
|
||||
)
|
||||
f"Found {gold_train_data['n_nonproj']} nonprojective "
|
||||
f"projectivized train sentence(s)"
|
||||
)
|
||||
if gold_train_data["n_cycles"] > 0:
|
||||
msg.fail(
|
||||
"Found {} projectivized train sentence{} with cycles".format(
|
||||
gold_train_data["n_cycles"],
|
||||
"s" if gold_train_data["n_cycles"] > 1 else "",
|
||||
)
|
||||
f"Found {gold_train_data['n_cycles']} projectivized train sentence(s) with cycles"
|
||||
)
|
||||
|
||||
msg.divider("Summary")
|
||||
|
@ -518,42 +423,34 @@ def debug_data(
|
|||
warn_counts = msg.counts[MESSAGES.WARN]
|
||||
fail_counts = msg.counts[MESSAGES.FAIL]
|
||||
if good_counts:
|
||||
msg.good(
|
||||
"{} {} passed".format(
|
||||
good_counts, "check" if good_counts == 1 else "checks"
|
||||
)
|
||||
)
|
||||
msg.good(f"{good_counts} {'check' if good_counts == 1 else 'checks'} passed")
|
||||
if warn_counts:
|
||||
msg.warn(
|
||||
"{} {}".format(warn_counts, "warning" if warn_counts == 1 else "warnings")
|
||||
)
|
||||
if fail_counts:
|
||||
msg.fail("{} {}".format(fail_counts, "error" if fail_counts == 1 else "errors"))
|
||||
|
||||
msg.warn(f"{warn_counts} {'warning' if warn_counts == 1 else 'warnings'}")
|
||||
if fail_counts:
|
||||
msg.fail(f"{fail_counts} {'error' if fail_counts == 1 else 'errors'}")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def _load_file(file_path, msg):
|
||||
file_name = file_path.parts[-1]
|
||||
if file_path.suffix == ".json":
|
||||
with msg.loading("Loading {}...".format(file_name)):
|
||||
with msg.loading(f"Loading {file_name}..."):
|
||||
data = srsly.read_json(file_path)
|
||||
msg.good("Loaded {}".format(file_name))
|
||||
msg.good(f"Loaded {file_name}")
|
||||
return data
|
||||
elif file_path.suffix == ".jsonl":
|
||||
with msg.loading("Loading {}...".format(file_name)):
|
||||
with msg.loading(f"Loading {file_name}..."):
|
||||
data = srsly.read_jsonl(file_path)
|
||||
msg.good("Loaded {}".format(file_name))
|
||||
msg.good(f"Loaded {file_name}")
|
||||
return data
|
||||
msg.fail(
|
||||
"Can't load file extension {}".format(file_path.suffix),
|
||||
f"Can't load file extension {file_path.suffix}",
|
||||
"Expected .json or .jsonl",
|
||||
exits=1,
|
||||
)
|
||||
|
||||
|
||||
def _compile_gold(train_docs, pipeline):
|
||||
def _compile_gold(examples, pipeline):
|
||||
data = {
|
||||
"ner": Counter(),
|
||||
"cats": Counter(),
|
||||
|
@ -571,7 +468,9 @@ def _compile_gold(train_docs, pipeline):
|
|||
"n_cats_multilabel": 0,
|
||||
"texts": set(),
|
||||
}
|
||||
for doc, gold in train_docs:
|
||||
for example in examples:
|
||||
gold = example.gold
|
||||
doc = example.doc
|
||||
valid_words = [x for x in gold.words if x is not None]
|
||||
data["words"].update(valid_words)
|
||||
data["n_words"] += len(valid_words)
|
||||
|
@ -584,7 +483,13 @@ def _compile_gold(train_docs, pipeline):
|
|||
if label.startswith(("B-", "U-", "L-")) and doc[i].is_space:
|
||||
# "Illegal" whitespace entity
|
||||
data["ws_ents"] += 1
|
||||
if label.startswith(("B-", "U-", "L-")) and doc[i].text in [".", "'", "!", "?", ","]:
|
||||
if label.startswith(("B-", "U-", "L-")) and doc[i].text in [
|
||||
".",
|
||||
"'",
|
||||
"!",
|
||||
"?",
|
||||
",",
|
||||
]:
|
||||
# punctuation entity: could be replaced by whitespace when training with noise,
|
||||
# so add a warning to alert the user to this unexpected side effect.
|
||||
data["punct_ents"] += 1
|
||||
|
@ -614,14 +519,18 @@ def _compile_gold(train_docs, pipeline):
|
|||
|
||||
def _format_labels(labels, counts=False):
|
||||
if counts:
|
||||
return ", ".join(["'{}' ({})".format(l, c) for l, c in labels])
|
||||
return ", ".join(["'{}'".format(l) for l in labels])
|
||||
return ", ".join([f"'{l}' ({c})" for l, c in labels])
|
||||
return ", ".join([f"'{l}'" for l in labels])
|
||||
|
||||
|
||||
def _get_examples_without_label(data, label):
|
||||
count = 0
|
||||
for doc, gold in data:
|
||||
labels = [label.split("-")[1] for label in gold.ner if label not in ("O", "-")]
|
||||
for ex in data:
|
||||
labels = [
|
||||
label.split("-")[1]
|
||||
for label in ex.gold.ner
|
||||
if label not in ("O", "-", None)
|
||||
]
|
||||
if label not in labels:
|
||||
count += 1
|
||||
return count
|
||||
|
|
|
@ -1,28 +1,21 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import plac
|
||||
import requests
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
from wasabi import msg
|
||||
|
||||
from .link import link
|
||||
from ..util import get_package_path
|
||||
from .. import about
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model to download (shortcut or name)", "positional", None, str),
|
||||
direct=("Force direct download of name + version", "flag", "d", bool),
|
||||
pip_args=("Additional arguments to be passed to `pip install` on model install"),
|
||||
)
|
||||
def download(model, direct=False, *pip_args):
|
||||
def download(
|
||||
model: ("Model to download (shortcut or name)", "positional", None, str),
|
||||
direct: ("Force direct download of name + version", "flag", "d", bool) = False,
|
||||
*pip_args: ("Additional arguments to be passed to `pip install` on model install"),
|
||||
):
|
||||
"""
|
||||
Download compatible model from default download path using pip. Model
|
||||
can be shortcut, model name or, if --direct flag is set, full model name
|
||||
with version. For direct downloads, the compatibility check will be skipped.
|
||||
Download compatible model from default download path using pip. If --direct
|
||||
flag is set, the command expects the full model name with version.
|
||||
For direct downloads, the compatibility check will be skipped.
|
||||
"""
|
||||
if not require_package("spacy") and "--no-deps" not in pip_args:
|
||||
msg.warn(
|
||||
|
@ -50,30 +43,8 @@ def download(model, direct=False, *pip_args):
|
|||
sys.exit(dl)
|
||||
msg.good(
|
||||
"Download and installation successful",
|
||||
"You can now load the model via spacy.load('{}')".format(model_name),
|
||||
f"You can now load the model via spacy.load('{model_name}')",
|
||||
)
|
||||
# Only create symlink if the model is installed via a shortcut like 'en'.
|
||||
# There's no real advantage over an additional symlink for en_core_web_sm
|
||||
# and if anything, it's more error prone and causes more confusion.
|
||||
if model in shortcuts:
|
||||
try:
|
||||
# Get package path here because link uses
|
||||
# pip.get_installed_distributions() to check if model is a
|
||||
# package, which fails if model was just installed via
|
||||
# subprocess
|
||||
package_path = get_package_path(model_name)
|
||||
link(model_name, model, force=True, model_path=package_path)
|
||||
except: # noqa: E722
|
||||
# Dirty, but since spacy.download and the auto-linking is
|
||||
# mostly a convenience wrapper, it's best to show a success
|
||||
# message and loading instructions, even if linking fails.
|
||||
msg.warn(
|
||||
"Download successful but linking failed",
|
||||
"Creating a shortcut link for '{}' didn't work (maybe you "
|
||||
"don't have admin permissions?), but you can still load "
|
||||
"the model via its full package name: "
|
||||
"nlp = spacy.load('{}')".format(model, model_name),
|
||||
)
|
||||
# If a model is downloaded and then loaded within the same process, our
|
||||
# is_package check currently fails, because pkg_resources.working_set
|
||||
# is not refreshed automatically (see #3923). We're trying to work
|
||||
|
@ -95,11 +66,11 @@ def get_json(url, desc):
|
|||
r = requests.get(url)
|
||||
if r.status_code != 200:
|
||||
msg.fail(
|
||||
"Server error ({})".format(r.status_code),
|
||||
"Couldn't fetch {}. Please find a model for your spaCy "
|
||||
"installation (v{}), and download it manually. For more "
|
||||
"details, see the documentation: "
|
||||
"https://spacy.io/usage/models".format(desc, about.__version__),
|
||||
f"Server error ({r.status_code})",
|
||||
f"Couldn't fetch {desc}. Please find a model for your spaCy "
|
||||
f"installation (v{about.__version__}), and download it manually. "
|
||||
f"For more details, see the documentation: "
|
||||
f"https://spacy.io/usage/models",
|
||||
exits=1,
|
||||
)
|
||||
return r.json()
|
||||
|
@ -111,7 +82,7 @@ def get_compatibility():
|
|||
comp_table = get_json(about.__compatibility__, "compatibility table")
|
||||
comp = comp_table["spacy"]
|
||||
if version not in comp:
|
||||
msg.fail("No compatible models found for v{} of spaCy".format(version), exits=1)
|
||||
msg.fail(f"No compatible models found for v{version} of spaCy", exits=1)
|
||||
return comp[version]
|
||||
|
||||
|
||||
|
@ -119,8 +90,7 @@ def get_version(model, comp):
|
|||
model = model.rsplit(".dev", 1)[0]
|
||||
if model not in comp:
|
||||
msg.fail(
|
||||
"No compatible model found for '{}' "
|
||||
"(spaCy v{}).".format(model, about.__version__),
|
||||
f"No compatible model found for '{model}' (spaCy v{about.__version__})",
|
||||
exits=1,
|
||||
)
|
||||
return comp[model][0]
|
||||
|
|
|
@ -1,7 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals, division, print_function
|
||||
|
||||
import plac
|
||||
from timeit import default_timer as timer
|
||||
from wasabi import msg
|
||||
|
||||
|
@ -10,23 +6,16 @@ from .. import util
|
|||
from .. import displacy
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model name or path", "positional", None, str),
|
||||
data_path=("Location of JSON-formatted evaluation data", "positional", None, str),
|
||||
gold_preproc=("Use gold preprocessing", "flag", "G", bool),
|
||||
gpu_id=("Use GPU", "option", "g", int),
|
||||
displacy_path=("Directory to output rendered parses as HTML", "option", "dp", str),
|
||||
displacy_limit=("Limit of parses to render as HTML", "option", "dl", int),
|
||||
return_scores=("Return dict containing model scores", "flag", "R", bool),
|
||||
)
|
||||
def evaluate(
|
||||
model,
|
||||
data_path,
|
||||
gpu_id=-1,
|
||||
gold_preproc=False,
|
||||
displacy_path=None,
|
||||
displacy_limit=25,
|
||||
return_scores=False,
|
||||
# fmt: off
|
||||
model: ("Model name or path", "positional", None, str),
|
||||
data_path: ("Location of JSON-formatted evaluation data", "positional", None, str),
|
||||
gpu_id: ("Use GPU", "option", "g", int) = -1,
|
||||
gold_preproc: ("Use gold preprocessing", "flag", "G", bool) = False,
|
||||
displacy_path: ("Directory to output rendered parses as HTML", "option", "dp", str) = None,
|
||||
displacy_limit: ("Limit of parses to render as HTML", "option", "dl", int) = 25,
|
||||
return_scores: ("Return dict containing model scores", "flag", "R", bool) = False,
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Evaluate a model. To render a sample of parses in a HTML file, set an
|
||||
|
@ -44,28 +33,31 @@ def evaluate(
|
|||
msg.fail("Visualization output directory not found", displacy_path, exits=1)
|
||||
corpus = GoldCorpus(data_path, data_path)
|
||||
nlp = util.load_model(model)
|
||||
dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc))
|
||||
dev_dataset = list(corpus.dev_dataset(nlp, gold_preproc=gold_preproc))
|
||||
begin = timer()
|
||||
scorer = nlp.evaluate(dev_docs, verbose=False)
|
||||
scorer = nlp.evaluate(dev_dataset, verbose=False)
|
||||
end = timer()
|
||||
nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
|
||||
nwords = sum(len(ex.doc) for ex in dev_dataset)
|
||||
results = {
|
||||
"Time": "%.2f s" % (end - begin),
|
||||
"Time": f"{end - begin:.2f} s",
|
||||
"Words": nwords,
|
||||
"Words/s": "%.0f" % (nwords / (end - begin)),
|
||||
"TOK": "%.2f" % scorer.token_acc,
|
||||
"POS": "%.2f" % scorer.tags_acc,
|
||||
"UAS": "%.2f" % scorer.uas,
|
||||
"LAS": "%.2f" % scorer.las,
|
||||
"NER P": "%.2f" % scorer.ents_p,
|
||||
"NER R": "%.2f" % scorer.ents_r,
|
||||
"NER F": "%.2f" % scorer.ents_f,
|
||||
"Textcat": "%.2f" % scorer.textcat_score,
|
||||
"Words/s": f"{nwords / (end - begin):.0f}",
|
||||
"TOK": f"{scorer.token_acc:.2f}",
|
||||
"POS": f"{scorer.tags_acc:.2f}",
|
||||
"UAS": f"{scorer.uas:.2f}",
|
||||
"LAS": f"{scorer.las:.2f}",
|
||||
"NER P": f"{scorer.ents_p:.2f}",
|
||||
"NER R": f"{scorer.ents_r:.2f}",
|
||||
"NER F": f"{scorer.ents_f:.2f}",
|
||||
"Textcat": f"{scorer.textcat_score:.2f}",
|
||||
"Sent P": f"{scorer.sent_p:.2f}",
|
||||
"Sent R": f"{scorer.sent_r:.2f}",
|
||||
"Sent F": f"{scorer.sent_f:.2f}",
|
||||
}
|
||||
msg.table(results, title="Results")
|
||||
|
||||
if displacy_path:
|
||||
docs, golds = zip(*dev_docs)
|
||||
docs = [ex.doc for ex in dev_dataset]
|
||||
render_deps = "parser" in nlp.meta.get("pipeline", [])
|
||||
render_ents = "ner" in nlp.meta.get("pipeline", [])
|
||||
render_parses(
|
||||
|
@ -76,7 +68,7 @@ def evaluate(
|
|||
deps=render_deps,
|
||||
ents=render_ents,
|
||||
)
|
||||
msg.good("Generated {} parses as HTML".format(displacy_limit), displacy_path)
|
||||
msg.good(f"Generated {displacy_limit} parses as HTML", displacy_path)
|
||||
if return_scores:
|
||||
return scorer.scores
|
||||
|
||||
|
|
|
@ -1,44 +1,39 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import plac
|
||||
import platform
|
||||
from pathlib import Path
|
||||
from wasabi import msg
|
||||
import srsly
|
||||
|
||||
from ..compat import path2str, basestring_, unicode_
|
||||
from .validate import get_model_pkgs
|
||||
from .. import util
|
||||
from .. import about
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Optional shortcut link of model", "positional", None, str),
|
||||
markdown=("Generate Markdown for GitHub issues", "flag", "md", str),
|
||||
silent=("Don't print anything (just return)", "flag", "s"),
|
||||
)
|
||||
def info(model=None, markdown=False, silent=False):
|
||||
def info(
|
||||
model: ("Optional model name", "positional", None, str) = None,
|
||||
markdown: ("Generate Markdown for GitHub issues", "flag", "md", str) = False,
|
||||
silent: ("Don't print anything (just return)", "flag", "s") = False,
|
||||
):
|
||||
"""
|
||||
Print info about spaCy installation. If a model shortcut link is
|
||||
speficied as an argument, print model information. Flag --markdown
|
||||
prints details in Markdown for easy copy-pasting to GitHub issues.
|
||||
Print info about spaCy installation. If a model is speficied as an argument,
|
||||
print model information. Flag --markdown prints details in Markdown for easy
|
||||
copy-pasting to GitHub issues.
|
||||
"""
|
||||
if model:
|
||||
if util.is_package(model):
|
||||
model_path = util.get_package_path(model)
|
||||
else:
|
||||
model_path = util.get_data_path() / model
|
||||
model_path = model
|
||||
meta_path = model_path / "meta.json"
|
||||
if not meta_path.is_file():
|
||||
msg.fail("Can't find model meta.json", meta_path, exits=1)
|
||||
meta = srsly.read_json(meta_path)
|
||||
if model_path.resolve() != model_path:
|
||||
meta["link"] = path2str(model_path)
|
||||
meta["source"] = path2str(model_path.resolve())
|
||||
meta["link"] = str(model_path)
|
||||
meta["source"] = str(model_path.resolve())
|
||||
else:
|
||||
meta["source"] = path2str(model_path)
|
||||
meta["source"] = str(model_path)
|
||||
if not silent:
|
||||
title = "Info about model '{}'".format(model)
|
||||
title = f"Info about model '{model}'"
|
||||
model_meta = {
|
||||
k: v for k, v in meta.items() if k not in ("accuracy", "speed")
|
||||
}
|
||||
|
@ -47,12 +42,13 @@ def info(model=None, markdown=False, silent=False):
|
|||
else:
|
||||
msg.table(model_meta, title=title)
|
||||
return meta
|
||||
all_models, _ = get_model_pkgs()
|
||||
data = {
|
||||
"spaCy version": about.__version__,
|
||||
"Location": path2str(Path(__file__).parent.parent),
|
||||
"Location": str(Path(__file__).parent.parent),
|
||||
"Platform": platform.platform(),
|
||||
"Python version": platform.python_version(),
|
||||
"Models": list_models(),
|
||||
"Models": ", ".join(model["name"] for model in all_models.values()),
|
||||
}
|
||||
if not silent:
|
||||
title = "Info about spaCy"
|
||||
|
@ -63,19 +59,6 @@ def info(model=None, markdown=False, silent=False):
|
|||
return data
|
||||
|
||||
|
||||
def list_models():
|
||||
def exclude_dir(dir_name):
|
||||
# exclude common cache directories and hidden directories
|
||||
exclude = ("cache", "pycache", "__pycache__")
|
||||
return dir_name in exclude or dir_name.startswith(".")
|
||||
|
||||
data_path = util.get_data_path()
|
||||
if data_path:
|
||||
models = [f.parts[-1] for f in data_path.iterdir() if f.is_dir()]
|
||||
return ", ".join([m for m in models if not exclude_dir(m)])
|
||||
return "-"
|
||||
|
||||
|
||||
def print_markdown(data, title=None):
|
||||
"""Print data in GitHub-flavoured Markdown format for issues etc.
|
||||
|
||||
|
@ -84,9 +67,9 @@ def print_markdown(data, title=None):
|
|||
"""
|
||||
markdown = []
|
||||
for key, value in data.items():
|
||||
if isinstance(value, basestring_) and Path(value).exists():
|
||||
if isinstance(value, str) and Path(value).exists():
|
||||
continue
|
||||
markdown.append("* **{}:** {}".format(key, unicode_(value)))
|
||||
markdown.append(f"* **{key}:** {value}")
|
||||
if title:
|
||||
print("\n## {}".format(title))
|
||||
print(f"\n## {title}")
|
||||
print("\n{}\n".format("\n".join(markdown)))
|
||||
|
|
|
@ -1,7 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import plac
|
||||
import math
|
||||
from tqdm import tqdm
|
||||
import numpy
|
||||
|
@ -27,32 +23,18 @@ except ImportError:
|
|||
DEFAULT_OOV_PROB = -20
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
lang=("Model language", "positional", None, str),
|
||||
output_dir=("Model output directory", "positional", None, Path),
|
||||
freqs_loc=("Location of words frequencies file", "option", "f", Path),
|
||||
jsonl_loc=("Location of JSONL-formatted attributes file", "option", "j", Path),
|
||||
clusters_loc=("Optional location of brown clusters data", "option", "c", str),
|
||||
vectors_loc=("Optional vectors file in Word2Vec format", "option", "v", str),
|
||||
prune_vectors=("Optional number of vectors to prune to", "option", "V", int),
|
||||
vectors_name=(
|
||||
"Optional name for the word vectors, e.g. en_core_web_lg.vectors",
|
||||
"option",
|
||||
"vn",
|
||||
str,
|
||||
),
|
||||
model_name=("Optional name for the model meta", "option", "mn", str),
|
||||
)
|
||||
def init_model(
|
||||
lang,
|
||||
output_dir,
|
||||
freqs_loc=None,
|
||||
clusters_loc=None,
|
||||
jsonl_loc=None,
|
||||
vectors_loc=None,
|
||||
prune_vectors=-1,
|
||||
vectors_name=None,
|
||||
model_name=None,
|
||||
# fmt: off
|
||||
lang: ("Model language", "positional", None, str),
|
||||
output_dir: ("Model output directory", "positional", None, Path),
|
||||
freqs_loc: ("Location of words frequencies file", "option", "f", Path) = None,
|
||||
clusters_loc: ("Optional location of brown clusters data", "option", "c", str) = None,
|
||||
jsonl_loc: ("Location of JSONL-formatted attributes file", "option", "j", Path) = None,
|
||||
vectors_loc: ("Optional vectors file in Word2Vec format", "option", "v", str) = None,
|
||||
prune_vectors: ("Optional number of vectors to prune to", "option", "V", int) = -1,
|
||||
vectors_name: ("Optional name for the word vectors, e.g. en_core_web_lg.vectors", "option", "vn", str) = None,
|
||||
model_name: ("Optional name for the model meta", "option", "mn", str) = None,
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Create a new model from raw data, like word frequencies, Brown clusters
|
||||
|
@ -91,8 +73,7 @@ def init_model(
|
|||
vec_added = len(nlp.vocab.vectors)
|
||||
lex_added = len(nlp.vocab)
|
||||
msg.good(
|
||||
"Sucessfully compiled vocab",
|
||||
"{} entries, {} vectors".format(lex_added, vec_added),
|
||||
"Sucessfully compiled vocab", f"{lex_added} entries, {vec_added} vectors",
|
||||
)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
|
@ -177,9 +158,9 @@ def add_vectors(nlp, vectors_loc, prune_vectors, name=None):
|
|||
nlp.vocab.vectors.add(lex.orth, row=lex.rank)
|
||||
else:
|
||||
if vectors_loc:
|
||||
with msg.loading("Reading vectors from {}".format(vectors_loc)):
|
||||
with msg.loading(f"Reading vectors from {vectors_loc}"):
|
||||
vectors_data, vector_keys = read_vectors(vectors_loc)
|
||||
msg.good("Loaded vectors from {}".format(vectors_loc))
|
||||
msg.good(f"Loaded vectors from {vectors_loc}")
|
||||
else:
|
||||
vectors_data, vector_keys = (None, None)
|
||||
if vector_keys is not None:
|
||||
|
@ -190,7 +171,7 @@ def add_vectors(nlp, vectors_loc, prune_vectors, name=None):
|
|||
if vectors_data is not None:
|
||||
nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys)
|
||||
if name is None:
|
||||
nlp.vocab.vectors.name = "%s_model.vectors" % nlp.meta["lang"]
|
||||
nlp.vocab.vectors.name = f"{nlp.meta['lang']}_model.vectors"
|
||||
else:
|
||||
nlp.vocab.vectors.name = name
|
||||
nlp.meta["vectors"]["name"] = nlp.vocab.vectors.name
|
||||
|
@ -236,7 +217,7 @@ def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50):
|
|||
word = literal_eval(key)
|
||||
except SyntaxError:
|
||||
# Take odd strings literally.
|
||||
word = literal_eval("'%s'" % key)
|
||||
word = literal_eval(f"'{key}'")
|
||||
smooth_count = counts.smoother(int(freq))
|
||||
probs[word] = math.log(smooth_count) - log_total
|
||||
oov_prob = math.log(counts.smoother(0)) - log_total
|
||||
|
|
|
@ -1,77 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import plac
|
||||
from pathlib import Path
|
||||
from wasabi import msg
|
||||
|
||||
from ..compat import symlink_to, path2str
|
||||
from .. import util
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
origin=("package name or local path to model", "positional", None, str),
|
||||
link_name=("name of shortuct link to create", "positional", None, str),
|
||||
force=("force overwriting of existing link", "flag", "f", bool),
|
||||
)
|
||||
def link(origin, link_name, force=False, model_path=None):
|
||||
"""
|
||||
Create a symlink for models within the spacy/data directory. Accepts
|
||||
either the name of a pip package, or the local path to the model data
|
||||
directory. Linking models allows loading them via spacy.load(link_name).
|
||||
"""
|
||||
if util.is_package(origin):
|
||||
model_path = util.get_package_path(origin)
|
||||
else:
|
||||
model_path = Path(origin) if model_path is None else Path(model_path)
|
||||
if not model_path.exists():
|
||||
msg.fail(
|
||||
"Can't locate model data",
|
||||
"The data should be located in {}".format(path2str(model_path)),
|
||||
exits=1,
|
||||
)
|
||||
data_path = util.get_data_path()
|
||||
if not data_path or not data_path.exists():
|
||||
spacy_loc = Path(__file__).parent.parent
|
||||
msg.fail(
|
||||
"Can't find the spaCy data path to create model symlink",
|
||||
"Make sure a directory `/data` exists within your spaCy "
|
||||
"installation and try again. The data directory should be located "
|
||||
"here:".format(path=spacy_loc),
|
||||
exits=1,
|
||||
)
|
||||
link_path = util.get_data_path() / link_name
|
||||
if link_path.is_symlink() and not force:
|
||||
msg.fail(
|
||||
"Link '{}' already exists".format(link_name),
|
||||
"To overwrite an existing link, use the --force flag",
|
||||
exits=1,
|
||||
)
|
||||
elif link_path.is_symlink(): # does a symlink exist?
|
||||
# NB: It's important to check for is_symlink here and not for exists,
|
||||
# because invalid/outdated symlinks would return False otherwise.
|
||||
link_path.unlink()
|
||||
elif link_path.exists(): # does it exist otherwise?
|
||||
# NB: Check this last because valid symlinks also "exist".
|
||||
msg.fail(
|
||||
"Can't overwrite symlink '{}'".format(link_name),
|
||||
"This can happen if your data directory contains a directory or "
|
||||
"file of the same name.",
|
||||
exits=1,
|
||||
)
|
||||
details = "%s --> %s" % (path2str(model_path), path2str(link_path))
|
||||
try:
|
||||
symlink_to(link_path, model_path)
|
||||
except: # noqa: E722
|
||||
# This is quite dirty, but just making sure other errors are caught.
|
||||
msg.fail(
|
||||
"Couldn't link model to '{}'".format(link_name),
|
||||
"Creating a symlink in spacy/data failed. Make sure you have the "
|
||||
"required permissions and try re-running the command as admin, or "
|
||||
"use a virtualenv. You can still import the model as a module and "
|
||||
"call its load() method, or create the symlink manually.",
|
||||
)
|
||||
msg.text(details)
|
||||
raise
|
||||
msg.good("Linking successful", details)
|
||||
msg.text("You can now load the model via spacy.load('{}')".format(link_name))
|
|
@ -1,25 +1,21 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import plac
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
from wasabi import msg, get_raw_input
|
||||
import srsly
|
||||
|
||||
from ..compat import path2str
|
||||
from .. import util
|
||||
from .. import about
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
input_dir=("Directory with model data", "positional", None, str),
|
||||
output_dir=("Output parent directory", "positional", None, str),
|
||||
meta_path=("Path to meta.json", "option", "m", str),
|
||||
create_meta=("Create meta.json, even if one exists", "flag", "c", bool),
|
||||
force=("Force overwriting existing model in output directory", "flag", "f", bool),
|
||||
)
|
||||
def package(input_dir, output_dir, meta_path=None, create_meta=False, force=False):
|
||||
def package(
|
||||
# fmt: off
|
||||
input_dir: ("Directory with model data", "positional", None, str),
|
||||
output_dir: ("Output parent directory", "positional", None, str),
|
||||
meta_path: ("Path to meta.json", "option", "m", str) = None,
|
||||
create_meta: ("Create meta.json, even if one exists", "flag", "c", bool) = False,
|
||||
force: ("Force overwriting existing model in output directory", "flag", "f", bool) = False,
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Generate Python package for model data, including meta and required
|
||||
installation files. A new directory will be created in the specified
|
||||
|
@ -47,7 +43,7 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False, force=Fals
|
|||
for key in ("lang", "name", "version"):
|
||||
if key not in meta or meta[key] == "":
|
||||
msg.fail(
|
||||
"No '{}' setting found in meta.json".format(key),
|
||||
f"No '{key}' setting found in meta.json",
|
||||
"This setting is required to build your package.",
|
||||
exits=1,
|
||||
)
|
||||
|
@ -58,22 +54,21 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False, force=Fals
|
|||
|
||||
if package_path.exists():
|
||||
if force:
|
||||
shutil.rmtree(path2str(package_path))
|
||||
shutil.rmtree(str(package_path))
|
||||
else:
|
||||
msg.fail(
|
||||
"Package directory already exists",
|
||||
"Please delete the directory and try again, or use the "
|
||||
"`--force` flag to overwrite existing "
|
||||
"directories.".format(path=path2str(package_path)),
|
||||
"`--force` flag to overwrite existing directories.",
|
||||
exits=1,
|
||||
)
|
||||
Path.mkdir(package_path, parents=True)
|
||||
shutil.copytree(path2str(input_path), path2str(package_path / model_name_v))
|
||||
shutil.copytree(str(input_path), str(package_path / model_name_v))
|
||||
create_file(main_path / "meta.json", srsly.json_dumps(meta, indent=2))
|
||||
create_file(main_path / "setup.py", TEMPLATE_SETUP)
|
||||
create_file(main_path / "MANIFEST.in", TEMPLATE_MANIFEST)
|
||||
create_file(package_path / "__init__.py", TEMPLATE_INIT)
|
||||
msg.good("Successfully created package '{}'".format(model_name_v), main_path)
|
||||
msg.good(f"Successfully created package '{model_name_v}'", main_path)
|
||||
msg.text("To build the package, run `python setup.py sdist` in this directory.")
|
||||
|
||||
|
||||
|
@ -88,7 +83,7 @@ def generate_meta(model_path, existing_meta, msg):
|
|||
("lang", "Model language", meta.get("lang", "en")),
|
||||
("name", "Model name", meta.get("name", "model")),
|
||||
("version", "Model version", meta.get("version", "0.0.0")),
|
||||
("spacy_version", "Required spaCy version", ">=%s,<3.0.0" % about.__version__),
|
||||
("spacy_version", "Required spaCy version", f">={about.__version__},<3.0.0"),
|
||||
("description", "Model description", meta.get("description", False)),
|
||||
("author", "Author", meta.get("author", False)),
|
||||
("email", "Author email", meta.get("email", False)),
|
||||
|
@ -118,9 +113,6 @@ def generate_meta(model_path, existing_meta, msg):
|
|||
|
||||
TEMPLATE_SETUP = """
|
||||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import io
|
||||
import json
|
||||
from os import path, walk
|
||||
|
@ -190,9 +182,6 @@ include meta.json
|
|||
|
||||
|
||||
TEMPLATE_INIT = """
|
||||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from pathlib import Path
|
||||
from spacy.util import load_model_from_init_py, get_model_meta
|
||||
|
||||
|
|
|
@ -1,107 +1,50 @@
|
|||
# coding: utf8
|
||||
from __future__ import print_function, unicode_literals
|
||||
|
||||
import plac
|
||||
import random
|
||||
import numpy
|
||||
import time
|
||||
import re
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
from thinc.v2v import Affine, Maxout
|
||||
from thinc.misc import LayerNorm as LN
|
||||
from thinc.neural.util import prefer_gpu
|
||||
from thinc.api import Linear, Maxout, chain, list2array, prefer_gpu
|
||||
from thinc.api import CosineDistance, L2Distance
|
||||
from wasabi import msg
|
||||
import srsly
|
||||
|
||||
from ..gold import Example
|
||||
from ..errors import Errors
|
||||
from ..tokens import Doc
|
||||
from ..attrs import ID, HEAD
|
||||
from .._ml import Tok2Vec, flatten, chain, create_default_optimizer
|
||||
from .._ml import masked_language_model, get_cossim_loss
|
||||
from ..ml.component_models import Tok2Vec
|
||||
from ..ml.component_models import masked_language_model
|
||||
from .. import util
|
||||
from ..util import create_default_optimizer
|
||||
from .train import _load_pretrained_tok2vec
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
texts_loc=(
|
||||
"Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the "
|
||||
"key 'tokens'",
|
||||
"positional",
|
||||
None,
|
||||
str,
|
||||
),
|
||||
vectors_model=("Name or path to spaCy model with vectors to learn from"),
|
||||
output_dir=("Directory to write models to on each epoch", "positional", None, str),
|
||||
width=("Width of CNN layers", "option", "cw", int),
|
||||
depth=("Depth of CNN layers", "option", "cd", int),
|
||||
cnn_window=("Window size for CNN layers", "option", "cW", int),
|
||||
cnn_pieces=("Maxout size for CNN layers. 1 for Mish", "option", "cP", int),
|
||||
use_chars=("Whether to use character-based embedding", "flag", "chr", bool),
|
||||
sa_depth=("Depth of self-attention layers", "option", "sa", int),
|
||||
bilstm_depth=("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int),
|
||||
embed_rows=("Number of embedding rows", "option", "er", int),
|
||||
loss_func=(
|
||||
"Loss function to use for the objective. Either 'L2' or 'cosine'",
|
||||
"option",
|
||||
"L",
|
||||
str,
|
||||
),
|
||||
use_vectors=("Whether to use the static vectors as input features", "flag", "uv"),
|
||||
dropout=("Dropout rate", "option", "d", float),
|
||||
batch_size=("Number of words per training batch", "option", "bs", int),
|
||||
max_length=(
|
||||
"Max words per example. Longer examples are discarded",
|
||||
"option",
|
||||
"xw",
|
||||
int,
|
||||
),
|
||||
min_length=(
|
||||
"Min words per example. Shorter examples are discarded",
|
||||
"option",
|
||||
"nw",
|
||||
int,
|
||||
),
|
||||
seed=("Seed for random number generators", "option", "s", int),
|
||||
n_iter=("Number of iterations to pretrain", "option", "i", int),
|
||||
n_save_every=("Save model every X batches.", "option", "se", int),
|
||||
init_tok2vec=(
|
||||
"Path to pretrained weights for the token-to-vector parts of the models. See 'spacy pretrain'. Experimental.",
|
||||
"option",
|
||||
"t2v",
|
||||
Path,
|
||||
),
|
||||
epoch_start=(
|
||||
"The epoch to start counting at. Only relevant when using '--init-tok2vec' and the given weight file has been "
|
||||
"renamed. Prevents unintended overwriting of existing weight files.",
|
||||
"option",
|
||||
"es",
|
||||
int,
|
||||
),
|
||||
)
|
||||
def pretrain(
|
||||
texts_loc,
|
||||
vectors_model,
|
||||
output_dir,
|
||||
width=96,
|
||||
depth=4,
|
||||
bilstm_depth=0,
|
||||
cnn_pieces=3,
|
||||
sa_depth=0,
|
||||
use_chars=False,
|
||||
cnn_window=1,
|
||||
embed_rows=2000,
|
||||
loss_func="cosine",
|
||||
use_vectors=False,
|
||||
dropout=0.2,
|
||||
n_iter=1000,
|
||||
batch_size=3000,
|
||||
max_length=500,
|
||||
min_length=5,
|
||||
seed=0,
|
||||
n_save_every=None,
|
||||
init_tok2vec=None,
|
||||
epoch_start=None,
|
||||
# fmt: off
|
||||
texts_loc: ("Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the key 'tokens'", "positional", None, str),
|
||||
vectors_model: ("Name or path to spaCy model with vectors to learn from", "positional", None, str),
|
||||
output_dir: ("Directory to write models to on each epoch", "positional", None, str),
|
||||
width: ("Width of CNN layers", "option", "cw", int) = 96,
|
||||
conv_depth: ("Depth of CNN layers", "option", "cd", int) = 4,
|
||||
bilstm_depth: ("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int) = 0,
|
||||
cnn_pieces: ("Maxout size for CNN layers. 1 for Mish", "option", "cP", int) = 3,
|
||||
sa_depth: ("Depth of self-attention layers", "option", "sa", int) = 0,
|
||||
use_chars: ("Whether to use character-based embedding", "flag", "chr", bool) = False,
|
||||
cnn_window: ("Window size for CNN layers", "option", "cW", int) = 1,
|
||||
embed_rows: ("Number of embedding rows", "option", "er", int) = 2000,
|
||||
loss_func: ("Loss function to use for the objective. Either 'L2' or 'cosine'", "option", "L", str) = "cosine",
|
||||
use_vectors: ("Whether to use the static vectors as input features", "flag", "uv") = False,
|
||||
dropout: ("Dropout rate", "option", "d", float) = 0.2,
|
||||
n_iter: ("Number of iterations to pretrain", "option", "i", int) = 1000,
|
||||
batch_size: ("Number of words per training batch", "option", "bs", int) = 3000,
|
||||
max_length: ("Max words per example. Longer examples are discarded", "option", "xw", int) = 500,
|
||||
min_length: ("Min words per example. Shorter examples are discarded", "option", "nw", int) = 5,
|
||||
seed: ("Seed for random number generators", "option", "s", int) = 0,
|
||||
n_save_every: ("Save model every X batches.", "option", "se", int) = None,
|
||||
init_tok2vec: ("Path to pretrained weights for the token-to-vector parts of the models. See 'spacy pretrain'. Experimental.", "option", "t2v", Path) = None,
|
||||
epoch_start: ("The epoch to start counting at. Only relevant when using '--init-tok2vec' and the given weight file has been renamed. Prevents unintended overwriting of existing weight files.", "option", "es", int) = None,
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
|
||||
|
@ -132,9 +75,15 @@ def pretrain(
|
|||
msg.info("Using GPU" if has_gpu else "Not using GPU")
|
||||
|
||||
output_dir = Path(output_dir)
|
||||
if output_dir.exists() and [p for p in output_dir.iterdir()]:
|
||||
msg.warn(
|
||||
"Output directory is not empty",
|
||||
"It is better to use an empty directory or refer to a new output path, "
|
||||
"then the new directory will be created for you.",
|
||||
)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
msg.good("Created output directory")
|
||||
msg.good(f"Created output directory: {output_dir}")
|
||||
srsly.write_json(output_dir / "config.json", config)
|
||||
msg.good("Saved settings to config.json")
|
||||
|
||||
|
@ -153,16 +102,16 @@ def pretrain(
|
|||
msg.text("Reading input text from stdin...")
|
||||
texts = srsly.read_jsonl("-")
|
||||
|
||||
with msg.loading("Loading model '{}'...".format(vectors_model)):
|
||||
with msg.loading(f"Loading model '{vectors_model}'..."):
|
||||
nlp = util.load_model(vectors_model)
|
||||
msg.good("Loaded model '{}'".format(vectors_model))
|
||||
pretrained_vectors = None if not use_vectors else nlp.vocab.vectors.name
|
||||
msg.good(f"Loaded model '{vectors_model}'")
|
||||
pretrained_vectors = None if not use_vectors else nlp.vocab.vectors
|
||||
model = create_pretraining_model(
|
||||
nlp,
|
||||
Tok2Vec(
|
||||
width,
|
||||
embed_rows,
|
||||
conv_depth=depth,
|
||||
conv_depth=conv_depth,
|
||||
pretrained_vectors=pretrained_vectors,
|
||||
bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental.
|
||||
subword_features=not use_chars, # Set to False for Chinese etc
|
||||
|
@ -172,7 +121,7 @@ def pretrain(
|
|||
# Load in pretrained weights
|
||||
if init_tok2vec is not None:
|
||||
components = _load_pretrained_tok2vec(nlp, init_tok2vec)
|
||||
msg.text("Loaded pretrained tok2vec for: {}".format(components))
|
||||
msg.text(f"Loaded pretrained tok2vec for: {components}")
|
||||
# Parse the epoch number from the given weight file
|
||||
model_name = re.search(r"model\d+\.bin", str(init_tok2vec))
|
||||
if model_name:
|
||||
|
@ -181,32 +130,28 @@ def pretrain(
|
|||
else:
|
||||
if not epoch_start:
|
||||
msg.fail(
|
||||
"You have to use the '--epoch-start' argument when using a renamed weight file for "
|
||||
"'--init-tok2vec'",
|
||||
"You have to use the --epoch-start argument when using a renamed weight file for --init-tok2vec",
|
||||
exits=True,
|
||||
)
|
||||
elif epoch_start < 0:
|
||||
msg.fail(
|
||||
"The argument '--epoch-start' has to be greater or equal to 0. '%d' is invalid"
|
||||
% epoch_start,
|
||||
f"The argument --epoch-start has to be greater or equal to 0. {epoch_start} is invalid",
|
||||
exits=True,
|
||||
)
|
||||
else:
|
||||
# Without '--init-tok2vec' the '--epoch-start' argument is ignored
|
||||
epoch_start = 0
|
||||
|
||||
optimizer = create_default_optimizer(model.ops)
|
||||
optimizer = create_default_optimizer()
|
||||
tracker = ProgressTracker(frequency=10000)
|
||||
msg.divider("Pre-training tok2vec layer - starting at epoch %d" % epoch_start)
|
||||
msg.divider(f"Pre-training tok2vec layer - starting at epoch {epoch_start}")
|
||||
row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")}
|
||||
msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings)
|
||||
|
||||
def _save_model(epoch, is_temp=False):
|
||||
is_temp_str = ".temp" if is_temp else ""
|
||||
with model.use_params(optimizer.averages):
|
||||
with (output_dir / ("model%d%s.bin" % (epoch, is_temp_str))).open(
|
||||
"wb"
|
||||
) as file_:
|
||||
with (output_dir / f"model{epoch}{is_temp_str}.bin").open("wb") as file_:
|
||||
file_.write(model.tok2vec.to_bytes())
|
||||
log = {
|
||||
"nr_word": tracker.nr_word,
|
||||
|
@ -220,7 +165,9 @@ def pretrain(
|
|||
skip_counter = 0
|
||||
for epoch in range(epoch_start, n_iter + epoch_start):
|
||||
for batch_id, batch in enumerate(
|
||||
util.minibatch_by_words(((text, None) for text in texts), size=batch_size)
|
||||
util.minibatch_by_words(
|
||||
(Example(doc=text) for text in texts), size=batch_size
|
||||
)
|
||||
):
|
||||
docs, count = make_docs(
|
||||
nlp,
|
||||
|
@ -245,7 +192,7 @@ def pretrain(
|
|||
# Reshuffle the texts if texts were loaded from a file
|
||||
random.shuffle(texts)
|
||||
if skip_counter > 0:
|
||||
msg.warn("Skipped {count} empty values".format(count=str(skip_counter)))
|
||||
msg.warn(f"Skipped {skip_counter} empty values")
|
||||
msg.good("Successfully finished pretrain")
|
||||
|
||||
|
||||
|
@ -310,13 +257,14 @@ def get_vectors_loss(ops, docs, prediction, objective="L2"):
|
|||
# and look them up all at once. This prevents data copying.
|
||||
ids = ops.flatten([doc.to_array(ID).ravel() for doc in docs])
|
||||
target = docs[0].vocab.vectors.data[ids]
|
||||
# TODO: this code originally didn't normalize, but shouldn't normalize=True ?
|
||||
if objective == "L2":
|
||||
d_target = prediction - target
|
||||
loss = (d_target ** 2).sum()
|
||||
distance = L2Distance(normalize=False)
|
||||
elif objective == "cosine":
|
||||
loss, d_target = get_cossim_loss(prediction, target)
|
||||
distance = CosineDistance(normalize=False)
|
||||
else:
|
||||
raise ValueError(Errors.E142.format(loss_func=objective))
|
||||
d_target, loss = distance(prediction, target)
|
||||
return loss, d_target
|
||||
|
||||
|
||||
|
@ -328,18 +276,18 @@ def create_pretraining_model(nlp, tok2vec):
|
|||
"""
|
||||
output_size = nlp.vocab.vectors.data.shape[1]
|
||||
output_layer = chain(
|
||||
LN(Maxout(300, pieces=3)), Affine(output_size, drop_factor=0.0)
|
||||
Maxout(300, pieces=3, normalize=True, dropout=0.0), Linear(output_size)
|
||||
)
|
||||
# This is annoying, but the parser etc have the flatten step after
|
||||
# the tok2vec. To load the weights in cleanly, we need to match
|
||||
# the shape of the models' components exactly. So what we cann
|
||||
# "tok2vec" has to be the same set of processes as what the components do.
|
||||
tok2vec = chain(tok2vec, flatten)
|
||||
tok2vec = chain(tok2vec, list2array())
|
||||
model = chain(tok2vec, output_layer)
|
||||
model = masked_language_model(nlp.vocab, model)
|
||||
model.tok2vec = tok2vec
|
||||
model.output_layer = output_layer
|
||||
model.begin_training([nlp.make_doc("Give it a doc to infer shapes")])
|
||||
model.set_ref("tok2vec", tok2vec)
|
||||
model.set_ref("output_layer", output_layer)
|
||||
model.initialize(X=[nlp.make_doc("Give it a doc to infer shapes")])
|
||||
return model
|
||||
|
||||
|
||||
|
|
|
@ -1,7 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals, division, print_function
|
||||
|
||||
import plac
|
||||
import tqdm
|
||||
from pathlib import Path
|
||||
import srsly
|
||||
|
@ -9,18 +5,19 @@ import cProfile
|
|||
import pstats
|
||||
import sys
|
||||
import itertools
|
||||
import thinc.extra.datasets
|
||||
import ml_datasets
|
||||
from wasabi import msg
|
||||
|
||||
from ..util import load_model
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model to load", "positional", None, str),
|
||||
inputs=("Location of input file. '-' for stdin.", "positional", None, str),
|
||||
n_texts=("Maximum number of texts to use if available", "option", "n", int),
|
||||
)
|
||||
def profile(model, inputs=None, n_texts=10000):
|
||||
def profile(
|
||||
# fmt: off
|
||||
model: ("Model to load", "positional", None, str),
|
||||
inputs: ("Location of input file. '-' for stdin.", "positional", None, str) = None,
|
||||
n_texts: ("Maximum number of texts to use if available", "option", "n", int) = 10000,
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Profile a spaCy pipeline, to find out which functions take the most time.
|
||||
Input should be formatted as one JSON object per line with a key "text".
|
||||
|
@ -32,13 +29,13 @@ def profile(model, inputs=None, n_texts=10000):
|
|||
if inputs is None:
|
||||
n_inputs = 25000
|
||||
with msg.loading("Loading IMDB dataset via Thinc..."):
|
||||
imdb_train, _ = thinc.extra.datasets.imdb()
|
||||
imdb_train, _ = ml_datasets.imdb()
|
||||
inputs, _ = zip(*imdb_train)
|
||||
msg.info("Loaded IMDB dataset and using {} examples".format(n_inputs))
|
||||
msg.info(f"Loaded IMDB dataset and using {n_inputs} examples")
|
||||
inputs = inputs[:n_inputs]
|
||||
with msg.loading("Loading model '{}'...".format(model)):
|
||||
with msg.loading(f"Loading model '{model}'..."):
|
||||
nlp = load_model(model)
|
||||
msg.good("Loaded model '{}'".format(model))
|
||||
msg.good(f"Loaded model '{model}'")
|
||||
texts = list(itertools.islice(inputs, n_texts))
|
||||
cProfile.runctx("parse_texts(nlp, texts)", globals(), locals(), "Profile.prof")
|
||||
s = pstats.Stats("Profile.prof")
|
||||
|
@ -60,7 +57,7 @@ def _read_inputs(loc, msg):
|
|||
input_path = Path(loc)
|
||||
if not input_path.exists() or not input_path.is_file():
|
||||
msg.fail("Not a valid input data file", loc, exits=1)
|
||||
msg.info("Using data from {}".format(input_path.parts[-1]))
|
||||
msg.info(f"Using data from {input_path.parts[-1]}")
|
||||
file_ = input_path.open()
|
||||
for line in file_:
|
||||
data = srsly.json_loads(line)
|
||||
|
|
|
@ -1,11 +1,7 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals, division, print_function
|
||||
|
||||
import plac
|
||||
import os
|
||||
import tqdm
|
||||
from pathlib import Path
|
||||
from thinc.neural._classes.model import Model
|
||||
from thinc.api import use_ops
|
||||
from timeit import default_timer as timer
|
||||
import shutil
|
||||
import srsly
|
||||
|
@ -13,76 +9,53 @@ from wasabi import msg
|
|||
import contextlib
|
||||
import random
|
||||
|
||||
from .._ml import create_default_optimizer
|
||||
from ..util import create_default_optimizer
|
||||
from ..util import use_gpu as set_gpu
|
||||
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
|
||||
from ..gold import GoldCorpus
|
||||
from ..compat import path2str
|
||||
from .. import util
|
||||
from .. import about
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
# fmt: off
|
||||
lang=("Model language", "positional", None, str),
|
||||
output_path=("Output directory to store model in", "positional", None, Path),
|
||||
train_path=("Location of JSON-formatted training data", "positional", None, Path),
|
||||
dev_path=("Location of JSON-formatted development data", "positional", None, Path),
|
||||
raw_text=("Path to jsonl file with unlabelled text documents.", "option", "rt", Path),
|
||||
base_model=("Name of model to update (optional)", "option", "b", str),
|
||||
pipeline=("Comma-separated names of pipeline components", "option", "p", str),
|
||||
replace_components=("Replace components from base model", "flag", "R", bool),
|
||||
vectors=("Model to load vectors from", "option", "v", str),
|
||||
n_iter=("Number of iterations", "option", "n", int),
|
||||
n_early_stopping=("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int),
|
||||
n_examples=("Number of examples", "option", "ns", int),
|
||||
use_gpu=("Use GPU", "option", "g", int),
|
||||
version=("Model version", "option", "V", str),
|
||||
meta_path=("Optional path to meta.json to use as base.", "option", "m", Path),
|
||||
init_tok2vec=("Path to pretrained weights for the token-to-vector parts of the models. See 'spacy pretrain'. Experimental.", "option", "t2v", Path),
|
||||
parser_multitasks=("Side objectives for parser CNN, e.g. 'dep' or 'dep,tag'", "option", "pt", str),
|
||||
entity_multitasks=("Side objectives for NER CNN, e.g. 'dep' or 'dep,tag'", "option", "et", str),
|
||||
noise_level=("Amount of corruption for data augmentation", "option", "nl", float),
|
||||
orth_variant_level=("Amount of orthography variation for data augmentation", "option", "ovl", float),
|
||||
eval_beam_widths=("Beam widths to evaluate, e.g. 4,8", "option", "bw", str),
|
||||
gold_preproc=("Use gold preprocessing", "flag", "G", bool),
|
||||
learn_tokens=("Make parser learn gold-standard tokenization", "flag", "T", bool),
|
||||
textcat_multilabel=("Textcat classes aren't mutually exclusive (multilabel)", "flag", "TML", bool),
|
||||
textcat_arch=("Textcat model architecture", "option", "ta", str),
|
||||
textcat_positive_label=("Textcat positive label for binary classes with two labels", "option", "tpl", str),
|
||||
verbose=("Display more information for debug", "flag", "VV", bool),
|
||||
debug=("Run data diagnostics before training", "flag", "D", bool),
|
||||
# fmt: on
|
||||
)
|
||||
def train(
|
||||
lang,
|
||||
output_path,
|
||||
train_path,
|
||||
dev_path,
|
||||
raw_text=None,
|
||||
base_model=None,
|
||||
pipeline="tagger,parser,ner",
|
||||
replace_components=False,
|
||||
vectors=None,
|
||||
n_iter=30,
|
||||
n_early_stopping=None,
|
||||
n_examples=0,
|
||||
use_gpu=-1,
|
||||
version="0.0.0",
|
||||
meta_path=None,
|
||||
init_tok2vec=None,
|
||||
parser_multitasks="",
|
||||
entity_multitasks="",
|
||||
noise_level=0.0,
|
||||
orth_variant_level=0.0,
|
||||
eval_beam_widths="",
|
||||
gold_preproc=False,
|
||||
learn_tokens=False,
|
||||
textcat_multilabel=False,
|
||||
textcat_arch="bow",
|
||||
textcat_positive_label=None,
|
||||
verbose=False,
|
||||
debug=False,
|
||||
# fmt: off
|
||||
lang: ("Model language", "positional", None, str),
|
||||
output_path: ("Output directory to store model in", "positional", None, Path),
|
||||
train_path: ("Location of JSON-formatted training data", "positional", None, Path),
|
||||
dev_path: ("Location of JSON-formatted development data", "positional", None, Path),
|
||||
raw_text: ("Path to jsonl file with unlabelled text documents.", "option", "rt", Path) = None,
|
||||
base_model: ("Name of model to update (optional)", "option", "b", str) = None,
|
||||
pipeline: ("Comma-separated names of pipeline components", "option", "p", str) = "tagger,parser,ner",
|
||||
vectors: ("Model to load vectors from", "option", "v", str) = None,
|
||||
replace_components: ("Replace components from base model", "flag", "R", bool) = False,
|
||||
width: ("Width of CNN layers of Tok2Vec component", "option", "cw", int) = 96,
|
||||
conv_depth: ("Depth of CNN layers of Tok2Vec component", "option", "cd", int) = 4,
|
||||
cnn_window: ("Window size for CNN layers of Tok2Vec component", "option", "cW", int) = 1,
|
||||
cnn_pieces: ("Maxout size for CNN layers of Tok2Vec component. 1 for Mish", "option", "cP", int) = 3,
|
||||
use_chars: ("Whether to use character-based embedding of Tok2Vec component", "flag", "chr", bool) = False,
|
||||
bilstm_depth: ("Depth of BiLSTM layers of Tok2Vec component (requires PyTorch)", "option", "lstm", int) = 0,
|
||||
embed_rows: ("Number of embedding rows of Tok2Vec component", "option", "er", int) = 2000,
|
||||
n_iter: ("Number of iterations", "option", "n", int) = 30,
|
||||
n_early_stopping: ("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int) = None,
|
||||
n_examples: ("Number of examples", "option", "ns", int) = 0,
|
||||
use_gpu: ("Use GPU", "option", "g", int) = -1,
|
||||
version: ("Model version", "option", "V", str) = "0.0.0",
|
||||
meta_path: ("Optional path to meta.json to use as base.", "option", "m", Path) = None,
|
||||
init_tok2vec: ("Path to pretrained weights for the token-to-vector parts of the models. See 'spacy pretrain'. Experimental.", "option", "t2v", Path) = None,
|
||||
parser_multitasks: ("Side objectives for parser CNN, e.g. 'dep' or 'dep,tag'", "option", "pt", str) = "",
|
||||
entity_multitasks: ("Side objectives for NER CNN, e.g. 'dep' or 'dep,tag'", "option", "et", str) = "",
|
||||
noise_level: ("Amount of corruption for data augmentation", "option", "nl", float) = 0.0,
|
||||
orth_variant_level: ("Amount of orthography variation for data augmentation", "option", "ovl", float) = 0.0,
|
||||
eval_beam_widths: ("Beam widths to evaluate, e.g. 4,8", "option", "bw", str) = "",
|
||||
gold_preproc: ("Use gold preprocessing", "flag", "G", bool) = False,
|
||||
learn_tokens: ("Make parser learn gold-standard tokenization", "flag", "T", bool) = False,
|
||||
textcat_multilabel: ("Textcat classes aren't mutually exclusive (multilabel)", "flag", "TML", bool) = False,
|
||||
textcat_arch: ("Textcat model architecture", "option", "ta", str) = "bow",
|
||||
textcat_positive_label: ("Textcat positive label for binary classes with two labels", "option", "tpl", str) = None,
|
||||
tag_map_path: ("Location of JSON-formatted tag map", "option", "tm", Path) = None,
|
||||
verbose: ("Display more information for debug", "flag", "VV", bool) = False,
|
||||
debug: ("Run data diagnostics before training", "flag", "D", bool) = False,
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Train or update a spaCy model. Requires data to be formatted in spaCy's
|
||||
|
@ -116,7 +89,11 @@ def train(
|
|||
)
|
||||
if not output_path.exists():
|
||||
output_path.mkdir()
|
||||
msg.good(f"Created output directory: {output_path}")
|
||||
|
||||
tag_map = {}
|
||||
if tag_map_path is not None:
|
||||
tag_map = srsly.read_json(tag_map_path)
|
||||
# Take dropout and batch size as generators of values -- dropout
|
||||
# starts high and decays sharply, to force the optimizer to explore.
|
||||
# Batch size starts at 1 and grows, so that we make updates quickly
|
||||
|
@ -145,28 +122,29 @@ def train(
|
|||
# the model and make sure the pipeline matches the pipeline setting. If
|
||||
# training starts from a blank model, intitalize the language class.
|
||||
pipeline = [p.strip() for p in pipeline.split(",")]
|
||||
msg.text(f"Training pipeline: {pipeline}")
|
||||
disabled_pipes = None
|
||||
pipes_added = False
|
||||
msg.text("Training pipeline: {}".format(pipeline))
|
||||
msg.text(f"Training pipeline: {pipeline}")
|
||||
if use_gpu >= 0:
|
||||
activated_gpu = None
|
||||
try:
|
||||
activated_gpu = set_gpu(use_gpu)
|
||||
except Exception as e:
|
||||
msg.warn("Exception: {}".format(e))
|
||||
msg.warn(f"Exception: {e}")
|
||||
if activated_gpu is not None:
|
||||
msg.text("Using GPU: {}".format(use_gpu))
|
||||
msg.text(f"Using GPU: {use_gpu}")
|
||||
else:
|
||||
msg.warn("Unable to activate GPU: {}".format(use_gpu))
|
||||
msg.warn(f"Unable to activate GPU: {use_gpu}")
|
||||
msg.text("Using CPU only")
|
||||
use_gpu = -1
|
||||
if base_model:
|
||||
msg.text("Starting with base model '{}'".format(base_model))
|
||||
msg.text(f"Starting with base model '{base_model}'")
|
||||
nlp = util.load_model(base_model)
|
||||
if nlp.lang != lang:
|
||||
msg.fail(
|
||||
"Model language ('{}') doesn't match language specified as "
|
||||
"`lang` argument ('{}') ".format(nlp.lang, lang),
|
||||
f"Model language ('{nlp.lang}') doesn't match language "
|
||||
f"specified as `lang` argument ('{lang}') ",
|
||||
exits=1,
|
||||
)
|
||||
for pipe in pipeline:
|
||||
|
@ -180,11 +158,11 @@ def train(
|
|||
"positive_label": textcat_positive_label,
|
||||
}
|
||||
if pipe not in nlp.pipe_names:
|
||||
msg.text("Adding component to base model '{}'".format(pipe))
|
||||
msg.text(f"Adding component to base model '{pipe}'")
|
||||
nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg))
|
||||
pipes_added = True
|
||||
elif replace_components:
|
||||
msg.text("Replacing component from base model '{}'".format(pipe))
|
||||
msg.text(f"Replacing component from base model '{pipe}'")
|
||||
nlp.replace_pipe(pipe, nlp.create_pipe(pipe, config=pipe_cfg))
|
||||
pipes_added = True
|
||||
else:
|
||||
|
@ -197,17 +175,17 @@ def train(
|
|||
}
|
||||
if base_cfg != pipe_cfg:
|
||||
msg.fail(
|
||||
"The base textcat model configuration does"
|
||||
"not match the provided training options. "
|
||||
"Existing cfg: {}, provided cfg: {}".format(
|
||||
base_cfg, pipe_cfg
|
||||
),
|
||||
f"The base textcat model configuration does"
|
||||
f"not match the provided training options. "
|
||||
f"Existing cfg: {base_cfg}, provided cfg: {pipe_cfg}",
|
||||
exits=1,
|
||||
)
|
||||
msg.text("Extending component from base model '{}'".format(pipe))
|
||||
disabled_pipes = nlp.disable_pipes([p for p in nlp.pipe_names if p not in pipeline])
|
||||
msg.text(f"Extending component from base model '{pipe}'")
|
||||
disabled_pipes = nlp.disable_pipes(
|
||||
[p for p in nlp.pipe_names if p not in pipeline]
|
||||
)
|
||||
else:
|
||||
msg.text("Starting with blank model '{}'".format(lang))
|
||||
msg.text(f"Starting with blank model '{lang}'")
|
||||
lang_cls = util.get_lang_class(lang)
|
||||
nlp = lang_cls()
|
||||
for pipe in pipeline:
|
||||
|
@ -223,8 +201,11 @@ def train(
|
|||
pipe_cfg = {}
|
||||
nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg))
|
||||
|
||||
# Update tag map with provided mapping
|
||||
nlp.vocab.morphology.tag_map.update(tag_map)
|
||||
|
||||
if vectors:
|
||||
msg.text("Loading vector from model '{}'".format(vectors))
|
||||
msg.text(f"Loading vector from model '{vectors}'")
|
||||
_load_vectors(nlp, vectors)
|
||||
|
||||
# Multitask objectives
|
||||
|
@ -233,49 +214,56 @@ def train(
|
|||
if multitasks:
|
||||
if pipe_name not in pipeline:
|
||||
msg.fail(
|
||||
"Can't use multitask objective without '{}' in the "
|
||||
"pipeline".format(pipe_name)
|
||||
f"Can't use multitask objective without '{pipe_name}' in "
|
||||
f"the pipeline"
|
||||
)
|
||||
pipe = nlp.get_pipe(pipe_name)
|
||||
for objective in multitasks.split(","):
|
||||
pipe.add_multitask_objective(objective)
|
||||
|
||||
# Prepare training corpus
|
||||
msg.text("Counting training words (limit={})".format(n_examples))
|
||||
msg.text(f"Counting training words (limit={n_examples})")
|
||||
corpus = GoldCorpus(train_path, dev_path, limit=n_examples)
|
||||
n_train_words = corpus.count_train()
|
||||
|
||||
if base_model and not pipes_added:
|
||||
# Start with an existing model, use default optimizer
|
||||
optimizer = create_default_optimizer(Model.ops)
|
||||
optimizer = create_default_optimizer()
|
||||
else:
|
||||
# Start with a blank model, call begin_training
|
||||
optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
|
||||
|
||||
cfg = {"device": use_gpu}
|
||||
cfg["conv_depth"] = conv_depth
|
||||
cfg["token_vector_width"] = width
|
||||
cfg["bilstm_depth"] = bilstm_depth
|
||||
cfg["cnn_maxout_pieces"] = cnn_pieces
|
||||
cfg["embed_size"] = embed_rows
|
||||
cfg["conv_window"] = cnn_window
|
||||
cfg["subword_features"] = not use_chars
|
||||
optimizer = nlp.begin_training(lambda: corpus.train_tuples, **cfg)
|
||||
nlp._optimizer = None
|
||||
|
||||
# Load in pretrained weights
|
||||
if init_tok2vec is not None:
|
||||
components = _load_pretrained_tok2vec(nlp, init_tok2vec)
|
||||
msg.text("Loaded pretrained tok2vec for: {}".format(components))
|
||||
msg.text(f"Loaded pretrained tok2vec for: {components}")
|
||||
|
||||
# Verify textcat config
|
||||
if "textcat" in pipeline:
|
||||
textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", [])
|
||||
if textcat_positive_label and textcat_positive_label not in textcat_labels:
|
||||
msg.fail(
|
||||
"The textcat_positive_label (tpl) '{}' does not match any "
|
||||
"label in the training data.".format(textcat_positive_label),
|
||||
f"The textcat_positive_label (tpl) '{textcat_positive_label}' "
|
||||
f"does not match any label in the training data.",
|
||||
exits=1,
|
||||
)
|
||||
if textcat_positive_label and len(textcat_labels) != 2:
|
||||
msg.fail(
|
||||
"A textcat_positive_label (tpl) '{}' was provided for training "
|
||||
"data that does not appear to be a binary classification "
|
||||
"problem with two labels.".format(textcat_positive_label),
|
||||
"A textcat_positive_label (tpl) '{textcat_positive_label}' was "
|
||||
"provided for training data that does not appear to be a "
|
||||
"binary classification problem with two labels.",
|
||||
exits=1,
|
||||
)
|
||||
train_docs = corpus.train_docs(
|
||||
train_data = corpus.train_data(
|
||||
nlp,
|
||||
noise_level=noise_level,
|
||||
gold_preproc=gold_preproc,
|
||||
|
@ -285,9 +273,9 @@ def train(
|
|||
train_labels = set()
|
||||
if textcat_multilabel:
|
||||
multilabel_found = False
|
||||
for text, gold in train_docs:
|
||||
train_labels.update(gold.cats.keys())
|
||||
if list(gold.cats.values()).count(1.0) != 1:
|
||||
for ex in train_data:
|
||||
train_labels.update(ex.gold.cats.keys())
|
||||
if list(ex.gold.cats.values()).count(1.0) != 1:
|
||||
multilabel_found = True
|
||||
if not multilabel_found and not base_model:
|
||||
msg.warn(
|
||||
|
@ -297,9 +285,9 @@ def train(
|
|||
"mutually-exclusive classes."
|
||||
)
|
||||
if not textcat_multilabel:
|
||||
for text, gold in train_docs:
|
||||
train_labels.update(gold.cats.keys())
|
||||
if list(gold.cats.values()).count(1.0) != 1 and not base_model:
|
||||
for ex in train_data:
|
||||
train_labels.update(ex.gold.cats.keys())
|
||||
if list(ex.gold.cats.values()).count(1.0) != 1 and not base_model:
|
||||
msg.warn(
|
||||
"Some textcat training instances do not have exactly "
|
||||
"one positive label. Modifying training options to "
|
||||
|
@ -311,20 +299,20 @@ def train(
|
|||
break
|
||||
if base_model and set(textcat_labels) != train_labels:
|
||||
msg.fail(
|
||||
"Cannot extend textcat model using data with different "
|
||||
"labels. Base model labels: {}, training data labels: "
|
||||
"{}.".format(textcat_labels, list(train_labels)),
|
||||
f"Cannot extend textcat model using data with different "
|
||||
f"labels. Base model labels: {textcat_labels}, training data "
|
||||
f"labels: {list(train_labels)}",
|
||||
exits=1,
|
||||
)
|
||||
if textcat_multilabel:
|
||||
msg.text(
|
||||
"Textcat evaluation score: ROC AUC score macro-averaged across "
|
||||
"the labels '{}'".format(", ".join(textcat_labels))
|
||||
f"Textcat evaluation score: ROC AUC score macro-averaged across "
|
||||
f"the labels '{', '.join(textcat_labels)}'"
|
||||
)
|
||||
elif textcat_positive_label and len(textcat_labels) == 2:
|
||||
msg.text(
|
||||
"Textcat evaluation score: F1-score for the "
|
||||
"label '{}'".format(textcat_positive_label)
|
||||
f"Textcat evaluation score: F1-score for the "
|
||||
f"label '{textcat_positive_label}'"
|
||||
)
|
||||
elif len(textcat_labels) > 1:
|
||||
if len(textcat_labels) == 2:
|
||||
|
@ -334,8 +322,8 @@ def train(
|
|||
"an evaluation on the positive class."
|
||||
)
|
||||
msg.text(
|
||||
"Textcat evaluation score: F1-score macro-averaged across "
|
||||
"the labels '{}'".format(", ".join(textcat_labels))
|
||||
f"Textcat evaluation score: F1-score macro-averaged across "
|
||||
f"the labels '{', '.join(textcat_labels)}'"
|
||||
)
|
||||
else:
|
||||
msg.fail(
|
||||
|
@ -355,7 +343,7 @@ def train(
|
|||
iter_since_best = 0
|
||||
best_score = 0.0
|
||||
for i in range(n_iter):
|
||||
train_docs = corpus.train_docs(
|
||||
train_data = corpus.train_dataset(
|
||||
nlp,
|
||||
noise_level=noise_level,
|
||||
orth_variant_level=orth_variant_level,
|
||||
|
@ -371,73 +359,82 @@ def train(
|
|||
words_seen = 0
|
||||
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
|
||||
losses = {}
|
||||
for batch in util.minibatch_by_words(train_docs, size=batch_sizes):
|
||||
for batch in util.minibatch_by_words(train_data, size=batch_sizes):
|
||||
if not batch:
|
||||
continue
|
||||
docs, golds = zip(*batch)
|
||||
nlp.update(
|
||||
docs,
|
||||
golds,
|
||||
sgd=optimizer,
|
||||
drop=next(dropout_rates),
|
||||
losses=losses,
|
||||
)
|
||||
try:
|
||||
nlp.update(
|
||||
docs,
|
||||
golds,
|
||||
sgd=optimizer,
|
||||
drop=next(dropout_rates),
|
||||
losses=losses,
|
||||
)
|
||||
except ValueError as e:
|
||||
msg.warn("Error during training")
|
||||
if init_tok2vec:
|
||||
msg.warn(
|
||||
"Did you provide the same parameters during 'train' as during 'pretrain'?"
|
||||
)
|
||||
msg.fail(f"Original error message: {e}", exits=1)
|
||||
if raw_text:
|
||||
# If raw text is available, perform 'rehearsal' updates,
|
||||
# which use unlabelled data to reduce overfitting.
|
||||
raw_batch = list(next(raw_batches))
|
||||
nlp.rehearse(raw_batch, sgd=optimizer, losses=losses)
|
||||
docs = [ex.doc for ex in batch]
|
||||
if not int(os.environ.get("LOG_FRIENDLY", 0)):
|
||||
pbar.update(sum(len(doc) for doc in docs))
|
||||
words_seen += sum(len(doc) for doc in docs)
|
||||
with nlp.use_params(optimizer.averages):
|
||||
util.set_env_log(False)
|
||||
epoch_model_path = output_path / ("model%d" % i)
|
||||
epoch_model_path = output_path / f"model{i}"
|
||||
nlp.to_disk(epoch_model_path)
|
||||
nlp_loaded = util.load_model_from_path(epoch_model_path)
|
||||
for beam_width in eval_beam_widths:
|
||||
for name, component in nlp_loaded.pipeline:
|
||||
if hasattr(component, "cfg"):
|
||||
component.cfg["beam_width"] = beam_width
|
||||
dev_docs = list(
|
||||
corpus.dev_docs(
|
||||
dev_dataset = list(
|
||||
corpus.dev_dataset(
|
||||
nlp_loaded,
|
||||
gold_preproc=gold_preproc,
|
||||
ignore_misaligned=True,
|
||||
)
|
||||
)
|
||||
nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
|
||||
nwords = sum(len(ex.doc) for ex in dev_dataset)
|
||||
start_time = timer()
|
||||
scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose)
|
||||
scorer = nlp_loaded.evaluate(dev_dataset, verbose=verbose)
|
||||
end_time = timer()
|
||||
if use_gpu < 0:
|
||||
gpu_wps = None
|
||||
cpu_wps = nwords / (end_time - start_time)
|
||||
else:
|
||||
gpu_wps = nwords / (end_time - start_time)
|
||||
with Model.use_device("cpu"):
|
||||
with use_ops("numpy"):
|
||||
nlp_loaded = util.load_model_from_path(epoch_model_path)
|
||||
for name, component in nlp_loaded.pipeline:
|
||||
if hasattr(component, "cfg"):
|
||||
component.cfg["beam_width"] = beam_width
|
||||
dev_docs = list(
|
||||
corpus.dev_docs(
|
||||
dev_dataset = list(
|
||||
corpus.dev_dataset(
|
||||
nlp_loaded,
|
||||
gold_preproc=gold_preproc,
|
||||
ignore_misaligned=True,
|
||||
)
|
||||
)
|
||||
start_time = timer()
|
||||
scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose)
|
||||
scorer = nlp_loaded.evaluate(dev_dataset, verbose=verbose)
|
||||
end_time = timer()
|
||||
cpu_wps = nwords / (end_time - start_time)
|
||||
acc_loc = output_path / ("model%d" % i) / "accuracy.json"
|
||||
acc_loc = output_path / f"model{i}" / "accuracy.json"
|
||||
srsly.write_json(acc_loc, scorer.scores)
|
||||
|
||||
# Update model meta.json
|
||||
meta["lang"] = nlp.lang
|
||||
meta["pipeline"] = nlp.pipe_names
|
||||
meta["spacy_version"] = ">=%s" % about.__version__
|
||||
meta["spacy_version"] = f">={about.__version__}"
|
||||
if beam_width == 1:
|
||||
meta["speed"] = {
|
||||
"nwords": nwords,
|
||||
|
@ -465,10 +462,10 @@ def train(
|
|||
"keys": nlp.vocab.vectors.n_keys,
|
||||
"name": nlp.vocab.vectors.name,
|
||||
}
|
||||
meta.setdefault("name", "model%d" % i)
|
||||
meta.setdefault("name", f"model{i}")
|
||||
meta.setdefault("version", version)
|
||||
meta["labels"] = nlp.meta["labels"]
|
||||
meta_loc = output_path / ("model%d" % i) / "meta.json"
|
||||
meta_loc = output_path / f"model{i}" / "meta.json"
|
||||
srsly.write_json(meta_loc, meta)
|
||||
util.set_env_log(verbose)
|
||||
|
||||
|
@ -486,8 +483,8 @@ def train(
|
|||
for cat, cat_score in textcats_per_cat.items():
|
||||
if cat_score.get("roc_auc_score", 0) < 0:
|
||||
msg.warn(
|
||||
"Textcat ROC AUC score is undefined due to "
|
||||
"only one value in label '{}'.".format(cat)
|
||||
f"Textcat ROC AUC score is undefined due to "
|
||||
f"only one value in label '{cat}'."
|
||||
)
|
||||
msg.row(progress, **row_settings)
|
||||
# Early stopping
|
||||
|
@ -500,14 +497,14 @@ def train(
|
|||
best_score = current_score
|
||||
if iter_since_best >= n_early_stopping:
|
||||
msg.text(
|
||||
"Early stopping, best iteration "
|
||||
"is: {}".format(i - iter_since_best)
|
||||
f"Early stopping, best iteration is: {i - iter_since_best}"
|
||||
)
|
||||
msg.text(
|
||||
"Best score = {}; Final iteration "
|
||||
"score = {}".format(best_score, current_score)
|
||||
f"Best score = {best_score}; Final iteration score = {current_score}"
|
||||
)
|
||||
break
|
||||
except Exception as e:
|
||||
msg.warn(f"Aborting and saving final best model. Encountered exception: {e}")
|
||||
finally:
|
||||
best_pipes = nlp.pipe_names
|
||||
if disabled_pipes:
|
||||
|
@ -535,6 +532,8 @@ def _score_for_model(meta):
|
|||
mean_acc.append((acc["ents_p"] + acc["ents_r"] + acc["ents_f"]) / 3)
|
||||
if "textcat" in pipes:
|
||||
mean_acc.append(acc["textcat_score"])
|
||||
if "sentrec" in pipes:
|
||||
mean_acc.append((acc["sent_p"] + acc["sent_r"] + acc["sent_f"]) / 3)
|
||||
return sum(mean_acc) / len(mean_acc)
|
||||
|
||||
|
||||
|
@ -580,12 +579,10 @@ def _collate_best_model(meta, output_path, components):
|
|||
for component in components:
|
||||
bests[component] = _find_best(output_path, component)
|
||||
best_dest = output_path / "model-best"
|
||||
shutil.copytree(path2str(output_path / "model-final"), path2str(best_dest))
|
||||
shutil.copytree(str(output_path / "model-final"), str(best_dest))
|
||||
for component, best_component_src in bests.items():
|
||||
shutil.rmtree(path2str(best_dest / component))
|
||||
shutil.copytree(
|
||||
path2str(best_component_src / component), path2str(best_dest / component)
|
||||
)
|
||||
shutil.rmtree(str(best_dest / component))
|
||||
shutil.copytree(str(best_component_src / component), str(best_dest / component))
|
||||
accs = srsly.read_json(best_component_src / "accuracy.json")
|
||||
for metric in _get_metrics(component):
|
||||
meta["accuracy"][metric] = accs[metric]
|
||||
|
@ -608,11 +605,13 @@ def _find_best(experiment_dir, component):
|
|||
|
||||
def _get_metrics(component):
|
||||
if component == "parser":
|
||||
return ("las", "uas", "las_per_type", "token_acc")
|
||||
return ("las", "uas", "las_per_type", "token_acc", "sent_f")
|
||||
elif component == "tagger":
|
||||
return ("tags_acc",)
|
||||
elif component == "ner":
|
||||
return ("ents_f", "ents_p", "ents_r", "ents_per_type")
|
||||
return ("ents_f", "ents_p", "ents_r", "enty_per_type")
|
||||
elif component == "sentrec":
|
||||
return ("sent_f", "sent_p", "sent_r")
|
||||
elif component == "textcat":
|
||||
return ("textcat_score",)
|
||||
return ("token_acc",)
|
||||
|
@ -626,14 +625,21 @@ def _configure_training_output(pipeline, use_gpu, has_beam_widths):
|
|||
row_head.extend(["Tag Loss ", " Tag % "])
|
||||
output_stats.extend(["tag_loss", "tags_acc"])
|
||||
elif pipe == "parser":
|
||||
row_head.extend(["Dep Loss ", " UAS ", " LAS "])
|
||||
output_stats.extend(["dep_loss", "uas", "las"])
|
||||
row_head.extend(
|
||||
["Dep Loss ", " UAS ", " LAS ", "Sent P", "Sent R", "Sent F"]
|
||||
)
|
||||
output_stats.extend(
|
||||
["dep_loss", "uas", "las", "sent_p", "sent_r", "sent_f"]
|
||||
)
|
||||
elif pipe == "ner":
|
||||
row_head.extend(["NER Loss ", "NER P ", "NER R ", "NER F "])
|
||||
output_stats.extend(["ner_loss", "ents_p", "ents_r", "ents_f"])
|
||||
elif pipe == "textcat":
|
||||
row_head.extend(["Textcat Loss", "Textcat"])
|
||||
output_stats.extend(["textcat_loss", "textcat_score"])
|
||||
elif pipe == "sentrec":
|
||||
row_head.extend(["Sentrec Loss", "Sent P", "Sent R", "Sent F"])
|
||||
output_stats.extend(["sentrec_loss", "sent_p", "sent_r", "sent_f"])
|
||||
row_head.extend(["Token %", "CPU WPS"])
|
||||
output_stats.extend(["token_acc", "cpu_wps"])
|
||||
|
||||
|
@ -643,7 +649,10 @@ def _configure_training_output(pipeline, use_gpu, has_beam_widths):
|
|||
|
||||
if has_beam_widths:
|
||||
row_head.insert(1, "Beam W.")
|
||||
return row_head, output_stats
|
||||
# remove duplicates
|
||||
row_head_dict = {k: 1 for k in row_head}
|
||||
output_stats_dict = {k: 1 for k in output_stats}
|
||||
return row_head_dict.keys(), output_stats_dict.keys()
|
||||
|
||||
|
||||
def _get_progress(
|
||||
|
@ -656,6 +665,7 @@ def _get_progress(
|
|||
scores["ner_loss"] = losses.get("ner", 0.0)
|
||||
scores["tag_loss"] = losses.get("tagger", 0.0)
|
||||
scores["textcat_loss"] = losses.get("textcat", 0.0)
|
||||
scores["sentrec_loss"] = losses.get("sentrec", 0.0)
|
||||
scores["cpu_wps"] = cpu_wps
|
||||
scores["gpu_wps"] = gpu_wps or 0.0
|
||||
scores.update(dev_scores)
|
||||
|
|
439
spacy/cli/train_from_config.py
Normal file
439
spacy/cli/train_from_config.py
Normal file
|
@ -0,0 +1,439 @@
|
|||
from typing import Optional, Dict, List, Union, Sequence
|
||||
import plac
|
||||
from wasabi import msg
|
||||
from pathlib import Path
|
||||
import thinc
|
||||
import thinc.schedules
|
||||
from thinc.api import Model
|
||||
from pydantic import BaseModel, FilePath, StrictInt
|
||||
import tqdm
|
||||
|
||||
# TODO: relative imports?
|
||||
import spacy
|
||||
from spacy.gold import GoldCorpus
|
||||
from spacy.pipeline.tok2vec import Tok2VecListener
|
||||
from spacy.ml import component_models
|
||||
from spacy import util
|
||||
|
||||
|
||||
registry = util.registry
|
||||
|
||||
CONFIG_STR = """
|
||||
[training]
|
||||
patience = 10
|
||||
eval_frequency = 10
|
||||
dropout = 0.2
|
||||
init_tok2vec = null
|
||||
vectors = null
|
||||
max_epochs = 100
|
||||
orth_variant_level = 0.0
|
||||
gold_preproc = false
|
||||
max_length = 0
|
||||
use_gpu = 0
|
||||
scores = ["ents_p", "ents_r", "ents_f"]
|
||||
score_weights = {"ents_f": 1.0}
|
||||
limit = 0
|
||||
|
||||
[training.batch_size]
|
||||
@schedules = "compounding.v1"
|
||||
start = 100
|
||||
stop = 1000
|
||||
compound = 1.001
|
||||
|
||||
[optimizer]
|
||||
@optimizers = "Adam.v1"
|
||||
learn_rate = 0.001
|
||||
beta1 = 0.9
|
||||
beta2 = 0.999
|
||||
|
||||
[nlp]
|
||||
lang = "en"
|
||||
vectors = ${training:vectors}
|
||||
|
||||
[nlp.pipeline.tok2vec]
|
||||
factory = "tok2vec"
|
||||
|
||||
[nlp.pipeline.ner]
|
||||
factory = "ner"
|
||||
|
||||
[nlp.pipeline.ner.model]
|
||||
@architectures = "transition_based_ner.v1"
|
||||
nr_feature_tokens = 3
|
||||
hidden_width = 64
|
||||
maxout_pieces = 3
|
||||
|
||||
[nlp.pipeline.ner.model.tok2vec]
|
||||
@architectures = "tok2vec_tensors.v1"
|
||||
width = ${nlp.pipeline.tok2vec.model:width}
|
||||
|
||||
[nlp.pipeline.tok2vec.model]
|
||||
@architectures = "hash_embed_cnn.v1"
|
||||
pretrained_vectors = ${nlp:vectors}
|
||||
width = 128
|
||||
depth = 4
|
||||
window_size = 1
|
||||
embed_size = 10000
|
||||
maxout_pieces = 3
|
||||
"""
|
||||
|
||||
|
||||
class PipelineComponent(BaseModel):
|
||||
factory: str
|
||||
model: Model
|
||||
|
||||
class Config:
|
||||
arbitrary_types_allowed = True
|
||||
|
||||
|
||||
class ConfigSchema(BaseModel):
|
||||
optimizer: Optional["Optimizer"]
|
||||
|
||||
class training(BaseModel):
|
||||
patience: int = 10
|
||||
eval_frequency: int = 100
|
||||
dropout: float = 0.2
|
||||
init_tok2vec: Optional[FilePath] = None
|
||||
vectors: Optional[str] = None
|
||||
max_epochs: int = 100
|
||||
orth_variant_level: float = 0.0
|
||||
gold_preproc: bool = False
|
||||
max_length: int = 0
|
||||
use_gpu: int = 0
|
||||
scores: List[str] = ["ents_p", "ents_r", "ents_f"]
|
||||
score_weights: Dict[str, Union[int, float]] = {"ents_f": 1.0}
|
||||
limit: int = 0
|
||||
batch_size: Union[Sequence[int], int]
|
||||
|
||||
class nlp(BaseModel):
|
||||
lang: str
|
||||
vectors: Optional[str]
|
||||
pipeline: Optional[Dict[str, PipelineComponent]]
|
||||
|
||||
class Config:
|
||||
extra = "allow"
|
||||
|
||||
|
||||
# Of course, these would normally decorate the functions where they're defined.
|
||||
# But for now...
|
||||
@registry.architectures.register("hash_embed_cnn.v1")
|
||||
def hash_embed_cnn(
|
||||
pretrained_vectors, width, depth, embed_size, maxout_pieces, window_size
|
||||
):
|
||||
return component_models.Tok2Vec(
|
||||
width=width,
|
||||
embed_size=embed_size,
|
||||
pretrained_vectors=pretrained_vectors,
|
||||
conv_depth=depth,
|
||||
cnn_maxout_pieces=maxout_pieces,
|
||||
bilstm_depth=0,
|
||||
window_size=window_size,
|
||||
)
|
||||
|
||||
|
||||
@registry.architectures.register("hash_embed_bilstm.v1")
|
||||
def hash_embed_bilstm_v1(pretrained_vectors, width, depth, embed_size):
|
||||
return component_models.Tok2Vec(
|
||||
width=width,
|
||||
embed_size=embed_size,
|
||||
pretrained_vectors=pretrained_vectors,
|
||||
bilstm_depth=depth,
|
||||
conv_depth=0,
|
||||
cnn_maxout_pieces=0,
|
||||
)
|
||||
|
||||
|
||||
@registry.architectures.register("tagger_model.v1")
|
||||
def build_tagger_model_v1(tok2vec):
|
||||
return component_models.build_tagger_model(nr_class=None, tok2vec=tok2vec)
|
||||
|
||||
|
||||
@registry.architectures.register("transition_based_parser.v1")
|
||||
def create_tb_parser_model(
|
||||
tok2vec: Model,
|
||||
nr_feature_tokens: StrictInt = 3,
|
||||
hidden_width: StrictInt = 64,
|
||||
maxout_pieces: StrictInt = 3,
|
||||
):
|
||||
from thinc.api import Linear, chain, list2array, use_ops, zero_init
|
||||
from spacy.ml._layers import PrecomputableAffine
|
||||
from spacy.syntax._parser_model import ParserModel
|
||||
|
||||
token_vector_width = tok2vec.get_dim("nO")
|
||||
tok2vec = chain(tok2vec, list2array())
|
||||
tok2vec.set_dim("nO", token_vector_width)
|
||||
|
||||
lower = PrecomputableAffine(
|
||||
hidden_width, nF=nr_feature_tokens, nI=tok2vec.get_dim("nO"), nP=maxout_pieces
|
||||
)
|
||||
lower.set_dim("nP", maxout_pieces)
|
||||
with use_ops("numpy"):
|
||||
# Initialize weights at zero, as it's a classification layer.
|
||||
upper = Linear(init_W=zero_init)
|
||||
return ParserModel(tok2vec, lower, upper)
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
# fmt: off
|
||||
train_path=("Location of JSON-formatted training data", "positional", None, Path),
|
||||
dev_path=("Location of JSON-formatted development data", "positional", None, Path),
|
||||
config_path=("Path to config file", "positional", None, Path),
|
||||
output_path=("Output directory to store model in", "option", "o", Path),
|
||||
meta_path=("Optional path to meta.json to use as base.", "option", "m", Path),
|
||||
raw_text=("Path to jsonl file with unlabelled text documents.", "option", "rt", Path),
|
||||
# fmt: on
|
||||
)
|
||||
def train_from_config_cli(
|
||||
train_path,
|
||||
dev_path,
|
||||
config_path,
|
||||
output_path=None,
|
||||
meta_path=None,
|
||||
raw_text=None,
|
||||
debug=False,
|
||||
verbose=False,
|
||||
):
|
||||
"""
|
||||
Train or update a spaCy model. Requires data to be formatted in spaCy's
|
||||
JSON format. To convert data from other formats, use the `spacy convert`
|
||||
command.
|
||||
"""
|
||||
if not config_path or not config_path.exists():
|
||||
msg.fail("Config file not found", config_path, exits=1)
|
||||
if not train_path or not train_path.exists():
|
||||
msg.fail("Training data not found", train_path, exits=1)
|
||||
if not dev_path or not dev_path.exists():
|
||||
msg.fail("Development data not found", dev_path, exits=1)
|
||||
if meta_path is not None and not meta_path.exists():
|
||||
msg.fail("Can't find model meta.json", meta_path, exits=1)
|
||||
if output_path is not None and not output_path.exists():
|
||||
output_path.mkdir()
|
||||
|
||||
try:
|
||||
train_from_config(
|
||||
config_path,
|
||||
{"train": train_path, "dev": dev_path},
|
||||
output_path=output_path,
|
||||
meta_path=meta_path,
|
||||
raw_text=raw_text,
|
||||
)
|
||||
except KeyboardInterrupt:
|
||||
msg.warn("Cancelled.")
|
||||
|
||||
|
||||
def train_from_config(
|
||||
config_path, data_paths, raw_text=None, meta_path=None, output_path=None,
|
||||
):
|
||||
msg.info(f"Loading config from: {config_path}")
|
||||
config = util.load_from_config(config_path, create_objects=True)
|
||||
use_gpu = config["training"]["use_gpu"]
|
||||
if use_gpu >= 0:
|
||||
msg.info("Using GPU")
|
||||
else:
|
||||
msg.info("Using CPU")
|
||||
msg.info("Creating nlp from config")
|
||||
nlp = create_nlp_from_config(**config["nlp"])
|
||||
optimizer = config["optimizer"]
|
||||
limit = config["training"]["limit"]
|
||||
msg.info("Loading training corpus")
|
||||
corpus = GoldCorpus(data_paths["train"], data_paths["dev"], limit=limit)
|
||||
msg.info("Initializing the nlp pipeline")
|
||||
nlp.begin_training(lambda: corpus.train_examples, device=use_gpu)
|
||||
|
||||
train_batches = create_train_batches(nlp, corpus, config["training"])
|
||||
evaluate = create_evaluation_callback(nlp, optimizer, corpus, config["training"])
|
||||
|
||||
# Create iterator, which yields out info after each optimization step.
|
||||
msg.info("Start training")
|
||||
training_step_iterator = train_while_improving(
|
||||
nlp,
|
||||
optimizer,
|
||||
train_batches,
|
||||
evaluate,
|
||||
config["training"]["dropout"],
|
||||
config["training"]["patience"],
|
||||
config["training"]["eval_frequency"],
|
||||
)
|
||||
|
||||
msg.info(f"Training. Initial learn rate: {optimizer.learn_rate}")
|
||||
print_row = setup_printer(config)
|
||||
|
||||
try:
|
||||
progress = tqdm.tqdm(total=config["training"]["eval_frequency"], leave=False)
|
||||
for batch, info, is_best_checkpoint in training_step_iterator:
|
||||
progress.update(1)
|
||||
if is_best_checkpoint is not None:
|
||||
progress.close()
|
||||
print_row(info)
|
||||
if is_best_checkpoint and output_path is not None:
|
||||
nlp.to_disk(output_path)
|
||||
progress = tqdm.tqdm(
|
||||
total=config["training"]["eval_frequency"], leave=False
|
||||
)
|
||||
finally:
|
||||
if output_path is not None:
|
||||
with nlp.use_params(optimizer.averages):
|
||||
final_model_path = output_path / "model-final"
|
||||
nlp.to_disk(final_model_path)
|
||||
msg.good("Saved model to output directory", final_model_path)
|
||||
# with msg.loading("Creating best model..."):
|
||||
# best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
|
||||
# msg.good("Created best model", best_model_path)
|
||||
|
||||
|
||||
def create_nlp_from_config(lang, vectors, pipeline):
|
||||
lang_class = spacy.util.get_lang_class(lang)
|
||||
nlp = lang_class()
|
||||
if vectors is not None:
|
||||
spacy.cli.train._load_vectors(nlp, vectors)
|
||||
for name, component_cfg in pipeline.items():
|
||||
factory = component_cfg.pop("factory")
|
||||
component = nlp.create_pipe(factory, config=component_cfg)
|
||||
nlp.add_pipe(component, name=name)
|
||||
return nlp
|
||||
|
||||
|
||||
def create_train_batches(nlp, corpus, cfg):
|
||||
while True:
|
||||
train_examples = corpus.train_dataset(
|
||||
nlp,
|
||||
noise_level=0.0,
|
||||
orth_variant_level=cfg["orth_variant_level"],
|
||||
gold_preproc=cfg["gold_preproc"],
|
||||
max_length=cfg["max_length"],
|
||||
ignore_misaligned=True,
|
||||
)
|
||||
for batch in util.minibatch_by_words(train_examples, size=cfg["batch_size"]):
|
||||
yield batch
|
||||
|
||||
|
||||
def create_evaluation_callback(nlp, optimizer, corpus, cfg):
|
||||
def evaluate():
|
||||
with nlp.use_params(optimizer.averages):
|
||||
dev_examples = list(
|
||||
corpus.dev_dataset(
|
||||
nlp, gold_preproc=cfg["gold_preproc"], ignore_misaligned=True
|
||||
)
|
||||
)
|
||||
scorer = nlp.evaluate(dev_examples)
|
||||
scores = scorer.scores
|
||||
# Calculate a weighted sum based on score_weights for the main score
|
||||
weights = cfg["score_weights"]
|
||||
weighted_score = sum(scores[s] * weights.get(s, 0.0) for s in weights)
|
||||
return weighted_score, scorer.scores
|
||||
|
||||
return evaluate
|
||||
|
||||
|
||||
def train_while_improving(
|
||||
nlp, optimizer, train_data, evaluate, dropout, patience, eval_frequency
|
||||
):
|
||||
"""Train until an evaluation stops improving. Works as a generator,
|
||||
with each iteration yielding a tuple `(batch, info, is_best_checkpoint)`,
|
||||
where info is a dict, and is_best_checkpoint is in [True, False, None] --
|
||||
None indicating that the iteration was not evaluated as a checkpoint.
|
||||
The evaluation is conducted by calling the evaluate callback, which should
|
||||
|
||||
Positional arguments:
|
||||
nlp: The spaCy pipeline to evaluate.
|
||||
train_data (Iterable[Batch]): A generator of batches, with the training
|
||||
data. Each batch should be a Sized[Tuple[Input, Annot]]. The training
|
||||
data iterable needs to take care of iterating over the epochs and
|
||||
shuffling.
|
||||
evaluate (Callable[[], Tuple[float, Any]]): A callback to perform evaluation.
|
||||
The callback should take no arguments and return a tuple
|
||||
`(main_score, other_scores)`. The main_score should be a float where
|
||||
higher is better. other_scores can be any object.
|
||||
|
||||
Every iteration, the function yields out a tuple with:
|
||||
|
||||
* batch: A zipped sequence of Tuple[Doc, GoldParse] pairs.
|
||||
* info: A dict with various information about the last update (see below).
|
||||
* is_best_checkpoint: A value in None, False, True, indicating whether this
|
||||
was the best evaluation so far. You should use this to save the model
|
||||
checkpoints during training. If None, evaluation was not conducted on
|
||||
that iteration. False means evaluation was conducted, but a previous
|
||||
evaluation was better.
|
||||
|
||||
The info dict provides the following information:
|
||||
|
||||
epoch (int): How many passes over the data have been completed.
|
||||
step (int): How many steps have been completed.
|
||||
score (float): The main score form the last evaluation.
|
||||
other_scores: : The other scores from the last evaluation.
|
||||
loss: The accumulated losses throughout training.
|
||||
checkpoints: A list of previous results, where each result is a
|
||||
(score, step, epoch) tuple.
|
||||
"""
|
||||
if isinstance(dropout, float):
|
||||
dropouts = thinc.schedules.constant(dropout)
|
||||
else:
|
||||
dropouts = dropout
|
||||
results = []
|
||||
losses = {}
|
||||
for step, batch in enumerate(train_data):
|
||||
dropout = next(dropouts)
|
||||
for subbatch in subdivide_batch(batch):
|
||||
nlp.update(subbatch, drop=dropout, losses=losses, sgd=False)
|
||||
for name, proc in nlp.pipeline:
|
||||
if hasattr(proc, "model"):
|
||||
proc.model.finish_update(optimizer)
|
||||
optimizer.step_schedules()
|
||||
if not (step % eval_frequency):
|
||||
score, other_scores = evaluate()
|
||||
results.append((score, step))
|
||||
is_best_checkpoint = score == max(results)[0]
|
||||
else:
|
||||
score, other_scores = (None, None)
|
||||
is_best_checkpoint = None
|
||||
info = {
|
||||
"step": step,
|
||||
"score": score,
|
||||
"other_scores": other_scores,
|
||||
"losses": losses,
|
||||
"checkpoints": results,
|
||||
}
|
||||
yield batch, info, is_best_checkpoint
|
||||
if is_best_checkpoint is not None:
|
||||
losses = {}
|
||||
# Stop if no improvement in `patience` updates
|
||||
best_score, best_step = max(results)
|
||||
if (step - best_step) >= patience:
|
||||
break
|
||||
|
||||
|
||||
def subdivide_batch(batch):
|
||||
return [batch]
|
||||
|
||||
|
||||
def setup_printer(config):
|
||||
score_cols = config["training"]["scores"]
|
||||
score_widths = [max(len(col), 6) for col in score_cols]
|
||||
loss_cols = [f"Loss {pipe}" for pipe in config["nlp"]["pipeline"]]
|
||||
loss_widths = [max(len(col), 8) for col in loss_cols]
|
||||
table_header = ["#"] + loss_cols + score_cols + ["Score"]
|
||||
table_header = [col.upper() for col in table_header]
|
||||
table_widths = [6] + loss_widths + score_widths + [6]
|
||||
table_aligns = ["r" for _ in table_widths]
|
||||
|
||||
msg.row(table_header, widths=table_widths)
|
||||
msg.row(["-" * width for width in table_widths])
|
||||
|
||||
def print_row(info):
|
||||
losses = [
|
||||
"{0:.2f}".format(info["losses"].get(col, 0.0))
|
||||
for col in config["nlp"]["pipeline"]
|
||||
]
|
||||
scores = [
|
||||
"{0:.2f}".format(info["other_scores"].get(col, 0.0))
|
||||
for col in config["training"]["scores"]
|
||||
]
|
||||
data = [info["step"]] + losses + scores + ["{0:.2f}".format(info["score"])]
|
||||
msg.row(data, widths=table_widths, aligns=table_aligns)
|
||||
|
||||
return print_row
|
||||
|
||||
|
||||
@registry.architectures.register("tok2vec_tensors.v1")
|
||||
def tok2vec_tensors_v1(width):
|
||||
tok2vec = Tok2VecListener("tok2vec", width=width)
|
||||
return tok2vec
|
|
@ -1,14 +1,8 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
from pathlib import Path
|
||||
import sys
|
||||
import requests
|
||||
import srsly
|
||||
from wasabi import msg
|
||||
|
||||
from ..compat import path2str
|
||||
from ..util import get_data_path
|
||||
from .. import about
|
||||
|
||||
|
||||
|
@ -17,51 +11,30 @@ def validate():
|
|||
Validate that the currently installed version of spaCy is compatible
|
||||
with the installed models. Should be run after `pip install -U spacy`.
|
||||
"""
|
||||
with msg.loading("Loading compatibility table..."):
|
||||
r = requests.get(about.__compatibility__)
|
||||
if r.status_code != 200:
|
||||
msg.fail(
|
||||
"Server error ({})".format(r.status_code),
|
||||
"Couldn't fetch compatibility table.",
|
||||
exits=1,
|
||||
)
|
||||
msg.good("Loaded compatibility table")
|
||||
compat = r.json()["spacy"]
|
||||
version = about.__version__
|
||||
version = version.rsplit(".dev", 1)[0]
|
||||
current_compat = compat.get(version)
|
||||
model_pkgs, compat = get_model_pkgs()
|
||||
spacy_version = about.__version__.rsplit(".dev", 1)[0]
|
||||
current_compat = compat.get(spacy_version, {})
|
||||
if not current_compat:
|
||||
msg.fail(
|
||||
"Can't find spaCy v{} in compatibility table".format(version),
|
||||
about.__compatibility__,
|
||||
exits=1,
|
||||
)
|
||||
all_models = set()
|
||||
for spacy_v, models in dict(compat).items():
|
||||
all_models.update(models.keys())
|
||||
for model, model_vs in models.items():
|
||||
compat[spacy_v][model] = [reformat_version(v) for v in model_vs]
|
||||
model_links = get_model_links(current_compat)
|
||||
model_pkgs = get_model_pkgs(current_compat, all_models)
|
||||
incompat_links = {l for l, d in model_links.items() if not d["compat"]}
|
||||
msg.warn(f"No compatible models found for v{spacy_version} of spaCy")
|
||||
incompat_models = {d["name"] for _, d in model_pkgs.items() if not d["compat"]}
|
||||
incompat_models.update(
|
||||
[d["name"] for _, d in model_links.items() if not d["compat"]]
|
||||
)
|
||||
na_models = [m for m in incompat_models if m not in current_compat]
|
||||
update_models = [m for m in incompat_models if m in current_compat]
|
||||
spacy_dir = Path(__file__).parent.parent
|
||||
|
||||
msg.divider("Installed models (spaCy v{})".format(about.__version__))
|
||||
msg.info("spaCy installation: {}".format(path2str(spacy_dir)))
|
||||
msg.divider(f"Installed models (spaCy v{about.__version__})")
|
||||
msg.info(f"spaCy installation: {spacy_dir}")
|
||||
|
||||
if model_links or model_pkgs:
|
||||
header = ("TYPE", "NAME", "MODEL", "VERSION", "")
|
||||
if model_pkgs:
|
||||
header = ("NAME", "VERSION", "")
|
||||
rows = []
|
||||
for name, data in model_pkgs.items():
|
||||
rows.append(get_model_row(current_compat, name, data, msg))
|
||||
for name, data in model_links.items():
|
||||
rows.append(get_model_row(current_compat, name, data, msg, "link"))
|
||||
if data["compat"]:
|
||||
comp = msg.text("", color="green", icon="good", no_print=True)
|
||||
version = msg.text(data["version"], color="green", no_print=True)
|
||||
else:
|
||||
version = msg.text(data["version"], color="red", no_print=True)
|
||||
comp = f"--> {compat.get(data['name'], ['n/a'])[0]}"
|
||||
rows.append((data["name"], version, comp))
|
||||
msg.table(rows, header=header)
|
||||
else:
|
||||
msg.text("No models found in your current environment.", exits=0)
|
||||
|
@ -71,44 +44,32 @@ def validate():
|
|||
cmd = "python -m spacy download {}"
|
||||
print("\n".join([cmd.format(pkg) for pkg in update_models]) + "\n")
|
||||
if na_models:
|
||||
msg.text(
|
||||
"The following models are not available for spaCy "
|
||||
"v{}: {}".format(about.__version__, ", ".join(na_models))
|
||||
msg.warn(
|
||||
f"The following models are not available for spaCy v{about.__version__}:",
|
||||
", ".join(na_models),
|
||||
)
|
||||
if incompat_links:
|
||||
msg.text(
|
||||
"You may also want to overwrite the incompatible links using the "
|
||||
"`python -m spacy link` command with `--force`, or remove them "
|
||||
"from the data directory. "
|
||||
"Data path: {path}".format(path=path2str(get_data_path()))
|
||||
)
|
||||
if incompat_models or incompat_links:
|
||||
if incompat_models:
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def get_model_links(compat):
|
||||
links = {}
|
||||
data_path = get_data_path()
|
||||
if data_path:
|
||||
models = [p for p in data_path.iterdir() if is_model_path(p)]
|
||||
for model in models:
|
||||
meta_path = Path(model) / "meta.json"
|
||||
if not meta_path.exists():
|
||||
continue
|
||||
meta = srsly.read_json(meta_path)
|
||||
link = model.parts[-1]
|
||||
name = meta["lang"] + "_" + meta["name"]
|
||||
links[link] = {
|
||||
"name": name,
|
||||
"version": meta["version"],
|
||||
"compat": is_compat(compat, name, meta["version"]),
|
||||
}
|
||||
return links
|
||||
|
||||
|
||||
def get_model_pkgs(compat, all_models):
|
||||
def get_model_pkgs():
|
||||
import pkg_resources
|
||||
|
||||
with msg.loading("Loading compatibility table..."):
|
||||
r = requests.get(about.__compatibility__)
|
||||
if r.status_code != 200:
|
||||
msg.fail(
|
||||
f"Server error ({r.status_code})",
|
||||
"Couldn't fetch compatibility table.",
|
||||
exits=1,
|
||||
)
|
||||
msg.good("Loaded compatibility table")
|
||||
compat = r.json()["spacy"]
|
||||
all_models = set()
|
||||
for spacy_v, models in dict(compat).items():
|
||||
all_models.update(models.keys())
|
||||
for model, model_vs in models.items():
|
||||
compat[spacy_v][model] = [reformat_version(v) for v in model_vs]
|
||||
pkgs = {}
|
||||
for pkg_name, pkg_data in pkg_resources.working_set.by_key.items():
|
||||
package = pkg_name.replace("-", "_")
|
||||
|
@ -117,29 +78,9 @@ def get_model_pkgs(compat, all_models):
|
|||
pkgs[pkg_name] = {
|
||||
"name": package,
|
||||
"version": version,
|
||||
"compat": is_compat(compat, package, version),
|
||||
"compat": package in compat and version in compat[package],
|
||||
}
|
||||
return pkgs
|
||||
|
||||
|
||||
def get_model_row(compat, name, data, msg, model_type="package"):
|
||||
if data["compat"]:
|
||||
comp = msg.text("", color="green", icon="good", no_print=True)
|
||||
version = msg.text(data["version"], color="green", no_print=True)
|
||||
else:
|
||||
version = msg.text(data["version"], color="red", no_print=True)
|
||||
comp = "--> {}".format(compat.get(data["name"], ["n/a"])[0])
|
||||
return (model_type, name, data["name"], version, comp)
|
||||
|
||||
|
||||
def is_model_path(model_path):
|
||||
exclude = ["cache", "pycache", "__pycache__"]
|
||||
name = model_path.parts[-1]
|
||||
return model_path.is_dir() and name not in exclude and not name.startswith(".")
|
||||
|
||||
|
||||
def is_compat(compat, name, version):
|
||||
return name in compat and version in compat[name]
|
||||
return pkgs, compat
|
||||
|
||||
|
||||
def reformat_version(version):
|
||||
|
|
129
spacy/compat.py
129
spacy/compat.py
|
@ -1,4 +1,3 @@
|
|||
# coding: utf8
|
||||
"""
|
||||
Helpers for Python and platform compatibility. To distinguish them from
|
||||
the builtin functions, replacement functions are suffixed with an underscore,
|
||||
|
@ -6,15 +5,9 @@ e.g. `unicode_`.
|
|||
|
||||
DOCS: https://spacy.io/api/top-level#compat
|
||||
"""
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import os
|
||||
import sys
|
||||
import itertools
|
||||
import ast
|
||||
import types
|
||||
|
||||
from thinc.neural.util import copy_array
|
||||
from thinc.util import copy_array
|
||||
|
||||
try:
|
||||
import cPickle as pickle
|
||||
|
@ -36,91 +29,23 @@ try:
|
|||
except ImportError:
|
||||
cupy = None
|
||||
|
||||
try:
|
||||
from thinc.neural.optimizers import Optimizer # noqa: F401
|
||||
except ImportError:
|
||||
from thinc.neural.optimizers import Adam as Optimizer # noqa: F401
|
||||
from thinc.api import Optimizer # noqa: F401
|
||||
|
||||
pickle = pickle
|
||||
copy_reg = copy_reg
|
||||
CudaStream = CudaStream
|
||||
cupy = cupy
|
||||
copy_array = copy_array
|
||||
izip = getattr(itertools, "izip", zip)
|
||||
|
||||
is_windows = sys.platform.startswith("win")
|
||||
is_linux = sys.platform.startswith("linux")
|
||||
is_osx = sys.platform == "darwin"
|
||||
|
||||
# See: https://github.com/benjaminp/six/blob/master/six.py
|
||||
is_python2 = sys.version_info[0] == 2
|
||||
is_python3 = sys.version_info[0] == 3
|
||||
is_python_pre_3_5 = is_python2 or (is_python3 and sys.version_info[1] < 5)
|
||||
|
||||
if is_python2:
|
||||
bytes_ = str
|
||||
unicode_ = unicode # noqa: F821
|
||||
basestring_ = basestring # noqa: F821
|
||||
input_ = raw_input # noqa: F821
|
||||
path2str = lambda path: str(path).decode("utf8")
|
||||
class_types = (type, types.ClassType)
|
||||
|
||||
elif is_python3:
|
||||
bytes_ = bytes
|
||||
unicode_ = str
|
||||
basestring_ = str
|
||||
input_ = input
|
||||
path2str = lambda path: str(path)
|
||||
class_types = (type, types.ClassType) if is_python_pre_3_5 else type
|
||||
|
||||
|
||||
def b_to_str(b_str):
|
||||
"""Convert a bytes object to a string.
|
||||
|
||||
b_str (bytes): The object to convert.
|
||||
RETURNS (unicode): The converted string.
|
||||
"""
|
||||
if is_python2:
|
||||
return b_str
|
||||
# Important: if no encoding is set, string becomes "b'...'"
|
||||
return str(b_str, encoding="utf8")
|
||||
|
||||
|
||||
def symlink_to(orig, dest):
|
||||
"""Create a symlink. Used for model shortcut links.
|
||||
|
||||
orig (unicode / Path): The origin path.
|
||||
dest (unicode / Path): The destination path of the symlink.
|
||||
"""
|
||||
if is_windows:
|
||||
import subprocess
|
||||
|
||||
subprocess.check_call(
|
||||
["mklink", "/d", path2str(orig), path2str(dest)], shell=True
|
||||
)
|
||||
else:
|
||||
orig.symlink_to(dest)
|
||||
|
||||
|
||||
def symlink_remove(link):
|
||||
"""Remove a symlink. Used for model shortcut links.
|
||||
|
||||
link (unicode / Path): The path to the symlink.
|
||||
"""
|
||||
# https://stackoverflow.com/q/26554135/6400719
|
||||
if os.path.isdir(path2str(link)) and is_windows:
|
||||
# this should only be on Py2.7 and windows
|
||||
os.rmdir(path2str(link))
|
||||
else:
|
||||
os.unlink(path2str(link))
|
||||
|
||||
|
||||
def is_config(python2=None, python3=None, windows=None, linux=None, osx=None):
|
||||
def is_config(windows=None, linux=None, osx=None, **kwargs):
|
||||
"""Check if a specific configuration of Python version and operating system
|
||||
matches the user's setup. Mostly used to display targeted error messages.
|
||||
|
||||
python2 (bool): spaCy is executed with Python 2.x.
|
||||
python3 (bool): spaCy is executed with Python 3.x.
|
||||
windows (bool): spaCy is executed on Windows.
|
||||
linux (bool): spaCy is executed on Linux.
|
||||
osx (bool): spaCy is executed on OS X or macOS.
|
||||
|
@ -129,53 +54,7 @@ def is_config(python2=None, python3=None, windows=None, linux=None, osx=None):
|
|||
DOCS: https://spacy.io/api/top-level#compat.is_config
|
||||
"""
|
||||
return (
|
||||
python2 in (None, is_python2)
|
||||
and python3 in (None, is_python3)
|
||||
and windows in (None, is_windows)
|
||||
windows in (None, is_windows)
|
||||
and linux in (None, is_linux)
|
||||
and osx in (None, is_osx)
|
||||
)
|
||||
|
||||
|
||||
def import_file(name, loc):
|
||||
"""Import module from a file. Used to load models from a directory.
|
||||
|
||||
name (unicode): Name of module to load.
|
||||
loc (unicode / Path): Path to the file.
|
||||
RETURNS: The loaded module.
|
||||
"""
|
||||
loc = path2str(loc)
|
||||
if is_python_pre_3_5:
|
||||
import imp
|
||||
|
||||
return imp.load_source(name, loc)
|
||||
else:
|
||||
import importlib.util
|
||||
|
||||
spec = importlib.util.spec_from_file_location(name, str(loc))
|
||||
module = importlib.util.module_from_spec(spec)
|
||||
spec.loader.exec_module(module)
|
||||
return module
|
||||
|
||||
|
||||
def unescape_unicode(string):
|
||||
"""Python2.7's re module chokes when compiling patterns that have ranges
|
||||
between escaped unicode codepoints if the two codepoints are unrecognised
|
||||
in the unicode database. For instance:
|
||||
|
||||
re.compile('[\\uAA77-\\uAA79]').findall("hello")
|
||||
|
||||
Ends up matching every character (on Python 2). This problem doesn't occur
|
||||
if we're dealing with unicode literals.
|
||||
"""
|
||||
if string is None:
|
||||
return string
|
||||
# We only want to unescape the unicode, so we first must protect the other
|
||||
# backslashes.
|
||||
string = string.replace("\\", "\\\\")
|
||||
# Now we remove that protection for the unicode.
|
||||
string = string.replace("\\\\u", "\\u")
|
||||
string = string.replace("\\\\U", "\\U")
|
||||
# Now we unescape by evaling the string with the AST. This can't execute
|
||||
# code -- it only does the representational level.
|
||||
return ast.literal_eval("u'''" + string + "'''")
|
||||
|
|
|
@ -1,15 +1,11 @@
|
|||
# coding: utf8
|
||||
"""
|
||||
spaCy's built in visualization suite for dependencies and named entities.
|
||||
|
||||
DOCS: https://spacy.io/api/top-level#displacy
|
||||
USAGE: https://spacy.io/usage/visualizers
|
||||
"""
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .render import DependencyRenderer, EntityRenderer
|
||||
from ..tokens import Doc, Span
|
||||
from ..compat import b_to_str
|
||||
from ..errors import Errors, Warnings, user_warning
|
||||
from ..util import is_in_jupyter
|
||||
|
||||
|
@ -93,20 +89,20 @@ def serve(
|
|||
|
||||
render(docs, style=style, page=page, minify=minify, options=options, manual=manual)
|
||||
httpd = simple_server.make_server(host, port, app)
|
||||
print("\nUsing the '{}' visualizer".format(style))
|
||||
print("Serving on http://{}:{} ...\n".format(host, port))
|
||||
print(f"\nUsing the '{style}' visualizer")
|
||||
print(f"Serving on http://{host}:{port} ...\n")
|
||||
try:
|
||||
httpd.serve_forever()
|
||||
except KeyboardInterrupt:
|
||||
print("Shutting down server on port {}.".format(port))
|
||||
print(f"Shutting down server on port {port}.")
|
||||
finally:
|
||||
httpd.server_close()
|
||||
|
||||
|
||||
def app(environ, start_response):
|
||||
# Headers and status need to be bytes in Python 2, see #1227
|
||||
headers = [(b_to_str(b"Content-type"), b_to_str(b"text/html; charset=utf-8"))]
|
||||
start_response(b_to_str(b"200 OK"), headers)
|
||||
headers = [("Content-type", "text/html; charset=utf-8")]
|
||||
start_response("200 OK", headers)
|
||||
res = _html["parsed"].encode(encoding="utf-8")
|
||||
return [res]
|
||||
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import uuid
|
||||
|
||||
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
|
||||
|
@ -55,7 +52,7 @@ class DependencyRenderer(object):
|
|||
settings = p.get("settings", {})
|
||||
self.direction = settings.get("direction", DEFAULT_DIR)
|
||||
self.lang = settings.get("lang", DEFAULT_LANG)
|
||||
render_id = "{}-{}".format(id_prefix, i)
|
||||
render_id = f"{id_prefix}-{i}"
|
||||
svg = self.render_svg(render_id, p["words"], p["arcs"])
|
||||
rendered.append(svg)
|
||||
if page:
|
||||
|
|
|
@ -1,7 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
# Setting explicit height and max-width: none on the SVG is required for
|
||||
# Jupyter to render it properly in a cell
|
||||
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import os
|
||||
import warnings
|
||||
import inspect
|
||||
|
@ -12,7 +9,7 @@ def add_codes(err_cls):
|
|||
class ErrorsWithCodes(object):
|
||||
def __getattribute__(self, code):
|
||||
msg = getattr(err_cls, code)
|
||||
return "[{code}] {msg}".format(code=code, msg=msg)
|
||||
return f"[{code}] {msg}"
|
||||
|
||||
return ErrorsWithCodes()
|
||||
|
||||
|
@ -97,8 +94,6 @@ class Warnings(object):
|
|||
"you can ignore this warning by setting SPACY_WARNING_IGNORE=W022. "
|
||||
"If this is surprising, make sure you have the spacy-lookups-data "
|
||||
"package installed.")
|
||||
W023 = ("Multiprocessing of Language.pipe is not supported in Python 2. "
|
||||
"'n_process' will be set to 1.")
|
||||
W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in "
|
||||
"the Knowledge Base.")
|
||||
W025 = ("'{name}' requires '{attr}' to be assigned, but none of the "
|
||||
|
@ -107,7 +102,9 @@ class Warnings(object):
|
|||
W027 = ("Found a large training file of {size} bytes. Note that it may "
|
||||
"be more efficient to split your training data into multiple "
|
||||
"smaller JSON files instead.")
|
||||
|
||||
W028 = ("Skipping unsupported morphological feature(s): {feature}. "
|
||||
"Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or "
|
||||
"string \"Field1=Value1,Value2|Field2=Value3\".")
|
||||
|
||||
|
||||
@add_codes
|
||||
|
@ -227,13 +224,8 @@ class Errors(object):
|
|||
E047 = ("Can't assign a value to unregistered extension attribute "
|
||||
"'{name}'. Did you forget to call the `set_extension` method?")
|
||||
E048 = ("Can't import language {lang} from spacy.lang: {err}")
|
||||
E049 = ("Can't find spaCy data directory: '{path}'. Check your "
|
||||
"installation and permissions, or use spacy.util.set_data_path "
|
||||
"to customise the location if necessary.")
|
||||
E050 = ("Can't find model '{name}'. It doesn't seem to be a shortcut "
|
||||
"link, a Python package or a valid path to a data directory.")
|
||||
E051 = ("Cant' load '{name}'. If you're using a shortcut link, make sure "
|
||||
"it points to a valid package (not just a data directory).")
|
||||
E050 = ("Can't find model '{name}'. It doesn't seem to be a Python "
|
||||
"package or a valid path to a data directory.")
|
||||
E052 = ("Can't find model directory: {path}")
|
||||
E053 = ("Could not read meta.json from {path}")
|
||||
E054 = ("No valid '{setting}' setting found in model meta.json.")
|
||||
|
@ -424,8 +416,6 @@ class Errors(object):
|
|||
E134 = ("Entity '{entity}' is not defined in the Knowledge Base.")
|
||||
E135 = ("If you meant to replace a built-in component, use `create_pipe`: "
|
||||
"`nlp.replace_pipe('{name}', nlp.create_pipe('{name}'))`")
|
||||
E136 = ("This additional feature requires the jsonschema library to be "
|
||||
"installed:\npip install jsonschema")
|
||||
E137 = ("Expected 'dict' type, but got '{type}' from '{line}'. Make sure "
|
||||
"to provide a valid JSON object as input with either the `text` "
|
||||
"or `tokens` key. For more info, see the docs:\n"
|
||||
|
@ -541,6 +531,15 @@ class Errors(object):
|
|||
E188 = ("Could not match the gold entity links to entities in the doc - "
|
||||
"make sure the gold EL data refers to valid results of the "
|
||||
"named entity recognizer in the `nlp` pipeline.")
|
||||
# TODO: fix numbering after merging develop into master
|
||||
E996 = ("Could not parse {file}: {msg}")
|
||||
E997 = ("Tokenizer special cases are not allowed to modify the text. "
|
||||
"This would map '{chunk}' to '{orth}' given token attributes "
|
||||
"'{token_attrs}'.")
|
||||
E998 = ("Can only create GoldParse objects from Example objects without a "
|
||||
"Doc if get_gold_parses() is called with a Vocab object.")
|
||||
E999 = ("Encountered an unexpected format for the dictionary holding "
|
||||
"gold annotations: {gold_dict}")
|
||||
|
||||
|
||||
@add_codes
|
||||
|
@ -566,10 +565,10 @@ class MatchPatternError(ValueError):
|
|||
errors (dict): Validation errors (sequence of strings) mapped to pattern
|
||||
ID, i.e. the index of the added pattern.
|
||||
"""
|
||||
msg = "Invalid token patterns for matcher rule '{}'\n".format(key)
|
||||
msg = f"Invalid token patterns for matcher rule '{key}'\n"
|
||||
for pattern_idx, error_msgs in errors.items():
|
||||
pattern_errors = "\n".join(["- {}".format(e) for e in error_msgs])
|
||||
msg += "\nPattern {}:\n{}\n".format(pattern_idx, pattern_errors)
|
||||
pattern_errors = "\n".join([f"- {e}" for e in error_msgs])
|
||||
msg += f"\nPattern {pattern_idx}:\n{pattern_errors}\n"
|
||||
ValueError.__init__(self, msg)
|
||||
|
||||
|
||||
|
|
|
@ -1,7 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
def explain(term):
|
||||
"""Get a description for a given POS tag, dependency label or entity type.
|
||||
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
from cymem.cymem cimport Pool
|
||||
|
||||
from .structs cimport TokenC
|
||||
from .tokens import Doc
|
||||
from .typedefs cimport attr_t
|
||||
from .syntax.transition_system cimport Transition
|
||||
|
||||
|
@ -19,23 +19,49 @@ cdef class GoldParse:
|
|||
cdef Pool mem
|
||||
|
||||
cdef GoldParseC c
|
||||
cdef readonly TokenAnnotation orig
|
||||
|
||||
cdef int length
|
||||
cdef public int loss
|
||||
cdef public list words
|
||||
cdef public list tags
|
||||
cdef public list morphology
|
||||
cdef public list pos
|
||||
cdef public list morphs
|
||||
cdef public list lemmas
|
||||
cdef public list sent_starts
|
||||
cdef public list heads
|
||||
cdef public list labels
|
||||
cdef public dict orths
|
||||
cdef public list ner
|
||||
cdef public list ents
|
||||
cdef public dict brackets
|
||||
cdef public object cats
|
||||
cdef public dict cats
|
||||
cdef public dict links
|
||||
|
||||
cdef readonly list cand_to_gold
|
||||
cdef readonly list gold_to_cand
|
||||
cdef readonly list orig_annot
|
||||
|
||||
|
||||
cdef class TokenAnnotation:
|
||||
cdef public list ids
|
||||
cdef public list words
|
||||
cdef public list tags
|
||||
cdef public list pos
|
||||
cdef public list morphs
|
||||
cdef public list lemmas
|
||||
cdef public list heads
|
||||
cdef public list deps
|
||||
cdef public list entities
|
||||
cdef public list sent_starts
|
||||
cdef public list brackets
|
||||
|
||||
|
||||
cdef class DocAnnotation:
|
||||
cdef public object cats
|
||||
cdef public object links
|
||||
|
||||
|
||||
cdef class Example:
|
||||
cdef public object doc
|
||||
cdef public TokenAnnotation token_annotation
|
||||
cdef public DocAnnotation doc_annotation
|
||||
cdef public object goldparse
|
||||
|
|
830
spacy/gold.pyx
830
spacy/gold.pyx
File diff suppressed because it is too large
Load Diff
|
@ -6,7 +6,7 @@ from libcpp.vector cimport vector
|
|||
from libc.stdint cimport int32_t, int64_t
|
||||
from libc.stdio cimport FILE
|
||||
|
||||
from spacy.vocab cimport Vocab
|
||||
from .vocab cimport Vocab
|
||||
from .typedefs cimport hash_t
|
||||
|
||||
from .structs cimport KBEntryC, AliasC
|
||||
|
@ -113,7 +113,7 @@ cdef class KnowledgeBase:
|
|||
return new_index
|
||||
|
||||
cdef inline void _create_empty_vectors(self, hash_t dummy_hash) nogil:
|
||||
"""
|
||||
"""
|
||||
Initializing the vectors and making sure the first element of each vector is a dummy,
|
||||
because the PreshMap maps pointing to indices in these vectors can not contain 0 as value
|
||||
cf. https://github.com/explosion/preshed/issues/17
|
||||
|
@ -169,4 +169,3 @@ cdef class Reader:
|
|||
cdef int read_alias(self, int64_t* entry_index, float* prob) except -1
|
||||
|
||||
cdef int _read(self, void* value, size_t size) except -1
|
||||
|
||||
|
|
15
spacy/kb.pyx
15
spacy/kb.pyx
|
@ -1,22 +1,17 @@
|
|||
# cython: infer_types=True
|
||||
# cython: profile=True
|
||||
# coding: utf8
|
||||
from spacy.errors import Errors, Warnings, user_warning
|
||||
|
||||
from pathlib import Path
|
||||
from cymem.cymem cimport Pool
|
||||
from preshed.maps cimport PreshMap
|
||||
|
||||
from cpython.exc cimport PyErr_SetFromErrno
|
||||
|
||||
from libc.stdio cimport fopen, fclose, fread, fwrite, feof, fseek
|
||||
from libc.stdint cimport int32_t, int64_t
|
||||
|
||||
from .typedefs cimport hash_t
|
||||
|
||||
from os import path
|
||||
from libcpp.vector cimport vector
|
||||
|
||||
from .typedefs cimport hash_t
|
||||
from .errors import Errors, Warnings, user_warning
|
||||
|
||||
|
||||
cdef class Candidate:
|
||||
"""A `Candidate` object refers to a textual mention (`alias`) that may or may not be resolved
|
||||
|
@ -447,7 +442,7 @@ cdef class KnowledgeBase:
|
|||
cdef class Writer:
|
||||
def __init__(self, object loc):
|
||||
if path.exists(loc):
|
||||
assert not path.isdir(loc), "%s is directory." % loc
|
||||
assert not path.isdir(loc), f"{loc} is directory"
|
||||
if isinstance(loc, Path):
|
||||
loc = bytes(loc)
|
||||
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
|
||||
|
@ -584,5 +579,3 @@ cdef class Reader:
|
|||
cdef int _read(self, void* value, size_t size) except -1:
|
||||
status = fread(value, size, 1, self._fp)
|
||||
return status
|
||||
|
||||
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
from ...language import Language
|
||||
from ...attrs import LANG
|
||||
|
|
|
@ -1,7 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
# Source: https://github.com/stopwords-iso/stopwords-af
|
||||
|
||||
STOP_WORDS = set(
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .punctuation import TOKENIZER_SUFFIXES
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,5 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
_num_words = set(
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
|
||||
from ..char_classes import UNITS, ALPHA_UPPER
|
||||
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
من
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import ORTH, LEMMA
|
||||
|
||||
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
from ...language import Language
|
||||
from ...attrs import LANG
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,7 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
# Source: https://github.com/Alir3z4/stop-words
|
||||
|
||||
STOP_WORDS = set(
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
|
||||
from .tag_map import TAG_MAP
|
||||
|
|
|
@ -1,7 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import LEMMA, PRON_LEMMA
|
||||
|
||||
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_ICONS
|
||||
from ..char_classes import ALPHA_LOWER, ALPHA, HYPHENS, CONCAT_QUOTES, UNITS
|
||||
|
||||
|
|
|
@ -1,7 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
অতএব অথচ অথবা অনুযায়ী অনেক অনেকে অনেকেই অন্তত অবধি অবশ্য অর্থাৎ অন্য অনুযায়ী অর্ধভাগে
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, PUNCT, ADJ, CONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||
from ...symbols import CCONJ, NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX, SYM
|
||||
|
||||
|
@ -14,8 +11,8 @@ TAG_MAP = {
|
|||
'""': {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
|
||||
"''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
|
||||
":": {POS: PUNCT},
|
||||
"৳": {POS: SYM, "Other": {"SymType": "currency"}},
|
||||
"#": {POS: SYM, "Other": {"SymType": "numbersign"}},
|
||||
"৳": {POS: SYM, "SymType": "currency"},
|
||||
"#": {POS: SYM, "SymType": "numbersign"},
|
||||
"AFX": {POS: ADJ, "Hyph": "yes"},
|
||||
"CC": {POS: CONJ, "ConjType": "coor"},
|
||||
"CD": {POS: NUM, "NumType": "card"},
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding=utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import ORTH, LEMMA
|
||||
|
||||
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
|
|
|
@ -1,7 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..punctuation import TOKENIZER_INFIXES
|
||||
from ..char_classes import ALPHA
|
||||
|
||||
|
|
|
@ -1,7 +1,3 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
a abans ací ah així això al aleshores algun alguna algunes alguns alhora allà allí allò
|
||||
|
|
|
@ -1,28 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
|
||||
from ..symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ
|
||||
|
||||
|
||||
TAG_MAP = {
|
||||
"ADV": {POS: ADV},
|
||||
"NOUN": {POS: NOUN},
|
||||
"ADP": {POS: ADP},
|
||||
"PRON": {POS: PRON},
|
||||
"SCONJ": {POS: SCONJ},
|
||||
"PROPN": {POS: PROPN},
|
||||
"DET": {POS: DET},
|
||||
"SYM": {POS: SYM},
|
||||
"INTJ": {POS: INTJ},
|
||||
"PUNCT": {POS: PUNCT},
|
||||
"NUM": {POS: NUM},
|
||||
"AUX": {POS: AUX},
|
||||
"X": {POS: X},
|
||||
"CONJ": {POS: CONJ},
|
||||
"CCONJ": {POS: CCONJ},
|
||||
"ADJ": {POS: ADJ},
|
||||
"VERB": {POS: VERB},
|
||||
"PART": {POS: PART},
|
||||
"SP": {POS: SPACE},
|
||||
}
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import ORTH, LEMMA
|
||||
|
||||
|
||||
|
@ -33,9 +30,9 @@ _exc["12m."] = [{ORTH: "12"}, {ORTH: "m.", LEMMA: "p.m."}]
|
|||
|
||||
for h in range(1, 12 + 1):
|
||||
for period in ["a.m.", "am"]:
|
||||
_exc["%d%s" % (h, period)] = [{ORTH: "%d" % h}, {ORTH: period, LEMMA: "a.m."}]
|
||||
_exc[f"{h}{period}"] = [{ORTH: f"{h}"}, {ORTH: period, LEMMA: "a.m."}]
|
||||
for period in ["p.m.", "pm"]:
|
||||
_exc["%d%s" % (h, period)] = [{ORTH: "%d" % h}, {ORTH: period, LEMMA: "p.m."}]
|
||||
_exc[f"{h}{period}"] = [{ORTH: f"{h}"}, {ORTH: period, LEMMA: "p.m."}]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
split_chars = lambda char: list(char.strip().split(" "))
|
||||
merge_chars = lambda char: char.strip().replace(" ", "|")
|
||||
group_chars = lambda char: char.strip().replace(" ", "")
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
from ...language import Language
|
||||
from ...attrs import LANG
|
||||
|
|
|
@ -1,7 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
# Source: https://github.com/Alir3z4/stop-words
|
||||
|
||||
STOP_WORDS = set(
|
||||
|
|
|
@ -1,13 +1,9 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .norm_exceptions import NORM_EXCEPTIONS
|
||||
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
|
||||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .morph_rules import MORPH_RULES
|
||||
from ..tag_map import TAG_MAP
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ..norm_exceptions import BASE_NORMS
|
||||
|
@ -27,7 +23,6 @@ class DanishDefaults(Language.Defaults):
|
|||
morph_rules = MORPH_RULES
|
||||
infixes = TOKENIZER_INFIXES
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
tag_map = TAG_MAP
|
||||
stop_words = STOP_WORDS
|
||||
|
||||
|
||||
|
|
|
@ -1,7 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user