diff --git a/.gitignore b/.gitignore index edcbba4d5..98ca1f2fb 100644 --- a/.gitignore +++ b/.gitignore @@ -44,6 +44,7 @@ __pycache__/ .env* .~env/ .venv +env3.6/ venv/ env3.*/ .dev @@ -118,3 +119,6 @@ Desktop.ini # Pycharm project files *.idea + +# IPython +.ipynb_checkpoints/ diff --git a/.travis.yml b/.travis.yml deleted file mode 100644 index e3ce53024..000000000 --- a/.travis.yml +++ /dev/null @@ -1,23 +0,0 @@ -language: python -sudo: false -cache: pip -dist: trusty -group: edge -python: - - "2.7" -os: - - linux -install: - - "pip install -r requirements.txt" - - "python setup.py build_ext --inplace" - - "pip install -e ." -script: - - "cat /proc/cpuinfo | grep flags | head -n 1" - - "python -m pytest --tb=native spacy" -branches: - except: - - spacy.io -notifications: - slack: - secure: F8GvqnweSdzImuLL64TpfG0i5rYl89liyr9tmFVsHl4c0DNiDuGhZivUz0M1broS8svE3OPOllLfQbACG/4KxD890qfF9MoHzvRDlp7U+RtwMV/YAkYn8MGWjPIbRbX0HpGdY7O2Rc9Qy4Kk0T8ZgiqXYIqAz2Eva9/9BlSmsJQ= - email: false diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 3c2b56cd3..6b7881dd2 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -280,23 +280,7 @@ except: # noqa: E722 ### Python conventions -All Python code must be written in an **intersection of Python 2 and Python 3**. -This is easy in Cython, but somewhat ugly in Python. Logic that deals with -Python or platform compatibility should only live in -[`spacy.compat`](spacy/compat.py). To distinguish them from the builtin -functions, replacement functions are suffixed with an underscore, for example -`unicode_`. If you need to access the user's version or platform information, -for example to show more specific error messages, you can use the `is_config()` -helper function. - -```python -from .compat import unicode_, is_config - -compatible_unicode = unicode_('hello world') -if is_config(windows=True, python2=True): - print("You are using Python 2 on Windows.") -``` - +All Python code must be written **compatible with Python 3.6+**. Code that interacts with the file-system should accept objects that follow the `pathlib.Path` API, without assuming that the object inherits from `pathlib.Path`. If the function is user-facing and takes a path as an argument, it should check diff --git a/MANIFEST.in b/MANIFEST.in index 1947b9140..e6d25284f 100644 --- a/MANIFEST.in +++ b/MANIFEST.in @@ -1,5 +1,5 @@ recursive-include include *.h -recursive-include spacy *.txt *.pyx *.pxd +recursive-include spacy *.pyx *.pxd *.txt *.cfg include LICENSE include README.md include bin/spacy diff --git a/README.md b/README.md index 31dc78d63..500431b9f 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,6 @@ It's commercial open-source software, released under the MIT license. [Check out the release notes here.](https://github.com/explosion/spaCy/releases) [![Azure Pipelines]()](https://dev.azure.com/explosion-ai/public/_build?definitionId=8) -[![Travis Build Status]()](https://travis-ci.org/explosion/spaCy) [![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square&logo=github)](https://github.com/explosion/spaCy/releases) [![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy/) [![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square&logo=conda-forge&logoColor=white)](https://anaconda.org/conda-forge/spacy) @@ -98,12 +97,19 @@ For detailed installation instructions, see the - **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio) -- **Python version**: Python 2.7, 3.5+ (only 64 bit) +- **Python version**: Python 3.6+ (only 64 bit) - **Package managers**: [pip] · [conda] (via `conda-forge`) [pip]: https://pypi.org/project/spacy/ [conda]: https://anaconda.org/conda-forge/spacy +> ⚠️ **Important note for Python 3.8:** We can't yet ship pre-compiled binary +> wheels for spaCy that work on Python 3.8, as we're still waiting for our CI +> providers and other tooling to support it. This means that in order to run +> spaCy on Python 3.8, you'll need [a compiler installed](#source) and compile +> the library and its Cython dependencies locally. If this is causing problems +> for you, the easiest solution is to **use Python 3.7** in the meantime. + ### pip Using pip, spaCy releases are available as source packages and binary wheels (as @@ -262,9 +268,7 @@ and git preinstalled. Install a version of the [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) or [Visual Studio Express](https://visualstudio.microsoft.com/vs/express/) that -matches the version that was used to compile your Python interpreter. For -official distributions these are VS 2008 (Python 2.7), VS 2010 (Python 3.4) and -VS 2015 (Python 3.5). +matches the version that was used to compile your Python interpreter. ## Run tests diff --git a/azure-pipelines.yml b/azure-pipelines.yml index 147d2e903..4dfb51296 100644 --- a/azure-pipelines.yml +++ b/azure-pipelines.yml @@ -27,7 +27,7 @@ jobs: inputs: versionSpec: '3.7' - script: | - pip install flake8 + pip install flake8==3.5.0 python -m flake8 spacy --count --select=E901,E999,F821,F822,F823 --show-source --statistics displayName: 'flake8' @@ -35,12 +35,6 @@ jobs: dependsOn: 'Validate' strategy: matrix: - Python35Linux: - imageName: 'ubuntu-16.04' - python.version: '3.5' - Python35Windows: - imageName: 'vs2017-win2016' - python.version: '3.5' Python36Linux: imageName: 'ubuntu-16.04' python.version: '3.6' @@ -58,7 +52,7 @@ jobs: # imageName: 'vs2017-win2016' # python.version: '3.7' # Python37Mac: - # imageName: 'macos-10.13' + # imageName: 'macos-10.14' # python.version: '3.7' Python38Linux: imageName: 'ubuntu-16.04' diff --git a/bin/cythonize.py b/bin/cythonize.py deleted file mode 100755 index 4814f8df0..000000000 --- a/bin/cythonize.py +++ /dev/null @@ -1,169 +0,0 @@ -#!/usr/bin/env python -""" cythonize.py - -Cythonize pyx files into C++ files as needed. - -Usage: cythonize.py [root] - -Checks pyx files to see if they have been changed relative to their -corresponding C++ files. If they have, then runs cython on these files to -recreate the C++ files. - -Additionally, checks pxd files and setup.py if they have been changed. If -they have, rebuilds everything. - -Change detection based on file hashes stored in JSON format. - -For now, this script should be run by developers when changing Cython files -and the resulting C++ files checked in, so that end-users (and Python-only -developers) do not get the Cython dependencies. - -Based upon: - -https://raw.github.com/dagss/private-scipy-refactor/cythonize/cythonize.py -https://raw.githubusercontent.com/numpy/numpy/master/tools/cythonize.py - -Note: this script does not check any of the dependent C++ libraries. -""" -from __future__ import print_function - -import os -import sys -import json -import hashlib -import subprocess -import argparse - - -HASH_FILE = "cythonize.json" - - -def process_pyx(fromfile, tofile, language_level="-2"): - print("Processing %s" % fromfile) - try: - from Cython.Compiler.Version import version as cython_version - from distutils.version import LooseVersion - - if LooseVersion(cython_version) < LooseVersion("0.19"): - raise Exception("Require Cython >= 0.19") - - except ImportError: - pass - - flags = ["--fast-fail", language_level] - if tofile.endswith(".cpp"): - flags += ["--cplus"] - - try: - try: - r = subprocess.call( - ["cython"] + flags + ["-o", tofile, fromfile], env=os.environ - ) # See Issue #791 - if r != 0: - raise Exception("Cython failed") - except OSError: - # There are ways of installing Cython that don't result in a cython - # executable on the path, see gh-2397. - r = subprocess.call( - [ - sys.executable, - "-c", - "import sys; from Cython.Compiler.Main import " - "setuptools_main as main; sys.exit(main())", - ] - + flags - + ["-o", tofile, fromfile] - ) - if r != 0: - raise Exception("Cython failed") - except OSError: - raise OSError("Cython needs to be installed") - - -def preserve_cwd(path, func, *args): - orig_cwd = os.getcwd() - try: - os.chdir(path) - func(*args) - finally: - os.chdir(orig_cwd) - - -def load_hashes(filename): - try: - return json.load(open(filename)) - except (ValueError, IOError): - return {} - - -def save_hashes(hash_db, filename): - with open(filename, "w") as f: - f.write(json.dumps(hash_db)) - - -def get_hash(path): - return hashlib.md5(open(path, "rb").read()).hexdigest() - - -def hash_changed(base, path, db): - full_path = os.path.normpath(os.path.join(base, path)) - return not get_hash(full_path) == db.get(full_path) - - -def hash_add(base, path, db): - full_path = os.path.normpath(os.path.join(base, path)) - db[full_path] = get_hash(full_path) - - -def process(base, filename, db): - root, ext = os.path.splitext(filename) - if ext in [".pyx", ".cpp"]: - if hash_changed(base, filename, db) or not os.path.isfile( - os.path.join(base, root + ".cpp") - ): - preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp") - hash_add(base, root + ".cpp", db) - hash_add(base, root + ".pyx", db) - - -def check_changes(root, db): - res = False - new_db = {} - - setup_filename = "setup.py" - hash_add(".", setup_filename, new_db) - if hash_changed(".", setup_filename, db): - res = True - - for base, _, files in os.walk(root): - for filename in files: - if filename.endswith(".pxd"): - hash_add(base, filename, new_db) - if hash_changed(base, filename, db): - res = True - - if res: - db.clear() - db.update(new_db) - return res - - -def run(root): - db = load_hashes(HASH_FILE) - - try: - check_changes(root, db) - for base, _, files in os.walk(root): - for filename in files: - process(base, filename, db) - finally: - save_hashes(db, HASH_FILE) - - -if __name__ == "__main__": - parser = argparse.ArgumentParser( - description="Cythonize pyx files into C++ files as needed" - ) - parser.add_argument("root", help="root directory") - args = parser.parse_args() - run(args.root) diff --git a/bin/ud/ud_run_test.py b/bin/ud/ud_run_test.py index 7cb270d84..70c6be0d0 100644 --- a/bin/ud/ud_run_test.py +++ b/bin/ud/ud_run_test.py @@ -13,23 +13,12 @@ import srsly import spacy import spacy.util from spacy.tokens import Token, Doc -from spacy.gold import GoldParse -from spacy.util import compounding, minibatch_by_words -from spacy.syntax.nonproj import projectivize from spacy.matcher import Matcher -# from spacy.morphology import Fused_begin, Fused_inside -from spacy import displacy -from collections import defaultdict, Counter -from timeit import default_timer as timer Fused_begin = None Fused_inside = None -import itertools -import random -import numpy.random - from . import conll17_ud_eval from spacy import lang @@ -268,7 +257,7 @@ def load_nlp(experiments_dir, corpus): return nlp -def initialize_pipeline(nlp, docs, golds, config, device): +def initialize_pipeline(nlp, examples, config, device): nlp.add_pipe(nlp.create_pipe("parser")) return nlp diff --git a/bin/ud/ud_train.py b/bin/ud/ud_train.py index 6353bd6e7..aa5050f3a 100644 --- a/bin/ud/ud_train.py +++ b/bin/ud/ud_train.py @@ -14,7 +14,7 @@ import spacy import spacy.util from bin.ud import conll17_ud_eval from spacy.tokens import Token, Doc -from spacy.gold import GoldParse +from spacy.gold import GoldParse, Example from spacy.util import compounding, minibatch, minibatch_by_words from spacy.syntax.nonproj import projectivize from spacy.matcher import Matcher @@ -53,7 +53,7 @@ def read_data( max_doc_length=None, limit=None, ): - """Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True, + """Read the CONLLU format into Example objects. If raw_text=True, include Doc objects created using nlp.make_doc and then aligned against the gold-standard sequences. If oracle_segments=True, include Doc objects created from the gold-standard segments. At least one must be True.""" @@ -98,15 +98,16 @@ def read_data( docs.append(doc) golds.append(gold) if limit and len(docs) >= limit: - return docs, golds + return golds_to_gold_data(docs, golds) if raw_text and sent_annots: doc, gold = _make_gold(nlp, None, sent_annots) docs.append(doc) golds.append(gold) if limit and len(docs) >= limit: - return docs, golds - return docs, golds + return golds_to_gold_data(docs, golds) + return golds_to_gold_data(docs, golds) + def _parse_morph_string(morph_string): if morph_string == '_': @@ -120,6 +121,7 @@ def _parse_morph_string(morph_string): output.append('%s_%s' % (key, value.lower())) return set(output) + def read_conllu(file_): docs = [] sent = [] @@ -180,16 +182,18 @@ def _make_gold(nlp, text, sent_annots, drop_deps=0.0): ############################# -def golds_to_gold_tuples(docs, golds): - """Get out the annoying 'tuples' format used by begin_training, given the +def golds_to_gold_data(docs, golds): + """Get out the training data format used by begin_training, given the GoldParse objects.""" - tuples = [] + data = [] for doc, gold in zip(docs, golds): - text = doc.text - ids, words, tags, heads, labels, iob = zip(*gold.orig_annot) - sents = [((ids, words, tags, heads, labels, iob), [])] - tuples.append((text, sents)) - return tuples + example = Example(doc=doc) + example.add_doc_annotation(cats=gold.cats) + token_annotation_dict = gold.orig.to_dict() + example.add_token_annotation(**token_annotation_dict) + example.goldparse = gold + data.append(example) + return data ############## @@ -327,7 +331,6 @@ def get_token_conllu(token, i): return "\n".join(lines) - ################## # Initialization # ################## @@ -348,7 +351,7 @@ def load_nlp(corpus, config, vectors=None): return nlp -def initialize_pipeline(nlp, docs, golds, config, device): +def initialize_pipeline(nlp, examples, config, device): nlp.add_pipe(nlp.create_pipe("tagger", config={"set_morphology": False})) nlp.add_pipe(nlp.create_pipe("morphologizer")) nlp.add_pipe(nlp.create_pipe("parser")) @@ -356,14 +359,15 @@ def initialize_pipeline(nlp, docs, golds, config, device): nlp.parser.add_multitask_objective("tag") if config.multitask_sent: nlp.parser.add_multitask_objective("sent_start") - for gold in golds: + for ex in examples: + gold = ex.gold for tag in gold.tags: if tag is not None: nlp.tagger.add_label(tag) if torch is not None and device != -1: torch.set_default_tensor_type("torch.cuda.FloatTensor") optimizer = nlp.begin_training( - lambda: golds_to_gold_tuples(docs, golds), + lambda: examples, device=device, subword_features=config.subword_features, conv_depth=config.conv_depth, @@ -382,8 +386,8 @@ def _load_pretrained_tok2vec(nlp, loc): weights_data = file_.read() loaded = [] for name, component in nlp.pipeline: - if hasattr(component, "model") and hasattr(component.model, "tok2vec"): - component.tok2vec.from_bytes(weights_data) + if hasattr(component, "model") and component.model.has_ref("tok2vec"): + component.get_ref("tok2vec").from_bytes(weights_data) loaded.append(name) return loaded @@ -491,6 +495,10 @@ def main( Token.set_extension("begins_fused", default=False) Token.set_extension("inside_fused", default=False) + Token.set_extension("get_conllu_lines", method=get_token_conllu) + Token.set_extension("begins_fused", default=False) + Token.set_extension("inside_fused", default=False) + spacy.util.fix_random_seed() lang.zh.Chinese.Defaults.use_jieba = False lang.ja.Japanese.Defaults.use_janome = False @@ -505,7 +513,7 @@ def main( print("Train and evaluate", corpus, "using lang", paths.lang) nlp = load_nlp(paths.lang, config, vectors=vectors_dir) - docs, golds = read_data( + examples = read_data( nlp, paths.train.conllu.open(encoding="utf8"), paths.train.text.open(encoding="utf8"), @@ -513,12 +521,12 @@ def main( limit=limit, ) - optimizer = initialize_pipeline(nlp, docs, golds, config, gpu_device) + optimizer = initialize_pipeline(nlp, examples, config, gpu_device) batch_sizes = compounding(config.min_batch_size, config.max_batch_size, 1.001) beam_prob = compounding(0.2, 0.8, 1.001) for i in range(config.nr_epoch): - docs, golds = read_data( + examples = read_data( nlp, paths.train.conllu.open(encoding="utf8"), paths.train.text.open(encoding="utf8"), @@ -527,22 +535,19 @@ def main( oracle_segments=use_oracle_segments, raw_text=not use_oracle_segments, ) - Xs = list(zip(docs, golds)) - random.shuffle(Xs) + random.shuffle(examples) if config.batch_by_words: - batches = minibatch_by_words(Xs, size=batch_sizes) + batches = minibatch_by_words(examples, size=batch_sizes) else: - batches = minibatch(Xs, size=batch_sizes) + batches = minibatch(examples, size=batch_sizes) losses = {} - n_train_words = sum(len(doc) for doc in docs) + n_train_words = sum(len(ex.doc) for ex in examples) with tqdm.tqdm(total=n_train_words, leave=False) as pbar: for batch in batches: - batch_docs, batch_gold = zip(*batch) - pbar.update(sum(len(doc) for doc in batch_docs)) + pbar.update(sum(len(ex.doc) for ex in batch)) nlp.parser.cfg["beam_update_prob"] = next(beam_prob) nlp.update( - batch_docs, - batch_gold, + batch, sgd=optimizer, drop=config.dropout, losses=losses, diff --git a/examples/deep_learning_keras.py b/examples/deep_learning_keras.py index 049cc0be4..bf857b8b7 100644 --- a/examples/deep_learning_keras.py +++ b/examples/deep_learning_keras.py @@ -14,7 +14,7 @@ pip install keras==2.0.9 Compatible with: spaCy v2.0.0+ """ - +import ml_datasets import plac import random import pathlib @@ -24,7 +24,6 @@ from keras.models import Sequential, model_from_json from keras.layers import LSTM, Dense, Embedding, Bidirectional from keras.layers import TimeDistributed from keras.optimizers import Adam -import thinc.extra.datasets from spacy.compat import pickle import spacy @@ -224,7 +223,7 @@ def main( if model_dir is not None: model_dir = pathlib.Path(model_dir) if train_dir is None or dev_dir is None: - imdb_data = thinc.extra.datasets.imdb() + imdb_data = ml_datasets.imdb() if is_runtime: if dev_dir is None: dev_texts, dev_labels = zip(*imdb_data[1]) diff --git a/examples/experiments/ptb-joint-pos-dep/bilstm_tok2vec.cfg b/examples/experiments/ptb-joint-pos-dep/bilstm_tok2vec.cfg new file mode 100644 index 000000000..e152fa5e0 --- /dev/null +++ b/examples/experiments/ptb-joint-pos-dep/bilstm_tok2vec.cfg @@ -0,0 +1,67 @@ +[training] +patience = 10000 +eval_frequency = 200 +dropout = 0.2 +init_tok2vec = null +vectors = null +max_epochs = 100 +orth_variant_level = 0.0 +gold_preproc = true +max_length = 0 +use_gpu = 0 +scores = ["tags_acc", "uas", "las"] +score_weights = {"las": 0.8, "tags_acc": 0.2} +limit = 0 +seed = 0 +accumulate_gradient = 2 + +[training.batch_size] +@schedules = "compounding.v1" +start = 100 +stop = 1000 +compound = 1.001 + +[optimizer] +@optimizers = "Adam.v1" +learn_rate = 0.001 +beta1 = 0.9 +beta2 = 0.999 + +[nlp] +lang = "en" +vectors = ${training:vectors} + +[nlp.pipeline.tok2vec] +factory = "tok2vec" + +[nlp.pipeline.tagger] +factory = "tagger" + +[nlp.pipeline.parser] +factory = "parser" + +[nlp.pipeline.tagger.model] +@architectures = "spacy.Tagger.v1" + +[nlp.pipeline.tagger.model.tok2vec] +@architectures = "spacy.Tok2VecTensors.v1" +width = ${nlp.pipeline.tok2vec.model:width} + +[nlp.pipeline.parser.model] +@architectures = "spacy.TransitionBasedParser.v1" +nr_feature_tokens = 8 +hidden_width = 64 +maxout_pieces = 3 + +[nlp.pipeline.parser.model.tok2vec] +@architectures = "spacy.Tok2VecTensors.v1" +width = ${nlp.pipeline.tok2vec.model:width} + +[nlp.pipeline.tok2vec.model] +@architectures = "spacy.HashEmbedBiLSTM.v1" +pretrained_vectors = ${nlp:vectors} +width = 96 +depth = 4 +embed_size = 2000 +subword_features = true +maxout_pieces = 3 diff --git a/examples/experiments/ptb-joint-pos-dep/defaults.cfg b/examples/experiments/ptb-joint-pos-dep/defaults.cfg new file mode 100644 index 000000000..9a10c45f0 --- /dev/null +++ b/examples/experiments/ptb-joint-pos-dep/defaults.cfg @@ -0,0 +1,68 @@ +[training] +patience = 10000 +eval_frequency = 200 +dropout = 0.2 +init_tok2vec = null +vectors = null +max_epochs = 100 +orth_variant_level = 0.0 +gold_preproc = true +max_length = 0 +use_gpu = -1 +scores = ["tags_acc", "uas", "las"] +score_weights = {"las": 0.8, "tags_acc": 0.2} +limit = 0 +seed = 0 +accumulate_gradient = 2 + +[training.batch_size] +@schedules = "compounding.v1" +start = 100 +stop = 1000 +compound = 1.001 + +[optimizer] +@optimizers = "Adam.v1" +learn_rate = 0.001 +beta1 = 0.9 +beta2 = 0.999 + +[nlp] +lang = "en" +vectors = ${training:vectors} + +[nlp.pipeline.tok2vec] +factory = "tok2vec" + +[nlp.pipeline.tagger] +factory = "tagger" + +[nlp.pipeline.parser] +factory = "parser" + +[nlp.pipeline.tagger.model] +@architectures = "spacy.Tagger.v1" + +[nlp.pipeline.tagger.model.tok2vec] +@architectures = "spacy.Tok2VecTensors.v1" +width = ${nlp.pipeline.tok2vec.model:width} + +[nlp.pipeline.parser.model] +@architectures = "spacy.TransitionBasedParser.v1" +nr_feature_tokens = 8 +hidden_width = 64 +maxout_pieces = 3 + +[nlp.pipeline.parser.model.tok2vec] +@architectures = "spacy.Tok2VecTensors.v1" +width = ${nlp.pipeline.tok2vec.model:width} + +[nlp.pipeline.tok2vec.model] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = ${nlp:vectors} +width = 96 +depth = 4 +window_size = 1 +embed_size = 2000 +maxout_pieces = 3 +subword_features = true diff --git a/examples/experiments/tok2vec-ner/charembed_tok2vec.cfg b/examples/experiments/tok2vec-ner/charembed_tok2vec.cfg new file mode 100644 index 000000000..796c8670f --- /dev/null +++ b/examples/experiments/tok2vec-ner/charembed_tok2vec.cfg @@ -0,0 +1,67 @@ +[training] +use_gpu = -1 +limit = 0 +dropout = 0.2 +patience = 10000 +eval_frequency = 200 +scores = ["ents_f"] +score_weights = {"ents_f": 1} +orth_variant_level = 0.0 +gold_preproc = true +max_length = 0 +batch_size = 25 +seed = 0 +accumulate_gradient = 2 + +[optimizer] +@optimizers = "Adam.v1" +learn_rate = 0.001 +beta1 = 0.9 +beta2 = 0.999 + +[nlp] +lang = "en" +vectors = null + +[nlp.pipeline.tok2vec] +factory = "tok2vec" + +[nlp.pipeline.tok2vec.model] +@architectures = "spacy.Tok2Vec.v1" + +[nlp.pipeline.tok2vec.model.extract] +@architectures = "spacy.CharacterEmbed.v1" +width = 96 +nM = 64 +nC = 8 +rows = 2000 +columns = ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"] + +[nlp.pipeline.tok2vec.model.extract.features] +@architectures = "spacy.Doc2Feats.v1" +columns = ${nlp.pipeline.tok2vec.model.extract:columns} + +[nlp.pipeline.tok2vec.model.embed] +@architectures = "spacy.LayerNormalizedMaxout.v1" +width = ${nlp.pipeline.tok2vec.model.extract:width} +maxout_pieces = 4 + +[nlp.pipeline.tok2vec.model.encode] +@architectures = "spacy.MaxoutWindowEncoder.v1" +width = ${nlp.pipeline.tok2vec.model.extract:width} +window_size = 1 +maxout_pieces = 2 +depth = 2 + +[nlp.pipeline.ner] +factory = "ner" + +[nlp.pipeline.ner.model] +@architectures = "spacy.TransitionBasedParser.v1" +nr_feature_tokens = 6 +hidden_width = 64 +maxout_pieces = 2 + +[nlp.pipeline.ner.model.tok2vec] +@architectures = "spacy.Tok2VecTensors.v1" +width = ${nlp.pipeline.tok2vec.model.extract:width} diff --git a/examples/experiments/tok2vec-ner/multihashembed_tok2vec.cfg b/examples/experiments/tok2vec-ner/multihashembed_tok2vec.cfg new file mode 100644 index 000000000..3ac70675b --- /dev/null +++ b/examples/experiments/tok2vec-ner/multihashembed_tok2vec.cfg @@ -0,0 +1,46 @@ +[training] +use_gpu = -1 +limit = 0 +dropout = 0.2 +patience = 10000 +eval_frequency = 200 +scores = ["ents_p", "ents_r", "ents_f"] +score_weights = {"ents_f": 1} +orth_variant_level = 0.0 +gold_preproc = true +max_length = 0 +seed = 0 +accumulate_gradient = 2 + +[training.batch_size] +@schedules = "compounding.v1" +start = 3000 +stop = 3000 +compound = 1.001 + + +[optimizer] +@optimizers = "Adam.v1" +learn_rate = 0.001 +beta1 = 0.9 +beta2 = 0.999 + +[nlp] +lang = "en" +vectors = null + +[nlp.pipeline.ner] +factory = "simple_ner" + +[nlp.pipeline.ner.model] +@architectures = "spacy.BiluoTagger.v1" + +[nlp.pipeline.ner.model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +width = 128 +depth = 4 +embed_size = 7000 +maxout_pieces = 3 +window_size = 1 +subword_features = true +pretrained_vectors = null diff --git a/examples/pipeline/multi_processing.py b/examples/pipeline/multi_processing.py index f0e437acf..e4aca7912 100644 --- a/examples/pipeline/multi_processing.py +++ b/examples/pipeline/multi_processing.py @@ -13,9 +13,10 @@ Prerequisites: pip install joblib from __future__ import print_function, unicode_literals from pathlib import Path + +import ml_datasets from joblib import Parallel, delayed from functools import partial -import thinc.extra.datasets import plac import spacy from spacy.util import minibatch @@ -35,7 +36,7 @@ def main(output_dir, model="en_core_web_sm", n_jobs=4, batch_size=1000, limit=10 output_dir.mkdir() # load and pre-process the IMBD dataset print("Loading IMDB data...") - data, _ = thinc.extra.datasets.imdb() + data, _ = ml_datasets.imdb() texts, _ = zip(*data[-limit:]) print("Processing texts...") partitions = minibatch(texts, size=batch_size) diff --git a/examples/streamlit_spacy.py b/examples/streamlit_spacy.py index a2da123c2..2b527b3df 100644 --- a/examples/streamlit_spacy.py +++ b/examples/streamlit_spacy.py @@ -1,7 +1,7 @@ # coding: utf-8 """ Example of a Streamlit app for an interactive spaCy model visualizer. You can -either download the script, or point streamlit run to the raw URL of this +either download the script, or point `streamlit run` to the raw URL of this file. For more details, see https://streamlit.io. Installation: @@ -15,6 +15,8 @@ streamlit run streamlit_spacy.py """ from __future__ import unicode_literals +import base64 + import streamlit as st import spacy from spacy import displacy @@ -54,6 +56,14 @@ model_load_state.empty() text = st.text_area("Text to analyze", DEFAULT_TEXT) doc = process_text(spacy_model, text) + +def render_svg(svg): + """Renders the given svg string.""" + b64 = base64.b64encode(svg.encode('utf-8')).decode("utf-8") + html = r'' % b64 + st.write(html, unsafe_allow_html=True) + + if "parser" in nlp.pipe_names: st.header("Dependency Parse & Part-of-speech tags") st.sidebar.header("Dependency Parse") @@ -68,12 +78,14 @@ if "parser" in nlp.pipe_names: } docs = [span.as_doc() for span in doc.sents] if split_sents else [doc] for sent in docs: - html = displacy.render(sent, options=options) + html = displacy.render(sent, options=options, style="dep") # Double newlines seem to mess with the rendering html = html.replace("\n\n", "\n") if split_sents and len(docs) > 1: st.markdown(f"> {sent.text}") - st.write(HTML_WRAPPER.format(html), unsafe_allow_html=True) + render_svg(html) + # this didn't show the dep arc labels properly, cf #5089 + # st.write(HTML_WRAPPER.format(html), unsafe_allow_html=True) if "ner" in nlp.pipe_names: st.header("Named Entities") diff --git a/examples/training/conllu.py b/examples/training/conllu.py index 1c65f4a72..bf47be72a 100644 --- a/examples/training/conllu.py +++ b/examples/training/conllu.py @@ -12,7 +12,7 @@ import tqdm import spacy import spacy.util from spacy.tokens import Token, Doc -from spacy.gold import GoldParse +from spacy.gold import GoldParse, Example from spacy.syntax.nonproj import projectivize from collections import defaultdict from spacy.matcher import Matcher @@ -33,25 +33,25 @@ random.seed(0) numpy.random.seed(0) -def minibatch_by_words(items, size=5000): - random.shuffle(items) +def minibatch_by_words(examples, size=5000): + random.shuffle(examples) if isinstance(size, int): size_ = itertools.repeat(size) else: size_ = size - items = iter(items) + examples = iter(examples) while True: batch_size = next(size_) batch = [] while batch_size >= 0: try: - doc, gold = next(items) + example = next(examples) except StopIteration: if batch: yield batch return - batch_size -= len(doc) - batch.append((doc, gold)) + batch_size -= len(example.doc) + batch.append(example) if batch: yield batch else: @@ -78,7 +78,7 @@ def read_data( max_doc_length=None, limit=None, ): - """Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True, + """Read the CONLLU format into Example objects. If raw_text=True, include Doc objects created using nlp.make_doc and then aligned against the gold-standard sequences. If oracle_segments=True, include Doc objects created from the gold-standard segments. At least one must be True.""" @@ -119,15 +119,15 @@ def read_data( docs.append(doc) golds.append(gold) if limit and len(docs) >= limit: - return docs, golds + return golds_to_gold_data(docs, golds) if raw_text and sent_annots: doc, gold = _make_gold(nlp, None, sent_annots) docs.append(doc) golds.append(gold) if limit and len(docs) >= limit: - return docs, golds - return docs, golds + return golds_to_gold_data(docs, golds) + return golds_to_gold_data(docs, golds) def read_conllu(file_): @@ -181,16 +181,18 @@ def _make_gold(nlp, text, sent_annots): ############################# -def golds_to_gold_tuples(docs, golds): - """Get out the annoying 'tuples' format used by begin_training, given the +def golds_to_gold_data(docs, golds): + """Get out the training data format used by begin_training, given the GoldParse objects.""" - tuples = [] + data = [] for doc, gold in zip(docs, golds): - text = doc.text - ids, words, tags, heads, labels, iob = zip(*gold.orig_annot) - sents = [((ids, words, tags, heads, labels, iob), [])] - tuples.append((text, sents)) - return tuples + example = Example(doc=doc) + example.add_doc_annotation(cats=gold.cats) + token_annotation_dict = gold.orig.to_dict() + example.add_token_annotation(**token_annotation_dict) + example.goldparse = gold + data.append(example) + return data ############## @@ -303,7 +305,7 @@ def load_nlp(corpus, config): return nlp -def initialize_pipeline(nlp, docs, golds, config): +def initialize_pipeline(nlp, examples, config): nlp.add_pipe(nlp.create_pipe("parser")) if config.multitask_tag: nlp.parser.add_multitask_objective("tag") @@ -311,18 +313,19 @@ def initialize_pipeline(nlp, docs, golds, config): nlp.parser.add_multitask_objective("sent_start") nlp.parser.moves.add_action(2, "subtok") nlp.add_pipe(nlp.create_pipe("tagger")) - for gold in golds: - for tag in gold.tags: + for ex in examples: + for tag in ex.gold.tags: if tag is not None: nlp.tagger.add_label(tag) # Replace labels that didn't make the frequency cutoff actions = set(nlp.parser.labels) label_set = set([act.split("-")[1] for act in actions if "-" in act]) - for gold in golds: + for ex in examples: + gold = ex.gold for i, label in enumerate(gold.labels): if label is not None and label not in label_set: gold.labels[i] = label.split("||")[0] - return nlp.begin_training(lambda: golds_to_gold_tuples(docs, golds)) + return nlp.begin_training(lambda: examples) ######################## @@ -391,13 +394,17 @@ def main(ud_dir, parses_dir, config, corpus, limit=0): Token.set_extension("begins_fused", default=False) Token.set_extension("inside_fused", default=False) + Token.set_extension("get_conllu_lines", method=get_token_conllu) + Token.set_extension("begins_fused", default=False) + Token.set_extension("inside_fused", default=False) + paths = TreebankPaths(ud_dir, corpus) if not (parses_dir / corpus).exists(): (parses_dir / corpus).mkdir() print("Train and evaluate", corpus, "using lang", paths.lang) nlp = load_nlp(paths.lang, config) - docs, golds = read_data( + examples = read_data( nlp, paths.train.conllu.open(encoding="utf8"), paths.train.text.open(encoding="utf8"), @@ -405,23 +412,18 @@ def main(ud_dir, parses_dir, config, corpus, limit=0): limit=limit, ) - optimizer = initialize_pipeline(nlp, docs, golds, config) + optimizer = initialize_pipeline(nlp, examples, config) for i in range(config.nr_epoch): - docs = [nlp.make_doc(doc.text) for doc in docs] - batches = minibatch_by_words(list(zip(docs, golds)), size=config.batch_size) + docs = [nlp.make_doc(example.doc.text) for example in examples] + batches = minibatch_by_words(examples, size=config.batch_size) losses = {} n_train_words = sum(len(doc) for doc in docs) with tqdm.tqdm(total=n_train_words, leave=False) as pbar: for batch in batches: - batch_docs, batch_gold = zip(*batch) - pbar.update(sum(len(doc) for doc in batch_docs)) + pbar.update(sum(len(ex.doc) for ex in batch)) nlp.update( - batch_docs, - batch_gold, - sgd=optimizer, - drop=config.dropout, - losses=losses, + examples=batch, sgd=optimizer, drop=config.dropout, losses=losses, ) out_path = parses_dir / corpus / "epoch-{i}.conllu".format(i=i) diff --git a/examples/training/ner_multitask_objective.py b/examples/training/ner_multitask_objective.py index 4bf7a008f..7561d4877 100644 --- a/examples/training/ner_multitask_objective.py +++ b/examples/training/ner_multitask_objective.py @@ -31,14 +31,13 @@ random.seed(0) PWD = os.path.dirname(__file__) -TRAIN_DATA = list(read_json_file( - os.path.join(PWD, "ner_example_data", "ner-sent-per-line.json"))) +TRAIN_DATA = list(read_json_file(os.path.join(PWD, "training-data.json"))) -def get_position_label(i, words, tags, heads, labels, ents): +def get_position_label(i, token_annotation): """Return labels indicating the position of the word in the document. """ - if len(words) < 20: + if len(token_annotation.words) < 20: return "short-doc" elif i == 0: return "first-word" @@ -46,7 +45,7 @@ def get_position_label(i, words, tags, heads, labels, ents): return "early-word" elif i < 20: return "mid-word" - elif i == len(words) - 1: + elif i == len(token_annotation.words) - 1: return "last-word" else: return "late-word" @@ -60,17 +59,17 @@ def main(n_iter=10): print(nlp.pipeline) print("Create data", len(TRAIN_DATA)) - optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA) + optimizer = nlp.begin_training(get_examples=lambda: TRAIN_DATA) for itn in range(n_iter): random.shuffle(TRAIN_DATA) losses = {} - for text, annot_brackets in TRAIN_DATA: - for annotations, _ in annot_brackets: - doc = Doc(nlp.vocab, words=annotations[1]) - gold = GoldParse.from_annot_tuples(doc, annotations) + for example in TRAIN_DATA: + for token_annotation in example.token_annotations: + doc = Doc(nlp.vocab, words=token_annotation.words) + gold = GoldParse.from_annotation(doc, example.doc_annotation, token_annotation) + nlp.update( - [doc], # batch of texts - [gold], # batch of annotations + examples=[(doc, gold)], # 1 example drop=0.2, # dropout - make it harder to memorise data sgd=optimizer, # callable to update weights losses=losses, @@ -78,9 +77,9 @@ def main(n_iter=10): print(losses.get("nn_labeller", 0.0), losses["ner"]) # test the trained model - for text, _ in TRAIN_DATA: - if text is not None: - doc = nlp(text) + for example in TRAIN_DATA: + if example.text is not None: + doc = nlp(example.text) print("Entities", [(ent.text, ent.label_) for ent in doc.ents]) print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc]) diff --git a/examples/training/pretrain_textcat.py b/examples/training/pretrain_textcat.py index f3e493f6a..5c41c0e92 100644 --- a/examples/training/pretrain_textcat.py +++ b/examples/training/pretrain_textcat.py @@ -16,16 +16,18 @@ the development labels, after all --- only the unlabelled text. import plac import tqdm import random + +import ml_datasets + import spacy -import thinc.extra.datasets -from spacy.util import minibatch, use_gpu, compounding -from spacy._ml import Tok2Vec +from spacy.util import minibatch from spacy.pipeline import TextCategorizer +from spacy.ml.models.tok2vec import build_Tok2Vec_model import numpy def load_texts(limit=0): - train, dev = thinc.extra.datasets.imdb() + train, dev = ml_datasets.imdb() train_texts, train_labels = zip(*train) dev_texts, dev_labels = zip(*train) train_texts = list(train_texts) @@ -41,7 +43,7 @@ def load_texts(limit=0): def load_textcat_data(limit=0): """Load data from the IMDB dataset.""" # Partition off part of the train data for evaluation - train_data, eval_data = thinc.extra.datasets.imdb() + train_data, eval_data = ml_datasets.imdb() random.shuffle(train_data) train_data = train_data[-limit:] texts, labels = zip(*train_data) @@ -63,25 +65,21 @@ def prefer_gpu(): def build_textcat_model(tok2vec, nr_class, width): - from thinc.v2v import Model, Softmax, Maxout - from thinc.api import flatten_add_lengths, chain - from thinc.t2v import Pooling, sum_pool, mean_pool, max_pool - from thinc.misc import Residual, LayerNorm - from spacy._ml import logistic, zero_init + from thinc.api import Model, Softmax, chain, reduce_mean, list2ragged with Model.define_operators({">>": chain}): model = ( tok2vec - >> flatten_add_lengths - >> Pooling(mean_pool) + >> list2ragged() + >> reduce_mean() >> Softmax(nr_class, width) ) - model.tok2vec = tok2vec + model.set_ref("tok2vec", tok2vec) return model def block_gradients(model): - from thinc.api import wrap + from thinc.api import wrap # TODO FIX def forward(X, drop=0.0): Y, _ = model.begin_update(X, drop=drop) @@ -97,8 +95,9 @@ def create_pipeline(width, embed_size, vectors_model): textcat = TextCategorizer( nlp.vocab, labels=["POSITIVE", "NEGATIVE"], + # TODO: replace with config version model=build_textcat_model( - Tok2Vec(width=width, embed_size=embed_size), 2, width + build_Tok2Vec_model(width=width, embed_size=embed_size), 2, width ), ) @@ -114,14 +113,14 @@ def train_tensorizer(nlp, texts, dropout, n_iter): losses = {} for i, batch in enumerate(minibatch(tqdm.tqdm(texts))): docs = [nlp.make_doc(text) for text in batch] - tensorizer.update(docs, None, losses=losses, sgd=optimizer, drop=dropout) + tensorizer.update((docs, None), losses=losses, sgd=optimizer, drop=dropout) print(losses) return optimizer def train_textcat(nlp, n_texts, n_iter=10): textcat = nlp.get_pipe("textcat") - tok2vec_weights = textcat.model.tok2vec.to_bytes() + tok2vec_weights = textcat.model.get_ref("tok2vec").to_bytes() (train_texts, train_cats), (dev_texts, dev_cats) = load_textcat_data(limit=n_texts) print( "Using {} examples ({} training, {} evaluation)".format( @@ -130,12 +129,9 @@ def train_textcat(nlp, n_texts, n_iter=10): ) train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats])) - # get names of other pipes to disable them during training - pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"] - other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] - with nlp.disable_pipes(*other_pipes): # only train textcat + with nlp.select_pipes(enable="textcat"): # only train textcat optimizer = nlp.begin_training() - textcat.model.tok2vec.from_bytes(tok2vec_weights) + textcat.model.get_ref("tok2vec").from_bytes(tok2vec_weights) print("Training the model...") print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F")) for i in range(n_iter): @@ -143,8 +139,7 @@ def train_textcat(nlp, n_texts, n_iter=10): # batch up the examples using spaCy's minibatch batches = minibatch(tqdm.tqdm(train_data), size=2) for batch in batches: - texts, annotations = zip(*batch) - nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses) + nlp.update(batch, sgd=optimizer, drop=0.2, losses=losses) with textcat.model.use_params(optimizer.averages): # evaluate on the dev data split off in load_data() scores = evaluate_textcat(nlp.tokenizer, textcat, dev_texts, dev_cats) diff --git a/examples/training/rehearsal.py b/examples/training/rehearsal.py index 24b1cea00..98a96643b 100644 --- a/examples/training/rehearsal.py +++ b/examples/training/rehearsal.py @@ -59,17 +59,14 @@ def main(model_name, unlabelled_loc): # yet, but I'm getting weird results from Adam. Try commenting out the # nlp.update(), and using Adam -- you'll find the models drift apart. # I guess Adam is losing precision, introducing gradient noise? - optimizer.alpha = 0.1 + optimizer.learn_rate = 0.1 optimizer.b1 = 0.0 optimizer.b2 = 0.0 - # get names of other pipes to disable them during training - pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"] - other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] sizes = compounding(1.0, 4.0, 1.001) - with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings(): + with nlp.select_pipes(enable="ner") and warnings.catch_warnings(): # show warnings for misaligned entity spans once - warnings.filterwarnings("once", category=UserWarning, module='spacy') + warnings.filterwarnings("once", category=UserWarning, module="spacy") for itn in range(n_iter): random.shuffle(TRAIN_DATA) @@ -79,8 +76,7 @@ def main(model_name, unlabelled_loc): # batch up the examples using spaCy's minibatch raw_batches = minibatch(raw_docs, size=4) for batch in minibatch(TRAIN_DATA, size=sizes): - docs, golds = zip(*batch) - nlp.update(docs, golds, sgd=optimizer, drop=dropout, losses=losses) + nlp.update(batch, sgd=optimizer, drop=dropout, losses=losses) raw_batch = list(next(raw_batches)) nlp.rehearse(raw_batch, sgd=optimizer, losses=r_losses) print("Losses", losses) diff --git a/examples/training/textcat_example_data/textcatjsonl_to_trainjson.py b/examples/training/textcat_example_data/textcatjsonl_to_trainjson.py index 339ce39be..66d96ff68 100644 --- a/examples/training/textcat_example_data/textcatjsonl_to_trainjson.py +++ b/examples/training/textcat_example_data/textcatjsonl_to_trainjson.py @@ -5,16 +5,17 @@ from spacy.gold import docs_to_json import srsly import sys + @plac.annotations( model=("Model name. Defaults to 'en'.", "option", "m", str), input_file=("Input file (jsonl)", "positional", None, Path), output_dir=("Output directory", "positional", None, Path), n_texts=("Number of texts to convert", "option", "t", int), ) -def convert(model='en', input_file=None, output_dir=None, n_texts=0): +def convert(model="en", input_file=None, output_dir=None, n_texts=0): # Load model with tokenizer + sentencizer only nlp = spacy.load(model) - nlp.disable_pipes(*nlp.pipe_names) + nlp.select_pipes(disable=nlp.pipe_names) sentencizer = nlp.create_pipe("sentencizer") nlp.add_pipe(sentencizer, first=True) @@ -49,5 +50,6 @@ def convert(model='en', input_file=None, output_dir=None, n_texts=0): srsly.write_json(output_dir / input_file.with_suffix(".json"), [docs_to_json(docs)]) + if __name__ == "__main__": plac.call(convert) diff --git a/examples/training/train_entity_linker.py b/examples/training/train_entity_linker.py index 3a8deb7a0..b82ff5bb4 100644 --- a/examples/training/train_entity_linker.py +++ b/examples/training/train_entity_linker.py @@ -18,7 +18,6 @@ import random from pathlib import Path from spacy.vocab import Vocab - import spacy from spacy.kb import KnowledgeBase from spacy.pipeline import EntityRuler @@ -66,36 +65,38 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50): vocab = Vocab().from_disk(vocab_path) # create blank English model with correct vocab nlp = spacy.blank("en", vocab=vocab) - nlp.vocab.vectors.name = "spacy_pretrained_vectors" + nlp.vocab.vectors.name = "nel_vectors" print("Created blank 'en' model with vocab from '%s'" % vocab_path) # Add a sentencizer component. Alternatively, add a dependency parser for higher accuracy. - nlp.add_pipe(nlp.create_pipe('sentencizer')) + nlp.add_pipe(nlp.create_pipe("sentencizer")) # Add a custom component to recognize "Russ Cochran" as an entity for the example training data. # Note that in a realistic application, an actual NER algorithm should be used instead. ruler = EntityRuler(nlp) - patterns = [{"label": "PERSON", "pattern": [{"LOWER": "russ"}, {"LOWER": "cochran"}]}] + patterns = [ + {"label": "PERSON", "pattern": [{"LOWER": "russ"}, {"LOWER": "cochran"}]} + ] ruler.add_patterns(patterns) nlp.add_pipe(ruler) # Create the Entity Linker component and add it to the pipeline. if "entity_linker" not in nlp.pipe_names: - # use only the predicted EL score and not the prior probability (for demo purposes) - cfg = {"incl_prior": False} - entity_linker = nlp.create_pipe("entity_linker", cfg) kb = KnowledgeBase(vocab=nlp.vocab) kb.load_bulk(kb_path) print("Loaded Knowledge Base from '%s'" % kb_path) - entity_linker.set_kb(kb) + + # use only the predicted EL score and not the prior probability (for demo purposes) + cfg = {"kb": kb, "incl_prior": False} + entity_linker = nlp.create_pipe("entity_linker", cfg) nlp.add_pipe(entity_linker, last=True) # Convert the texts to docs to make sure we have doc.ents set for the training examples. - # Also ensure that the annotated examples correspond to known identifiers in the knowlege base. + # Also ensure that the annotated examples correspond to known identifiers in the knowledge base. kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings() TRAIN_DOCS = [] for text, annotation in TRAIN_DATA: - with nlp.disable_pipes("entity_linker"): + with nlp.select_pipes(disable="entity_linker"): doc = nlp(text) annotation_clean = annotation for offset, kb_id_dict in annotation["links"].items(): @@ -110,22 +111,18 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50): annotation_clean["links"][offset] = new_dict TRAIN_DOCS.append((doc, annotation_clean)) - # get names of other pipes to disable them during training - pipe_exceptions = ["entity_linker", "trf_wordpiecer", "trf_tok2vec"] - other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] - with nlp.disable_pipes(*other_pipes): # only train entity linker + with nlp.select_pipes(enable="entity_linker"): # only train entity linker # reset and initialize the weights randomly optimizer = nlp.begin_training() + for itn in range(n_iter): random.shuffle(TRAIN_DOCS) losses = {} # batch up the examples using spaCy's minibatch batches = minibatch(TRAIN_DOCS, size=compounding(4.0, 32.0, 1.001)) for batch in batches: - texts, annotations = zip(*batch) nlp.update( - texts, # batch of texts - annotations, # batch of annotations + batch, drop=0.2, # dropout - make it harder to memorise data losses=losses, sgd=optimizer, diff --git a/examples/training/train_intent_parser.py b/examples/training/train_intent_parser.py index d2472b6b9..c3d5a279b 100644 --- a/examples/training/train_intent_parser.py +++ b/examples/training/train_intent_parser.py @@ -124,9 +124,7 @@ def main(model=None, output_dir=None, n_iter=15): for dep in annotations.get("deps", []): parser.add_label(dep) - pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"] - other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] - with nlp.disable_pipes(*other_pipes): # only train parser + with nlp.select_pipes(enable="parser"): # only train parser optimizer = nlp.begin_training() for itn in range(n_iter): random.shuffle(TRAIN_DATA) @@ -134,8 +132,7 @@ def main(model=None, output_dir=None, n_iter=15): # batch up the examples using spaCy's minibatch batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001)) for batch in batches: - texts, annotations = zip(*batch) - nlp.update(texts, annotations, sgd=optimizer, losses=losses) + nlp.update(batch, sgd=optimizer, losses=losses) print("Losses", losses) # test the trained model diff --git a/examples/training/train_morphologizer.py b/examples/training/train_morphologizer.py new file mode 100644 index 000000000..aec114de7 --- /dev/null +++ b/examples/training/train_morphologizer.py @@ -0,0 +1,133 @@ +#!/usr/bin/env python +# coding: utf8 +""" +A simple example for training a morphologizer. For more details, see +the documentation: +* Training: https://spacy.io/usage/training + +Compatible with: spaCy v3.0.0+ +Last tested with: v3.0.0 +""" +from __future__ import unicode_literals, print_function + +import plac +import random +from pathlib import Path +import spacy +from spacy.util import minibatch, compounding +from spacy.morphology import Morphology + + +# Usually you'll read this in, of course. Data formats vary. Ensure your +# strings are unicode and that the number of tags assigned matches spaCy's +# tokenization. If not, you can always add a 'words' key to the annotations +# that specifies the gold-standard tokenization, e.g.: +# ("Eatblueham", {'words': ['Eat', 'blue', 'ham'], 'tags': ['V', 'J', 'N']}) +TRAIN_DATA = [ + ( + "I like green eggs", + { + "morphs": [ + "PronType=Prs|Person=1", + "VerbForm=Fin", + "Degree=Pos", + "Number=Plur", + ], + "pos": ["PRON", "VERB", "ADJ", "NOUN"], + }, + ), + ( + "Eat blue ham", + { + "morphs": ["VerbForm=Inf", "Degree=Pos", "Number=Sing"], + "pos": ["VERB", "ADJ", "NOUN"], + }, + ), + ( + "She was blue", + { + "morphs": ["PronType=Prs|Person=3", "VerbForm=Fin", "Degree=Pos"], + "pos": ["PRON", "VERB", "ADJ"], + }, + ), + ( + "He was blue today", + { + "morphs": ["PronType=Prs|Person=3", "VerbForm=Fin", "Degree=Pos", ""], + "pos": ["PRON", "VERB", "ADJ", "ADV"], + }, + ), +] + +# The POS tags are optional, set `with_pos_tags = False` to omit them for +# this example: +with_pos_tags = True + +if not with_pos_tags: + for i in range(len(TRAIN_DATA)): + del TRAIN_DATA[i][1]["pos"] + + +@plac.annotations( + lang=("ISO Code of language to use", "option", "l", str), + output_dir=("Optional output directory", "option", "o", Path), + n_iter=("Number of training iterations", "option", "n", int), +) +def main(lang="en", output_dir=None, n_iter=25): + """Create a new model, set up the pipeline and train the tagger. In order to + train the tagger with a custom tag map, we're creating a new Language + instance with a custom vocab. + """ + nlp = spacy.blank(lang) + # add the tagger to the pipeline + # nlp.create_pipe works for built-ins that are registered with spaCy + morphologizer = nlp.create_pipe("morphologizer") + nlp.add_pipe(morphologizer) + + # add labels + for _, annotations in TRAIN_DATA: + morph_labels = annotations.get("morphs") + pos_labels = annotations.get("pos", [""] * len(annotations.get("morphs"))) + assert len(morph_labels) == len(pos_labels) + for morph, pos in zip(morph_labels, pos_labels): + morph_dict = Morphology.feats_to_dict(morph) + if pos: + morph_dict["POS"] = pos + morph = Morphology.dict_to_feats(morph_dict) + morphologizer.add_label(morph) + + optimizer = nlp.begin_training() + for i in range(n_iter): + random.shuffle(TRAIN_DATA) + losses = {} + # batch up the examples using spaCy's minibatch + batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001)) + for batch in batches: + nlp.update(batch, sgd=optimizer, losses=losses) + print("Losses", losses) + + # test the trained model + test_text = "I like blue eggs" + doc = nlp(test_text) + print("Morphs", [(t.text, t.morph) for t in doc]) + + # save model to output directory + if output_dir is not None: + output_dir = Path(output_dir) + if not output_dir.exists(): + output_dir.mkdir() + nlp.to_disk(output_dir) + print("Saved model to", output_dir) + + # test the save model + print("Loading from", output_dir) + nlp2 = spacy.load(output_dir) + doc = nlp2(test_text) + print("Morphs", [(t.text, t.morph) for t in doc]) + + +if __name__ == "__main__": + plac.call(main) + +# Expected output: +# Morphs [('I', POS=PRON|Person=1|PronType=Prs), ('like', POS=VERB|VerbForm=Fin), ('blue', Degree=Pos|POS=ADJ), ('eggs', Number=Plur|POS=NOUN)] diff --git a/examples/training/train_ner.py b/examples/training/train_ner.py index ff6029567..f439fda23 100644 --- a/examples/training/train_ner.py +++ b/examples/training/train_ner.py @@ -43,41 +43,39 @@ def main(model=None, output_dir=None, n_iter=100): # create the built-in pipeline components and add them to the pipeline # nlp.create_pipe works for built-ins that are registered with spaCy - if "ner" not in nlp.pipe_names: - ner = nlp.create_pipe("ner") + if "simple_ner" not in nlp.pipe_names: + ner = nlp.create_pipe("simple_ner") nlp.add_pipe(ner, last=True) # otherwise, get it so we can add labels else: - ner = nlp.get_pipe("ner") + ner = nlp.get_pipe("simple_ner") # add labels for _, annotations in TRAIN_DATA: for ent in annotations.get("entities"): + print("Add label", ent[2]) ner.add_label(ent[2]) - # get names of other pipes to disable them during training - pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"] - other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] - # only train NER - with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings(): + with nlp.select_pipes(enable="ner") and warnings.catch_warnings(): # show warnings for misaligned entity spans once - warnings.filterwarnings("once", category=UserWarning, module='spacy') + warnings.filterwarnings("once", category=UserWarning, module="spacy") # reset and initialize the weights randomly – but only if we're # training a new model if model is None: nlp.begin_training() + print( + "Transitions", list(enumerate(nlp.get_pipe("simple_ner").get_tag_names())) + ) for itn in range(n_iter): random.shuffle(TRAIN_DATA) losses = {} # batch up the examples using spaCy's minibatch batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001)) for batch in batches: - texts, annotations = zip(*batch) nlp.update( - texts, # batch of texts - annotations, # batch of annotations - drop=0.5, # dropout - make it harder to memorise data + batch, + drop=0.0, # dropout - make it harder to memorise data losses=losses, ) print("Losses", losses) diff --git a/examples/training/train_new_entity_type.py b/examples/training/train_new_entity_type.py index e8ff6802a..5124d0a2c 100644 --- a/examples/training/train_new_entity_type.py +++ b/examples/training/train_new_entity_type.py @@ -95,13 +95,9 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30): else: optimizer = nlp.resume_training() move_names = list(ner.move_names) - # get names of other pipes to disable them during training - pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"] - other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] - # only train NER - with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings(): + with nlp.select_pipes(enable="ner") and warnings.catch_warnings(): # show warnings for misaligned entity spans once - warnings.filterwarnings("once", category=UserWarning, module='spacy') + warnings.filterwarnings("once", category=UserWarning, module="spacy") sizes = compounding(1.0, 4.0, 1.001) # batch up the examples using spaCy's minibatch @@ -110,8 +106,7 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30): batches = minibatch(TRAIN_DATA, size=sizes) losses = {} for batch in batches: - texts, annotations = zip(*batch) - nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses) + nlp.update(batch, sgd=optimizer, drop=0.35, losses=losses) print("Losses", losses) # test the trained model diff --git a/examples/training/train_parser.py b/examples/training/train_parser.py index c5adb0dec..4f4409e31 100644 --- a/examples/training/train_parser.py +++ b/examples/training/train_parser.py @@ -64,10 +64,7 @@ def main(model=None, output_dir=None, n_iter=15): for dep in annotations.get("deps", []): parser.add_label(dep) - # get names of other pipes to disable them during training - pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"] - other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] - with nlp.disable_pipes(*other_pipes): # only train parser + with nlp.select_pipes(enable="parser"): # only train parser optimizer = nlp.begin_training() for itn in range(n_iter): random.shuffle(TRAIN_DATA) @@ -75,8 +72,7 @@ def main(model=None, output_dir=None, n_iter=15): # batch up the examples using spaCy's minibatch batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001)) for batch in batches: - texts, annotations = zip(*batch) - nlp.update(texts, annotations, sgd=optimizer, losses=losses) + nlp.update(batch, sgd=optimizer, losses=losses) print("Losses", losses) # test the trained model diff --git a/examples/training/train_tagger.py b/examples/training/train_tagger.py index 7136273b3..06e05f6cd 100644 --- a/examples/training/train_tagger.py +++ b/examples/training/train_tagger.py @@ -65,8 +65,7 @@ def main(lang="en", output_dir=None, n_iter=25): # batch up the examples using spaCy's minibatch batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001)) for batch in batches: - texts, annotations = zip(*batch) - nlp.update(texts, annotations, sgd=optimizer, losses=losses) + nlp.update(batch, sgd=optimizer, losses=losses) print("Losses", losses) # test the trained model diff --git a/examples/training/train_textcat.py b/examples/training/train_textcat.py index 456ef098c..65acadb07 100644 --- a/examples/training/train_textcat.py +++ b/examples/training/train_textcat.py @@ -2,89 +2,87 @@ # coding: utf8 """Train a convolutional neural network text classifier on the IMDB dataset, using the TextCategorizer component. The dataset will be loaded -automatically via Thinc's built-in dataset loader. The model is added to +automatically via the package `ml_datasets`. The model is added to spacy.pipeline, and predictions are available via `doc.cats`. For more details, see the documentation: * Training: https://spacy.io/usage/training -Compatible with: spaCy v2.0.0+ +Compatible with: spaCy v3.0.0+ """ from __future__ import unicode_literals, print_function + import plac import random from pathlib import Path -import thinc.extra.datasets +from ml_datasets import loaders import spacy +from spacy import util from spacy.util import minibatch, compounding +from spacy.gold import Example, GoldParse @plac.annotations( - model=("Model name. Defaults to blank 'en' model.", "option", "m", str), + config_path=("Path to config file", "positional", None, Path), output_dir=("Optional output directory", "option", "o", Path), n_texts=("Number of texts to train from", "option", "t", int), n_iter=("Number of training iterations", "option", "n", int), init_tok2vec=("Pretrained tok2vec weights", "option", "t2v", Path), + dataset=("Dataset to train on (default: imdb)", "option", "d", str), + threshold=("Min. number of instances for a given label (default 20)", "option", "m", int) ) -def main(model=None, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=None): +def main(config_path, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=None, dataset="imdb", threshold=20): + if not config_path or not config_path.exists(): + raise ValueError(f"Config file not found at {config_path}") + + spacy.util.fix_random_seed() if output_dir is not None: output_dir = Path(output_dir) if not output_dir.exists(): output_dir.mkdir() - if model is not None: - nlp = spacy.load(model) # load existing spaCy model - print("Loaded model '%s'" % model) - else: - nlp = spacy.blank("en") # create blank Language class - print("Created blank 'en' model") + print(f"Loading nlp model from {config_path}") + nlp_config = util.load_config(config_path, create_objects=False)["nlp"] + nlp = util.load_model_from_config(nlp_config) - # add the text classifier to the pipeline if it doesn't exist - # nlp.create_pipe works for built-ins that are registered with spaCy + # ensure the nlp object was defined with a textcat component if "textcat" not in nlp.pipe_names: - textcat = nlp.create_pipe( - "textcat", config={"exclusive_classes": True, "architecture": "simple_cnn"} - ) - nlp.add_pipe(textcat, last=True) - # otherwise, get it, so we can add labels to it - else: - textcat = nlp.get_pipe("textcat") + raise ValueError(f"The nlp definition in the config does not contain a textcat component") - # add label to text classifier - textcat.add_label("POSITIVE") - textcat.add_label("NEGATIVE") + textcat = nlp.get_pipe("textcat") - # load the IMDB dataset - print("Loading IMDB data...") - (train_texts, train_cats), (dev_texts, dev_cats) = load_data() - train_texts = train_texts[:n_texts] - train_cats = train_cats[:n_texts] + # load the dataset + print(f"Loading dataset {dataset} ...") + (train_texts, train_cats), (dev_texts, dev_cats) = load_data(dataset=dataset, threshold=threshold, limit=n_texts) print( "Using {} examples ({} training, {} evaluation)".format( n_texts, len(train_texts), len(dev_texts) ) ) - train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats])) + train_examples = [] + for text, cats in zip(train_texts, train_cats): + doc = nlp.make_doc(text) + gold = GoldParse(doc, cats=cats) + for cat in cats: + textcat.add_label(cat) + ex = Example.from_gold(gold, doc=doc) + train_examples.append(ex) - # get names of other pipes to disable them during training - pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"] - other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] - with nlp.disable_pipes(*other_pipes): # only train textcat + with nlp.select_pipes(enable="textcat"): # only train textcat optimizer = nlp.begin_training() if init_tok2vec is not None: with init_tok2vec.open("rb") as file_: - textcat.model.tok2vec.from_bytes(file_.read()) + textcat.model.get_ref("tok2vec").from_bytes(file_.read()) print("Training the model...") print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F")) batch_sizes = compounding(4.0, 32.0, 1.001) for i in range(n_iter): losses = {} # batch up the examples using spaCy's minibatch - random.shuffle(train_data) - batches = minibatch(train_data, size=batch_sizes) + random.shuffle(train_examples) + batches = minibatch(train_examples, size=batch_sizes) for batch in batches: - texts, annotations = zip(*batch) - nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses) + nlp.update(batch, sgd=optimizer, drop=0.2, losses=losses) with textcat.model.use_params(optimizer.averages): # evaluate on the dev data split off in load_data() scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats) @@ -97,7 +95,7 @@ def main(model=None, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=None ) ) - # test the trained model + # test the trained model (only makes sense for sentiment analysis) test_text = "This movie sucked" doc = nlp(test_text) print(test_text, doc.cats) @@ -114,14 +112,39 @@ def main(model=None, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=None print(test_text, doc2.cats) -def load_data(limit=0, split=0.8): - """Load data from the IMDB dataset.""" +def load_data(dataset, threshold, limit=0, split=0.8): + """Load data from the provided dataset.""" # Partition off part of the train data for evaluation - train_data, _ = thinc.extra.datasets.imdb() + data_loader = loaders.get(dataset) + train_data, _ = data_loader(limit=int(limit/split)) random.shuffle(train_data) - train_data = train_data[-limit:] texts, labels = zip(*train_data) - cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels] + + unique_labels = sorted(set([l for label_set in labels for l in label_set])) + print(f"# of unique_labels: {len(unique_labels)}") + + count_values_train = dict() + for text, annot_list in train_data: + for annot in annot_list: + count_values_train[annot] = count_values_train.get(annot, 0) + 1 + for value, count in sorted(count_values_train.items(), key=lambda item: item[1]): + if count < threshold: + unique_labels.remove(value) + + print(f"# of unique_labels after filtering with threshold {threshold}: {len(unique_labels)}") + + if unique_labels == {0, 1}: + cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels] + else: + cats = [] + for y in labels: + if isinstance(y, str): + cats.append({str(label): (label == y) for label in unique_labels}) + elif isinstance(y, set): + cats.append({str(label): (label in y) for label in unique_labels}) + else: + raise ValueError(f"Unrecognised type of labels: {type(y)}") + split = int(len(train_data) * split) return (texts[:split], cats[:split]), (texts[split:], cats[split:]) diff --git a/examples/training/train_textcat_config.cfg b/examples/training/train_textcat_config.cfg new file mode 100644 index 000000000..7c0f36b57 --- /dev/null +++ b/examples/training/train_textcat_config.cfg @@ -0,0 +1,19 @@ +[nlp] +lang = "en" + +[nlp.pipeline.textcat] +factory = "textcat" + +[nlp.pipeline.textcat.model] +@architectures = "spacy.TextCatCNN.v1" +exclusive_classes = false + +[nlp.pipeline.textcat.model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 96 +depth = 4 +embed_size = 2000 +window_size = 1 +maxout_pieces = 3 +subword_features = true diff --git a/fabfile.py b/fabfile.py index fcab493f5..760c2c0e2 100644 --- a/fabfile.py +++ b/fabfile.py @@ -1,9 +1,6 @@ -# coding: utf-8 -from __future__ import unicode_literals, print_function - import contextlib from pathlib import Path -from fabric.api import local, lcd, env, settings, prefix +from fabric.api import local, lcd from os import path, environ import shutil import sys @@ -82,9 +79,7 @@ def pex(): with virtualenv(VENV_DIR) as venv_local: with lcd(path.dirname(__file__)): sha = local("git rev-parse --short HEAD", capture=True) - venv_local( - "pex dist/*.whl -e spacy -o dist/spacy-%s.pex" % sha, direct=True - ) + venv_local(f"pex dist/*.whl -e spacy -o dist/spacy-{sha}.pex", direct=True) def clean(): diff --git a/pyproject.toml b/pyproject.toml index 827e2a797..66a06c1d9 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -6,6 +6,7 @@ requires = [ "cymem>=2.0.2,<2.1.0", "preshed>=3.0.2,<3.1.0", "murmurhash>=0.28.0,<1.1.0", - "thinc==7.4.0", + "thinc==8.0.0a9", + "blis>=0.4.0,<0.5.0" ] build-backend = "setuptools.build_meta" diff --git a/requirements.txt b/requirements.txt index ec30efc16..e5f1ae10b 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,20 +1,21 @@ # Our libraries cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 -thinc==7.4.0 +thinc==8.0.0a9 blis>=0.4.0,<0.5.0 +ml_datasets>=0.1.1 murmurhash>=0.28.0,<1.1.0 wasabi>=0.4.0,<1.1.0 -srsly>=1.0.2,<1.1.0 +srsly>=2.0.0,<3.0.0 catalogue>=0.0.7,<1.1.0 # Third party dependencies numpy>=1.15.0 requests>=2.13.0,<3.0.0 plac>=0.9.6,<1.2.0 -pathlib==1.0.1; python_version < "3.4" tqdm>=4.38.0,<5.0.0 # Optional dependencies jsonschema>=2.6.0,<3.1.0 +pydantic>=1.3.0,<2.0.0 # Development dependencies cython>=0.25 pytest>=4.6.5 diff --git a/setup.cfg b/setup.cfg index af3579f88..f0895bbbb 100644 --- a/setup.cfg +++ b/setup.cfg @@ -16,10 +16,7 @@ classifiers = Operating System :: MacOS :: MacOS X Operating System :: Microsoft :: Windows Programming Language :: Cython - Programming Language :: Python :: 2 - Programming Language :: Python :: 2.7 Programming Language :: Python :: 3 - Programming Language :: Python :: 3.5 Programming Language :: Python :: 3.6 Programming Language :: Python :: 3.7 Programming Language :: Python :: 3.8 @@ -30,32 +27,35 @@ zip_safe = false include_package_data = true scripts = bin/spacy -python_requires = >=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.* +python_requires = >=3.6 setup_requires = wheel cython>=0.25 + numpy>=1.15.0 # We also need our Cython packages here to compile against cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 murmurhash>=0.28.0,<1.1.0 - thinc==7.4.0 + thinc==8.0.0a9 install_requires = # Our libraries murmurhash>=0.28.0,<1.1.0 cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 - thinc==7.4.0 + thinc==8.0.0a9 blis>=0.4.0,<0.5.0 wasabi>=0.4.0,<1.1.0 - srsly>=1.0.2,<1.1.0 + srsly>=2.0.0,<3.0.0 catalogue>=0.0.7,<1.1.0 + ml_datasets # Third-party dependencies tqdm>=4.38.0,<5.0.0 setuptools numpy>=1.15.0 plac>=0.9.6,<1.2.0 requests>=2.13.0,<3.0.0 - pathlib==1.0.1; python_version < "3.4" + pydantic>=1.3.0,<2.0.0 + tqdm>=4.38.0,<5.0.0 [options.extras_require] lookups = diff --git a/setup.py b/setup.py index 62a09aa73..d16615f5f 100755 --- a/setup.py +++ b/setup.py @@ -1,35 +1,27 @@ #!/usr/bin/env python -from __future__ import print_function -import io -import os -import subprocess import sys -import contextlib +import platform from distutils.command.build_ext import build_ext from distutils.sysconfig import get_python_inc import distutils.util from distutils import ccompiler, msvccompiler from setuptools import Extension, setup, find_packages +import numpy +from pathlib import Path +import shutil +from Cython.Build import cythonize +from Cython.Compiler import Options -def is_new_osx(): - """Check whether we're on OSX >= 10.10""" - name = distutils.util.get_platform() - if sys.platform != "darwin": - return False - elif name.startswith("macosx-10"): - minor_version = int(name.split("-")[1].split(".")[1]) - if minor_version >= 7: - return True - else: - return False - else: - return False +ROOT = Path(__file__).parent +PACKAGE_ROOT = ROOT / "spacy" +# Preserve `__doc__` on functions and classes +# http://docs.cython.org/en/latest/src/userguide/source_files_and_compilation.html#compiler-options +Options.docstrings = True + PACKAGES = find_packages() - - MOD_NAMES = [ "spacy.parts_of_speech", "spacy.strings", @@ -62,16 +54,38 @@ MOD_NAMES = [ "spacy.symbols", "spacy.vectors", ] - - COMPILE_OPTIONS = { "msvc": ["/Ox", "/EHsc"], "mingw32": ["-O2", "-Wno-strict-prototypes", "-Wno-unused-function"], "other": ["-O2", "-Wno-strict-prototypes", "-Wno-unused-function"], } - - LINK_OPTIONS = {"msvc": [], "mingw32": [], "other": []} +COMPILER_DIRECTIVES = { + "language_level": -3, + "embedsignature": True, + "annotation_typing": False, +} +# Files to copy into the package that are otherwise not included +COPY_FILES = { + ROOT / "setup.cfg": PACKAGE_ROOT / "tests" / "package", + ROOT / "pyproject.toml": PACKAGE_ROOT / "tests" / "package", + ROOT / "requirements.txt": PACKAGE_ROOT / "tests" / "package", +} + + +def is_new_osx(): + """Check whether we're on OSX >= 10.7""" + name = distutils.util.get_platform() + if sys.platform != "darwin": + return False + mac_ver = platform.mac_ver()[0] + if mac_ver.startswith("10"): + minor_version = int(mac_ver.split('.')[1]) + if minor_version >= 7: + return True + else: + return False + return False if is_new_osx(): @@ -104,95 +118,53 @@ class build_ext_subclass(build_ext, build_ext_options): build_ext.build_extensions(self) -def generate_cython(root, source): - print("Cythonizing sources") - p = subprocess.call( - [sys.executable, os.path.join(root, "bin", "cythonize.py"), source], - env=os.environ, - ) - if p != 0: - raise RuntimeError("Running cythonize failed") - - -def is_source_release(path): - return os.path.exists(os.path.join(path, "PKG-INFO")) - - def clean(path): - for name in MOD_NAMES: - name = name.replace(".", "/") - for ext in [".so", ".html", ".cpp", ".c"]: - file_path = os.path.join(path, name + ext) - if os.path.exists(file_path): - os.unlink(file_path) - - -@contextlib.contextmanager -def chdir(new_dir): - old_dir = os.getcwd() - try: - os.chdir(new_dir) - sys.path.insert(0, new_dir) - yield - finally: - del sys.path[0] - os.chdir(old_dir) + for path in path.glob("**/*"): + if path.is_file() and path.suffix in (".so", ".cpp"): + print(f"Deleting {path.name}") + path.unlink() def setup_package(): - root = os.path.abspath(os.path.dirname(__file__)) - if len(sys.argv) > 1 and sys.argv[1] == "clean": - return clean(root) + return clean(PACKAGE_ROOT) - with chdir(root): - with io.open(os.path.join(root, "spacy", "about.py"), encoding="utf8") as f: - about = {} - exec(f.read(), about) + with (PACKAGE_ROOT / "about.py").open("r") as f: + about = {} + exec(f.read(), about) - include_dirs = [ - get_python_inc(plat_specific=True), - os.path.join(root, "include"), - ] + for copy_file, target_dir in COPY_FILES.items(): + if copy_file.exists(): + shutil.copy(str(copy_file), str(target_dir)) + print(f"Copied {copy_file} -> {target_dir}") - if ( - ccompiler.new_compiler().compiler_type == "msvc" - and msvccompiler.get_build_version() == 9 - ): - include_dirs.append(os.path.join(root, "include", "msvc9")) + include_dirs = [ + get_python_inc(plat_specific=True), + numpy.get_include(), + str(ROOT / "include"), + ] + if ( + ccompiler.new_compiler().compiler_type == "msvc" + and msvccompiler.get_build_version() == 9 + ): + include_dirs.append(str(ROOT / "include" / "msvc9")) + ext_modules = [] + for name in MOD_NAMES: + mod_path = name.replace(".", "/") + ".pyx" + ext = Extension(name, [mod_path], language="c++") + ext_modules.append(ext) + print("Cythonizing sources") + ext_modules = cythonize(ext_modules, compiler_directives=COMPILER_DIRECTIVES) - ext_modules = [] - for mod_name in MOD_NAMES: - mod_path = mod_name.replace(".", "/") + ".cpp" - extra_link_args = [] - # ??? - # Imported from patch from @mikepb - # See Issue #267. Running blind here... - if sys.platform == "darwin": - dylib_path = [".." for _ in range(mod_name.count("."))] - dylib_path = "/".join(dylib_path) - dylib_path = "@loader_path/%s/spacy/platform/darwin/lib" % dylib_path - extra_link_args.append("-Wl,-rpath,%s" % dylib_path) - ext_modules.append( - Extension( - mod_name, - [mod_path], - language="c++", - include_dirs=include_dirs, - extra_link_args=extra_link_args, - ) - ) - - if not is_source_release(root): - generate_cython(root, "spacy") - - setup( - name="spacy", - packages=PACKAGES, - version=about["__version__"], - ext_modules=ext_modules, - cmdclass={"build_ext": build_ext_subclass}, - ) + setup( + name="spacy", + packages=PACKAGES, + version=about["__version__"], + ext_modules=ext_modules, + cmdclass={"build_ext": build_ext_subclass}, + include_dirs=include_dirs, + package_data={"": ["*.pyx", "*.pxd", "*.pxi", "*.cpp"]}, + ) if __name__ == "__main__": diff --git a/spacy/__init__.py b/spacy/__init__.py index 6aa7b7c16..e4e1f6c8e 100644 --- a/spacy/__init__.py +++ b/spacy/__init__.py @@ -1,5 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals import warnings import sys @@ -7,7 +5,7 @@ warnings.filterwarnings("ignore", message="numpy.dtype size changed") warnings.filterwarnings("ignore", message="numpy.ufunc size changed") # These are imported as part of the API -from thinc.neural.util import prefer_gpu, require_gpu +from thinc.api import prefer_gpu, require_gpu from . import pipeline from .cli.info import info as cli_info @@ -23,6 +21,9 @@ if sys.maxunicode == 65535: raise SystemError(Errors.E130) +config = registry + + def load(name, **overrides): depr_path = overrides.get("path") if depr_path not in (True, False, None): diff --git a/spacy/__main__.py b/spacy/__main__.py index 2c285095e..71ab1a91a 100644 --- a/spacy/__main__.py +++ b/spacy/__main__.py @@ -1,21 +1,17 @@ -# coding: utf8 -from __future__ import print_function - -# NB! This breaks in plac on Python 2!! -# from __future__ import unicode_literals - if __name__ == "__main__": import plac import sys from wasabi import msg from spacy.cli import download, link, info, package, train, pretrain, convert from spacy.cli import init_model, profile, evaluate, validate, debug_data + from spacy.cli import train_from_config_cli commands = { "download": download, "link": link, "info": info, "train": train, + "train-from-config": train_from_config_cli, "pretrain": pretrain, "debug-data": debug_data, "evaluate": evaluate, @@ -28,9 +24,9 @@ if __name__ == "__main__": if len(sys.argv) == 1: msg.info("Available commands", ", ".join(commands), exits=1) command = sys.argv.pop(1) - sys.argv[0] = "spacy %s" % command + sys.argv[0] = f"spacy {command}" if command in commands: plac.call(commands[command], sys.argv[1:]) else: - available = "Available: {}".format(", ".join(commands)) - msg.fail("Unknown command: {}".format(command), available, exits=1) + available = f"Available: {', '.join(commands)}" + msg.fail(f"Unknown command: {command}", available, exits=1) diff --git a/spacy/_ml.py b/spacy/_ml.py index 60a0bbee0..e69de29bb 100644 --- a/spacy/_ml.py +++ b/spacy/_ml.py @@ -1,988 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import numpy -import warnings -from thinc.v2v import Model, Maxout, Softmax, Affine, ReLu -from thinc.t2t import ExtractWindow, ParametricAttention -from thinc.t2v import Pooling, sum_pool, mean_pool -from thinc.i2v import HashEmbed -from thinc.misc import Residual, FeatureExtracter -from thinc.misc import LayerNorm as LN -from thinc.api import add, layerize, chain, clone, concatenate, with_flatten -from thinc.api import with_getitem, flatten_add_lengths -from thinc.api import uniqued, wrap, noop -from thinc.linear.linear import LinearModel -from thinc.neural.ops import NumpyOps, CupyOps -from thinc.neural.util import get_array_module, copy_array -from thinc.neural.optimizers import Adam - -from thinc import describe -from thinc.describe import Dimension, Synapses, Biases, Gradient -from thinc.neural._classes.affine import _set_dimensions_if_needed -import thinc.extra.load_nlp - -from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE -from .errors import Errors, Warnings -from . import util -from . import ml as new_ml -from .ml import _legacy_tok2vec - - -VECTORS_KEY = "spacy_pretrained_vectors" -# Backwards compatibility with <2.2.2 -USE_MODEL_REGISTRY_TOK2VEC = False - - -def cosine(vec1, vec2): - xp = get_array_module(vec1) - norm1 = xp.linalg.norm(vec1) - norm2 = xp.linalg.norm(vec2) - if norm1 == 0.0 or norm2 == 0.0: - return 0 - else: - return vec1.dot(vec2) / (norm1 * norm2) - - -def create_default_optimizer(ops, **cfg): - learn_rate = util.env_opt("learn_rate", 0.001) - beta1 = util.env_opt("optimizer_B1", 0.9) - beta2 = util.env_opt("optimizer_B2", 0.999) - eps = util.env_opt("optimizer_eps", 1e-8) - L2 = util.env_opt("L2_penalty", 1e-6) - max_grad_norm = util.env_opt("grad_norm_clip", 1.0) - optimizer = Adam(ops, learn_rate, L2=L2, beta1=beta1, beta2=beta2, eps=eps) - optimizer.max_grad_norm = max_grad_norm - optimizer.device = ops.device - return optimizer - - -@layerize -def _flatten_add_lengths(seqs, pad=0, drop=0.0): - ops = Model.ops - lengths = ops.asarray([len(seq) for seq in seqs], dtype="i") - - def finish_update(d_X, sgd=None): - return ops.unflatten(d_X, lengths, pad=pad) - - X = ops.flatten(seqs, pad=pad) - return (X, lengths), finish_update - - -def _zero_init(model): - def _zero_init_impl(self, *args, **kwargs): - self.W.fill(0) - - model.on_init_hooks.append(_zero_init_impl) - if model.W is not None: - model.W.fill(0.0) - return model - - -def with_cpu(ops, model): - """Wrap a model that should run on CPU, transferring inputs and outputs - as necessary.""" - model.to_cpu() - - def with_cpu_forward(inputs, drop=0.0): - cpu_outputs, backprop = model.begin_update(_to_cpu(inputs), drop=drop) - gpu_outputs = _to_device(ops, cpu_outputs) - - def with_cpu_backprop(d_outputs, sgd=None): - cpu_d_outputs = _to_cpu(d_outputs) - return backprop(cpu_d_outputs, sgd=sgd) - - return gpu_outputs, with_cpu_backprop - - return wrap(with_cpu_forward, model) - - -def _to_cpu(X): - if isinstance(X, numpy.ndarray): - return X - elif isinstance(X, tuple): - return tuple([_to_cpu(x) for x in X]) - elif isinstance(X, list): - return [_to_cpu(x) for x in X] - elif hasattr(X, "get"): - return X.get() - else: - return X - - -def _to_device(ops, X): - if isinstance(X, tuple): - return tuple([_to_device(ops, x) for x in X]) - elif isinstance(X, list): - return [_to_device(ops, x) for x in X] - else: - return ops.asarray(X) - - -class extract_ngrams(Model): - def __init__(self, ngram_size, attr=LOWER): - Model.__init__(self) - self.ngram_size = ngram_size - self.attr = attr - - def begin_update(self, docs, drop=0.0): - batch_keys = [] - batch_vals = [] - for doc in docs: - unigrams = doc.to_array([self.attr]) - ngrams = [unigrams] - for n in range(2, self.ngram_size + 1): - ngrams.append(self.ops.ngrams(n, unigrams)) - keys = self.ops.xp.concatenate(ngrams) - keys, vals = self.ops.xp.unique(keys, return_counts=True) - batch_keys.append(keys) - batch_vals.append(vals) - # The dtype here matches what thinc is expecting -- which differs per - # platform (by int definition). This should be fixed once the problem - # is fixed on Thinc's side. - lengths = self.ops.asarray( - [arr.shape[0] for arr in batch_keys], dtype=numpy.int_ - ) - batch_keys = self.ops.xp.concatenate(batch_keys) - batch_vals = self.ops.asarray(self.ops.xp.concatenate(batch_vals), dtype="f") - return (batch_keys, batch_vals, lengths), None - - -@describe.on_data( - _set_dimensions_if_needed, lambda model, X, y: model.init_weights(model) -) -@describe.attributes( - nI=Dimension("Input size"), - nF=Dimension("Number of features"), - nO=Dimension("Output size"), - nP=Dimension("Maxout pieces"), - W=Synapses("Weights matrix", lambda obj: (obj.nF, obj.nO, obj.nP, obj.nI)), - b=Biases("Bias vector", lambda obj: (obj.nO, obj.nP)), - pad=Synapses( - "Pad", - lambda obj: (1, obj.nF, obj.nO, obj.nP), - lambda M, ops: ops.normal_init(M, 1.0), - ), - d_W=Gradient("W"), - d_pad=Gradient("pad"), - d_b=Gradient("b"), -) -class PrecomputableAffine(Model): - def __init__(self, nO=None, nI=None, nF=None, nP=None, **kwargs): - Model.__init__(self, **kwargs) - self.nO = nO - self.nP = nP - self.nI = nI - self.nF = nF - - def begin_update(self, X, drop=0.0): - Yf = self.ops.gemm( - X, self.W.reshape((self.nF * self.nO * self.nP, self.nI)), trans2=True - ) - Yf = Yf.reshape((Yf.shape[0], self.nF, self.nO, self.nP)) - Yf = self._add_padding(Yf) - - def backward(dY_ids, sgd=None): - dY, ids = dY_ids - dY, ids = self._backprop_padding(dY, ids) - Xf = X[ids] - Xf = Xf.reshape((Xf.shape[0], self.nF * self.nI)) - - self.d_b += dY.sum(axis=0) - dY = dY.reshape((dY.shape[0], self.nO * self.nP)) - - Wopfi = self.W.transpose((1, 2, 0, 3)) - Wopfi = self.ops.xp.ascontiguousarray(Wopfi) - Wopfi = Wopfi.reshape((self.nO * self.nP, self.nF * self.nI)) - dXf = self.ops.gemm(dY.reshape((dY.shape[0], self.nO * self.nP)), Wopfi) - - # Reuse the buffer - dWopfi = Wopfi - dWopfi.fill(0.0) - self.ops.gemm(dY, Xf, out=dWopfi, trans1=True) - dWopfi = dWopfi.reshape((self.nO, self.nP, self.nF, self.nI)) - # (o, p, f, i) --> (f, o, p, i) - self.d_W += dWopfi.transpose((2, 0, 1, 3)) - - if sgd is not None: - sgd(self._mem.weights, self._mem.gradient, key=self.id) - return dXf.reshape((dXf.shape[0], self.nF, self.nI)) - - return Yf, backward - - def _add_padding(self, Yf): - Yf_padded = self.ops.xp.vstack((self.pad, Yf)) - return Yf_padded - - def _backprop_padding(self, dY, ids): - # (1, nF, nO, nP) += (nN, nF, nO, nP) where IDs (nN, nF) < 0 - mask = ids < 0.0 - mask = mask.sum(axis=1) - d_pad = dY * mask.reshape((ids.shape[0], 1, 1)) - self.d_pad += d_pad.sum(axis=0) - return dY, ids - - @staticmethod - def init_weights(model): - """This is like the 'layer sequential unit variance', but instead - of taking the actual inputs, we randomly generate whitened data. - - Why's this all so complicated? We have a huge number of inputs, - and the maxout unit makes guessing the dynamics tricky. Instead - we set the maxout weights to values that empirically result in - whitened outputs given whitened inputs. - """ - if (model.W ** 2).sum() != 0.0: - return - ops = model.ops - xp = ops.xp - ops.normal_init(model.W, model.nF * model.nI, inplace=True) - - ids = ops.allocate((5000, model.nF), dtype="f") - ids += xp.random.uniform(0, 1000, ids.shape) - ids = ops.asarray(ids, dtype="i") - tokvecs = ops.allocate((5000, model.nI), dtype="f") - tokvecs += xp.random.normal(loc=0.0, scale=1.0, size=tokvecs.size).reshape( - tokvecs.shape - ) - - def predict(ids, tokvecs): - # nS ids. nW tokvecs. Exclude the padding array. - hiddens = model(tokvecs[:-1]) # (nW, f, o, p) - vectors = model.ops.allocate((ids.shape[0], model.nO * model.nP), dtype="f") - # need nS vectors - hiddens = hiddens.reshape( - (hiddens.shape[0] * model.nF, model.nO * model.nP) - ) - model.ops.scatter_add(vectors, ids.flatten(), hiddens) - vectors = vectors.reshape((vectors.shape[0], model.nO, model.nP)) - vectors += model.b - vectors = model.ops.asarray(vectors) - if model.nP >= 2: - return model.ops.maxout(vectors)[0] - else: - return vectors * (vectors >= 0) - - tol_var = 0.01 - tol_mean = 0.01 - t_max = 10 - t_i = 0 - for t_i in range(t_max): - acts1 = predict(ids, tokvecs) - var = model.ops.xp.var(acts1) - mean = model.ops.xp.mean(acts1) - if abs(var - 1.0) >= tol_var: - model.W /= model.ops.xp.sqrt(var) - elif abs(mean) >= tol_mean: - model.b -= mean - else: - break - - -def link_vectors_to_models(vocab, skip_rank=False): - vectors = vocab.vectors - if vectors.name is None: - vectors.name = VECTORS_KEY - if vectors.data.size != 0: - warnings.warn(Warnings.W020.format(shape=vectors.data.shape)) - ops = Model.ops - if not skip_rank: - for word in vocab: - if word.orth in vectors.key2row: - word.rank = vectors.key2row[word.orth] - else: - word.rank = util.OOV_RANK - data = ops.asarray(vectors.data) - # Set an entry here, so that vectors are accessed by StaticVectors - # (unideal, I know) - key = (ops.device, vectors.name) - if key in thinc.extra.load_nlp.VECTORS: - if thinc.extra.load_nlp.VECTORS[key].shape != data.shape: - # This is a hack to avoid the problem in #3853. - old_name = vectors.name - new_name = vectors.name + "_%d" % data.shape[0] - warnings.warn(Warnings.W019.format(old=old_name, new=new_name)) - vectors.name = new_name - key = (ops.device, vectors.name) - thinc.extra.load_nlp.VECTORS[key] = data - - -def PyTorchBiLSTM(nO, nI, depth, dropout=0.2): - import torch.nn - from thinc.api import with_square_sequences - from thinc.extra.wrappers import PyTorchWrapperRNN - - if depth == 0: - return layerize(noop()) - model = torch.nn.LSTM(nI, nO // 2, depth, bidirectional=True, dropout=dropout) - return with_square_sequences(PyTorchWrapperRNN(model)) - - -def Tok2Vec(width, embed_size, **kwargs): - if not USE_MODEL_REGISTRY_TOK2VEC: - # Preserve prior tok2vec for backwards compat, in v2.2.2 - return _legacy_tok2vec.Tok2Vec(width, embed_size, **kwargs) - pretrained_vectors = kwargs.get("pretrained_vectors", None) - cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3) - subword_features = kwargs.get("subword_features", True) - char_embed = kwargs.get("char_embed", False) - conv_depth = kwargs.get("conv_depth", 4) - bilstm_depth = kwargs.get("bilstm_depth", 0) - conv_window = kwargs.get("conv_window", 1) - - cols = ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"] - - doc2feats_cfg = {"arch": "spacy.Doc2Feats.v1", "config": {"columns": cols}} - if char_embed: - embed_cfg = { - "arch": "spacy.CharacterEmbed.v1", - "config": { - "width": 64, - "chars": 6, - "@mix": { - "arch": "spacy.LayerNormalizedMaxout.v1", - "config": {"width": width, "pieces": 3}, - }, - "@embed_features": None, - }, - } - else: - embed_cfg = { - "arch": "spacy.MultiHashEmbed.v1", - "config": { - "width": width, - "rows": embed_size, - "columns": cols, - "use_subwords": subword_features, - "@pretrained_vectors": None, - "@mix": { - "arch": "spacy.LayerNormalizedMaxout.v1", - "config": {"width": width, "pieces": 3}, - }, - }, - } - if pretrained_vectors: - embed_cfg["config"]["@pretrained_vectors"] = { - "arch": "spacy.PretrainedVectors.v1", - "config": { - "vectors_name": pretrained_vectors, - "width": width, - "column": cols.index("ID"), - }, - } - if cnn_maxout_pieces >= 2: - cnn_cfg = { - "arch": "spacy.MaxoutWindowEncoder.v1", - "config": { - "width": width, - "window_size": conv_window, - "pieces": cnn_maxout_pieces, - "depth": conv_depth, - }, - } - else: - cnn_cfg = { - "arch": "spacy.MishWindowEncoder.v1", - "config": {"width": width, "window_size": conv_window, "depth": conv_depth}, - } - bilstm_cfg = { - "arch": "spacy.TorchBiLSTMEncoder.v1", - "config": {"width": width, "depth": bilstm_depth}, - } - if conv_depth == 0 and bilstm_depth == 0: - encode_cfg = {} - elif conv_depth >= 1 and bilstm_depth >= 1: - encode_cfg = { - "arch": "thinc.FeedForward.v1", - "config": {"children": [cnn_cfg, bilstm_cfg]}, - } - elif conv_depth >= 1: - encode_cfg = cnn_cfg - else: - encode_cfg = bilstm_cfg - config = {"@doc2feats": doc2feats_cfg, "@embed": embed_cfg, "@encode": encode_cfg} - return new_ml.Tok2Vec(config) - - -def reapply(layer, n_times): - def reapply_fwd(X, drop=0.0): - backprops = [] - for i in range(n_times): - Y, backprop = layer.begin_update(X, drop=drop) - X = Y - backprops.append(backprop) - - def reapply_bwd(dY, sgd=None): - dX = None - for backprop in reversed(backprops): - dY = backprop(dY, sgd=sgd) - if dX is None: - dX = dY - else: - dX += dY - return dX - - return Y, reapply_bwd - - return wrap(reapply_fwd, layer) - - -def asarray(ops, dtype): - def forward(X, drop=0.0): - return ops.asarray(X, dtype=dtype), None - - return layerize(forward) - - -def _divide_array(X, size): - parts = [] - index = 0 - while index < len(X): - parts.append(X[index : index + size]) - index += size - return parts - - -def get_col(idx): - if idx < 0: - raise IndexError(Errors.E066.format(value=idx)) - - def forward(X, drop=0.0): - if isinstance(X, numpy.ndarray): - ops = NumpyOps() - else: - ops = CupyOps() - output = ops.xp.ascontiguousarray(X[:, idx], dtype=X.dtype) - - def backward(y, sgd=None): - dX = ops.allocate(X.shape) - dX[:, idx] += y - return dX - - return output, backward - - return layerize(forward) - - -def doc2feats(cols=None): - if cols is None: - cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH] - - def forward(docs, drop=0.0): - feats = [] - for doc in docs: - feats.append(doc.to_array(cols)) - return feats, None - - model = layerize(forward) - model.cols = cols - return model - - -def print_shape(prefix): - def forward(X, drop=0.0): - return X, lambda dX, **kwargs: dX - - return layerize(forward) - - -@layerize -def get_token_vectors(tokens_attrs_vectors, drop=0.0): - tokens, attrs, vectors = tokens_attrs_vectors - - def backward(d_output, sgd=None): - return (tokens, d_output) - - return vectors, backward - - -@layerize -def logistic(X, drop=0.0): - xp = get_array_module(X) - if not isinstance(X, xp.ndarray): - X = xp.asarray(X) - # Clip to range (-10, 10) - X = xp.minimum(X, 10.0, X) - X = xp.maximum(X, -10.0, X) - Y = 1.0 / (1.0 + xp.exp(-X)) - - def logistic_bwd(dY, sgd=None): - dX = dY * (Y * (1 - Y)) - return dX - - return Y, logistic_bwd - - -def zero_init(model): - def _zero_init_impl(self, X, y): - self.W.fill(0) - - model.on_data_hooks.append(_zero_init_impl) - return model - - -def getitem(i): - def getitem_fwd(X, drop=0.0): - return X[i], None - - return layerize(getitem_fwd) - - -@describe.attributes( - W=Synapses("Weights matrix", lambda obj: (obj.nO, obj.nI), lambda W, ops: None) -) -class MultiSoftmax(Affine): - """Neural network layer that predicts several multi-class attributes at once. - For instance, we might predict one class with 6 variables, and another with 5. - We predict the 11 neurons required for this, and then softmax them such - that columns 0-6 make a probability distribution and coumns 6-11 make another. - """ - - name = "multisoftmax" - - def __init__(self, out_sizes, nI=None, **kwargs): - Model.__init__(self, **kwargs) - self.out_sizes = out_sizes - self.nO = sum(out_sizes) - self.nI = nI - - def predict(self, input__BI): - output__BO = self.ops.affine(self.W, self.b, input__BI) - i = 0 - for out_size in self.out_sizes: - self.ops.softmax(output__BO[:, i : i + out_size], inplace=True) - i += out_size - return output__BO - - def begin_update(self, input__BI, drop=0.0): - output__BO = self.predict(input__BI) - - def finish_update(grad__BO, sgd=None): - self.d_W += self.ops.gemm(grad__BO, input__BI, trans1=True) - self.d_b += grad__BO.sum(axis=0) - grad__BI = self.ops.gemm(grad__BO, self.W) - if sgd is not None: - sgd(self._mem.weights, self._mem.gradient, key=self.id) - return grad__BI - - return output__BO, finish_update - - -def build_tagger_model(nr_class, **cfg): - embed_size = util.env_opt("embed_size", 2000) - if "token_vector_width" in cfg: - token_vector_width = cfg["token_vector_width"] - else: - token_vector_width = util.env_opt("token_vector_width", 96) - pretrained_vectors = cfg.get("pretrained_vectors") - subword_features = cfg.get("subword_features", True) - with Model.define_operators({">>": chain, "+": add}): - if "tok2vec" in cfg: - tok2vec = cfg["tok2vec"] - else: - tok2vec = Tok2Vec( - token_vector_width, - embed_size, - subword_features=subword_features, - pretrained_vectors=pretrained_vectors, - ) - softmax = with_flatten(Softmax(nr_class, token_vector_width)) - model = tok2vec >> softmax - model.nI = None - model.tok2vec = tok2vec - model.softmax = softmax - return model - - -def build_morphologizer_model(class_nums, **cfg): - embed_size = util.env_opt("embed_size", 7000) - if "token_vector_width" in cfg: - token_vector_width = cfg["token_vector_width"] - else: - token_vector_width = util.env_opt("token_vector_width", 128) - pretrained_vectors = cfg.get("pretrained_vectors") - char_embed = cfg.get("char_embed", True) - with Model.define_operators({">>": chain, "+": add, "**": clone}): - if "tok2vec" in cfg: - tok2vec = cfg["tok2vec"] - else: - tok2vec = Tok2Vec( - token_vector_width, - embed_size, - char_embed=char_embed, - pretrained_vectors=pretrained_vectors, - ) - softmax = with_flatten(MultiSoftmax(class_nums, token_vector_width)) - softmax.out_sizes = class_nums - model = tok2vec >> softmax - model.nI = None - model.tok2vec = tok2vec - model.softmax = softmax - return model - - -@layerize -def SpacyVectors(docs, drop=0.0): - batch = [] - for doc in docs: - indices = numpy.zeros((len(doc),), dtype="i") - for i, word in enumerate(doc): - if word.orth in doc.vocab.vectors.key2row: - indices[i] = doc.vocab.vectors.key2row[word.orth] - else: - indices[i] = 0 - vectors = doc.vocab.vectors.data[indices] - batch.append(vectors) - return batch, None - - -def build_text_classifier(nr_class, width=64, **cfg): - depth = cfg.get("depth", 2) - nr_vector = cfg.get("nr_vector", 5000) - pretrained_dims = cfg.get("pretrained_dims", 0) - with Model.define_operators({">>": chain, "+": add, "|": concatenate, "**": clone}): - if cfg.get("low_data") and pretrained_dims: - model = ( - SpacyVectors - >> flatten_add_lengths - >> with_getitem(0, Affine(width, pretrained_dims)) - >> ParametricAttention(width) - >> Pooling(sum_pool) - >> Residual(ReLu(width, width)) ** 2 - >> zero_init(Affine(nr_class, width, drop_factor=0.0)) - >> logistic - ) - return model - - lower = HashEmbed(width, nr_vector, column=1) - prefix = HashEmbed(width // 2, nr_vector, column=2) - suffix = HashEmbed(width // 2, nr_vector, column=3) - shape = HashEmbed(width // 2, nr_vector, column=4) - - trained_vectors = FeatureExtracter( - [ORTH, LOWER, PREFIX, SUFFIX, SHAPE, ID] - ) >> with_flatten( - uniqued( - (lower | prefix | suffix | shape) - >> LN(Maxout(width, width + (width // 2) * 3)), - column=0, - ) - ) - - if pretrained_dims: - static_vectors = SpacyVectors >> with_flatten( - Affine(width, pretrained_dims) - ) - # TODO Make concatenate support lists - vectors = concatenate_lists(trained_vectors, static_vectors) - vectors_width = width * 2 - else: - vectors = trained_vectors - vectors_width = width - static_vectors = None - tok2vec = vectors >> with_flatten( - LN(Maxout(width, vectors_width)) - >> Residual((ExtractWindow(nW=1) >> LN(Maxout(width, width * 3)))) ** depth, - pad=depth, - ) - cnn_model = ( - tok2vec - >> flatten_add_lengths - >> ParametricAttention(width) - >> Pooling(sum_pool) - >> Residual(zero_init(Maxout(width, width))) - >> zero_init(Affine(nr_class, width, drop_factor=0.0)) - ) - - linear_model = build_bow_text_classifier( - nr_class, - ngram_size=cfg.get("ngram_size", 1), - exclusive_classes=cfg.get("exclusive_classes", False), - ) - if cfg.get("exclusive_classes", False): - output_layer = Softmax(nr_class, nr_class * 2) - else: - output_layer = ( - zero_init(Affine(nr_class, nr_class * 2, drop_factor=0.0)) >> logistic - ) - model = (linear_model | cnn_model) >> output_layer - model.tok2vec = chain(tok2vec, flatten) - model.nO = nr_class - model.lsuv = False - return model - - -def build_bow_text_classifier( - nr_class, ngram_size=1, exclusive_classes=False, no_output_layer=False, **cfg -): - with Model.define_operators({">>": chain}): - model = with_cpu( - Model.ops, extract_ngrams(ngram_size, attr=ORTH) >> LinearModel(nr_class) - ) - if not no_output_layer: - model = model >> (cpu_softmax if exclusive_classes else logistic) - model.nO = nr_class - return model - - -@layerize -def cpu_softmax(X, drop=0.0): - ops = NumpyOps() - - def cpu_softmax_backward(dY, sgd=None): - return dY - - return ops.softmax(X), cpu_softmax_backward - - -def build_simple_cnn_text_classifier(tok2vec, nr_class, exclusive_classes=False, **cfg): - """ - Build a simple CNN text classifier, given a token-to-vector model as inputs. - If exclusive_classes=True, a softmax non-linearity is applied, so that the - outputs sum to 1. If exclusive_classes=False, a logistic non-linearity - is applied instead, so that outputs are in the range [0, 1]. - """ - with Model.define_operators({">>": chain}): - if exclusive_classes: - output_layer = Softmax(nr_class, tok2vec.nO) - else: - output_layer = ( - zero_init(Affine(nr_class, tok2vec.nO, drop_factor=0.0)) >> logistic - ) - model = tok2vec >> flatten_add_lengths >> Pooling(mean_pool) >> output_layer - model.tok2vec = chain(tok2vec, flatten) - model.nO = nr_class - return model - - -def build_nel_encoder(embed_width, hidden_width, ner_types, **cfg): - if "entity_width" not in cfg: - raise ValueError(Errors.E144.format(param="entity_width")) - - conv_depth = cfg.get("conv_depth", 2) - cnn_maxout_pieces = cfg.get("cnn_maxout_pieces", 3) - pretrained_vectors = cfg.get("pretrained_vectors", None) - context_width = cfg.get("entity_width") - - with Model.define_operators({">>": chain, "**": clone}): - # context encoder - tok2vec = Tok2Vec( - width=hidden_width, - embed_size=embed_width, - pretrained_vectors=pretrained_vectors, - cnn_maxout_pieces=cnn_maxout_pieces, - subword_features=True, - conv_depth=conv_depth, - bilstm_depth=0, - ) - - model = ( - tok2vec - >> flatten_add_lengths - >> Pooling(mean_pool) - >> Residual(zero_init(Maxout(hidden_width, hidden_width))) - >> zero_init(Affine(context_width, hidden_width, drop_factor=0.0)) - ) - - model.tok2vec = tok2vec - model.nO = context_width - return model - - -@layerize -def flatten(seqs, drop=0.0): - ops = Model.ops - lengths = ops.asarray([len(seq) for seq in seqs], dtype="i") - - def finish_update(d_X, sgd=None): - return ops.unflatten(d_X, lengths, pad=0) - - X = ops.flatten(seqs, pad=0) - return X, finish_update - - -def concatenate_lists(*layers, **kwargs): # pragma: no cover - """Compose two or more models `f`, `g`, etc, such that their outputs are - concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))` - """ - if not layers: - return noop() - drop_factor = kwargs.get("drop_factor", 1.0) - ops = layers[0].ops - layers = [chain(layer, flatten) for layer in layers] - concat = concatenate(*layers) - - def concatenate_lists_fwd(Xs, drop=0.0): - if drop is not None: - drop *= drop_factor - lengths = ops.asarray([len(X) for X in Xs], dtype="i") - flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop) - ys = ops.unflatten(flat_y, lengths) - - def concatenate_lists_bwd(d_ys, sgd=None): - return bp_flat_y(ops.flatten(d_ys), sgd=sgd) - - return ys, concatenate_lists_bwd - - model = wrap(concatenate_lists_fwd, concat) - return model - - -def masked_language_model(vocab, model, mask_prob=0.15): - """Convert a model into a BERT-style masked language model""" - - random_words = _RandomWords(vocab) - - def mlm_forward(docs, drop=0.0): - mask, docs = _apply_mask(docs, random_words, mask_prob=mask_prob) - mask = model.ops.asarray(mask).reshape((mask.shape[0], 1)) - output, backprop = model.begin_update(docs, drop=drop) - - def mlm_backward(d_output, sgd=None): - d_output *= 1 - mask - return backprop(d_output, sgd=sgd) - - return output, mlm_backward - - return wrap(mlm_forward, model) - - -class _RandomWords(object): - def __init__(self, vocab): - self.words = [lex.text for lex in vocab if lex.prob != 0.0] - self.probs = [lex.prob for lex in vocab if lex.prob != 0.0] - self.words = self.words[:10000] - self.probs = self.probs[:10000] - self.probs = numpy.exp(numpy.array(self.probs, dtype="f")) - self.probs /= self.probs.sum() - self._cache = [] - - def next(self): - if not self._cache: - self._cache.extend( - numpy.random.choice(len(self.words), 10000, p=self.probs) - ) - index = self._cache.pop() - return self.words[index] - - -def _apply_mask(docs, random_words, mask_prob=0.15): - # This needs to be here to avoid circular imports - from .tokens.doc import Doc - - N = sum(len(doc) for doc in docs) - mask = numpy.random.uniform(0.0, 1.0, (N,)) - mask = mask >= mask_prob - i = 0 - masked_docs = [] - for doc in docs: - words = [] - for token in doc: - if not mask[i]: - word = _replace_word(token.text, random_words) - else: - word = token.text - words.append(word) - i += 1 - spaces = [bool(w.whitespace_) for w in doc] - # NB: If you change this implementation to instead modify - # the docs in place, take care that the IDs reflect the original - # words. Currently we use the original docs to make the vectors - # for the target, so we don't lose the original tokens. But if - # you modified the docs in place here, you would. - masked_docs.append(Doc(doc.vocab, words=words, spaces=spaces)) - return mask, masked_docs - - -def _replace_word(word, random_words, mask="[MASK]"): - roll = numpy.random.random() - if roll < 0.8: - return mask - elif roll < 0.9: - return random_words.next() - else: - return word - - -def _uniform_init(lo, hi): - def wrapped(W, ops): - copy_array(W, ops.xp.random.uniform(lo, hi, W.shape)) - - return wrapped - - -@describe.attributes( - nM=Dimension("Vector dimensions"), - nC=Dimension("Number of characters per word"), - vectors=Synapses( - "Embed matrix", lambda obj: (obj.nC, obj.nV, obj.nM), _uniform_init(-0.1, 0.1) - ), - d_vectors=Gradient("vectors"), -) -class CharacterEmbed(Model): - def __init__(self, nM=None, nC=None, **kwargs): - Model.__init__(self, **kwargs) - self.nM = nM - self.nC = nC - - @property - def nO(self): - return self.nM * self.nC - - @property - def nV(self): - return 256 - - def begin_update(self, docs, drop=0.0): - if not docs: - return [] - ids = [] - output = [] - weights = self.vectors - # This assists in indexing; it's like looping over this dimension. - # Still consider this weird witch craft...But thanks to Mark Neumann - # for the tip. - nCv = self.ops.xp.arange(self.nC) - for doc in docs: - doc_ids = doc.to_utf8_array(nr_char=self.nC) - doc_vectors = self.ops.allocate((len(doc), self.nC, self.nM)) - # Let's say I have a 2d array of indices, and a 3d table of data. What numpy - # incantation do I chant to get - # output[i, j, k] == data[j, ids[i, j], k]? - doc_vectors[:, nCv] = weights[nCv, doc_ids[:, nCv]] - output.append(doc_vectors.reshape((len(doc), self.nO))) - ids.append(doc_ids) - - def backprop_character_embed(d_vectors, sgd=None): - gradient = self.d_vectors - for doc_ids, d_doc_vectors in zip(ids, d_vectors): - d_doc_vectors = d_doc_vectors.reshape((len(doc_ids), self.nC, self.nM)) - gradient[nCv, doc_ids[:, nCv]] += d_doc_vectors[:, nCv] - if sgd is not None: - sgd(self._mem.weights, self._mem.gradient, key=self.id) - return None - - return output, backprop_character_embed - - -def get_cossim_loss(yh, y, ignore_zeros=False): - xp = get_array_module(yh) - # Find the zero vectors - if ignore_zeros: - zero_indices = xp.abs(y).sum(axis=1) == 0 - # Add a small constant to avoid 0 vectors - yh = yh + 1e-8 - y = y + 1e-8 - # https://math.stackexchange.com/questions/1923613/partial-derivative-of-cosine-similarity - norm_yh = xp.linalg.norm(yh, axis=1, keepdims=True) - norm_y = xp.linalg.norm(y, axis=1, keepdims=True) - mul_norms = norm_yh * norm_y - cosine = (yh * y).sum(axis=1, keepdims=True) / mul_norms - d_yh = (y / mul_norms) - (cosine * (yh / norm_yh ** 2)) - losses = xp.abs(cosine - 1) - if ignore_zeros: - # If the target was a zero vector, don't count it in the loss. - d_yh[zero_indices] = 0 - losses[zero_indices] = 0 - loss = losses.sum() - return loss, -d_yh diff --git a/spacy/about.py b/spacy/about.py index 84dc86aa8..3af1b77a0 100644 --- a/spacy/about.py +++ b/spacy/about.py @@ -1,6 +1,6 @@ # fmt: off __title__ = "spacy" -__version__ = "2.2.4" +__version__ = "3.0.0.dev8" __release__ = True __download_url__ = "https://github.com/explosion/spacy-models/releases/download" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" diff --git a/spacy/analysis.py b/spacy/analysis.py index 960ce6c0f..c2600048f 100644 --- a/spacy/analysis.py +++ b/spacy/analysis.py @@ -1,10 +1,5 @@ -# coding: utf8 -from __future__ import unicode_literals - -import warnings - -from collections import OrderedDict from wasabi import Printer +import warnings from .tokens import Doc, Token, Span from .errors import Errors, Warnings @@ -25,7 +20,7 @@ def analyze_pipes(pipeline, name, pipe, index, warn=True): assert pipeline[index][0] == name prev_pipes = pipeline[:index] pipe_requires = getattr(pipe, "requires", []) - requires = OrderedDict([(annot, False) for annot in pipe_requires]) + requires = {annot: False for annot in pipe_requires} if requires: for prev_name, prev_pipe in prev_pipes: prev_assigns = getattr(prev_pipe, "assigns", []) @@ -100,15 +95,15 @@ def validate_attrs(values): for ext_attr, ext_value in value.items(): # We don't check whether the attribute actually exists if ext_value is not True: # attr is something like doc._.x.y - good = "{}._.{}".format(obj_key, ext_attr) - bad = "{}.{}".format(good, ".".join(ext_value)) + good = f"{obj_key}._.{ext_attr}" + bad = f"{good}.{'.'.join(ext_value)}" raise ValueError(Errors.E183.format(attr=bad, solution=good)) continue # we can't validate those further if attr.endswith("_"): # attr is something like "token.pos_" raise ValueError(Errors.E184.format(attr=attr, solution=attr[:-1])) if value is not True: # attr is something like doc.x.y - good = "{}.{}".format(obj_key, attr) - bad = "{}.{}".format(good, ".".join(value)) + good = f"{obj_key}.{attr}" + bad = f"{good}.{'.'.join(value)}" raise ValueError(Errors.E183.format(attr=bad, solution=good)) obj = objs[obj_key] if not hasattr(obj, attr): @@ -170,11 +165,10 @@ def print_summary(nlp, pretty=True, no_print=False): msg.table(overview, header=header, divider=True, multiline=True) n_problems = sum(len(p) for p in problems.values()) if any(p for p in problems.values()): - msg.divider("Problems ({})".format(n_problems)) + msg.divider(f"Problems ({n_problems})") for name, problem in problems.items(): if problem: - problem = ", ".join(problem) - msg.warn("'{}' requirements not met: {}".format(name, problem)) + msg.warn(f"'{name}' requirements not met: {', '.join(problem)}") else: msg.good("No problems found.") if no_print: diff --git a/spacy/attrs.pxd b/spacy/attrs.pxd index 805dc2950..33d5372de 100644 --- a/spacy/attrs.pxd +++ b/spacy/attrs.pxd @@ -91,6 +91,7 @@ cdef enum attr_id_t: LANG ENT_KB_ID = symbols.ENT_KB_ID + MORPH ENT_ID = symbols.ENT_ID IDX diff --git a/spacy/attrs.pyx b/spacy/attrs.pyx index fe9895d06..b15db7599 100644 --- a/spacy/attrs.pyx +++ b/spacy/attrs.pyx @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - IDS = { "": NULL_ATTR, @@ -92,6 +89,7 @@ IDS = { "SPACY": SPACY, "PROB": PROB, "LANG": LANG, + "MORPH": MORPH, "IDX": IDX } diff --git a/spacy/cli/__init__.py b/spacy/cli/__init__.py index 778453711..5f83b26c1 100644 --- a/spacy/cli/__init__.py +++ b/spacy/cli/__init__.py @@ -1,12 +1,21 @@ +from wasabi import msg + from .download import download # noqa: F401 from .info import info # noqa: F401 -from .link import link # noqa: F401 from .package import package # noqa: F401 from .profile import profile # noqa: F401 from .train import train # noqa: F401 +from .train_from_config import train_from_config_cli # noqa: F401 from .pretrain import pretrain # noqa: F401 from .debug_data import debug_data # noqa: F401 from .evaluate import evaluate # noqa: F401 from .convert import convert # noqa: F401 from .init_model import init_model # noqa: F401 from .validate import validate # noqa: F401 + + +def link(*args, **kwargs): + msg.warn( + "As of spaCy v3.0, model symlinks are deprecated. You can load models " + "using their full names or from a directory path." + ) diff --git a/spacy/cli/_schemas.py b/spacy/cli/_schemas.py deleted file mode 100644 index 3fb2c8979..000000000 --- a/spacy/cli/_schemas.py +++ /dev/null @@ -1,220 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - - -# NB: This schema describes the new format of the training data, see #2928 -TRAINING_SCHEMA = { - "$schema": "http://json-schema.org/draft-06/schema", - "title": "Training data for spaCy models", - "type": "array", - "items": { - "type": "object", - "properties": { - "text": { - "title": "The text of the training example", - "type": "string", - "minLength": 1, - }, - "ents": { - "title": "Named entity spans in the text", - "type": "array", - "items": { - "type": "object", - "properties": { - "start": { - "title": "Start character offset of the span", - "type": "integer", - "minimum": 0, - }, - "end": { - "title": "End character offset of the span", - "type": "integer", - "minimum": 0, - }, - "label": { - "title": "Entity label", - "type": "string", - "minLength": 1, - "pattern": "^[A-Z0-9]*$", - }, - }, - "required": ["start", "end", "label"], - }, - }, - "sents": { - "title": "Sentence spans in the text", - "type": "array", - "items": { - "type": "object", - "properties": { - "start": { - "title": "Start character offset of the span", - "type": "integer", - "minimum": 0, - }, - "end": { - "title": "End character offset of the span", - "type": "integer", - "minimum": 0, - }, - }, - "required": ["start", "end"], - }, - }, - "cats": { - "title": "Text categories for the text classifier", - "type": "object", - "patternProperties": { - "*": { - "title": "A text category", - "oneOf": [ - {"type": "boolean"}, - {"type": "number", "minimum": 0}, - ], - } - }, - "propertyNames": {"pattern": "^[A-Z0-9]*$", "minLength": 1}, - }, - "tokens": { - "title": "The tokens in the text", - "type": "array", - "items": { - "type": "object", - "minProperties": 1, - "properties": { - "id": { - "title": "Token ID, usually token index", - "type": "integer", - "minimum": 0, - }, - "start": { - "title": "Start character offset of the token", - "type": "integer", - "minimum": 0, - }, - "end": { - "title": "End character offset of the token", - "type": "integer", - "minimum": 0, - }, - "pos": { - "title": "Coarse-grained part-of-speech tag", - "type": "string", - "minLength": 1, - }, - "tag": { - "title": "Fine-grained part-of-speech tag", - "type": "string", - "minLength": 1, - }, - "dep": { - "title": "Dependency label", - "type": "string", - "minLength": 1, - }, - "head": { - "title": "Index of the token's head", - "type": "integer", - "minimum": 0, - }, - }, - "required": ["start", "end"], - }, - }, - "_": {"title": "Custom user space", "type": "object"}, - }, - "required": ["text"], - }, -} - -META_SCHEMA = { - "$schema": "http://json-schema.org/draft-06/schema", - "type": "object", - "properties": { - "lang": { - "title": "Two-letter language code, e.g. 'en'", - "type": "string", - "minLength": 2, - "maxLength": 2, - "pattern": "^[a-z]*$", - }, - "name": { - "title": "Model name", - "type": "string", - "minLength": 1, - "pattern": "^[a-z_]*$", - }, - "version": { - "title": "Model version", - "type": "string", - "minLength": 1, - "pattern": "^[0-9a-z.-]*$", - }, - "spacy_version": { - "title": "Compatible spaCy version identifier", - "type": "string", - "minLength": 1, - "pattern": "^[0-9a-z.-><=]*$", - }, - "parent_package": { - "title": "Name of parent spaCy package, e.g. spacy or spacy-nightly", - "type": "string", - "minLength": 1, - "default": "spacy", - }, - "pipeline": { - "title": "Names of pipeline components", - "type": "array", - "items": {"type": "string", "minLength": 1}, - }, - "description": {"title": "Model description", "type": "string"}, - "license": {"title": "Model license", "type": "string"}, - "author": {"title": "Model author name", "type": "string"}, - "email": {"title": "Model author email", "type": "string", "format": "email"}, - "url": {"title": "Model author URL", "type": "string", "format": "uri"}, - "sources": { - "title": "Training data sources", - "type": "array", - "items": {"type": "string"}, - }, - "vectors": { - "title": "Included word vectors", - "type": "object", - "properties": { - "keys": { - "title": "Number of unique keys", - "type": "integer", - "minimum": 0, - }, - "vectors": { - "title": "Number of unique vectors", - "type": "integer", - "minimum": 0, - }, - "width": { - "title": "Number of dimensions", - "type": "integer", - "minimum": 0, - }, - }, - }, - "accuracy": { - "title": "Accuracy numbers", - "type": "object", - "patternProperties": {"*": {"type": "number", "minimum": 0.0}}, - }, - "speed": { - "title": "Speed evaluation numbers", - "type": "object", - "patternProperties": { - "*": { - "oneOf": [ - {"type": "number", "minimum": 0.0}, - {"type": "integer", "minimum": 0}, - ] - } - }, - }, - }, - "required": ["lang", "name", "version"], -} diff --git a/spacy/cli/convert.py b/spacy/cli/convert.py index fa867fa04..2ffbeb458 100644 --- a/spacy/cli/convert.py +++ b/spacy/cli/convert.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - -import plac from pathlib import Path from wasabi import Printer import srsly @@ -29,27 +25,20 @@ FILE_TYPES = ("json", "jsonl", "msg") FILE_TYPES_STDOUT = ("json", "jsonl") -@plac.annotations( - input_file=("Input file", "positional", None, str), - output_dir=("Output directory. '-' for stdout.", "positional", None, str), - file_type=("Type of data to produce: {}".format(FILE_TYPES), "option", "t", str), - n_sents=("Number of sentences per doc (0 to disable)", "option", "n", int), - seg_sents=("Segment sentences (for -c ner)", "flag", "s"), - model=("Model for sentence segmentation (for -s)", "option", "b", str), - converter=("Converter: {}".format(tuple(CONVERTERS.keys())), "option", "c", str), - lang=("Language (if tokenizer required)", "option", "l", str), - morphology=("Enable appending morphology to tags", "flag", "m", bool), -) def convert( - input_file, - output_dir="-", - file_type="json", - n_sents=1, - seg_sents=False, - model=None, - morphology=False, - converter="auto", - lang=None, + # fmt: off + input_file: ("Input file", "positional", None, str), + output_dir: ("Output directory. '-' for stdout.", "positional", None, str) = "-", + file_type: (f"Type of data to produce: {FILE_TYPES}", "option", "t", str, FILE_TYPES) = "json", + n_sents: ("Number of sentences per doc (0 to disable)", "option", "n", int) = 1, + seg_sents: ("Segment sentences (for -c ner)", "flag", "s") = False, + model: ("Model for sentence segmentation (for -s)", "option", "b", str) = None, + morphology: ("Enable appending morphology to tags", "flag", "m", bool) = False, + merge_subtokens: ("Merge CoNLL-U subtokens", "flag", "T", bool) = False, + converter: (f"Converter: {tuple(CONVERTERS.keys())}", "option", "c", str) = "auto", + ner_map_path: ("NER tag mapping (as JSON-encoded dict of entity types)", "option", "N", Path) = None, + lang: ("Language (if tokenizer required)", "option", "l", str) = None, + # fmt: on ): """ Convert files into JSON format for use with train command and other @@ -60,16 +49,10 @@ def convert( no_print = output_dir == "-" msg = Printer(no_print=no_print) input_path = Path(input_file) - if file_type not in FILE_TYPES: - msg.fail( - "Unknown file type: '{}'".format(file_type), - "Supported file types: '{}'".format(", ".join(FILE_TYPES)), - exits=1, - ) if file_type not in FILE_TYPES_STDOUT and output_dir == "-": # TODO: support msgpack via stdout in srsly? msg.fail( - "Can't write .{} data to stdout.".format(file_type), + f"Can't write .{file_type} data to stdout", "Please specify an output directory.", exits=1, ) @@ -93,21 +76,26 @@ def convert( "Can't automatically detect NER format. Conversion may not succeed. See https://spacy.io/api/cli#convert" ) if converter not in CONVERTERS: - msg.fail("Can't find converter for {}".format(converter), exits=1) + msg.fail(f"Can't find converter for {converter}", exits=1) + ner_map = None + if ner_map_path is not None: + ner_map = srsly.read_json(ner_map_path) # Use converter function to convert data func = CONVERTERS[converter] data = func( input_data, n_sents=n_sents, seg_sents=seg_sents, - use_morphology=morphology, + append_morphology=morphology, + merge_subtokens=merge_subtokens, lang=lang, model=model, no_print=no_print, + ner_map=ner_map, ) if output_dir != "-": # Export data to a file - suffix = ".{}".format(file_type) + suffix = f".{file_type}" output_file = Path(output_dir) / Path(input_path.parts[-1]).with_suffix(suffix) if file_type == "json": srsly.write_json(output_file, data) @@ -115,9 +103,7 @@ def convert( srsly.write_jsonl(output_file, data) elif file_type == "msg": srsly.write_msgpack(output_file, data) - msg.good( - "Generated output file ({} documents): {}".format(len(data), output_file) - ) + msg.good(f"Generated output file ({len(data)} documents): {output_file}") else: # Print to stdout if file_type == "json": diff --git a/spacy/cli/converters/conll_ner2json.py b/spacy/cli/converters/conll_ner2json.py index 46489ad7c..b607d5913 100644 --- a/spacy/cli/converters/conll_ner2json.py +++ b/spacy/cli/converters/conll_ner2json.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from wasabi import Printer from ...gold import iob_to_biluo @@ -64,9 +61,9 @@ def conll_ner2json( # sentence segmentation required for document segmentation if n_sents > 0 and not seg_sents: msg.warn( - "No sentence boundaries found to use with option `-n {}`. " - "Use `-s` to automatically segment sentences or `-n 0` " - "to disable.".format(n_sents) + f"No sentence boundaries found to use with option `-n {n_sents}`. " + f"Use `-s` to automatically segment sentences or `-n 0` " + f"to disable." ) else: n_sents_info(msg, n_sents) @@ -129,7 +126,7 @@ def segment_sents_and_docs(doc, n_sents, doc_delimiter, model=None, msg=None): if model: nlp = load_model(model) if "parser" in nlp.pipe_names: - msg.info("Segmenting sentences with parser from model '{}'.".format(model)) + msg.info(f"Segmenting sentences with parser from model '{model}'.") sentencizer = nlp.get_pipe("parser") if not sentencizer: msg.info( @@ -166,7 +163,7 @@ def segment_docs(input_data, n_sents, doc_delimiter): def n_sents_info(msg, n_sents): - msg.info("Grouping every {} sentences into a document.".format(n_sents)) + msg.info(f"Grouping every {n_sents} sentences into a document.") if n_sents == 1: msg.warn( "To generate better training data, you may want to group " diff --git a/spacy/cli/converters/conllu2json.py b/spacy/cli/converters/conllu2json.py index 3de4dcc30..0b2920802 100644 --- a/spacy/cli/converters/conllu2json.py +++ b/spacy/cli/converters/conllu2json.py @@ -1,141 +1,349 @@ -# coding: utf8 -from __future__ import unicode_literals - import re -from ...gold import iob_to_biluo +from ...gold import Example +from ...gold import iob_to_biluo, spans_from_biluo_tags, biluo_tags_from_offsets +from ...language import Language +from ...tokens import Doc, Token +from .conll_ner2json import n_sents_info +from wasabi import Printer -def conllu2json(input_data, n_sents=10, use_morphology=False, lang=None, **_): +def conllu2json( + input_data, + n_sents=10, + append_morphology=False, + lang=None, + ner_map=None, + merge_subtokens=False, + no_print=False, + **_ +): """ Convert conllu files into JSON format for use with train cli. - use_morphology parameter enables appending morphology to tags, which is + append_morphology parameter enables appending morphology to tags, which is useful for languages such as Spanish, where UD tags are not so rich. Extract NER tags if available and convert them so that they follow BILUO and the Wikipedia scheme """ - # by @dvsrepo, via #11 explosion/spacy-dev-resources - # by @katarkor + MISC_NER_PATTERN = "^((?:name|NE)=)?([BILU])-([A-Z_]+)|O$" + msg = Printer(no_print=no_print) + n_sents_info(msg, n_sents) docs = [] + raw = "" sentences = [] - conll_tuples = read_conllx(input_data, use_morphology=use_morphology) - checked_for_ner = False - has_ner_tags = False - for i, (raw_text, tokens) in enumerate(conll_tuples): - sentence, brackets = tokens[0] - if not checked_for_ner: - has_ner_tags = is_ner(sentence[5][0]) - checked_for_ner = True - sentences.append(generate_sentence(sentence, has_ner_tags)) + conll_data = read_conllx( + input_data, + append_morphology=append_morphology, + ner_tag_pattern=MISC_NER_PATTERN, + ner_map=ner_map, + merge_subtokens=merge_subtokens, + ) + has_ner_tags = has_ner(input_data, MISC_NER_PATTERN) + for i, example in enumerate(conll_data): + raw += example.text + sentences.append( + generate_sentence( + example.token_annotation, + has_ner_tags, + MISC_NER_PATTERN, + ner_map=ner_map, + ) + ) # Real-sized documents could be extracted using the comments on the - # conluu document + # conllu document if len(sentences) % n_sents == 0: - doc = create_doc(sentences, i) + doc = create_json_doc(raw, sentences, i) docs.append(doc) + raw = "" sentences = [] if sentences: - doc = create_doc(sentences, i) + doc = create_json_doc(raw, sentences, i) docs.append(doc) return docs -def is_ner(tag): +def has_ner(input_data, ner_tag_pattern): """ - Check the 10th column of the first token to determine if the file contains - NER tags + Check the MISC column for NER tags. """ - tag_match = re.match("([A-Z_]+)-([A-Z_]+)", tag) - if tag_match: - return True - elif tag == "O": - return True - else: - return False - - -def read_conllx(input_data, use_morphology=False, n=0): - i = 0 for sent in input_data.strip().split("\n\n"): lines = sent.strip().split("\n") if lines: while lines[0].startswith("#"): lines.pop(0) - tokens = [] for line in lines: - parts = line.split("\t") - id_, word, lemma, pos, tag, morph, head, dep, _1, iob = parts - if "-" in id_ or "." in id_: - continue - try: - id_ = int(id_) - 1 - head = (int(head) - 1) if head not in ["0", "_"] else id_ - dep = "ROOT" if dep == "root" else dep - tag = pos if tag == "_" else tag - tag = tag + "__" + morph if use_morphology else tag - iob = iob if iob else "O" - tokens.append((id_, word, tag, head, dep, iob)) - except: # noqa: E722 - print(line) - raise - tuples = [list(t) for t in zip(*tokens)] - yield (None, [[tuples, []]]) - i += 1 - if n >= 1 and i >= n: + id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts + for misc_part in misc.split("|"): + if re.match(ner_tag_pattern, misc_part): + return True + return False + + +def read_conllx( + input_data, + append_morphology=False, + merge_subtokens=False, + ner_tag_pattern="", + ner_map=None, +): + """ Yield examples, one for each sentence """ + vocab = Language.Defaults.create_vocab() # need vocab to make a minimal Doc + for sent in input_data.strip().split("\n\n"): + lines = sent.strip().split("\n") + if lines: + while lines[0].startswith("#"): + lines.pop(0) + example = example_from_conllu_sentence( + vocab, + lines, + ner_tag_pattern, + merge_subtokens=merge_subtokens, + append_morphology=append_morphology, + ner_map=ner_map, + ) + yield example + + +def get_entities(lines, tag_pattern, ner_map=None): + """Find entities in the MISC column according to the pattern and map to + final entity type with `ner_map` if mapping present. Entity tag is 'O' if + the pattern is not matched. + + lines (unicode): CONLL-U lines for one sentences + tag_pattern (unicode): Regex pattern for entity tag + ner_map (dict): Map old NER tag names to new ones, '' maps to O. + RETURNS (list): List of BILUO entity tags + """ + miscs = [] + for line in lines: + parts = line.split("\t") + id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts + if "-" in id_ or "." in id_: + continue + miscs.append(misc) + + iob = [] + for misc in miscs: + iob_tag = "O" + for misc_part in misc.split("|"): + tag_match = re.match(tag_pattern, misc_part) + if tag_match: + prefix = tag_match.group(2) + suffix = tag_match.group(3) + if prefix and suffix: + iob_tag = prefix + "-" + suffix + if ner_map: + suffix = ner_map.get(suffix, suffix) + if suffix == "": + iob_tag = "O" + else: + iob_tag = prefix + "-" + suffix break + iob.append(iob_tag) + return iob_to_biluo(iob) -def simplify_tags(iob): - """ - Simplify tags obtained from the dataset in order to follow Wikipedia - scheme (PER, LOC, ORG, MISC). 'PER', 'LOC' and 'ORG' keep their tags, while - 'GPE_LOC' is simplified to 'LOC', 'GPE_ORG' to 'ORG' and all remaining tags to - 'MISC'. - """ - new_iob = [] - for tag in iob: - tag_match = re.match("([A-Z_]+)-([A-Z_]+)", tag) - if tag_match: - prefix = tag_match.group(1) - suffix = tag_match.group(2) - if suffix == "GPE_LOC": - suffix = "LOC" - elif suffix == "GPE_ORG": - suffix = "ORG" - elif suffix != "PER" and suffix != "LOC" and suffix != "ORG": - suffix = "MISC" - tag = prefix + "-" + suffix - new_iob.append(tag) - return new_iob - - -def generate_sentence(sent, has_ner_tags): - (id_, word, tag, head, dep, iob) = sent +def generate_sentence(token_annotation, has_ner_tags, tag_pattern, ner_map=None): sentence = {} tokens = [] - if has_ner_tags: - iob = simplify_tags(iob) - biluo = iob_to_biluo(iob) - for i, id in enumerate(id_): + for i, id_ in enumerate(token_annotation.ids): token = {} - token["id"] = id - token["orth"] = word[i] - token["tag"] = tag[i] - token["head"] = head[i] - id - token["dep"] = dep[i] + token["id"] = id_ + token["orth"] = token_annotation.get_word(i) + token["tag"] = token_annotation.get_tag(i) + token["pos"] = token_annotation.get_pos(i) + token["lemma"] = token_annotation.get_lemma(i) + token["morph"] = token_annotation.get_morph(i) + token["head"] = token_annotation.get_head(i) - id_ + token["dep"] = token_annotation.get_dep(i) if has_ner_tags: - token["ner"] = biluo[i] + token["ner"] = token_annotation.get_entity(i) tokens.append(token) sentence["tokens"] = tokens return sentence -def create_doc(sentences, id): +def create_json_doc(raw, sentences, id_): doc = {} paragraph = {} - doc["id"] = id + doc["id"] = id_ doc["paragraphs"] = [] + paragraph["raw"] = raw.strip() paragraph["sentences"] = sentences doc["paragraphs"].append(paragraph) return doc + + +def example_from_conllu_sentence( + vocab, + lines, + ner_tag_pattern, + merge_subtokens=False, + append_morphology=False, + ner_map=None, +): + """Create an Example from the lines for one CoNLL-U sentence, merging + subtokens and appending morphology to tags if required. + + lines (unicode): The non-comment lines for a CoNLL-U sentence + ner_tag_pattern (unicode): The regex pattern for matching NER in MISC col + RETURNS (Example): An example containing the annotation + """ + # create a Doc with each subtoken as its own token + # if merging subtokens, each subtoken orth is the merged subtoken form + if not Token.has_extension("merged_orth"): + Token.set_extension("merged_orth", default="") + if not Token.has_extension("merged_lemma"): + Token.set_extension("merged_lemma", default="") + if not Token.has_extension("merged_morph"): + Token.set_extension("merged_morph", default="") + if not Token.has_extension("merged_spaceafter"): + Token.set_extension("merged_spaceafter", default="") + words, spaces, tags, poses, morphs, lemmas = [], [], [], [], [], [] + heads, deps = [], [] + subtok_word = "" + in_subtok = False + for i in range(len(lines)): + line = lines[i] + parts = line.split("\t") + id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts + if "." in id_: + continue + if "-" in id_: + in_subtok = True + if "-" in id_: + in_subtok = True + subtok_word = word + subtok_start, subtok_end = id_.split("-") + subtok_spaceafter = "SpaceAfter=No" not in misc + continue + if merge_subtokens and in_subtok: + words.append(subtok_word) + else: + words.append(word) + if in_subtok: + if id_ == subtok_end: + spaces.append(subtok_spaceafter) + else: + spaces.append(False) + elif "SpaceAfter=No" in misc: + spaces.append(False) + else: + spaces.append(True) + if in_subtok and id_ == subtok_end: + subtok_word = "" + in_subtok = False + id_ = int(id_) - 1 + head = (int(head) - 1) if head not in ("0", "_") else id_ + tag = pos if tag == "_" else tag + morph = morph if morph != "_" else "" + dep = "ROOT" if dep == "root" else dep + lemmas.append(lemma) + poses.append(pos) + tags.append(tag) + morphs.append(morph) + heads.append(head) + deps.append(dep) + + doc = Doc(vocab, words=words, spaces=spaces) + for i in range(len(doc)): + doc[i].tag_ = tags[i] + doc[i].pos_ = poses[i] + doc[i].dep_ = deps[i] + doc[i].lemma_ = lemmas[i] + doc[i].head = doc[heads[i]] + doc[i]._.merged_orth = words[i] + doc[i]._.merged_morph = morphs[i] + doc[i]._.merged_lemma = lemmas[i] + doc[i]._.merged_spaceafter = spaces[i] + ents = get_entities(lines, ner_tag_pattern, ner_map) + doc.ents = spans_from_biluo_tags(doc, ents) + doc.is_parsed = True + doc.is_tagged = True + + if merge_subtokens: + doc = merge_conllu_subtokens(lines, doc) + + # create Example from custom Doc annotation + ids, words, tags, heads, deps = [], [], [], [], [] + pos, lemmas, morphs, spaces = [], [], [], [] + for i, t in enumerate(doc): + ids.append(i) + words.append(t._.merged_orth) + if append_morphology and t._.merged_morph: + tags.append(t.tag_ + "__" + t._.merged_morph) + else: + tags.append(t.tag_) + pos.append(t.pos_) + morphs.append(t._.merged_morph) + lemmas.append(t._.merged_lemma) + heads.append(t.head.i) + deps.append(t.dep_) + spaces.append(t._.merged_spaceafter) + ent_offsets = [(e.start_char, e.end_char, e.label_) for e in doc.ents] + ents = biluo_tags_from_offsets(doc, ent_offsets) + raw = "" + for word, space in zip(words, spaces): + raw += word + if space: + raw += " " + example = Example(doc=raw) + example.set_token_annotation( + ids=ids, + words=words, + tags=tags, + pos=pos, + morphs=morphs, + lemmas=lemmas, + heads=heads, + deps=deps, + entities=ents, + ) + return example + + +def merge_conllu_subtokens(lines, doc): + # identify and process all subtoken spans to prepare attrs for merging + subtok_spans = [] + for line in lines: + parts = line.split("\t") + id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts + if "-" in id_: + subtok_start, subtok_end = id_.split("-") + subtok_span = doc[int(subtok_start) - 1 : int(subtok_end)] + subtok_spans.append(subtok_span) + # create merged tag, morph, and lemma values + tags = [] + morphs = {} + lemmas = [] + for token in subtok_span: + tags.append(token.tag_) + lemmas.append(token.lemma_) + if token._.merged_morph: + for feature in token._.merged_morph.split("|"): + field, values = feature.split("=", 1) + if field not in morphs: + morphs[field] = set() + for value in values.split(","): + morphs[field].add(value) + # create merged features for each morph field + for field, values in morphs.items(): + morphs[field] = field + "=" + ",".join(sorted(values)) + # set the same attrs on all subtok tokens so that whatever head the + # retokenizer chooses, the final attrs are available on that token + for token in subtok_span: + token._.merged_orth = token.orth_ + token._.merged_lemma = " ".join(lemmas) + token.tag_ = "_".join(tags) + token._.merged_morph = "|".join(sorted(morphs.values())) + token._.merged_spaceafter = ( + True if subtok_span[-1].whitespace_ else False + ) + + with doc.retokenize() as retokenizer: + for span in subtok_spans: + retokenizer.merge(span) + + return doc diff --git a/spacy/cli/converters/iob2json.py b/spacy/cli/converters/iob2json.py index 61c398f8d..b6ac234fc 100644 --- a/spacy/cli/converters/iob2json.py +++ b/spacy/cli/converters/iob2json.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from wasabi import Printer from ...gold import iob_to_biluo diff --git a/spacy/cli/converters/jsonl2json.py b/spacy/cli/converters/jsonl2json.py index 1c1bc45c7..525063b22 100644 --- a/spacy/cli/converters/jsonl2json.py +++ b/spacy/cli/converters/jsonl2json.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import srsly from ...gold import docs_to_json diff --git a/spacy/cli/debug_data.py b/spacy/cli/debug_data.py index 7a4a093e2..21f49956d 100644 --- a/spacy/cli/debug_data.py +++ b/spacy/cli/debug_data.py @@ -1,9 +1,5 @@ -# coding: utf8 -from __future__ import unicode_literals, print_function - from pathlib import Path from collections import Counter -import plac import sys import srsly from wasabi import Printer, MESSAGES @@ -22,29 +18,18 @@ BLANK_MODEL_MIN_THRESHOLD = 100 BLANK_MODEL_THRESHOLD = 2000 -@plac.annotations( - # fmt: off - lang=("model language", "positional", None, str), - train_path=("location of JSON-formatted training data", "positional", None, Path), - dev_path=("location of JSON-formatted development data", "positional", None, Path), - tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path), - base_model=("name of model to update (optional)", "option", "b", str), - pipeline=("Comma-separated names of pipeline components to train", "option", "p", str), - ignore_warnings=("Ignore warnings, only show stats and errors", "flag", "IW", bool), - verbose=("Print additional information and explanations", "flag", "V", bool), - no_format=("Don't pretty-print the results", "flag", "NF", bool), - # fmt: on -) def debug_data( - lang, - train_path, - dev_path, - tag_map_path=None, - base_model=None, - pipeline="tagger,parser,ner", - ignore_warnings=False, - verbose=False, - no_format=False, + # fmt: off + lang: ("Model language", "positional", None, str), + train_path: ("Location of JSON-formatted training data", "positional", None, Path), + dev_path: ("Location of JSON-formatted development data", "positional", None, Path), + tag_map_path: ("Location of JSON-formatted tag map", "option", "tm", Path) = None, + base_model: ("Name of model to update (optional)", "option", "b", str) = None, + pipeline: ("Comma-separated names of pipeline components to train", "option", "p", str) = "tagger,parser,ner", + ignore_warnings: ("Ignore warnings, only show stats and errors", "flag", "IW", bool) = False, + verbose: ("Print additional information and explanations", "flag", "V", bool) = False, + no_format: ("Don't pretty-print the results", "flag", "NF", bool) = False, + # fmt: on ): """ Analyze, debug and validate your training and development data, get useful @@ -85,20 +70,16 @@ def debug_data( with msg.loading("Loading corpus..."): corpus = GoldCorpus(train_path, dev_path) try: - train_docs = list(corpus.train_docs(nlp)) - train_docs_unpreprocessed = list( - corpus.train_docs_without_preprocessing(nlp) + train_dataset = list(corpus.train_dataset(nlp)) + train_dataset_unpreprocessed = list( + corpus.train_dataset_without_preprocessing(nlp) ) except ValueError as e: - loading_train_error_message = "Training data cannot be loaded: {}".format( - str(e) - ) + loading_train_error_message = f"Training data cannot be loaded: {e}" try: - dev_docs = list(corpus.dev_docs(nlp)) + dev_dataset = list(corpus.dev_dataset(nlp)) except ValueError as e: - loading_dev_error_message = "Development data cannot be loaded: {}".format( - str(e) - ) + loading_dev_error_message = f"Development data cannot be loaded: {e}" if loading_train_error_message or loading_dev_error_message: if loading_train_error_message: msg.fail(loading_train_error_message) @@ -107,82 +88,68 @@ def debug_data( sys.exit(1) msg.good("Corpus is loadable") - # Create all gold data here to avoid iterating over the train_docs constantly - gold_train_data = _compile_gold(train_docs, pipeline, nlp) + # Create all gold data here to avoid iterating over the train_dataset constantly + gold_train_data = _compile_gold(train_dataset, pipeline, nlp) gold_train_unpreprocessed_data = _compile_gold( - train_docs_unpreprocessed, pipeline, nlp + train_dataset_unpreprocessed, pipeline ) - gold_dev_data = _compile_gold(dev_docs, pipeline, nlp) + gold_dev_data = _compile_gold(dev_dataset, pipeline, nlp) train_texts = gold_train_data["texts"] dev_texts = gold_dev_data["texts"] msg.divider("Training stats") - msg.text("Training pipeline: {}".format(", ".join(pipeline))) + msg.text(f"Training pipeline: {', '.join(pipeline)}") for pipe in [p for p in pipeline if p not in nlp.factories]: - msg.fail("Pipeline component '{}' not available in factories".format(pipe)) + msg.fail(f"Pipeline component '{pipe}' not available in factories") if base_model: - msg.text("Starting with base model '{}'".format(base_model)) + msg.text(f"Starting with base model '{base_model}'") else: - msg.text("Starting with blank model '{}'".format(lang)) - msg.text("{} training docs".format(len(train_docs))) - msg.text("{} evaluation docs".format(len(dev_docs))) + msg.text(f"Starting with blank model '{lang}'") + msg.text(f"{len(train_dataset)} training docs") + msg.text(f"{len(dev_dataset)} evaluation docs") - if not len(dev_docs): + if not len(gold_dev_data): msg.fail("No evaluation docs") overlap = len(train_texts.intersection(dev_texts)) if overlap: - msg.warn("{} training examples also in evaluation data".format(overlap)) + msg.warn(f"{overlap} training examples also in evaluation data") else: msg.good("No overlap between training and evaluation data") - if not base_model and len(train_docs) < BLANK_MODEL_THRESHOLD: - text = "Low number of examples to train from a blank model ({})".format( - len(train_docs) + if not base_model and len(train_dataset) < BLANK_MODEL_THRESHOLD: + text = ( + f"Low number of examples to train from a blank model ({len(train_dataset)})" ) - if len(train_docs) < BLANK_MODEL_MIN_THRESHOLD: + if len(train_dataset) < BLANK_MODEL_MIN_THRESHOLD: msg.fail(text) else: msg.warn(text) msg.text( - "It's recommended to use at least {} examples (minimum {})".format( - BLANK_MODEL_THRESHOLD, BLANK_MODEL_MIN_THRESHOLD - ), + f"It's recommended to use at least {BLANK_MODEL_THRESHOLD} examples " + f"(minimum {BLANK_MODEL_MIN_THRESHOLD})", show=verbose, ) msg.divider("Vocab & Vectors") n_words = gold_train_data["n_words"] msg.info( - "{} total {} in the data ({} unique)".format( - n_words, "word" if n_words == 1 else "words", len(gold_train_data["words"]) - ) + f"{n_words} total word(s) in the data ({len(gold_train_data['words'])} unique)" ) if gold_train_data["n_misaligned_words"] > 0: - msg.warn( - "{} misaligned tokens in the training data".format( - gold_train_data["n_misaligned_words"] - ) - ) + n_misaligned = gold_train_data["n_misaligned_words"] + msg.warn(f"{n_misaligned} misaligned tokens in the training data") if gold_dev_data["n_misaligned_words"] > 0: - msg.warn( - "{} misaligned tokens in the dev data".format( - gold_dev_data["n_misaligned_words"] - ) - ) + n_misaligned = gold_dev_data["n_misaligned_words"] + msg.warn(f"{n_misaligned} misaligned tokens in the dev data") most_common_words = gold_train_data["words"].most_common(10) msg.text( - "10 most common words: {}".format( - _format_labels(most_common_words, counts=True) - ), + f"10 most common words: {_format_labels(most_common_words, counts=True)}", show=verbose, ) if len(nlp.vocab.vectors): msg.info( - "{} vectors ({} unique keys, {} dimensions)".format( - len(nlp.vocab.vectors), - nlp.vocab.vectors.n_keys, - nlp.vocab.vectors_length, - ) + f"{len(nlp.vocab.vectors)} vectors ({nlp.vocab.vectors.n_keys} " + f"unique keys, {nlp.vocab.vectors_length} dimensions)" ) n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values()) msg.warn( @@ -205,7 +172,7 @@ def debug_data( if "ner" in pipeline: # Get all unique NER labels present in the data labels = set( - label for label in gold_train_data["ner"] if label not in ("O", "-") + label for label in gold_train_data["ner"] if label not in ("O", "-", None) ) label_counts = gold_train_data["ner"] model_labels = _get_labels_from_model(nlp, "ner") @@ -218,19 +185,10 @@ def debug_data( msg.divider("Named Entity Recognition") msg.info( - "{} new {}, {} existing {}".format( - len(new_labels), - "label" if len(new_labels) == 1 else "labels", - len(existing_labels), - "label" if len(existing_labels) == 1 else "labels", - ) + f"{len(new_labels)} new label(s), {len(existing_labels)} existing label(s)" ) missing_values = label_counts["-"] - msg.text( - "{} missing {} (tokens with '-' label)".format( - missing_values, "value" if missing_values == 1 else "values" - ) - ) + msg.text(f"{missing_values} missing value(s) (tokens with '-' label)") for label in new_labels: if len(label) == 0: msg.fail("Empty label found in new labels") @@ -241,43 +199,28 @@ def debug_data( if label != "-" ] labels_with_counts = _format_labels(labels_with_counts, counts=True) - msg.text("New: {}".format(labels_with_counts), show=verbose) + msg.text(f"New: {labels_with_counts}", show=verbose) if existing_labels: - msg.text( - "Existing: {}".format(_format_labels(existing_labels)), show=verbose - ) - + msg.text(f"Existing: {_format_labels(existing_labels)}", show=verbose) if gold_train_data["ws_ents"]: - msg.fail( - "{} invalid whitespace entity span(s)".format( - gold_train_data["ws_ents"] - ) - ) + msg.fail(f"{gold_train_data['ws_ents']} invalid whitespace entity spans") has_ws_ents_error = True if gold_train_data["punct_ents"]: - msg.warn( - "{} entity span(s) with punctuation".format( - gold_train_data["punct_ents"] - ) - ) + msg.warn(f"{gold_train_data['punct_ents']} entity span(s) with punctuation") has_punct_ents_warning = True for label in new_labels: if label_counts[label] <= NEW_LABEL_THRESHOLD: msg.warn( - "Low number of examples for new label '{}' ({})".format( - label, label_counts[label] - ) + f"Low number of examples for new label '{label}' ({label_counts[label]})" ) has_low_data_warning = True with msg.loading("Analyzing label distribution..."): - neg_docs = _get_examples_without_label(train_docs, label) + neg_docs = _get_examples_without_label(train_dataset, label) if neg_docs == 0: - msg.warn( - "No examples for texts WITHOUT new label '{}'".format(label) - ) + msg.warn(f"No examples for texts WITHOUT new label '{label}'") has_no_neg_warning = True if not has_low_data_warning: @@ -291,8 +234,8 @@ def debug_data( if has_low_data_warning: msg.text( - "To train a new entity type, your data should include at " - "least {} instances of the new label".format(NEW_LABEL_THRESHOLD), + f"To train a new entity type, your data should include at " + f"least {NEW_LABEL_THRESHOLD} instances of the new label", show=verbose, ) if has_no_neg_warning: @@ -321,27 +264,21 @@ def debug_data( new_labels = [l for l in labels if l not in model_labels] existing_labels = [l for l in labels if l in model_labels] msg.info( - "Text Classification: {} new label(s), {} existing label(s)".format( - len(new_labels), len(existing_labels) - ) + f"Text Classification: {len(new_labels)} new label(s), " + f"{len(existing_labels)} existing label(s)" ) if new_labels: labels_with_counts = _format_labels( gold_train_data["cats"].most_common(), counts=True ) - msg.text("New: {}".format(labels_with_counts), show=verbose) + msg.text(f"New: {labels_with_counts}", show=verbose) if existing_labels: - msg.text( - "Existing: {}".format(_format_labels(existing_labels)), show=verbose - ) + msg.text(f"Existing: {_format_labels(existing_labels)}", show=verbose) if set(gold_train_data["cats"]) != set(gold_dev_data["cats"]): msg.fail( - "The train and dev labels are not the same. " - "Train labels: {}. " - "Dev labels: {}.".format( - _format_labels(gold_train_data["cats"]), - _format_labels(gold_dev_data["cats"]), - ) + f"The train and dev labels are not the same. " + f"Train labels: {_format_labels(gold_train_data['cats'])}. " + f"Dev labels: {_format_labels(gold_dev_data['cats'])}." ) if gold_train_data["n_cats_multilabel"] > 0: msg.info( @@ -371,27 +308,16 @@ def debug_data( msg.divider("Part-of-speech Tagging") labels = [label for label in gold_train_data["tags"]] tag_map = nlp.vocab.morphology.tag_map - msg.info( - "{} {} in data ({} {} in tag map)".format( - len(labels), - "label" if len(labels) == 1 else "labels", - len(tag_map), - "label" if len(tag_map) == 1 else "labels", - ) - ) + msg.info(f"{len(labels)} label(s) in data ({len(tag_map)} label(s) in tag map)") labels_with_counts = _format_labels( gold_train_data["tags"].most_common(), counts=True ) msg.text(labels_with_counts, show=verbose) non_tagmap = [l for l in labels if l not in tag_map] if not non_tagmap: - msg.good("All labels present in tag map for language '{}'".format(nlp.lang)) + msg.good(f"All labels present in tag map for language '{nlp.lang}'") for label in non_tagmap: - msg.fail( - "Label '{}' not found in tag map for language '{}'".format( - label, nlp.lang - ) - ) + msg.fail(f"Label '{label}' not found in tag map for language '{nlp.lang}'") if "parser" in pipeline: has_low_data_warning = False @@ -399,21 +325,18 @@ def debug_data( # profile sentence length msg.info( - "Found {} sentence{} with an average length of {:.1f} words.".format( - gold_train_data["n_sents"], - "s" if len(train_docs) > 1 else "", - gold_train_data["n_words"] / gold_train_data["n_sents"], - ) + f"Found {gold_train_data['n_sents']} sentence(s) with an average " + f"length of {gold_train_data['n_words'] / gold_train_data['n_sents']:.1f} words." ) # check for documents with multiple sentences sents_per_doc = gold_train_data["n_sents"] / len(gold_train_data["texts"]) if sents_per_doc < 1.1: msg.warn( - "The training data contains {:.2f} sentences per " - "document. When there are very few documents containing more " - "than one sentence, the parser will not learn how to segment " - "longer texts into sentences.".format(sents_per_doc) + f"The training data contains {sents_per_doc:.2f} sentences per " + f"document. When there are very few documents containing more " + f"than one sentence, the parser will not learn how to segment " + f"longer texts into sentences." ) # profile labels @@ -424,32 +347,13 @@ def debug_data( labels_dev = [label for label in gold_dev_data["deps"]] if gold_train_unpreprocessed_data["n_nonproj"] > 0: - msg.info( - "Found {} nonprojective train sentence{}".format( - gold_train_unpreprocessed_data["n_nonproj"], - "s" if gold_train_unpreprocessed_data["n_nonproj"] > 1 else "", - ) - ) + n_nonproj = gold_train_unpreprocessed_data["n_nonproj"] + msg.info(f"Found {n_nonproj} nonprojective train sentence(s)") if gold_dev_data["n_nonproj"] > 0: - msg.info( - "Found {} nonprojective dev sentence{}".format( - gold_dev_data["n_nonproj"], - "s" if gold_dev_data["n_nonproj"] > 1 else "", - ) - ) - - msg.info( - "{} {} in train data".format( - len(labels_train_unpreprocessed), - "label" if len(labels_train) == 1 else "labels", - ) - ) - msg.info( - "{} {} in projectivized train data".format( - len(labels_train), "label" if len(labels_train) == 1 else "labels" - ) - ) - + n_nonproj = gold_dev_data["n_nonproj"] + msg.info(f"Found {n_nonproj} nonprojective dev sentence(s)") + msg.info(f"{labels_train_unpreprocessed} label(s) in train data") + msg.info(f"{len(labels_train)} label(s) in projectivized train data") labels_with_counts = _format_labels( gold_train_unpreprocessed_data["deps"].most_common(), counts=True ) @@ -459,9 +363,8 @@ def debug_data( for label in gold_train_unpreprocessed_data["deps"]: if gold_train_unpreprocessed_data["deps"][label] <= DEP_LABEL_THRESHOLD: msg.warn( - "Low number of examples for label '{}' ({})".format( - label, gold_train_unpreprocessed_data["deps"][label] - ) + f"Low number of examples for label '{label}' " + f"({gold_train_unpreprocessed_data['deps'][label]})" ) has_low_data_warning = True @@ -470,22 +373,19 @@ def debug_data( for label in gold_train_data["deps"]: if gold_train_data["deps"][label] <= DEP_LABEL_THRESHOLD and "||" in label: rare_projectivized_labels.append( - "{}: {}".format(label, str(gold_train_data["deps"][label])) + f"{label}: {gold_train_data['deps'][label]}" ) if len(rare_projectivized_labels) > 0: msg.warn( - "Low number of examples for {} label{} in the " - "projectivized dependency trees used for training. You may " - "want to projectivize labels such as punct before " - "training in order to improve parser performance.".format( - len(rare_projectivized_labels), - "s" if len(rare_projectivized_labels) > 1 else "", - ) + f"Low number of examples for {len(rare_projectivized_labels)} " + "label(s) in the projectivized dependency trees used for " + "training. You may want to projectivize labels such as punct " + "before training in order to improve parser performance." ) msg.warn( - "Projectivized labels with low numbers of examples: " - "{}".format("\n".join(rare_projectivized_labels)), + f"Projectivized labels with low numbers of examples: ", + ", ".join(rare_projectivized_labels), show=verbose, ) has_low_data_warning = True @@ -493,50 +393,44 @@ def debug_data( # labels only in train if set(labels_train) - set(labels_dev): msg.warn( - "The following labels were found only in the train data: " - "{}".format(", ".join(set(labels_train) - set(labels_dev))), + "The following labels were found only in the train data:", + ", ".join(set(labels_train) - set(labels_dev)), show=verbose, ) # labels only in dev if set(labels_dev) - set(labels_train): msg.warn( - "The following labels were found only in the dev data: " - + ", ".join(set(labels_dev) - set(labels_train)), + "The following labels were found only in the dev data:", + ", ".join(set(labels_dev) - set(labels_train)), show=verbose, ) if has_low_data_warning: msg.text( - "To train a parser, your data should include at " - "least {} instances of each label.".format(DEP_LABEL_THRESHOLD), + f"To train a parser, your data should include at " + f"least {DEP_LABEL_THRESHOLD} instances of each label.", show=verbose, ) # multiple root labels if len(gold_train_unpreprocessed_data["roots"]) > 1: msg.warn( - "Multiple root labels ({}) ".format( - ", ".join(gold_train_unpreprocessed_data["roots"]) - ) - + "found in training data. spaCy's parser uses a single root " - "label ROOT so this distinction will not be available." + f"Multiple root labels " + f"({', '.join(gold_train_unpreprocessed_data['roots'])}) " + f"found in training data. spaCy's parser uses a single root " + f"label ROOT so this distinction will not be available." ) # these should not happen, but just in case if gold_train_data["n_nonproj"] > 0: msg.fail( - "Found {} nonprojective projectivized train sentence{}".format( - gold_train_data["n_nonproj"], - "s" if gold_train_data["n_nonproj"] > 1 else "", - ) + f"Found {gold_train_data['n_nonproj']} nonprojective " + f"projectivized train sentence(s)" ) if gold_train_data["n_cycles"] > 0: msg.fail( - "Found {} projectivized train sentence{} with cycles".format( - gold_train_data["n_cycles"], - "s" if gold_train_data["n_cycles"] > 1 else "", - ) + f"Found {gold_train_data['n_cycles']} projectivized train sentence(s) with cycles" ) msg.divider("Summary") @@ -544,42 +438,34 @@ def debug_data( warn_counts = msg.counts[MESSAGES.WARN] fail_counts = msg.counts[MESSAGES.FAIL] if good_counts: - msg.good( - "{} {} passed".format( - good_counts, "check" if good_counts == 1 else "checks" - ) - ) + msg.good(f"{good_counts} {'check' if good_counts == 1 else 'checks'} passed") if warn_counts: - msg.warn( - "{} {}".format(warn_counts, "warning" if warn_counts == 1 else "warnings") - ) - if fail_counts: - msg.fail("{} {}".format(fail_counts, "error" if fail_counts == 1 else "errors")) - + msg.warn(f"{warn_counts} {'warning' if warn_counts == 1 else 'warnings'}") if fail_counts: + msg.fail(f"{fail_counts} {'error' if fail_counts == 1 else 'errors'}") sys.exit(1) def _load_file(file_path, msg): file_name = file_path.parts[-1] if file_path.suffix == ".json": - with msg.loading("Loading {}...".format(file_name)): + with msg.loading(f"Loading {file_name}..."): data = srsly.read_json(file_path) - msg.good("Loaded {}".format(file_name)) + msg.good(f"Loaded {file_name}") return data elif file_path.suffix == ".jsonl": - with msg.loading("Loading {}...".format(file_name)): + with msg.loading(f"Loading {file_name}..."): data = srsly.read_jsonl(file_path) - msg.good("Loaded {}".format(file_name)) + msg.good(f"Loaded {file_name}") return data msg.fail( - "Can't load file extension {}".format(file_path.suffix), + f"Can't load file extension {file_path.suffix}", "Expected .json or .jsonl", exits=1, ) -def _compile_gold(train_docs, pipeline, nlp): +def _compile_gold(examples, pipeline, nlp): data = { "ner": Counter(), "cats": Counter(), @@ -598,7 +484,9 @@ def _compile_gold(train_docs, pipeline, nlp): "n_cats_multilabel": 0, "texts": set(), } - for doc, gold in train_docs: + for example in examples: + gold = example.gold + doc = example.doc valid_words = [x for x in gold.words if x is not None] data["words"].update(valid_words) data["n_words"] += len(valid_words) @@ -651,17 +539,17 @@ def _compile_gold(train_docs, pipeline, nlp): def _format_labels(labels, counts=False): if counts: - return ", ".join(["'{}' ({})".format(l, c) for l, c in labels]) - return ", ".join(["'{}'".format(l) for l in labels]) + return ", ".join([f"'{l}' ({c})" for l, c in labels]) + return ", ".join([f"'{l}'" for l in labels]) def _get_examples_without_label(data, label): count = 0 - for doc, gold in data: + for ex in data: labels = [ label.split("-")[1] - for label in gold.ner - if label is not None and label not in ("O", "-") + for label in ex.gold.ner + if label not in ("O", "-", None) ] if label not in labels: count += 1 diff --git a/spacy/cli/download.py b/spacy/cli/download.py index 19f3e7860..0230e272d 100644 --- a/spacy/cli/download.py +++ b/spacy/cli/download.py @@ -1,28 +1,21 @@ -# coding: utf8 -from __future__ import unicode_literals - -import plac import requests import os import subprocess import sys from wasabi import msg -from .link import link -from ..util import get_package_path from .. import about -@plac.annotations( - model=("Model to download (shortcut or name)", "positional", None, str), - direct=("Force direct download of name + version", "flag", "d", bool), - pip_args=("Additional arguments to be passed to `pip install` on model install"), -) -def download(model, direct=False, *pip_args): +def download( + model: ("Model to download (shortcut or name)", "positional", None, str), + direct: ("Force direct download of name + version", "flag", "d", bool) = False, + *pip_args: ("Additional arguments to be passed to `pip install` on model install"), +): """ - Download compatible model from default download path using pip. Model - can be shortcut, model name or, if --direct flag is set, full model name - with version. For direct downloads, the compatibility check will be skipped. + Download compatible model from default download path using pip. If --direct + flag is set, the command expects the full model name with version. + For direct downloads, the compatibility check will be skipped. """ if not require_package("spacy") and "--no-deps" not in pip_args: msg.warn( @@ -50,30 +43,8 @@ def download(model, direct=False, *pip_args): sys.exit(dl) msg.good( "Download and installation successful", - "You can now load the model via spacy.load('{}')".format(model_name), + f"You can now load the model via spacy.load('{model_name}')", ) - # Only create symlink if the model is installed via a shortcut like 'en'. - # There's no real advantage over an additional symlink for en_core_web_sm - # and if anything, it's more error prone and causes more confusion. - if model in shortcuts: - try: - # Get package path here because link uses - # pip.get_installed_distributions() to check if model is a - # package, which fails if model was just installed via - # subprocess - package_path = get_package_path(model_name) - link(model_name, model, force=True, model_path=package_path) - except: # noqa: E722 - # Dirty, but since spacy.download and the auto-linking is - # mostly a convenience wrapper, it's best to show a success - # message and loading instructions, even if linking fails. - msg.warn( - "Download successful but linking failed", - "Creating a shortcut link for '{}' didn't work (maybe you " - "don't have admin permissions?), but you can still load " - "the model via its full package name: " - "nlp = spacy.load('{}')".format(model, model_name), - ) # If a model is downloaded and then loaded within the same process, our # is_package check currently fails, because pkg_resources.working_set # is not refreshed automatically (see #3923). We're trying to work @@ -95,11 +66,11 @@ def get_json(url, desc): r = requests.get(url) if r.status_code != 200: msg.fail( - "Server error ({})".format(r.status_code), - "Couldn't fetch {}. Please find a model for your spaCy " - "installation (v{}), and download it manually. For more " - "details, see the documentation: " - "https://spacy.io/usage/models".format(desc, about.__version__), + f"Server error ({r.status_code})", + f"Couldn't fetch {desc}. Please find a model for your spaCy " + f"installation (v{about.__version__}), and download it manually. " + f"For more details, see the documentation: " + f"https://spacy.io/usage/models", exits=1, ) return r.json() @@ -111,7 +82,7 @@ def get_compatibility(): comp_table = get_json(about.__compatibility__, "compatibility table") comp = comp_table["spacy"] if version not in comp: - msg.fail("No compatible models found for v{} of spaCy".format(version), exits=1) + msg.fail(f"No compatible models found for v{version} of spaCy", exits=1) return comp[version] @@ -119,8 +90,7 @@ def get_version(model, comp): model = model.rsplit(".dev", 1)[0] if model not in comp: msg.fail( - "No compatible model found for '{}' " - "(spaCy v{}).".format(model, about.__version__), + f"No compatible model found for '{model}' (spaCy v{about.__version__})", exits=1, ) return comp[model][0] diff --git a/spacy/cli/evaluate.py b/spacy/cli/evaluate.py index 8a84684e5..735e304f9 100644 --- a/spacy/cli/evaluate.py +++ b/spacy/cli/evaluate.py @@ -1,8 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals, division, print_function - -import plac -import spacy from timeit import default_timer as timer from wasabi import msg @@ -11,23 +6,16 @@ from .. import util from .. import displacy -@plac.annotations( - model=("Model name or path", "positional", None, str), - data_path=("Location of JSON-formatted evaluation data", "positional", None, str), - gold_preproc=("Use gold preprocessing", "flag", "G", bool), - gpu_id=("Use GPU", "option", "g", int), - displacy_path=("Directory to output rendered parses as HTML", "option", "dp", str), - displacy_limit=("Limit of parses to render as HTML", "option", "dl", int), - return_scores=("Return dict containing model scores", "flag", "R", bool), -) def evaluate( - model, - data_path, - gpu_id=-1, - gold_preproc=False, - displacy_path=None, - displacy_limit=25, - return_scores=False, + # fmt: off + model: ("Model name or path", "positional", None, str), + data_path: ("Location of JSON-formatted evaluation data", "positional", None, str), + gpu_id: ("Use GPU", "option", "g", int) = -1, + gold_preproc: ("Use gold preprocessing", "flag", "G", bool) = False, + displacy_path: ("Directory to output rendered parses as HTML", "option", "dp", str) = None, + displacy_limit: ("Limit of parses to render as HTML", "option", "dl", int) = 25, + return_scores: ("Return dict containing model scores", "flag", "R", bool) = False, + # fmt: on ): """ Evaluate a model. To render a sample of parses in a HTML file, set an @@ -45,31 +33,36 @@ def evaluate( msg.fail("Visualization output directory not found", displacy_path, exits=1) corpus = GoldCorpus(data_path, data_path) if model.startswith("blank:"): - nlp = spacy.blank(model.replace("blank:", "")) + nlp = util.get_lang_class(model.replace("blank:", ""))() else: nlp = util.load_model(model) - dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc)) + dev_dataset = list(corpus.dev_dataset(nlp, gold_preproc=gold_preproc)) begin = timer() - scorer = nlp.evaluate(dev_docs, verbose=False) + scorer = nlp.evaluate(dev_dataset, verbose=False) end = timer() - nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs) + nwords = sum(len(ex.doc) for ex in dev_dataset) results = { - "Time": "%.2f s" % (end - begin), + "Time": f"{end - begin:.2f} s", "Words": nwords, - "Words/s": "%.0f" % (nwords / (end - begin)), - "TOK": "%.2f" % scorer.token_acc, - "POS": "%.2f" % scorer.tags_acc, - "UAS": "%.2f" % scorer.uas, - "LAS": "%.2f" % scorer.las, - "NER P": "%.2f" % scorer.ents_p, - "NER R": "%.2f" % scorer.ents_r, - "NER F": "%.2f" % scorer.ents_f, - "Textcat": "%.2f" % scorer.textcat_score, + "Words/s": f"{nwords / (end - begin):.0f}", + "TOK": f"{scorer.token_acc:.2f}", + "TAG": f"{scorer.tags_acc:.2f}", + "POS": f"{scorer.pos_acc:.2f}", + "MORPH": f"{scorer.morphs_acc:.2f}", + "UAS": f"{scorer.uas:.2f}", + "LAS": f"{scorer.las:.2f}", + "NER P": f"{scorer.ents_p:.2f}", + "NER R": f"{scorer.ents_r:.2f}", + "NER F": f"{scorer.ents_f:.2f}", + "Textcat": f"{scorer.textcat_score:.2f}", + "Sent P": f"{scorer.sent_p:.2f}", + "Sent R": f"{scorer.sent_r:.2f}", + "Sent F": f"{scorer.sent_f:.2f}", } msg.table(results, title="Results") if displacy_path: - docs, golds = zip(*dev_docs) + docs = [ex.doc for ex in dev_dataset] render_deps = "parser" in nlp.meta.get("pipeline", []) render_ents = "ner" in nlp.meta.get("pipeline", []) render_parses( @@ -80,7 +73,7 @@ def evaluate( deps=render_deps, ents=render_ents, ) - msg.good("Generated {} parses as HTML".format(displacy_limit), displacy_path) + msg.good(f"Generated {displacy_limit} parses as HTML", displacy_path) if return_scores: return scorer.scores diff --git a/spacy/cli/info.py b/spacy/cli/info.py index 080d0dc77..23f766368 100644 --- a/spacy/cli/info.py +++ b/spacy/cli/info.py @@ -1,44 +1,39 @@ -# coding: utf8 -from __future__ import unicode_literals - -import plac import platform from pathlib import Path from wasabi import msg import srsly -from ..compat import path2str, basestring_, unicode_ +from .validate import get_model_pkgs from .. import util from .. import about -@plac.annotations( - model=("Optional shortcut link of model", "positional", None, str), - markdown=("Generate Markdown for GitHub issues", "flag", "md", str), - silent=("Don't print anything (just return)", "flag", "s"), -) -def info(model=None, markdown=False, silent=False): +def info( + model: ("Optional model name", "positional", None, str) = None, + markdown: ("Generate Markdown for GitHub issues", "flag", "md", str) = False, + silent: ("Don't print anything (just return)", "flag", "s") = False, +): """ - Print info about spaCy installation. If a model shortcut link is - speficied as an argument, print model information. Flag --markdown - prints details in Markdown for easy copy-pasting to GitHub issues. + Print info about spaCy installation. If a model is speficied as an argument, + print model information. Flag --markdown prints details in Markdown for easy + copy-pasting to GitHub issues. """ if model: if util.is_package(model): model_path = util.get_package_path(model) else: - model_path = util.get_data_path() / model + model_path = model meta_path = model_path / "meta.json" if not meta_path.is_file(): msg.fail("Can't find model meta.json", meta_path, exits=1) meta = srsly.read_json(meta_path) if model_path.resolve() != model_path: - meta["link"] = path2str(model_path) - meta["source"] = path2str(model_path.resolve()) + meta["link"] = str(model_path) + meta["source"] = str(model_path.resolve()) else: - meta["source"] = path2str(model_path) + meta["source"] = str(model_path) if not silent: - title = "Info about model '{}'".format(model) + title = f"Info about model '{model}'" model_meta = { k: v for k, v in meta.items() if k not in ("accuracy", "speed") } @@ -47,12 +42,13 @@ def info(model=None, markdown=False, silent=False): else: msg.table(model_meta, title=title) return meta + all_models, _ = get_model_pkgs() data = { "spaCy version": about.__version__, - "Location": path2str(Path(__file__).parent.parent), + "Location": str(Path(__file__).parent.parent), "Platform": platform.platform(), "Python version": platform.python_version(), - "Models": list_models(), + "Models": ", ".join(model["name"] for model in all_models.values()), } if not silent: title = "Info about spaCy" @@ -63,19 +59,6 @@ def info(model=None, markdown=False, silent=False): return data -def list_models(): - def exclude_dir(dir_name): - # exclude common cache directories and hidden directories - exclude = ("cache", "pycache", "__pycache__") - return dir_name in exclude or dir_name.startswith(".") - - data_path = util.get_data_path() - if data_path: - models = [f.parts[-1] for f in data_path.iterdir() if f.is_dir()] - return ", ".join([m for m in models if not exclude_dir(m)]) - return "-" - - def print_markdown(data, title=None): """Print data in GitHub-flavoured Markdown format for issues etc. @@ -84,9 +67,9 @@ def print_markdown(data, title=None): """ markdown = [] for key, value in data.items(): - if isinstance(value, basestring_) and Path(value).exists(): + if isinstance(value, str) and Path(value).exists(): continue - markdown.append("* **{}:** {}".format(key, unicode_(value))) + markdown.append(f"* **{key}:** {value}") if title: - print("\n## {}".format(title)) + print(f"\n## {title}") print("\n{}\n".format("\n".join(markdown))) diff --git a/spacy/cli/init_model.py b/spacy/cli/init_model.py index 7fdd39932..700fa43de 100644 --- a/spacy/cli/init_model.py +++ b/spacy/cli/init_model.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - -import plac import math from tqdm import tqdm import numpy @@ -20,7 +16,6 @@ from ..errors import Errors, Warnings from ..util import ensure_path, get_lang_class, load_model, OOV_RANK from ..lookups import Lookups - try: import ftfy except ImportError: @@ -30,43 +25,21 @@ except ImportError: DEFAULT_OOV_PROB = -20 -@plac.annotations( - lang=("Model language", "positional", None, str), - output_dir=("Model output directory", "positional", None, Path), - freqs_loc=("Location of words frequencies file", "option", "f", Path), - jsonl_loc=("Location of JSONL-formatted attributes file", "option", "j", Path), - clusters_loc=("Optional location of brown clusters data", "option", "c", str), - vectors_loc=("Optional vectors file in Word2Vec format", "option", "v", str), - truncate_vectors=( - "Optional number of vectors to truncate to when reading in vectors file", - "option", - "t", - int, - ), - prune_vectors=("Optional number of vectors to prune to", "option", "V", int), - vectors_name=( - "Optional name for the word vectors, e.g. en_core_web_lg.vectors", - "option", - "vn", - str, - ), - model_name=("Optional name for the model meta", "option", "mn", str), - omit_extra_lookups=("Don't include extra lookups in model", "flag", "OEL", bool), - base_model=("Base model (for languages with custom tokenizers)", "option", "b", str), -) def init_model( - lang, - output_dir, - freqs_loc=None, - clusters_loc=None, - jsonl_loc=None, - vectors_loc=None, - truncate_vectors=0, - prune_vectors=-1, - vectors_name=None, - model_name=None, - omit_extra_lookups=False, - base_model=None, + # fmt: off + lang: ("Model language", "positional", None, str), + output_dir: ("Model output directory", "positional", None, Path), + freqs_loc: ("Location of words frequencies file", "option", "f", Path) = None, + clusters_loc: ("Optional location of brown clusters data", "option", "c", str) = None, + jsonl_loc: ("Location of JSONL-formatted attributes file", "option", "j", Path) = None, + vectors_loc: ("Optional vectors file in Word2Vec format", "option", "v", str) = None, + prune_vectors: ("Optional number of vectors to prune to", "option", "V", int) = -1, + truncate_vectors: ("Optional number of vectors to truncate to when reading in vectors file", "option", "t", int) = 0, + vectors_name: ("Optional name for the word vectors, e.g. en_core_web_lg.vectors", "option", "vn", str) = None, + model_name: ("Optional name for the model meta", "option", "mn", str) = None, + omit_extra_lookups: ("Don't include extra lookups in model", "flag", "OEL", bool) = False, + base_model: ("Base model (for languages with custom tokenizers)", "option", "b", str) = None + # fmt: on ): """ Create a new model from raw data, like word frequencies, Brown clusters @@ -114,8 +87,7 @@ def init_model( vec_added = len(nlp.vocab.vectors) lex_added = len(nlp.vocab) msg.good( - "Sucessfully compiled vocab", - "{} entries, {} vectors".format(lex_added, vec_added), + "Sucessfully compiled vocab", f"{lex_added} entries, {vec_added} vectors", ) if not output_dir.exists(): output_dir.mkdir() @@ -203,9 +175,9 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None): nlp.vocab.vectors.add(lex.orth, row=lex.rank) else: if vectors_loc: - with msg.loading("Reading vectors from {}".format(vectors_loc)): - vectors_data, vector_keys = read_vectors(vectors_loc, truncate_vectors) - msg.good("Loaded vectors from {}".format(vectors_loc)) + with msg.loading(f"Reading vectors from {vectors_loc}"): + vectors_data, vector_keys = read_vectors(vectors_loc) + msg.good(f"Loaded vectors from {vectors_loc}") else: vectors_data, vector_keys = (None, None) if vector_keys is not None: @@ -215,7 +187,7 @@ def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None): if vectors_data is not None: nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys) if name is None: - nlp.vocab.vectors.name = "%s_model.vectors" % nlp.meta["lang"] + nlp.vocab.vectors.name = f"{nlp.meta['lang']}_model.vectors" else: nlp.vocab.vectors.name = name nlp.meta["vectors"]["name"] = nlp.vocab.vectors.name @@ -265,7 +237,7 @@ def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50): word = literal_eval(key) except SyntaxError: # Take odd strings literally. - word = literal_eval("'%s'" % key) + word = literal_eval(f"'{key}'") smooth_count = counts.smoother(int(freq)) probs[word] = math.log(smooth_count) - log_total oov_prob = math.log(counts.smoother(0)) - log_total diff --git a/spacy/cli/link.py b/spacy/cli/link.py deleted file mode 100644 index 8117829b5..000000000 --- a/spacy/cli/link.py +++ /dev/null @@ -1,77 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import plac -from pathlib import Path -from wasabi import msg - -from ..compat import symlink_to, path2str -from .. import util - - -@plac.annotations( - origin=("package name or local path to model", "positional", None, str), - link_name=("name of shortuct link to create", "positional", None, str), - force=("force overwriting of existing link", "flag", "f", bool), -) -def link(origin, link_name, force=False, model_path=None): - """ - Create a symlink for models within the spacy/data directory. Accepts - either the name of a pip package, or the local path to the model data - directory. Linking models allows loading them via spacy.load(link_name). - """ - if util.is_package(origin): - model_path = util.get_package_path(origin) - else: - model_path = Path(origin) if model_path is None else Path(model_path) - if not model_path.exists(): - msg.fail( - "Can't locate model data", - "The data should be located in {}".format(path2str(model_path)), - exits=1, - ) - data_path = util.get_data_path() - if not data_path or not data_path.exists(): - spacy_loc = Path(__file__).parent.parent - msg.fail( - "Can't find the spaCy data path to create model symlink", - "Make sure a directory `/data` exists within your spaCy " - "installation and try again. The data directory should be located " - "here:".format(path=spacy_loc), - exits=1, - ) - link_path = util.get_data_path() / link_name - if link_path.is_symlink() and not force: - msg.fail( - "Link '{}' already exists".format(link_name), - "To overwrite an existing link, use the --force flag", - exits=1, - ) - elif link_path.is_symlink(): # does a symlink exist? - # NB: It's important to check for is_symlink here and not for exists, - # because invalid/outdated symlinks would return False otherwise. - link_path.unlink() - elif link_path.exists(): # does it exist otherwise? - # NB: Check this last because valid symlinks also "exist". - msg.fail( - "Can't overwrite symlink '{}'".format(link_name), - "This can happen if your data directory contains a directory or " - "file of the same name.", - exits=1, - ) - details = "%s --> %s" % (path2str(model_path), path2str(link_path)) - try: - symlink_to(link_path, model_path) - except: # noqa: E722 - # This is quite dirty, but just making sure other errors are caught. - msg.fail( - "Couldn't link model to '{}'".format(link_name), - "Creating a symlink in spacy/data failed. Make sure you have the " - "required permissions and try re-running the command as admin, or " - "use a virtualenv. You can still import the model as a module and " - "call its load() method, or create the symlink manually.", - ) - msg.text(details) - raise - msg.good("Linking successful", details) - msg.text("You can now load the model via spacy.load('{}')".format(link_name)) diff --git a/spacy/cli/package.py b/spacy/cli/package.py index 8ed92259c..8e27e44d0 100644 --- a/spacy/cli/package.py +++ b/spacy/cli/package.py @@ -1,25 +1,21 @@ -# coding: utf8 -from __future__ import unicode_literals - -import plac import shutil from pathlib import Path from wasabi import msg, get_raw_input import srsly -from ..compat import path2str from .. import util from .. import about -@plac.annotations( - input_dir=("Directory with model data", "positional", None, str), - output_dir=("Output parent directory", "positional", None, str), - meta_path=("Path to meta.json", "option", "m", str), - create_meta=("Create meta.json, even if one exists", "flag", "c", bool), - force=("Force overwriting existing model in output directory", "flag", "f", bool), -) -def package(input_dir, output_dir, meta_path=None, create_meta=False, force=False): +def package( + # fmt: off + input_dir: ("Directory with model data", "positional", None, str), + output_dir: ("Output parent directory", "positional", None, str), + meta_path: ("Path to meta.json", "option", "m", str) = None, + create_meta: ("Create meta.json, even if one exists", "flag", "c", bool) = False, + force: ("Force overwriting existing model in output directory", "flag", "f", bool) = False, + # fmt: on +): """ Generate Python package for model data, including meta and required installation files. A new directory will be created in the specified @@ -47,7 +43,7 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False, force=Fals for key in ("lang", "name", "version"): if key not in meta or meta[key] == "": msg.fail( - "No '{}' setting found in meta.json".format(key), + f"No '{key}' setting found in meta.json", "This setting is required to build your package.", exits=1, ) @@ -58,22 +54,21 @@ def package(input_dir, output_dir, meta_path=None, create_meta=False, force=Fals if package_path.exists(): if force: - shutil.rmtree(path2str(package_path)) + shutil.rmtree(str(package_path)) else: msg.fail( "Package directory already exists", "Please delete the directory and try again, or use the " - "`--force` flag to overwrite existing " - "directories.".format(path=path2str(package_path)), + "`--force` flag to overwrite existing directories.", exits=1, ) Path.mkdir(package_path, parents=True) - shutil.copytree(path2str(input_path), path2str(package_path / model_name_v)) + shutil.copytree(str(input_path), str(package_path / model_name_v)) create_file(main_path / "meta.json", srsly.json_dumps(meta, indent=2)) create_file(main_path / "setup.py", TEMPLATE_SETUP) create_file(main_path / "MANIFEST.in", TEMPLATE_MANIFEST) create_file(package_path / "__init__.py", TEMPLATE_INIT) - msg.good("Successfully created package '{}'".format(model_name_v), main_path) + msg.good(f"Successfully created package '{model_name_v}'", main_path) msg.text("To build the package, run `python setup.py sdist` in this directory.") @@ -88,7 +83,7 @@ def generate_meta(model_path, existing_meta, msg): ("lang", "Model language", meta.get("lang", "en")), ("name", "Model name", meta.get("name", "model")), ("version", "Model version", meta.get("version", "0.0.0")), - ("spacy_version", "Required spaCy version", ">=%s,<3.0.0" % about.__version__), + ("spacy_version", "Required spaCy version", f">={about.__version__},<3.0.0"), ("description", "Model description", meta.get("description", False)), ("author", "Author", meta.get("author", False)), ("email", "Author email", meta.get("email", False)), @@ -118,9 +113,6 @@ def generate_meta(model_path, existing_meta, msg): TEMPLATE_SETUP = """ #!/usr/bin/env python -# coding: utf8 -from __future__ import unicode_literals - import io import json from os import path, walk @@ -190,9 +182,6 @@ include meta.json TEMPLATE_INIT = """ -# coding: utf8 -from __future__ import unicode_literals - from pathlib import Path from spacy.util import load_model_from_init_py, get_model_meta diff --git a/spacy/cli/pretrain.py b/spacy/cli/pretrain.py index aaec1ea75..b2e3229ee 100644 --- a/spacy/cli/pretrain.py +++ b/spacy/cli/pretrain.py @@ -1,107 +1,50 @@ -# coding: utf8 -from __future__ import print_function, unicode_literals - -import plac import random import numpy import time import re from collections import Counter from pathlib import Path -from thinc.v2v import Affine, Maxout -from thinc.misc import LayerNorm as LN -from thinc.neural.util import prefer_gpu +from thinc.api import Linear, Maxout, chain, list2array, prefer_gpu +from thinc.api import CosineDistance, L2Distance from wasabi import msg import srsly +from ..gold import Example from ..errors import Errors +from ..ml.models.multi_task import build_masked_language_model from ..tokens import Doc from ..attrs import ID, HEAD -from .._ml import Tok2Vec, flatten, chain, create_default_optimizer -from .._ml import masked_language_model, get_cossim_loss +from ..ml.models.tok2vec import build_Tok2Vec_model from .. import util +from ..util import create_default_optimizer from .train import _load_pretrained_tok2vec -@plac.annotations( - texts_loc=( - "Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the " - "key 'tokens'", - "positional", - None, - str, - ), - vectors_model=("Name or path to spaCy model with vectors to learn from"), - output_dir=("Directory to write models to on each epoch", "positional", None, str), - width=("Width of CNN layers", "option", "cw", int), - conv_depth=("Depth of CNN layers", "option", "cd", int), - cnn_window=("Window size for CNN layers", "option", "cW", int), - cnn_pieces=("Maxout size for CNN layers. 1 for Mish", "option", "cP", int), - use_chars=("Whether to use character-based embedding", "flag", "chr", bool), - sa_depth=("Depth of self-attention layers", "option", "sa", int), - bilstm_depth=("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int), - embed_rows=("Number of embedding rows", "option", "er", int), - loss_func=( - "Loss function to use for the objective. Either 'L2' or 'cosine'", - "option", - "L", - str, - ), - use_vectors=("Whether to use the static vectors as input features", "flag", "uv"), - dropout=("Dropout rate", "option", "d", float), - batch_size=("Number of words per training batch", "option", "bs", int), - max_length=( - "Max words per example. Longer examples are discarded", - "option", - "xw", - int, - ), - min_length=( - "Min words per example. Shorter examples are discarded", - "option", - "nw", - int, - ), - seed=("Seed for random number generators", "option", "s", int), - n_iter=("Number of iterations to pretrain", "option", "i", int), - n_save_every=("Save model every X batches.", "option", "se", int), - init_tok2vec=( - "Path to pretrained weights for the token-to-vector parts of the models. See 'spacy pretrain'. Experimental.", - "option", - "t2v", - Path, - ), - epoch_start=( - "The epoch to start counting at. Only relevant when using '--init-tok2vec' and the given weight file has been " - "renamed. Prevents unintended overwriting of existing weight files.", - "option", - "es", - int, - ), -) def pretrain( - texts_loc, - vectors_model, - output_dir, - width=96, - conv_depth=4, - bilstm_depth=0, - cnn_pieces=3, - sa_depth=0, - use_chars=False, - cnn_window=1, - embed_rows=2000, - loss_func="cosine", - use_vectors=False, - dropout=0.2, - n_iter=1000, - batch_size=3000, - max_length=500, - min_length=5, - seed=0, - n_save_every=None, - init_tok2vec=None, - epoch_start=None, + # fmt: off + texts_loc: ("Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the key 'tokens'", "positional", None, str), + vectors_model: ("Name or path to spaCy model with vectors to learn from", "positional", None, str), + output_dir: ("Directory to write models to on each epoch", "positional", None, str), + width: ("Width of CNN layers", "option", "cw", int) = 96, + conv_depth: ("Depth of CNN layers", "option", "cd", int) = 4, + bilstm_depth: ("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int) = 0, + cnn_pieces: ("Maxout size for CNN layers. 1 for Mish", "option", "cP", int) = 3, + sa_depth: ("Depth of self-attention layers", "option", "sa", int) = 0, + use_chars: ("Whether to use character-based embedding", "flag", "chr", bool) = False, + cnn_window: ("Window size for CNN layers", "option", "cW", int) = 1, + embed_rows: ("Number of embedding rows", "option", "er", int) = 2000, + loss_func: ("Loss function to use for the objective. Either 'L2' or 'cosine'", "option", "L", str) = "cosine", + use_vectors: ("Whether to use the static vectors as input features", "flag", "uv") = False, + dropout: ("Dropout rate", "option", "d", float) = 0.2, + n_iter: ("Number of iterations to pretrain", "option", "i", int) = 1000, + batch_size: ("Number of words per training batch", "option", "bs", int) = 3000, + max_length: ("Max words per example. Longer examples are discarded", "option", "xw", int) = 500, + min_length: ("Min words per example. Shorter examples are discarded", "option", "nw", int) = 5, + seed: ("Seed for random number generators", "option", "s", int) = 0, + n_save_every: ("Save model every X batches.", "option", "se", int) = None, + init_tok2vec: ("Path to pretrained weights for the token-to-vector parts of the models. See 'spacy pretrain'. Experimental.", "option", "t2v", Path) = None, + epoch_start: ("The epoch to start counting at. Only relevant when using '--init-tok2vec' and the given weight file has been renamed. Prevents unintended overwriting of existing weight files.", "option", "es", int) = None, + # fmt: on ): """ Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components, @@ -140,7 +83,7 @@ def pretrain( ) if not output_dir.exists(): output_dir.mkdir() - msg.good("Created output directory: {}".format(output_dir)) + msg.good(f"Created output directory: {output_dir}") srsly.write_json(output_dir / "config.json", config) msg.good("Saved settings to config.json") @@ -159,26 +102,31 @@ def pretrain( msg.text("Reading input text from stdin...") texts = srsly.read_jsonl("-") - with msg.loading("Loading model '{}'...".format(vectors_model)): + with msg.loading(f"Loading model '{vectors_model}'..."): nlp = util.load_model(vectors_model) - msg.good("Loaded model '{}'".format(vectors_model)) - pretrained_vectors = None if not use_vectors else nlp.vocab.vectors.name + msg.good(f"Loaded model '{vectors_model}'") + pretrained_vectors = None if not use_vectors else nlp.vocab.vectors model = create_pretraining_model( nlp, - Tok2Vec( + # TODO: replace with config + build_Tok2Vec_model( width, embed_rows, conv_depth=conv_depth, pretrained_vectors=pretrained_vectors, bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental. subword_features=not use_chars, # Set to False for Chinese etc - cnn_maxout_pieces=cnn_pieces, # If set to 1, use Mish activation. + maxout_pieces=cnn_pieces, # If set to 1, use Mish activation. + window_size=1, + char_embed=False, + nM=64, + nC=8, ), ) # Load in pretrained weights if init_tok2vec is not None: components = _load_pretrained_tok2vec(nlp, init_tok2vec) - msg.text("Loaded pretrained tok2vec for: {}".format(components)) + msg.text(f"Loaded pretrained tok2vec for: {components}") # Parse the epoch number from the given weight file model_name = re.search(r"model\d+\.bin", str(init_tok2vec)) if model_name: @@ -187,33 +135,29 @@ def pretrain( else: if not epoch_start: msg.fail( - "You have to use the '--epoch-start' argument when using a renamed weight file for " - "'--init-tok2vec'", + "You have to use the --epoch-start argument when using a renamed weight file for --init-tok2vec", exits=True, ) elif epoch_start < 0: msg.fail( - "The argument '--epoch-start' has to be greater or equal to 0. '%d' is invalid" - % epoch_start, + f"The argument --epoch-start has to be greater or equal to 0. {epoch_start} is invalid", exits=True, ) else: # Without '--init-tok2vec' the '--epoch-start' argument is ignored epoch_start = 0 - optimizer = create_default_optimizer(model.ops) + optimizer = create_default_optimizer() tracker = ProgressTracker(frequency=10000) - msg.divider("Pre-training tok2vec layer - starting at epoch %d" % epoch_start) + msg.divider(f"Pre-training tok2vec layer - starting at epoch {epoch_start}") row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")} msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings) def _save_model(epoch, is_temp=False): is_temp_str = ".temp" if is_temp else "" with model.use_params(optimizer.averages): - with (output_dir / ("model%d%s.bin" % (epoch, is_temp_str))).open( - "wb" - ) as file_: - file_.write(model.tok2vec.to_bytes()) + with (output_dir / f"model{epoch}{is_temp_str}.bin").open("wb") as file_: + file_.write(model.get_ref("tok2vec").to_bytes()) log = { "nr_word": tracker.nr_word, "loss": tracker.loss, @@ -226,7 +170,9 @@ def pretrain( skip_counter = 0 for epoch in range(epoch_start, n_iter + epoch_start): for batch_id, batch in enumerate( - util.minibatch_by_words(((text, None) for text in texts), size=batch_size) + util.minibatch_by_words( + (Example(doc=text) for text in texts), size=batch_size + ) ): docs, count = make_docs( nlp, @@ -251,7 +197,7 @@ def pretrain( # Reshuffle the texts if texts were loaded from a file random.shuffle(texts) if skip_counter > 0: - msg.warn("Skipped {count} empty values".format(count=str(skip_counter))) + msg.warn(f"Skipped {skip_counter} empty values") msg.good("Successfully finished pretrain") @@ -316,13 +262,14 @@ def get_vectors_loss(ops, docs, prediction, objective="L2"): # and look them up all at once. This prevents data copying. ids = ops.flatten([doc.to_array(ID).ravel() for doc in docs]) target = docs[0].vocab.vectors.data[ids] + # TODO: this code originally didn't normalize, but shouldn't normalize=True ? if objective == "L2": - d_target = prediction - target - loss = (d_target ** 2).sum() + distance = L2Distance(normalize=False) elif objective == "cosine": - loss, d_target = get_cossim_loss(prediction, target) + distance = CosineDistance(normalize=False) else: raise ValueError(Errors.E142.format(loss_func=objective)) + d_target, loss = distance(prediction, target) return loss, d_target @@ -334,18 +281,18 @@ def create_pretraining_model(nlp, tok2vec): """ output_size = nlp.vocab.vectors.data.shape[1] output_layer = chain( - LN(Maxout(300, pieces=3)), Affine(output_size, drop_factor=0.0) + Maxout(300, pieces=3, normalize=True, dropout=0.0), Linear(output_size) ) # This is annoying, but the parser etc have the flatten step after # the tok2vec. To load the weights in cleanly, we need to match # the shape of the models' components exactly. So what we cann # "tok2vec" has to be the same set of processes as what the components do. - tok2vec = chain(tok2vec, flatten) + tok2vec = chain(tok2vec, list2array()) model = chain(tok2vec, output_layer) - model = masked_language_model(nlp.vocab, model) - model.tok2vec = tok2vec - model.output_layer = output_layer - model.begin_training([nlp.make_doc("Give it a doc to infer shapes")]) + model = build_masked_language_model(nlp.vocab, model) + model.set_ref("tok2vec", tok2vec) + model.set_ref("output_layer", output_layer) + model.initialize(X=[nlp.make_doc("Give it a doc to infer shapes")]) return model diff --git a/spacy/cli/profile.py b/spacy/cli/profile.py index 4ee72fc23..5b7a02212 100644 --- a/spacy/cli/profile.py +++ b/spacy/cli/profile.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals, division, print_function - -import plac import tqdm from pathlib import Path import srsly @@ -9,18 +5,19 @@ import cProfile import pstats import sys import itertools -import thinc.extra.datasets +import ml_datasets from wasabi import msg from ..util import load_model -@plac.annotations( - model=("Model to load", "positional", None, str), - inputs=("Location of input file. '-' for stdin.", "positional", None, str), - n_texts=("Maximum number of texts to use if available", "option", "n", int), -) -def profile(model, inputs=None, n_texts=10000): +def profile( + # fmt: off + model: ("Model to load", "positional", None, str), + inputs: ("Location of input file. '-' for stdin.", "positional", None, str) = None, + n_texts: ("Maximum number of texts to use if available", "option", "n", int) = 10000, + # fmt: on +): """ Profile a spaCy pipeline, to find out which functions take the most time. Input should be formatted as one JSON object per line with a key "text". @@ -32,13 +29,13 @@ def profile(model, inputs=None, n_texts=10000): if inputs is None: n_inputs = 25000 with msg.loading("Loading IMDB dataset via Thinc..."): - imdb_train, _ = thinc.extra.datasets.imdb() + imdb_train, _ = ml_datasets.imdb() inputs, _ = zip(*imdb_train) - msg.info("Loaded IMDB dataset and using {} examples".format(n_inputs)) + msg.info(f"Loaded IMDB dataset and using {n_inputs} examples") inputs = inputs[:n_inputs] - with msg.loading("Loading model '{}'...".format(model)): + with msg.loading(f"Loading model '{model}'..."): nlp = load_model(model) - msg.good("Loaded model '{}'".format(model)) + msg.good(f"Loaded model '{model}'") texts = list(itertools.islice(inputs, n_texts)) cProfile.runctx("parse_texts(nlp, texts)", globals(), locals(), "Profile.prof") s = pstats.Stats("Profile.prof") @@ -60,7 +57,7 @@ def _read_inputs(loc, msg): input_path = Path(loc) if not input_path.exists() or not input_path.is_file(): msg.fail("Not a valid input data file", loc, exits=1) - msg.info("Using data from {}".format(input_path.parts[-1])) + msg.info(f"Using data from {input_path.parts[-1]}") file_ = input_path.open() for line in file_: data = srsly.json_loads(line) diff --git a/spacy/cli/train.py b/spacy/cli/train.py index 6ce095c15..da3d1d5a6 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -1,11 +1,7 @@ -# coding: utf8 -from __future__ import unicode_literals, division, print_function - -import plac import os import tqdm from pathlib import Path -from thinc.neural._classes.model import Model +from thinc.api import use_ops from timeit import default_timer as timer import shutil import srsly @@ -13,94 +9,47 @@ from wasabi import msg import contextlib import random -from .._ml import create_default_optimizer +from ..util import create_default_optimizer from ..util import use_gpu as set_gpu from ..gold import GoldCorpus -from ..compat import path2str from ..lookups import Lookups from .. import util from .. import about -@plac.annotations( - # fmt: off - lang=("Model language", "positional", None, str), - output_path=("Output directory to store model in", "positional", None, Path), - train_path=("Location of JSON-formatted training data", "positional", None, Path), - dev_path=("Location of JSON-formatted development data", "positional", None, Path), - raw_text=("Path to jsonl file with unlabelled text documents.", "option", "rt", Path), - base_model=("Name of model to update (optional)", "option", "b", str), - pipeline=("Comma-separated names of pipeline components", "option", "p", str), - replace_components=("Replace components from base model", "flag", "R", bool), - vectors=("Model to load vectors from", "option", "v", str), - width=("Width of CNN layers of Tok2Vec component", "option", "cw", int), - conv_depth=("Depth of CNN layers of Tok2Vec component", "option", "cd", int), - cnn_window=("Window size for CNN layers of Tok2Vec component", "option", "cW", int), - cnn_pieces=("Maxout size for CNN layers of Tok2Vec component. 1 for Mish", "option", "cP", int), - use_chars=("Whether to use character-based embedding of Tok2Vec component", "flag", "chr", bool), - bilstm_depth=("Depth of BiLSTM layers of Tok2Vec component (requires PyTorch)", "option", "lstm", int), - embed_rows=("Number of embedding rows of Tok2Vec component", "option", "er", int), - n_iter=("Number of iterations", "option", "n", int), - n_early_stopping=("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int), - n_examples=("Number of examples", "option", "ns", int), - use_gpu=("Use GPU", "option", "g", int), - version=("Model version", "option", "V", str), - meta_path=("Optional path to meta.json to use as base.", "option", "m", Path), - init_tok2vec=("Path to pretrained weights for the token-to-vector parts of the models. See 'spacy pretrain'. Experimental.", "option", "t2v", Path), - parser_multitasks=("Side objectives for parser CNN, e.g. 'dep' or 'dep,tag'", "option", "pt", str), - entity_multitasks=("Side objectives for NER CNN, e.g. 'dep' or 'dep,tag'", "option", "et", str), - noise_level=("Amount of corruption for data augmentation", "option", "nl", float), - orth_variant_level=("Amount of orthography variation for data augmentation", "option", "ovl", float), - eval_beam_widths=("Beam widths to evaluate, e.g. 4,8", "option", "bw", str), - gold_preproc=("Use gold preprocessing", "flag", "G", bool), - learn_tokens=("Make parser learn gold-standard tokenization", "flag", "T", bool), - textcat_multilabel=("Textcat classes aren't mutually exclusive (multilabel)", "flag", "TML", bool), - textcat_arch=("Textcat model architecture", "option", "ta", str), - textcat_positive_label=("Textcat positive label for binary classes with two labels", "option", "tpl", str), - tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path), - omit_extra_lookups=("Don't include extra lookups in model", "flag", "OEL", bool), - verbose=("Display more information for debug", "flag", "VV", bool), - debug=("Run data diagnostics before training", "flag", "D", bool), - # fmt: on -) def train( - lang, - output_path, - train_path, - dev_path, - raw_text=None, - base_model=None, - pipeline="tagger,parser,ner", - replace_components=False, - vectors=None, - width=96, - conv_depth=4, - cnn_window=1, - cnn_pieces=3, - use_chars=False, - bilstm_depth=0, - embed_rows=2000, - n_iter=30, - n_early_stopping=None, - n_examples=0, - use_gpu=-1, - version="0.0.0", - meta_path=None, - init_tok2vec=None, - parser_multitasks="", - entity_multitasks="", - noise_level=0.0, - orth_variant_level=0.0, - eval_beam_widths="", - gold_preproc=False, - learn_tokens=False, - textcat_multilabel=False, - textcat_arch="bow", - textcat_positive_label=None, - tag_map_path=None, - omit_extra_lookups=False, - verbose=False, - debug=False, + # fmt: off + lang: ("Model language", "positional", None, str), + output_path: ("Output directory to store model in", "positional", None, Path), + train_path: ("Location of JSON-formatted training data", "positional", None, Path), + dev_path: ("Location of JSON-formatted development data", "positional", None, Path), + raw_text: ("Path to jsonl file with unlabelled text documents.", "option", "rt", Path) = None, + base_model: ("Name of model to update (optional)", "option", "b", str) = None, + pipeline: ("Comma-separated names of pipeline components", "option", "p", str) = "tagger,parser,ner", + vectors: ("Model to load vectors from", "option", "v", str) = None, + replace_components: ("Replace components from base model", "flag", "R", bool) = False, + n_iter: ("Number of iterations", "option", "n", int) = 30, + n_early_stopping: ("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int) = None, + n_examples: ("Number of examples", "option", "ns", int) = 0, + use_gpu: ("Use GPU", "option", "g", int) = -1, + version: ("Model version", "option", "V", str) = "0.0.0", + meta_path: ("Optional path to meta.json to use as base.", "option", "m", Path) = None, + init_tok2vec: ("Path to pretrained weights for the token-to-vector parts of the models. See 'spacy pretrain'. Experimental.", "option", "t2v", Path) = None, + parser_multitasks: ("Side objectives for parser CNN, e.g. 'dep' or 'dep,tag'", "option", "pt", str) = "", + entity_multitasks: ("Side objectives for NER CNN, e.g. 'dep' or 'dep,tag'", "option", "et", str) = "", + noise_level: ("Amount of corruption for data augmentation", "option", "nl", float) = 0.0, + orth_variant_level: ("Amount of orthography variation for data augmentation", "option", "ovl", float) = 0.0, + eval_beam_widths: ("Beam widths to evaluate, e.g. 4,8", "option", "bw", str) = "", + gold_preproc: ("Use gold preprocessing", "flag", "G", bool) = False, + learn_tokens: ("Make parser learn gold-standard tokenization", "flag", "T", bool) = False, + textcat_multilabel: ("Textcat classes aren't mutually exclusive (multilabel)", "flag", "TML", bool) = False, + textcat_arch: ("Textcat model architecture", "option", "ta", str) = "bow", + textcat_positive_label: ("Textcat positive label for binary classes with two labels", "option", "tpl", str) = None, + tag_map_path: ("Location of JSON-formatted tag map", "option", "tm", Path) = None, + omit_extra_lookups: ("Don't include extra lookups in model", "flag", "OEL", bool) = False, + verbose: ("Display more information for debug", "flag", "VV", bool) = False, + debug: ("Run data diagnostics before training", "flag", "D", bool) = False, + # fmt: on ): """ Train or update a spaCy model. Requires data to be formatted in spaCy's @@ -134,7 +83,7 @@ def train( ) if not output_path.exists(): output_path.mkdir() - msg.good("Created output directory: {}".format(output_path)) + msg.good(f"Created output directory: {output_path}") tag_map = {} if tag_map_path is not None: @@ -163,50 +112,78 @@ def train( eval_beam_widths.sort() has_beam_widths = eval_beam_widths != [1] + default_dir = Path(__file__).parent.parent / "ml" / "models" / "defaults" + # Set up the base model and pipeline. If a base model is specified, load # the model and make sure the pipeline matches the pipeline setting. If # training starts from a blank model, intitalize the language class. pipeline = [p.strip() for p in pipeline.split(",")] + msg.text(f"Training pipeline: {pipeline}") disabled_pipes = None pipes_added = False - msg.text("Training pipeline: {}".format(pipeline)) if use_gpu >= 0: activated_gpu = None try: activated_gpu = set_gpu(use_gpu) except Exception as e: - msg.warn("Exception: {}".format(e)) + msg.warn(f"Exception: {e}") if activated_gpu is not None: - msg.text("Using GPU: {}".format(use_gpu)) + msg.text(f"Using GPU: {use_gpu}") else: - msg.warn("Unable to activate GPU: {}".format(use_gpu)) + msg.warn(f"Unable to activate GPU: {use_gpu}") msg.text("Using CPU only") use_gpu = -1 if base_model: - msg.text("Starting with base model '{}'".format(base_model)) + msg.text(f"Starting with base model '{base_model}'") nlp = util.load_model(base_model) if nlp.lang != lang: msg.fail( - "Model language ('{}') doesn't match language specified as " - "`lang` argument ('{}') ".format(nlp.lang, lang), + f"Model language ('{nlp.lang}') doesn't match language " + f"specified as `lang` argument ('{lang}') ", exits=1, ) + if vectors: + msg.text(f"Loading vectors from model '{vectors}'") + _load_vectors(nlp, vectors) + + nlp.select_pipes(disable=[p for p in nlp.pipe_names if p not in pipeline]) for pipe in pipeline: - pipe_cfg = {} + # first, create the model. + # Bit of a hack after the refactor to get the vectors into a default config + # use train-from-config instead :-) if pipe == "parser": - pipe_cfg = {"learn_tokens": learn_tokens} + config_loc = default_dir / "parser_defaults.cfg" + elif pipe == "tagger": + config_loc = default_dir / "tagger_defaults.cfg" + elif pipe == "ner": + config_loc = default_dir / "ner_defaults.cfg" elif pipe == "textcat": - pipe_cfg = { - "exclusive_classes": not textcat_multilabel, - "architecture": textcat_arch, - "positive_label": textcat_positive_label, + config_loc = default_dir / "textcat_defaults.cfg" + elif pipe == "senter": + config_loc = default_dir / "senter_defaults.cfg" + else: + raise ValueError(f"Component {pipe} currently not supported.") + pipe_cfg = util.load_config(config_loc, create_objects=False) + if vectors: + pretrained_config = { + "@architectures": "spacy.VocabVectors.v1", + "name": vectors, } + pipe_cfg["model"]["tok2vec"]["pretrained_vectors"] = pretrained_config + + if pipe == "parser": + pipe_cfg["learn_tokens"] = learn_tokens + elif pipe == "textcat": + pipe_cfg["exclusive_classes"] = not textcat_multilabel + pipe_cfg["architecture"] = textcat_arch + pipe_cfg["positive_label"] = textcat_positive_label + if pipe not in nlp.pipe_names: - msg.text("Adding component to base model '{}'".format(pipe)) + msg.text(f"Adding component to base model '{pipe}'") nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg)) pipes_added = True elif replace_components: - msg.text("Replacing component from base model '{}'".format(pipe)) + msg.text(f"Replacing component from base model '{pipe}'") nlp.replace_pipe(pipe, nlp.create_pipe(pipe, config=pipe_cfg)) pipes_added = True else: @@ -219,33 +196,59 @@ def train( } if base_cfg != pipe_cfg: msg.fail( - "The base textcat model configuration does" - "not match the provided training options. " - "Existing cfg: {}, provided cfg: {}".format( - base_cfg, pipe_cfg - ), + f"The base textcat model configuration does" + f"not match the provided training options. " + f"Existing cfg: {base_cfg}, provided cfg: {pipe_cfg}", exits=1, ) - msg.text("Extending component from base model '{}'".format(pipe)) - disabled_pipes = nlp.disable_pipes( - [p for p in nlp.pipe_names if p not in pipeline] + msg.text(f"Extending component from base model '{pipe}'") + disabled_pipes = nlp.select_pipes( + disable=[p for p in nlp.pipe_names if p not in pipeline] ) else: - msg.text("Starting with blank model '{}'".format(lang)) + msg.text(f"Starting with blank model '{lang}'") lang_cls = util.get_lang_class(lang) nlp = lang_cls() + + if vectors: + msg.text(f"Loading vectors from model '{vectors}'") + _load_vectors(nlp, vectors) + for pipe in pipeline: + # first, create the model. + # Bit of a hack after the refactor to get the vectors into a default config + # use train-from-config instead :-) if pipe == "parser": - pipe_cfg = {"learn_tokens": learn_tokens} + config_loc = default_dir / "parser_defaults.cfg" + elif pipe == "tagger": + config_loc = default_dir / "tagger_defaults.cfg" + elif pipe == "morphologizer": + config_loc = default_dir / "morphologizer_defaults.cfg" + elif pipe == "ner": + config_loc = default_dir / "ner_defaults.cfg" elif pipe == "textcat": - pipe_cfg = { - "exclusive_classes": not textcat_multilabel, - "architecture": textcat_arch, - "positive_label": textcat_positive_label, - } + config_loc = default_dir / "textcat_defaults.cfg" + elif pipe == "senter": + config_loc = default_dir / "senter_defaults.cfg" else: - pipe_cfg = {} - nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg)) + raise ValueError(f"Component {pipe} currently not supported.") + pipe_cfg = util.load_config(config_loc, create_objects=False) + if vectors: + pretrained_config = { + "@architectures": "spacy.VocabVectors.v1", + "name": vectors, + } + pipe_cfg["model"]["tok2vec"]["pretrained_vectors"] = pretrained_config + + if pipe == "parser": + pipe_cfg["learn_tokens"] = learn_tokens + elif pipe == "textcat": + pipe_cfg["exclusive_classes"] = not textcat_multilabel + pipe_cfg["architecture"] = textcat_arch + pipe_cfg["positive_label"] = textcat_positive_label + + pipe = nlp.create_pipe(pipe, config=pipe_cfg) + nlp.add_pipe(pipe) # Update tag map with provided mapping nlp.vocab.morphology.tag_map.update(tag_map) @@ -268,57 +271,49 @@ def train( if multitasks: if pipe_name not in pipeline: msg.fail( - "Can't use multitask objective without '{}' in the " - "pipeline".format(pipe_name) + f"Can't use multitask objective without '{pipe_name}' in " + f"the pipeline" ) pipe = nlp.get_pipe(pipe_name) for objective in multitasks.split(","): pipe.add_multitask_objective(objective) # Prepare training corpus - msg.text("Counting training words (limit={})".format(n_examples)) + msg.text(f"Counting training words (limit={n_examples})") corpus = GoldCorpus(train_path, dev_path, limit=n_examples) n_train_words = corpus.count_train() if base_model and not pipes_added: # Start with an existing model, use default optimizer - optimizer = create_default_optimizer(Model.ops) + optimizer = create_default_optimizer() else: # Start with a blank model, call begin_training cfg = {"device": use_gpu} - cfg["conv_depth"] = conv_depth - cfg["token_vector_width"] = width - cfg["bilstm_depth"] = bilstm_depth - cfg["cnn_maxout_pieces"] = cnn_pieces - cfg["embed_size"] = embed_rows - cfg["conv_window"] = cnn_window - cfg["subword_features"] = not use_chars - optimizer = nlp.begin_training(lambda: corpus.train_tuples, **cfg) - + optimizer = nlp.begin_training(lambda: corpus.train_examples, **cfg) nlp._optimizer = None - # Load in pretrained weights + # Load in pretrained weights (TODO: this may be broken in the config rewrite) if init_tok2vec is not None: components = _load_pretrained_tok2vec(nlp, init_tok2vec) - msg.text("Loaded pretrained tok2vec for: {}".format(components)) + msg.text(f"Loaded pretrained tok2vec for: {components}") # Verify textcat config if "textcat" in pipeline: textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", []) if textcat_positive_label and textcat_positive_label not in textcat_labels: msg.fail( - "The textcat_positive_label (tpl) '{}' does not match any " - "label in the training data.".format(textcat_positive_label), + f"The textcat_positive_label (tpl) '{textcat_positive_label}' " + f"does not match any label in the training data.", exits=1, ) if textcat_positive_label and len(textcat_labels) != 2: msg.fail( - "A textcat_positive_label (tpl) '{}' was provided for training " - "data that does not appear to be a binary classification " - "problem with two labels.".format(textcat_positive_label), + "A textcat_positive_label (tpl) '{textcat_positive_label}' was " + "provided for training data that does not appear to be a " + "binary classification problem with two labels.", exits=1, ) - train_docs = corpus.train_docs( + train_data = corpus.train_data( nlp, noise_level=noise_level, gold_preproc=gold_preproc, @@ -328,9 +323,9 @@ def train( train_labels = set() if textcat_multilabel: multilabel_found = False - for text, gold in train_docs: - train_labels.update(gold.cats.keys()) - if list(gold.cats.values()).count(1.0) != 1: + for ex in train_data: + train_labels.update(ex.gold.cats.keys()) + if list(ex.gold.cats.values()).count(1.0) != 1: multilabel_found = True if not multilabel_found and not base_model: msg.warn( @@ -340,9 +335,9 @@ def train( "mutually-exclusive classes." ) if not textcat_multilabel: - for text, gold in train_docs: - train_labels.update(gold.cats.keys()) - if list(gold.cats.values()).count(1.0) != 1 and not base_model: + for ex in train_data: + train_labels.update(ex.gold.cats.keys()) + if list(ex.gold.cats.values()).count(1.0) != 1 and not base_model: msg.warn( "Some textcat training instances do not have exactly " "one positive label. Modifying training options to " @@ -354,20 +349,20 @@ def train( break if base_model and set(textcat_labels) != train_labels: msg.fail( - "Cannot extend textcat model using data with different " - "labels. Base model labels: {}, training data labels: " - "{}.".format(textcat_labels, list(train_labels)), + f"Cannot extend textcat model using data with different " + f"labels. Base model labels: {textcat_labels}, training data " + f"labels: {list(train_labels)}", exits=1, ) if textcat_multilabel: msg.text( - "Textcat evaluation score: ROC AUC score macro-averaged across " - "the labels '{}'".format(", ".join(textcat_labels)) + f"Textcat evaluation score: ROC AUC score macro-averaged across " + f"the labels '{', '.join(textcat_labels)}'" ) elif textcat_positive_label and len(textcat_labels) == 2: msg.text( - "Textcat evaluation score: F1-score for the " - "label '{}'".format(textcat_positive_label) + f"Textcat evaluation score: F1-score for the " + f"label '{textcat_positive_label}'" ) elif len(textcat_labels) > 1: if len(textcat_labels) == 2: @@ -377,8 +372,8 @@ def train( "an evaluation on the positive class." ) msg.text( - "Textcat evaluation score: F1-score macro-averaged across " - "the labels '{}'".format(", ".join(textcat_labels)) + f"Textcat evaluation score: F1-score macro-averaged across " + f"the labels '{', '.join(textcat_labels)}'" ) else: msg.fail( @@ -398,7 +393,7 @@ def train( iter_since_best = 0 best_score = 0.0 for i in range(n_iter): - train_docs = corpus.train_docs( + train_data = corpus.train_dataset( nlp, noise_level=noise_level, orth_variant_level=orth_variant_level, @@ -414,14 +409,12 @@ def train( words_seen = 0 with tqdm.tqdm(total=n_train_words, leave=False) as pbar: losses = {} - for batch in util.minibatch_by_words(train_docs, size=batch_sizes): + for batch in util.minibatch_by_words(train_data, size=batch_sizes): if not batch: continue - docs, golds = zip(*batch) try: nlp.update( - docs, - golds, + batch, sgd=optimizer, drop=next(dropout_rates), losses=losses, @@ -430,66 +423,64 @@ def train( err = "Error during training" if init_tok2vec: err += " Did you provide the same parameters during 'train' as during 'pretrain'?" - msg.fail(err, "Original error message: {}".format(e), exits=1) + msg.fail(err, f"Original error message: {e}", exits=1) if raw_text: # If raw text is available, perform 'rehearsal' updates, # which use unlabelled data to reduce overfitting. raw_batch = list(next(raw_batches)) nlp.rehearse(raw_batch, sgd=optimizer, losses=losses) + docs = [ex.doc for ex in batch] if not int(os.environ.get("LOG_FRIENDLY", 0)): pbar.update(sum(len(doc) for doc in docs)) words_seen += sum(len(doc) for doc in docs) with nlp.use_params(optimizer.averages): util.set_env_log(False) - epoch_model_path = output_path / ("model%d" % i) + epoch_model_path = output_path / f"model{i}" nlp.to_disk(epoch_model_path) nlp_loaded = util.load_model_from_path(epoch_model_path) for beam_width in eval_beam_widths: for name, component in nlp_loaded.pipeline: if hasattr(component, "cfg"): component.cfg["beam_width"] = beam_width - dev_docs = list( - corpus.dev_docs( + dev_dataset = list( + corpus.dev_dataset( nlp_loaded, gold_preproc=gold_preproc, ignore_misaligned=True, ) ) - nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs) + nwords = sum(len(ex.doc) for ex in dev_dataset) start_time = timer() - scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose) + scorer = nlp_loaded.evaluate(dev_dataset, verbose=verbose) end_time = timer() if use_gpu < 0: gpu_wps = None cpu_wps = nwords / (end_time - start_time) else: gpu_wps = nwords / (end_time - start_time) - # Only evaluate on CPU in the first iteration (for - # timing) if GPU is enabled - if i == 0: - with Model.use_device("cpu"): - nlp_loaded = util.load_model_from_path(epoch_model_path) - for name, component in nlp_loaded.pipeline: - if hasattr(component, "cfg"): - component.cfg["beam_width"] = beam_width - dev_docs = list( - corpus.dev_docs( - nlp_loaded, - gold_preproc=gold_preproc, - ignore_misaligned=True, - ) + with use_ops("numpy"): + nlp_loaded = util.load_model_from_path(epoch_model_path) + for name, component in nlp_loaded.pipeline: + if hasattr(component, "cfg"): + component.cfg["beam_width"] = beam_width + dev_dataset = list( + corpus.dev_dataset( + nlp_loaded, + gold_preproc=gold_preproc, + ignore_misaligned=True, ) - start_time = timer() - scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose) - end_time = timer() - cpu_wps = nwords / (end_time - start_time) - acc_loc = output_path / ("model%d" % i) / "accuracy.json" + ) + start_time = timer() + scorer = nlp_loaded.evaluate(dev_dataset, verbose=verbose) + end_time = timer() + cpu_wps = nwords / (end_time - start_time) + acc_loc = output_path / f"model{i}" / "accuracy.json" srsly.write_json(acc_loc, scorer.scores) # Update model meta.json meta["lang"] = nlp.lang meta["pipeline"] = nlp.pipe_names - meta["spacy_version"] = ">=%s" % about.__version__ + meta["spacy_version"] = f">={about.__version__}" if beam_width == 1: meta["speed"] = { "nwords": nwords, @@ -517,10 +508,10 @@ def train( "keys": nlp.vocab.vectors.n_keys, "name": nlp.vocab.vectors.name, } - meta.setdefault("name", "model%d" % i) + meta.setdefault("name", f"model{i}") meta.setdefault("version", version) meta["labels"] = nlp.meta["labels"] - meta_loc = output_path / ("model%d" % i) / "meta.json" + meta_loc = output_path / f"model{i}" / "meta.json" srsly.write_json(meta_loc, meta) util.set_env_log(verbose) @@ -538,8 +529,8 @@ def train( for cat, cat_score in textcats_per_cat.items(): if cat_score.get("roc_auc_score", 0) < 0: msg.warn( - "Textcat ROC AUC score is undefined due to " - "only one value in label '{}'.".format(cat) + f"Textcat ROC AUC score is undefined due to " + f"only one value in label '{cat}'." ) msg.row(progress, **row_settings) # Early stopping @@ -552,20 +543,14 @@ def train( best_score = current_score if iter_since_best >= n_early_stopping: msg.text( - "Early stopping, best iteration " - "is: {}".format(i - iter_since_best) + f"Early stopping, best iteration is: {i - iter_since_best}" ) msg.text( - "Best score = {}; Final iteration " - "score = {}".format(best_score, current_score) + f"Best score = {best_score}; Final iteration score = {current_score}" ) break except Exception as e: - msg.warn( - "Aborting and saving the final best model. " - "Encountered exception: {}".format(e), - exits=1, - ) + msg.warn(f"Aborting and saving final best model. Encountered exception: {e}") finally: best_pipes = nlp.pipe_names if disabled_pipes: @@ -620,12 +605,16 @@ def _score_for_model(meta): acc = meta["accuracy"] if "tagger" in pipes: mean_acc.append(acc["tags_acc"]) + if "morphologizer" in pipes: + mean_acc.append((acc["morphs_acc"] + acc["pos_acc"]) / 2) if "parser" in pipes: mean_acc.append((acc["uas"] + acc["las"]) / 2) if "ner" in pipes: mean_acc.append((acc["ents_p"] + acc["ents_r"] + acc["ents_f"]) / 3) if "textcat" in pipes: mean_acc.append(acc["textcat_score"]) + if "senter" in pipes: + mean_acc.append((acc["sent_p"] + acc["sent_r"] + acc["sent_f"]) / 3) return sum(mean_acc) / len(mean_acc) @@ -650,8 +639,8 @@ def _load_pretrained_tok2vec(nlp, loc): weights_data = file_.read() loaded = [] for name, component in nlp.pipeline: - if hasattr(component, "model") and hasattr(component.model, "tok2vec"): - component.tok2vec.from_bytes(weights_data) + if hasattr(component, "model") and component.model.has_ref("tok2vec"): + component.get_ref("tok2vec").from_bytes(weights_data) loaded.append(name) return loaded @@ -662,12 +651,10 @@ def _collate_best_model(meta, output_path, components): for component in components: bests[component] = _find_best(output_path, component) best_dest = output_path / "model-best" - shutil.copytree(path2str(output_path / "model-final"), path2str(best_dest)) + shutil.copytree(str(output_path / "model-final"), str(best_dest)) for component, best_component_src in bests.items(): - shutil.rmtree(path2str(best_dest / component)) - shutil.copytree( - path2str(best_component_src / component), path2str(best_dest / component) - ) + shutil.rmtree(str(best_dest / component)) + shutil.copytree(str(best_component_src / component), str(best_dest / component)) accs = srsly.read_json(best_component_src / "accuracy.json") for metric in _get_metrics(component): meta["accuracy"][metric] = accs[metric] @@ -692,11 +679,15 @@ def _find_best(experiment_dir, component): def _get_metrics(component): if component == "parser": - return ("las", "uas", "las_per_type", "token_acc") + return ("las", "uas", "las_per_type", "sent_f", "token_acc") elif component == "tagger": return ("tags_acc", "token_acc") + elif component == "morphologizer": + return ("morphs_acc", "pos_acc", "token_acc") elif component == "ner": return ("ents_f", "ents_p", "ents_r", "ents_per_type", "token_acc") + elif component == "senter": + return ("sent_f", "sent_p", "sent_r", "token_acc") elif component == "textcat": return ("textcat_score", "token_acc") return ("token_acc",) @@ -709,15 +700,25 @@ def _configure_training_output(pipeline, use_gpu, has_beam_widths): if pipe == "tagger": row_head.extend(["Tag Loss ", " Tag % "]) output_stats.extend(["tag_loss", "tags_acc"]) + elif pipe == "morphologizer" or pipe == "morphologizertagger": + row_head.extend(["Morph Loss ", " Morph % ", " POS % "]) + output_stats.extend(["morph_loss", "morphs_acc", "pos_acc"]) elif pipe == "parser": - row_head.extend(["Dep Loss ", " UAS ", " LAS "]) - output_stats.extend(["dep_loss", "uas", "las"]) + row_head.extend( + ["Dep Loss ", " UAS ", " LAS ", "Sent P", "Sent R", "Sent F"] + ) + output_stats.extend( + ["dep_loss", "uas", "las", "sent_p", "sent_r", "sent_f"] + ) elif pipe == "ner": row_head.extend(["NER Loss ", "NER P ", "NER R ", "NER F "]) output_stats.extend(["ner_loss", "ents_p", "ents_r", "ents_f"]) elif pipe == "textcat": row_head.extend(["Textcat Loss", "Textcat"]) output_stats.extend(["textcat_loss", "textcat_score"]) + elif pipe == "senter": + row_head.extend(["Senter Loss", "Sent P", "Sent R", "Sent F"]) + output_stats.extend(["senter_loss", "sent_p", "sent_r", "sent_f"]) row_head.extend(["Token %", "CPU WPS"]) output_stats.extend(["token_acc", "cpu_wps"]) @@ -727,7 +728,10 @@ def _configure_training_output(pipeline, use_gpu, has_beam_widths): if has_beam_widths: row_head.insert(1, "Beam W.") - return row_head, output_stats + # remove duplicates + row_head_dict = {k: 1 for k in row_head} + output_stats_dict = {k: 1 for k in output_stats} + return row_head_dict.keys(), output_stats_dict.keys() def _get_progress( @@ -739,7 +743,9 @@ def _get_progress( scores["dep_loss"] = losses.get("parser", 0.0) scores["ner_loss"] = losses.get("ner", 0.0) scores["tag_loss"] = losses.get("tagger", 0.0) + scores["morph_loss"] = losses.get("morphologizer", 0.0) scores["textcat_loss"] = losses.get("textcat", 0.0) + scores["senter_loss"] = losses.get("senter", 0.0) scores["cpu_wps"] = cpu_wps scores["gpu_wps"] = gpu_wps or 0.0 scores.update(dev_scores) diff --git a/spacy/cli/train_from_config.py b/spacy/cli/train_from_config.py new file mode 100644 index 000000000..c75c861cc --- /dev/null +++ b/spacy/cli/train_from_config.py @@ -0,0 +1,402 @@ +from typing import Optional, Dict, List, Union, Sequence +from timeit import default_timer as timer +from pydantic import BaseModel, FilePath +import plac +import tqdm +from pathlib import Path +from wasabi import msg +import thinc +import thinc.schedules +from thinc.api import Model +import random + +from ..gold import GoldCorpus +from .. import util +from ..errors import Errors + +registry = util.registry + +CONFIG_STR = """ +[training] +patience = 10 +eval_frequency = 10 +dropout = 0.2 +init_tok2vec = null +vectors = null +max_epochs = 100 +orth_variant_level = 0.0 +gold_preproc = false +max_length = 0 +use_gpu = 0 +scores = ["ents_p", "ents_r", "ents_f"] +score_weights = {"ents_f": 1.0} +limit = 0 + +[training.batch_size] +@schedules = "compounding.v1" +start = 100 +stop = 1000 +compound = 1.001 + +[optimizer] +@optimizers = "Adam.v1" +learn_rate = 0.001 +beta1 = 0.9 +beta2 = 0.999 + +[nlp] +lang = "en" +vectors = ${training:vectors} + +[nlp.pipeline.tok2vec] +factory = "tok2vec" + +[nlp.pipeline.ner] +factory = "ner" + +[nlp.pipeline.ner.model] +@architectures = "spacy.TransitionBasedParser.v1" +nr_feature_tokens = 3 +hidden_width = 64 +maxout_pieces = 3 + +[nlp.pipeline.ner.model.tok2vec] +@architectures = "spacy.Tok2VecTensors.v1" +width = ${nlp.pipeline.tok2vec.model:width} + +[nlp.pipeline.tok2vec.model] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = ${nlp:vectors} +width = 128 +depth = 4 +window_size = 1 +embed_size = 10000 +maxout_pieces = 3 +subword_features = true +""" + + +class PipelineComponent(BaseModel): + factory: str + model: Model + + class Config: + arbitrary_types_allowed = True + + +class ConfigSchema(BaseModel): + optimizer: Optional["Optimizer"] + + class training(BaseModel): + patience: int = 10 + eval_frequency: int = 100 + dropout: float = 0.2 + init_tok2vec: Optional[FilePath] = None + vectors: Optional[str] = None + max_epochs: int = 100 + orth_variant_level: float = 0.0 + gold_preproc: bool = False + max_length: int = 0 + use_gpu: int = 0 + scores: List[str] = ["ents_p", "ents_r", "ents_f"] + score_weights: Dict[str, Union[int, float]] = {"ents_f": 1.0} + limit: int = 0 + batch_size: Union[Sequence[int], int] + + class nlp(BaseModel): + lang: str + vectors: Optional[str] + pipeline: Optional[Dict[str, PipelineComponent]] + + class Config: + extra = "allow" + + +@plac.annotations( + # fmt: off + train_path=("Location of JSON-formatted training data", "positional", None, Path), + dev_path=("Location of JSON-formatted development data", "positional", None, Path), + config_path=("Path to config file", "positional", None, Path), + output_path=("Output directory to store model in", "option", "o", Path), + meta_path=("Optional path to meta.json to use as base.", "option", "m", Path), + raw_text=("Path to jsonl file with unlabelled text documents.", "option", "rt", Path), + use_gpu=("Use GPU", "option", "g", int), + # fmt: on +) +def train_from_config_cli( + train_path, + dev_path, + config_path, + output_path=None, + meta_path=None, + raw_text=None, + debug=False, + verbose=False, + use_gpu=-1 +): + """ + Train or update a spaCy model. Requires data to be formatted in spaCy's + JSON format. To convert data from other formats, use the `spacy convert` + command. + """ + if not config_path or not config_path.exists(): + msg.fail("Config file not found", config_path, exits=1) + if not train_path or not train_path.exists(): + msg.fail("Training data not found", train_path, exits=1) + if not dev_path or not dev_path.exists(): + msg.fail("Development data not found", dev_path, exits=1) + if meta_path is not None and not meta_path.exists(): + msg.fail("Can't find model meta.json", meta_path, exits=1) + if output_path is not None and not output_path.exists(): + output_path.mkdir() + + if use_gpu >= 0: + msg.info("Using GPU") + util.use_gpu(use_gpu) + else: + msg.info("Using CPU") + + train_from_config( + config_path, + {"train": train_path, "dev": dev_path}, + output_path=output_path, + meta_path=meta_path, + raw_text=raw_text, + ) + + +def train_from_config( + config_path, data_paths, raw_text=None, meta_path=None, output_path=None, +): + msg.info(f"Loading config from: {config_path}") + config = util.load_config(config_path, create_objects=False) + util.fix_random_seed(config["training"]["seed"]) + nlp_config = config["nlp"] + config = util.load_config(config_path, create_objects=True) + msg.info("Creating nlp from config") + nlp = util.load_model_from_config(nlp_config) + optimizer = config["optimizer"] + training = config["training"] + limit = training["limit"] + msg.info("Loading training corpus") + corpus = GoldCorpus(data_paths["train"], data_paths["dev"], limit=limit) + msg.info("Initializing the nlp pipeline") + nlp.begin_training(lambda: corpus.train_examples) + + train_batches = create_train_batches(nlp, corpus, training) + evaluate = create_evaluation_callback(nlp, optimizer, corpus, training) + + # Create iterator, which yields out info after each optimization step. + msg.info("Start training") + training_step_iterator = train_while_improving( + nlp, + optimizer, + train_batches, + evaluate, + dropout=training["dropout"], + accumulate_gradient=training["accumulate_gradient"], + patience=training.get("patience", 0), + max_steps=training.get("max_steps", 0), + eval_frequency=training["eval_frequency"], + ) + + msg.info(f"Training. Initial learn rate: {optimizer.learn_rate}") + print_row = setup_printer(training, nlp) + + try: + progress = tqdm.tqdm(total=training["eval_frequency"], leave=False) + for batch, info, is_best_checkpoint in training_step_iterator: + progress.update(1) + if is_best_checkpoint is not None: + progress.close() + print_row(info) + if is_best_checkpoint and output_path is not None: + nlp.to_disk(output_path) + progress = tqdm.tqdm(total=training["eval_frequency"], leave=False) + finally: + if output_path is not None: + final_model_path = output_path / "model-final" + if optimizer.averages: + with nlp.use_params(optimizer.averages): + nlp.to_disk(final_model_path) + else: + nlp.to_disk(final_model_path) + msg.good("Saved model to output directory", final_model_path) + + +def create_train_batches(nlp, corpus, cfg): + epochs_todo = cfg.get("max_epochs", 0) + while True: + train_examples = list(corpus.train_dataset( + nlp, + noise_level=0.0, + orth_variant_level=cfg["orth_variant_level"], + gold_preproc=cfg["gold_preproc"], + max_length=cfg["max_length"], + ignore_misaligned=True, + )) + if len(train_examples) == 0: + raise ValueError(Errors.E988) + random.shuffle(train_examples) + batches = util.minibatch_by_words(train_examples, size=cfg["batch_size"]) + for batch in batches: + yield batch + epochs_todo -= 1 + # We intentionally compare exactly to 0 here, so that max_epochs < 1 + # will not break. + if epochs_todo == 0: + break + + +def create_evaluation_callback(nlp, optimizer, corpus, cfg): + def evaluate(): + dev_examples = list( + corpus.dev_dataset( + nlp, gold_preproc=cfg["gold_preproc"], ignore_misaligned=True + ) + ) + n_words = sum(len(ex.doc) for ex in dev_examples) + start_time = timer() + + if optimizer.averages: + with nlp.use_params(optimizer.averages): + scorer = nlp.evaluate(dev_examples, batch_size=32) + else: + scorer = nlp.evaluate(dev_examples, batch_size=32) + end_time = timer() + wps = n_words / (end_time - start_time) + scores = scorer.scores + # Calculate a weighted sum based on score_weights for the main score + weights = cfg["score_weights"] + weighted_score = sum(scores[s] * weights.get(s, 0.0) for s in weights) + scores["speed"] = wps + return weighted_score, scores + + return evaluate + + +def train_while_improving( + nlp, optimizer, train_data, evaluate, *, dropout, eval_frequency, + accumulate_gradient=1, patience=0, max_steps=0 +): + """Train until an evaluation stops improving. Works as a generator, + with each iteration yielding a tuple `(batch, info, is_best_checkpoint)`, + where info is a dict, and is_best_checkpoint is in [True, False, None] -- + None indicating that the iteration was not evaluated as a checkpoint. + The evaluation is conducted by calling the evaluate callback, which should + + Positional arguments: + nlp: The spaCy pipeline to evaluate. + optimizer: The optimizer callable. + train_data (Iterable[Batch]): A generator of batches, with the training + data. Each batch should be a Sized[Tuple[Input, Annot]]. The training + data iterable needs to take care of iterating over the epochs and + shuffling. + evaluate (Callable[[], Tuple[float, Any]]): A callback to perform evaluation. + The callback should take no arguments and return a tuple + `(main_score, other_scores)`. The main_score should be a float where + higher is better. other_scores can be any object. + + Every iteration, the function yields out a tuple with: + + * batch: A zipped sequence of Tuple[Doc, GoldParse] pairs. + * info: A dict with various information about the last update (see below). + * is_best_checkpoint: A value in None, False, True, indicating whether this + was the best evaluation so far. You should use this to save the model + checkpoints during training. If None, evaluation was not conducted on + that iteration. False means evaluation was conducted, but a previous + evaluation was better. + + The info dict provides the following information: + + epoch (int): How many passes over the data have been completed. + step (int): How many steps have been completed. + score (float): The main score form the last evaluation. + other_scores: : The other scores from the last evaluation. + loss: The accumulated losses throughout training. + checkpoints: A list of previous results, where each result is a + (score, step, epoch) tuple. + """ + if isinstance(dropout, float): + dropouts = thinc.schedules.constant(dropout) + else: + dropouts = dropout + results = [] + losses = {} + to_enable = [name for name, proc in nlp.pipeline if hasattr(proc, "model")] + + for step, batch in enumerate(train_data): + dropout = next(dropouts) + with nlp.select_pipes(enable=to_enable): + for subbatch in subdivide_batch(batch, accumulate_gradient): + nlp.update(subbatch, drop=dropout, losses=losses, sgd=False) + for name, proc in nlp.pipeline: + if hasattr(proc, "model"): + proc.model.finish_update(optimizer) + optimizer.step_schedules() + if not (step % eval_frequency): + score, other_scores = evaluate() + results.append((score, step)) + is_best_checkpoint = score == max(results)[0] + else: + score, other_scores = (None, None) + is_best_checkpoint = None + info = { + "step": step, + "score": score, + "other_scores": other_scores, + "losses": losses, + "checkpoints": results, + } + yield batch, info, is_best_checkpoint + if is_best_checkpoint is not None: + losses = {} + # Stop if no improvement in `patience` updates (if specified) + best_score, best_step = max(results) + if patience and (step - best_step) >= patience: + break + # Stop if we've exhausted our max steps (if specified) + if max_steps and (step * accumulate_gradient) >= max_steps: + break + + +def subdivide_batch(batch, accumulate_gradient): + batch = list(batch) + batch.sort(key=lambda eg: len(eg.doc)) + sub_len = len(batch) // accumulate_gradient + start = 0 + for i in range(accumulate_gradient): + subbatch = batch[start : start + sub_len] + if subbatch: + yield subbatch + start += len(subbatch) + subbatch = batch[start : ] + if subbatch: + yield subbatch + + +def setup_printer(training, nlp): + score_cols = training["scores"] + score_widths = [max(len(col), 6) for col in score_cols] + loss_cols = [f"Loss {pipe}" for pipe in nlp.pipe_names] + loss_widths = [max(len(col), 8) for col in loss_cols] + table_header = ["#"] + loss_cols + score_cols + ["Score"] + table_header = [col.upper() for col in table_header] + table_widths = [6] + loss_widths + score_widths + [6] + table_aligns = ["r" for _ in table_widths] + + msg.row(table_header, widths=table_widths) + msg.row(["-" * width for width in table_widths]) + + def print_row(info): + losses = [ + "{0:.2f}".format(float(info["losses"].get(pipe_name, 0.0))) + for pipe_name in nlp.pipe_names + ] + scores = [ + "{0:.2f}".format(float(info["other_scores"].get(col, 0.0))) for col in score_cols + ] + data = [info["step"]] + losses + scores + ["{0:.2f}".format(float(info["score"]))] + msg.row(data, widths=table_widths, aligns=table_aligns) + + return print_row diff --git a/spacy/cli/validate.py b/spacy/cli/validate.py index 93abad6f6..a23ce3453 100644 --- a/spacy/cli/validate.py +++ b/spacy/cli/validate.py @@ -1,14 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals, print_function - from pathlib import Path import sys import requests -import srsly from wasabi import msg -from ..compat import path2str -from ..util import get_data_path from .. import about @@ -17,51 +11,30 @@ def validate(): Validate that the currently installed version of spaCy is compatible with the installed models. Should be run after `pip install -U spacy`. """ - with msg.loading("Loading compatibility table..."): - r = requests.get(about.__compatibility__) - if r.status_code != 200: - msg.fail( - "Server error ({})".format(r.status_code), - "Couldn't fetch compatibility table.", - exits=1, - ) - msg.good("Loaded compatibility table") - compat = r.json()["spacy"] - version = about.__version__ - version = version.rsplit(".dev", 1)[0] - current_compat = compat.get(version) + model_pkgs, compat = get_model_pkgs() + spacy_version = about.__version__.rsplit(".dev", 1)[0] + current_compat = compat.get(spacy_version, {}) if not current_compat: - msg.fail( - "Can't find spaCy v{} in compatibility table".format(version), - about.__compatibility__, - exits=1, - ) - all_models = set() - for spacy_v, models in dict(compat).items(): - all_models.update(models.keys()) - for model, model_vs in models.items(): - compat[spacy_v][model] = [reformat_version(v) for v in model_vs] - model_links = get_model_links(current_compat) - model_pkgs = get_model_pkgs(current_compat, all_models) - incompat_links = {l for l, d in model_links.items() if not d["compat"]} + msg.warn(f"No compatible models found for v{spacy_version} of spaCy") incompat_models = {d["name"] for _, d in model_pkgs.items() if not d["compat"]} - incompat_models.update( - [d["name"] for _, d in model_links.items() if not d["compat"]] - ) na_models = [m for m in incompat_models if m not in current_compat] update_models = [m for m in incompat_models if m in current_compat] spacy_dir = Path(__file__).parent.parent - msg.divider("Installed models (spaCy v{})".format(about.__version__)) - msg.info("spaCy installation: {}".format(path2str(spacy_dir))) + msg.divider(f"Installed models (spaCy v{about.__version__})") + msg.info(f"spaCy installation: {spacy_dir}") - if model_links or model_pkgs: - header = ("TYPE", "NAME", "MODEL", "VERSION", "") + if model_pkgs: + header = ("NAME", "VERSION", "") rows = [] for name, data in model_pkgs.items(): - rows.append(get_model_row(current_compat, name, data, msg)) - for name, data in model_links.items(): - rows.append(get_model_row(current_compat, name, data, msg, "link")) + if data["compat"]: + comp = msg.text("", color="green", icon="good", no_print=True) + version = msg.text(data["version"], color="green", no_print=True) + else: + version = msg.text(data["version"], color="red", no_print=True) + comp = f"--> {compat.get(data['name'], ['n/a'])[0]}" + rows.append((data["name"], version, comp)) msg.table(rows, header=header) else: msg.text("No models found in your current environment.", exits=0) @@ -71,44 +44,32 @@ def validate(): cmd = "python -m spacy download {}" print("\n".join([cmd.format(pkg) for pkg in update_models]) + "\n") if na_models: - msg.text( - "The following models are not available for spaCy " - "v{}: {}".format(about.__version__, ", ".join(na_models)) + msg.warn( + f"The following models are not available for spaCy v{about.__version__}:", + ", ".join(na_models), ) - if incompat_links: - msg.text( - "You may also want to overwrite the incompatible links using the " - "`python -m spacy link` command with `--force`, or remove them " - "from the data directory. " - "Data path: {path}".format(path=path2str(get_data_path())) - ) - if incompat_models or incompat_links: + if incompat_models: sys.exit(1) -def get_model_links(compat): - links = {} - data_path = get_data_path() - if data_path: - models = [p for p in data_path.iterdir() if is_model_path(p)] - for model in models: - meta_path = Path(model) / "meta.json" - if not meta_path.exists(): - continue - meta = srsly.read_json(meta_path) - link = model.parts[-1] - name = meta["lang"] + "_" + meta["name"] - links[link] = { - "name": name, - "version": meta["version"], - "compat": is_compat(compat, name, meta["version"]), - } - return links - - -def get_model_pkgs(compat, all_models): +def get_model_pkgs(): import pkg_resources + with msg.loading("Loading compatibility table..."): + r = requests.get(about.__compatibility__) + if r.status_code != 200: + msg.fail( + f"Server error ({r.status_code})", + "Couldn't fetch compatibility table.", + exits=1, + ) + msg.good("Loaded compatibility table") + compat = r.json()["spacy"] + all_models = set() + for spacy_v, models in dict(compat).items(): + all_models.update(models.keys()) + for model, model_vs in models.items(): + compat[spacy_v][model] = [reformat_version(v) for v in model_vs] pkgs = {} for pkg_name, pkg_data in pkg_resources.working_set.by_key.items(): package = pkg_name.replace("-", "_") @@ -117,29 +78,9 @@ def get_model_pkgs(compat, all_models): pkgs[pkg_name] = { "name": package, "version": version, - "compat": is_compat(compat, package, version), + "compat": package in compat and version in compat[package], } - return pkgs - - -def get_model_row(compat, name, data, msg, model_type="package"): - if data["compat"]: - comp = msg.text("", color="green", icon="good", no_print=True) - version = msg.text(data["version"], color="green", no_print=True) - else: - version = msg.text(data["version"], color="red", no_print=True) - comp = "--> {}".format(compat.get(data["name"], ["n/a"])[0]) - return (model_type, name, data["name"], version, comp) - - -def is_model_path(model_path): - exclude = ["cache", "pycache", "__pycache__"] - name = model_path.parts[-1] - return model_path.is_dir() and name not in exclude and not name.startswith(".") - - -def is_compat(compat, name, version): - return name in compat and version in compat[name] + return pkgs, compat def reformat_version(version): diff --git a/spacy/compat.py b/spacy/compat.py index 0ea31c6b3..d8377633f 100644 --- a/spacy/compat.py +++ b/spacy/compat.py @@ -1,4 +1,3 @@ -# coding: utf8 """ Helpers for Python and platform compatibility. To distinguish them from the builtin functions, replacement functions are suffixed with an underscore, @@ -6,15 +5,9 @@ e.g. `unicode_`. DOCS: https://spacy.io/api/top-level#compat """ -from __future__ import unicode_literals - -import os import sys -import itertools -import ast -import types -from thinc.neural.util import copy_array +from thinc.util import copy_array try: import cPickle as pickle @@ -36,91 +29,23 @@ try: except ImportError: cupy = None -try: - from thinc.neural.optimizers import Optimizer # noqa: F401 -except ImportError: - from thinc.neural.optimizers import Adam as Optimizer # noqa: F401 +from thinc.api import Optimizer # noqa: F401 pickle = pickle copy_reg = copy_reg CudaStream = CudaStream cupy = cupy copy_array = copy_array -izip = getattr(itertools, "izip", zip) is_windows = sys.platform.startswith("win") is_linux = sys.platform.startswith("linux") is_osx = sys.platform == "darwin" -# See: https://github.com/benjaminp/six/blob/master/six.py -is_python2 = sys.version_info[0] == 2 -is_python3 = sys.version_info[0] == 3 -is_python_pre_3_5 = is_python2 or (is_python3 and sys.version_info[1] < 5) -if is_python2: - bytes_ = str - unicode_ = unicode # noqa: F821 - basestring_ = basestring # noqa: F821 - input_ = raw_input # noqa: F821 - path2str = lambda path: str(path).decode("utf8") - class_types = (type, types.ClassType) - -elif is_python3: - bytes_ = bytes - unicode_ = str - basestring_ = str - input_ = input - path2str = lambda path: str(path) - class_types = (type, types.ClassType) if is_python_pre_3_5 else type - - -def b_to_str(b_str): - """Convert a bytes object to a string. - - b_str (bytes): The object to convert. - RETURNS (unicode): The converted string. - """ - if is_python2: - return b_str - # Important: if no encoding is set, string becomes "b'...'" - return str(b_str, encoding="utf8") - - -def symlink_to(orig, dest): - """Create a symlink. Used for model shortcut links. - - orig (unicode / Path): The origin path. - dest (unicode / Path): The destination path of the symlink. - """ - if is_windows: - import subprocess - - subprocess.check_call( - ["mklink", "/d", path2str(orig), path2str(dest)], shell=True - ) - else: - orig.symlink_to(dest) - - -def symlink_remove(link): - """Remove a symlink. Used for model shortcut links. - - link (unicode / Path): The path to the symlink. - """ - # https://stackoverflow.com/q/26554135/6400719 - if os.path.isdir(path2str(link)) and is_windows: - # this should only be on Py2.7 and windows - os.rmdir(path2str(link)) - else: - os.unlink(path2str(link)) - - -def is_config(python2=None, python3=None, windows=None, linux=None, osx=None): +def is_config(windows=None, linux=None, osx=None, **kwargs): """Check if a specific configuration of Python version and operating system matches the user's setup. Mostly used to display targeted error messages. - python2 (bool): spaCy is executed with Python 2.x. - python3 (bool): spaCy is executed with Python 3.x. windows (bool): spaCy is executed on Windows. linux (bool): spaCy is executed on Linux. osx (bool): spaCy is executed on OS X or macOS. @@ -129,53 +54,7 @@ def is_config(python2=None, python3=None, windows=None, linux=None, osx=None): DOCS: https://spacy.io/api/top-level#compat.is_config """ return ( - python2 in (None, is_python2) - and python3 in (None, is_python3) - and windows in (None, is_windows) + windows in (None, is_windows) and linux in (None, is_linux) and osx in (None, is_osx) ) - - -def import_file(name, loc): - """Import module from a file. Used to load models from a directory. - - name (unicode): Name of module to load. - loc (unicode / Path): Path to the file. - RETURNS: The loaded module. - """ - loc = path2str(loc) - if is_python_pre_3_5: - import imp - - return imp.load_source(name, loc) - else: - import importlib.util - - spec = importlib.util.spec_from_file_location(name, str(loc)) - module = importlib.util.module_from_spec(spec) - spec.loader.exec_module(module) - return module - - -def unescape_unicode(string): - """Python2.7's re module chokes when compiling patterns that have ranges - between escaped unicode codepoints if the two codepoints are unrecognised - in the unicode database. For instance: - - re.compile('[\\uAA77-\\uAA79]').findall("hello") - - Ends up matching every character (on Python 2). This problem doesn't occur - if we're dealing with unicode literals. - """ - if string is None: - return string - # We only want to unescape the unicode, so we first must protect the other - # backslashes. - string = string.replace("\\", "\\\\") - # Now we remove that protection for the unicode. - string = string.replace("\\\\u", "\\u") - string = string.replace("\\\\U", "\\U") - # Now we unescape by evaling the string with the AST. This can't execute - # code -- it only does the representational level. - return ast.literal_eval("u'''" + string + "'''") diff --git a/spacy/data/__init__.py b/spacy/data/__init__.py deleted file mode 100644 index e69de29bb..000000000 diff --git a/spacy/displacy/__init__.py b/spacy/displacy/__init__.py index a0cccbbde..3f84dabce 100644 --- a/spacy/displacy/__init__.py +++ b/spacy/displacy/__init__.py @@ -1,17 +1,13 @@ -# coding: utf8 """ spaCy's built in visualization suite for dependencies and named entities. DOCS: https://spacy.io/api/top-level#displacy USAGE: https://spacy.io/usage/visualizers """ -from __future__ import unicode_literals - import warnings from .render import DependencyRenderer, EntityRenderer from ..tokens import Doc, Span -from ..compat import b_to_str from ..errors import Errors, Warnings from ..util import is_in_jupyter @@ -95,20 +91,20 @@ def serve( render(docs, style=style, page=page, minify=minify, options=options, manual=manual) httpd = simple_server.make_server(host, port, app) - print("\nUsing the '{}' visualizer".format(style)) - print("Serving on http://{}:{} ...\n".format(host, port)) + print(f"\nUsing the '{style}' visualizer") + print(f"Serving on http://{host}:{port} ...\n") try: httpd.serve_forever() except KeyboardInterrupt: - print("Shutting down server on port {}.".format(port)) + print(f"Shutting down server on port {port}.") finally: httpd.server_close() def app(environ, start_response): # Headers and status need to be bytes in Python 2, see #1227 - headers = [(b_to_str(b"Content-type"), b_to_str(b"text/html; charset=utf-8"))] - start_response(b_to_str(b"200 OK"), headers) + headers = [("Content-type", "text/html; charset=utf-8")] + start_response("200 OK", headers) res = _html["parsed"].encode(encoding="utf-8") return [res] diff --git a/spacy/displacy/render.py b/spacy/displacy/render.py index 57d67c96b..0d4cdb77f 100644 --- a/spacy/displacy/render.py +++ b/spacy/displacy/render.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import uuid from .templates import ( @@ -61,7 +58,7 @@ class DependencyRenderer(object): settings = p.get("settings", {}) self.direction = settings.get("direction", DEFAULT_DIR) self.lang = settings.get("lang", DEFAULT_LANG) - render_id = "{}-{}".format(id_prefix, i) + render_id = f"{id_prefix}-{i}" svg = self.render_svg(render_id, p["words"], p["arcs"]) rendered.append(svg) if page: diff --git a/spacy/displacy/templates.py b/spacy/displacy/templates.py index f29eab86f..ff99000f4 100644 --- a/spacy/displacy/templates.py +++ b/spacy/displacy/templates.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Setting explicit height and max-width: none on the SVG is required for # Jupyter to render it properly in a cell diff --git a/spacy/errors.py b/spacy/errors.py index 0750ab616..905f7d443 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - def add_codes(err_cls): """Add error codes to string messages via class attribute names.""" @@ -92,9 +88,9 @@ class Warnings(object): W022 = ("Training a new part-of-speech tagger using a model with no " "lemmatization rules or data. This means that the trained model " "may not be able to lemmatize correctly. If this is intentional " - "or the language you're using doesn't have lemmatization data. " - "If this is surprising, make sure you have the spacy-lookups-data " - "package installed.") + "or the language you're using doesn't have lemmatization data, " + "you can ignore this warning. If this is surprising, make sure you " + "have the spacy-lookups-data package installed.") W023 = ("Multiprocessing of Language.pipe is not supported in Python 2. " "'n_process' will be set to 1.") W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in " @@ -116,6 +112,19 @@ class Warnings(object): " to check the alignment. Misaligned entities ('-') will be " "ignored during training.") + # TODO: fix numbering after merging develop into master + W095 = ("Skipping unsupported morphological feature(s): {feature}. " + "Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or " + "string \"Field1=Value1,Value2|Field2=Value3\".") + W096 = ("The method 'disable_pipes' has become deprecated - use 'select_pipes' " + "instead.") + W097 = ("No Model config was provided to create the '{name}' component, " + "and no default configuration could be found either.") + W098 = ("No Model config was provided to create the '{name}' component, " + "so a default configuration was used.") + W099 = ("Expected 'dict' type for the 'model' argument of pipe '{pipe}', " + "but got '{type}' instead, so ignoring it.") + @add_codes class Errors(object): @@ -137,7 +146,7 @@ class Errors(object): E007 = ("'{name}' already exists in pipeline. Existing names: {opts}") E008 = ("Some current components would be lost when restoring previous " "pipeline state. If you added components after calling " - "`nlp.disable_pipes()`, you should remove them explicitly with " + "`nlp.select_pipes()`, you should remove them explicitly with " "`nlp.remove_pipe()` before the pipeline is restored. Names of " "the new components: {names}") E009 = ("The `update` method expects same number of docs and golds, but " @@ -198,7 +207,7 @@ class Errors(object): "the documentation:\nhttps://spacy.io/usage/models") E030 = ("Sentence boundaries unset. You can add the 'sentencizer' " "component to the pipeline with: " - "nlp.add_pipe(nlp.create_pipe('sentencizer')) " + "nlp.add_pipe(nlp.create_pipe('sentencizer')). " "Alternatively, add the dependency parser, or set sentence " "boundaries by setting doc[i].is_sent_start.") E031 = ("Invalid token: empty string ('') at position {i}.") @@ -234,15 +243,10 @@ class Errors(object): E047 = ("Can't assign a value to unregistered extension attribute " "'{name}'. Did you forget to call the `set_extension` method?") E048 = ("Can't import language {lang} from spacy.lang: {err}") - E049 = ("Can't find spaCy data directory: '{path}'. Check your " - "installation and permissions, or use spacy.util.set_data_path " - "to customise the location if necessary.") - E050 = ("Can't find model '{name}'. It doesn't seem to be a shortcut " - "link, a Python package or a valid path to a data directory.") - E051 = ("Cant' load '{name}'. If you're using a shortcut link, make sure " - "it points to a valid package (not just a data directory).") + E050 = ("Can't find model '{name}'. It doesn't seem to be a Python " + "package or a valid path to a data directory.") E052 = ("Can't find model directory: {path}") - E053 = ("Could not read meta.json from {path}") + E053 = ("Could not read {name} from {path}") E054 = ("No valid '{setting}' setting found in model meta.json.") E055 = ("Invalid ORTH value in exception:\nKey: {key}\nOrths: {orths}") E056 = ("Invalid tokenizer exception: ORTH values combined don't match " @@ -360,8 +364,8 @@ class Errors(object): E108 = ("As of spaCy v2.1, the pipe name `sbd` has been deprecated " "in favor of the pipe name `sentencizer`, which does the same " "thing. For example, use `nlp.create_pipeline('sentencizer')`") - E109 = ("Model for component '{name}' not initialized. Did you forget to " - "load a model, or forget to call begin_training()?") + E109 = ("Component '{name}' could not be run. Did you forget to " + "call begin_training()?") E110 = ("Invalid displaCy render wrapper. Expected callable, got: {obj}") E111 = ("Pickling a token is not supported, because tokens are only views " "of the parent Doc and can't exist on their own. A pickled token " @@ -431,8 +435,6 @@ class Errors(object): E134 = ("Entity '{entity}' is not defined in the Knowledge Base.") E135 = ("If you meant to replace a built-in component, use `create_pipe`: " "`nlp.replace_pipe('{name}', nlp.create_pipe('{name}'))`") - E136 = ("This additional feature requires the jsonschema library to be " - "installed:\npip install jsonschema") E137 = ("Expected 'dict' type, but got '{type}' from '{line}'. Make sure " "to provide a valid JSON object as input with either the `text` " "or `tokens` key. For more info, see the docs:\n" @@ -440,8 +442,7 @@ class Errors(object): E138 = ("Invalid JSONL format for raw text '{text}'. Make sure the input " "includes either the `text` or `tokens` key. For more info, see " "the docs:\nhttps://spacy.io/api/cli#pretrain-jsonl") - E139 = ("Knowledge Base for component '{name}' not initialized. Did you " - "forget to call set_kb()?") + E139 = ("Knowledge Base for component '{name}' is empty.") E140 = ("The list of entities, prior probabilities and entity vectors " "should be of equal length.") E141 = ("Entity vectors should be of length {required} instead of the " @@ -568,6 +569,38 @@ class Errors(object): E198 = ("Unable to return {n} most similar vectors for the current vectors " "table, which contains {n_rows} vectors.") + # TODO: fix numbering after merging develop into master + + E987 = ("The text of an example training instance is either a Doc or " + "a string, but found {type} instead.") + E988 = ("Could not parse any training examples. Ensure the data is " + "formatted correctly.") + E989 = ("'nlp.update()' was called with two positional arguments. This " + "may be due to a backwards-incompatible change to the format " + "of the training data in spaCy 3.0 onwards. The 'update' " + "function should now be called with a batch of 'Example' " + "objects, instead of (text, annotation) tuples. ") + E990 = ("An entity linking component needs to be initialized with a " + "KnowledgeBase object, but found {type} instead.") + E991 = ("The function 'select_pipes' should be called with either a " + "'disable' argument to list the names of the pipe components " + "that should be disabled, or with an 'enable' argument that " + "specifies which pipes should not be disabled.") + E992 = ("The function `select_pipes` was called with `enable`={enable} " + "and `disable`={disable} but that information is conflicting " + "for the `nlp` pipeline with components {names}.") + E993 = ("The config for 'nlp' should include either a key 'name' to " + "refer to an existing model by name or path, or a key 'lang' " + "to create a new blank model.") + E996 = ("Could not parse {file}: {msg}") + E997 = ("Tokenizer special cases are not allowed to modify the text. " + "This would map '{chunk}' to '{orth}' given token attributes " + "'{token_attrs}'.") + E998 = ("To create GoldParse objects from Example objects without a " + "Doc, get_gold_parses() should be called with a Vocab object.") + E999 = ("Encountered an unexpected format for the dictionary holding " + "gold annotations: {gold_dict}") + @add_codes class TempErrors(object): @@ -592,10 +625,10 @@ class MatchPatternError(ValueError): errors (dict): Validation errors (sequence of strings) mapped to pattern ID, i.e. the index of the added pattern. """ - msg = "Invalid token patterns for matcher rule '{}'\n".format(key) + msg = f"Invalid token patterns for matcher rule '{key}'\n" for pattern_idx, error_msgs in errors.items(): - pattern_errors = "\n".join(["- {}".format(e) for e in error_msgs]) - msg += "\nPattern {}:\n{}\n".format(pattern_idx, pattern_errors) + pattern_errors = "\n".join([f"- {e}" for e in error_msgs]) + msg += f"\nPattern {pattern_idx}:\n{pattern_errors}\n" ValueError.__init__(self, msg) diff --git a/spacy/glossary.py b/spacy/glossary.py index 44a8277da..938a575cd 100644 --- a/spacy/glossary.py +++ b/spacy/glossary.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - def explain(term): """Get a description for a given POS tag, dependency label or entity type. diff --git a/spacy/gold.pxd b/spacy/gold.pxd index 20a25a939..bf724868f 100644 --- a/spacy/gold.pxd +++ b/spacy/gold.pxd @@ -1,9 +1,10 @@ from cymem.cymem cimport Pool -from .structs cimport TokenC from .typedefs cimport attr_t from .syntax.transition_system cimport Transition +from .tokens import Doc + cdef struct GoldParseC: int* tags @@ -19,23 +20,49 @@ cdef class GoldParse: cdef Pool mem cdef GoldParseC c + cdef readonly TokenAnnotation orig cdef int length cdef public int loss cdef public list words cdef public list tags - cdef public list morphology + cdef public list pos + cdef public list morphs + cdef public list lemmas + cdef public list sent_starts cdef public list heads cdef public list labels cdef public dict orths cdef public list ner - cdef public list ents cdef public dict brackets - cdef public object cats + cdef public dict cats cdef public dict links cdef readonly list cand_to_gold cdef readonly list gold_to_cand - cdef readonly list orig_annot +cdef class TokenAnnotation: + cdef public list ids + cdef public list words + cdef public list tags + cdef public list pos + cdef public list morphs + cdef public list lemmas + cdef public list heads + cdef public list deps + cdef public list entities + cdef public list sent_starts + cdef public dict brackets_by_start + + +cdef class DocAnnotation: + cdef public object cats + cdef public object links + + +cdef class Example: + cdef public object doc + cdef public TokenAnnotation token_annotation + cdef public DocAnnotation doc_annotation + cdef public object goldparse diff --git a/spacy/gold.pyx b/spacy/gold.pyx index cf67a2ac7..13e448342 100644 --- a/spacy/gold.pyx +++ b/spacy/gold.pyx @@ -1,7 +1,4 @@ # cython: profile=True -# coding: utf8 -from __future__ import unicode_literals, print_function - import re import random import numpy @@ -15,11 +12,7 @@ import warnings from .syntax import nonproj from .tokens import Doc, Span from .errors import Errors, AlignmentError, Warnings -from .compat import path2str from . import util -from .util import minibatch, itershuffle - -from libc.stdio cimport FILE, fopen, fclose, fread, fwrite, feof, fseek punct_re = re.compile(r"\W") @@ -161,30 +154,32 @@ class GoldCorpus(object): def __init__(self, train, dev, gold_preproc=False, limit=None): """Create a GoldCorpus. - train_path (unicode or Path): File or directory of training data. - dev_path (unicode or Path): File or directory of development data. + train (unicode or Path): File or directory of training data. + dev (unicode or Path): File or directory of development data. RETURNS (GoldCorpus): The newly created object. """ self.limit = limit if isinstance(train, str) or isinstance(train, Path): - train = self.read_tuples(self.walk_corpus(train)) - dev = self.read_tuples(self.walk_corpus(dev)) + train = self.read_examples(self.walk_corpus(train)) + dev = self.read_examples(self.walk_corpus(dev)) # Write temp directory with one doc per file, so we can shuffle and stream self.tmp_dir = Path(tempfile.mkdtemp()) self.write_msgpack(self.tmp_dir / "train", train, limit=self.limit) self.write_msgpack(self.tmp_dir / "dev", dev, limit=self.limit) def __del__(self): - shutil.rmtree(path2str(self.tmp_dir)) + shutil.rmtree(self.tmp_dir) @staticmethod - def write_msgpack(directory, doc_tuples, limit=0): + def write_msgpack(directory, examples, limit=0): if not directory.exists(): directory.mkdir() n = 0 - for i, doc_tuple in enumerate(doc_tuples): - srsly.write_msgpack(directory / "{}.msg".format(i), [doc_tuple]) - n += len(doc_tuple[1]) + for i, example in enumerate(examples): + ex_dict = example.to_dict() + text = example.text + srsly.write_msgpack(directory / f"{i}.msg", (text, ex_dict)) + n += 1 if limit and n >= limit: break @@ -209,130 +204,164 @@ class GoldCorpus(object): return locs @staticmethod - def read_tuples(locs, limit=0): + def read_examples(locs, limit=0): + """ Yield training examples """ i = 0 for loc in locs: loc = util.ensure_path(loc) - if loc.parts[-1].endswith("json"): - gold_tuples = read_json_file(loc) - elif loc.parts[-1].endswith("jsonl"): + file_name = loc.parts[-1] + if file_name.endswith("json"): + examples = read_json_file(loc) + elif file_name.endswith("jsonl"): gold_tuples = srsly.read_jsonl(loc) first_gold_tuple = next(gold_tuples) gold_tuples = itertools.chain([first_gold_tuple], gold_tuples) # TODO: proper format checks with schemas if isinstance(first_gold_tuple, dict): - gold_tuples = read_json_object(gold_tuples) - elif loc.parts[-1].endswith("msg"): - gold_tuples = srsly.read_msgpack(loc) + if first_gold_tuple.get("paragraphs", None): + examples = read_json_object(gold_tuples) + elif first_gold_tuple.get("doc_annotation", None): + examples = [] + for ex_dict in gold_tuples: + doc = ex_dict.get("doc", None) + if doc is None: + doc = ex_dict.get("text", None) + if not (doc is None or isinstance(doc, Doc) or isinstance(doc, str)): + raise ValueError(Errors.E987.format(type=type(doc))) + examples.append(Example.from_dict(ex_dict, doc=doc)) + + elif file_name.endswith("msg"): + text, ex_dict = srsly.read_msgpack(loc) + examples = [Example.from_dict(ex_dict, doc=text)] else: supported = ("json", "jsonl", "msg") - raise ValueError(Errors.E124.format(path=path2str(loc), formats=supported)) - for item in gold_tuples: - yield item - i += len(item[1]) - if limit and i >= limit: - return + raise ValueError(Errors.E124.format(path=loc, formats=supported)) + try: + for example in examples: + yield example + i += 1 + if limit and i >= limit: + return + except KeyError as e: + msg = "Missing key {}".format(e) + raise KeyError(Errors.E996.format(file=file_name, msg=msg)) + except UnboundLocalError as e: + msg = "Unexpected document structure" + raise ValueError(Errors.E996.format(file=file_name, msg=msg)) @property - def dev_tuples(self): + def dev_examples(self): locs = (self.tmp_dir / "dev").iterdir() - yield from self.read_tuples(locs, limit=self.limit) + yield from self.read_examples(locs, limit=self.limit) @property - def train_tuples(self): + def train_examples(self): locs = (self.tmp_dir / "train").iterdir() - yield from self.read_tuples(locs, limit=self.limit) + yield from self.read_examples(locs, limit=self.limit) def count_train(self): + """Returns count of words in train examples""" n = 0 i = 0 - for raw_text, paragraph_tuples in self.train_tuples: - for sent_tuples, brackets in paragraph_tuples: - n += len(sent_tuples[1]) - if self.limit and i >= self.limit: - break - i += 1 + for example in self.train_examples: + n += len(example.token_annotation.words) + if self.limit and i >= self.limit: + break + i += 1 return n - def train_docs(self, nlp, gold_preproc=False, max_length=None, + def train_dataset(self, nlp, gold_preproc=False, max_length=None, noise_level=0.0, orth_variant_level=0.0, ignore_misaligned=False): locs = list((self.tmp_dir / 'train').iterdir()) random.shuffle(locs) - train_tuples = self.read_tuples(locs, limit=self.limit) - gold_docs = self.iter_gold_docs(nlp, train_tuples, gold_preproc, + train_examples = self.read_examples(locs, limit=self.limit) + gold_examples = self.iter_gold_docs(nlp, train_examples, gold_preproc, max_length=max_length, noise_level=noise_level, orth_variant_level=orth_variant_level, make_projective=True, ignore_misaligned=ignore_misaligned) - yield from gold_docs + yield from gold_examples - def train_docs_without_preprocessing(self, nlp, gold_preproc=False): - gold_docs = self.iter_gold_docs(nlp, self.train_tuples, gold_preproc=gold_preproc) - yield from gold_docs + def train_dataset_without_preprocessing(self, nlp, gold_preproc=False, + ignore_misaligned=False): + examples = self.iter_gold_docs(nlp, self.train_examples, + gold_preproc=gold_preproc, + ignore_misaligned=ignore_misaligned) + yield from examples - def dev_docs(self, nlp, gold_preproc=False, ignore_misaligned=False): - gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc, - ignore_misaligned=ignore_misaligned) - yield from gold_docs + def dev_dataset(self, nlp, gold_preproc=False, ignore_misaligned=False): + examples = self.iter_gold_docs(nlp, self.dev_examples, + gold_preproc=gold_preproc, + ignore_misaligned=ignore_misaligned) + yield from examples @classmethod - def iter_gold_docs(cls, nlp, tuples, gold_preproc, max_length=None, - noise_level=0.0, orth_variant_level=0.0, make_projective=False, - ignore_misaligned=False): - for raw_text, paragraph_tuples in tuples: + def iter_gold_docs(cls, nlp, examples, gold_preproc, max_length=None, + noise_level=0.0, orth_variant_level=0.0, + make_projective=False, ignore_misaligned=False): + """ Setting gold_preproc will result in creating a doc per sentence """ + for example in examples: if gold_preproc: - raw_text = None + split_examples = example.split_sents() + example_golds = [] + for split_example in split_examples: + split_example_docs = cls._make_docs(nlp, split_example, + gold_preproc, noise_level=noise_level, + orth_variant_level=orth_variant_level) + split_example_golds = cls._make_golds(split_example_docs, + vocab=nlp.vocab, make_projective=make_projective, + ignore_misaligned=ignore_misaligned) + example_golds.extend(split_example_golds) else: - paragraph_tuples = merge_sents(paragraph_tuples) - docs, paragraph_tuples = cls._make_docs(nlp, raw_text, - paragraph_tuples, gold_preproc, noise_level=noise_level, - orth_variant_level=orth_variant_level) - golds = cls._make_golds(docs, paragraph_tuples, make_projective, - ignore_misaligned=ignore_misaligned) - for doc, gold in zip(docs, golds): - if gold is not None: - if (not max_length) or len(doc) < max_length: - yield doc, gold + example_docs = cls._make_docs(nlp, example, + gold_preproc, noise_level=noise_level, + orth_variant_level=orth_variant_level) + example_golds = cls._make_golds(example_docs, vocab=nlp.vocab, + make_projective=make_projective, + ignore_misaligned=ignore_misaligned) + for ex in example_golds: + if ex.goldparse is not None: + if (not max_length) or len(ex.doc) < max_length: + yield ex @classmethod - def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc, noise_level=0.0, orth_variant_level=0.0): - if raw_text is not None: - raw_text, paragraph_tuples = make_orth_variants(nlp, raw_text, paragraph_tuples, orth_variant_level=orth_variant_level) - raw_text = add_noise(raw_text, noise_level) - return [nlp.make_doc(raw_text)], paragraph_tuples + def _make_docs(cls, nlp, example, gold_preproc, noise_level=0.0, orth_variant_level=0.0): + var_example = make_orth_variants(nlp, example, orth_variant_level=orth_variant_level) + # gold_preproc is not used ?! + if example.text is not None: + var_text = add_noise(var_example.text, noise_level) + var_doc = nlp.make_doc(var_text) + var_example.doc = var_doc else: - docs = [] - raw_text, paragraph_tuples = make_orth_variants(nlp, None, paragraph_tuples, orth_variant_level=orth_variant_level) - return [Doc(nlp.vocab, words=add_noise(sent_tuples[1], noise_level)) - for (sent_tuples, brackets) in paragraph_tuples], paragraph_tuples - + var_doc = Doc(nlp.vocab, words=add_noise(var_example.token_annotation.words, noise_level)) + var_example.doc = var_doc + return [var_example] @classmethod - def _make_golds(cls, docs, paragraph_tuples, make_projective, ignore_misaligned=False): - if len(docs) != len(paragraph_tuples): - n_annots = len(paragraph_tuples) - raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots=n_annots)) - golds = [] - for doc, (sent_tuples, (cats, brackets)) in zip(docs, paragraph_tuples): - try: - gold = GoldParse.from_annot_tuples(doc, sent_tuples, cats=cats, - make_projective=make_projective) - except AlignmentError: - if ignore_misaligned: - gold = None - else: - raise - golds.append(gold) - return golds + def _make_golds(cls, examples, vocab=None, make_projective=False, + ignore_misaligned=False): + filtered_examples = [] + for example in examples: + gold_parses = example.get_gold_parses(vocab=vocab, + make_projective=make_projective, + ignore_misaligned=ignore_misaligned) + assert len(gold_parses) == 1 + doc, gold = gold_parses[0] + if doc: + assert doc == example.doc + example.goldparse = gold + filtered_examples.append(example) + return filtered_examples -def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0): +def make_orth_variants(nlp, example, orth_variant_level=0.0): if random.random() >= orth_variant_level: - return raw, paragraph_tuples - raw_orig = str(raw) - lower = False + return example + if not example.token_annotation: + return example + raw = example.text if random.random() >= 0.5: lower = True if raw is not None: @@ -340,9 +369,15 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0): ndsv = nlp.Defaults.single_orth_variants ndpv = nlp.Defaults.paired_orth_variants # modify words in paragraph_tuples - variant_paragraph_tuples = [] - for sent_tuples, brackets in paragraph_tuples: - ids, words, tags, heads, labels, ner = sent_tuples + variant_example = Example(doc=raw) + token_annotation = example.token_annotation + words = token_annotation.words + tags = token_annotation.tags + if not words or not tags: + # add the unmodified annotation + token_dict = token_annotation.to_dict() + variant_example.set_token_annotation(**token_dict) + else: if lower: words = [w.lower() for w in words] # single variants @@ -371,7 +406,10 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0): pair_idx = pair.index(words[word_idx]) words[word_idx] = punct_choices[punct_idx][pair_idx] - variant_paragraph_tuples.append(((ids, words, tags, heads, labels, ner), brackets)) + token_dict = token_annotation.to_dict() + token_dict["words"] = words + token_dict["tags"] = tags + variant_example.set_token_annotation(**token_dict) # modify raw to match variant_paragraph_tuples if raw is not None: variants = [] @@ -389,36 +427,32 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0): while raw_idx < len(raw) and re.match("\s", raw[raw_idx]): variant_raw += raw[raw_idx] raw_idx += 1 - for sent_tuples, brackets in variant_paragraph_tuples: - ids, words, tags, heads, labels, ner = sent_tuples - for word in words: - match_found = False - # skip whitespace words - if word.isspace(): - match_found = True - # add identical word - elif word not in variants and raw[raw_idx:].startswith(word): - variant_raw += word - raw_idx += len(word) - match_found = True - # add variant word - else: - for variant in variants: - if not match_found and \ - raw[raw_idx:].startswith(variant): - raw_idx += len(variant) - variant_raw += word - match_found = True - # something went wrong, abort - # (add a warning message?) - if not match_found: - return raw_orig, paragraph_tuples - # add following whitespace - while raw_idx < len(raw) and re.match("\s", raw[raw_idx]): - variant_raw += raw[raw_idx] - raw_idx += 1 - return variant_raw, variant_paragraph_tuples - return raw, variant_paragraph_tuples + for word in variant_example.token_annotation.words: + match_found = False + # add identical word + if word not in variants and raw[raw_idx:].startswith(word): + variant_raw += word + raw_idx += len(word) + match_found = True + # add variant word + else: + for variant in variants: + if not match_found and \ + raw[raw_idx:].startswith(variant): + raw_idx += len(variant) + variant_raw += word + match_found = True + # something went wrong, abort + # (add a warning message?) + if not match_found: + return example + # add following whitespace + while raw_idx < len(raw) and re.match("\s", raw[raw_idx]): + variant_raw += raw[raw_idx] + raw_idx += 1 + variant_example.doc = variant_raw + return variant_example + return variant_example def add_noise(orig, noise_level): @@ -443,52 +477,70 @@ def _corrupt(c, noise_level): def read_json_object(json_corpus_section): """Take a list of JSON-formatted documents (e.g. from an already loaded - training data file) and yield tuples in the GoldParse format. + training data file) and yield annotations in the GoldParse format. json_corpus_section (list): The data. - YIELDS (tuple): The reformatted data. + YIELDS (Example): The reformatted data - one training example per paragraph """ for json_doc in json_corpus_section: - tuple_doc = json_to_tuple(json_doc) - for tuple_paragraph in tuple_doc: - yield tuple_paragraph + examples = json_to_examples(json_doc) + for ex in examples: + yield ex -def json_to_tuple(doc): - """Convert an item in the JSON-formatted training data to the tuple format +def json_to_examples(doc): + """Convert an item in the JSON-formatted training data to the format used by GoldParse. doc (dict): One entry in the training data. - YIELDS (tuple): The reformatted data. + YIELDS (Example): The reformatted data - one training example per paragraph """ paragraphs = [] for paragraph in doc["paragraphs"]: - sents = [] - cats = {} - for cat in paragraph.get("cats", {}): - cats[cat["label"]] = cat["value"] + example = Example(doc=paragraph.get("raw", None)) + words = [] + ids = [] + tags = [] + pos = [] + morphs = [] + lemmas = [] + heads = [] + labels = [] + ner = [] + sent_starts = [] + brackets = [] for sent in paragraph["sentences"]: - words = [] - ids = [] - tags = [] - heads = [] - labels = [] - ner = [] + sent_start_i = len(words) for i, token in enumerate(sent["tokens"]): words.append(token["orth"]) - ids.append(i) + ids.append(token.get('id', sent_start_i + i)) tags.append(token.get('tag', "-")) - heads.append(token.get("head", 0) + i) + pos.append(token.get("pos", "")) + morphs.append(token.get("morph", "")) + lemmas.append(token.get("lemma", "")) + heads.append(token.get("head", 0) + sent_start_i + i) labels.append(token.get("dep", "")) # Ensure ROOT label is case-insensitive if labels[-1].lower() == "root": labels[-1] = "ROOT" ner.append(token.get("ner", "-")) - sents.append([ - [ids, words, tags, heads, labels, ner], - [cats, sent.get("brackets", [])]]) - if sents: - yield [paragraph.get("raw", None), sents] + if i == 0: + sent_starts.append(1) + else: + sent_starts.append(0) + if "brackets" in sent: + brackets.extend((b["first"] + sent_start_i, + b["last"] + sent_start_i, b["label"]) + for b in sent["brackets"]) + cats = {} + for cat in paragraph.get("cats", {}): + cats[cat["label"]] = cat["value"] + example.set_token_annotation(ids=ids, words=words, tags=tags, + pos=pos, morphs=morphs, lemmas=lemmas, heads=heads, + deps=labels, entities=ner, sent_starts=sent_starts, + brackets=brackets) + example.set_doc_annotation(cats=cats) + yield example def read_json_file(loc, docs_filter=None, limit=None): @@ -500,8 +552,8 @@ def read_json_file(loc, docs_filter=None, limit=None): for doc in _json_iterate(loc): if docs_filter is not None and not docs_filter(doc): continue - for json_tuple in json_to_tuple(doc): - yield json_tuple + for json_data in json_to_examples(doc): + yield json_data def _json_iterate(loc): @@ -571,6 +623,14 @@ def iob_to_biluo(tags): return out +def biluo_to_iob(tags): + out = [] + for tag in tags: + tag = tag.replace("U-", "B-", 1).replace("L-", "I-", 1) + out.append(tag) + return out + + def _consume_os(tags): while tags and tags[0] == "O": yield tags.pop(0) @@ -594,30 +654,364 @@ def _consume_ent(tags): else: start = "B-" + label end = "L-" + label - middle = ["I-%s" % label for _ in range(1, length - 1)] + middle = [f"I-{label}" for _ in range(1, length - 1)] return [start] + middle + [end] +cdef class TokenAnnotation: + def __init__(self, ids=None, words=None, tags=None, pos=None, morphs=None, + lemmas=None, heads=None, deps=None, entities=None, sent_starts=None, + brackets=None): + self.ids = ids if ids else [] + self.words = words if words else [] + self.tags = tags if tags else [] + self.pos = pos if pos else [] + self.morphs = morphs if morphs else [] + self.lemmas = lemmas if lemmas else [] + self.heads = heads if heads else [] + self.deps = deps if deps else [] + self.entities = entities if entities else [] + self.sent_starts = sent_starts if sent_starts else [] + self.brackets_by_start = {} + if brackets: + for b_start, b_end, b_label in brackets: + self.brackets_by_start.setdefault(b_start, []).append((b_end, b_label)) + + @property + def brackets(self): + brackets = [] + for start, ends_labels in self.brackets_by_start.items(): + for end, label in ends_labels: + brackets.append((start, end, label)) + return brackets + + @classmethod + def from_dict(cls, token_dict): + return cls(ids=token_dict.get("ids", None), + words=token_dict.get("words", None), + tags=token_dict.get("tags", None), + pos=token_dict.get("pos", None), + morphs=token_dict.get("morphs", None), + lemmas=token_dict.get("lemmas", None), + heads=token_dict.get("heads", None), + deps=token_dict.get("deps", None), + entities=token_dict.get("entities", None), + sent_starts=token_dict.get("sent_starts", None), + brackets=token_dict.get("brackets", None)) + + def to_dict(self): + return {"ids": self.ids, + "words": self.words, + "tags": self.tags, + "pos": self.pos, + "morphs": self.morphs, + "lemmas": self.lemmas, + "heads": self.heads, + "deps": self.deps, + "entities": self.entities, + "sent_starts": self.sent_starts, + "brackets": self.brackets} + + def get_id(self, i): + return self.ids[i] if i < len(self.ids) else i + + def get_word(self, i): + return self.words[i] if i < len(self.words) else "" + + def get_tag(self, i): + return self.tags[i] if i < len(self.tags) else "-" + + def get_pos(self, i): + return self.pos[i] if i < len(self.pos) else "" + + def get_morph(self, i): + return self.morphs[i] if i < len(self.morphs) else "" + + def get_lemma(self, i): + return self.lemmas[i] if i < len(self.lemmas) else "" + + def get_head(self, i): + return self.heads[i] if i < len(self.heads) else i + + def get_dep(self, i): + return self.deps[i] if i < len(self.deps) else "" + + def get_entity(self, i): + return self.entities[i] if i < len(self.entities) else "-" + + def get_sent_start(self, i): + return self.sent_starts[i] if i < len(self.sent_starts) else None + + def __str__(self): + return str(self.to_dict()) + + def __repr__(self): + return self.__str__() + + +cdef class DocAnnotation: + def __init__(self, cats=None, links=None): + self.cats = cats if cats else {} + self.links = links if links else {} + + @classmethod + def from_dict(cls, doc_dict): + return cls(cats=doc_dict.get("cats", None), links=doc_dict.get("links", None)) + + def to_dict(self): + return {"cats": self.cats, "links": self.links} + + def __str__(self): + return str(self.to_dict()) + + def __repr__(self): + return self.__str__() + + +cdef class Example: + def __init__(self, doc_annotation=None, token_annotation=None, doc=None, + goldparse=None): + """ Doc can either be text, or an actual Doc """ + self.doc = doc + self.doc_annotation = doc_annotation if doc_annotation else DocAnnotation() + self.token_annotation = token_annotation if token_annotation else TokenAnnotation() + self.goldparse = goldparse + + @classmethod + def from_gold(cls, goldparse, doc=None): + doc_annotation = DocAnnotation(cats=goldparse.cats, links=goldparse.links) + token_annotation = goldparse.get_token_annotation() + return cls(doc_annotation, token_annotation, doc) + + @classmethod + def from_dict(cls, example_dict, doc=None): + token_dict = example_dict.get("token_annotation", {}) + token_annotation = TokenAnnotation.from_dict(token_dict) + doc_dict = example_dict.get("doc_annotation", {}) + doc_annotation = DocAnnotation.from_dict(doc_dict) + return cls(doc_annotation, token_annotation, doc) + + def to_dict(self): + """ Note that this method does NOT export the doc, only the annotations ! """ + token_dict = self.token_annotation.to_dict() + doc_dict = self.doc_annotation.to_dict() + return {"token_annotation": token_dict, "doc_annotation": doc_dict} + + @property + def text(self): + if self.doc is None: + return None + if isinstance(self.doc, Doc): + return self.doc.text + return self.doc + + @property + def gold(self): + if self.goldparse is None: + doc, gold = self.get_gold_parses()[0] + self.goldparse = gold + return self.goldparse + + def set_token_annotation(self, ids=None, words=None, tags=None, pos=None, + morphs=None, lemmas=None, heads=None, deps=None, + entities=None, sent_starts=None, brackets=None): + self.token_annotation = TokenAnnotation(ids=ids, words=words, tags=tags, + pos=pos, morphs=morphs, lemmas=lemmas, heads=heads, + deps=deps, entities=entities, + sent_starts=sent_starts, brackets=brackets) + + def set_doc_annotation(self, cats=None, links=None): + if cats: + self.doc_annotation.cats = cats + if links: + self.doc_annotation.links = links + + def split_sents(self): + """ Split the token annotations into multiple Examples based on + sent_starts and return a list of the new Examples""" + if not self.token_annotation.words: + return [self] + s_example = Example(doc=None, doc_annotation=self.doc_annotation) + s_ids, s_words, s_tags, s_pos, s_morphs = [], [], [], [], [] + s_lemmas, s_heads, s_deps, s_ents, s_sent_starts = [], [], [], [], [] + s_brackets = [] + sent_start_i = 0 + cdef TokenAnnotation t = self.token_annotation + split_examples = [] + cdef int b_start, b_end + cdef unicode b_label + for i in range(len(t.words)): + if i > 0 and t.sent_starts[i] == 1: + s_example.set_token_annotation(ids=s_ids, + words=s_words, tags=s_tags, pos=s_pos, morphs=s_morphs, + lemmas=s_lemmas, heads=s_heads, deps=s_deps, + entities=s_ents, sent_starts=s_sent_starts, + brackets=s_brackets) + split_examples.append(s_example) + s_example = Example(doc=None, doc_annotation=self.doc_annotation) + s_ids, s_words, s_tags, s_pos, s_heads = [], [], [], [], [] + s_deps, s_ents, s_morphs, s_lemmas = [], [], [], [] + s_sent_starts, s_brackets = [], [] + sent_start_i = i + s_ids.append(t.get_id(i)) + s_words.append(t.get_word(i)) + s_tags.append(t.get_tag(i)) + s_pos.append(t.get_pos(i)) + s_morphs.append(t.get_morph(i)) + s_lemmas.append(t.get_lemma(i)) + s_heads.append(t.get_head(i) - sent_start_i) + s_deps.append(t.get_dep(i)) + s_ents.append(t.get_entity(i)) + s_sent_starts.append(t.get_sent_start(i)) + for b_end, b_label in t.brackets_by_start.get(i, []): + s_brackets.append( + (i - sent_start_i, b_end - sent_start_i, b_label) + ) + i += 1 + s_example.set_token_annotation(ids=s_ids, words=s_words, tags=s_tags, + pos=s_pos, morphs=s_morphs, lemmas=s_lemmas, heads=s_heads, + deps=s_deps, entities=s_ents, sent_starts=s_sent_starts, + brackets=s_brackets) + split_examples.append(s_example) + return split_examples + + + def get_gold_parses(self, merge=True, vocab=None, make_projective=False, + ignore_misaligned=False): + """Return a list of (doc, GoldParse) objects. + If merge is set to True, keep all Token annotations as one big list.""" + d = self.doc_annotation + # merge == do not modify Example + if merge: + t = self.token_annotation + doc = self.doc + if doc is None or not isinstance(doc, Doc): + if not vocab: + raise ValueError(Errors.E998) + doc = Doc(vocab, words=t.words) + try: + gp = GoldParse.from_annotation(doc, d, t, + make_projective=make_projective) + except AlignmentError: + if ignore_misaligned: + gp = None + else: + raise + return [(doc, gp)] + # not merging: one GoldParse per sentence, defining docs with the words + # from each sentence + else: + parses = [] + split_examples = self.split_sents() + for split_example in split_examples: + if not vocab: + raise ValueError(Errors.E998) + split_doc = Doc(vocab, words=split_example.token_annotation.words) + try: + gp = GoldParse.from_annotation(split_doc, d, + split_example.token_annotation, + make_projective=make_projective) + except AlignmentError: + if ignore_misaligned: + gp = None + else: + raise + if gp is not None: + parses.append((split_doc, gp)) + return parses + + @classmethod + def to_example_objects(cls, examples, make_doc=None, keep_raw_text=False): + """ + Return a list of Example objects, from a variety of input formats. + make_doc needs to be provided when the examples contain text strings and keep_raw_text=False + """ + if isinstance(examples, Example): + return [examples] + if isinstance(examples, tuple): + examples = [examples] + converted_examples = [] + for ex in examples: + if isinstance(ex, Example): + converted_examples.append(ex) + # convert string to Doc to Example + elif isinstance(ex, str): + if keep_raw_text: + converted_examples.append(Example(doc=ex)) + else: + doc = make_doc(ex) + converted_examples.append(Example(doc=doc)) + # convert Doc to Example + elif isinstance(ex, Doc): + converted_examples.append(Example(doc=ex)) + # convert tuples to Example + elif isinstance(ex, tuple) and len(ex) == 2: + doc, gold = ex + gold_dict = {} + # convert string to Doc + if isinstance(doc, str) and not keep_raw_text: + doc = make_doc(doc) + # convert dict to GoldParse + if isinstance(gold, dict): + gold_dict = gold + if doc is not None or gold.get("words", None) is not None: + gold = GoldParse(doc, **gold) + else: + gold = None + if gold is not None: + converted_examples.append(Example.from_gold(goldparse=gold, doc=doc)) + else: + raise ValueError(Errors.E999.format(gold_dict=gold_dict)) + else: + converted_examples.append(ex) + return converted_examples + + cdef class GoldParse: """Collection for training annotations. DOCS: https://spacy.io/api/goldparse """ @classmethod - def from_annot_tuples(cls, doc, annot_tuples, cats=None, make_projective=False): - _, words, tags, heads, deps, entities = annot_tuples - return cls(doc, words=words, tags=tags, heads=heads, deps=deps, - entities=entities, cats=cats, + def from_annotation(cls, doc, doc_annotation, token_annotation, make_projective=False): + return cls(doc, words=token_annotation.words, + tags=token_annotation.tags, + pos=token_annotation.pos, + morphs=token_annotation.morphs, + lemmas=token_annotation.lemmas, + heads=token_annotation.heads, + deps=token_annotation.deps, + entities=token_annotation.entities, + sent_starts=token_annotation.sent_starts, + cats=doc_annotation.cats, + links=doc_annotation.links, make_projective=make_projective) - def __init__(self, doc, annot_tuples=None, words=None, tags=None, morphology=None, - heads=None, deps=None, entities=None, make_projective=False, - cats=None, links=None, **_): + def get_token_annotation(self): + ids = None + if self.words: + ids = list(range(len(self.words))) + + return TokenAnnotation(ids=ids, words=self.words, tags=self.tags, + pos=self.pos, morphs=self.morphs, + lemmas=self.lemmas, heads=self.heads, + deps=self.labels, entities=self.ner, + sent_starts=self.sent_starts) + + def __init__(self, doc, words=None, tags=None, pos=None, morphs=None, + lemmas=None, heads=None, deps=None, entities=None, + sent_starts=None, make_projective=False, cats=None, + links=None): """Create a GoldParse. The fields will not be initialized if len(doc) is zero. doc (Doc): The document the annotations refer to. words (iterable): A sequence of unicode word strings. tags (iterable): A sequence of strings, representing tag annotations. + pos (iterable): A sequence of strings, representing UPOS annotations. + morphs (iterable): A sequence of strings, representing morph + annotations. + lemmas (iterable): A sequence of strings, representing lemma + annotations. heads (iterable): A sequence of integers, representing syntactic head offsets. deps (iterable): A sequence of strings, representing the syntactic @@ -625,6 +1019,8 @@ cdef class GoldParse: entities (iterable): A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. + sent_starts (iterable): A sequence of sentence position tags, 1 for + the first word in a sentence, 0 for all others. cats (dict): Labels for text classification. Each key in the dictionary may be a string or an int, or a `(start_char, end_char, label)` tuple, indicating that the label is applied to only part of the @@ -647,36 +1043,32 @@ cdef class GoldParse: self.length = len(doc) self.cats = {} if cats is None else dict(cats) - self.links = links - - # orig_annot is used as an iterator in `nlp.evalate` even if self.length == 0, - # so set a empty list to avoid error. - # if self.lenght > 0, this is modified latter. - self.orig_annot = [] + self.links = {} if links is None else dict(links) # temporary doc for aligning entity annotation entdoc = None # avoid allocating memory if the doc does not contain any tokens if self.length == 0: - self.words = [] - self.tags = [] - self.heads = [] - self.labels = [] - self.ner = [] - self.morphology = [] - + # set a minimal orig so that the scorer can score an empty doc + self.orig = TokenAnnotation(ids=[]) else: - if words is None: + if not words: words = [token.text for token in doc] - if tags is None: + if not tags: tags = [None for _ in words] - if heads is None: + if not pos: + pos = [None for _ in words] + if not morphs: + morphs = [None for _ in words] + if not lemmas: + lemmas = [None for _ in words] + if not heads: heads = [None for _ in words] - if deps is None: + if not deps: deps = [None for _ in words] - if morphology is None: - morphology = [None for _ in words] + if not sent_starts: + sent_starts = [None for _ in words] if entities is None: entities = ["-" for _ in words] elif len(entities) == 0: @@ -685,7 +1077,7 @@ cdef class GoldParse: # Translate the None values to '-', to make processing easier. # See Issue #2603 entities = [(ent if ent is not None else "-") for ent in entities] - if not isinstance(entities[0], basestring): + if not isinstance(entities[0], str): # Assume we have entities specified by character offset. # Create a temporary Doc corresponding to provided words # (to preserve gold tokenization) and text (to preserve @@ -717,13 +1109,16 @@ cdef class GoldParse: self.words = [None] * len(doc) self.tags = [None] * len(doc) + self.pos = [None] * len(doc) + self.morphs = [None] * len(doc) + self.lemmas = [None] * len(doc) self.heads = [None] * len(doc) self.labels = [None] * len(doc) self.ner = [None] * len(doc) - self.morphology = [None] * len(doc) + self.sent_starts = [None] * len(doc) # This needs to be done before we align the words - if make_projective and heads is not None and deps is not None: + if make_projective and any(heads) and any(deps) : heads, deps = nonproj.projectivize(heads, deps) # Do many-to-one alignment for misaligned tokens. @@ -739,22 +1134,30 @@ cdef class GoldParse: self.cand_to_gold = [(j if j >= 0 else None) for j in i2j] self.gold_to_cand = [(i if i >= 0 else None) for i in j2i] - annot_tuples = (range(len(words)), words, tags, heads, deps, entities) - self.orig_annot = list(zip(*annot_tuples)) + self.orig = TokenAnnotation(ids=list(range(len(words))), + words=words, tags=tags, pos=pos, morphs=morphs, + lemmas=lemmas, heads=heads, deps=deps, entities=entities, + sent_starts=sent_starts, brackets=[]) for i, gold_i in enumerate(self.cand_to_gold): if doc[i].text.isspace(): self.words[i] = doc[i].text self.tags[i] = "_SP" + self.pos[i] = "SPACE" + self.morphs[i] = None + self.lemmas[i] = None self.heads[i] = None self.labels[i] = None self.ner[i] = None - self.morphology[i] = set() + self.sent_starts[i] = 0 if gold_i is None: if i in i2j_multi: self.words[i] = words[i2j_multi[i]] self.tags[i] = tags[i2j_multi[i]] - self.morphology[i] = morphology[i2j_multi[i]] + self.pos[i] = pos[i2j_multi[i]] + self.morphs[i] = morphs[i2j_multi[i]] + self.lemmas[i] = lemmas[i2j_multi[i]] + self.sent_starts[i] = sent_starts[i2j_multi[i]] is_last = i2j_multi[i] != i2j_multi.get(i+1) # Set next word in multi-token span as head, until last if not is_last: @@ -772,7 +1175,10 @@ cdef class GoldParse: else: self.words[i] = words[gold_i] self.tags[i] = tags[gold_i] - self.morphology[i] = morphology[gold_i] + self.pos[i] = pos[gold_i] + self.morphs[i] = morphs[gold_i] + self.lemmas[i] = lemmas[gold_i] + self.sent_starts[i] = sent_starts[gold_i] if heads[gold_i] is None: self.heads[i] = None else: @@ -825,7 +1231,7 @@ cdef class GoldParse: cycle = nonproj.contains_cycle(self.heads) if cycle is not None: raise ValueError(Errors.E069.format(cycle=cycle, - cycle_tokens=" ".join(["'{}'".format(self.words[tok_id]) for tok_id in cycle]), + cycle_tokens=" ".join([f"'{self.words[tok_id]}'" for tok_id in cycle]), doc_tokens=" ".join(words[:50]))) def __len__(self): @@ -842,21 +1248,6 @@ cdef class GoldParse: """ return not nonproj.is_nonproj_tree(self.heads) - property sent_starts: - def __get__(self): - return [self.c.sent_start[i] for i in range(self.length)] - - def __set__(self, sent_starts): - for gold_i, is_sent_start in enumerate(sent_starts): - i = self.gold_to_cand[gold_i] - if i is not None: - if is_sent_start in (1, True): - self.c.sent_start[i] = 1 - elif is_sent_start in (-1, False): - self.c.sent_start[i] = -1 - else: - self.c.sent_start[i] = 0 - def docs_to_json(docs, id=0, ner_missing_tag="O"): """Convert a list of Doc objects into the JSON-serializable format used by @@ -883,6 +1274,9 @@ def docs_to_json(docs, id=0, ner_missing_tag="O"): json_token = {"id": token.i, "orth": token.text} if doc.is_tagged: json_token["tag"] = token.tag_ + json_token["pos"] = token.pos_ + json_token["morph"] = token.morph_ + json_token["lemma"] = token.lemma_ if doc.is_parsed: json_token["head"] = token.head.i-token.i json_token["dep"] = token.dep_ @@ -940,12 +1334,12 @@ def biluo_tags_from_offsets(doc, entities, missing="O"): # Only interested if the tokenization is correct if start_token is not None and end_token is not None: if start_token == end_token: - biluo[start_token] = "U-%s" % label + biluo[start_token] = f"U-{label}" else: - biluo[start_token] = "B-%s" % label + biluo[start_token] = f"B-{label}" for i in range(start_token+1, end_token): - biluo[i] = "I-%s" % label - biluo[end_token] = "L-%s" % label + biluo[i] = f"I-{label}" + biluo[end_token] = f"L-{label}" # Now distinguish the O cases from ones where we miss the tokenization entity_chars = set() for start_char, end_char, label in entities: diff --git a/spacy/kb.pxd b/spacy/kb.pxd index d5aa382b1..53038b5db 100644 --- a/spacy/kb.pxd +++ b/spacy/kb.pxd @@ -1,15 +1,15 @@ """Knowledge-base for entity or concept linking.""" from cymem.cymem cimport Pool from preshed.maps cimport PreshMap - from libcpp.vector cimport vector from libc.stdint cimport int32_t, int64_t from libc.stdio cimport FILE -from spacy.vocab cimport Vocab +from .vocab cimport Vocab from .typedefs cimport hash_t - from .structs cimport KBEntryC, AliasC + + ctypedef vector[KBEntryC] entry_vec ctypedef vector[AliasC] alias_vec ctypedef vector[float] float_vec @@ -113,7 +113,7 @@ cdef class KnowledgeBase: return new_index cdef inline void _create_empty_vectors(self, hash_t dummy_hash) nogil: - """ + """ Initializing the vectors and making sure the first element of each vector is a dummy, because the PreshMap maps pointing to indices in these vectors can not contain 0 as value cf. https://github.com/explosion/preshed/issues/17 @@ -169,4 +169,3 @@ cdef class Reader: cdef int read_alias(self, int64_t* entry_index, float* prob) except -1 cdef int _read(self, void* value, size_t size) except -1 - diff --git a/spacy/kb.pyx b/spacy/kb.pyx index 36a6dbd93..86a8d49b8 100644 --- a/spacy/kb.pyx +++ b/spacy/kb.pyx @@ -1,23 +1,17 @@ -# cython: infer_types=True -# cython: profile=True -# coding: utf8 -import warnings - -from spacy.errors import Errors, Warnings - -from pathlib import Path +# cython: infer_types=True, profile=True from cymem.cymem cimport Pool from preshed.maps cimport PreshMap - from cpython.exc cimport PyErr_SetFromErrno - from libc.stdio cimport fopen, fclose, fread, fwrite, feof, fseek from libc.stdint cimport int32_t, int64_t +from libcpp.vector cimport vector + +from pathlib import Path +import warnings +from os import path from .typedefs cimport hash_t - -from os import path -from libcpp.vector cimport vector +from .errors import Errors, Warnings cdef class Candidate: @@ -449,7 +443,7 @@ cdef class KnowledgeBase: cdef class Writer: def __init__(self, object loc): if path.exists(loc): - assert not path.isdir(loc), "%s is directory." % loc + assert not path.isdir(loc), f"{loc} is directory" if isinstance(loc, Path): loc = bytes(loc) cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc @@ -586,5 +580,3 @@ cdef class Reader: cdef int _read(self, void* value, size_t size) except -1: status = fread(value, size, 1, self._fp) return status - - diff --git a/spacy/lang/af/__init__.py b/spacy/lang/af/__init__.py index 90ea324f0..0da123419 100644 --- a/spacy/lang/af/__init__.py +++ b/spacy/lang/af/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language from ...attrs import LANG diff --git a/spacy/lang/af/stop_words.py b/spacy/lang/af/stop_words.py index 2b3bcc019..4b5a04a5e 100644 --- a/spacy/lang/af/stop_words.py +++ b/spacy/lang/af/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/stopwords-iso/stopwords-af STOP_WORDS = set( diff --git a/spacy/lang/ar/__init__.py b/spacy/lang/ar/__init__.py index c120703f6..6a1a8af3a 100644 --- a/spacy/lang/ar/__init__.py +++ b/spacy/lang/ar/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS from .punctuation import TOKENIZER_SUFFIXES diff --git a/spacy/lang/ar/examples.py b/spacy/lang/ar/examples.py index 2a10f4fcc..a51bb9ded 100644 --- a/spacy/lang/ar/examples.py +++ b/spacy/lang/ar/examples.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/ar/lex_attrs.py b/spacy/lang/ar/lex_attrs.py index 19e7aef8a..54ad7a8c3 100644 --- a/spacy/lang/ar/lex_attrs.py +++ b/spacy/lang/ar/lex_attrs.py @@ -1,5 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals from ...attrs import LIKE_NUM _num_words = set( diff --git a/spacy/lang/ar/punctuation.py b/spacy/lang/ar/punctuation.py index 6625c5475..f30204c02 100644 --- a/spacy/lang/ar/punctuation.py +++ b/spacy/lang/ar/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY from ..char_classes import UNITS, ALPHA_UPPER diff --git a/spacy/lang/ar/stop_words.py b/spacy/lang/ar/stop_words.py index de2fc7443..f4da54dda 100644 --- a/spacy/lang/ar/stop_words.py +++ b/spacy/lang/ar/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - STOP_WORDS = set( """ من diff --git a/spacy/lang/ar/tokenizer_exceptions.py b/spacy/lang/ar/tokenizer_exceptions.py index 030daecd5..a11f3b43a 100644 --- a/spacy/lang/ar/tokenizer_exceptions.py +++ b/spacy/lang/ar/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA diff --git a/spacy/lang/bg/__init__.py b/spacy/lang/bg/__init__.py index 9b4c647e3..437feb9ed 100644 --- a/spacy/lang/bg/__init__.py +++ b/spacy/lang/bg/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language from ...attrs import LANG diff --git a/spacy/lang/bg/examples.py b/spacy/lang/bg/examples.py index b08b8926d..a6d40da1a 100644 --- a/spacy/lang/bg/examples.py +++ b/spacy/lang/bg/examples.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/bg/stop_words.py b/spacy/lang/bg/stop_words.py index e7c65cbc2..aae7692a2 100644 --- a/spacy/lang/bg/stop_words.py +++ b/spacy/lang/bg/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/Alir3z4/stop-words STOP_WORDS = set( diff --git a/spacy/lang/bn/__init__.py b/spacy/lang/bn/__init__.py index e70232552..901676554 100644 --- a/spacy/lang/bn/__init__.py +++ b/spacy/lang/bn/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES from .tag_map import TAG_MAP diff --git a/spacy/lang/bn/examples.py b/spacy/lang/bn/examples.py index 2d5bdb238..c3be4c556 100644 --- a/spacy/lang/bn/examples.py +++ b/spacy/lang/bn/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/bn/morph_rules.py b/spacy/lang/bn/morph_rules.py index 21a76c7e6..44d6108e9 100644 --- a/spacy/lang/bn/morph_rules.py +++ b/spacy/lang/bn/morph_rules.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import LEMMA, PRON_LEMMA diff --git a/spacy/lang/bn/punctuation.py b/spacy/lang/bn/punctuation.py index f624b4ba4..becfe8d2a 100644 --- a/spacy/lang/bn/punctuation.py +++ b/spacy/lang/bn/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_ICONS from ..char_classes import ALPHA_LOWER, ALPHA, HYPHENS, CONCAT_QUOTES, UNITS diff --git a/spacy/lang/bn/stop_words.py b/spacy/lang/bn/stop_words.py index 6c9967df8..bf38e3254 100644 --- a/spacy/lang/bn/stop_words.py +++ b/spacy/lang/bn/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ অতএব অথচ অথবা অনুযায়ী অনেক অনেকে অনেকেই অন্তত অবধি অবশ্য অর্থাৎ অন্য অনুযায়ী অর্ধভাগে diff --git a/spacy/lang/bn/tag_map.py b/spacy/lang/bn/tag_map.py index 1efb35858..bc4c5ef6b 100644 --- a/spacy/lang/bn/tag_map.py +++ b/spacy/lang/bn/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, ADJ, CONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB from ...symbols import CCONJ, NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX, SYM @@ -14,8 +11,8 @@ TAG_MAP = { '""': {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"}, "''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"}, ":": {POS: PUNCT}, - "৳": {POS: SYM, "Other": {"SymType": "currency"}}, - "#": {POS: SYM, "Other": {"SymType": "numbersign"}}, + "৳": {POS: SYM, "SymType": "currency"}, + "#": {POS: SYM, "SymType": "numbersign"}, "AFX": {POS: ADJ, "Hyph": "yes"}, "CC": {POS: CONJ, "ConjType": "coor"}, "CD": {POS: NUM, "NumType": "card"}, diff --git a/spacy/lang/bn/tokenizer_exceptions.py b/spacy/lang/bn/tokenizer_exceptions.py index 32acb1730..18e313a25 100644 --- a/spacy/lang/bn/tokenizer_exceptions.py +++ b/spacy/lang/bn/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding=utf-8 -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA diff --git a/spacy/lang/ca/__init__.py b/spacy/lang/ca/__init__.py index 6d4c00a6b..a1ff2f2df 100644 --- a/spacy/lang/ca/__init__.py +++ b/spacy/lang/ca/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS diff --git a/spacy/lang/ca/examples.py b/spacy/lang/ca/examples.py index 3020ee707..ae6aa3e24 100644 --- a/spacy/lang/ca/examples.py +++ b/spacy/lang/ca/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/ca/lex_attrs.py b/spacy/lang/ca/lex_attrs.py index 6314efa92..be8b7a6ea 100644 --- a/spacy/lang/ca/lex_attrs.py +++ b/spacy/lang/ca/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/ca/punctuation.py b/spacy/lang/ca/punctuation.py index 4439376c8..d50b75589 100644 --- a/spacy/lang/ca/punctuation.py +++ b/spacy/lang/ca/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..punctuation import TOKENIZER_INFIXES from ..char_classes import ALPHA diff --git a/spacy/lang/ca/stop_words.py b/spacy/lang/ca/stop_words.py index a803db2a5..1a87b2f9d 100644 --- a/spacy/lang/ca/stop_words.py +++ b/spacy/lang/ca/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ a abans ací ah així això al aleshores algun alguna algunes alguns alhora allà allí allò diff --git a/spacy/lang/ca/tag_map.py b/spacy/lang/ca/tag_map.py deleted file mode 100644 index 472e772ef..000000000 --- a/spacy/lang/ca/tag_map.py +++ /dev/null @@ -1,28 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ..symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ -from ..symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ - - -TAG_MAP = { - "ADV": {POS: ADV}, - "NOUN": {POS: NOUN}, - "ADP": {POS: ADP}, - "PRON": {POS: PRON}, - "SCONJ": {POS: SCONJ}, - "PROPN": {POS: PROPN}, - "DET": {POS: DET}, - "SYM": {POS: SYM}, - "INTJ": {POS: INTJ}, - "PUNCT": {POS: PUNCT}, - "NUM": {POS: NUM}, - "AUX": {POS: AUX}, - "X": {POS: X}, - "CONJ": {POS: CONJ}, - "CCONJ": {POS: CCONJ}, - "ADJ": {POS: ADJ}, - "VERB": {POS: VERB}, - "PART": {POS: PART}, - "SP": {POS: SPACE}, -} diff --git a/spacy/lang/ca/tokenizer_exceptions.py b/spacy/lang/ca/tokenizer_exceptions.py index d95e5e626..b4ae61a2d 100644 --- a/spacy/lang/ca/tokenizer_exceptions.py +++ b/spacy/lang/ca/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA @@ -33,9 +30,9 @@ _exc["12m."] = [{ORTH: "12"}, {ORTH: "m.", LEMMA: "p.m."}] for h in range(1, 12 + 1): for period in ["a.m.", "am"]: - _exc["%d%s" % (h, period)] = [{ORTH: "%d" % h}, {ORTH: period, LEMMA: "a.m."}] + _exc[f"{h}{period}"] = [{ORTH: f"{h}"}, {ORTH: period, LEMMA: "a.m."}] for period in ["p.m.", "pm"]: - _exc["%d%s" % (h, period)] = [{ORTH: "%d" % h}, {ORTH: period, LEMMA: "p.m."}] + _exc[f"{h}{period}"] = [{ORTH: f"{h}"}, {ORTH: period, LEMMA: "p.m."}] TOKENIZER_EXCEPTIONS = _exc diff --git a/spacy/lang/char_classes.py b/spacy/lang/char_classes.py index bd0f7e437..b8094319f 100644 --- a/spacy/lang/char_classes.py +++ b/spacy/lang/char_classes.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - split_chars = lambda char: list(char.strip().split(" ")) merge_chars = lambda char: char.strip().replace(" ", "|") group_chars = lambda char: char.strip().replace(" ", "") diff --git a/spacy/lang/cs/__init__.py b/spacy/lang/cs/__init__.py index 5b1397ba2..a27e3339d 100644 --- a/spacy/lang/cs/__init__.py +++ b/spacy/lang/cs/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language from ...attrs import LANG diff --git a/spacy/lang/cs/stop_words.py b/spacy/lang/cs/stop_words.py index 59d3c102e..70aab030b 100644 --- a/spacy/lang/cs/stop_words.py +++ b/spacy/lang/cs/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/Alir3z4/stop-words STOP_WORDS = set( diff --git a/spacy/lang/da/__init__.py b/spacy/lang/da/__init__.py index 0190656e5..e0f0061ec 100644 --- a/spacy/lang/da/__init__.py +++ b/spacy/lang/da/__init__.py @@ -1,12 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS from .morph_rules import MORPH_RULES -from ..tag_map import TAG_MAP from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...language import Language @@ -22,7 +18,6 @@ class DanishDefaults(Language.Defaults): morph_rules = MORPH_RULES infixes = TOKENIZER_INFIXES suffixes = TOKENIZER_SUFFIXES - tag_map = TAG_MAP stop_words = STOP_WORDS diff --git a/spacy/lang/da/examples.py b/spacy/lang/da/examples.py index 525c6519c..efa1a7c0e 100644 --- a/spacy/lang/da/examples.py +++ b/spacy/lang/da/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/da/lex_attrs.py b/spacy/lang/da/lex_attrs.py index 9fefc1eba..403af686c 100644 --- a/spacy/lang/da/lex_attrs.py +++ b/spacy/lang/da/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/da/morph_rules.py b/spacy/lang/da/morph_rules.py index 7ffe2ac6f..06704f482 100644 --- a/spacy/lang/da/morph_rules.py +++ b/spacy/lang/da/morph_rules.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import LEMMA, PRON_LEMMA # Source: Danish Universal Dependencies and http://fjern-uv.dk/pronom.php diff --git a/spacy/lang/da/punctuation.py b/spacy/lang/da/punctuation.py index b6b852c55..e050ab7aa 100644 --- a/spacy/lang/da/punctuation.py +++ b/spacy/lang/da/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_ELLIPSES, LIST_ICONS from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER from ..punctuation import TOKENIZER_SUFFIXES diff --git a/spacy/lang/da/stop_words.py b/spacy/lang/da/stop_words.py index 48de0c7ca..05b2084dd 100644 --- a/spacy/lang/da/stop_words.py +++ b/spacy/lang/da/stop_words.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - # Source: Handpicked by Jens Dahl Møllerhøj. STOP_WORDS = set( diff --git a/spacy/lang/da/tokenizer_exceptions.py b/spacy/lang/da/tokenizer_exceptions.py index 9e4637bfb..36d03bde3 100644 --- a/spacy/lang/da/tokenizer_exceptions.py +++ b/spacy/lang/da/tokenizer_exceptions.py @@ -1,11 +1,7 @@ -# encoding: utf8 """ Tokenizer Exceptions. Source: https://forkortelse.dk/ and various others. """ - -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA, NORM @@ -575,7 +571,7 @@ for exc_data in [ # Dates for h in range(1, 31 + 1): for period in ["."]: - _exc["%d%s" % (h, period)] = [{ORTH: "%d." % h}] + _exc[f"{h}{period}"] = [{ORTH: f"{h}."}] _custom_base_exc = {"i.": [{ORTH: "i", LEMMA: "i", NORM: "i"}, {ORTH: "."}]} _exc.update(_custom_base_exc) diff --git a/spacy/lang/de/__init__.py b/spacy/lang/de/__init__.py index ca01428ba..25785d125 100644 --- a/spacy/lang/de/__init__.py +++ b/spacy/lang/de/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES from .punctuation import TOKENIZER_INFIXES diff --git a/spacy/lang/de/examples.py b/spacy/lang/de/examples.py index 0c64a693a..735d1c316 100644 --- a/spacy/lang/de/examples.py +++ b/spacy/lang/de/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/de/punctuation.py b/spacy/lang/de/punctuation.py index 93454ffff..69d402237 100644 --- a/spacy/lang/de/punctuation.py +++ b/spacy/lang/de/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_PUNCT, LIST_QUOTES from ..char_classes import CURRENCY, UNITS, PUNCT from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER diff --git a/spacy/lang/de/stop_words.py b/spacy/lang/de/stop_words.py index 0c8b375e0..f52687eb9 100644 --- a/spacy/lang/de/stop_words.py +++ b/spacy/lang/de/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ á a ab aber ach acht achte achten achter achtes ag alle allein allem allen diff --git a/spacy/lang/de/syntax_iterators.py b/spacy/lang/de/syntax_iterators.py index 73c1b1a6e..e322e1add 100644 --- a/spacy/lang/de/syntax_iterators.py +++ b/spacy/lang/de/syntax_iterators.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import NOUN, PROPN, PRON from ...errors import Errors diff --git a/spacy/lang/de/tag_map.py b/spacy/lang/de/tag_map.py index c169501a9..ca7ec61f1 100644 --- a/spacy/lang/de/tag_map.py +++ b/spacy/lang/de/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, ADJ, CCONJ, SCONJ, NUM, DET, ADV, ADP, X from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX, VERB diff --git a/spacy/lang/de/tokenizer_exceptions.py b/spacy/lang/de/tokenizer_exceptions.py index ebbbfba8c..3c2f02c7a 100644 --- a/spacy/lang/de/tokenizer_exceptions.py +++ b/spacy/lang/de/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA diff --git a/spacy/lang/el/__init__.py b/spacy/lang/el/__init__.py index d03a42da9..5269199b3 100644 --- a/spacy/lang/el/__init__.py +++ b/spacy/lang/el/__init__.py @@ -1,7 +1,3 @@ -# -*- coding: utf-8 -*- - -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from ..tag_map import TAG_MAP from .stop_words import STOP_WORDS diff --git a/spacy/lang/el/examples.py b/spacy/lang/el/examples.py index 521e7b30d..62515c07a 100644 --- a/spacy/lang/el/examples.py +++ b/spacy/lang/el/examples.py @@ -1,7 +1,3 @@ -# -*- coding: utf-8 -*- - -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. >>> from spacy.lang.el.examples import sentences diff --git a/spacy/lang/el/get_pos_from_wiktionary.py b/spacy/lang/el/get_pos_from_wiktionary.py index f41833974..369973cc0 100644 --- a/spacy/lang/el/get_pos_from_wiktionary.py +++ b/spacy/lang/el/get_pos_from_wiktionary.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - def get_pos_from_wiktionary(): import re from gensim.corpora.wikicorpus import extract_pages diff --git a/spacy/lang/el/lemmatizer.py b/spacy/lang/el/lemmatizer.py index 6f5b3999b..cf3a7fe97 100644 --- a/spacy/lang/el/lemmatizer.py +++ b/spacy/lang/el/lemmatizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...lemmatizer import Lemmatizer diff --git a/spacy/lang/el/lex_attrs.py b/spacy/lang/el/lex_attrs.py index cf32fe12c..5c8f96848 100644 --- a/spacy/lang/el/lex_attrs.py +++ b/spacy/lang/el/lex_attrs.py @@ -1,7 +1,3 @@ -# -*- coding: utf-8 -*- - -from __future__ import unicode_literals - from ...attrs import LIKE_NUM _num_words = [ diff --git a/spacy/lang/el/punctuation.py b/spacy/lang/el/punctuation.py index fbf773f4d..2d5690407 100644 --- a/spacy/lang/el/punctuation.py +++ b/spacy/lang/el/punctuation.py @@ -1,7 +1,3 @@ -# -*- coding: utf-8 -*- - -from __future__ import unicode_literals - from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY from ..char_classes import LIST_ICONS, ALPHA_LOWER, ALPHA_UPPER, ALPHA, HYPHENS from ..char_classes import CONCAT_QUOTES, CURRENCY diff --git a/spacy/lang/el/stop_words.py b/spacy/lang/el/stop_words.py index f13c47ec2..7c436219f 100644 --- a/spacy/lang/el/stop_words.py +++ b/spacy/lang/el/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Stop words # Link to greek stop words: https://www.translatum.gr/forum/index.php?topic=3550.0?topic=3550.0 STOP_WORDS = set( diff --git a/spacy/lang/el/syntax_iterators.py b/spacy/lang/el/syntax_iterators.py index 4317bdeb4..ea3af576c 100644 --- a/spacy/lang/el/syntax_iterators.py +++ b/spacy/lang/el/syntax_iterators.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import NOUN, PROPN, PRON from ...errors import Errors diff --git a/spacy/lang/el/tag_map_fine.py b/spacy/lang/el/tag_map_fine.py index b346299bc..f37f84c57 100644 --- a/spacy/lang/el/tag_map_fine.py +++ b/spacy/lang/el/tag_map_fine.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB from ...symbols import NOUN, PROPN, PART, INTJ, PRON, AUX @@ -659,7 +656,7 @@ TAG_MAP = { "Gender": "Fem", "Number": "Plur", "Case": "Acc", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfFePlGe": { POS: DET, @@ -667,7 +664,7 @@ TAG_MAP = { "Gender": "Fem", "Number": "Plur", "Case": "Gen", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfFePlNm": { POS: DET, @@ -675,7 +672,7 @@ TAG_MAP = { "Gender": "Fem", "Number": "Plur", "Case": "Nom", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfFeSgAc": { POS: DET, @@ -683,7 +680,7 @@ TAG_MAP = { "Gender": "Fem", "Number": "Sing", "Case": "Acc", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfFeSgDa": { POS: DET, @@ -691,7 +688,7 @@ TAG_MAP = { "Gender": "Fem", "Number": "Sing", "Case": "Dat", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfFeSgGe": { POS: DET, @@ -699,7 +696,7 @@ TAG_MAP = { "Gender": "Fem", "Number": "Sing", "Case": "Gen", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfFeSgNm": { POS: DET, @@ -707,7 +704,7 @@ TAG_MAP = { "Gender": "Fem", "Number": "Sing", "Case": "Nom", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfMaPlAc": { POS: DET, @@ -715,7 +712,7 @@ TAG_MAP = { "Gender": "Masc", "Number": "Plur", "Case": "Acc", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfMaPlGe": { POS: DET, @@ -723,7 +720,7 @@ TAG_MAP = { "Gender": "Masc", "Number": "Plur", "Case": "Gen", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfMaPlNm": { POS: DET, @@ -731,7 +728,7 @@ TAG_MAP = { "Gender": "Masc", "Number": "Plur", "Case": "Nom", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfMaSgAc": { POS: DET, @@ -739,7 +736,7 @@ TAG_MAP = { "Gender": "Masc", "Number": "Sing", "Case": "Acc", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfMaSgDa": { POS: DET, @@ -747,7 +744,7 @@ TAG_MAP = { "Gender": "Masc", "Number": "Sing", "Case": "Dat", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfMaSgGe": { POS: DET, @@ -755,7 +752,7 @@ TAG_MAP = { "Gender": "Masc", "Number": "Sing", "Case": "Gen", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfMaSgNm": { POS: DET, @@ -763,7 +760,7 @@ TAG_MAP = { "Gender": "Masc", "Number": "Sing", "Case": "Nom", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfNePlAc": { POS: DET, @@ -771,7 +768,7 @@ TAG_MAP = { "Gender": "Neut", "Number": "Plur", "Case": "Acc", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfNePlDa": { POS: DET, @@ -779,7 +776,7 @@ TAG_MAP = { "Gender": "Neut", "Number": "Plur", "Case": "Dat", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfNePlGe": { POS: DET, @@ -787,7 +784,7 @@ TAG_MAP = { "Gender": "Neut", "Number": "Plur", "Case": "Gen", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfNePlNm": { POS: DET, @@ -795,7 +792,7 @@ TAG_MAP = { "Gender": "Neut", "Number": "Plur", "Case": "Nom", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfNeSgAc": { POS: DET, @@ -803,7 +800,7 @@ TAG_MAP = { "Gender": "Neut", "Number": "Sing", "Case": "Acc", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfNeSgDa": { POS: DET, @@ -811,7 +808,7 @@ TAG_MAP = { "Gender": "Neut", "Number": "Sing", "Case": "Dat", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfNeSgGe": { POS: DET, @@ -819,7 +816,7 @@ TAG_MAP = { "Gender": "Neut", "Number": "Sing", "Case": "Gen", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtDfNeSgNm": { POS: DET, @@ -827,7 +824,7 @@ TAG_MAP = { "Gender": "Neut", "Number": "Sing", "Case": "Nom", - "Other": {"Definite": "Def"}, + "Definite": "Def", }, "AtIdFeSgAc": { POS: DET, @@ -835,7 +832,7 @@ TAG_MAP = { "Gender": "Fem", "Number": "Sing", "Case": "Acc", - "Other": {"Definite": "Ind"}, + "Definite": "Ind", }, "AtIdFeSgDa": { POS: DET, @@ -843,7 +840,7 @@ TAG_MAP = { "Gender": "Fem", "Number": "Sing", "Case": "Dat", - "Other": {"Definite": "Ind"}, + "Definite": "Ind", }, "AtIdFeSgGe": { POS: DET, @@ -851,7 +848,7 @@ TAG_MAP = { "Gender": "Fem", "Number": "Sing", "Case": "Gen", - "Other": {"Definite": "Ind"}, + "Definite": "Ind", }, "AtIdFeSgNm": { POS: DET, @@ -859,7 +856,7 @@ TAG_MAP = { "Gender": "Fem", "Number": "Sing", "Case": "Nom", - "Other": {"Definite": "Ind"}, + "Definite": "Ind", }, "AtIdMaSgAc": { POS: DET, @@ -867,7 +864,7 @@ TAG_MAP = { "Gender": "Masc", "Number": "Sing", "Case": "Acc", - "Other": {"Definite": "Ind"}, + "Definite": "Ind", }, "AtIdMaSgGe": { POS: DET, @@ -875,7 +872,7 @@ TAG_MAP = { "Gender": "Masc", "Number": "Sing", "Case": "Gen", - "Other": {"Definite": "Ind"}, + "Definite": "Ind", }, "AtIdMaSgNm": { POS: DET, @@ -883,7 +880,7 @@ TAG_MAP = { "Gender": "Masc", "Number": "Sing", "Case": "Nom", - "Other": {"Definite": "Ind"}, + "Definite": "Ind", }, "AtIdNeSgAc": { POS: DET, @@ -891,7 +888,7 @@ TAG_MAP = { "Gender": "Neut", "Number": "Sing", "Case": "Acc", - "Other": {"Definite": "Ind"}, + "Definite": "Ind", }, "AtIdNeSgGe": { POS: DET, @@ -899,7 +896,7 @@ TAG_MAP = { "Gender": "Neut", "Number": "Sing", "Case": "Gen", - "Other": {"Definite": "Ind"}, + "Definite": "Ind", }, "AtIdNeSgNm": { POS: DET, @@ -907,7 +904,7 @@ TAG_MAP = { "Gender": "Neut", "Number": "Sing", "Case": "Nom", - "Other": {"Definite": "Ind"}, + "Definite": "Ind", }, "CjCo": {POS: CCONJ}, "CjSb": {POS: SCONJ}, diff --git a/spacy/lang/el/tokenizer_exceptions.py b/spacy/lang/el/tokenizer_exceptions.py index a3c36542e..112fd991b 100644 --- a/spacy/lang/el/tokenizer_exceptions.py +++ b/spacy/lang/el/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA, NORM @@ -134,14 +131,14 @@ _exc.update(_other_exc) for h in range(1, 12 + 1): for period in ["π.μ.", "πμ"]: - _exc["%d%s" % (h, period)] = [ - {ORTH: "%d" % h}, + _exc[f"{h}{period}"] = [ + {ORTH: f"{h}"}, {ORTH: period, LEMMA: "π.μ.", NORM: "π.μ."}, ] for period in ["μ.μ.", "μμ"]: - _exc["%d%s" % (h, period)] = [ - {ORTH: "%d" % h}, + _exc[f"{h}{period}"] = [ + {ORTH: f"{h}"}, {ORTH: period, LEMMA: "μ.μ.", NORM: "μ.μ."}, ] diff --git a/spacy/lang/en/__init__.py b/spacy/lang/en/__init__.py index 4304b3c6a..de09ec1e7 100644 --- a/spacy/lang/en/__init__.py +++ b/spacy/lang/en/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tag_map import TAG_MAP from .stop_words import STOP_WORDS diff --git a/spacy/lang/en/examples.py b/spacy/lang/en/examples.py index 946289c7c..2cca9e05f 100644 --- a/spacy/lang/en/examples.py +++ b/spacy/lang/en/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/en/lex_attrs.py b/spacy/lang/en/lex_attrs.py index f92d41139..96fb4c9fa 100644 --- a/spacy/lang/en/lex_attrs.py +++ b/spacy/lang/en/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/en/morph_rules.py b/spacy/lang/en/morph_rules.py index 5ed4eac59..aa3e6ce57 100644 --- a/spacy/lang/en/morph_rules.py +++ b/spacy/lang/en/morph_rules.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import LEMMA, PRON_LEMMA # Several entries here look pretty suspicious. These will get the POS SCONJ diff --git a/spacy/lang/en/stop_words.py b/spacy/lang/en/stop_words.py index 3505b13bf..1ca5cbc16 100644 --- a/spacy/lang/en/stop_words.py +++ b/spacy/lang/en/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Stop words STOP_WORDS = set( """ diff --git a/spacy/lang/en/syntax_iterators.py b/spacy/lang/en/syntax_iterators.py index 6d366ec90..c41120afb 100644 --- a/spacy/lang/en/syntax_iterators.py +++ b/spacy/lang/en/syntax_iterators.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import NOUN, PROPN, PRON from ...errors import Errors diff --git a/spacy/lang/en/tag_map.py b/spacy/lang/en/tag_map.py index ecb3103cc..2078798f7 100644 --- a/spacy/lang/en/tag_map.py +++ b/spacy/lang/en/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON diff --git a/spacy/lang/en/tokenizer_exceptions.py b/spacy/lang/en/tokenizer_exceptions.py index 6a553052b..908ac3940 100644 --- a/spacy/lang/en/tokenizer_exceptions.py +++ b/spacy/lang/en/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA @@ -334,13 +331,13 @@ for exc_data in [ for h in range(1, 12 + 1): for period in ["a.m.", "am"]: - _exc["%d%s" % (h, period)] = [ - {ORTH: "%d" % h}, + _exc[f"{h}{period}"] = [ + {ORTH: f"{h}"}, {ORTH: period, LEMMA: "a.m.", NORM: "a.m."}, ] for period in ["p.m.", "pm"]: - _exc["%d%s" % (h, period)] = [ - {ORTH: "%d" % h}, + _exc[f"{h}{period}"] = [ + {ORTH: f"{h}"}, {ORTH: period, LEMMA: "p.m.", NORM: "p.m."}, ] diff --git a/spacy/lang/es/__init__.py b/spacy/lang/es/__init__.py index 249748a17..f3b1f756e 100644 --- a/spacy/lang/es/__init__.py +++ b/spacy/lang/es/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tag_map import TAG_MAP from .stop_words import STOP_WORDS diff --git a/spacy/lang/es/examples.py b/spacy/lang/es/examples.py index 0e31b56af..a1db41a16 100644 --- a/spacy/lang/es/examples.py +++ b/spacy/lang/es/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/es/lex_attrs.py b/spacy/lang/es/lex_attrs.py index 632a638fc..988dbaba1 100644 --- a/spacy/lang/es/lex_attrs.py +++ b/spacy/lang/es/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/es/stop_words.py b/spacy/lang/es/stop_words.py index 20e929b48..004df4fca 100644 --- a/spacy/lang/es/stop_words.py +++ b/spacy/lang/es/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ actualmente acuerdo adelante ademas además adrede afirmó agregó ahi ahora ahí diff --git a/spacy/lang/es/syntax_iterators.py b/spacy/lang/es/syntax_iterators.py index 5fda35211..3c65bd441 100644 --- a/spacy/lang/es/syntax_iterators.py +++ b/spacy/lang/es/syntax_iterators.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import NOUN, PROPN, PRON, VERB, AUX from ...errors import Errors diff --git a/spacy/lang/es/tag_map.py b/spacy/lang/es/tag_map.py index 7a7c9d549..1748162c0 100644 --- a/spacy/lang/es/tag_map.py +++ b/spacy/lang/es/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, SYM, ADJ, NUM, DET, ADV, ADP, X, VERB from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, SCONJ, AUX, CONJ diff --git a/spacy/lang/es/tokenizer_exceptions.py b/spacy/lang/es/tokenizer_exceptions.py index 2c2631086..7836f1c43 100644 --- a/spacy/lang/es/tokenizer_exceptions.py +++ b/spacy/lang/es/tokenizer_exceptions.py @@ -1,12 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA, NORM, PRON_LEMMA _exc = { "pal": [{ORTH: "pa", LEMMA: "para"}, {ORTH: "l", LEMMA: "el", NORM: "el"}], - "pala": [{ORTH: "pa", LEMMA: "para"}, {ORTH: "la", LEMMA: "la", NORM: "la"}], } @@ -31,9 +27,9 @@ _exc["12m."] = [{ORTH: "12"}, {ORTH: "m.", LEMMA: "p.m."}] for h in range(1, 12 + 1): for period in ["a.m.", "am"]: - _exc["%d%s" % (h, period)] = [{ORTH: "%d" % h}, {ORTH: period, LEMMA: "a.m."}] + _exc[f"{h}{period}"] = [{ORTH: f"{h}"}, {ORTH: period, LEMMA: "a.m."}] for period in ["p.m.", "pm"]: - _exc["%d%s" % (h, period)] = [{ORTH: "%d" % h}, {ORTH: period, LEMMA: "p.m."}] + _exc[f"{h}{period}"] = [{ORTH: f"{h}"}, {ORTH: period, LEMMA: "p.m."}] for orth in [ diff --git a/spacy/lang/et/__init__.py b/spacy/lang/et/__init__.py index d84c081ef..e0b0a8a87 100644 --- a/spacy/lang/et/__init__.py +++ b/spacy/lang/et/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language from ...attrs import LANG diff --git a/spacy/lang/et/stop_words.py b/spacy/lang/et/stop_words.py index 15070db5f..e1da1f14d 100644 --- a/spacy/lang/et/stop_words.py +++ b/spacy/lang/et/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/stopwords-iso/stopwords-et STOP_WORDS = set( diff --git a/spacy/lang/eu/__init__.py b/spacy/lang/eu/__init__.py index 4f3338c1d..352eb1548 100644 --- a/spacy/lang/eu/__init__.py +++ b/spacy/lang/eu/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS from .punctuation import TOKENIZER_SUFFIXES diff --git a/spacy/lang/eu/examples.py b/spacy/lang/eu/examples.py index 463494abd..3b9ef71b6 100644 --- a/spacy/lang/eu/examples.py +++ b/spacy/lang/eu/examples.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/eu/lex_attrs.py b/spacy/lang/eu/lex_attrs.py index 19b75c111..a3ab018ee 100644 --- a/spacy/lang/eu/lex_attrs.py +++ b/spacy/lang/eu/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM # Source http://mylanguages.org/basque_numbers.php diff --git a/spacy/lang/eu/punctuation.py b/spacy/lang/eu/punctuation.py index b8b1a1c83..5d35d0a25 100644 --- a/spacy/lang/eu/punctuation.py +++ b/spacy/lang/eu/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..punctuation import TOKENIZER_SUFFIXES diff --git a/spacy/lang/eu/stop_words.py b/spacy/lang/eu/stop_words.py index dda11a7fd..d213b5b81 100644 --- a/spacy/lang/eu/stop_words.py +++ b/spacy/lang/eu/stop_words.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - # Source: https://github.com/stopwords-iso/stopwords-eu # https://www.ranks.nl/stopwords/basque # https://www.mustgo.com/worldlanguages/basque/ diff --git a/spacy/lang/eu/tag_map.py b/spacy/lang/eu/tag_map.py index 2499d7e3e..e0940edb7 100644 --- a/spacy/lang/eu/tag_map.py +++ b/spacy/lang/eu/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON diff --git a/spacy/lang/fa/__init__.py b/spacy/lang/fa/__init__.py index c93bca671..4c5a7074c 100644 --- a/spacy/lang/fa/__init__.py +++ b/spacy/lang/fa/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...language import Language from ...attrs import LANG, NORM from ...util import update_exc, add_lookups diff --git a/spacy/lang/fa/examples.py b/spacy/lang/fa/examples.py index 3f65a366d..9c6fb0345 100644 --- a/spacy/lang/fa/examples.py +++ b/spacy/lang/fa/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/fa/generate_verbs_exc.py b/spacy/lang/fa/generate_verbs_exc.py index 5d0ff944d..62094c6de 100644 --- a/spacy/lang/fa/generate_verbs_exc.py +++ b/spacy/lang/fa/generate_verbs_exc.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - verb_roots = """ #هست آخت#آهنج diff --git a/spacy/lang/fa/lex_attrs.py b/spacy/lang/fa/lex_attrs.py index dbea66b68..99b8e2787 100644 --- a/spacy/lang/fa/lex_attrs.py +++ b/spacy/lang/fa/lex_attrs.py @@ -1,5 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals from ...attrs import LIKE_NUM diff --git a/spacy/lang/fa/punctuation.py b/spacy/lang/fa/punctuation.py index 33aa46ae2..4b258c13d 100644 --- a/spacy/lang/fa/punctuation.py +++ b/spacy/lang/fa/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY from ..char_classes import UNITS, ALPHA_UPPER diff --git a/spacy/lang/fa/stop_words.py b/spacy/lang/fa/stop_words.py index 682fb7a71..f462f2e7a 100644 --- a/spacy/lang/fa/stop_words.py +++ b/spacy/lang/fa/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Stop words from HAZM package STOP_WORDS = set( """ diff --git a/spacy/lang/fa/syntax_iterators.py b/spacy/lang/fa/syntax_iterators.py index 6d366ec90..c41120afb 100644 --- a/spacy/lang/fa/syntax_iterators.py +++ b/spacy/lang/fa/syntax_iterators.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import NOUN, PROPN, PRON from ...errors import Errors diff --git a/spacy/lang/fa/tag_map.py b/spacy/lang/fa/tag_map.py index b9043adf0..f1f106915 100644 --- a/spacy/lang/fa/tag_map.py +++ b/spacy/lang/fa/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, ADJ, CONJ, NUM, DET, ADV, ADP, X, VERB from ...symbols import PRON, NOUN, PART, INTJ, AUX diff --git a/spacy/lang/fa/tokenizer_exceptions.py b/spacy/lang/fa/tokenizer_exceptions.py index b3f8dcbf5..db9e3f6fc 100644 --- a/spacy/lang/fa/tokenizer_exceptions.py +++ b/spacy/lang/fa/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA, TAG, NORM diff --git a/spacy/lang/fi/__init__.py b/spacy/lang/fi/__init__.py index 45d2f886f..db58ad3ba 100644 --- a/spacy/lang/fi/__init__.py +++ b/spacy/lang/fi/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS diff --git a/spacy/lang/fi/examples.py b/spacy/lang/fi/examples.py index 88be248a6..930fac273 100644 --- a/spacy/lang/fi/examples.py +++ b/spacy/lang/fi/examples.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. >>> from spacy.lang.fi.examples import sentences diff --git a/spacy/lang/fi/lex_attrs.py b/spacy/lang/fi/lex_attrs.py index e960b55eb..4d500cead 100644 --- a/spacy/lang/fi/lex_attrs.py +++ b/spacy/lang/fi/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/fi/punctuation.py b/spacy/lang/fi/punctuation.py index a85c0b228..6e14dde38 100644 --- a/spacy/lang/fi/punctuation.py +++ b/spacy/lang/fi/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_HYPHENS from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER from ..punctuation import TOKENIZER_SUFFIXES diff --git a/spacy/lang/fi/stop_words.py b/spacy/lang/fi/stop_words.py index e8e39ec6f..8e8dcfa56 100644 --- a/spacy/lang/fi/stop_words.py +++ b/spacy/lang/fi/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source https://github.com/stopwords-iso/stopwords-fi/blob/master/stopwords-fi.txt # Reformatted with some minor corrections STOP_WORDS = set( diff --git a/spacy/lang/fi/tokenizer_exceptions.py b/spacy/lang/fi/tokenizer_exceptions.py index 7cdc7cf11..51ca45d63 100644 --- a/spacy/lang/fi/tokenizer_exceptions.py +++ b/spacy/lang/fi/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA diff --git a/spacy/lang/fr/__init__.py b/spacy/lang/fr/__init__.py index 7727aff0e..7f9928136 100644 --- a/spacy/lang/fr/__init__.py +++ b/spacy/lang/fr/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, TOKEN_MATCH from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES from .punctuation import TOKENIZER_SUFFIXES diff --git a/spacy/lang/fr/_tokenizer_exceptions_list.py b/spacy/lang/fr/_tokenizer_exceptions_list.py index c9fcfff2d..7f908dac8 100644 --- a/spacy/lang/fr/_tokenizer_exceptions_list.py +++ b/spacy/lang/fr/_tokenizer_exceptions_list.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - FR_BASE_EXCEPTIONS = [ "(+)-amphétamine", "(5R,6S)-7,8-didehydro-4,5-époxy-3-méthoxy-N-méthylmorphinan-6-ol", diff --git a/spacy/lang/fr/examples.py b/spacy/lang/fr/examples.py index a874c22fc..a74a62204 100644 --- a/spacy/lang/fr/examples.py +++ b/spacy/lang/fr/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/fr/lemmatizer.py b/spacy/lang/fr/lemmatizer.py index 79f4dd28d..fe128df1f 100644 --- a/spacy/lang/fr/lemmatizer.py +++ b/spacy/lang/fr/lemmatizer.py @@ -1,10 +1,6 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...lemmatizer import Lemmatizer from ...symbols import POS, NOUN, VERB, ADJ, ADV, PRON, DET, AUX, PUNCT, ADP from ...symbols import SCONJ, CCONJ -from ...symbols import VerbForm_inf, VerbForm_none, Number_sing, Degree_pos class FrenchLemmatizer(Lemmatizer): @@ -85,13 +81,13 @@ class FrenchLemmatizer(Lemmatizer): return True elif univ_pos == "adj" and morphology.get("Degree") == "pos": return True - elif VerbForm_inf in morphology: + elif "VerbForm=inf" in morphology: return True - elif VerbForm_none in morphology: + elif "VerbForm=none" in morphology: return True - elif Number_sing in morphology: + elif "Number=sing" in morphology: return True - elif Degree_pos in morphology: + elif "Degree=pos" in morphology: return True else: return False diff --git a/spacy/lang/fr/lex_attrs.py b/spacy/lang/fr/lex_attrs.py index e3ccd9fdd..da98c6e37 100644 --- a/spacy/lang/fr/lex_attrs.py +++ b/spacy/lang/fr/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/fr/punctuation.py b/spacy/lang/fr/punctuation.py index 7d50c4a9e..873d01d87 100644 --- a/spacy/lang/fr/punctuation.py +++ b/spacy/lang/fr/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY from ..char_classes import CONCAT_QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER diff --git a/spacy/lang/fr/stop_words.py b/spacy/lang/fr/stop_words.py index ae8432043..a331f3c0f 100644 --- a/spacy/lang/fr/stop_words.py +++ b/spacy/lang/fr/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ a à â abord absolument afin ah ai aie ailleurs ainsi ait allaient allo allons diff --git a/spacy/lang/fr/syntax_iterators.py b/spacy/lang/fr/syntax_iterators.py index 2ed2c1b35..c09b0e840 100644 --- a/spacy/lang/fr/syntax_iterators.py +++ b/spacy/lang/fr/syntax_iterators.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import NOUN, PROPN, PRON from ...errors import Errors diff --git a/spacy/lang/fr/tag_map.py b/spacy/lang/fr/tag_map.py index 93b43c2ec..2b1b20c52 100644 --- a/spacy/lang/fr/tag_map.py +++ b/spacy/lang/fr/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX, SCONJ diff --git a/spacy/lang/fr/tokenizer_exceptions.py b/spacy/lang/fr/tokenizer_exceptions.py index 4eb4c1568..7bf4922d8 100644 --- a/spacy/lang/fr/tokenizer_exceptions.py +++ b/spacy/lang/fr/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import re from .punctuation import ELISION, HYPHENS @@ -91,7 +88,7 @@ for verb, verb_lemma in [ ]: for orth in [verb, verb.title()]: for pronoun in ["elle", "il", "on"]: - token = "{}-t-{}".format(orth, pronoun) + token = f"{orth}-t-{pronoun}" _exc[token] = [ {LEMMA: verb_lemma, ORTH: orth}, # , TAG: "VERB"}, {LEMMA: "t", ORTH: "-t"}, @@ -100,7 +97,7 @@ for verb, verb_lemma in [ for verb, verb_lemma in [("est", "être")]: for orth in [verb, verb.title()]: - token = "{}-ce".format(orth) + token = f"{orth}-ce" _exc[token] = [ {LEMMA: verb_lemma, ORTH: orth}, # , TAG: "VERB"}, {LEMMA: "ce", ORTH: "-ce"}, @@ -109,7 +106,7 @@ for verb, verb_lemma in [("est", "être")]: for pre, pre_lemma in [("qu'", "que"), ("n'", "ne")]: for orth in [pre, pre.title()]: - _exc["%sest-ce" % orth] = [ + _exc[f"{orth}est-ce"] = [ {LEMMA: pre_lemma, ORTH: orth}, {LEMMA: "être", ORTH: "est"}, {LEMMA: "ce", ORTH: "-ce"}, diff --git a/spacy/lang/ga/__init__.py b/spacy/lang/ga/__init__.py index 42b4d0d18..4c3d219c7 100644 --- a/spacy/lang/ga/__init__.py +++ b/spacy/lang/ga/__init__.py @@ -1,8 +1,6 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS +from .tag_map import TAG_MAP from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...language import Language @@ -16,6 +14,7 @@ class IrishDefaults(Language.Defaults): tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) stop_words = set(STOP_WORDS) + tag_map = TAG_MAP class Irish(Language): diff --git a/spacy/lang/ga/irish_morphology_helpers.py b/spacy/lang/ga/irish_morphology_helpers.py index 2133f0d22..d606da975 100644 --- a/spacy/lang/ga/irish_morphology_helpers.py +++ b/spacy/lang/ga/irish_morphology_helpers.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # fmt: off consonants = ["b", "c", "d", "f", "g", "h", "j", "k", "l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "x", "z"] broad_vowels = ["a", "á", "o", "ó", "u", "ú"] diff --git a/spacy/lang/ga/stop_words.py b/spacy/lang/ga/stop_words.py index d8f705b59..4ef052ca5 100644 --- a/spacy/lang/ga/stop_words.py +++ b/spacy/lang/ga/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ a ach ag agus an aon ar arna as diff --git a/spacy/lang/ga/tag_map.py b/spacy/lang/ga/tag_map.py index 1d8284014..efcaf5d1f 100644 --- a/spacy/lang/ga/tag_map.py +++ b/spacy/lang/ga/tag_map.py @@ -1,29 +1,26 @@ -# coding: utf8 -from __future__ import unicode_literals - # fmt: off TAG_MAP = { - "ADJ__Case=Gen|Form=Len|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "gen", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}}, + "ADJ__Case=Gen|Form=Len|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "gen", "Gender": "masc", "Number": "sing", "Form": "len"}, "ADJ__Case=Gen|Gender=Fem|Number=Sing": {"pos": "ADJ", "Case": "gen", "Gender": "fem", "Number": "sing"}, "ADJ__Case=Gen|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "gen", "Gender": "masc", "Number": "sing"}, - "ADJ__Case=Gen|NounType=Strong|Number=Plur": {"pos": "ADJ", "Case": "gen", "Number": "plur", "Other": {"NounType": "strong"}}, - "ADJ__Case=Gen|NounType=Weak|Number=Plur": {"pos": "ADJ", "Case": "gen", "Number": "plur", "Other": {"NounType": "weak"}}, - "ADJ__Case=NomAcc|Form=Len|Gender=Fem|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}}, - "ADJ__Case=NomAcc|Form=Len|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}}, + "ADJ__Case=Gen|NounType=Strong|Number=Plur": {"pos": "ADJ", "Case": "gen", "Number": "plur", "NounType": "strong"}, + "ADJ__Case=Gen|NounType=Weak|Number=Plur": {"pos": "ADJ", "Case": "gen", "Number": "plur", "NounType": "weak"}, + "ADJ__Case=NomAcc|Form=Len|Gender=Fem|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Form": "len"}, + "ADJ__Case=NomAcc|Form=Len|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Form": "len"}, "ADJ__Case=NomAcc|Gender=Fem|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Gender": "fem", "Number": "plur"}, "ADJ__Case=NomAcc|Gender=Fem|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "fem", "Number": "sing"}, "ADJ__Case=NomAcc|Gender=Masc|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Gender": "masc", "Number": "plur"}, "ADJ__Case=NomAcc|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "masc", "Number": "sing"}, - "ADJ__Case=NomAcc|NounType=NotSlender|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Number": "plur", "Other": {"NounType": "notslender"}}, - "ADJ__Case=NomAcc|NounType=Slender|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Number": "plur", "Other": {"NounType": "slender"}}, - "ADJ__Degree=Cmp,Sup|Form=Len": {"pos": "ADJ", "Degree": "cmp|sup", "Other": {"Form": "len"}}, + "ADJ__Case=NomAcc|NounType=NotSlender|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Number": "plur", "NounType": "notslender"}, + "ADJ__Case=NomAcc|NounType=Slender|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Number": "plur", "NounType": "slender"}, + "ADJ__Degree=Cmp,Sup|Form=Len": {"pos": "ADJ", "Degree": "cmp|sup", "Form": "len"}, "ADJ__Degree=Cmp,Sup": {"pos": "ADJ", "Degree": "cmp|sup"}, - "ADJ__Degree=Pos|Form=Ecl": {"pos": "ADJ", "Degree": "pos", "Other": {"Form": "ecl"}}, - "ADJ__Degree=Pos|Form=HPref": {"pos": "ADJ", "Degree": "pos", "Other": {"Form": "hpref"}}, - "ADJ__Degree=Pos|Form=Len": {"pos": "ADJ", "Degree": "pos", "Other": {"Form": "len"}}, + "ADJ__Degree=Pos|Form=Ecl": {"pos": "ADJ", "Degree": "pos", "Form": "ecl"}, + "ADJ__Degree=Pos|Form=HPref": {"pos": "ADJ", "Degree": "pos", "Form": "hpref"}, + "ADJ__Degree=Pos|Form=Len": {"pos": "ADJ", "Degree": "pos", "Form": "len"}, "ADJ__Degree=Pos": {"pos": "ADJ", "Degree": "pos"}, "ADJ__Foreign=Yes": {"pos": "ADJ", "Foreign": "yes"}, - "ADJ__Form=Len|VerbForm=Part": {"pos": "ADJ", "VerbForm": "part", "Other": {"Form": "len"}}, + "ADJ__Form=Len|VerbForm=Part": {"pos": "ADJ", "VerbForm": "part", "Form": "len"}, "ADJ__Gender=Masc|Number=Sing|PartType=Voc": {"pos": "ADJ", "Gender": "masc", "Number": "sing", "Case": "voc"}, "ADJ__Gender=Masc|Number=Sing|Case=Voc": {"pos": "ADJ", "Gender": "masc", "Number": "sing", "Case": "voc"}, "ADJ__Number=Plur|PartType=Voc": {"pos": "ADJ", "Number": "plur", "Case": "voc"}, @@ -32,9 +29,9 @@ TAG_MAP = { "ADJ___": {"pos": "ADJ"}, "ADJ__VerbForm=Part": {"pos": "ADJ", "VerbForm": "part"}, "ADP__Foreign=Yes": {"pos": "ADP", "Foreign": "yes"}, - "ADP__Form=Len|Number=Plur|Person=1": {"pos": "ADP", "Number": "plur", "Person": 1, "Other": {"Form": "len"}}, - "ADP__Form=Len|Number=Plur|Person=3": {"pos": "ADP", "Number": "plur", "Person": 3, "Other": {"Form": "len"}}, - "ADP__Form=Len|Number=Sing|Person=1": {"pos": "ADP", "Number": "sing", "Person": 1, "Other": {"Form": "len"}}, + "ADP__Form=Len|Number=Plur|Person=1": {"pos": "ADP", "Number": "plur", "Person": 1, "Form": "len"}, + "ADP__Form=Len|Number=Plur|Person=3": {"pos": "ADP", "Number": "plur", "Person": 3, "Form": "len"}, + "ADP__Form=Len|Number=Sing|Person=1": {"pos": "ADP", "Number": "sing", "Person": 1, "Form": "len"}, "ADP__Gender=Fem|Number=Sing|Person=3": {"pos": "ADP", "Gender": "fem", "Number": "sing", "Person": 3}, "ADP__Gender=Fem|Number=Sing|Person=3|Poss=Yes": {"pos": "ADP", "Gender": "fem", "Number": "sing", "Person": 3, "Poss": "yes"}, "ADP__Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs": {"pos": "ADP", "Gender": "fem", "Number": "sing", "Person": 3, "Poss": "yes", "PronType": "prs"}, @@ -60,41 +57,41 @@ TAG_MAP = { "ADP__Person=3|Poss=Yes": {"pos": "ADP", "Person": 3, "Poss": "yes"}, "ADP___": {"pos": "ADP"}, "ADP__Poss=Yes": {"pos": "ADP", "Poss": "yes"}, - "ADP__PrepForm=Cmpd": {"pos": "ADP", "Other": {"PrepForm": "cmpd"}}, + "ADP__PrepForm=Cmpd": {"pos": "ADP", "PrepForm": "cmpd"}, "ADP__PronType=Art": {"pos": "ADP", "PronType": "art"}, - "ADV__Form=Len": {"pos": "ADV", "Other": {"Form": "len"}}, + "ADV__Form=Len": {"pos": "ADV", "Form": "len"}, "ADV___": {"pos": "ADV"}, "ADV__PronType=Int": {"pos": "ADV", "PronType": "int"}, - "AUX__Form=VF|Polarity=Neg|PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "PronType": "rel", "Tense": "past", "Other": {"Form": "vf", "VerbForm": "cop"}}, - "AUX__Form=VF|Polarity=Neg|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "past", "Other": {"Form": "vf", "VerbForm": "cop"}}, - "AUX__Form=VF|PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "PronType": "rel", "Tense": "past", "Other": {"Form": "vf", "VerbForm": "cop"}}, - "AUX__Form=VF|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Tense": "past", "Other": {"Form": "vf", "VerbForm": "cop"}}, - "AUX__Form=VF|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Tense": "pres", "Other": {"Form": "vf", "VerbForm": "cop"}}, - "AUX__Gender=Masc|Number=Sing|Person=3|VerbForm=Cop": {"pos": "AUX", "Gender": "masc", "Number": "sing", "Person": 3, "Other": {"VerbForm": "cop"}}, - "AUX__Mood=Int|Number=Sing|PronType=Art|VerbForm=Cop": {"pos": "AUX", "Number": "sing", "PronType": "art", "Other": {"Mood": "int", "VerbForm": "cop"}}, - "AUX__Mood=Int|Polarity=Neg|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "past", "Other": {"Mood": "int", "VerbForm": "cop"}}, - "AUX__Mood=Int|Polarity=Neg|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "pres", "Other": {"Mood": "int", "VerbForm": "cop"}}, - "AUX__Mood=Int|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Tense": "pres", "Other": {"Mood": "int", "VerbForm": "cop"}}, - "AUX__PartType=Comp|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Tense": "past", "Other": {"PartType": "comp", "VerbForm": "cop"}}, - "AUX__Polarity=Neg|PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "PronType": "rel", "Tense": "past", "Other": {"VerbForm": "cop"}}, - "AUX__Polarity=Neg|PronType=Rel|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "PronType": "rel", "Tense": "pres", "Other": {"VerbForm": "cop"}}, - "AUX__Polarity=Neg|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "past", "Other": {"VerbForm": "cop"}}, - "AUX__Polarity=Neg|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "pres", "Other": {"VerbForm": "cop"}}, + "AUX__Form=VF|Polarity=Neg|PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "PronType": "rel", "Tense": "past", "Form": "vf", "VerbForm": "cop"}, + "AUX__Form=VF|Polarity=Neg|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "past", "Form": "vf", "VerbForm": "cop"}, + "AUX__Form=VF|PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "PronType": "rel", "Tense": "past", "Form": "vf", "VerbForm": "cop"}, + "AUX__Form=VF|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Tense": "past", "Form": "vf", "VerbForm": "cop"}, + "AUX__Form=VF|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Tense": "pres", "Form": "vf", "VerbForm": "cop"}, + "AUX__Gender=Masc|Number=Sing|Person=3|VerbForm=Cop": {"pos": "AUX", "Gender": "masc", "Number": "sing", "Person": 3, "VerbForm": "cop"}, + "AUX__Mood=Int|Number=Sing|PronType=Art|VerbForm=Cop": {"pos": "AUX", "Number": "sing", "PronType": "art", "Mood": "int", "VerbForm": "cop"}, + "AUX__Mood=Int|Polarity=Neg|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "past", "Mood": "int", "VerbForm": "cop"}, + "AUX__Mood=Int|Polarity=Neg|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "pres", "Mood": "int", "VerbForm": "cop"}, + "AUX__Mood=Int|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Tense": "pres", "Mood": "int", "VerbForm": "cop"}, + "AUX__PartType=Comp|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Tense": "past", "PartType": "comp", "VerbForm": "cop"}, + "AUX__Polarity=Neg|PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "PronType": "rel", "Tense": "past", "VerbForm": "cop"}, + "AUX__Polarity=Neg|PronType=Rel|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "PronType": "rel", "Tense": "pres", "VerbForm": "cop"}, + "AUX__Polarity=Neg|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "past", "VerbForm": "cop"}, + "AUX__Polarity=Neg|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "pres", "VerbForm": "cop"}, "AUX___": {"pos": "AUX"}, - "AUX__PronType=Dem|VerbForm=Cop": {"pos": "AUX", "PronType": "dem", "Other": {"VerbForm": "cop"}}, - "AUX__PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "PronType": "rel", "Tense": "past", "Other": {"VerbForm": "cop"}}, - "AUX__PronType=Rel|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "PronType": "rel", "Tense": "pres", "Other": {"VerbForm": "cop"}}, - "AUX__Tense=Past|VerbForm=Cop": {"pos": "AUX", "Tense": "past", "Other": {"VerbForm": "cop"}}, - "AUX__Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Tense": "pres", "Other": {"VerbForm": "cop"}}, - "AUX__VerbForm=Cop": {"pos": "AUX", "Other": {"VerbForm": "cop"}}, + "AUX__PronType=Dem|VerbForm=Cop": {"pos": "AUX", "PronType": "dem", "VerbForm": "cop"}, + "AUX__PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "PronType": "rel", "Tense": "past", "VerbForm": "cop"}, + "AUX__PronType=Rel|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "PronType": "rel", "Tense": "pres", "VerbForm": "cop"}, + "AUX__Tense=Past|VerbForm=Cop": {"pos": "AUX", "Tense": "past", "VerbForm": "cop"}, + "AUX__Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Tense": "pres", "VerbForm": "cop"}, + "AUX__VerbForm=Cop": {"pos": "AUX", "VerbForm": "cop"}, "CCONJ___": {"pos": "CCONJ"}, "DET__Case=Gen|Definite=Def|Gender=Fem|Number=Sing|PronType=Art": {"pos": "DET", "Case": "gen", "Definite": "def", "Gender": "fem", "Number": "sing", "PronType": "art"}, - "DET__Definite=Def|Form=Ecl": {"pos": "DET", "Definite": "def", "Other": {"Form": "ecl"}}, + "DET__Definite=Def|Form=Ecl": {"pos": "DET", "Definite": "def", "Form": "ecl"}, "DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art": {"pos": "DET", "Definite": "def", "Gender": "fem", "Number": "sing", "PronType": "art"}, "DET__Definite=Def|Number=Plur|PronType=Art": {"pos": "DET", "Definite": "def", "Number": "plur", "PronType": "art"}, "DET__Definite=Def|Number=Sing|PronType=Art": {"pos": "DET", "Definite": "def", "Number": "sing", "PronType": "art"}, "DET__Definite=Def": {"pos": "DET", "Definite": "def"}, - "DET__Form=HPref|PronType=Ind": {"pos": "DET", "PronType": "ind", "Other": {"Form": "hpref"}}, + "DET__Form=HPref|PronType=Ind": {"pos": "DET", "PronType": "ind", "Form": "hpref"}, "DET__Gender=Fem|Number=Sing|Person=3|Poss=Yes": {"pos": "DET", "Gender": "fem", "Number": "sing", "Person": 3, "Poss": "yes"}, "DET__Gender=Masc|Number=Sing|Person=3|Poss=Yes": {"pos": "DET", "Gender": "masc", "Number": "sing", "Person": 3, "Poss": "yes"}, "DET__Number=Plur|Person=1|Poss=Yes": {"pos": "DET", "Number": "plur", "Person": 1, "Poss": "yes"}, @@ -106,33 +103,33 @@ TAG_MAP = { "DET__PronType=Dem": {"pos": "DET", "PronType": "dem"}, "DET__PronType=Ind": {"pos": "DET", "PronType": "ind"}, "NOUN__Case=Dat|Definite=Ind|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Definite": "ind", "Gender": "fem", "Number": "sing"}, - "NOUN__Case=Dat|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "fem", "Number": "sing", "Other": {"Form": "ecl"}}, - "NOUN__Case=Dat|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}}, + "NOUN__Case=Dat|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "fem", "Number": "sing", "Form": "ecl"}, + "NOUN__Case=Dat|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "fem", "Number": "sing", "Form": "len"}, "NOUN__Case=Dat|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "fem", "Number": "sing"}, "NOUN__Case=Dat|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "masc", "Number": "sing"}, - "NOUN__Case=Gen|Definite=Def|Gender=Fem|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "fem", "Number": "plur", "Other": {"NounType": "strong"}}, + "NOUN__Case=Gen|Definite=Def|Gender=Fem|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "fem", "Number": "plur", "NounType": "strong"}, "NOUN__Case=Gen|Definite=Def|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "fem", "Number": "sing"}, - "NOUN__Case=Gen|Definite=Def|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "masc", "Number": "plur", "Other": {"NounType": "strong"}}, - "NOUN__Case=Gen|Definite=Def|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "masc", "Number": "plur", "Other": {"NounType": "weak"}}, + "NOUN__Case=Gen|Definite=Def|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "masc", "Number": "plur", "NounType": "strong"}, + "NOUN__Case=Gen|Definite=Def|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "masc", "Number": "plur", "NounType": "weak"}, "NOUN__Case=Gen|Definite=Def|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "masc", "Number": "sing"}, "NOUN__Case=Gen|Definite=Ind|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Definite": "ind", "Gender": "fem", "Number": "sing"}, - "NOUN__Case=Gen|Form=Ecl|Gender=Fem|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur", "Other": {"Form": "ecl", "NounType": "strong"}}, - "NOUN__Case=Gen|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "ecl"}}, - "NOUN__Case=Gen|Form=Ecl|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "ecl", "NounType": "strong"}}, - "NOUN__Case=Gen|Form=Ecl|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "ecl", "NounType": "weak"}}, - "NOUN__Case=Gen|Form=Ecl|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "sing", "Other": {"Form": "ecl"}}, - "NOUN__Case=Gen|Form=HPref|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "hpref"}}, - "NOUN__Case=Gen|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}}, - "NOUN__Case=Gen|Form=Len|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "len", "NounType": "strong"}}, - "NOUN__Case=Gen|Form=Len|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "len", "NounType": "weak"}}, - "NOUN__Case=Gen|Form=Len|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}}, - "NOUN__Case=Gen|Form=Len|VerbForm=Inf": {"pos": "NOUN", "Case": "gen", "VerbForm": "inf", "Other": {"Form": "len"}}, - "NOUN__Case=Gen|Gender=Fem|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur", "Other": {"NounType": "strong"}}, - "NOUN__Case=Gen|Gender=Fem|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur", "Other": {"NounType": "weak"}}, + "NOUN__Case=Gen|Form=Ecl|Gender=Fem|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur", "Form": "ecl", "NounType": "strong"}, + "NOUN__Case=Gen|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing", "Form": "ecl"}, + "NOUN__Case=Gen|Form=Ecl|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Form": "ecl", "NounType": "strong"}, + "NOUN__Case=Gen|Form=Ecl|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Form": "ecl", "NounType": "weak"}, + "NOUN__Case=Gen|Form=Ecl|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "sing", "Form": "ecl"}, + "NOUN__Case=Gen|Form=HPref|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing", "Form": "hpref"}, + "NOUN__Case=Gen|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing", "Form": "len"}, + "NOUN__Case=Gen|Form=Len|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Form": "len", "NounType": "strong"}, + "NOUN__Case=Gen|Form=Len|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Form": "len", "NounType": "weak"}, + "NOUN__Case=Gen|Form=Len|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "sing", "Form": "len"}, + "NOUN__Case=Gen|Form=Len|VerbForm=Inf": {"pos": "NOUN", "Case": "gen", "VerbForm": "inf", "Form": "len"}, + "NOUN__Case=Gen|Gender=Fem|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur", "NounType": "strong"}, + "NOUN__Case=Gen|Gender=Fem|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur", "NounType": "weak"}, "NOUN__Case=Gen|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur"}, "NOUN__Case=Gen|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing"}, - "NOUN__Case=Gen|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"NounType": "strong"}}, - "NOUN__Case=Gen|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"NounType": "weak"}}, + "NOUN__Case=Gen|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "NounType": "strong"}, + "NOUN__Case=Gen|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "NounType": "weak"}, "NOUN__Case=Gen|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur"}, "NOUN__Case=Gen|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "sing"}, "NOUN__Case=Gen|Number=Sing": {"pos": "NOUN", "Case": "gen", "Number": "sing"}, @@ -143,79 +140,79 @@ TAG_MAP = { "NOUN__Case=NomAcc|Definite=Def|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Definite": "def", "Gender": "masc", "Number": "plur"}, "NOUN__Case=NomAcc|Definite=Def|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Definite": "def", "Gender": "masc", "Number": "sing"}, "NOUN__Case=NomAcc|Definite=Ind|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Definite": "ind", "Gender": "masc", "Number": "plur"}, - "NOUN__Case=NomAcc|Form=Ecl|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur", "Other": {"Form": "ecl"}}, - "NOUN__Case=NomAcc|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "ecl"}}, - "NOUN__Case=NomAcc|Form=Ecl|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur", "Other": {"Form": "ecl"}}, - "NOUN__Case=NomAcc|Form=Ecl|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "ecl"}}, - "NOUN__Case=NomAcc|Form=Emp|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "emp"}}, - "NOUN__Case=NomAcc|Form=HPref|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur", "Other": {"Form": "hpref"}}, - "NOUN__Case=NomAcc|Form=HPref|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "hpref"}}, - "NOUN__Case=NomAcc|Form=HPref|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur", "Other": {"Form": "hpref"}}, - "NOUN__Case=NomAcc|Form=HPref|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "hpref"}}, - "NOUN__Case=NomAcc|Form=Len|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur", "Other": {"Form": "len"}}, - "NOUN__Case=NomAcc|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}}, - "NOUN__Case=NomAcc|Form=Len|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur", "Other": {"Form": "len"}}, - "NOUN__Case=NomAcc|Form=Len|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}}, + "NOUN__Case=NomAcc|Form=Ecl|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur", "Form": "ecl"}, + "NOUN__Case=NomAcc|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Form": "ecl"}, + "NOUN__Case=NomAcc|Form=Ecl|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur", "Form": "ecl"}, + "NOUN__Case=NomAcc|Form=Ecl|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Form": "ecl"}, + "NOUN__Case=NomAcc|Form=Emp|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Form": "emp"}, + "NOUN__Case=NomAcc|Form=HPref|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur", "Form": "hpref"}, + "NOUN__Case=NomAcc|Form=HPref|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Form": "hpref"}, + "NOUN__Case=NomAcc|Form=HPref|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur", "Form": "hpref"}, + "NOUN__Case=NomAcc|Form=HPref|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Form": "hpref"}, + "NOUN__Case=NomAcc|Form=Len|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur", "Form": "len"}, + "NOUN__Case=NomAcc|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Form": "len"}, + "NOUN__Case=NomAcc|Form=Len|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur", "Form": "len"}, + "NOUN__Case=NomAcc|Form=Len|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Form": "len"}, "NOUN__Case=NomAcc|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur"}, "NOUN__Case=NomAcc|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing"}, "NOUN__Case=NomAcc|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur"}, "NOUN__Case=NomAcc|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing"}, "NOUN__Case=Voc|Definite=Def|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "voc", "Definite": "def", "Gender": "masc", "Number": "plur"}, - "NOUN__Case=Voc|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "voc", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}}, - "NOUN__Case=Voc|Form=Len|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "voc", "Gender": "masc", "Number": "plur", "Other": {"Form": "len"}}, - "NOUN__Case=Voc|Form=Len|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "voc", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}}, + "NOUN__Case=Voc|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "voc", "Gender": "fem", "Number": "sing", "Form": "len"}, + "NOUN__Case=Voc|Form=Len|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "voc", "Gender": "masc", "Number": "plur", "Form": "len"}, + "NOUN__Case=Voc|Form=Len|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "voc", "Gender": "masc", "Number": "sing", "Form": "len"}, "NOUN__Case=Voc|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "voc", "Gender": "masc", "Number": "sing"}, "NOUN__Degree=Pos": {"pos": "NOUN", "Degree": "pos"}, "NOUN__Foreign=Yes": {"pos": "NOUN", "Foreign": "yes"}, - "NOUN__Form=Ecl|Number=Sing": {"pos": "NOUN", "Number": "sing", "Other": {"Form": "ecl"}}, - "NOUN__Form=Ecl|VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf", "Other": {"Form": "ecl"}}, - "NOUN__Form=Ecl|VerbForm=Vnoun": {"pos": "NOUN", "VerbForm": "vnoun", "Other": {"Form": "ecl"}}, - "NOUN__Form=HPref|VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf", "Other": {"Form": "hpref"}}, - "NOUN__Form=Len|Number=Sing": {"pos": "NOUN", "Number": "sing", "Other": {"Form": "len"}}, - "NOUN__Form=Len|VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf", "Other": {"Form": "len"}}, + "NOUN__Form=Ecl|Number=Sing": {"pos": "NOUN", "Number": "sing", "Form": "ecl"}, + "NOUN__Form=Ecl|VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf", "Form": "ecl"}, + "NOUN__Form=Ecl|VerbForm=Vnoun": {"pos": "NOUN", "VerbForm": "vnoun", "Form": "ecl"}, + "NOUN__Form=HPref|VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf", "Form": "hpref"}, + "NOUN__Form=Len|Number=Sing": {"pos": "NOUN", "Number": "sing", "Form": "len"}, + "NOUN__Form=Len|VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf", "Form": "len"}, "NOUN__Gender=Fem|Number=Sing": {"pos": "NOUN", "Gender": "fem", "Number": "sing"}, - "NOUN__Number=Sing|PartType=Comp": {"pos": "NOUN", "Number": "sing", "Other": {"PartType": "comp"}}, + "NOUN__Number=Sing|PartType=Comp": {"pos": "NOUN", "Number": "sing", "PartType": "comp"}, "NOUN__Number=Sing": {"pos": "NOUN", "Number": "sing"}, "NOUN___": {"pos": "NOUN"}, "NOUN__Reflex=Yes": {"pos": "NOUN", "Reflex": "yes"}, "NOUN__VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf"}, "NOUN__VerbForm=Vnoun": {"pos": "NOUN", "VerbForm": "vnoun"}, "NUM__Definite=Def|NumType=Card": {"pos": "NUM", "Definite": "def", "NumType": "card"}, - "NUM__Form=Ecl|NumType=Card": {"pos": "NUM", "NumType": "card", "Other": {"Form": "ecl"}}, - "NUM__Form=Ecl|NumType=Ord": {"pos": "NUM", "NumType": "ord", "Other": {"Form": "ecl"}}, - "NUM__Form=HPref|NumType=Card": {"pos": "NUM", "NumType": "card", "Other": {"Form": "hpref"}}, - "NUM__Form=Len|NumType=Card": {"pos": "NUM", "NumType": "card", "Other": {"Form": "len"}}, - "NUM__Form=Len|NumType=Ord": {"pos": "NUM", "NumType": "ord", "Other": {"Form": "len"}}, + "NUM__Form=Ecl|NumType=Card": {"pos": "NUM", "NumType": "card", "Form": "ecl"}, + "NUM__Form=Ecl|NumType=Ord": {"pos": "NUM", "NumType": "ord", "Form": "ecl"}, + "NUM__Form=HPref|NumType=Card": {"pos": "NUM", "NumType": "card", "Form": "hpref"}, + "NUM__Form=Len|NumType=Card": {"pos": "NUM", "NumType": "card", "Form": "len"}, + "NUM__Form=Len|NumType=Ord": {"pos": "NUM", "NumType": "ord", "Form": "len"}, "NUM__NumType=Card": {"pos": "NUM", "NumType": "card"}, "NUM__NumType=Ord": {"pos": "NUM", "NumType": "ord"}, "NUM___": {"pos": "NUM"}, - "PART__Form=Ecl|PartType=Vb|PronType=Rel": {"pos": "PART", "PronType": "rel", "Other": {"Form": "ecl", "PartType": "vb"}}, - "PART__Mood=Imp|PartType=Vb|Polarity=Neg": {"pos": "PART", "Mood": "imp", "Polarity": "neg", "Other": {"PartType": "vb"}}, - "PART__Mood=Imp|PartType=Vb": {"pos": "PART", "Mood": "imp", "Other": {"PartType": "vb"}}, - "PART__Mood=Int|PartType=Vb|Polarity=Neg": {"pos": "PART", "Polarity": "neg", "Other": {"Mood": "int", "PartType": "vb"}}, - "PART__PartType=Ad": {"pos": "PART", "Other": {"PartType": "ad"}}, - "PART__PartType=Cmpl|Polarity=Neg": {"pos": "PART", "Polarity": "neg", "Other": {"PartType": "cmpl"}}, - "PART__PartType=Cmpl|Polarity=Neg|Tense=Past": {"pos": "PART", "Polarity": "neg", "Tense": "past", "Other": {"PartType": "cmpl"}}, - "PART__PartType=Cmpl": {"pos": "PART", "Other": {"PartType": "cmpl"}}, - "PART__PartType=Comp": {"pos": "PART", "Other": {"PartType": "comp"}}, - "PART__PartType=Cop|PronType=Rel": {"pos": "PART", "PronType": "rel", "Other": {"PartType": "cop"}}, - "PART__PartType=Deg": {"pos": "PART", "Other": {"PartType": "deg"}}, + "PART__Form=Ecl|PartType=Vb|PronType=Rel": {"pos": "PART", "PronType": "rel", "Form": "ecl", "PartType": "vb"}, + "PART__Mood=Imp|PartType=Vb|Polarity=Neg": {"pos": "PART", "Mood": "imp", "Polarity": "neg", "PartType": "vb"}, + "PART__Mood=Imp|PartType=Vb": {"pos": "PART", "Mood": "imp", "PartType": "vb"}, + "PART__Mood=Int|PartType=Vb|Polarity=Neg": {"pos": "PART", "Polarity": "neg", "Mood": "int", "PartType": "vb"}, + "PART__PartType=Ad": {"pos": "PART", "PartType": "ad"}, + "PART__PartType=Cmpl|Polarity=Neg": {"pos": "PART", "Polarity": "neg", "PartType": "cmpl"}, + "PART__PartType=Cmpl|Polarity=Neg|Tense=Past": {"pos": "PART", "Polarity": "neg", "Tense": "past", "PartType": "cmpl"}, + "PART__PartType=Cmpl": {"pos": "PART", "PartType": "cmpl"}, + "PART__PartType=Comp": {"pos": "PART", "PartType": "comp"}, + "PART__PartType=Cop|PronType=Rel": {"pos": "PART", "PronType": "rel", "PartType": "cop"}, + "PART__PartType=Deg": {"pos": "PART", "PartType": "deg"}, "PART__PartType=Inf": {"pos": "PART", "PartType": "inf"}, - "PART__PartType=Num": {"pos": "PART", "Other": {"PartType": "num"}}, - "PART__PartType=Pat": {"pos": "PART", "Other": {"PartType": "pat"}}, - "PART__PartType=Vb|Polarity=Neg": {"pos": "PART", "Polarity": "neg", "Other": {"PartType": "vb"}}, - "PART__PartType=Vb|Polarity=Neg|PronType=Rel": {"pos": "PART", "Polarity": "neg", "PronType": "rel", "Other": {"PartType": "vb"}}, - "PART__PartType=Vb|Polarity=Neg|PronType=Rel|Tense=Past": {"pos": "PART", "Polarity": "neg", "PronType": "rel", "Tense": "past", "Other": {"PartType": "vb"}}, - "PART__PartType=Vb|Polarity=Neg|Tense=Past": {"pos": "PART", "Polarity": "neg", "Tense": "past", "Other": {"PartType": "vb"}}, - "PART__PartType=Vb": {"pos": "PART", "Other": {"PartType": "vb"}}, - "PART__PartType=Vb|PronType=Rel": {"pos": "PART", "PronType": "rel", "Other": {"PartType": "vb"}}, - "PART__PartType=Vb|PronType=Rel|Tense=Past": {"pos": "PART", "PronType": "rel", "Tense": "past", "Other": {"PartType": "vb"}}, - "PART__PartType=Vb|Tense=Past": {"pos": "PART", "Tense": "past", "Other": {"PartType": "vb"}}, - "PART__PartType=Voc": {"pos": "PART", "Other": {"PartType": "voc"}}, + "PART__PartType=Num": {"pos": "PART", "PartType": "num"}, + "PART__PartType=Pat": {"pos": "PART", "PartType": "pat"}, + "PART__PartType=Vb|Polarity=Neg": {"pos": "PART", "Polarity": "neg", "PartType": "vb"}, + "PART__PartType=Vb|Polarity=Neg|PronType=Rel": {"pos": "PART", "Polarity": "neg", "PronType": "rel", "PartType": "vb"}, + "PART__PartType=Vb|Polarity=Neg|PronType=Rel|Tense=Past": {"pos": "PART", "Polarity": "neg", "PronType": "rel", "Tense": "past", "PartType": "vb"}, + "PART__PartType=Vb|Polarity=Neg|Tense=Past": {"pos": "PART", "Polarity": "neg", "Tense": "past", "PartType": "vb"}, + "PART__PartType=Vb": {"pos": "PART", "PartType": "vb"}, + "PART__PartType=Vb|PronType=Rel": {"pos": "PART", "PronType": "rel", "PartType": "vb"}, + "PART__PartType=Vb|PronType=Rel|Tense=Past": {"pos": "PART", "PronType": "rel", "Tense": "past", "PartType": "vb"}, + "PART__PartType=Vb|Tense=Past": {"pos": "PART", "Tense": "past", "PartType": "vb"}, + "PART__PartType=Voc": {"pos": "PART", "PartType": "voc"}, "PART___": {"pos": "PART"}, "PART__PronType=Rel": {"pos": "PART", "PronType": "rel"}, - "PRON__Form=Len|Number=Sing|Person=2": {"pos": "PRON", "Number": "sing", "Person": 2, "Other": {"Form": "len"}}, - "PRON__Form=Len|PronType=Ind": {"pos": "PRON", "PronType": "ind", "Other": {"Form": "len"}}, + "PRON__Form=Len|Number=Sing|Person=2": {"pos": "PRON", "Number": "sing", "Person": 2, "Form": "len"}, + "PRON__Form=Len|PronType=Ind": {"pos": "PRON", "PronType": "ind", "Form": "len"}, "PRON__Gender=Fem|Number=Sing|Person=3": {"pos": "PRON", "Gender": "fem", "Number": "sing", "Person": 3}, "PRON__Gender=Masc|Number=Sing|Person=3": {"pos": "PRON", "Gender": "masc", "Number": "sing", "Person": 3}, "PRON__Gender=Masc|Number=Sing|Person=3|PronType=Emp": {"pos": "PRON", "Gender": "masc", "Number": "sing", "Person": 3, "PronType": "emp"}, @@ -235,103 +232,103 @@ TAG_MAP = { "PRON__PronType=Ind": {"pos": "PRON", "PronType": "ind"}, "PRON__PronType=Int": {"pos": "PRON", "PronType": "int"}, "PRON__Reflex=Yes": {"pos": "PRON", "Reflex": "yes"}, - "PROPN__Abbr=Yes": {"pos": "PROPN", "Other": {"Abbr": "yes"}}, + "PROPN__Abbr=Yes": {"pos": "PROPN", "Abbr": "yes"}, "PROPN__Case=Dat|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "dat", "Gender": "fem", "Number": "sing"}, "PROPN__Case=Gen|Definite=Def|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Definite": "def", "Gender": "fem", "Number": "sing"}, - "PROPN__Case=Gen|Form=Ecl|Gender=Fem|Number=Plur": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "plur", "Other": {"Form": "ecl"}}, - "PROPN__Case=Gen|Form=Ecl|Gender=Masc|Number=Plur": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "ecl"}}, - "PROPN__Case=Gen|Form=HPref|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "hpref"}}, - "PROPN__Case=Gen|Form=Len|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}}, - "PROPN__Case=Gen|Form=Len|Gender=Fem": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Other": {"Form": "len"}}, - "PROPN__Case=Gen|Form=Len|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}}, - "PROPN__Case=Gen|Form=Len|Gender=Masc": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Other": {"Form": "len"}}, + "PROPN__Case=Gen|Form=Ecl|Gender=Fem|Number=Plur": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "plur", "Form": "ecl"}, + "PROPN__Case=Gen|Form=Ecl|Gender=Masc|Number=Plur": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "plur", "Form": "ecl"}, + "PROPN__Case=Gen|Form=HPref|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "sing", "Form": "hpref"}, + "PROPN__Case=Gen|Form=Len|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "sing", "Form": "len"}, + "PROPN__Case=Gen|Form=Len|Gender=Fem": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Form": "len"}, + "PROPN__Case=Gen|Form=Len|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "sing", "Form": "len"}, + "PROPN__Case=Gen|Form=Len|Gender=Masc": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Form": "len"}, "PROPN__Case=Gen|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "sing"}, "PROPN__Case=Gen|Gender=Fem": {"pos": "PROPN", "Case": "gen", "Gender": "fem"}, - "PROPN__Case=Gen|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"NounType": "weak"}}, + "PROPN__Case=Gen|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "plur", "NounType": "weak"}, "PROPN__Case=Gen|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "sing"}, "PROPN__Case=Gen|Gender=Masc": {"pos": "PROPN", "Case": "gen", "Gender": "masc"}, "PROPN__Case=NomAcc|Definite=Def|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Definite": "def", "Gender": "fem", "Number": "sing"}, "PROPN__Case=NomAcc|Definite=Def|Gender=Masc|Number=Plur": {"pos": "PROPN", "Case": "nom|acc", "Definite": "def", "Gender": "masc", "Number": "plur"}, "PROPN__Case=NomAcc|Definite=Def|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Definite": "def", "Gender": "masc", "Number": "sing"}, - "PROPN__Case=NomAcc|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "ecl"}}, - "PROPN__Case=NomAcc|Form=Ecl|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "ecl"}}, - "PROPN__Case=NomAcc|Form=HPref|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "hpref"}}, - "PROPN__Case=NomAcc|Form=Len|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}}, - "PROPN__Case=NomAcc|Form=Len|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}}, + "PROPN__Case=NomAcc|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Form": "ecl"}, + "PROPN__Case=NomAcc|Form=Ecl|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Form": "ecl"}, + "PROPN__Case=NomAcc|Form=HPref|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Form": "hpref"}, + "PROPN__Case=NomAcc|Form=Len|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Form": "len"}, + "PROPN__Case=NomAcc|Form=Len|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Form": "len"}, "PROPN__Case=NomAcc|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "fem", "Number": "sing"}, "PROPN__Case=NomAcc|Gender=Masc|Number=Plur": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "plur"}, "PROPN__Case=NomAcc|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing"}, "PROPN__Case=NomAcc|Gender=Masc": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc"}, - "PROPN__Case=Voc|Form=Len|Gender=Fem": {"pos": "PROPN", "Case": "voc", "Gender": "fem", "Other": {"Form": "len"}}, + "PROPN__Case=Voc|Form=Len|Gender=Fem": {"pos": "PROPN", "Case": "voc", "Gender": "fem", "Form": "len"}, "PROPN__Case=Voc|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "voc", "Gender": "masc", "Number": "sing"}, "PROPN__Gender=Masc|Number=Sing": {"pos": "PROPN", "Gender": "masc", "Number": "sing"}, "PROPN___": {"pos": "PROPN"}, "PUNCT___": {"pos": "PUNCT"}, "SCONJ___": {"pos": "SCONJ"}, - "SCONJ__Tense=Past|VerbForm=Cop": {"pos": "SCONJ", "Tense": "past", "Other": {"VerbForm": "cop"}}, - "SCONJ__VerbForm=Cop": {"pos": "SCONJ", "Other": {"VerbForm": "cop"}}, - "SYM__Abbr=Yes": {"pos": "SYM", "Other": {"Abbr": "yes"}}, + "SCONJ__Tense=Past|VerbForm=Cop": {"pos": "SCONJ", "Tense": "past", "VerbForm": "cop"}, + "SCONJ__VerbForm=Cop": {"pos": "SCONJ", "VerbForm": "cop"}, + "SYM__Abbr=Yes": {"pos": "SYM", "Abbr": "yes"}, "VERB__Case=NomAcc|Gender=Masc|Mood=Ind|Number=Sing|Tense=Pres": {"pos": "VERB", "Case": "nom|acc", "Gender": "masc", "Mood": "ind", "Number": "sing", "Tense": "pres"}, - "VERB__Dialect=Munster|Form=Len|Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Dialect": "munster", "Form": "len"}}, + "VERB__Dialect=Munster|Form=Len|Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Dialect": "munster", "Form": "len"}, "VERB__Foreign=Yes": {"pos": "VERB", "Foreign": "yes"}, - "VERB__Form=Ecl|Mood=Cnd|Number=Sing|Person=1": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 1, "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Cnd|Polarity=Neg": {"pos": "VERB", "Mood": "cnd", "Polarity": "neg", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Cnd": {"pos": "VERB", "Mood": "cnd", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Cnd|Voice=Auto": {"pos": "VERB", "Mood": "cnd", "Other": {"Form": "ecl", "Voice": "auto"}}, - "VERB__Form=Ecl|Mood=Imp|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "imp", "Number": "sing", "Person": 1, "Tense": "past", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Imp|Tense=Past": {"pos": "VERB", "Mood": "imp", "Tense": "past", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Number=Plur|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "pres", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Number=Sing|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "pres", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Other": {"Form": "ecl", "Voice": "auto"}}, - "VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Past": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Form": "ecl", "Voice": "auto"}}, - "VERB__Form=Ecl|Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Form": "ecl", "Voice": "auto"}}, - "VERB__Form=Ecl|Mood=Sub|Tense=Pres": {"pos": "VERB", "Mood": "sub", "Tense": "pres", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl": {"pos": "VERB", "Other": {"Form": "ecl"}}, - "VERB__Form=Emp|Mood=Ind|Number=Plur|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "pres", "Other": {"Form": "emp"}}, - "VERB__Form=Emp|Mood=Ind|Number=Sing|Person=1|PronType=Rel|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "PronType": "rel", "Tense": "pres", "Other": {"Form": "emp"}}, - "VERB__Form=Emp|Mood=Ind|Number=Sing|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "pres", "Other": {"Form": "emp"}}, - "VERB__Form=Len|Mood=Cnd|Number=Plur|Person=3": {"pos": "VERB", "Mood": "cnd", "Number": "plur", "Person": 3, "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Cnd|Number=Sing|Person=1": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 1, "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Cnd|Number=Sing|Person=2": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 2, "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Cnd|Polarity=Neg": {"pos": "VERB", "Mood": "cnd", "Polarity": "neg", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Cnd": {"pos": "VERB", "Mood": "cnd", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Cnd|Voice=Auto": {"pos": "VERB", "Mood": "cnd", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Imp|Number=Plur|Person=3|Tense=Past": {"pos": "VERB", "Mood": "imp", "Number": "plur", "Person": 3, "Tense": "past", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Imp|Tense=Past": {"pos": "VERB", "Mood": "imp", "Tense": "past", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Imp|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "imp", "Tense": "past", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Imp|Voice=Auto": {"pos": "VERB", "Mood": "imp", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Ind|Number=Plur|Person=1|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "fut", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Number=Plur|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "past", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Number=Plur|Person=3|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 3, "Tense": "past", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Polarity": "neg", "Tense": "past", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Polarity": "neg", "Tense": "pres", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Past": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Ind|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Ind|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Sub|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "sub", "Polarity": "neg", "Tense": "pres", "Other": {"Form": "len"}}, - "VERB__Form=Len|Polarity=Neg": {"pos": "VERB", "Polarity": "neg", "Other": {"Form": "len"}}, - "VERB__Form=Len": {"pos": "VERB", "Other": {"Form": "len"}}, + "VERB__Form=Ecl|Mood=Cnd|Number=Sing|Person=1": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 1, "Form": "ecl"}, + "VERB__Form=Ecl|Mood=Cnd|Polarity=Neg": {"pos": "VERB", "Mood": "cnd", "Polarity": "neg", "Form": "ecl"}, + "VERB__Form=Ecl|Mood=Cnd": {"pos": "VERB", "Mood": "cnd", "Form": "ecl"}, + "VERB__Form=Ecl|Mood=Cnd|Voice=Auto": {"pos": "VERB", "Mood": "cnd", "Form": "ecl", "Voice": "auto"}, + "VERB__Form=Ecl|Mood=Imp|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "imp", "Number": "sing", "Person": 1, "Tense": "past", "Form": "ecl"}, + "VERB__Form=Ecl|Mood=Imp|Tense=Past": {"pos": "VERB", "Mood": "imp", "Tense": "past", "Form": "ecl"}, + "VERB__Form=Ecl|Mood=Ind|Number=Plur|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "pres", "Form": "ecl"}, + "VERB__Form=Ecl|Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past", "Form": "ecl"}, + "VERB__Form=Ecl|Mood=Ind|Number=Sing|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "pres", "Form": "ecl"}, + "VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Form": "ecl"}, + "VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Form": "ecl", "Voice": "auto"}, + "VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Past": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Form": "ecl"}, + "VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres", "Form": "ecl"}, + "VERB__Form=Ecl|Mood=Ind|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Form": "ecl"}, + "VERB__Form=Ecl|Mood=Ind|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Form": "ecl", "Voice": "auto"}, + "VERB__Form=Ecl|Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Form": "ecl"}, + "VERB__Form=Ecl|Mood=Ind|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Form": "ecl"}, + "VERB__Form=Ecl|Mood=Ind|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Form": "ecl", "Voice": "auto"}, + "VERB__Form=Ecl|Mood=Sub|Tense=Pres": {"pos": "VERB", "Mood": "sub", "Tense": "pres", "Form": "ecl"}, + "VERB__Form=Ecl": {"pos": "VERB", "Form": "ecl"}, + "VERB__Form=Emp|Mood=Ind|Number=Plur|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "pres", "Form": "emp"}, + "VERB__Form=Emp|Mood=Ind|Number=Sing|Person=1|PronType=Rel|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "PronType": "rel", "Tense": "pres", "Form": "emp"}, + "VERB__Form=Emp|Mood=Ind|Number=Sing|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "pres", "Form": "emp"}, + "VERB__Form=Len|Mood=Cnd|Number=Plur|Person=3": {"pos": "VERB", "Mood": "cnd", "Number": "plur", "Person": 3, "Form": "len"}, + "VERB__Form=Len|Mood=Cnd|Number=Sing|Person=1": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 1, "Form": "len"}, + "VERB__Form=Len|Mood=Cnd|Number=Sing|Person=2": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 2, "Form": "len"}, + "VERB__Form=Len|Mood=Cnd|Polarity=Neg": {"pos": "VERB", "Mood": "cnd", "Polarity": "neg", "Form": "len"}, + "VERB__Form=Len|Mood=Cnd": {"pos": "VERB", "Mood": "cnd", "Form": "len"}, + "VERB__Form=Len|Mood=Cnd|Voice=Auto": {"pos": "VERB", "Mood": "cnd", "Form": "len", "Voice": "auto"}, + "VERB__Form=Len|Mood=Imp|Number=Plur|Person=3|Tense=Past": {"pos": "VERB", "Mood": "imp", "Number": "plur", "Person": 3, "Tense": "past", "Form": "len"}, + "VERB__Form=Len|Mood=Imp|Tense=Past": {"pos": "VERB", "Mood": "imp", "Tense": "past", "Form": "len"}, + "VERB__Form=Len|Mood=Imp|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "imp", "Tense": "past", "Form": "len", "Voice": "auto"}, + "VERB__Form=Len|Mood=Imp|Voice=Auto": {"pos": "VERB", "Mood": "imp", "Form": "len", "Voice": "auto"}, + "VERB__Form=Len|Mood=Ind|Number=Plur|Person=1|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "fut", "Form": "len"}, + "VERB__Form=Len|Mood=Ind|Number=Plur|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "past", "Form": "len"}, + "VERB__Form=Len|Mood=Ind|Number=Plur|Person=3|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 3, "Tense": "past", "Form": "len"}, + "VERB__Form=Len|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Polarity": "neg", "Tense": "past", "Form": "len"}, + "VERB__Form=Len|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Polarity": "neg", "Tense": "pres", "Form": "len"}, + "VERB__Form=Len|Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past", "Form": "len"}, + "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Form": "len"}, + "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Form": "len", "Voice": "auto"}, + "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Past": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Form": "len"}, + "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Form": "len", "Voice": "auto"}, + "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres", "Form": "len"}, + "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres", "Form": "len", "Voice": "auto"}, + "VERB__Form=Len|Mood=Ind|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Form": "len"}, + "VERB__Form=Len|Mood=Ind|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Form": "len", "Voice": "auto"}, + "VERB__Form=Len|Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Form": "len"}, + "VERB__Form=Len|Mood=Ind|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Form": "len", "Voice": "auto"}, + "VERB__Form=Len|Mood=Ind|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Form": "len"}, + "VERB__Form=Len|Mood=Ind|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Form": "len", "Voice": "auto"}, + "VERB__Form=Len|Mood=Sub|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "sub", "Polarity": "neg", "Tense": "pres", "Form": "len"}, + "VERB__Form=Len|Polarity=Neg": {"pos": "VERB", "Polarity": "neg", "Form": "len"}, + "VERB__Form=Len": {"pos": "VERB", "Form": "len"}, "VERB__Mood=Cnd|Number=Plur|Person=3": {"pos": "VERB", "Mood": "cnd", "Number": "plur", "Person": 3}, "VERB__Mood=Cnd|Number=Sing|Person=1": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 1}, "VERB__Mood=Cnd": {"pos": "VERB", "Mood": "cnd"}, - "VERB__Mood=Cnd|Voice=Auto": {"pos": "VERB", "Mood": "cnd", "Other": {"Voice": "auto"}}, + "VERB__Mood=Cnd|Voice=Auto": {"pos": "VERB", "Mood": "cnd", "Voice": "auto"}, "VERB__Mood=Imp|Number=Plur|Person=1|Polarity=Neg": {"pos": "VERB", "Mood": "imp", "Number": "plur", "Person": 1, "Polarity": "neg"}, "VERB__Mood=Imp|Number=Plur|Person=1": {"pos": "VERB", "Mood": "imp", "Number": "plur", "Person": 1}, "VERB__Mood=Imp|Number=Plur|Person=2": {"pos": "VERB", "Mood": "imp", "Number": "plur", "Person": 2}, @@ -341,28 +338,28 @@ TAG_MAP = { "VERB__Mood=Ind|Number=Plur|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "pres"}, "VERB__Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past"}, "VERB__Mood=Ind|Number=Sing|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "pres"}, - "VERB__Mood=Ind|Polarity=Neg|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Other": {"Voice": "auto"}}, + "VERB__Mood=Ind|Polarity=Neg|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Voice": "auto"}, "VERB__Mood=Ind|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres"}, "VERB__Mood=Ind|PronType=Rel|Tense=Fut": {"pos": "VERB", "Mood": "ind", "PronType": "rel", "Tense": "fut"}, "VERB__Mood=Ind|PronType=Rel|Tense=Pres": {"pos": "VERB", "Mood": "ind", "PronType": "rel", "Tense": "pres"}, "VERB__Mood=Ind|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Tense": "fut"}, - "VERB__Mood=Ind|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Voice": "auto"}}, + "VERB__Mood=Ind|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Voice": "auto"}, "VERB__Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past"}, - "VERB__Mood=Ind|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Voice": "auto"}}, + "VERB__Mood=Ind|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Voice": "auto"}, "VERB__Mood=Ind|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Tense": "pres"}, - "VERB__Mood=Ind|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Voice": "auto"}}, + "VERB__Mood=Ind|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Voice": "auto"}, "VERB___": {"pos": "VERB"}, - "X__Abbr=Yes": {"pos": "X", "Other": {"Abbr": "yes"}}, + "X__Abbr=Yes": {"pos": "X", "Abbr": "yes"}, "X__Case=NomAcc|Foreign=Yes|Gender=Fem|Number=Sing": {"pos": "X", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Foreign": "yes"}, - "X__Definite=Def|Dialect=Ulster": {"pos": "X", "Definite": "def", "Other": {"Dialect": "ulster"}}, - "X__Dialect=Munster|Form=Len|Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "X", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past", "Other": {"Dialect": "munster", "Form": "len"}}, - "X__Dialect=Munster|Mood=Imp|Number=Sing|Person=2|Polarity=Neg": {"pos": "X", "Mood": "imp", "Number": "sing", "Person": 2, "Polarity": "neg", "Other": {"Dialect": "munster"}}, - "X__Dialect=Munster|Mood=Ind|Tense=Past|Voice=Auto": {"pos": "X", "Mood": "ind", "Tense": "past", "Other": {"Dialect": "munster", "Voice": "auto"}}, - "X__Dialect=Munster": {"pos": "X", "Other": {"Dialect": "munster"}}, - "X__Dialect=Munster|PronType=Dem": {"pos": "X", "PronType": "dem", "Other": {"Dialect": "munster"}}, - "X__Dialect=Ulster|Gender=Masc|Number=Sing|Person=3": {"pos": "X", "Gender": "masc", "Number": "sing", "Person": 3, "Other": {"Dialect": "ulster"}}, - "X__Dialect=Ulster|PartType=Vb|Polarity=Neg": {"pos": "X", "Polarity": "neg", "Other": {"Dialect": "ulster", "PartType": "vb"}}, - "X__Dialect=Ulster|VerbForm=Cop": {"pos": "X", "Other": {"Dialect": "ulster", "VerbForm": "cop"}}, + "X__Definite=Def|Dialect=Ulster": {"pos": "X", "Definite": "def", "Dialect": "ulster"}, + "X__Dialect=Munster|Form=Len|Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "X", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past", "Dialect": "munster", "Form": "len"}, + "X__Dialect=Munster|Mood=Imp|Number=Sing|Person=2|Polarity=Neg": {"pos": "X", "Mood": "imp", "Number": "sing", "Person": 2, "Polarity": "neg", "Dialect": "munster"}, + "X__Dialect=Munster|Mood=Ind|Tense=Past|Voice=Auto": {"pos": "X", "Mood": "ind", "Tense": "past", "Dialect": "munster", "Voice": "auto"}, + "X__Dialect=Munster": {"pos": "X", "Dialect": "munster"}, + "X__Dialect=Munster|PronType=Dem": {"pos": "X", "PronType": "dem", "Dialect": "munster"}, + "X__Dialect=Ulster|Gender=Masc|Number=Sing|Person=3": {"pos": "X", "Gender": "masc", "Number": "sing", "Person": 3, "Dialect": "ulster"}, + "X__Dialect=Ulster|PartType=Vb|Polarity=Neg": {"pos": "X", "Polarity": "neg", "Dialect": "ulster", "PartType": "vb"}, + "X__Dialect=Ulster|VerbForm=Cop": {"pos": "X", "Dialect": "ulster", "VerbForm": "cop"}, "X__Foreign=Yes": {"pos": "X", "Foreign": "yes"}, "X___": {"pos": "X"} } diff --git a/spacy/lang/ga/tokenizer_exceptions.py b/spacy/lang/ga/tokenizer_exceptions.py index c0e53f522..0c587c67e 100644 --- a/spacy/lang/ga/tokenizer_exceptions.py +++ b/spacy/lang/ga/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, DET, ADP, CCONJ, ADV, NOUN, X, AUX from ...symbols import ORTH, LEMMA, NORM diff --git a/spacy/lang/he/__init__.py b/spacy/lang/he/__init__.py index 411cdf107..0d324f64c 100644 --- a/spacy/lang/he/__init__.py +++ b/spacy/lang/he/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ..tokenizer_exceptions import BASE_EXCEPTIONS diff --git a/spacy/lang/he/examples.py b/spacy/lang/he/examples.py index 34cd157ae..d54d2a145 100644 --- a/spacy/lang/he/examples.py +++ b/spacy/lang/he/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/he/stop_words.py b/spacy/lang/he/stop_words.py index a01ec4246..2745460a7 100644 --- a/spacy/lang/he/stop_words.py +++ b/spacy/lang/he/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ אני diff --git a/spacy/lang/hi/__init__.py b/spacy/lang/hi/__init__.py index b0d45ddf3..9a96de95c 100644 --- a/spacy/lang/hi/__init__.py +++ b/spacy/lang/hi/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS diff --git a/spacy/lang/hi/examples.py b/spacy/lang/hi/examples.py index 1dd182532..ecb0b328c 100644 --- a/spacy/lang/hi/examples.py +++ b/spacy/lang/hi/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/hi/lex_attrs.py b/spacy/lang/hi/lex_attrs.py index 12666d96a..20a8c2975 100644 --- a/spacy/lang/hi/lex_attrs.py +++ b/spacy/lang/hi/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..norm_exceptions import BASE_NORMS from ...attrs import NORM, LIKE_NUM diff --git a/spacy/lang/hi/stop_words.py b/spacy/lang/hi/stop_words.py index efad18c84..475b07da1 100644 --- a/spacy/lang/hi/stop_words.py +++ b/spacy/lang/hi/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/taranjeet/hindi-tokenizer/blob/master/stopwords.txt, https://data.mendeley.com/datasets/bsr3frvvjc/1#file-a21d5092-99d7-45d8-b044-3ae9edd391c6 STOP_WORDS = set( diff --git a/spacy/lang/hr/__init__.py b/spacy/lang/hr/__init__.py index 539b164d7..fbc66ece0 100644 --- a/spacy/lang/hr/__init__.py +++ b/spacy/lang/hr/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ..tokenizer_exceptions import BASE_EXCEPTIONS diff --git a/spacy/lang/hr/examples.py b/spacy/lang/hr/examples.py index dc52ce4f0..b28fb63c2 100644 --- a/spacy/lang/hr/examples.py +++ b/spacy/lang/hr/examples.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/hr/stop_words.py b/spacy/lang/hr/stop_words.py index 408b802c5..dd10f792d 100644 --- a/spacy/lang/hr/stop_words.py +++ b/spacy/lang/hr/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/stopwords-iso/stopwords-hr STOP_WORDS = set( """ diff --git a/spacy/lang/hu/__init__.py b/spacy/lang/hu/__init__.py index a331adc5b..df3fe4a44 100644 --- a/spacy/lang/hu/__init__.py +++ b/spacy/lang/hu/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, TOKEN_MATCH from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES from .stop_words import STOP_WORDS diff --git a/spacy/lang/hu/examples.py b/spacy/lang/hu/examples.py index 3267887fe..711a438bd 100644 --- a/spacy/lang/hu/examples.py +++ b/spacy/lang/hu/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/hu/punctuation.py b/spacy/lang/hu/punctuation.py index bc043486f..1fea6d510 100644 --- a/spacy/lang/hu/punctuation.py +++ b/spacy/lang/hu/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CONCAT_QUOTES from ..char_classes import CONCAT_ICONS, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER diff --git a/spacy/lang/hu/stop_words.py b/spacy/lang/hu/stop_words.py index c9a217dd6..e39a26d35 100644 --- a/spacy/lang/hu/stop_words.py +++ b/spacy/lang/hu/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ a abban ahhoz ahogy ahol aki akik akkor akár alatt amely amelyek amelyekben diff --git a/spacy/lang/hu/tokenizer_exceptions.py b/spacy/lang/hu/tokenizer_exceptions.py index c18a2cec2..cc5eede17 100644 --- a/spacy/lang/hu/tokenizer_exceptions.py +++ b/spacy/lang/hu/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import re from ..punctuation import ALPHA_LOWER, CURRENCY diff --git a/spacy/lang/id/__init__.py b/spacy/lang/id/__init__.py index 8e2266a40..db576f4eb 100644 --- a/spacy/lang/id/__init__.py +++ b/spacy/lang/id/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .punctuation import TOKENIZER_SUFFIXES, TOKENIZER_PREFIXES, TOKENIZER_INFIXES from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS diff --git a/spacy/lang/id/_tokenizer_exceptions_list.py b/spacy/lang/id/_tokenizer_exceptions_list.py index fec878d5a..a0b35fa1a 100644 --- a/spacy/lang/id/_tokenizer_exceptions_list.py +++ b/spacy/lang/id/_tokenizer_exceptions_list.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - ID_BASE_EXCEPTIONS = set( """ aba-aba diff --git a/spacy/lang/id/examples.py b/spacy/lang/id/examples.py index 56ac9165e..1069232ff 100644 --- a/spacy/lang/id/examples.py +++ b/spacy/lang/id/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/id/lex_attrs.py b/spacy/lang/id/lex_attrs.py index 1d4584ae3..3167f4659 100644 --- a/spacy/lang/id/lex_attrs.py +++ b/spacy/lang/id/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import unicodedata from .punctuation import LIST_CURRENCY diff --git a/spacy/lang/id/punctuation.py b/spacy/lang/id/punctuation.py index e4794d42b..f6c2387d8 100644 --- a/spacy/lang/id/punctuation.py +++ b/spacy/lang/id/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES from ..char_classes import ALPHA, merge_chars, split_chars, _currency, _units diff --git a/spacy/lang/id/stop_words.py b/spacy/lang/id/stop_words.py index 0a9f91947..b1bfaea79 100644 --- a/spacy/lang/id/stop_words.py +++ b/spacy/lang/id/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - STOP_WORDS = set( """ ada adalah adanya adapun agak agaknya agar akan akankah akhir akhiri akhirnya diff --git a/spacy/lang/id/syntax_iterators.py b/spacy/lang/id/syntax_iterators.py index 2ed2c1b35..c09b0e840 100644 --- a/spacy/lang/id/syntax_iterators.py +++ b/spacy/lang/id/syntax_iterators.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import NOUN, PROPN, PRON from ...errors import Errors diff --git a/spacy/lang/id/tag_map.py b/spacy/lang/id/tag_map.py index 16391a840..3bd08e96a 100644 --- a/spacy/lang/id/tag_map.py +++ b/spacy/lang/id/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB from ...symbols import NOUN, PRON, AUX, SCONJ, INTJ, PART, PROPN diff --git a/spacy/lang/id/tokenizer_exceptions.py b/spacy/lang/id/tokenizer_exceptions.py index 86fe611bf..5259bddf8 100644 --- a/spacy/lang/id/tokenizer_exceptions.py +++ b/spacy/lang/id/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ._tokenizer_exceptions_list import ID_BASE_EXCEPTIONS from ...symbols import ORTH, LEMMA, NORM diff --git a/spacy/lang/is/__init__.py b/spacy/lang/is/__init__.py index 18e41432d..cdcfd6e71 100644 --- a/spacy/lang/is/__init__.py +++ b/spacy/lang/is/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language from ...attrs import LANG diff --git a/spacy/lang/is/stop_words.py b/spacy/lang/is/stop_words.py index e4ae0498b..917fb6df4 100644 --- a/spacy/lang/is/stop_words.py +++ b/spacy/lang/is/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/Xangis/extra-stopwords STOP_WORDS = set( diff --git a/spacy/lang/it/__init__.py b/spacy/lang/it/__init__.py index 06d146748..7ca79a35c 100644 --- a/spacy/lang/it/__init__.py +++ b/spacy/lang/it/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .tag_map import TAG_MAP from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS diff --git a/spacy/lang/it/examples.py b/spacy/lang/it/examples.py index af66b7eca..506721276 100644 --- a/spacy/lang/it/examples.py +++ b/spacy/lang/it/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/it/punctuation.py b/spacy/lang/it/punctuation.py index 1d641f144..f01ab4f0d 100644 --- a/spacy/lang/it/punctuation.py +++ b/spacy/lang/it/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES from ..char_classes import LIST_ELLIPSES, LIST_ICONS from ..char_classes import ALPHA, HYPHENS, CONCAT_QUOTES diff --git a/spacy/lang/it/stop_words.py b/spacy/lang/it/stop_words.py index 84233d381..e97613912 100644 --- a/spacy/lang/it/stop_words.py +++ b/spacy/lang/it/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ a abbastanza abbia abbiamo abbiano abbiate accidenti ad adesso affinche agl diff --git a/spacy/lang/it/tag_map.py b/spacy/lang/it/tag_map.py index 798c45d80..ce0e1d9ee 100644 --- a/spacy/lang/it/tag_map.py +++ b/spacy/lang/it/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, SYM, ADJ, NUM, DET, ADV, ADP, X, VERB from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, SCONJ, AUX, CONJ diff --git a/spacy/lang/it/tokenizer_exceptions.py b/spacy/lang/it/tokenizer_exceptions.py index 70519ba6a..7237443b5 100644 --- a/spacy/lang/it/tokenizer_exceptions.py +++ b/spacy/lang/it/tokenizer_exceptions.py @@ -1,5 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals from ...symbols import ORTH, LEMMA _exc = { diff --git a/spacy/lang/ja/__init__.py b/spacy/lang/ja/__init__.py index 22590043f..d1ce651d7 100644 --- a/spacy/lang/ja/__init__.py +++ b/spacy/lang/ja/__init__.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals, print_function - import re from collections import namedtuple diff --git a/spacy/lang/ja/examples.py b/spacy/lang/ja/examples.py index e00001ed5..c3a011862 100644 --- a/spacy/lang/ja/examples.py +++ b/spacy/lang/ja/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/ja/stop_words.py b/spacy/lang/ja/stop_words.py index bb232a2d2..98560d7e2 100644 --- a/spacy/lang/ja/stop_words.py +++ b/spacy/lang/ja/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - # This list was created by taking the top 2000 words from a Wikipedia dump and # filtering out everything that wasn't hiragana. ー (one) was also added. # Considered keeping some non-hiragana words but too many place names were diff --git a/spacy/lang/ja/tag_map.py b/spacy/lang/ja/tag_map.py index 4ff0a35ee..d922cd22b 100644 --- a/spacy/lang/ja/tag_map.py +++ b/spacy/lang/ja/tag_map.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, INTJ, X, ADJ, AUX, ADP, PART, SCONJ, NOUN from ...symbols import SYM, PRON, VERB, ADV, PROPN, NUM, DET, SPACE diff --git a/spacy/lang/kn/__init__.py b/spacy/lang/kn/__init__.py index c86354248..ef3b10f81 100644 --- a/spacy/lang/kn/__init__.py +++ b/spacy/lang/kn/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language from ...attrs import LANG diff --git a/spacy/lang/kn/stop_words.py b/spacy/lang/kn/stop_words.py index 652341e73..dba9740af 100644 --- a/spacy/lang/kn/stop_words.py +++ b/spacy/lang/kn/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ ಹಲವು diff --git a/spacy/lang/ko/__init__.py b/spacy/lang/ko/__init__.py index ec79a95ab..4ecdfbc58 100644 --- a/spacy/lang/ko/__init__.py +++ b/spacy/lang/ko/__init__.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals, print_function - from .stop_words import STOP_WORDS from .tag_map import TAG_MAP from ...attrs import LANG diff --git a/spacy/lang/ko/examples.py b/spacy/lang/ko/examples.py index 0306e5db8..edb755eaa 100644 --- a/spacy/lang/ko/examples.py +++ b/spacy/lang/ko/examples.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/ko/lex_attrs.py b/spacy/lang/ko/lex_attrs.py index 1904a0ece..ac5bc7e48 100644 --- a/spacy/lang/ko/lex_attrs.py +++ b/spacy/lang/ko/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/ko/stop_words.py b/spacy/lang/ko/stop_words.py index 676dca1b4..3eba9fc82 100644 --- a/spacy/lang/ko/stop_words.py +++ b/spacy/lang/ko/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - STOP_WORDS = set( """ 이 diff --git a/spacy/lang/ko/tag_map.py b/spacy/lang/ko/tag_map.py index 57317c969..26a8c56b9 100644 --- a/spacy/lang/ko/tag_map.py +++ b/spacy/lang/ko/tag_map.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, INTJ, X, SYM, ADJ, AUX, ADP, CONJ, NOUN, PRON from ...symbols import VERB, ADV, PROPN, NUM, DET diff --git a/spacy/lang/lb/__init__.py b/spacy/lang/lb/__init__.py index 8d85b8fc7..8235e7610 100644 --- a/spacy/lang/lb/__init__.py +++ b/spacy/lang/lb/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .punctuation import TOKENIZER_INFIXES from .lex_attrs import LEX_ATTRS diff --git a/spacy/lang/lb/examples.py b/spacy/lang/lb/examples.py index 3cbba31d9..a7a10489c 100644 --- a/spacy/lang/lb/examples.py +++ b/spacy/lang/lb/examples.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/lb/lex_attrs.py b/spacy/lang/lb/lex_attrs.py index e38c74974..d2d50d9dc 100644 --- a/spacy/lang/lb/lex_attrs.py +++ b/spacy/lang/lb/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/lb/punctuation.py b/spacy/lang/lb/punctuation.py index 2a4587856..e382c56c5 100644 --- a/spacy/lang/lb/punctuation.py +++ b/spacy/lang/lb/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_ELLIPSES, LIST_ICONS, ALPHA, ALPHA_LOWER, ALPHA_UPPER ELISION = " ' ’ ".strip().replace(" ", "") diff --git a/spacy/lang/lb/stop_words.py b/spacy/lang/lb/stop_words.py index 41e6f79d2..8f22ea6e6 100644 --- a/spacy/lang/lb/stop_words.py +++ b/spacy/lang/lb/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - STOP_WORDS = set( """ a diff --git a/spacy/lang/lb/tag_map.py b/spacy/lang/lb/tag_map.py index 424a83bb4..cd2e8b93c 100644 --- a/spacy/lang/lb/tag_map.py +++ b/spacy/lang/lb/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, ADJ, CONJ, NUM, DET, ADV, ADP, X, VERB from ...symbols import NOUN, PART, SPACE, AUX diff --git a/spacy/lang/lb/tokenizer_exceptions.py b/spacy/lang/lb/tokenizer_exceptions.py index 1c9b2dde3..070bb0fd7 100644 --- a/spacy/lang/lb/tokenizer_exceptions.py +++ b/spacy/lang/lb/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA, NORM # TODO diff --git a/spacy/lang/lex_attrs.py b/spacy/lang/lex_attrs.py index c9cd82d7b..0310b2b36 100644 --- a/spacy/lang/lex_attrs.py +++ b/spacy/lang/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import unicodedata import re diff --git a/spacy/lang/lij/__init__.py b/spacy/lang/lij/__init__.py index 9b4b29798..a75f081bf 100644 --- a/spacy/lang/lij/__init__.py +++ b/spacy/lang/lij/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .punctuation import TOKENIZER_INFIXES diff --git a/spacy/lang/lij/examples.py b/spacy/lang/lij/examples.py index c4034ae7e..ba7fe43fd 100644 --- a/spacy/lang/lij/examples.py +++ b/spacy/lang/lij/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/lij/punctuation.py b/spacy/lang/lij/punctuation.py index 4439376c8..d50b75589 100644 --- a/spacy/lang/lij/punctuation.py +++ b/spacy/lang/lij/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..punctuation import TOKENIZER_INFIXES from ..char_classes import ALPHA diff --git a/spacy/lang/lij/stop_words.py b/spacy/lang/lij/stop_words.py index ffd53370d..1d6f09d27 100644 --- a/spacy/lang/lij/stop_words.py +++ b/spacy/lang/lij/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ a à â a-a a-e a-i a-o aiva aloa an ancheu ancon apreuvo ascì atra atre atri atro avanti avei diff --git a/spacy/lang/lij/tokenizer_exceptions.py b/spacy/lang/lij/tokenizer_exceptions.py index 2109add62..2befabca3 100644 --- a/spacy/lang/lij/tokenizer_exceptions.py +++ b/spacy/lang/lij/tokenizer_exceptions.py @@ -1,5 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals from ...symbols import ORTH, LEMMA _exc = {} diff --git a/spacy/lang/lt/__init__.py b/spacy/lang/lt/__init__.py index ce2c8d6a4..9becfbe15 100644 --- a/spacy/lang/lt/__init__.py +++ b/spacy/lang/lt/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS diff --git a/spacy/lang/lt/examples.py b/spacy/lang/lt/examples.py index 99dbe9d4d..eaf941f1a 100644 --- a/spacy/lang/lt/examples.py +++ b/spacy/lang/lt/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/lt/lex_attrs.py b/spacy/lang/lt/lex_attrs.py index 81879948f..28894a59b 100644 --- a/spacy/lang/lt/lex_attrs.py +++ b/spacy/lang/lt/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM _num_words = { diff --git a/spacy/lang/lt/morph_rules.py b/spacy/lang/lt/morph_rules.py index 3bf26d9d8..f7bfd3cc6 100644 --- a/spacy/lang/lt/morph_rules.py +++ b/spacy/lang/lt/morph_rules.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import LEMMA, PRON_LEMMA diff --git a/spacy/lang/lt/punctuation.py b/spacy/lang/lt/punctuation.py index 5eedc8116..506aa8f32 100644 --- a/spacy/lang/lt/punctuation.py +++ b/spacy/lang/lt/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_ICONS, LIST_ELLIPSES from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA from ..char_classes import HYPHENS diff --git a/spacy/lang/lt/stop_words.py b/spacy/lang/lt/stop_words.py index fed05d80d..8c11b3f7b 100644 --- a/spacy/lang/lt/stop_words.py +++ b/spacy/lang/lt/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - STOP_WORDS = { "a", "abejais", diff --git a/spacy/lang/lt/tag_map.py b/spacy/lang/lt/tag_map.py index 6ea4f8ae0..f08db535f 100644 --- a/spacy/lang/lt/tag_map.py +++ b/spacy/lang/lt/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, ADJ, ADP, ADV, CONJ, INTJ, NOUN, NUM, PART from ...symbols import PRON, PROPN, PUNCT, SYM, VERB, X diff --git a/spacy/lang/lt/tokenizer_exceptions.py b/spacy/lang/lt/tokenizer_exceptions.py index 4287b26dd..012dfbd20 100644 --- a/spacy/lang/lt/tokenizer_exceptions.py +++ b/spacy/lang/lt/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH _exc = {} diff --git a/spacy/lang/lv/__init__.py b/spacy/lang/lv/__init__.py index bb8c0763b..dd8919b73 100644 --- a/spacy/lang/lv/__init__.py +++ b/spacy/lang/lv/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language from ...attrs import LANG diff --git a/spacy/lang/lv/stop_words.py b/spacy/lang/lv/stop_words.py index 075ad6347..2685c2430 100644 --- a/spacy/lang/lv/stop_words.py +++ b/spacy/lang/lv/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/stopwords-iso/stopwords-lv STOP_WORDS = set( diff --git a/spacy/lang/mr/__init__.py b/spacy/lang/mr/__init__.py index fd95f9354..eb52a3935 100644 --- a/spacy/lang/mr/__init__.py +++ b/spacy/lang/mr/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language from ...attrs import LANG diff --git a/spacy/lang/mr/stop_words.py b/spacy/lang/mr/stop_words.py index 0b0cd035d..9b0cee951 100644 --- a/spacy/lang/mr/stop_words.py +++ b/spacy/lang/mr/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/stopwords-iso/stopwords-mr/blob/master/stopwords-mr.txt, https://github.com/6/stopwords-json/edit/master/dist/mr.json STOP_WORDS = set( """ diff --git a/spacy/lang/nb/__init__.py b/spacy/lang/nb/__init__.py index e6c58b7de..d10c44c50 100644 --- a/spacy/lang/nb/__init__.py +++ b/spacy/lang/nb/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES from .punctuation import TOKENIZER_SUFFIXES diff --git a/spacy/lang/nb/examples.py b/spacy/lang/nb/examples.py index c15426ded..b1a63ad74 100644 --- a/spacy/lang/nb/examples.py +++ b/spacy/lang/nb/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/nb/morph_rules.py b/spacy/lang/nb/morph_rules.py index e20814535..e96b9fd6b 100644 --- a/spacy/lang/nb/morph_rules.py +++ b/spacy/lang/nb/morph_rules.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - from ...symbols import LEMMA, PRON_LEMMA # This dict includes all the PRON and DET tag combinations found in the @@ -198,7 +195,7 @@ MORPH_RULES = { "seg": { LEMMA: PRON_LEMMA, "Person": "Three", - "Number": ("Sing", "Plur"), + "Number": "Sing,Plur", "Reflex": "Yes", } }, @@ -251,7 +248,7 @@ MORPH_RULES = { }, "deres": { LEMMA: "deres", - "Person": ("Two", "Three"), + "Person": "Two,Three", "Number": "Sing", "Poss": "Yes", "Gender": "Masc", @@ -312,7 +309,7 @@ MORPH_RULES = { }, "deres": { LEMMA: "deres", - "Person": ("Two", "Three"), + "Person": "Two,Three", "Number": "Sing", "Poss": "Yes", "Gender": "Fem", @@ -373,7 +370,7 @@ MORPH_RULES = { }, "deres": { LEMMA: "deres", - "Person": ("Two", "Three"), + "Person": "Two,Three", "Number": "Sing", "Poss": "Yes", "Gender": "Neut", @@ -403,7 +400,7 @@ MORPH_RULES = { "våre": {LEMMA: "vår", "Person": "One", "Number": "Plur", "Poss": "Yes"}, "deres": { LEMMA: "deres", - "Person": ("Two", "Three"), + "Person": "Two,Three", "Number": "Plur", "Poss": "Yes", }, @@ -451,21 +448,21 @@ MORPH_RULES = { "PronType": "Prs", "Number": "Sing", "Person": "Three", - "Gender": ("Fem", "Masc"), + "Gender": "Fem,Masc", }, "den": { LEMMA: PRON_LEMMA, "PronType": "Prs", "Number": "Sing", "Person": "Three", - "Gender": ("Fem", "Masc"), + "Gender": "Fem,Masc", }, "ingen": { LEMMA: PRON_LEMMA, "PronType": "Prs", "Number": "Sing", "Person": "Three", - "Gender": ("Fem", "Masc"), + "Gender": "Fem,Masc", "Polarity": "Neg", }, }, @@ -478,7 +475,7 @@ MORPH_RULES = { LEMMA: PRON_LEMMA, "PronType": "Prs", "Number": "Sing", - "Case": ("Gen", "Nom"), + "Case": "Gen,Nom", } }, "PRON__Animacy=Anim|Case=Gen|Number=Sing|PronType=Prs": { diff --git a/spacy/lang/nb/punctuation.py b/spacy/lang/nb/punctuation.py index 4c10b5a68..9b800029c 100644 --- a/spacy/lang/nb/punctuation.py +++ b/spacy/lang/nb/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_PUNCT, LIST_QUOTES from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER from ..char_classes import CURRENCY, PUNCT, UNITS, LIST_CURRENCY diff --git a/spacy/lang/nb/stop_words.py b/spacy/lang/nb/stop_words.py index caa2012e7..fd65dd788 100644 --- a/spacy/lang/nb/stop_words.py +++ b/spacy/lang/nb/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ alle allerede alt and andre annen annet at av diff --git a/spacy/lang/nb/syntax_iterators.py b/spacy/lang/nb/syntax_iterators.py index 2ed2c1b35..c09b0e840 100644 --- a/spacy/lang/nb/syntax_iterators.py +++ b/spacy/lang/nb/syntax_iterators.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import NOUN, PROPN, PRON from ...errors import Errors diff --git a/spacy/lang/nb/tag_map.py b/spacy/lang/nb/tag_map.py index ca0ece265..a67586ed9 100644 --- a/spacy/lang/nb/tag_map.py +++ b/spacy/lang/nb/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, ADJ, CONJ, CCONJ, SCONJ, SYM, NUM, DET, ADV, ADP, X from ...symbols import VERB, NOUN, PROPN, PART, INTJ, PRON, AUX diff --git a/spacy/lang/nb/tokenizer_exceptions.py b/spacy/lang/nb/tokenizer_exceptions.py index 3f4aa79f6..eb67e4c89 100644 --- a/spacy/lang/nb/tokenizer_exceptions.py +++ b/spacy/lang/nb/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA diff --git a/spacy/lang/nl/__init__.py b/spacy/lang/nl/__init__.py index 407d23f73..e99665e1d 100644 --- a/spacy/lang/nl/__init__.py +++ b/spacy/lang/nl/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS from .tag_map import TAG_MAP diff --git a/spacy/lang/nl/examples.py b/spacy/lang/nl/examples.py index a459760f4..8c8c50c60 100644 --- a/spacy/lang/nl/examples.py +++ b/spacy/lang/nl/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/nl/lemmatizer.py b/spacy/lang/nl/lemmatizer.py index 9a92bee44..e7501ec52 100644 --- a/spacy/lang/nl/lemmatizer.py +++ b/spacy/lang/nl/lemmatizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...lemmatizer import Lemmatizer from ...symbols import NOUN, VERB, ADJ, NUM, DET, PRON, ADP, AUX, ADV diff --git a/spacy/lang/nl/lex_attrs.py b/spacy/lang/nl/lex_attrs.py index 69343b589..f1acaefeb 100644 --- a/spacy/lang/nl/lex_attrs.py +++ b/spacy/lang/nl/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/nl/punctuation.py b/spacy/lang/nl/punctuation.py index e7207038b..d9dd2a6e3 100644 --- a/spacy/lang/nl/punctuation.py +++ b/spacy/lang/nl/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_UNITS, merge_chars from ..char_classes import LIST_PUNCT, LIST_QUOTES, CURRENCY, PUNCT from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER diff --git a/spacy/lang/nl/stop_words.py b/spacy/lang/nl/stop_words.py index 44551f2d4..a2c6198e7 100644 --- a/spacy/lang/nl/stop_words.py +++ b/spacy/lang/nl/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - # The original stop words list (added in f46ffe3) was taken from # http://www.damienvanholten.com/downloads/dutch-stop-words.txt # and consisted of about 100 tokens. diff --git a/spacy/lang/nl/tag_map.py b/spacy/lang/nl/tag_map.py index 4fde5d39f..5bd7747c6 100644 --- a/spacy/lang/nl/tag_map.py +++ b/spacy/lang/nl/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, ADJ, NUM, DET, ADV, ADP, X, VERB from ...symbols import NOUN, PROPN, SPACE, PRON, CONJ diff --git a/spacy/lang/nl/tokenizer_exceptions.py b/spacy/lang/nl/tokenizer_exceptions.py index c0915f127..df69c7a8a 100644 --- a/spacy/lang/nl/tokenizer_exceptions.py +++ b/spacy/lang/nl/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH # Extensive list of both common and uncommon dutch abbreviations copied from diff --git a/spacy/lang/norm_exceptions.py b/spacy/lang/norm_exceptions.py index 341967a78..f35f613b1 100644 --- a/spacy/lang/norm_exceptions.py +++ b/spacy/lang/norm_exceptions.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # These exceptions are used to add NORM values based on a token's ORTH value. # Individual languages can also add their own exceptions and overwrite them - # for example, British vs. American spelling in English. diff --git a/spacy/lang/pl/__init__.py b/spacy/lang/pl/__init__.py index 52b662a90..660931ffd 100644 --- a/spacy/lang/pl/__init__.py +++ b/spacy/lang/pl/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES from .punctuation import TOKENIZER_SUFFIXES from .tag_map import TAG_MAP diff --git a/spacy/lang/pl/examples.py b/spacy/lang/pl/examples.py index 14b6c7030..b1ea5880f 100644 --- a/spacy/lang/pl/examples.py +++ b/spacy/lang/pl/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/pl/lex_attrs.py b/spacy/lang/pl/lex_attrs.py index f1379aa50..ce56e28a8 100644 --- a/spacy/lang/pl/lex_attrs.py +++ b/spacy/lang/pl/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/pl/punctuation.py b/spacy/lang/pl/punctuation.py index c87464b1b..31e56b9ae 100644 --- a/spacy/lang/pl/punctuation.py +++ b/spacy/lang/pl/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_ELLIPSES, LIST_PUNCT, LIST_HYPHENS from ..char_classes import LIST_ICONS, LIST_QUOTES, CURRENCY, UNITS, PUNCT from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER diff --git a/spacy/lang/pl/stop_words.py b/spacy/lang/pl/stop_words.py index 11df67328..075aec391 100644 --- a/spacy/lang/pl/stop_words.py +++ b/spacy/lang/pl/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 - -from __future__ import unicode_literals - # sources: https://github.com/bieli/stopwords/blob/master/polish.stopwords.txt and https://github.com/stopwords-iso/stopwords-pl STOP_WORDS = set( diff --git a/spacy/lang/pl/tag_map.py b/spacy/lang/pl/tag_map.py index 5356c26cb..b83ee4d4c 100644 --- a/spacy/lang/pl/tag_map.py +++ b/spacy/lang/pl/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ( POS, ADJ, diff --git a/spacy/lang/pt/__init__.py b/spacy/lang/pt/__init__.py index c09996126..d212d1e39 100644 --- a/spacy/lang/pt/__init__.py +++ b/spacy/lang/pt/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS diff --git a/spacy/lang/pt/examples.py b/spacy/lang/pt/examples.py index b7206ffd7..13f3512cf 100644 --- a/spacy/lang/pt/examples.py +++ b/spacy/lang/pt/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/pt/lex_attrs.py b/spacy/lang/pt/lex_attrs.py index 4ad0eeecb..3c6979ab4 100644 --- a/spacy/lang/pt/lex_attrs.py +++ b/spacy/lang/pt/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/pt/punctuation.py b/spacy/lang/pt/punctuation.py index 370e6aaad..08e31f9d0 100644 --- a/spacy/lang/pt/punctuation.py +++ b/spacy/lang/pt/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES from ..punctuation import TOKENIZER_SUFFIXES as BASE_TOKENIZER_SUFFIXES from ..punctuation import TOKENIZER_INFIXES as BASE_TOKENIZER_INFIXES diff --git a/spacy/lang/pt/stop_words.py b/spacy/lang/pt/stop_words.py index 774b06809..ff45ad3a7 100644 --- a/spacy/lang/pt/stop_words.py +++ b/spacy/lang/pt/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ à às área acerca ademais adeus agora ainda algo algumas alguns ali além ambas ambos antes diff --git a/spacy/lang/pt/tag_map.py b/spacy/lang/pt/tag_map.py index cdc7de57e..dc65998a4 100644 --- a/spacy/lang/pt/tag_map.py +++ b/spacy/lang/pt/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, SYM, ADJ, NUM, DET, ADV, ADP, X, VERB, CCONJ from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, SCONJ, AUX diff --git a/spacy/lang/pt/tokenizer_exceptions.py b/spacy/lang/pt/tokenizer_exceptions.py index 981c0624b..c5c5d49e8 100644 --- a/spacy/lang/pt/tokenizer_exceptions.py +++ b/spacy/lang/pt/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH diff --git a/spacy/lang/punctuation.py b/spacy/lang/punctuation.py index ccb72de28..bf7357e48 100644 --- a/spacy/lang/punctuation.py +++ b/spacy/lang/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY from .char_classes import LIST_ICONS, HYPHENS, CURRENCY, UNITS from .char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT diff --git a/spacy/lang/ro/__init__.py b/spacy/lang/ro/__init__.py index c7b744ca5..cae64a85c 100644 --- a/spacy/lang/ro/__init__.py +++ b/spacy/lang/ro/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES diff --git a/spacy/lang/ro/examples.py b/spacy/lang/ro/examples.py index a372d7cb2..bfa258ffc 100644 --- a/spacy/lang/ro/examples.py +++ b/spacy/lang/ro/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/ro/lex_attrs.py b/spacy/lang/ro/lex_attrs.py index bb8391ad1..0f86f53cd 100644 --- a/spacy/lang/ro/lex_attrs.py +++ b/spacy/lang/ro/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/ro/punctuation.py b/spacy/lang/ro/punctuation.py index 87f9a1248..529e1c977 100644 --- a/spacy/lang/ro/punctuation.py +++ b/spacy/lang/ro/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import itertools from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY diff --git a/spacy/lang/ro/stop_words.py b/spacy/lang/ro/stop_words.py index b5ba73458..1d90be85d 100644 --- a/spacy/lang/ro/stop_words.py +++ b/spacy/lang/ro/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/stopwords-iso/stopwords-ro STOP_WORDS = set( """ diff --git a/spacy/lang/ro/tag_map.py b/spacy/lang/ro/tag_map.py index cb5239809..d6820b4f2 100644 --- a/spacy/lang/ro/tag_map.py +++ b/spacy/lang/ro/tag_map.py @@ -1,5 +1,3 @@ -from __future__ import unicode_literals - from ...symbols import POS, ADJ, ADP, ADV, INTJ, NOUN, NUM, PART from ...symbols import PRON, PROPN, PUNCT, SYM, VERB, X, CCONJ, SCONJ, DET, AUX diff --git a/spacy/lang/ro/tokenizer_exceptions.py b/spacy/lang/ro/tokenizer_exceptions.py index b27344d2a..eb5f95dfb 100644 --- a/spacy/lang/ro/tokenizer_exceptions.py +++ b/spacy/lang/ro/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH from .punctuation import _make_ro_variants diff --git a/spacy/lang/ru/__init__.py b/spacy/lang/ru/__init__.py index f0e77d811..52cab1db1 100644 --- a/spacy/lang/ru/__init__.py +++ b/spacy/lang/ru/__init__.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals, print_function - from .stop_words import STOP_WORDS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .lex_attrs import LEX_ATTRS diff --git a/spacy/lang/ru/examples.py b/spacy/lang/ru/examples.py index 2db621dac..adb007625 100644 --- a/spacy/lang/ru/examples.py +++ b/spacy/lang/ru/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/ru/lemmatizer.py b/spacy/lang/ru/lemmatizer.py index 96d32f59c..ed0e858f5 100644 --- a/spacy/lang/ru/lemmatizer.py +++ b/spacy/lang/ru/lemmatizer.py @@ -1,9 +1,5 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ADJ, DET, NOUN, NUM, PRON, PROPN, PUNCT, VERB, POS from ...lemmatizer import Lemmatizer -from ...compat import unicode_ class RussianLemmatizer(Lemmatizer): @@ -85,7 +81,7 @@ class RussianLemmatizer(Lemmatizer): @staticmethod def normalize_univ_pos(univ_pos): - if isinstance(univ_pos, unicode_): + if isinstance(univ_pos, str): return univ_pos.upper() symbols_to_str = { diff --git a/spacy/lang/ru/lex_attrs.py b/spacy/lang/ru/lex_attrs.py index 448c5b285..7979c7ea6 100644 --- a/spacy/lang/ru/lex_attrs.py +++ b/spacy/lang/ru/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/ru/stop_words.py b/spacy/lang/ru/stop_words.py index 89069b3cf..16cb55ef9 100644 --- a/spacy/lang/ru/stop_words.py +++ b/spacy/lang/ru/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ а diff --git a/spacy/lang/ru/tag_map.py b/spacy/lang/ru/tag_map.py index baf065588..294919811 100644 --- a/spacy/lang/ru/tag_map.py +++ b/spacy/lang/ru/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, SYM, ADJ, NUM, DET, ADV, ADP, X, VERB, NOUN from ...symbols import PROPN, PART, INTJ, PRON, SCONJ, AUX, CCONJ diff --git a/spacy/lang/ru/tokenizer_exceptions.py b/spacy/lang/ru/tokenizer_exceptions.py index ea7b5b20d..df3169baf 100644 --- a/spacy/lang/ru/tokenizer_exceptions.py +++ b/spacy/lang/ru/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA, NORM diff --git a/spacy/lang/si/__init__.py b/spacy/lang/si/__init__.py index a58a63f03..3b065860c 100644 --- a/spacy/lang/si/__init__.py +++ b/spacy/lang/si/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS diff --git a/spacy/lang/si/examples.py b/spacy/lang/si/examples.py index 842dfdd7e..b34051d00 100644 --- a/spacy/lang/si/examples.py +++ b/spacy/lang/si/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/si/lex_attrs.py b/spacy/lang/si/lex_attrs.py index 5d5f06187..aa061852d 100644 --- a/spacy/lang/si/lex_attrs.py +++ b/spacy/lang/si/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM _num_words = [ diff --git a/spacy/lang/si/stop_words.py b/spacy/lang/si/stop_words.py index 8bbdec6b7..bde662bf7 100644 --- a/spacy/lang/si/stop_words.py +++ b/spacy/lang/si/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ අතර diff --git a/spacy/lang/sk/__init__.py b/spacy/lang/sk/__init__.py index cb17c0b6d..c7b171de4 100644 --- a/spacy/lang/sk/__init__.py +++ b/spacy/lang/sk/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .tag_map import TAG_MAP from .lex_attrs import LEX_ATTRS diff --git a/spacy/lang/sk/examples.py b/spacy/lang/sk/examples.py index 486ea375e..736109a7c 100644 --- a/spacy/lang/sk/examples.py +++ b/spacy/lang/sk/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/sk/lex_attrs.py b/spacy/lang/sk/lex_attrs.py index 3dea4d8f0..0caf62e8e 100644 --- a/spacy/lang/sk/lex_attrs.py +++ b/spacy/lang/sk/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM _num_words = [ diff --git a/spacy/lang/sk/stop_words.py b/spacy/lang/sk/stop_words.py index 3e78acb10..017e7beef 100644 --- a/spacy/lang/sk/stop_words.py +++ b/spacy/lang/sk/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/Ardevop-sk/stopwords-sk STOP_WORDS = set( diff --git a/spacy/lang/sk/tag_map.py b/spacy/lang/sk/tag_map.py index 28b36d3c1..d159a6a51 100644 --- a/spacy/lang/sk/tag_map.py +++ b/spacy/lang/sk/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, AUX, ADJ, CCONJ, NUM, ADV, ADP, X, VERB from ...symbols import NOUN, PART, INTJ, PRON diff --git a/spacy/lang/sl/__init__.py b/spacy/lang/sl/__init__.py index 2d4977bdf..ce46e92dc 100644 --- a/spacy/lang/sl/__init__.py +++ b/spacy/lang/sl/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language from ...attrs import LANG diff --git a/spacy/lang/sl/stop_words.py b/spacy/lang/sl/stop_words.py index 187e95876..6fb01a183 100644 --- a/spacy/lang/sl/stop_words.py +++ b/spacy/lang/sl/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/stopwords-iso/stopwords-sl # TODO: probably needs to be tidied up – the list seems to have month names in # it, which shouldn't be considered stop words. diff --git a/spacy/lang/sq/__init__.py b/spacy/lang/sq/__init__.py index 6f33b37c2..034604838 100644 --- a/spacy/lang/sq/__init__.py +++ b/spacy/lang/sq/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language from ...attrs import LANG diff --git a/spacy/lang/sq/examples.py b/spacy/lang/sq/examples.py index c51a0da39..06ed20fa1 100644 --- a/spacy/lang/sq/examples.py +++ b/spacy/lang/sq/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/sq/stop_words.py b/spacy/lang/sq/stop_words.py index f91861ca1..f2b1a4f4a 100644 --- a/spacy/lang/sq/stop_words.py +++ b/spacy/lang/sq/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/andrixh/index-albanian STOP_WORDS = set( diff --git a/spacy/lang/sr/__init__.py b/spacy/lang/sr/__init__.py index 286d6693b..7f2172707 100644 --- a/spacy/lang/sr/__init__.py +++ b/spacy/lang/sr/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .lex_attrs import LEX_ATTRS diff --git a/spacy/lang/sr/examples.py b/spacy/lang/sr/examples.py index d636220c3..ec7f57ced 100644 --- a/spacy/lang/sr/examples.py +++ b/spacy/lang/sr/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/sr/lex_attrs.py b/spacy/lang/sr/lex_attrs.py index c90dc0da7..dc48909bc 100644 --- a/spacy/lang/sr/lex_attrs.py +++ b/spacy/lang/sr/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/sr/stop_words.py b/spacy/lang/sr/stop_words.py index 9712327f8..5df5509d2 100644 --- a/spacy/lang/sr/stop_words.py +++ b/spacy/lang/sr/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ а diff --git a/spacy/lang/sr/tokenizer_exceptions.py b/spacy/lang/sr/tokenizer_exceptions.py index 8fca346a3..82df15186 100755 --- a/spacy/lang/sr/tokenizer_exceptions.py +++ b/spacy/lang/sr/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA, NORM diff --git a/spacy/lang/sv/__init__.py b/spacy/lang/sv/__init__.py index 3a749eeee..8179b1c84 100644 --- a/spacy/lang/sv/__init__.py +++ b/spacy/lang/sv/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tag_map import TAG_MAP from .stop_words import STOP_WORDS diff --git a/spacy/lang/sv/examples.py b/spacy/lang/sv/examples.py index 58e095195..bc6cd7a54 100644 --- a/spacy/lang/sv/examples.py +++ b/spacy/lang/sv/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/sv/morph_rules.py b/spacy/lang/sv/morph_rules.py index 77744813f..3ef6aedc5 100644 --- a/spacy/lang/sv/morph_rules.py +++ b/spacy/lang/sv/morph_rules.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import LEMMA, PRON_LEMMA @@ -108,7 +105,7 @@ MORPH_RULES = { "PronType": "Prs", "Person": "Three", "Number": "Plur", - "Case": ("Nom", "Acc"), + "Case": "Nom,Acc", }, "dem": { LEMMA: PRON_LEMMA, @@ -169,7 +166,7 @@ MORPH_RULES = { LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", - "Number": ("Sing", "Plur"), + "Number": "Sing,Plur", "Gender": "Masc", "Poss": "Yes", "Reflex": "Yes", @@ -178,7 +175,7 @@ MORPH_RULES = { LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", - "Number": ("Sing", "Plur"), + "Number": "Sing,Plur", "Gender": "Fem", "Poss": "Yes", "Reflex": "Yes", @@ -187,7 +184,7 @@ MORPH_RULES = { LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", - "Number": ("Sing", "Plur"), + "Number": "Sing,Plur", "Poss": "Yes", "Reflex": "Yes", }, @@ -275,7 +272,7 @@ MORPH_RULES = { "VBZ": { "är": { "VerbForm": "Fin", - "Person": ("One", "Two", "Three"), + "Person": "One,Two,Three", "Tense": "Pres", "Mood": "Ind", } diff --git a/spacy/lang/sv/stop_words.py b/spacy/lang/sv/stop_words.py index 206abce5a..2422b2a9e 100644 --- a/spacy/lang/sv/stop_words.py +++ b/spacy/lang/sv/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ aderton adertonde adjö aldrig alla allas allt alltid alltså än andra andras diff --git a/spacy/lang/sv/syntax_iterators.py b/spacy/lang/sv/syntax_iterators.py index 84493ae79..ec92c08d3 100644 --- a/spacy/lang/sv/syntax_iterators.py +++ b/spacy/lang/sv/syntax_iterators.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import NOUN, PROPN, PRON from ...errors import Errors diff --git a/spacy/lang/sv/tag_map.py b/spacy/lang/sv/tag_map.py index 7d4e29030..d4f5b6291 100644 --- a/spacy/lang/sv/tag_map.py +++ b/spacy/lang/sv/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, ADJ, CCONJ, SCONJ, NUM, DET, ADV from ...symbols import ADP, X, VERB, NOUN, PROPN, PART, INTJ, PRON diff --git a/spacy/lang/sv/tokenizer_exceptions.py b/spacy/lang/sv/tokenizer_exceptions.py index e95c67f37..a78a51f31 100644 --- a/spacy/lang/sv/tokenizer_exceptions.py +++ b/spacy/lang/sv/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import LEMMA, NORM, ORTH, PRON_LEMMA _exc = {} diff --git a/spacy/lang/ta/__init__.py b/spacy/lang/ta/__init__.py index cb23339e6..d7a04afea 100644 --- a/spacy/lang/ta/__init__.py +++ b/spacy/lang/ta/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS diff --git a/spacy/lang/ta/examples.py b/spacy/lang/ta/examples.py index 3ce3c3544..a53227220 100644 --- a/spacy/lang/ta/examples.py +++ b/spacy/lang/ta/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/ta/lex_attrs.py b/spacy/lang/ta/lex_attrs.py index 40158ad7a..f830f4ac9 100644 --- a/spacy/lang/ta/lex_attrs.py +++ b/spacy/lang/ta/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/ta/stop_words.py b/spacy/lang/ta/stop_words.py index 91ebe8fd8..abbff949d 100644 --- a/spacy/lang/ta/stop_words.py +++ b/spacy/lang/ta/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Stop words STOP_WORDS = set( diff --git a/spacy/lang/tag_map.py b/spacy/lang/tag_map.py index 3a744f180..5bff905bd 100644 --- a/spacy/lang/tag_map.py +++ b/spacy/lang/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ from ..symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ diff --git a/spacy/lang/te/__init__.py b/spacy/lang/te/__init__.py index a4709177d..424164cc7 100644 --- a/spacy/lang/te/__init__.py +++ b/spacy/lang/te/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS diff --git a/spacy/lang/te/examples.py b/spacy/lang/te/examples.py index 815ec8227..cff7d3cb0 100644 --- a/spacy/lang/te/examples.py +++ b/spacy/lang/te/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/te/lex_attrs.py b/spacy/lang/te/lex_attrs.py index 6da766dca..ae11827f6 100644 --- a/spacy/lang/te/lex_attrs.py +++ b/spacy/lang/te/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM _num_words = [ diff --git a/spacy/lang/te/stop_words.py b/spacy/lang/te/stop_words.py index 11e157177..b18dab697 100644 --- a/spacy/lang/te/stop_words.py +++ b/spacy/lang/te/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - # Source: https://github.com/Xangis/extra-stopwords (MIT License) STOP_WORDS = set( diff --git a/spacy/lang/th/__init__.py b/spacy/lang/th/__init__.py index 512be0c59..4333afcc9 100644 --- a/spacy/lang/th/__init__.py +++ b/spacy/lang/th/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tag_map import TAG_MAP from .stop_words import STOP_WORDS diff --git a/spacy/lang/th/lex_attrs.py b/spacy/lang/th/lex_attrs.py index 047d046c2..bc4e5293e 100644 --- a/spacy/lang/th/lex_attrs.py +++ b/spacy/lang/th/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/th/tag_map.py b/spacy/lang/th/tag_map.py index 119a2f6a0..7fb12d538 100644 --- a/spacy/lang/th/tag_map.py +++ b/spacy/lang/th/tag_map.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, NOUN, PRON, ADJ, ADV, INTJ, PROPN, DET, NUM, AUX, VERB from ...symbols import ADP, CCONJ, PART, PUNCT, SPACE, SCONJ diff --git a/spacy/lang/th/tokenizer_exceptions.py b/spacy/lang/th/tokenizer_exceptions.py index 4de0f1195..0529b3a99 100644 --- a/spacy/lang/th/tokenizer_exceptions.py +++ b/spacy/lang/th/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA diff --git a/spacy/lang/tl/__init__.py b/spacy/lang/tl/__init__.py index 30ad93139..f477029f7 100644 --- a/spacy/lang/tl/__init__.py +++ b/spacy/lang/tl/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS diff --git a/spacy/lang/tl/lex_attrs.py b/spacy/lang/tl/lex_attrs.py index 61dc9d4f3..60bdc923b 100644 --- a/spacy/lang/tl/lex_attrs.py +++ b/spacy/lang/tl/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/tl/stop_words.py b/spacy/lang/tl/stop_words.py index 510b3a418..2560cdaed 100644 --- a/spacy/lang/tl/stop_words.py +++ b/spacy/lang/tl/stop_words.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - STOP_WORDS = set( """ akin diff --git a/spacy/lang/tl/tokenizer_exceptions.py b/spacy/lang/tl/tokenizer_exceptions.py index 77e1fb0c6..ea14746c4 100644 --- a/spacy/lang/tl/tokenizer_exceptions.py +++ b/spacy/lang/tl/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA diff --git a/spacy/lang/tokenizer_exceptions.py b/spacy/lang/tokenizer_exceptions.py index 29ce75442..3bb299d6d 100644 --- a/spacy/lang/tokenizer_exceptions.py +++ b/spacy/lang/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import re from .char_classes import ALPHA_LOWER @@ -37,7 +34,7 @@ URL_PATTERN = ( r"|" # host & domain names # mods: match is case-sensitive, so include [A-Z] - "(?:" # noqa + "(?:" # noqa: E131 "(?:" "[A-Za-z0-9\u00a1-\uffff]" "[A-Za-z0-9\u00a1-\uffff_-]{0,62}" @@ -127,7 +124,6 @@ emoticons = set( (-: =) (= -") :] :-] [: diff --git a/spacy/lang/tr/__init__.py b/spacy/lang/tr/__init__.py index 2553e7c0f..a29d78261 100644 --- a/spacy/lang/tr/__init__.py +++ b/spacy/lang/tr/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS diff --git a/spacy/lang/tr/examples.py b/spacy/lang/tr/examples.py index a0464dfe3..dfb324a4e 100644 --- a/spacy/lang/tr/examples.py +++ b/spacy/lang/tr/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. >>> from spacy.lang.tr.examples import sentences diff --git a/spacy/lang/tr/lex_attrs.py b/spacy/lang/tr/lex_attrs.py index 93f26fc8e..3dbc1833a 100644 --- a/spacy/lang/tr/lex_attrs.py +++ b/spacy/lang/tr/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/tr/stop_words.py b/spacy/lang/tr/stop_words.py index 65905499a..85dcff6a5 100644 --- a/spacy/lang/tr/stop_words.py +++ b/spacy/lang/tr/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/stopwords-iso/stopwords-tr STOP_WORDS = set( """ diff --git a/spacy/lang/tr/tokenizer_exceptions.py b/spacy/lang/tr/tokenizer_exceptions.py index f48e035d4..97f524a87 100644 --- a/spacy/lang/tr/tokenizer_exceptions.py +++ b/spacy/lang/tr/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH, NORM _exc = {"sağol": [{ORTH: "sağ"}, {ORTH: "ol", NORM: "olun"}]} diff --git a/spacy/lang/tt/__init__.py b/spacy/lang/tt/__init__.py index 3655e6264..80574a70d 100644 --- a/spacy/lang/tt/__init__.py +++ b/spacy/lang/tt/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .lex_attrs import LEX_ATTRS from .punctuation import TOKENIZER_INFIXES from .stop_words import STOP_WORDS diff --git a/spacy/lang/tt/examples.py b/spacy/lang/tt/examples.py index ac668a0c2..723fcdd15 100644 --- a/spacy/lang/tt/examples.py +++ b/spacy/lang/tt/examples.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. >>> from spacy.lang.tt.examples import sentences diff --git a/spacy/lang/tt/lex_attrs.py b/spacy/lang/tt/lex_attrs.py index ad3d6b9eb..a2ae03061 100644 --- a/spacy/lang/tt/lex_attrs.py +++ b/spacy/lang/tt/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM _num_words = [ diff --git a/spacy/lang/tt/punctuation.py b/spacy/lang/tt/punctuation.py index 9ee66a59e..f644a8ccb 100644 --- a/spacy/lang/tt/punctuation.py +++ b/spacy/lang/tt/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, CONCAT_QUOTES, HYPHENS from ..char_classes import LIST_ELLIPSES, LIST_ICONS diff --git a/spacy/lang/tt/stop_words.py b/spacy/lang/tt/stop_words.py index 9f6e9bb86..44169b757 100644 --- a/spacy/lang/tt/stop_words.py +++ b/spacy/lang/tt/stop_words.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - # Tatar stopwords are from https://github.com/aliiae/stopwords-tt STOP_WORDS = set( diff --git a/spacy/lang/tt/tokenizer_exceptions.py b/spacy/lang/tt/tokenizer_exceptions.py index 89f7a990b..efe9e1fc0 100644 --- a/spacy/lang/tt/tokenizer_exceptions.py +++ b/spacy/lang/tt/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA, NORM _exc = {} diff --git a/spacy/lang/uk/__init__.py b/spacy/lang/uk/__init__.py index e74ff2d86..51165112a 100644 --- a/spacy/lang/uk/__init__.py +++ b/spacy/lang/uk/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS diff --git a/spacy/lang/uk/examples.py b/spacy/lang/uk/examples.py index 4f2b034eb..f75d44488 100644 --- a/spacy/lang/uk/examples.py +++ b/spacy/lang/uk/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/uk/lemmatizer.py b/spacy/lang/uk/lemmatizer.py index 3eeed5dd4..ff61d711f 100644 --- a/spacy/lang/uk/lemmatizer.py +++ b/spacy/lang/uk/lemmatizer.py @@ -1,4 +1,3 @@ -# coding: utf8 from ...symbols import ADJ, DET, NOUN, NUM, PRON, PROPN, PUNCT, VERB, POS from ...lemmatizer import Lemmatizer diff --git a/spacy/lang/uk/lex_attrs.py b/spacy/lang/uk/lex_attrs.py index 0ade751d6..510e5b85d 100644 --- a/spacy/lang/uk/lex_attrs.py +++ b/spacy/lang/uk/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM _num_words = [ diff --git a/spacy/lang/uk/stop_words.py b/spacy/lang/uk/stop_words.py index cdf24dd70..b11d7a044 100644 --- a/spacy/lang/uk/stop_words.py +++ b/spacy/lang/uk/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """а або diff --git a/spacy/lang/uk/tag_map.py b/spacy/lang/uk/tag_map.py deleted file mode 100644 index 472e772ef..000000000 --- a/spacy/lang/uk/tag_map.py +++ /dev/null @@ -1,28 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ..symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ -from ..symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ - - -TAG_MAP = { - "ADV": {POS: ADV}, - "NOUN": {POS: NOUN}, - "ADP": {POS: ADP}, - "PRON": {POS: PRON}, - "SCONJ": {POS: SCONJ}, - "PROPN": {POS: PROPN}, - "DET": {POS: DET}, - "SYM": {POS: SYM}, - "INTJ": {POS: INTJ}, - "PUNCT": {POS: PUNCT}, - "NUM": {POS: NUM}, - "AUX": {POS: AUX}, - "X": {POS: X}, - "CONJ": {POS: CONJ}, - "CCONJ": {POS: CCONJ}, - "ADJ": {POS: ADJ}, - "VERB": {POS: VERB}, - "PART": {POS: PART}, - "SP": {POS: SPACE}, -} diff --git a/spacy/lang/uk/tokenizer_exceptions.py b/spacy/lang/uk/tokenizer_exceptions.py index a94d77af3..36f0b2e72 100644 --- a/spacy/lang/uk/tokenizer_exceptions.py +++ b/spacy/lang/uk/tokenizer_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import ORTH, LEMMA, POS, NORM, NOUN diff --git a/spacy/lang/ur/__init__.py b/spacy/lang/ur/__init__.py index 6eea0cf3b..c7f65adc3 100644 --- a/spacy/lang/ur/__init__.py +++ b/spacy/lang/ur/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS from .punctuation import TOKENIZER_SUFFIXES diff --git a/spacy/lang/ur/examples.py b/spacy/lang/ur/examples.py index f47c11600..e55b337be 100644 --- a/spacy/lang/ur/examples.py +++ b/spacy/lang/ur/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/ur/lex_attrs.py b/spacy/lang/ur/lex_attrs.py index 12d85be4b..e590ed3e3 100644 --- a/spacy/lang/ur/lex_attrs.py +++ b/spacy/lang/ur/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM # Source https://quizlet.com/4271889/1-100-urdu-number-wordsurdu-numerals-flash-cards/ diff --git a/spacy/lang/ur/punctuation.py b/spacy/lang/ur/punctuation.py index b8b1a1c83..5d35d0a25 100644 --- a/spacy/lang/ur/punctuation.py +++ b/spacy/lang/ur/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..punctuation import TOKENIZER_SUFFIXES diff --git a/spacy/lang/ur/stop_words.py b/spacy/lang/ur/stop_words.py index 73c159d5c..abfa36497 100644 --- a/spacy/lang/ur/stop_words.py +++ b/spacy/lang/ur/stop_words.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - # Source: collected from different resource on internet STOP_WORDS = set( """ diff --git a/spacy/lang/ur/tag_map.py b/spacy/lang/ur/tag_map.py index aad548e9b..4ae0d7014 100644 --- a/spacy/lang/ur/tag_map.py +++ b/spacy/lang/ur/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX, SCONJ from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB diff --git a/spacy/lang/vi/__init__.py b/spacy/lang/vi/__init__.py index 425f84e3d..7496763ee 100644 --- a/spacy/lang/vi/__init__.py +++ b/spacy/lang/vi/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LANG, NORM from ..norm_exceptions import BASE_NORMS from ...language import Language diff --git a/spacy/lang/vi/lex_attrs.py b/spacy/lang/vi/lex_attrs.py index b6cd1188a..b3dbf2192 100644 --- a/spacy/lang/vi/lex_attrs.py +++ b/spacy/lang/vi/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/vi/stop_words.py b/spacy/lang/vi/stop_words.py index 13284dc59..1d2ecdf8d 100644 --- a/spacy/lang/vi/stop_words.py +++ b/spacy/lang/vi/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - # Source: https://github.com/stopwords/vietnamese-stopwords STOP_WORDS = set( """ diff --git a/spacy/lang/vi/tag_map.py b/spacy/lang/vi/tag_map.py deleted file mode 100644 index 472e772ef..000000000 --- a/spacy/lang/vi/tag_map.py +++ /dev/null @@ -1,28 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ..symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ -from ..symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ - - -TAG_MAP = { - "ADV": {POS: ADV}, - "NOUN": {POS: NOUN}, - "ADP": {POS: ADP}, - "PRON": {POS: PRON}, - "SCONJ": {POS: SCONJ}, - "PROPN": {POS: PROPN}, - "DET": {POS: DET}, - "SYM": {POS: SYM}, - "INTJ": {POS: INTJ}, - "PUNCT": {POS: PUNCT}, - "NUM": {POS: NUM}, - "AUX": {POS: AUX}, - "X": {POS: X}, - "CONJ": {POS: CONJ}, - "CCONJ": {POS: CCONJ}, - "ADJ": {POS: ADJ}, - "VERB": {POS: VERB}, - "PART": {POS: PART}, - "SP": {POS: SPACE}, -} diff --git a/spacy/lang/xx/__init__.py b/spacy/lang/xx/__init__.py index 66d8c7917..347c624fd 100644 --- a/spacy/lang/xx/__init__.py +++ b/spacy/lang/xx/__init__.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..norm_exceptions import BASE_NORMS from ...language import Language diff --git a/spacy/lang/xx/examples.py b/spacy/lang/xx/examples.py index 38cd5e0cd..8d63c3c20 100644 --- a/spacy/lang/xx/examples.py +++ b/spacy/lang/xx/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/yo/__init__.py b/spacy/lang/yo/__init__.py index f227203cc..08e3166e1 100644 --- a/spacy/lang/yo/__init__.py +++ b/spacy/lang/yo/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS from ..tokenizer_exceptions import BASE_EXCEPTIONS diff --git a/spacy/lang/yo/examples.py b/spacy/lang/yo/examples.py index 170ddc803..0a610f125 100644 --- a/spacy/lang/yo/examples.py +++ b/spacy/lang/yo/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/yo/lex_attrs.py b/spacy/lang/yo/lex_attrs.py index a9f1b85f6..ead68ced2 100644 --- a/spacy/lang/yo/lex_attrs.py +++ b/spacy/lang/yo/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import unicodedata from ...attrs import LIKE_NUM diff --git a/spacy/lang/yo/stop_words.py b/spacy/lang/yo/stop_words.py index 53d382ad3..5c7a7fc45 100644 --- a/spacy/lang/yo/stop_words.py +++ b/spacy/lang/yo/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - # stop words as whitespace-separated list. # Source: https://raw.githubusercontent.com/dohliam/more-stoplists/master/yo/yo.txt diff --git a/spacy/lang/zh/__init__.py b/spacy/lang/zh/__init__.py index 9d1cb71a7..fc7573f8d 100644 --- a/spacy/lang/zh/__init__.py +++ b/spacy/lang/zh/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import tempfile import srsly from pathlib import Path diff --git a/spacy/lang/zh/examples.py b/spacy/lang/zh/examples.py index b28215741..8be1336d2 100644 --- a/spacy/lang/zh/examples.py +++ b/spacy/lang/zh/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/zh/lex_attrs.py b/spacy/lang/zh/lex_attrs.py index 0b29c226e..08c8e3160 100644 --- a/spacy/lang/zh/lex_attrs.py +++ b/spacy/lang/zh/lex_attrs.py @@ -1,8 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals import re + from ...attrs import LIKE_NUM + _single_num_words = [ "〇", "一", diff --git a/spacy/lang/zh/stop_words.py b/spacy/lang/zh/stop_words.py index 0af4c1859..42ae4a1de 100644 --- a/spacy/lang/zh/stop_words.py +++ b/spacy/lang/zh/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - # stop words as whitespace-separated list # Chinese stop words,maybe not enough STOP_WORDS = set( diff --git a/spacy/lang/zh/tag_map.py b/spacy/lang/zh/tag_map.py index 41e2d2158..1ff0827be 100644 --- a/spacy/lang/zh/tag_map.py +++ b/spacy/lang/zh/tag_map.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, ADJ, SCONJ, CCONJ, NUM, DET, ADV, ADP, X from ...symbols import NOUN, PART, INTJ, PRON, VERB, SPACE diff --git a/spacy/language.py b/spacy/language.py index 0e5c46459..f8732b471 100644 --- a/spacy/language.py +++ b/spacy/language.py @@ -1,19 +1,13 @@ -# coding: utf8 -from __future__ import absolute_import, unicode_literals - import random import itertools -import warnings - -from thinc.extra import load_nlp - -from spacy.util import minibatch import weakref import functools -from collections import OrderedDict from contextlib import contextmanager from copy import copy, deepcopy -from thinc.neural import Model +from pathlib import Path +import warnings + +from thinc.api import get_current_ops, Config import srsly import multiprocessing as mp from itertools import chain, cycle @@ -24,10 +18,9 @@ from .vocab import Vocab from .lemmatizer import Lemmatizer from .lookups import Lookups from .analysis import analyze_pipes, analyze_all_pipes, validate_attrs -from .compat import izip, basestring_, is_python2, class_types -from .gold import GoldParse +from .gold import Example from .scorer import Scorer -from ._ml import link_vectors_to_models, create_default_optimizer +from .util import link_vectors_to_models, create_default_optimizer, registry from .attrs import IS_STOP, LANG, NORM from .lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES from .lang.punctuation import TOKENIZER_INFIXES @@ -145,7 +138,13 @@ class Language(object): factories = {"tokenizer": lambda nlp: nlp.Defaults.create_tokenizer(nlp)} def __init__( - self, vocab=True, make_doc=True, max_length=10 ** 6, meta={}, **kwargs + self, + vocab=True, + make_doc=True, + max_length=10 ** 6, + meta={}, + config=None, + **kwargs, ): """Initialise a Language object. @@ -155,6 +154,7 @@ class Language(object): object. Usually a `Tokenizer`. meta (dict): Custom meta data for the Language class. Is written to by models to add model meta data. + config (Config): Configuration data for creating the pipeline components. max_length (int) : Maximum number of characters in a single text. The current v2 models may run out memory on extremely long texts, due to large internal @@ -169,6 +169,9 @@ class Language(object): user_factories = util.registry.factories.get_all() self.factories.update(user_factories) self._meta = dict(meta) + self._config = config + if not self._config: + self._config = Config() self._path = None if vocab is True: factory = self.Defaults.create_vocab @@ -199,7 +202,7 @@ class Language(object): self._meta.setdefault("lang", self.lang) self._meta.setdefault("name", "model") self._meta.setdefault("version", "0.0.0") - self._meta.setdefault("spacy_version", ">={}".format(about.__version__)) + self._meta.setdefault("spacy_version", f">={about.__version__}") self._meta.setdefault("description", "") self._meta.setdefault("author", "") self._meta.setdefault("email", "") @@ -220,6 +223,10 @@ class Language(object): def meta(self, value): self._meta = value + @property + def config(self): + return self._config + # Conveniences to access pipeline components # Shouldn't be used anymore! @property @@ -242,6 +249,10 @@ class Language(object): def linker(self): return self.get_pipe("entity_linker") + @property + def senter(self): + return self.get_pipe("senter") + @property def matcher(self): return self.get_pipe("matcher") @@ -272,7 +283,7 @@ class Language(object): RETURNS (dict): Labels keyed by component name. """ - labels = OrderedDict() + labels = {} for name, pipe in self.pipeline: if hasattr(pipe, "labels"): labels[name] = list(pipe.labels) @@ -306,7 +317,28 @@ class Language(object): else: raise KeyError(Errors.E002.format(name=name)) factory = self.factories[name] - return factory(self, **config) + + # transform the model's config to an actual Model + factory_cfg = dict(config) + + # check whether we have a proper model config, or load a default one + if "model" in factory_cfg and not isinstance(factory_cfg["model"], dict): + warnings.warn( + Warnings.W099.format(type=type(factory_cfg["model"]), pipe=name) + ) + + # refer to the model configuration in the cfg settings for this component + if "model" in factory_cfg: + self.config[name] = {"model": factory_cfg["model"]} + + # create all objects in the config + factory_cfg = registry.make_from_config({"config": factory_cfg}, validate=True)[ + "config" + ] + model = factory_cfg.get("model", None) + if model is not None: + del factory_cfg["model"] + return factory(self, model, **factory_cfg) def add_pipe( self, component, name=None, before=None, after=None, first=None, last=None @@ -329,7 +361,7 @@ class Language(object): """ if not hasattr(component, "__call__"): msg = Errors.E003.format(component=repr(component), name=name) - if isinstance(component, basestring_) and component in self.factories: + if isinstance(component, str) and component in self.factories: msg += Errors.E004.format(component=component) raise ValueError(msg) if name is None: @@ -381,7 +413,7 @@ class Language(object): raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names)) if not hasattr(component, "__call__"): msg = Errors.E003.format(component=repr(component), name=name) - if isinstance(component, basestring_) and component in self.factories: + if isinstance(component, str) and component in self.factories: msg += Errors.E135.format(name=name) raise ValueError(msg) self.pipeline[self.pipe_names.index(name)] = (name, component) @@ -420,7 +452,7 @@ class Language(object): def __call__(self, text, disable=[], component_cfg=None): """Apply the pipeline to some text. The text can span multiple sentences, - and can contain arbtrary whitespace. Alignment into the original string + and can contain arbitrary whitespace. Alignment into the original string is preserved. text (unicode): The text to be processed. @@ -443,7 +475,10 @@ class Language(object): continue if not hasattr(proc, "__call__"): raise ValueError(Errors.E003.format(component=type(proc), name=name)) - doc = proc(doc, **component_cfg.get(name, {})) + try: + doc = proc(doc, **component_cfg.get(name, {})) + except KeyError: + raise ValueError(Errors.E109.format(name=name)) if doc is None: raise ValueError(Errors.E005.format(name=name)) return doc @@ -454,39 +489,59 @@ class Language(object): of the block. Otherwise, a DisabledPipes object is returned, that has a `.restore()` method you can use to undo your changes. - DOCS: https://spacy.io/api/language#disable_pipes + This method has been deprecated since 3.0 """ + warnings.warn(Warnings.W096, DeprecationWarning) if len(names) == 1 and isinstance(names[0], (list, tuple)): names = names[0] # support list of names instead of spread - return DisabledPipes(self, *names) + return DisabledPipes(self, names) + + def select_pipes(self, disable=None, enable=None): + """Disable one or more pipeline components. If used as a context + manager, the pipeline will be restored to the initial state at the end + of the block. Otherwise, a DisabledPipes object is returned, that has + a `.restore()` method you can use to undo your changes. + + disable (str or iterable): The name(s) of the pipes to disable + enable (str or iterable): The name(s) of the pipes to enable - all others will be disabled + + DOCS: https://spacy.io/api/language#select_pipes + """ + if enable is None and disable is None: + raise ValueError(Errors.E991) + if disable is not None and isinstance(disable, str): + disable = [disable] + if enable is not None: + if isinstance(enable, str): + enable = [enable] + to_disable = [pipe for pipe in self.pipe_names if pipe not in enable] + # raise an error if the enable and disable keywords are not consistent + if disable is not None and disable != to_disable: + raise ValueError( + Errors.E992.format( + enable=enable, disable=disable, names=self.pipe_names + ) + ) + disable = to_disable + return DisabledPipes(self, disable) def make_doc(self, text): return self.tokenizer(text) - def _format_docs_and_golds(self, docs, golds): - """Format golds and docs before update models.""" - expected_keys = ("words", "tags", "heads", "deps", "entities", "cats", "links") - gold_objs = [] - doc_objs = [] - for doc, gold in zip(docs, golds): - if isinstance(doc, basestring_): - doc = self.make_doc(doc) - if not isinstance(gold, GoldParse): - unexpected = [k for k in gold if k not in expected_keys] - if unexpected: - err = Errors.E151.format(unexp=unexpected, exp=expected_keys) - raise ValueError(err) - gold = GoldParse(doc, **gold) - doc_objs.append(doc) - gold_objs.append(gold) - - return doc_objs, gold_objs - - def update(self, docs, golds, drop=0.0, sgd=None, losses=None, component_cfg=None): + def update( + self, + examples, + dummy=None, + *, + drop=0.0, + sgd=None, + losses=None, + component_cfg=None, + ): """Update the models in the pipeline. - docs (iterable): A batch of `Doc` objects. - golds (iterable): A batch of `GoldParse` objects. + examples (iterable): A batch of `Example` or `Doc` objects. + dummy: Should not be set - serves to catch backwards-incompatible scripts. drop (float): The dropout rate. sgd (callable): An optimizer. losses (dict): Dictionary to update with the loss, keyed by component. @@ -495,46 +550,44 @@ class Language(object): DOCS: https://spacy.io/api/language#update """ - if len(docs) != len(golds): - raise IndexError(Errors.E009.format(n_docs=len(docs), n_golds=len(golds))) - if len(docs) == 0: + if dummy is not None: + raise ValueError(Errors.E989) + + if len(examples) == 0: return + examples = Example.to_example_objects(examples, make_doc=self.make_doc) + if sgd is None: if self._optimizer is None: - self._optimizer = create_default_optimizer(Model.ops) + self._optimizer = create_default_optimizer() sgd = self._optimizer - # Allow dict of args to GoldParse, instead of GoldParse objects. - docs, golds = self._format_docs_and_golds(docs, golds) - grads = {} - def get_grads(W, dW, key=None): - grads[key] = (W, dW) - - get_grads.alpha = sgd.alpha - get_grads.b1 = sgd.b1 - get_grads.b2 = sgd.b2 - pipes = list(self.pipeline) - random.shuffle(pipes) if component_cfg is None: component_cfg = {} - for name, proc in pipes: + # Determine whether component should set annotations. In theory I guess + # we should do this by inspecting the meta? Or we could just always + # say "yes" + for name, proc in self.pipeline: + component_cfg.setdefault(name, {}) + component_cfg[name].setdefault("drop", drop) + component_cfg[name].setdefault("set_annotations", False) + for name, proc in self.pipeline: if not hasattr(proc, "update"): continue - grads = {} - kwargs = component_cfg.get(name, {}) - kwargs.setdefault("drop", drop) - proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs) - for key, (W, dW) in grads.items(): - sgd(W, dW, key=key) + proc.update(examples, sgd=None, losses=losses, **component_cfg[name]) + if sgd is not False: + for name, proc in self.pipeline: + if hasattr(proc, "model"): + proc.model.finish_update(sgd) - def rehearse(self, docs, sgd=None, losses=None, config=None): + def rehearse(self, examples, sgd=None, losses=None, config=None): """Make a "rehearsal" update to the models in the pipeline, to prevent forgetting. Rehearsal updates run an initial copy of the model over some data, and update the model so its current predictions are more like the initial ones. This is useful for keeping a pretrained model on-track, even if you're updating it with a smaller set of examples. - docs (iterable): A batch of `Doc` objects. + examples (iterable): A batch of `Doc` objects. drop (float): The dropout rate. sgd (callable): An optimizer. RETURNS (dict): Results from the update. @@ -542,22 +595,18 @@ class Language(object): EXAMPLE: >>> raw_text_batches = minibatch(raw_texts) >>> for labelled_batch in minibatch(zip(train_docs, train_golds)): - >>> docs, golds = zip(*train_docs) - >>> nlp.update(docs, golds) + >>> nlp.update(labelled_batch) >>> raw_batch = [nlp.make_doc(text) for text in next(raw_text_batches)] >>> nlp.rehearse(raw_batch) """ # TODO: document - if len(docs) == 0: + if len(examples) == 0: return + examples = Example.to_example_objects(examples, make_doc=self.make_doc) if sgd is None: if self._optimizer is None: - self._optimizer = create_default_optimizer(Model.ops) + self._optimizer = create_default_optimizer() sgd = self._optimizer - docs = list(docs) - for i, doc in enumerate(docs): - if isinstance(doc, basestring_): - docs[i] = self.make_doc(doc) pipes = list(self.pipeline) random.shuffle(pipes) if config is None: @@ -567,61 +616,61 @@ class Language(object): def get_grads(W, dW, key=None): grads[key] = (W, dW) - get_grads.alpha = sgd.alpha + get_grads.learn_rate = sgd.learn_rate get_grads.b1 = sgd.b1 get_grads.b2 = sgd.b2 for name, proc in pipes: if not hasattr(proc, "rehearse"): continue grads = {} - proc.rehearse(docs, sgd=get_grads, losses=losses, **config.get(name, {})) - for key, (W, dW) in grads.items(): - sgd(W, dW, key=key) + proc.rehearse( + examples, sgd=get_grads, losses=losses, **config.get(name, {}) + ) + for key, (W, dW) in grads.items(): + sgd(W, dW, key=key) return losses - def preprocess_gold(self, docs_golds): + def preprocess_gold(self, examples): """Can be called before training to pre-process gold data. By default, it handles nonprojectivity and adds missing tags to the tag map. - docs_golds (iterable): Tuples of `Doc` and `GoldParse` objects. - YIELDS (tuple): Tuples of preprocessed `Doc` and `GoldParse` objects. + examples (iterable): `Example` objects. + YIELDS (tuple): `Example` objects. """ for name, proc in self.pipeline: if hasattr(proc, "preprocess_gold"): - docs_golds = proc.preprocess_gold(docs_golds) - for doc, gold in docs_golds: - yield doc, gold + examples = proc.preprocess_gold(examples) + for ex in examples: + yield ex - def begin_training(self, get_gold_tuples=None, sgd=None, component_cfg=None, **cfg): + def begin_training(self, get_examples=None, sgd=None, component_cfg=None, **cfg): """Allocate models, pre-process training data and acquire a trainer and optimizer. Used as a contextmanager. - get_gold_tuples (function): Function returning gold data + get_examples (function): Function returning example training data (TODO: document format change since 3.0) component_cfg (dict): Config parameters for specific components. **cfg: Config parameters. RETURNS: An optimizer. DOCS: https://spacy.io/api/language#begin_training """ - if get_gold_tuples is None: - get_gold_tuples = lambda: [] + # TODO: throw warning when get_gold_tuples is provided instead of get_examples + if get_examples is None: + get_examples = lambda: [] # Populate vocab else: - for _, annots_brackets in get_gold_tuples(): - _ = annots_brackets.pop() - for annots, _ in annots_brackets: - for word in annots[1]: - _ = self.vocab[word] # noqa: F841 + for example in get_examples(): + for word in example.token_annotation.words: + _ = self.vocab[word] # noqa: F841 + if cfg.get("device", -1) >= 0: util.use_gpu(cfg["device"]) if self.vocab.vectors.data.shape[1] >= 1: - self.vocab.vectors.data = Model.ops.asarray(self.vocab.vectors.data) + ops = get_current_ops() + self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data) link_vectors_to_models(self.vocab) - if self.vocab.vectors.data.shape[1]: - cfg["pretrained_vectors"] = self.vocab.vectors.name - cfg["pretrained_dims"] = self.vocab.vectors.data.shape[1] if sgd is None: - sgd = create_default_optimizer(Model.ops) + sgd = create_default_optimizer() self._optimizer = sgd if component_cfg is None: component_cfg = {} @@ -630,11 +679,9 @@ class Language(object): kwargs = component_cfg.get(name, {}) kwargs.update(cfg) proc.begin_training( - get_gold_tuples, - pipeline=self.pipeline, - sgd=self._optimizer, - **kwargs + get_examples, pipeline=self.pipeline, sgd=self._optimizer, **kwargs ) + self._link_components() return self._optimizer def resume_training(self, sgd=None, **cfg): @@ -648,13 +695,12 @@ class Language(object): """ if cfg.get("device", -1) >= 0: util.use_gpu(cfg["device"]) + ops = get_current_ops() if self.vocab.vectors.data.shape[1] >= 1: - self.vocab.vectors.data = Model.ops.asarray(self.vocab.vectors.data) + self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data) link_vectors_to_models(self.vocab) - if self.vocab.vectors.data.shape[1]: - cfg["pretrained_vectors"] = self.vocab.vectors.name if sgd is None: - sgd = create_default_optimizer(Model.ops) + sgd = create_default_optimizer() self._optimizer = sgd for name, proc in self.pipeline: if hasattr(proc, "_rehearsal_model"): @@ -662,11 +708,11 @@ class Language(object): return self._optimizer def evaluate( - self, docs_golds, verbose=False, batch_size=256, scorer=None, component_cfg=None + self, examples, verbose=False, batch_size=256, scorer=None, component_cfg=None ): """Evaluate a model's pipeline components. - docs_golds (iterable): Tuples of `Doc` and `GoldParse` objects. + examples (iterable): `Example` objects. verbose (bool): Print debugging information. batch_size (int): Batch size to use. scorer (Scorer): Optional `Scorer` to use. If not passed in, a new one @@ -677,30 +723,24 @@ class Language(object): DOCS: https://spacy.io/api/language#evaluate """ + examples = Example.to_example_objects(examples, make_doc=self.make_doc) if scorer is None: scorer = Scorer(pipeline=self.pipeline) if component_cfg is None: component_cfg = {} - docs, golds = zip(*docs_golds) - docs = [ - self.make_doc(doc) if isinstance(doc, basestring_) else doc for doc in docs - ] - golds = list(golds) for name, pipe in self.pipeline: kwargs = component_cfg.get(name, {}) kwargs.setdefault("batch_size", batch_size) if not hasattr(pipe, "pipe"): - docs = _pipe(docs, pipe, kwargs) + examples = _pipe(examples, pipe, kwargs) else: - docs = pipe.pipe(docs, **kwargs) - for doc, gold in zip(docs, golds): - if not isinstance(gold, GoldParse): - gold = GoldParse(doc, **gold) + examples = pipe.pipe(examples, as_example=True, **kwargs) + for ex in examples: if verbose: - print(doc) + print(ex.doc) kwargs = component_cfg.get("scorer", {}) kwargs.setdefault("verbose", verbose) - scorer.score(doc, gold, **kwargs) + scorer.score(ex, **kwargs) return scorer @contextmanager @@ -719,7 +759,7 @@ class Language(object): contexts = [ pipe.use_params(params) for name, pipe in self.pipeline - if hasattr(pipe, "use_params") + if hasattr(pipe, "use_params") and hasattr(pipe, "model") ] # TODO: Having trouble with contextlib # Workaround: these aren't actually context managers atm. @@ -745,6 +785,7 @@ class Language(object): cleanup=False, component_cfg=None, n_process=1, + as_example=False, ): """Process texts as a stream, and yield `Doc` objects in order. @@ -764,9 +805,6 @@ class Language(object): DOCS: https://spacy.io/api/language#pipe """ - if is_python2 and n_process != 1: - warnings.warn(Warnings.W023) - n_process = 1 if n_threads != -1: warnings.warn(Warnings.W016, DeprecationWarning) if n_process == -1: @@ -781,8 +819,9 @@ class Language(object): disable=disable, n_process=n_process, component_cfg=component_cfg, + as_example=as_example, ) - for doc, context in izip(docs, contexts): + for doc, context in zip(docs, contexts): yield (doc, context) return if component_cfg is None: @@ -852,7 +891,7 @@ class Language(object): *[mp.Pipe(False) for _ in range(n_process)] ) - batch_texts = minibatch(texts, batch_size) + batch_texts = util.minibatch(texts, batch_size) # Sender sends texts to the workers. # This is necessary to properly handle infinite length of texts. # (In this case, all data cannot be sent to the workers at once) @@ -864,14 +903,7 @@ class Language(object): procs = [ mp.Process( target=_apply_pipes, - args=( - self.make_doc, - pipes, - rch, - sch, - Underscore.get_state(), - load_nlp.VECTORS, - ), + args=(self.make_doc, pipes, rch, sch, Underscore.get_state()), ) for rch, sch in zip(texts_q, bytedocs_send_ch) ] @@ -892,6 +924,16 @@ class Language(object): for proc in procs: proc.terminate() + def _link_components(self): + """Register 'listeners' within pipeline components, to allow them to + effectively share weights. + """ + for i, (name1, proc1) in enumerate(self.pipeline): + if hasattr(proc1, "find_listeners"): + for name2, proc2 in self.pipeline[i:]: + if hasattr(proc2, "model"): + proc1.find_listeners(proc2.model) + def to_disk(self, path, exclude=tuple(), disable=None): """Save the current state to a directory. If a model is loaded, this will include the model. @@ -906,13 +948,14 @@ class Language(object): warnings.warn(Warnings.W014, DeprecationWarning) exclude = disable path = util.ensure_path(path) - serializers = OrderedDict() + serializers = {} serializers["tokenizer"] = lambda p: self.tokenizer.to_disk( p, exclude=["vocab"] ) serializers["meta.json"] = lambda p: p.open("w").write( srsly.json_dumps(self.meta) ) + serializers["config.cfg"] = lambda p: self.config.to_disk(p) for name, proc in self.pipeline: if not hasattr(proc, "name"): continue @@ -939,7 +982,9 @@ class Language(object): warnings.warn(Warnings.W014, DeprecationWarning) exclude = disable path = util.ensure_path(path) - deserializers = OrderedDict() + deserializers = {} + if Path(path / "config.cfg").exists(): + deserializers["config.cfg"] = lambda p: self.config.from_disk(p) deserializers["meta.json"] = lambda p: self.meta.update(srsly.read_json(p)) deserializers["vocab"] = lambda p: self.vocab.from_disk( p @@ -960,6 +1005,7 @@ class Language(object): exclude = list(exclude) + ["vocab"] util.from_disk(path, deserializers, exclude) self._path = path + self._link_components() return self def to_bytes(self, exclude=tuple(), disable=None, **kwargs): @@ -973,12 +1019,11 @@ class Language(object): if disable is not None: warnings.warn(Warnings.W014, DeprecationWarning) exclude = disable - serializers = OrderedDict() + serializers = {} serializers["vocab"] = lambda: self.vocab.to_bytes() serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"]) - serializers["meta.json"] = lambda: srsly.json_dumps( - OrderedDict(sorted(self.meta.items())) - ) + serializers["meta.json"] = lambda: srsly.json_dumps(self.meta) + serializers["config.cfg"] = lambda: self.config.to_bytes() for name, proc in self.pipeline: if name in exclude: continue @@ -1000,7 +1045,8 @@ class Language(object): if disable is not None: warnings.warn(Warnings.W014, DeprecationWarning) exclude = disable - deserializers = OrderedDict() + deserializers = {} + deserializers["config.cfg"] = lambda b: self.config.from_bytes(b) deserializers["meta.json"] = lambda b: self.meta.update(srsly.json_loads(b)) deserializers["vocab"] = lambda b: self.vocab.from_bytes( b @@ -1018,6 +1064,7 @@ class Language(object): ) exclude = util.get_serialization_exclude(deserializers, exclude, kwargs) util.from_bytes(bytes_data, deserializers, exclude) + self._link_components() return self @@ -1026,14 +1073,21 @@ class component(object): and class components and will automatically register components in the Language.factories. If the component is a class and needs access to the nlp object or config parameters, it can expose a from_nlp classmethod - that takes the nlp object and **cfg arguments and returns the initialized - component. + that takes the nlp & model objects and **cfg arguments, and returns the + initialized component. """ # NB: This decorator needs to live here, because it needs to write to # Language.factories. All other solutions would cause circular import. - def __init__(self, name=None, assigns=tuple(), requires=tuple(), retokenizes=False): + def __init__( + self, + name=None, + assigns=tuple(), + requires=tuple(), + retokenizes=False, + default_model=lambda: None, + ): """Decorate a pipeline component. name (unicode): Default component and factory name. @@ -1045,6 +1099,7 @@ class component(object): self.assigns = validate_attrs(assigns) self.requires = validate_attrs(requires) self.retokenizes = retokenizes + self.default_model = default_model def __call__(self, *args, **kwargs): obj = args[0] @@ -1056,10 +1111,15 @@ class component(object): obj.requires = self.requires obj.retokenizes = self.retokenizes - def factory(nlp, **cfg): + def factory(nlp, model, **cfg): + if model is None: + model = self.default_model() + warnings.warn(Warnings.W098.format(name=self.name)) + if model is None: + warnings.warn(Warnings.W097.format(name=self.name)) if hasattr(obj, "from_nlp"): - return obj.from_nlp(nlp, **cfg) - elif isinstance(obj, class_types): + return obj.from_nlp(nlp, model, **cfg) + elif isinstance(obj, type): return obj() return obj @@ -1075,7 +1135,7 @@ def _fix_pretrained_vectors_name(nlp): elif not nlp.vocab.vectors.size: nlp.vocab.vectors.name = None elif "name" in nlp.meta and "lang" in nlp.meta: - vectors_name = "%s_%s.vectors" % (nlp.meta["lang"], nlp.meta["name"]) + vectors_name = f"{nlp.meta['lang']}_{nlp.meta['name']}.vectors" nlp.vocab.vectors.name = vectors_name else: raise ValueError(Errors.E092) @@ -1091,7 +1151,7 @@ def _fix_pretrained_vectors_name(nlp): class DisabledPipes(list): """Manager for temporary pipeline disabling.""" - def __init__(self, nlp, *names): + def __init__(self, nlp, names): self.nlp = nlp self.names = names # Important! Not deep copy -- we just want the container (but we also @@ -1118,18 +1178,18 @@ class DisabledPipes(list): self[:] = [] -def _pipe(docs, proc, kwargs): +def _pipe(examples, proc, kwargs): # We added some args for pipe that __call__ doesn't expect. kwargs = dict(kwargs) for arg in ["n_threads", "batch_size"]: if arg in kwargs: kwargs.pop(arg) - for doc in docs: - doc = proc(doc, **kwargs) - yield doc + for ex in examples: + ex = proc(ex, **kwargs) + yield ex -def _apply_pipes(make_doc, pipes, receiver, sender, underscore_state, vectors): +def _apply_pipes(make_doc, pipes, receiver, sender, underscore_state): """Worker for Language.pipe receiver (multiprocessing.Connection): Pipe to receive text. Usually @@ -1137,10 +1197,8 @@ def _apply_pipes(make_doc, pipes, receiver, sender, underscore_state, vectors): sender (multiprocessing.Connection): Pipe to send doc. Usually created by `multiprocessing.Pipe()` underscore_state (tuple): The data in the Underscore class of the parent - vectors (dict): The global vectors data, copied from the parent """ Underscore.load_state(underscore_state) - load_nlp.VECTORS = vectors while True: texts = receiver.get() docs = (make_doc(text) for text in texts) diff --git a/spacy/lemmatizer.py b/spacy/lemmatizer.py index 1f0f0da3f..517a10866 100644 --- a/spacy/lemmatizer.py +++ b/spacy/lemmatizer.py @@ -1,8 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - -from collections import OrderedDict - from .symbols import NOUN, VERB, ADJ, PUNCT, PROPN from .errors import Errors from .lookups import Lookups @@ -158,7 +153,7 @@ class Lemmatizer(object): else: oov_forms.append(form) # Remove duplicates but preserve the ordering of applied "rules" - forms = list(OrderedDict.fromkeys(forms)) + forms = list(dict.fromkeys(forms)) # Put exceptions at the front of the list, so they get priority. # This is a dodgy heuristic -- but it's the best we can do until we get # frequencies on this. We can at least prune out problematic exceptions, diff --git a/spacy/lexeme.pxd b/spacy/lexeme.pxd index 167f57462..c99b6912a 100644 --- a/spacy/lexeme.pxd +++ b/spacy/lexeme.pxd @@ -1,3 +1,5 @@ +from numpy cimport ndarray + from .typedefs cimport attr_t, hash_t, flags_t, len_t, tag_t from .attrs cimport attr_id_t from .attrs cimport ID, ORTH, LOWER, NORM, SHAPE, PREFIX, SUFFIX, LENGTH, LANG @@ -6,8 +8,6 @@ from .structs cimport LexemeC from .strings cimport StringStore from .vocab cimport Vocab -from numpy cimport ndarray - cdef LexemeC EMPTY_LEXEME cdef attr_t OOV_RANK diff --git a/spacy/lexeme.pyx b/spacy/lexeme.pyx index dec2993fa..40aab697e 100644 --- a/spacy/lexeme.pyx +++ b/spacy/lexeme.pyx @@ -1,7 +1,4 @@ # cython: embedsignature=True -# coding: utf8 -from __future__ import unicode_literals, print_function - # Compiler crashes on memory view coercion without this. Should report bug. from cython.view cimport array as cvarray from libc.string cimport memset @@ -9,8 +6,8 @@ cimport numpy as np np.import_array() import numpy +from thinc.api import get_array_module import warnings -from thinc.neural.util import get_array_module from libc.stdint cimport UINT64_MAX from .typedefs cimport attr_t, flags_t diff --git a/spacy/lookups.py b/spacy/lookups.py index 1fa29bdfe..7e49f4dca 100644 --- a/spacy/lookups.py +++ b/spacy/lookups.py @@ -1,9 +1,6 @@ -# coding: utf-8 -from __future__ import unicode_literals - import srsly -from collections import OrderedDict from preshed.bloom import BloomFilter +from collections import OrderedDict from .errors import Errors from .util import SimpleFrozenDict, ensure_path @@ -28,7 +25,7 @@ class Lookups(object): DOCS: https://spacy.io/api/lookups#init """ - self._tables = OrderedDict() + self._tables = {} def __contains__(self, name): """Check if the lookups contain a table of a given name. Delegates to @@ -118,7 +115,7 @@ class Lookups(object): DOCS: https://spacy.io/api/lookups#from_bytes """ - self._tables = OrderedDict() + self._tables = {} for key, value in srsly.msgpack_loads(bytes_data).items(): self._tables[key] = Table(key) self._tables[key].update(value) @@ -254,12 +251,12 @@ class Table(OrderedDict): DOCS: https://spacy.io/api/lookups#table.to_bytes """ - data = [ - ("name", self.name), - ("dict", dict(self.items())), - ("bloom", self.bloom.to_bytes()), - ] - return srsly.msgpack_dumps(OrderedDict(data)) + data = { + "name": self.name, + "dict": dict(self.items()), + "bloom": self.bloom.to_bytes(), + } + return srsly.msgpack_dumps(data) def from_bytes(self, bytes_data): """Load a table from a bytestring. diff --git a/spacy/matcher/__init__.py b/spacy/matcher/__init__.py index 91874ed43..286844787 100644 --- a/spacy/matcher/__init__.py +++ b/spacy/matcher/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .matcher import Matcher from .phrasematcher import PhraseMatcher from .dependencymatcher import DependencyMatcher diff --git a/spacy/matcher/_schemas.py b/spacy/matcher/_schemas.py deleted file mode 100644 index 4ef7ae49a..000000000 --- a/spacy/matcher/_schemas.py +++ /dev/null @@ -1,204 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - - -TOKEN_PATTERN_SCHEMA = { - "$schema": "http://json-schema.org/draft-06/schema", - "definitions": { - "string_value": { - "anyOf": [ - {"type": "string"}, - { - "type": "object", - "properties": { - "REGEX": {"type": "string"}, - "IN": {"type": "array", "items": {"type": "string"}}, - "NOT_IN": {"type": "array", "items": {"type": "string"}}, - }, - "additionalProperties": False, - }, - ] - }, - "integer_value": { - "anyOf": [ - {"type": "integer"}, - { - "type": "object", - "properties": { - "REGEX": {"type": "string"}, - "IN": {"type": "array", "items": {"type": "integer"}}, - "NOT_IN": {"type": "array", "items": {"type": "integer"}}, - "==": {"type": "integer"}, - ">=": {"type": "integer"}, - "<=": {"type": "integer"}, - ">": {"type": "integer"}, - "<": {"type": "integer"}, - }, - "additionalProperties": False, - }, - ] - }, - "boolean_value": {"type": "boolean"}, - "underscore_value": { - "anyOf": [ - {"type": ["string", "integer", "number", "array", "boolean", "null"]}, - { - "type": "object", - "properties": { - "REGEX": {"type": "string"}, - "IN": { - "type": "array", - "items": {"type": ["string", "integer"]}, - }, - "NOT_IN": { - "type": "array", - "items": {"type": ["string", "integer"]}, - }, - "==": {"type": "integer"}, - ">=": {"type": "integer"}, - "<=": {"type": "integer"}, - ">": {"type": "integer"}, - "<": {"type": "integer"}, - }, - "additionalProperties": False, - }, - ] - }, - }, - "type": "array", - "items": { - "type": "object", - "properties": { - "ORTH": { - "title": "Verbatim token text", - "$ref": "#/definitions/string_value", - }, - "TEXT": { - "title": "Verbatim token text (spaCy v2.1+)", - "$ref": "#/definitions/string_value", - }, - "LOWER": { - "title": "Lowercase form of token text", - "$ref": "#/definitions/string_value", - }, - "POS": { - "title": "Coarse-grained part-of-speech tag", - "$ref": "#/definitions/string_value", - }, - "TAG": { - "title": "Fine-grained part-of-speech tag", - "$ref": "#/definitions/string_value", - }, - "DEP": {"title": "Dependency label", "$ref": "#/definitions/string_value"}, - "LEMMA": { - "title": "Lemma (base form)", - "$ref": "#/definitions/string_value", - }, - "SHAPE": { - "title": "Abstract token shape", - "$ref": "#/definitions/string_value", - }, - "ENT_TYPE": { - "title": "Entity label of single token", - "$ref": "#/definitions/string_value", - }, - "NORM": { - "title": "Normalized form of the token text", - "$ref": "#/definitions/string_value", - }, - "LENGTH": { - "title": "Token character length", - "$ref": "#/definitions/integer_value", - }, - "IS_ALPHA": { - "title": "Token consists of alphabetic characters", - "$ref": "#/definitions/boolean_value", - }, - "IS_ASCII": { - "title": "Token consists of ASCII characters", - "$ref": "#/definitions/boolean_value", - }, - "IS_DIGIT": { - "title": "Token consists of digits", - "$ref": "#/definitions/boolean_value", - }, - "IS_LOWER": { - "title": "Token is lowercase", - "$ref": "#/definitions/boolean_value", - }, - "IS_UPPER": { - "title": "Token is uppercase", - "$ref": "#/definitions/boolean_value", - }, - "IS_TITLE": { - "title": "Token is titlecase", - "$ref": "#/definitions/boolean_value", - }, - "IS_PUNCT": { - "title": "Token is punctuation", - "$ref": "#/definitions/boolean_value", - }, - "IS_SPACE": { - "title": "Token is whitespace", - "$ref": "#/definitions/boolean_value", - }, - "IS_BRACKET": { - "title": "Token is a bracket", - "$ref": "#/definitions/boolean_value", - }, - "IS_QUOTE": { - "title": "Token is a quotation mark", - "$ref": "#/definitions/boolean_value", - }, - "IS_LEFT_PUNCT": { - "title": "Token is a left punctuation mark", - "$ref": "#/definitions/boolean_value", - }, - "IS_RIGHT_PUNCT": { - "title": "Token is a right punctuation mark", - "$ref": "#/definitions/boolean_value", - }, - "IS_CURRENCY": { - "title": "Token is a currency symbol", - "$ref": "#/definitions/boolean_value", - }, - "IS_STOP": { - "title": "Token is stop word", - "$ref": "#/definitions/boolean_value", - }, - "IS_SENT_START": { - "title": "Token is the first in a sentence", - "$ref": "#/definitions/boolean_value", - }, - "SENT_START": { - "title": "Token is the first in a sentence", - "$ref": "#/definitions/boolean_value", - }, - "LIKE_NUM": { - "title": "Token resembles a number", - "$ref": "#/definitions/boolean_value", - }, - "LIKE_URL": { - "title": "Token resembles a URL", - "$ref": "#/definitions/boolean_value", - }, - "LIKE_EMAIL": { - "title": "Token resembles an email address", - "$ref": "#/definitions/boolean_value", - }, - "_": { - "title": "Custom extension token attributes (token._.)", - "type": "object", - "patternProperties": { - "^.*$": {"$ref": "#/definitions/underscore_value"} - }, - }, - "OP": { - "title": "Operators / quantifiers", - "type": "string", - "enum": ["+", "*", "?", "!"], - }, - }, - "additionalProperties": False, - }, -} diff --git a/spacy/matcher/dependencymatcher.pyx b/spacy/matcher/dependencymatcher.pyx index 56d27024d..ff707a71c 100644 --- a/spacy/matcher/dependencymatcher.pyx +++ b/spacy/matcher/dependencymatcher.pyx @@ -1,9 +1,9 @@ -# cython: infer_types=True -# cython: profile=True -from __future__ import unicode_literals - +# cython: infer_types=True, profile=True from cymem.cymem cimport Pool from preshed.maps cimport PreshMap +from libcpp cimport bool + +import numpy from .matcher cimport Matcher from ..vocab cimport Vocab @@ -12,8 +12,6 @@ from ..tokens.doc cimport Doc from .matcher import unpickle_matcher from ..errors import Errors -from libcpp cimport bool -import numpy DELIMITER = "||" INDEX_HEAD = 1 @@ -41,7 +39,8 @@ cdef class DependencyMatcher: RETURNS (DependencyMatcher): The newly constructed object. """ size = 20 - self.token_matcher = Matcher(vocab) + # TODO: make matcher work with validation + self.token_matcher = Matcher(vocab, validate=False) self._keys_to_token = {} self._patterns = {} self._root = {} @@ -131,7 +130,7 @@ cdef class DependencyMatcher: # TODO: Better ways to hash edges in pattern? for j in range(len(_patterns[i])): k = self._normalize_key(unicode(key) + DELIMITER + unicode(i) + DELIMITER + unicode(j)) - self.token_matcher.add(k, None, _patterns[i][j]) + self.token_matcher.add(k, [_patterns[i][j]]) _keys_to_token[k] = j _keys_to_token_list.append(_keys_to_token) self._keys_to_token.setdefault(key, []) diff --git a/spacy/matcher/matcher.pxd b/spacy/matcher/matcher.pxd index dd04153bf..689734079 100644 --- a/spacy/matcher/matcher.pxd +++ b/spacy/matcher/matcher.pxd @@ -63,7 +63,7 @@ cdef class Matcher: cdef Pool mem cdef vector[TokenPatternC*] patterns cdef readonly Vocab vocab - cdef public object validator + cdef public object validate cdef public object _patterns cdef public object _callbacks cdef public object _extensions diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx index 0c1a56187..8bd66cbca 100644 --- a/spacy/matcher/matcher.pyx +++ b/spacy/matcher/matcher.pyx @@ -1,7 +1,4 @@ -# cython: infer_types=True -# cython: profile=True -from __future__ import unicode_literals - +# cython: infer_types=True, cython: profile=True from libcpp.vector cimport vector from libc.stdint cimport int32_t from cymem.cymem cimport Pool @@ -19,8 +16,7 @@ from ..tokens.span cimport Span from ..tokens.token cimport Token from ..attrs cimport ID, attr_id_t, NULL_ATTR, ORTH, POS, TAG, DEP, LEMMA -from ._schemas import TOKEN_PATTERN_SCHEMA -from ..util import get_json_validator, validate_json +from ..schemas import validate_token_pattern from ..errors import Errors, MatchPatternError, Warnings from ..strings import get_string_id from ..attrs import IDS @@ -36,7 +32,7 @@ cdef class Matcher: USAGE: https://spacy.io/usage/rule-based-matching """ - def __init__(self, vocab, validate=False): + def __init__(self, vocab, validate=True): """Create the Matcher. vocab (Vocab): The vocabulary object, which must be shared with the @@ -50,10 +46,7 @@ cdef class Matcher: self._seen_attrs = set() self.vocab = vocab self.mem = Pool() - if validate: - self.validator = get_json_validator(TOKEN_PATTERN_SCHEMA) - else: - self.validator = None + self.validate = validate def __reduce__(self): data = (self.vocab, self._patterns, self._callbacks) @@ -123,8 +116,8 @@ cdef class Matcher: raise ValueError(Errors.E012.format(key=key)) if not isinstance(pattern, list): raise ValueError(Errors.E178.format(pat=pattern, key=key)) - if self.validator: - errors[i] = validate_json(pattern, self.validator) + if self.validate: + errors[i] = validate_token_pattern(pattern) if any(err for err in errors.values()): raise MatchPatternError(key, errors) key = self._normalize_key(key) @@ -679,8 +672,6 @@ def _get_attr_values(spec, string_store): attr = "ORTH" if attr == "IS_SENT_START": attr = "SENT_START" - if attr not in TOKEN_PATTERN_SCHEMA["items"]["properties"]: - raise ValueError(Errors.E152.format(attr=attr)) attr = IDS.get(attr) if isinstance(value, basestring): value = string_store.add(value) @@ -695,7 +686,7 @@ def _get_attr_values(spec, string_store): if attr is not None: attr_values.append((attr, value)) else: - # should be caught above using TOKEN_PATTERN_SCHEMA + # should be caught in validation raise ValueError(Errors.E152.format(attr=attr)) return attr_values diff --git a/spacy/matcher/phrasematcher.pxd b/spacy/matcher/phrasematcher.pxd index a8e5e5085..3b42f3fab 100644 --- a/spacy/matcher/phrasematcher.pxd +++ b/spacy/matcher/phrasematcher.pxd @@ -1,5 +1,4 @@ from libcpp.vector cimport vector - from cymem.cymem cimport Pool from preshed.maps cimport key_t, MapStruct diff --git a/spacy/matcher/phrasematcher.pyx b/spacy/matcher/phrasematcher.pyx index b66ec35b8..14cc39787 100644 --- a/spacy/matcher/phrasematcher.pyx +++ b/spacy/matcher/phrasematcher.pyx @@ -1,9 +1,5 @@ -# cython: infer_types=True -# cython: profile=True -from __future__ import unicode_literals - +# cython: infer_types=True, profile=True from libc.stdint cimport uintptr_t - from preshed.maps cimport map_init, map_set, map_get, map_clear, map_iter import warnings @@ -13,7 +9,7 @@ from ..structs cimport TokenC from ..tokens.token cimport Token from ..typedefs cimport attr_t -from ._schemas import TOKEN_PATTERN_SCHEMA +from ..schemas import TokenPattern from ..errors import Errors, Warnings @@ -58,7 +54,7 @@ cdef class PhraseMatcher: attr = attr.upper() if attr == "TEXT": attr = "ORTH" - if attr not in TOKEN_PATTERN_SCHEMA["items"]["properties"]: + if attr.lower() not in TokenPattern().dict(): raise ValueError(Errors.E152.format(attr=attr)) self.attr = self.vocab.strings[attr] diff --git a/spacy/ml/__init__.py b/spacy/ml/__init__.py index 57e7ef571..e69de29bb 100644 --- a/spacy/ml/__init__.py +++ b/spacy/ml/__init__.py @@ -1,5 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from .tok2vec import Tok2Vec # noqa: F401 -from .common import FeedForward, LayerNormalizedMaxout # noqa: F401 diff --git a/spacy/ml/_biluo.py b/spacy/ml/_biluo.py new file mode 100644 index 000000000..28339089a --- /dev/null +++ b/spacy/ml/_biluo.py @@ -0,0 +1,109 @@ +"""Thinc layer to do simpler transition-based parsing, NER, etc.""" +from typing import List, Tuple, Dict, Optional +import numpy +from thinc.api import Ops, Model, with_array, softmax_activation, padded2list +from thinc.api import to_numpy +from thinc.types import Padded, Ints1d, Ints3d, Floats2d, Floats3d + +from ..tokens import Doc + + +def BILUO() -> Model[Padded, Padded]: + return Model( + "biluo", + forward, + init=init, + dims={"nO": None}, + attrs={"get_num_actions": get_num_actions} + ) + + +def init(model, X: Optional[Padded]=None, Y: Optional[Padded]=None): + if X is not None and Y is not None: + if X.data.shape != Y.data.shape: + # TODO: Fix error + raise ValueError("Mismatched shapes (TODO: Fix message)") + model.set_dim("nO", X.data.shape[2]) + elif X is not None: + model.set_dim("nO", X.data.shape[2]) + elif Y is not None: + model.set_dim("nO", Y.data.shape[2]) + elif model.get_dim("nO") is None: + raise ValueError("Dimension unset for BILUO: nO") + + +def forward(model: Model[Padded, Padded], Xp: Padded, is_train: bool): + n_labels = (model.get_dim("nO") - 1) // 4 + n_tokens, n_docs, n_actions = Xp.data.shape + # At each timestep, we make a validity mask of shape (n_docs, n_actions) + # to indicate which actions are valid next for each sequence. To construct + # the mask, we have a state of shape (2, n_actions) and a validity table of + # shape (2, n_actions+1, n_actions). The first dimension of the state indicates + # whether it's the last token, the second dimension indicates the previous + # action, plus a special 'null action' for the first entry. + valid_transitions = model.ops.asarray(_get_transition_table(n_labels)) + prev_actions = model.ops.alloc1i(n_docs) + # Initialize as though prev action was O + prev_actions.fill(n_actions - 1) + Y = model.ops.alloc3f(*Xp.data.shape) + masks = model.ops.alloc3f(*Y.shape) + max_value = Xp.data.max() + for t in range(Xp.data.shape[0]): + is_last = (Xp.lengths < (t+2)).astype("i") + masks[t] = valid_transitions[is_last, prev_actions] + # Don't train the out-of-bounds sequences. + masks[t, Xp.size_at_t[t]:] = 0 + # Valid actions get 0*10e8, invalid get large negative value + Y[t] = Xp.data[t] + ((masks[t]-1) * max_value * 10) + prev_actions = Y[t].argmax(axis=-1) + + def backprop_biluo(dY: Padded) -> Padded: + dY.data *= masks + return dY + + return Padded(Y, Xp.size_at_t, Xp.lengths, Xp.indices), backprop_biluo + + +def get_num_actions(n_labels: int) -> int: + # One BEGIN action per label + # One IN action per label + # One LAST action per label + # One UNIT action per label + # One OUT action + return n_labels + n_labels + n_labels + n_labels + 1 + + +def _get_transition_table( + n_labels: int, *, _cache: Dict[int, Floats3d] = {} +) -> Floats3d: + n_actions = get_num_actions(n_labels) + if n_actions in _cache: + return _cache[n_actions] + table = numpy.zeros((2, n_actions, n_actions), dtype="f") + B_start, B_end = (0, n_labels) + I_start, I_end = (B_end, B_end + n_labels) + L_start, L_end = (I_end, I_end + n_labels) + U_start, U_end = (L_end, L_end + n_labels) + # Using ranges allows us to set specific cells, which is necessary to express + # that only actions of the same label are valid continuations. + B_range = numpy.arange(B_start, B_end) + I_range = numpy.arange(I_start, I_end) + L_range = numpy.arange(L_start, L_end) + O_action = U_end + # If this is the last token and the previous action was B or I, only L + # of that label is valid + table[1, B_range, L_range] = 1 + table[1, I_range, L_range] = 1 + # If this isn't the last token and the previous action was B or I, only I or + # L of that label are valid. + table[0, B_range, I_range] = 1 + table[0, B_range, L_range] = 1 + table[0, I_range, I_range] = 1 + table[0, I_range, L_range] = 1 + # If this isn't the last token and the previous was L, U or O, B is valid + table[0, L_start:, :B_end] = 1 + # Regardless of whether this is the last token, if the previous action was + # {L, U, O}, U and O are valid. + table[:, L_start:, U_start:] = 1 + _cache[n_actions] = table + return table diff --git a/spacy/ml/_character_embed.py b/spacy/ml/_character_embed.py new file mode 100644 index 000000000..f4890144a --- /dev/null +++ b/spacy/ml/_character_embed.py @@ -0,0 +1,54 @@ +from thinc.api import Model + + +def CharacterEmbed(nM, nC): + # nM: Number of dimensions per character. nC: Number of characters. + nO = nM * nC if (nM is not None and nC is not None) else None + return Model( + "charembed", + forward, + init=init, + dims={"nM": nM, "nC": nC, "nO": nO, "nV": 256}, + params={"E": None}, + ).initialize() + + +def init(model, X=None, Y=None): + vectors_table = model.ops.alloc3f( + model.get_dim("nC"), model.get_dim("nV"), model.get_dim("nM") + ) + model.set_param("E", vectors_table) + + +def forward(model, docs, is_train): + if docs is None: + return [] + ids = [] + output = [] + E = model.get_param("E") + nC = model.get_dim("nC") + nM = model.get_dim("nM") + nO = model.get_dim("nO") + # This assists in indexing; it's like looping over this dimension. + # Still consider this weird witch craft...But thanks to Mark Neumann + # for the tip. + nCv = model.ops.xp.arange(nC) + for doc in docs: + doc_ids = doc.to_utf8_array(nr_char=nC) + doc_vectors = model.ops.alloc3f(len(doc), nC, nM) + # Let's say I have a 2d array of indices, and a 3d table of data. What numpy + # incantation do I chant to get + # output[i, j, k] == data[j, ids[i, j], k]? + doc_vectors[:, nCv] = E[nCv, doc_ids[:, nCv]] + output.append(doc_vectors.reshape((len(doc), nO))) + ids.append(doc_ids) + + def backprop(d_output): + dE = model.ops.alloc(E.shape, dtype=E.dtype) + for doc_ids, d_doc_vectors in zip(ids, d_output): + d_doc_vectors = d_doc_vectors.reshape((len(doc_ids), nC, nM)) + dE[nCv, doc_ids[:, nCv]] += d_doc_vectors[:, nCv] + model.inc_grad("E", dE) + return [] + + return output, backprop diff --git a/spacy/ml/_iob.py b/spacy/ml/_iob.py new file mode 100644 index 000000000..0ce9a71e6 --- /dev/null +++ b/spacy/ml/_iob.py @@ -0,0 +1,92 @@ +"""Thinc layer to do simpler transition-based parsing, NER, etc.""" +from typing import List, Tuple, Dict, Optional +from thinc.api import Ops, Model, with_array, softmax_activation, padded2list +from thinc.types import Padded, Ints1d, Ints3d, Floats2d, Floats3d + +from ..tokens import Doc + + +def IOB() -> Model[Padded, Padded]: + return Model( + "biluo", + forward, + init=init, + dims={"nO": None}, + attrs={"get_num_actions": get_num_actions} + ) + + +def init(model, X: Optional[Padded]=None, Y: Optional[Padded]=None): + if X is not None and Y is not None: + if X.data.shape != Y.data.shape: + # TODO: Fix error + raise ValueError("Mismatched shapes (TODO: Fix message)") + model.set_dim("nO", X.data.shape[2]) + elif X is not None: + model.set_dim("nO", X.data.shape[2]) + elif Y is not None: + model.set_dim("nO", Y.data.shape[2]) + elif model.get_dim("nO") is None: + raise ValueError("Dimension unset for BILUO: nO") + + +def forward(model: Model[Padded, Padded], Xp: Padded, is_train: bool): + n_labels = (model.get_dim("nO") - 1) // 2 + n_tokens, n_docs, n_actions = Xp.data.shape + # At each timestep, we make a validity mask of shape (n_docs, n_actions) + # to indicate which actions are valid next for each sequence. To construct + # the mask, we have a state of shape (2, n_actions) and a validity table of + # shape (2, n_actions+1, n_actions). The first dimension of the state indicates + # whether it's the last token, the second dimension indicates the previous + # action, plus a special 'null action' for the first entry. + valid_transitions = _get_transition_table(model.ops, n_labels) + prev_actions = model.ops.alloc1i(n_docs) + # Initialize as though prev action was O + prev_actions.fill(n_actions - 1) + Y = model.ops.alloc3f(*Xp.data.shape) + masks = model.ops.alloc3f(*Y.shape) + for t in range(Xp.data.shape[0]): + masks[t] = valid_transitions[prev_actions] + # Don't train the out-of-bounds sequences. + masks[t, Xp.size_at_t[t]:] = 0 + # Valid actions get 0*10e8, invalid get -1*10e8 + Y[t] = Xp.data[t] + ((masks[t]-1) * 10e8) + prev_actions = Y[t].argmax(axis=-1) + + def backprop_biluo(dY: Padded) -> Padded: + # Masking the gradient seems to do poorly here. But why? + #dY.data *= masks + return dY + + return Padded(Y, Xp.size_at_t, Xp.lengths, Xp.indices), backprop_biluo + + +def get_num_actions(n_labels: int) -> int: + # One BEGIN action per label + # One IN action per label + # One LAST action per label + # One UNIT action per label + # One OUT action + return n_labels * 2 + 1 + + +def _get_transition_table( + ops: Ops, n_labels: int, _cache: Dict[int, Floats3d] = {} +) -> Floats3d: + n_actions = get_num_actions(n_labels) + if n_actions in _cache: + return ops.asarray(_cache[n_actions]) + table = ops.alloc2f(n_actions, n_actions) + B_start, B_end = (0, n_labels) + I_start, I_end = (B_end, B_end + n_labels) + O_action = I_end + B_range = ops.xp.arange(B_start, B_end) + I_range = ops.xp.arange(I_start, I_end) + # B and O are always valid + table[:, B_start : B_end] = 1 + table[:, O_action] = 1 + # I can only follow a matching B + table[B_range, I_range] = 1 + + _cache[n_actions] = table + return table diff --git a/spacy/ml/_legacy_tok2vec.py b/spacy/ml/_legacy_tok2vec.py deleted file mode 100644 index b077a46b7..000000000 --- a/spacy/ml/_legacy_tok2vec.py +++ /dev/null @@ -1,131 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals -from thinc.v2v import Model, Maxout -from thinc.i2v import HashEmbed, StaticVectors -from thinc.t2t import ExtractWindow -from thinc.misc import Residual -from thinc.misc import LayerNorm as LN -from thinc.misc import FeatureExtracter -from thinc.api import layerize, chain, clone, concatenate, with_flatten -from thinc.api import uniqued, wrap, noop - -from ..attrs import ID, ORTH, NORM, PREFIX, SUFFIX, SHAPE - - -def Tok2Vec(width, embed_size, **kwargs): - # Circular imports :( - from .._ml import CharacterEmbed - from .._ml import PyTorchBiLSTM - - pretrained_vectors = kwargs.get("pretrained_vectors", None) - cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3) - subword_features = kwargs.get("subword_features", True) - char_embed = kwargs.get("char_embed", False) - if char_embed: - subword_features = False - conv_depth = kwargs.get("conv_depth", 4) - bilstm_depth = kwargs.get("bilstm_depth", 0) - cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH] - with Model.define_operators({">>": chain, "|": concatenate, "**": clone}): - norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm") - if subword_features: - prefix = HashEmbed( - width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix" - ) - suffix = HashEmbed( - width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix" - ) - shape = HashEmbed( - width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape" - ) - else: - prefix, suffix, shape = (None, None, None) - if pretrained_vectors is not None: - glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID)) - - if subword_features: - embed = uniqued( - (glove | norm | prefix | suffix | shape) - >> LN(Maxout(width, width * 5, pieces=3)), - column=cols.index(ORTH), - ) - else: - embed = uniqued( - (glove | norm) >> LN(Maxout(width, width * 2, pieces=3)), - column=cols.index(ORTH), - ) - elif subword_features: - embed = uniqued( - (norm | prefix | suffix | shape) - >> LN(Maxout(width, width * 4, pieces=3)), - column=cols.index(ORTH), - ) - elif char_embed: - embed = concatenate_lists( - CharacterEmbed(nM=64, nC=8), - FeatureExtracter(cols) >> with_flatten(norm), - ) - reduce_dimensions = LN( - Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces) - ) - else: - embed = norm - - convolution = Residual( - ExtractWindow(nW=1) - >> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces)) - ) - if char_embed: - tok2vec = embed >> with_flatten( - reduce_dimensions >> convolution ** conv_depth, pad=conv_depth - ) - else: - tok2vec = FeatureExtracter(cols) >> with_flatten( - embed >> convolution ** conv_depth, pad=conv_depth - ) - - if bilstm_depth >= 1: - tok2vec = tok2vec >> PyTorchBiLSTM(width, width, bilstm_depth) - # Work around thinc API limitations :(. TODO: Revise in Thinc 7 - tok2vec.nO = width - tok2vec.embed = embed - return tok2vec - - -@layerize -def flatten(seqs, drop=0.0): - ops = Model.ops - lengths = ops.asarray([len(seq) for seq in seqs], dtype="i") - - def finish_update(d_X, sgd=None): - return ops.unflatten(d_X, lengths, pad=0) - - X = ops.flatten(seqs, pad=0) - return X, finish_update - - -def concatenate_lists(*layers, **kwargs): # pragma: no cover - """Compose two or more models `f`, `g`, etc, such that their outputs are - concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))` - """ - if not layers: - return noop() - drop_factor = kwargs.get("drop_factor", 1.0) - ops = layers[0].ops - layers = [chain(layer, flatten) for layer in layers] - concat = concatenate(*layers) - - def concatenate_lists_fwd(Xs, drop=0.0): - if drop is not None: - drop *= drop_factor - lengths = ops.asarray([len(X) for X in Xs], dtype="i") - flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop) - ys = ops.unflatten(flat_y, lengths) - - def concatenate_lists_bwd(d_ys, sgd=None): - return bp_flat_y(ops.flatten(d_ys), sgd=sgd) - - return ys, concatenate_lists_bwd - - model = wrap(concatenate_lists_fwd, concat) - return model diff --git a/spacy/ml/_precomputable_affine.py b/spacy/ml/_precomputable_affine.py new file mode 100644 index 000000000..f4b5b16fe --- /dev/null +++ b/spacy/ml/_precomputable_affine.py @@ -0,0 +1,156 @@ +from thinc.api import Model, normal_init + + +def PrecomputableAffine(nO, nI, nF, nP): + model = Model( + "precomputable_affine", + forward, + init=init, + dims={"nO": nO, "nI": nI, "nF": nF, "nP": nP}, + params={"W": None, "b": None, "pad": None}, + ) + return model + + +def forward(model, X, is_train): + nF = model.get_dim("nF") + nO = model.get_dim("nO") + nP = model.get_dim("nP") + nI = model.get_dim("nI") + W = model.get_param("W") + Yf = model.ops.gemm(X, W.reshape((nF * nO * nP, nI)), trans2=True) + Yf = Yf.reshape((Yf.shape[0], nF, nO, nP)) + Yf = model.ops.xp.vstack((model.get_param("pad"), Yf)) + + def backward(dY_ids): + # This backprop is particularly tricky, because we get back a different + # thing from what we put out. We put out an array of shape: + # (nB, nF, nO, nP), and get back: + # (nB, nO, nP) and ids (nB, nF) + # The ids tell us the values of nF, so we would have: + # + # dYf = zeros((nB, nF, nO, nP)) + # for b in range(nB): + # for f in range(nF): + # dYf[b, ids[b, f]] += dY[b] + # + # However, we avoid building that array for efficiency -- and just pass + # in the indices. + dY, ids = dY_ids + assert dY.ndim == 3 + assert dY.shape[1] == nO, dY.shape + assert dY.shape[2] == nP, dY.shape + # nB = dY.shape[0] + model.inc_grad("pad", _backprop_precomputable_affine_padding(model, dY, ids)) + Xf = X[ids] + Xf = Xf.reshape((Xf.shape[0], nF * nI)) + + model.inc_grad("b", dY.sum(axis=0)) + dY = dY.reshape((dY.shape[0], nO * nP)) + + Wopfi = W.transpose((1, 2, 0, 3)) + Wopfi = model.ops.xp.ascontiguousarray(Wopfi) + Wopfi = Wopfi.reshape((nO * nP, nF * nI)) + dXf = model.ops.gemm(dY.reshape((dY.shape[0], nO * nP)), Wopfi) + + # Reuse the buffer + dWopfi = Wopfi + dWopfi.fill(0.0) + model.ops.gemm(dY, Xf, out=dWopfi, trans1=True) + dWopfi = dWopfi.reshape((nO, nP, nF, nI)) + # (o, p, f, i) --> (f, o, p, i) + model.inc_grad("W", dWopfi.transpose((2, 0, 1, 3))) + return dXf.reshape((dXf.shape[0], nF, nI)) + + return Yf, backward + + +def _backprop_precomputable_affine_padding(model, dY, ids): + nB = dY.shape[0] + nF = model.get_dim("nF") + nP = model.get_dim("nP") + nO = model.get_dim("nO") + # Backprop the "padding", used as a filler for missing values. + # Values that are missing are set to -1, and each state vector could + # have multiple missing values. The padding has different values for + # different missing features. The gradient of the padding vector is: + # + # for b in range(nB): + # for f in range(nF): + # if ids[b, f] < 0: + # d_pad[f] += dY[b] + # + # Which can be rewritten as: + # + # (ids < 0).T @ dY + mask = model.ops.asarray(ids < 0, dtype="f") + d_pad = model.ops.gemm(mask, dY.reshape(nB, nO*nP), trans1=True) + return d_pad.reshape((1, nF, nO, nP)) + + +def init(model, X=None, Y=None): + """This is like the 'layer sequential unit variance', but instead + of taking the actual inputs, we randomly generate whitened data. + + Why's this all so complicated? We have a huge number of inputs, + and the maxout unit makes guessing the dynamics tricky. Instead + we set the maxout weights to values that empirically result in + whitened outputs given whitened inputs. + """ + if model.has_param("W") and model.get_param("W").any(): + return + + nF = model.get_dim("nF") + nO = model.get_dim("nO") + nP = model.get_dim("nP") + nI = model.get_dim("nI") + W = model.ops.alloc4f(nF, nO, nP, nI) + b = model.ops.alloc2f(nO, nP) + pad = model.ops.alloc4f(1, nF, nO, nP) + + ops = model.ops + W = normal_init(ops, W.shape, mean=float(ops.xp.sqrt(1.0 / nF * nI))) + model.set_param("W", W) + model.set_param("b", b) + model.set_param("pad", pad) + + ids = ops.alloc((5000, nF), dtype="f") + ids += ops.xp.random.uniform(0, 1000, ids.shape) + ids = ops.asarray(ids, dtype="i") + tokvecs = ops.alloc((5000, nI), dtype="f") + tokvecs += ops.xp.random.normal(loc=0.0, scale=1.0, size=tokvecs.size).reshape( + tokvecs.shape + ) + + def predict(ids, tokvecs): + # nS ids. nW tokvecs. Exclude the padding array. + hiddens = model.predict(tokvecs[:-1]) # (nW, f, o, p) + vectors = model.ops.alloc((ids.shape[0], nO * nP), dtype="f") + # need nS vectors + hiddens = hiddens.reshape((hiddens.shape[0] * nF, nO * nP)) + model.ops.scatter_add(vectors, ids.flatten(), hiddens) + vectors = vectors.reshape((vectors.shape[0], nO, nP)) + vectors += b + vectors = model.ops.asarray(vectors) + if nP >= 2: + return model.ops.maxout(vectors)[0] + else: + return vectors * (vectors >= 0) + + tol_var = 0.01 + tol_mean = 0.01 + t_max = 10 + W = model.get_param("W").copy() + b = model.get_param("b").copy() + for t_i in range(t_max): + acts1 = predict(ids, tokvecs) + var = model.ops.xp.var(acts1) + mean = model.ops.xp.mean(acts1) + if abs(var - 1.0) >= tol_var: + W /= model.ops.xp.sqrt(var) + model.set_param("W", W) + elif abs(mean) >= tol_mean: + b -= mean + model.set_param("b", b) + else: + break diff --git a/spacy/ml/_wire.py b/spacy/ml/_wire.py deleted file mode 100644 index fa271b37c..000000000 --- a/spacy/ml/_wire.py +++ /dev/null @@ -1,42 +0,0 @@ -from __future__ import unicode_literals -from thinc.api import layerize, wrap, noop, chain, concatenate -from thinc.v2v import Model - - -def concatenate_lists(*layers, **kwargs): # pragma: no cover - """Compose two or more models `f`, `g`, etc, such that their outputs are - concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))` - """ - if not layers: - return layerize(noop()) - drop_factor = kwargs.get("drop_factor", 1.0) - ops = layers[0].ops - layers = [chain(layer, flatten) for layer in layers] - concat = concatenate(*layers) - - def concatenate_lists_fwd(Xs, drop=0.0): - if drop is not None: - drop *= drop_factor - lengths = ops.asarray([len(X) for X in Xs], dtype="i") - flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop) - ys = ops.unflatten(flat_y, lengths) - - def concatenate_lists_bwd(d_ys, sgd=None): - return bp_flat_y(ops.flatten(d_ys), sgd=sgd) - - return ys, concatenate_lists_bwd - - model = wrap(concatenate_lists_fwd, concat) - return model - - -@layerize -def flatten(seqs, drop=0.0): - ops = Model.ops - lengths = ops.asarray([len(seq) for seq in seqs], dtype="i") - - def finish_update(d_X, sgd=None): - return ops.unflatten(d_X, lengths, pad=0) - - X = ops.flatten(seqs, pad=0) - return X, finish_update diff --git a/spacy/ml/common.py b/spacy/ml/common.py deleted file mode 100644 index f90b53a15..000000000 --- a/spacy/ml/common.py +++ /dev/null @@ -1,23 +0,0 @@ -from __future__ import unicode_literals - -from thinc.api import chain -from thinc.v2v import Maxout -from thinc.misc import LayerNorm -from ..util import registry, make_layer - - -@registry.architectures.register("thinc.FeedForward.v1") -def FeedForward(config): - layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]] - model = chain(*layers) - model.cfg = config - return model - - -@registry.architectures.register("spacy.LayerNormalizedMaxout.v1") -def LayerNormalizedMaxout(config): - width = config["width"] - pieces = config["pieces"] - layer = LayerNorm(Maxout(width, pieces=pieces)) - layer.nO = width - return layer diff --git a/spacy/ml/extract_ngrams.py b/spacy/ml/extract_ngrams.py new file mode 100644 index 000000000..f9f691aae --- /dev/null +++ b/spacy/ml/extract_ngrams.py @@ -0,0 +1,36 @@ +import numpy +from thinc.api import Model + +from ..attrs import LOWER + + +def extract_ngrams(ngram_size, attr=LOWER) -> Model: + model = Model("extract_ngrams", forward) + model.attrs["ngram_size"] = ngram_size + model.attrs["attr"] = attr + return model + + +def forward(model, docs, is_train: bool): + batch_keys = [] + batch_vals = [] + for doc in docs: + unigrams = model.ops.asarray(doc.to_array([model.attrs["attr"]])) + ngrams = [unigrams] + for n in range(2, model.attrs["ngram_size"] + 1): + ngrams.append(model.ops.ngrams(n, unigrams)) + keys = model.ops.xp.concatenate(ngrams) + keys, vals = model.ops.xp.unique(keys, return_counts=True) + batch_keys.append(keys) + batch_vals.append(vals) + # The dtype here matches what thinc is expecting -- which differs per + # platform (by int definition). This should be fixed once the problem + # is fixed on Thinc's side. + lengths = model.ops.asarray([arr.shape[0] for arr in batch_keys], dtype=numpy.int_) + batch_keys = model.ops.xp.concatenate(batch_keys) + batch_vals = model.ops.asarray(model.ops.xp.concatenate(batch_vals), dtype="f") + + def backprop(dY): + return [] + + return (batch_keys, batch_vals, lengths), backprop diff --git a/spacy/ml/models/__init__.py b/spacy/ml/models/__init__.py new file mode 100644 index 000000000..ef1e8efca --- /dev/null +++ b/spacy/ml/models/__init__.py @@ -0,0 +1,7 @@ +from .entity_linker import * # noqa +from .parser import * # noqa +from .simple_ner import * +from .tagger import * # noqa +from .tensorizer import * # noqa +from .textcat import * # noqa +from .tok2vec import * # noqa diff --git a/spacy/ml/models/entity_linker.py b/spacy/ml/models/entity_linker.py new file mode 100644 index 000000000..00689e85b --- /dev/null +++ b/spacy/ml/models/entity_linker.py @@ -0,0 +1,33 @@ +from pathlib import Path + +from thinc.api import chain, clone, list2ragged, reduce_mean, residual +from thinc.api import Model, Maxout, Linear + +from ...util import registry +from ...kb import KnowledgeBase +from ...vocab import Vocab + + +@registry.architectures.register("spacy.EntityLinker.v1") +def build_nel_encoder(tok2vec, nO=None): + with Model.define_operators({">>": chain, "**": clone}): + token_width = tok2vec.get_dim("nO") + output_layer = Linear(nO=nO, nI=token_width) + model = ( + tok2vec + >> list2ragged() + >> reduce_mean() + >> residual(Maxout(nO=token_width, nI=token_width, nP=2, dropout=0.0)) + >> output_layer + ) + model.set_ref("output_layer", output_layer) + model.set_ref("tok2vec", tok2vec) + return model + + +@registry.assets.register("spacy.KBFromFile.v1") +def load_kb(nlp_path, kb_path) -> KnowledgeBase: + vocab = Vocab().from_disk(Path(nlp_path) / "vocab") + kb = KnowledgeBase(vocab=vocab) + kb.load_bulk(kb_path) + return kb diff --git a/spacy/ml/models/multi_task.py b/spacy/ml/models/multi_task.py new file mode 100644 index 000000000..1c193df82 --- /dev/null +++ b/spacy/ml/models/multi_task.py @@ -0,0 +1,29 @@ +from thinc.api import chain, Maxout, LayerNorm, Softmax, Linear, zero_init + + +def build_multi_task_model(n_tags, tok2vec=None, token_vector_width=96): + model = chain( + tok2vec, + Maxout(nO=token_vector_width * 2, nI=token_vector_width, nP=3, dropout=0.0), + LayerNorm(token_vector_width * 2), + Softmax(nO=n_tags, nI=token_vector_width * 2), + ) + return model + + +def build_cloze_multi_task_model(vocab, tok2vec): + output_size = vocab.vectors.data.shape[1] + output_layer = chain( + Maxout( + nO=output_size, nI=tok2vec.get_dim("nO"), nP=3, normalize=True, dropout=0.0 + ), + Linear(nO=output_size, nI=output_size, init_W=zero_init), + ) + model = chain(tok2vec, output_layer) + model = build_masked_language_model(vocab, model) + return model + + +def build_masked_language_model(*args, **kwargs): + # TODO cf https://github.com/explosion/spaCy/blob/2c107f02a4d60bda2440db0aad1a88cbbf4fb52d/spacy/_ml.py#L828 + raise NotImplementedError diff --git a/spacy/ml/models/parser.py b/spacy/ml/models/parser.py new file mode 100644 index 000000000..710d36a1d --- /dev/null +++ b/spacy/ml/models/parser.py @@ -0,0 +1,38 @@ +from pydantic import StrictInt +from thinc.api import Model, chain, list2array, Linear, zero_init, use_ops, with_array + +from ...util import registry +from .._precomputable_affine import PrecomputableAffine +from ..tb_framework import TransitionModel + + +@registry.architectures.register("spacy.TransitionBasedParser.v1") +def build_tb_parser_model( + tok2vec: Model, + nr_feature_tokens: StrictInt, + hidden_width: StrictInt, + maxout_pieces: StrictInt, + use_upper=True, + nO=None, +): + token_vector_width = tok2vec.get_dim("nO") + tok2vec = chain( + tok2vec, + with_array(Linear(hidden_width, token_vector_width)), + list2array(), + ) + tok2vec.set_dim("nO", hidden_width) + + lower = PrecomputableAffine( + nO=hidden_width if use_upper else nO, + nF=nr_feature_tokens, + nI=tok2vec.get_dim("nO"), + nP=maxout_pieces + ) + if use_upper: + with use_ops("numpy"): + # Initialize weights at zero, as it's a classification layer. + upper = Linear(nO=nO, init_W=zero_init) + else: + upper = None + return TransitionModel(tok2vec, lower, upper) diff --git a/spacy/ml/models/simple_ner.py b/spacy/ml/models/simple_ner.py new file mode 100644 index 000000000..01661f55b --- /dev/null +++ b/spacy/ml/models/simple_ner.py @@ -0,0 +1,82 @@ +import functools +from typing import List, Tuple, Dict, Optional +from thinc.api import Ops, Model, Linear, Softmax, with_array, softmax_activation, padded2list +from thinc.api import chain, list2padded, configure_normal_init +from thinc.api import Dropout +from thinc.types import Padded, Ints1d, Ints3d, Floats2d, Floats3d + +from ...tokens import Doc +from .._biluo import BILUO +from .._iob import IOB +from ...util import registry + + +@registry.architectures.register("spacy.BiluoTagger.v1") +def BiluoTagger(tok2vec: Model[List[Doc], List[Floats2d]]) -> Model[List[Doc], List[Floats2d]]: + biluo = BILUO() + linear = Linear( + nO=None, + nI=tok2vec.get_dim("nO"), + init_W=configure_normal_init(mean=0.02) + ) + model = chain( + tok2vec, + list2padded(), + with_array(chain(Dropout(0.1), linear)), + biluo, + with_array(softmax_activation()), + padded2list() + ) + + return Model( + "biluo-tagger", + forward, + init=init, + layers=[model, linear], + refs={"tok2vec": tok2vec, "linear": linear, "biluo": biluo}, + dims={"nO": None}, + attrs={"get_num_actions": biluo.attrs["get_num_actions"]} + ) + +@registry.architectures.register("spacy.IOBTagger.v1") +def IOBTagger(tok2vec: Model[List[Doc], List[Floats2d]]) -> Model[List[Doc], List[Floats2d]]: + biluo = IOB() + linear = Linear(nO=None, nI=tok2vec.get_dim("nO")) + model = chain( + tok2vec, + list2padded(), + with_array(linear), + biluo, + with_array(softmax_activation()), + padded2list() + ) + + return Model( + "iob-tagger", + forward, + init=init, + layers=[model], + refs={"tok2vec": tok2vec, "linear": linear, "biluo": biluo}, + dims={"nO": None}, + attrs={"get_num_actions": biluo.attrs["get_num_actions"]} + ) + + + +def init(model: Model[List[Doc], List[Floats2d]], X=None, Y=None) -> None: + if model.get_dim("nO") is None and Y: + model.set_dim("nO", Y[0].shape[1]) + nO = model.get_dim("nO") + biluo = model.get_ref("biluo") + linear = model.get_ref("linear") + biluo.set_dim("nO", nO) + if linear.has_dim("nO") is None: + linear.set_dim("nO", nO) + model.layers[0].initialize(X=X, Y=Y) + + +def forward(model: Model, X: List[Doc], is_train: bool): + return model.layers[0](X, is_train) + + +__all__ = ["BiluoTagger"] diff --git a/spacy/ml/models/tagger.py b/spacy/ml/models/tagger.py new file mode 100644 index 000000000..683c8b518 --- /dev/null +++ b/spacy/ml/models/tagger.py @@ -0,0 +1,17 @@ +from thinc.api import zero_init, with_array, Softmax, chain, Model, Dropout +from thinc.api import glorot_uniform_init + +from ...util import registry + + +@registry.architectures.register("spacy.Tagger.v1") +def build_tagger_model(tok2vec, nO=None) -> Model: + token_vector_width = tok2vec.get_dim("nO") + # TODO: glorot_uniform_init seems to work a bit better than zero_init here?! + output_layer = Softmax(nO, nI=token_vector_width, init_W=zero_init) + softmax = with_array(output_layer) + model = chain(tok2vec, softmax) + model.set_ref("tok2vec", tok2vec) + model.set_ref("softmax", output_layer) + model.set_ref("output_layer", output_layer) + return model diff --git a/spacy/ml/models/tensorizer.py b/spacy/ml/models/tensorizer.py new file mode 100644 index 000000000..f66610b64 --- /dev/null +++ b/spacy/ml/models/tensorizer.py @@ -0,0 +1,10 @@ +from thinc.api import Linear, zero_init + +from ... import util +from ...util import registry + + +@registry.architectures.register("spacy.Tensorizer.v1") +def build_tensorizer(input_size, output_size): + input_size = util.env_opt("token_vector_width", input_size) + return Linear(output_size, input_size, init_W=zero_init) diff --git a/spacy/ml/models/textcat.py b/spacy/ml/models/textcat.py new file mode 100644 index 000000000..ce31d058c --- /dev/null +++ b/spacy/ml/models/textcat.py @@ -0,0 +1,135 @@ +from thinc.api import Model, reduce_mean, Linear, list2ragged, Logistic, ParametricAttention +from thinc.api import chain, concatenate, clone, Dropout +from thinc.api import SparseLinear, Softmax, softmax_activation, Maxout, reduce_sum, Relu, residual, expand_window +from thinc.api import HashEmbed, with_ragged, with_array, with_cpu, uniqued, FeatureExtractor + +from ..spacy_vectors import SpacyVectors +from ... import util +from ...attrs import ID, ORTH, NORM, PREFIX, SUFFIX, SHAPE, LOWER +from ...util import registry +from ..extract_ngrams import extract_ngrams + + +@registry.architectures.register("spacy.TextCatCNN.v1") +def build_simple_cnn_text_classifier(tok2vec, exclusive_classes, nO=None): + """ + Build a simple CNN text classifier, given a token-to-vector model as inputs. + If exclusive_classes=True, a softmax non-linearity is applied, so that the + outputs sum to 1. If exclusive_classes=False, a logistic non-linearity + is applied instead, so that outputs are in the range [0, 1]. + """ + with Model.define_operators({">>": chain}): + if exclusive_classes: + output_layer = Softmax(nO=nO, nI=tok2vec.get_dim("nO")) + model = tok2vec >> list2ragged() >> reduce_mean() >> output_layer + model.set_ref("output_layer", output_layer) + else: + linear_layer = Linear(nO=nO, nI=tok2vec.get_dim("nO")) + model = ( + tok2vec >> list2ragged() >> reduce_mean() >> linear_layer >> Logistic() + ) + model.set_ref("output_layer", linear_layer) + model.set_ref("tok2vec", tok2vec) + model.set_dim("nO", nO) + return model + + +@registry.architectures.register("spacy.TextCatBOW.v1") +def build_bow_text_classifier(exclusive_classes, ngram_size, no_output_layer, nO=None): + with Model.define_operators({">>": chain}): + sparse_linear = SparseLinear(nO) + model = extract_ngrams(ngram_size, attr=ORTH) >> sparse_linear + model = with_cpu(model, model.ops) + if not no_output_layer: + output_layer = softmax_activation() if exclusive_classes else Logistic() + model = model >> with_cpu(output_layer, output_layer.ops) + model.set_ref("output_layer", sparse_linear) + return model + + +@registry.architectures.register("spacy.TextCat.v1") +def build_text_classifier(width, embed_size, pretrained_vectors, exclusive_classes, ngram_size, + window_size, conv_depth, nO=None): + cols = [ORTH, LOWER, PREFIX, SUFFIX, SHAPE, ID] + with Model.define_operators({">>": chain, "|": concatenate, "**": clone}): + lower = HashEmbed(nO=width, nV=embed_size, column=cols.index(LOWER)) + prefix = HashEmbed(nO=width // 2, nV=embed_size, column=cols.index(PREFIX)) + suffix = HashEmbed(nO=width // 2, nV=embed_size, column=cols.index(SUFFIX)) + shape = HashEmbed(nO=width // 2, nV=embed_size, column=cols.index(SHAPE)) + + width_nI = sum(layer.get_dim("nO") for layer in [lower, prefix, suffix, shape]) + trained_vectors = FeatureExtractor(cols) >> with_array( + uniqued( + (lower | prefix | suffix | shape) + >> Maxout(nO=width, nI=width_nI, normalize=True), + column=cols.index(ORTH), + ) + ) + + if pretrained_vectors: + nlp = util.load_model(pretrained_vectors) + vectors = nlp.vocab.vectors + vector_dim = vectors.data.shape[1] + + static_vectors = SpacyVectors(vectors) >> with_array( + Linear(width, vector_dim) + ) + vector_layer = trained_vectors | static_vectors + vectors_width = width * 2 + else: + vector_layer = trained_vectors + vectors_width = width + tok2vec = vector_layer >> with_array( + Maxout(width, vectors_width, normalize=True) + >> residual((expand_window(window_size=window_size) + >> Maxout(nO=width, nI=width * ((window_size * 2) + 1), normalize=True))) ** conv_depth, + pad=conv_depth, + ) + cnn_model = ( + tok2vec + >> list2ragged() + >> ParametricAttention(width) + >> reduce_sum() + >> residual(Maxout(nO=width, nI=width)) + >> Linear(nO=nO, nI=width) + >> Dropout(0.0) + ) + + linear_model = build_bow_text_classifier( + nO=nO, ngram_size=ngram_size, exclusive_classes=exclusive_classes, no_output_layer=False + ) + nO_double = nO*2 if nO else None + if exclusive_classes: + output_layer = Softmax(nO=nO, nI=nO_double) + else: + output_layer = ( + Linear(nO=nO, nI=nO_double) >> Dropout(0.0) >> Logistic() + ) + model = (linear_model | cnn_model) >> output_layer + model.set_ref("tok2vec", tok2vec) + if model.has_dim("nO") is not False: + model.set_dim("nO", nO) + model.set_ref("output_layer", linear_model.get_ref("output_layer")) + return model + + +@registry.architectures.register("spacy.TextCatLowData.v1") +def build_text_classifier_lowdata(width, pretrained_vectors, nO=None): + nlp = util.load_model(pretrained_vectors) + vectors = nlp.vocab.vectors + vector_dim = vectors.data.shape[1] + + # Note, before v.3, this was the default if setting "low_data" and "pretrained_dims" + with Model.define_operators({">>": chain, "**": clone}): + model = ( + SpacyVectors(vectors) + >> list2ragged() + >> with_ragged(0, Linear(width, vector_dim)) + >> ParametricAttention(width) + >> reduce_sum() + >> residual(Relu(width, width)) ** 2 + >> Linear(nO, width) + >> Dropout(0.0) + >> Logistic() + ) + return model diff --git a/spacy/ml/models/tok2vec.py b/spacy/ml/models/tok2vec.py new file mode 100644 index 000000000..a2e8f589a --- /dev/null +++ b/spacy/ml/models/tok2vec.py @@ -0,0 +1,340 @@ +from thinc.api import chain, clone, concatenate, with_array, uniqued +from thinc.api import Model, noop, with_padded, Maxout, expand_window +from thinc.api import HashEmbed, StaticVectors, PyTorchLSTM +from thinc.api import residual, LayerNorm, FeatureExtractor, Mish + +from ... import util +from ...util import registry +from ...ml import _character_embed +from ...pipeline.tok2vec import Tok2VecListener +from ...attrs import ID, ORTH, NORM, PREFIX, SUFFIX, SHAPE + + +@registry.architectures.register("spacy.Tok2VecTensors.v1") +def tok2vec_tensors_v1(width): + tok2vec = Tok2VecListener("tok2vec", width=width) + return tok2vec + + +@registry.architectures.register("spacy.VocabVectors.v1") +def get_vocab_vectors(name): + nlp = util.load_model(name) + return nlp.vocab.vectors + + +@registry.architectures.register("spacy.Tok2Vec.v1") +def Tok2Vec(extract, embed, encode): + field_size = 0 + if encode.attrs.get("receptive_field", None): + field_size = encode.attrs["receptive_field"] + with Model.define_operators({">>": chain, "|": concatenate}): + tok2vec = extract >> with_array(embed >> encode, pad=field_size) + tok2vec.set_dim("nO", encode.get_dim("nO")) + tok2vec.set_ref("embed", embed) + tok2vec.set_ref("encode", encode) + return tok2vec + + +@registry.architectures.register("spacy.Doc2Feats.v1") +def Doc2Feats(columns): + return FeatureExtractor(columns) + + +@registry.architectures.register("spacy.HashEmbedCNN.v1") +def hash_embed_cnn( + pretrained_vectors, + width, + depth, + embed_size, + maxout_pieces, + window_size, + subword_features, +): + # Does not use character embeddings: set to False by default + return build_Tok2Vec_model( + width=width, + embed_size=embed_size, + pretrained_vectors=pretrained_vectors, + conv_depth=depth, + bilstm_depth=0, + maxout_pieces=maxout_pieces, + window_size=window_size, + subword_features=subword_features, + char_embed=False, + nM=0, + nC=0, + ) + + +@registry.architectures.register("spacy.HashCharEmbedCNN.v1") +def hash_charembed_cnn( + pretrained_vectors, + width, + depth, + embed_size, + maxout_pieces, + window_size, + nM, + nC, +): + # Allows using character embeddings by setting nC, nM and char_embed=True + return build_Tok2Vec_model( + width=width, + embed_size=embed_size, + pretrained_vectors=pretrained_vectors, + conv_depth=depth, + bilstm_depth=0, + maxout_pieces=maxout_pieces, + window_size=window_size, + subword_features=False, + char_embed=True, + nM=nM, + nC=nC, + ) + + +@registry.architectures.register("spacy.HashEmbedBiLSTM.v1") +def hash_embed_bilstm_v1( + pretrained_vectors, width, depth, embed_size, subword_features, maxout_pieces +): + # Does not use character embeddings: set to False by default + return build_Tok2Vec_model( + width=width, + embed_size=embed_size, + pretrained_vectors=pretrained_vectors, + bilstm_depth=depth, + conv_depth=0, + maxout_pieces=maxout_pieces, + window_size=1, + subword_features=subword_features, + char_embed=False, + nM=0, + nC=0, + ) + + +@registry.architectures.register("spacy.HashCharEmbedBiLSTM.v1") +def hash_char_embed_bilstm_v1( + pretrained_vectors, width, depth, embed_size, maxout_pieces, nM, nC +): + # Allows using character embeddings by setting nC, nM and char_embed=True + return build_Tok2Vec_model( + width=width, + embed_size=embed_size, + pretrained_vectors=pretrained_vectors, + bilstm_depth=depth, + conv_depth=0, + maxout_pieces=maxout_pieces, + window_size=1, + subword_features=False, + char_embed=True, + nM=nM, + nC=nC, + ) + + +@registry.architectures.register("spacy.LayerNormalizedMaxout.v1") +def LayerNormalizedMaxout(width, maxout_pieces): + return Maxout( + nO=width, + nP=maxout_pieces, + dropout=0.0, + normalize=True, + ) + + +@registry.architectures.register("spacy.MultiHashEmbed.v1") +def MultiHashEmbed(columns, width, rows, use_subwords, pretrained_vectors, mix): + norm = HashEmbed(nO=width, nV=rows, column=columns.index("NORM")) + if use_subwords: + prefix = HashEmbed(nO=width, nV=rows // 2, column=columns.index("PREFIX")) + suffix = HashEmbed(nO=width, nV=rows // 2, column=columns.index("SUFFIX")) + shape = HashEmbed(nO=width, nV=rows // 2, column=columns.index("SHAPE")) + + if pretrained_vectors: + glove = StaticVectors( + vectors=pretrained_vectors.data, + nO=width, + column=columns.index(ID), + dropout=0.0, + ) + + with Model.define_operators({">>": chain, "|": concatenate}): + if not use_subwords and not pretrained_vectors: + embed_layer = norm + else: + if use_subwords and pretrained_vectors: + nr_columns = 5 + concat_columns = glove | norm | prefix | suffix | shape + elif use_subwords: + nr_columns = 4 + concat_columns = norm | prefix | suffix | shape + else: + nr_columns = 2 + concat_columns = glove | norm + + embed_layer = uniqued(concat_columns >> mix, column=columns.index("ORTH")) + + return embed_layer + + +@registry.architectures.register("spacy.CharacterEmbed.v1") +def CharacterEmbed(columns, width, rows, nM, nC, features): + norm = HashEmbed(nO=width, nV=rows, column=columns.index("NORM")) + chr_embed = _character_embed.CharacterEmbed(nM=nM, nC=nC) + with Model.define_operators({">>": chain, "|": concatenate}): + embed_layer = chr_embed | features >> with_array(norm) + embed_layer.set_dim("nO", nM * nC + width) + return embed_layer + + +@registry.architectures.register("spacy.MaxoutWindowEncoder.v1") +def MaxoutWindowEncoder(width, window_size, maxout_pieces, depth): + cnn = chain( + expand_window(window_size=window_size), + Maxout(nO=width, nI=width * ((window_size * 2) + 1), nP=maxout_pieces, dropout=0.0, normalize=True), + ) + model = clone(residual(cnn), depth) + model.set_dim("nO", width) + model.attrs["receptive_field"] = window_size * depth + return model + + +@registry.architectures.register("spacy.MishWindowEncoder.v1") +def MishWindowEncoder(width, window_size, depth): + cnn = chain( + expand_window(window_size=window_size), + Mish(nO=width, nI=width * ((window_size * 2) + 1)), + LayerNorm(width), + ) + model = clone(residual(cnn), depth) + model.set_dim("nO", width) + return model + + +@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1") +def TorchBiLSTMEncoder(width, depth): + import torch.nn + + # TODO FIX + from thinc.api import PyTorchRNNWrapper + + if depth == 0: + return noop() + return with_padded( + PyTorchRNNWrapper(torch.nn.LSTM(width, width // 2, depth, bidirectional=True)) + ) + + +def build_Tok2Vec_model( + width, + embed_size, + pretrained_vectors, + window_size, + maxout_pieces, + subword_features, + char_embed, + nM, + nC, + conv_depth, + bilstm_depth, +) -> Model: + if char_embed: + subword_features = False + cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH] + with Model.define_operators({">>": chain, "|": concatenate, "**": clone}): + norm = HashEmbed(nO=width, nV=embed_size, column=cols.index(NORM)) + if subword_features: + prefix = HashEmbed(nO=width, nV=embed_size // 2, column=cols.index(PREFIX)) + suffix = HashEmbed(nO=width, nV=embed_size // 2, column=cols.index(SUFFIX)) + shape = HashEmbed(nO=width, nV=embed_size // 2, column=cols.index(SHAPE)) + else: + prefix, suffix, shape = (None, None, None) + if pretrained_vectors is not None: + glove = StaticVectors( + vectors=pretrained_vectors.data, + nO=width, + column=cols.index(ID), + dropout=0.0, + ) + + if subword_features: + columns = 5 + embed = uniqued( + (glove | norm | prefix | suffix | shape) + >> Maxout( + nO=width, + nI=width * columns, + nP=maxout_pieces, + dropout=0.0, + normalize=True, + ), + column=cols.index(ORTH), + ) + else: + columns = 2 + embed = uniqued( + (glove | norm) + >> Maxout( + nO=width, + nI=width * columns, + nP=maxout_pieces, + dropout=0.0, + normalize=True, + ), + column=cols.index(ORTH), + ) + elif subword_features: + columns = 4 + embed = uniqued( + concatenate(norm, prefix, suffix, shape) + >> Maxout( + nO=width, + nI=width * columns, + nP=maxout_pieces, + dropout=0.0, + normalize=True, + ), + column=cols.index(ORTH), + ) + elif char_embed: + embed = _character_embed.CharacterEmbed(nM=nM, nC=nC) | FeatureExtractor( + cols + ) >> with_array(norm) + reduce_dimensions = Maxout( + nO=width, + nI=nM * nC + width, + nP=maxout_pieces, + dropout=0.0, + normalize=True, + ) + else: + embed = norm + + convolution = residual( + expand_window(window_size=window_size) + >> Maxout( + nO=width, + nI=width * ((window_size * 2) + 1), + nP=maxout_pieces, + dropout=0.0, + normalize=True, + ) + ) + if char_embed: + tok2vec = embed >> with_array( + reduce_dimensions >> convolution ** conv_depth, pad=conv_depth + ) + else: + tok2vec = FeatureExtractor(cols) >> with_array( + embed >> convolution ** conv_depth, pad=conv_depth + ) + + if bilstm_depth >= 1: + tok2vec = tok2vec >> PyTorchLSTM( + nO=width, nI=width, depth=bilstm_depth, bi=True + ) + if tok2vec.has_dim("nO") is not False: + tok2vec.set_dim("nO", width) + tok2vec.set_ref("embed", embed) + return tok2vec diff --git a/spacy/ml/spacy_vectors.py b/spacy/ml/spacy_vectors.py new file mode 100644 index 000000000..2a4988494 --- /dev/null +++ b/spacy/ml/spacy_vectors.py @@ -0,0 +1,27 @@ +import numpy +from thinc.api import Model, Unserializable + + +def SpacyVectors(vectors) -> Model: + attrs = {"vectors": Unserializable(vectors)} + model = Model("spacy_vectors", forward, attrs=attrs) + return model + + +def forward(model, docs, is_train: bool): + batch = [] + vectors = model.attrs["vectors"].obj + for doc in docs: + indices = numpy.zeros((len(doc),), dtype="i") + for i, word in enumerate(doc): + if word.orth in vectors.key2row: + indices[i] = vectors.key2row[word.orth] + else: + indices[i] = 0 + batch_vectors = vectors.data[indices] + batch.append(batch_vectors) + + def backprop(dY): + return None + + return batch, backprop diff --git a/spacy/ml/tb_framework.py b/spacy/ml/tb_framework.py new file mode 100644 index 000000000..e4301a644 --- /dev/null +++ b/spacy/ml/tb_framework.py @@ -0,0 +1,86 @@ +from thinc.api import Model, noop, use_ops, Linear +from ..syntax._parser_model import ParserStepModel + + +def TransitionModel(tok2vec, lower, upper, unseen_classes=set()): + """Set up a stepwise transition-based model""" + if upper is None: + has_upper = False + upper = noop() + else: + has_upper = True + # don't define nO for this object, because we can't dynamically change it + return Model( + name="parser_model", + forward=forward, + dims={"nI": tok2vec.get_dim("nI") if tok2vec.has_dim("nI") else None}, + layers=[tok2vec, lower, upper], + refs={"tok2vec": tok2vec, "lower": lower, "upper": upper}, + init=init, + attrs={ + "has_upper": has_upper, + "unseen_classes": set(unseen_classes), + "resize_output": resize_output + } + ) + + +def forward(model, X, is_train): + step_model = ParserStepModel( + X, + model.layers, + unseen_classes=model.attrs["unseen_classes"], + train=is_train, + has_upper=model.attrs["has_upper"] + ) + + return step_model, step_model.finish_steps + + +def init(model, X=None, Y=None): + tok2vec = model.get_ref("tok2vec").initialize() + lower = model.get_ref("lower").initialize(X=X) + if model.attrs["has_upper"]: + statevecs = model.ops.alloc2f(2, lower.get_dim("nO")) + model.get_ref("upper").initialize(X=statevecs) + + +def resize_output(model, new_nO): + tok2vec = model.get_ref("tok2vec") + lower = model.get_ref("lower") + upper = model.get_ref("upper") + if not model.attrs["has_upper"]: + if lower.has_dim("nO") is None: + lower.set_dim("nO", new_nO) + return + elif upper.has_dim("nO") is None: + upper.set_dim("nO", new_nO) + return + elif new_nO == upper.get_dim("nO"): + return + smaller = upper + nI = None + if smaller.has_dim("nI"): + nI = smaller.get_dim("nI") + with use_ops('numpy'): + larger = Linear(nO=new_nO, nI=nI) + larger.init = smaller.init + # it could be that the model is not initialized yet, then skip this bit + if nI: + larger_W = larger.ops.alloc2f(new_nO, nI) + larger_b = larger.ops.alloc1f(new_nO) + smaller_W = smaller.get_param("W") + smaller_b = smaller.get_param("b") + # Weights are stored in (nr_out, nr_in) format, so we're basically + # just adding rows here. + if smaller.has_dim("nO"): + larger_W[:smaller.get_dim("nO")] = smaller_W + larger_b[:smaller.get_dim("nO")] = smaller_b + for i in range(smaller.get_dim("nO"), new_nO): + model.attrs["unseen_classes"].add(i) + + larger.set_param("W", larger_W) + larger.set_param("b", larger_b) + model._layers[-1] = larger + model.set_ref("upper", larger) + return model diff --git a/spacy/ml/tok2vec.py b/spacy/ml/tok2vec.py deleted file mode 100644 index 8f86475ef..000000000 --- a/spacy/ml/tok2vec.py +++ /dev/null @@ -1,176 +0,0 @@ -from __future__ import unicode_literals - -from thinc.api import chain, layerize, clone, concatenate, with_flatten, uniqued -from thinc.api import noop, with_square_sequences -from thinc.v2v import Maxout, Model -from thinc.i2v import HashEmbed, StaticVectors -from thinc.t2t import ExtractWindow -from thinc.misc import Residual, LayerNorm, FeatureExtracter -from ..util import make_layer, registry -from ._wire import concatenate_lists - - -@registry.architectures.register("spacy.Tok2Vec.v1") -def Tok2Vec(config): - doc2feats = make_layer(config["@doc2feats"]) - embed = make_layer(config["@embed"]) - encode = make_layer(config["@encode"]) - field_size = getattr(encode, "receptive_field", 0) - tok2vec = chain(doc2feats, with_flatten(chain(embed, encode), pad=field_size)) - tok2vec.cfg = config - tok2vec.nO = encode.nO - tok2vec.embed = embed - tok2vec.encode = encode - return tok2vec - - -@registry.architectures.register("spacy.Doc2Feats.v1") -def Doc2Feats(config): - columns = config["columns"] - return FeatureExtracter(columns) - - -@registry.architectures.register("spacy.MultiHashEmbed.v1") -def MultiHashEmbed(config): - # For backwards compatibility with models before the architecture registry, - # we have to be careful to get exactly the same model structure. One subtle - # trick is that when we define concatenation with the operator, the operator - # is actually binary associative. So when we write (a | b | c), we're actually - # getting concatenate(concatenate(a, b), c). That's why the implementation - # is a bit ugly here. - cols = config["columns"] - width = config["width"] - rows = config["rows"] - - norm = HashEmbed(width, rows, column=cols.index("NORM"), name="embed_norm") - if config["use_subwords"]: - prefix = HashEmbed( - width, rows // 2, column=cols.index("PREFIX"), name="embed_prefix" - ) - suffix = HashEmbed( - width, rows // 2, column=cols.index("SUFFIX"), name="embed_suffix" - ) - shape = HashEmbed( - width, rows // 2, column=cols.index("SHAPE"), name="embed_shape" - ) - if config.get("@pretrained_vectors"): - glove = make_layer(config["@pretrained_vectors"]) - mix = make_layer(config["@mix"]) - - with Model.define_operators({">>": chain, "|": concatenate}): - if config["use_subwords"] and config["@pretrained_vectors"]: - mix._layers[0].nI = width * 5 - layer = uniqued( - (glove | norm | prefix | suffix | shape) >> mix, - column=cols.index("ORTH"), - ) - elif config["use_subwords"]: - mix._layers[0].nI = width * 4 - layer = uniqued( - (norm | prefix | suffix | shape) >> mix, column=cols.index("ORTH") - ) - elif config["@pretrained_vectors"]: - mix._layers[0].nI = width * 2 - layer = uniqued((glove | norm) >> mix, column=cols.index("ORTH"),) - else: - layer = norm - layer.cfg = config - return layer - - -@registry.architectures.register("spacy.CharacterEmbed.v1") -def CharacterEmbed(config): - from .. import _ml - - width = config["width"] - chars = config["chars"] - - chr_embed = _ml.CharacterEmbedModel(nM=width, nC=chars) - other_tables = make_layer(config["@embed_features"]) - mix = make_layer(config["@mix"]) - - model = chain(concatenate_lists(chr_embed, other_tables), mix) - model.cfg = config - return model - - -@registry.architectures.register("spacy.MaxoutWindowEncoder.v1") -def MaxoutWindowEncoder(config): - nO = config["width"] - nW = config["window_size"] - nP = config["pieces"] - depth = config["depth"] - - cnn = chain( - ExtractWindow(nW=nW), LayerNorm(Maxout(nO, nO * ((nW * 2) + 1), pieces=nP)) - ) - model = clone(Residual(cnn), depth) - model.nO = nO - model.receptive_field = nW * depth - return model - - -@registry.architectures.register("spacy.MishWindowEncoder.v1") -def MishWindowEncoder(config): - from thinc.v2v import Mish - - nO = config["width"] - nW = config["window_size"] - depth = config["depth"] - - cnn = chain(ExtractWindow(nW=nW), LayerNorm(Mish(nO, nO * ((nW * 2) + 1)))) - model = clone(Residual(cnn), depth) - model.nO = nO - return model - - -@registry.architectures.register("spacy.PretrainedVectors.v1") -def PretrainedVectors(config): - return StaticVectors(config["vectors_name"], config["width"], config["column"]) - - -@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1") -def TorchBiLSTMEncoder(config): - import torch.nn - from thinc.extra.wrappers import PyTorchWrapperRNN - - width = config["width"] - depth = config["depth"] - if depth == 0: - return layerize(noop()) - return with_square_sequences( - PyTorchWrapperRNN(torch.nn.LSTM(width, width // 2, depth, bidirectional=True)) - ) - - -_EXAMPLE_CONFIG = { - "@doc2feats": { - "arch": "Doc2Feats", - "config": {"columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"]}, - }, - "@embed": { - "arch": "spacy.MultiHashEmbed.v1", - "config": { - "width": 96, - "rows": 2000, - "columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"], - "use_subwords": True, - "@pretrained_vectors": { - "arch": "TransformedStaticVectors", - "config": { - "vectors_name": "en_vectors_web_lg.vectors", - "width": 96, - "column": 0, - }, - }, - "@mix": { - "arch": "LayerNormalizedMaxout", - "config": {"width": 96, "pieces": 3}, - }, - }, - }, - "@encode": { - "arch": "MaxoutWindowEncode", - "config": {"width": 96, "window_size": 1, "depth": 4, "pieces": 3}, - }, -} diff --git a/spacy/morphology.pxd b/spacy/morphology.pxd index 1a3cedf97..c57e3a1db 100644 --- a/spacy/morphology.pxd +++ b/spacy/morphology.pxd @@ -2,31 +2,31 @@ from cymem.cymem cimport Pool from preshed.maps cimport PreshMap, PreshMapArray from libc.stdint cimport uint64_t from murmurhash cimport mrmr +cimport numpy as np from .structs cimport TokenC, MorphAnalysisC from .strings cimport StringStore from .typedefs cimport hash_t, attr_t, flags_t from .parts_of_speech cimport univ_pos_t - from . cimport symbols + cdef class Morphology: cdef readonly Pool mem cdef readonly StringStore strings cdef PreshMap tags # Keyed by hash, value is pointer to tag - + cdef public object lemmatizer cdef readonly object tag_map cdef readonly object tag_names cdef readonly object reverse_index cdef readonly object exc - cdef readonly object _feat_map cdef readonly PreshMapArray _cache cdef readonly int n_tags - cpdef update(self, hash_t morph, features) - cdef hash_t insert(self, MorphAnalysisC tag) except 0 - + cdef MorphAnalysisC create_morph_tag(self, field_feature_pairs) except * + cdef int insert(self, MorphAnalysisC tag) except -1 + cdef int assign_untagged(self, TokenC* token) except -1 cdef int assign_tag(self, TokenC* token, tag) except -1 cdef int assign_tag_id(self, TokenC* token, int tag_id) except -1 @@ -34,8 +34,7 @@ cdef class Morphology: cdef int _assign_tag_from_exceptions(self, TokenC* token, int tag_id) except -1 -cdef int check_feature(const MorphAnalysisC* tag, attr_t feature) nogil -cdef attr_t get_field(const MorphAnalysisC* tag, int field) nogil -cdef list list_features(const MorphAnalysisC* tag) - -cdef tag_to_json(const MorphAnalysisC* tag) +cdef int check_feature(const MorphAnalysisC* morph, attr_t feature) nogil +cdef list list_features(const MorphAnalysisC* morph) +cdef np.ndarray get_by_field(const MorphAnalysisC* morph, attr_t field) +cdef int get_n_by_field(attr_t* results, const MorphAnalysisC* morph, attr_t field) nogil diff --git a/spacy/morphology.pyx b/spacy/morphology.pyx index c146094a9..0b53b124c 100644 --- a/spacy/morphology.pyx +++ b/spacy/morphology.pyx @@ -1,143 +1,51 @@ # cython: infer_types -# coding: utf8 -from __future__ import unicode_literals - from libc.string cimport memset + import srsly from collections import Counter +import numpy +import warnings -from .compat import basestring_ -from .strings import get_string_id -from . import symbols from .attrs cimport POS, IS_SPACE -from .attrs import LEMMA, intify_attrs from .parts_of_speech cimport SPACE -from .parts_of_speech import IDS as POS_IDS from .lexeme cimport Lexeme -from .errors import Errors + +from .strings import get_string_id +from .attrs import LEMMA, intify_attrs +from .parts_of_speech import IDS as POS_IDS +from .errors import Errors, Warnings from .util import ensure_path - - -cdef enum univ_field_t: - Field_POS - Field_Abbr - Field_AdpType - Field_AdvType - Field_Animacy - Field_Aspect - Field_Case - Field_ConjType - Field_Connegative - Field_Definite - Field_Degree - Field_Derivation - Field_Echo - Field_Foreign - Field_Gender - Field_Hyph - Field_InfForm - Field_Mood - Field_NameType - Field_Negative - Field_NounType - Field_Number - Field_NumForm - Field_NumType - Field_NumValue - Field_PartForm - Field_PartType - Field_Person - Field_Polarity - Field_Polite - Field_Poss - Field_Prefix - Field_PrepCase - Field_PronType - Field_PunctSide - Field_PunctType - Field_Reflex - Field_Style - Field_StyleVariant - Field_Tense - Field_Typo - Field_VerbForm - Field_VerbType - Field_Voice +from . import symbols def _normalize_props(props): - """Transform deprecated string keys to correct names.""" + """Convert attrs dict so that POS is always by ID, other features are left + as is as long as they are strings or IDs. + """ out = {} props = dict(props) - for key in FIELDS: - if key in props: - value = str(props[key]).lower() - # We don't have support for disjunctive int|rel features, so - # just take the first one :( - if "|" in value: - value = value.split("|")[0] - attr = '%s_%s' % (key, value) - if attr in FEATURES: - props.pop(key) - props[attr] = True for key, value in props.items(): + # convert POS value to ID if key == POS: if hasattr(value, 'upper'): value = value.upper() if value in POS_IDS: value = POS_IDS[value] out[key] = value - elif isinstance(key, int): - out[key] = value - elif value is True: - out[key] = value - elif key.lower() == 'pos': + elif isinstance(key, str) and key.lower() == 'pos': out[POS] = POS_IDS[value.upper()] - elif key.lower() != 'morph': + # sort values + elif isinstance(value, str) and Morphology.VALUE_SEP in value: + out[key] = Morphology.VALUE_SEP.join( + sorted(value.split(Morphology.VALUE_SEP))) + # accept any string or ID fields and values + elif isinstance(key, (int, str)) and isinstance(value, (int, str)): out[key] = value + else: + warnings.warn(Warnings.W029.format(feature={key: value})) return out -class MorphologyClassMap(object): - def __init__(self, features): - self.features = tuple(features) - self.fields = [] - self.feat2field = {} - seen_fields = set() - for feature in features: - field = feature.split("_", 1)[0] - if field not in seen_fields: - self.fields.append(field) - seen_fields.add(field) - self.feat2field[feature] = FIELDS[field] - self.id2feat = {get_string_id(name): name for name in features} - self.field2feats = {"POS": []} - self.col2info = [] - self.attr2field = dict(LOWER_FIELDS.items()) - self.feat2offset = {} - self.field2col = {} - self.field2id = dict(FIELDS.items()) - self.fieldid2field = {field_id: field for field, field_id in FIELDS.items()} - for feature in features: - field = self.fields[self.feat2field[feature]] - if field not in self.field2col: - self.field2col[field] = len(self.col2info) - if field != "POS" and field not in self.field2feats: - self.col2info.append((field, 0, "NIL")) - self.field2feats.setdefault(field, ["NIL"]) - offset = len(self.field2feats[field]) - self.field2feats[field].append(feature) - self.col2info.append((field, offset, feature)) - self.feat2offset[feature] = offset - - @property - def field_sizes(self): - return [len(self.field2feats[field]) for field in self.fields] - - def get_field_offset(self, field): - return self.field2col[field] - - cdef class Morphology: '''Store the possible morphological analyses for a language, and index them by hash. @@ -146,9 +54,15 @@ cdef class Morphology: analysis, so queries of morphological attributes are delegated to this class. ''' - def __init__(self, StringStore string_store, tag_map, lemmatizer, exc=None): + + FEATURE_SEP = "|" + FIELD_SEP = "=" + VALUE_SEP = "," + EMPTY_MORPH = "_" + + def __init__(self, StringStore strings, tag_map, lemmatizer, exc=None): self.mem = Pool() - self.strings = string_store + self.strings = strings self.tags = PreshMap() # Add special space symbol. We prefix with underscore, to make sure it # always sorts to the end. @@ -162,7 +76,6 @@ cdef class Morphology: self.lemmatizer = lemmatizer self.n_tags = len(tag_map) self.reverse_index = {} - self._feat_map = MorphologyClassMap(FEATURES) self._load_from_tag_map(tag_map) self._cache = PreshMapArray(self.n_tags) @@ -176,8 +89,7 @@ cdef class Morphology: def _load_from_tag_map(self, tag_map): for i, (tag_str, attrs) in enumerate(sorted(tag_map.items())): attrs = _normalize_props(attrs) - self.add({self._feat_map.id2feat[feat] for feat in attrs - if feat in self._feat_map.id2feat}) + self.add(attrs) self.tag_map[tag_str] = dict(attrs) self.reverse_index[self.strings.add(tag_str)] = i @@ -186,40 +98,78 @@ cdef class Morphology: self.exc), None, None) def add(self, features): - """Insert a morphological analysis in the morphology table, if not already - present. Returns the hash of the new analysis. + """Insert a morphological analysis in the morphology table, if not + already present. The morphological analysis may be provided in the UD + FEATS format as a string or in the tag map dict format. + Returns the hash of the new analysis. + """ + cdef MorphAnalysisC* tag_ptr + if features == self.EMPTY_MORPH: + features = "" + if isinstance(features, str): + tag_ptr = self.tags.get(self.strings[features]) + if tag_ptr != NULL: + return tag_ptr.key + features = self.feats_to_dict(features) + if not isinstance(features, dict): + warnings.warn(Warnings.W029.format(feature=features)) + features = {} + features = _normalize_props(features) + string_features = {self.strings.as_string(field): self.strings.as_string(values) for field, values in features.items()} + # normalized UFEATS string with sorted fields and values + norm_feats_string = self.FEATURE_SEP.join(sorted([ + self.FIELD_SEP.join([field, values]) + for field, values in string_features.items() + ])) + # intified ("Field", "Field=Value") pairs + field_feature_pairs = [] + for field in sorted(string_features): + values = string_features[field] + for value in values.split(self.VALUE_SEP): + field_feature_pairs.append(( + self.strings.add(field), + self.strings.add(field + self.FIELD_SEP + value), + )) + cdef MorphAnalysisC tag = self.create_morph_tag(field_feature_pairs) + # the hash key for the tag is either the hash of the normalized UFEATS + # string or the hash of an empty placeholder (using the empty string + # would give a hash key of 0, which is not good for PreshMap) + if norm_feats_string: + tag.key = self.strings.add(norm_feats_string) + else: + tag.key = self.strings.add(self.EMPTY_MORPH) + self.insert(tag) + return tag.key + + cdef MorphAnalysisC create_morph_tag(self, field_feature_pairs) except *: + """Creates a MorphAnalysisC from a list of intified + ("Field", "Field=Value") tuples where fields with multiple values have + been split into individual tuples, e.g.: + [("Field1", "Field1=Value1"), ("Field1", "Field1=Value2"), + ("Field2", "Field2=Value3")] """ - for f in features: - if isinstance(f, basestring_): - self.strings.add(f) - string_features = features - features = intify_features(features) - cdef attr_t feature - for feature in features: - if feature != 0 and feature not in self._feat_map.id2feat: - raise ValueError(Errors.E167.format(feat=self.strings[feature], feat_id=feature)) cdef MorphAnalysisC tag - tag = create_rich_tag(features) - cdef hash_t key = self.insert(tag) - return key + tag.length = len(field_feature_pairs) + tag.fields = self.mem.alloc(tag.length, sizeof(attr_t)) + tag.features = self.mem.alloc(tag.length, sizeof(attr_t)) + for i, (field, feature) in enumerate(field_feature_pairs): + tag.fields[i] = field + tag.features[i] = feature + return tag + + cdef int insert(self, MorphAnalysisC tag) except -1: + cdef hash_t key = tag.key + if self.tags.get(key) == NULL: + tag_ptr = self.mem.alloc(1, sizeof(MorphAnalysisC)) + tag_ptr[0] = tag + self.tags.set(key, tag_ptr) def get(self, hash_t morph): tag = self.tags.get(morph) if tag == NULL: return [] else: - return tag_to_json(tag) - - cpdef update(self, hash_t morph, features): - """Update a morphological analysis with new feature values.""" - tag = (self.tags.get(morph))[0] - features = intify_features(features) - cdef attr_t feature - for feature in features: - field = FEATURE_FIELDS[FEATURE_NAMES[feature]] - set_feature(&tag, field, feature, 1) - morph = self.insert(tag) - return morph + return self.strings[tag.key] def lemmatize(self, const univ_pos_t univ_pos, attr_t orth, morphology): if orth not in self.strings: @@ -253,19 +203,10 @@ cdef class Morphology: """ attrs = dict(attrs) attrs = _normalize_props(attrs) - self.add({self._feat_map.id2feat[feat] for feat in attrs - if feat in self._feat_map.id2feat}) + self.add(attrs) attrs = intify_attrs(attrs, self.strings, _do_deprecated=True) self.exc[(tag_str, self.strings.add(orth_str))] = attrs - cdef hash_t insert(self, MorphAnalysisC tag) except 0: - cdef hash_t key = hash_tag(tag) - if self.tags.get(key) == NULL: - tag_ptr = self.mem.alloc(1, sizeof(MorphAnalysisC)) - tag_ptr[0] = tag - self.tags.set(key, tag_ptr) - return key - cdef int assign_untagged(self, TokenC* token) except -1: """Set morphological attributes on a token without a POS tag. Uses the lemmatizer's lookup() method, which looks up the string in the @@ -326,782 +267,60 @@ cdef class Morphology: for form_str, attrs in entries.items(): self.add_special_case(tag_str, form_str, attrs) - @classmethod - def create_class_map(cls): - return MorphologyClassMap(FEATURES) + @staticmethod + def feats_to_dict(feats): + if not feats: + return {} + return {field: Morphology.VALUE_SEP.join(sorted(values.split(Morphology.VALUE_SEP))) for field, values in + [feat.split(Morphology.FIELD_SEP) for feat in feats.split(Morphology.FEATURE_SEP)]} + + @staticmethod + def dict_to_feats(feats_dict): + if len(feats_dict) == 0: + return "" + return Morphology.FEATURE_SEP.join(sorted([Morphology.FIELD_SEP.join([field, Morphology.VALUE_SEP.join(sorted(values.split(Morphology.VALUE_SEP)))]) for field, values in feats_dict.items()])) + + @staticmethod + def list_to_feats(feats_list): + if len(feats_list) == 0: + return "" + feats_dict = {} + for feat in feats_list: + field, value = feat.split(Morphology.FIELD_SEP) + if field not in feats_dict: + feats_dict[field] = set() + feats_dict[field].add(value) + feats_dict = {field: Morphology.VALUE_SEP.join(sorted(values)) for field, values in feats_dict.items()} + return Morphology.dict_to_feats(feats_dict) -cpdef univ_pos_t get_int_tag(pos_): - return 0 - -cpdef intify_features(features): - return {get_string_id(feature) for feature in features} - -cdef hash_t hash_tag(MorphAnalysisC tag) nogil: - return mrmr.hash64(&tag, sizeof(tag), 0) +cdef int check_feature(const MorphAnalysisC* morph, attr_t feature) nogil: + cdef int i + for i in range(morph.length): + if morph.features[i] == feature: + return True + return False -cdef MorphAnalysisC create_rich_tag(features) except *: - cdef MorphAnalysisC tag - cdef attr_t feature - memset(&tag, 0, sizeof(tag)) - for feature in features: - field = FEATURE_FIELDS[FEATURE_NAMES[feature]] - set_feature(&tag, field, feature, 1) - return tag +cdef list list_features(const MorphAnalysisC* morph): + cdef int i + features = [] + for i in range(morph.length): + features.append(morph.features[i]) + return features -cdef tag_to_json(const MorphAnalysisC* tag): - return [FEATURE_NAMES[f] for f in list_features(tag)] +cdef np.ndarray get_by_field(const MorphAnalysisC* morph, attr_t field): + cdef np.ndarray results = numpy.zeros((morph.length,), dtype="uint64") + n = get_n_by_field(results.data, morph, field) + return results[:n] -cdef MorphAnalysisC tag_from_json(json_tag): - raise NotImplementedError - - -cdef list list_features(const MorphAnalysisC* tag): - output = [] - if tag.abbr != 0: - output.append(tag.abbr) - if tag.adp_type != 0: - output.append(tag.adp_type) - if tag.adv_type != 0: - output.append(tag.adv_type) - if tag.animacy != 0: - output.append(tag.animacy) - if tag.aspect != 0: - output.append(tag.aspect) - if tag.case != 0: - output.append(tag.case) - if tag.conj_type != 0: - output.append(tag.conj_type) - if tag.connegative != 0: - output.append(tag.connegative) - if tag.definite != 0: - output.append(tag.definite) - if tag.degree != 0: - output.append(tag.degree) - if tag.derivation != 0: - output.append(tag.derivation) - if tag.echo != 0: - output.append(tag.echo) - if tag.foreign != 0: - output.append(tag.foreign) - if tag.gender != 0: - output.append(tag.gender) - if tag.hyph != 0: - output.append(tag.hyph) - if tag.inf_form != 0: - output.append(tag.inf_form) - if tag.mood != 0: - output.append(tag.mood) - if tag.negative != 0: - output.append(tag.negative) - if tag.number != 0: - output.append(tag.number) - if tag.name_type != 0: - output.append(tag.name_type) - if tag.noun_type != 0: - output.append(tag.noun_type) - if tag.part_form != 0: - output.append(tag.part_form) - if tag.part_type != 0: - output.append(tag.part_type) - if tag.person != 0: - output.append(tag.person) - if tag.polite != 0: - output.append(tag.polite) - if tag.polarity != 0: - output.append(tag.polarity) - if tag.poss != 0: - output.append(tag.poss) - if tag.prefix != 0: - output.append(tag.prefix) - if tag.prep_case != 0: - output.append(tag.prep_case) - if tag.pron_type != 0: - output.append(tag.pron_type) - if tag.punct_type != 0: - output.append(tag.punct_type) - if tag.reflex != 0: - output.append(tag.reflex) - if tag.style != 0: - output.append(tag.style) - if tag.style_variant != 0: - output.append(tag.style_variant) - if tag.typo != 0: - output.append(tag.typo) - if tag.verb_form != 0: - output.append(tag.verb_form) - if tag.voice != 0: - output.append(tag.voice) - if tag.verb_type != 0: - output.append(tag.verb_type) - return output - - -cdef attr_t get_field(const MorphAnalysisC* tag, int field_id) nogil: - field = field_id - if field == Field_POS: - return tag.pos - if field == Field_Abbr: - return tag.abbr - elif field == Field_AdpType: - return tag.adp_type - elif field == Field_AdvType: - return tag.adv_type - elif field == Field_Animacy: - return tag.animacy - elif field == Field_Aspect: - return tag.aspect - elif field == Field_Case: - return tag.case - elif field == Field_ConjType: - return tag.conj_type - elif field == Field_Connegative: - return tag.connegative - elif field == Field_Definite: - return tag.definite - elif field == Field_Degree: - return tag.degree - elif field == Field_Derivation: - return tag.derivation - elif field == Field_Echo: - return tag.echo - elif field == Field_Foreign: - return tag.foreign - elif field == Field_Gender: - return tag.gender - elif field == Field_Hyph: - return tag.hyph - elif field == Field_InfForm: - return tag.inf_form - elif field == Field_Mood: - return tag.mood - elif field == Field_Negative: - return tag.negative - elif field == Field_Number: - return tag.number - elif field == Field_NameType: - return tag.name_type - elif field == Field_NounType: - return tag.noun_type - elif field == Field_NumForm: - return tag.num_form - elif field == Field_NumType: - return tag.num_type - elif field == Field_NumValue: - return tag.num_value - elif field == Field_PartForm: - return tag.part_form - elif field == Field_PartType: - return tag.part_type - elif field == Field_Person: - return tag.person - elif field == Field_Polite: - return tag.polite - elif field == Field_Polarity: - return tag.polarity - elif field == Field_Poss: - return tag.poss - elif field == Field_Prefix: - return tag.prefix - elif field == Field_PrepCase: - return tag.prep_case - elif field == Field_PronType: - return tag.pron_type - elif field == Field_PunctSide: - return tag.punct_side - elif field == Field_PunctType: - return tag.punct_type - elif field == Field_Reflex: - return tag.reflex - elif field == Field_Style: - return tag.style - elif field == Field_StyleVariant: - return tag.style_variant - elif field == Field_Tense: - return tag.tense - elif field == Field_Typo: - return tag.typo - elif field == Field_VerbForm: - return tag.verb_form - elif field == Field_Voice: - return tag.voice - elif field == Field_VerbType: - return tag.verb_type - else: - raise ValueError(Errors.E168.format(field=field_id)) - - -cdef int check_feature(const MorphAnalysisC* tag, attr_t feature) nogil: - if tag.abbr == feature: - return 1 - elif tag.adp_type == feature: - return 1 - elif tag.adv_type == feature: - return 1 - elif tag.animacy == feature: - return 1 - elif tag.aspect == feature: - return 1 - elif tag.case == feature: - return 1 - elif tag.conj_type == feature: - return 1 - elif tag.connegative == feature: - return 1 - elif tag.definite == feature: - return 1 - elif tag.degree == feature: - return 1 - elif tag.derivation == feature: - return 1 - elif tag.echo == feature: - return 1 - elif tag.foreign == feature: - return 1 - elif tag.gender == feature: - return 1 - elif tag.hyph == feature: - return 1 - elif tag.inf_form == feature: - return 1 - elif tag.mood == feature: - return 1 - elif tag.negative == feature: - return 1 - elif tag.number == feature: - return 1 - elif tag.name_type == feature: - return 1 - elif tag.noun_type == feature: - return 1 - elif tag.num_form == feature: - return 1 - elif tag.num_type == feature: - return 1 - elif tag.num_value == feature: - return 1 - elif tag.part_form == feature: - return 1 - elif tag.part_type == feature: - return 1 - elif tag.person == feature: - return 1 - elif tag.polite == feature: - return 1 - elif tag.polarity == feature: - return 1 - elif tag.poss == feature: - return 1 - elif tag.prefix == feature: - return 1 - elif tag.prep_case == feature: - return 1 - elif tag.pron_type == feature: - return 1 - elif tag.punct_side == feature: - return 1 - elif tag.punct_type == feature: - return 1 - elif tag.reflex == feature: - return 1 - elif tag.style == feature: - return 1 - elif tag.style_variant == feature: - return 1 - elif tag.tense == feature: - return 1 - elif tag.typo == feature: - return 1 - elif tag.verb_form == feature: - return 1 - elif tag.voice == feature: - return 1 - elif tag.verb_type == feature: - return 1 - else: - return 0 - -cdef int set_feature(MorphAnalysisC* tag, - univ_field_t field, attr_t feature, int value) except -1: - if value == True: - value_ = feature - else: - value_ = 0 - prev_value = get_field(tag, field) - if prev_value != 0 and value_ == 0 and field != Field_POS: - tag.length -= 1 - elif prev_value == 0 and value_ != 0 and field != Field_POS: - tag.length += 1 - if feature == 0: - pass - elif field == Field_POS: - tag.pos = get_string_id(FEATURE_NAMES[value_].split('_')[1]) - elif field == Field_Abbr: - tag.abbr = value_ - elif field == Field_AdpType: - tag.adp_type = value_ - elif field == Field_AdvType: - tag.adv_type = value_ - elif field == Field_Animacy: - tag.animacy = value_ - elif field == Field_Aspect: - tag.aspect = value_ - elif field == Field_Case: - tag.case = value_ - elif field == Field_ConjType: - tag.conj_type = value_ - elif field == Field_Connegative: - tag.connegative = value_ - elif field == Field_Definite: - tag.definite = value_ - elif field == Field_Degree: - tag.degree = value_ - elif field == Field_Derivation: - tag.derivation = value_ - elif field == Field_Echo: - tag.echo = value_ - elif field == Field_Foreign: - tag.foreign = value_ - elif field == Field_Gender: - tag.gender = value_ - elif field == Field_Hyph: - tag.hyph = value_ - elif field == Field_InfForm: - tag.inf_form = value_ - elif field == Field_Mood: - tag.mood = value_ - elif field == Field_Negative: - tag.negative = value_ - elif field == Field_Number: - tag.number = value_ - elif field == Field_NameType: - tag.name_type = value_ - elif field == Field_NounType: - tag.noun_type = value_ - elif field == Field_NumForm: - tag.num_form = value_ - elif field == Field_NumType: - tag.num_type = value_ - elif field == Field_NumValue: - tag.num_value = value_ - elif field == Field_PartForm: - tag.part_form = value_ - elif field == Field_PartType: - tag.part_type = value_ - elif field == Field_Person: - tag.person = value_ - elif field == Field_Polite: - tag.polite = value_ - elif field == Field_Polarity: - tag.polarity = value_ - elif field == Field_Poss: - tag.poss = value_ - elif field == Field_Prefix: - tag.prefix = value_ - elif field == Field_PrepCase: - tag.prep_case = value_ - elif field == Field_PronType: - tag.pron_type = value_ - elif field == Field_PunctSide: - tag.punct_side = value_ - elif field == Field_PunctType: - tag.punct_type = value_ - elif field == Field_Reflex: - tag.reflex = value_ - elif field == Field_Style: - tag.style = value_ - elif field == Field_StyleVariant: - tag.style_variant = value_ - elif field == Field_Tense: - tag.tense = value_ - elif field == Field_Typo: - tag.typo = value_ - elif field == Field_VerbForm: - tag.verb_form = value_ - elif field == Field_Voice: - tag.voice = value_ - elif field == Field_VerbType: - tag.verb_type = value_ - else: - raise ValueError(Errors.E167.format(field=FEATURE_NAMES.get(feature), field_id=feature)) - - -FIELDS = { - 'POS': Field_POS, - 'Abbr': Field_Abbr, - 'AdpType': Field_AdpType, - 'AdvType': Field_AdvType, - 'Animacy': Field_Animacy, - 'Aspect': Field_Aspect, - 'Case': Field_Case, - 'ConjType': Field_ConjType, - 'Connegative': Field_Connegative, - 'Definite': Field_Definite, - 'Degree': Field_Degree, - 'Derivation': Field_Derivation, - 'Echo': Field_Echo, - 'Foreign': Field_Foreign, - 'Gender': Field_Gender, - 'Hyph': Field_Hyph, - 'InfForm': Field_InfForm, - 'Mood': Field_Mood, - 'NameType': Field_NameType, - 'Negative': Field_Negative, - 'NounType': Field_NounType, - 'Number': Field_Number, - 'NumForm': Field_NumForm, - 'NumType': Field_NumType, - 'NumValue': Field_NumValue, - 'PartForm': Field_PartForm, - 'PartType': Field_PartType, - 'Person': Field_Person, - 'Polite': Field_Polite, - 'Polarity': Field_Polarity, - 'Poss': Field_Poss, - 'Prefix': Field_Prefix, - 'PrepCase': Field_PrepCase, - 'PronType': Field_PronType, - 'PunctSide': Field_PunctSide, - 'PunctType': Field_PunctType, - 'Reflex': Field_Reflex, - 'Style': Field_Style, - 'StyleVariant': Field_StyleVariant, - 'Tense': Field_Tense, - 'Typo': Field_Typo, - 'VerbForm': Field_VerbForm, - 'VerbType': Field_VerbType, - 'Voice': Field_Voice, -} - -LOWER_FIELDS = { - 'pos': Field_POS, - 'abbr': Field_Abbr, - 'adp_type': Field_AdpType, - 'adv_type': Field_AdvType, - 'animacy': Field_Animacy, - 'aspect': Field_Aspect, - 'case': Field_Case, - 'conj_type': Field_ConjType, - 'connegative': Field_Connegative, - 'definite': Field_Definite, - 'degree': Field_Degree, - 'derivation': Field_Derivation, - 'echo': Field_Echo, - 'foreign': Field_Foreign, - 'gender': Field_Gender, - 'hyph': Field_Hyph, - 'inf_form': Field_InfForm, - 'mood': Field_Mood, - 'name_type': Field_NameType, - 'negative': Field_Negative, - 'noun_type': Field_NounType, - 'number': Field_Number, - 'num_form': Field_NumForm, - 'num_type': Field_NumType, - 'num_value': Field_NumValue, - 'part_form': Field_PartForm, - 'part_type': Field_PartType, - 'person': Field_Person, - 'polarity': Field_Polarity, - 'polite': Field_Polite, - 'poss': Field_Poss, - 'prefix': Field_Prefix, - 'prep_case': Field_PrepCase, - 'pron_type': Field_PronType, - 'punct_side': Field_PunctSide, - 'punct_type': Field_PunctType, - 'reflex': Field_Reflex, - 'style': Field_Style, - 'style_variant': Field_StyleVariant, - 'tense': Field_Tense, - 'typo': Field_Typo, - 'verb_form': Field_VerbForm, - 'verb_type': Field_VerbType, - 'voice': Field_Voice, -} - - -FEATURES = [ - "POS_ADJ", - "POS_ADP", - "POS_ADV", - "POS_AUX", - "POS_CONJ", - "POS_CCONJ", - "POS_DET", - "POS_INTJ", - "POS_NOUN", - "POS_NUM", - "POS_PART", - "POS_PRON", - "POS_PROPN", - "POS_PUNCT", - "POS_SCONJ", - "POS_SYM", - "POS_VERB", - "POS_X", - "POS_EOL", - "POS_SPACE", - "Abbr_yes", - "AdpType_circ", - "AdpType_comprep", - "AdpType_prep", - "AdpType_post", - "AdpType_voc", - "AdvType_adadj", - "AdvType_cau", - "AdvType_deg", - "AdvType_ex", - "AdvType_loc", - "AdvType_man", - "AdvType_mod", - "AdvType_sta", - "AdvType_tim", - "Animacy_anim", - "Animacy_hum", - "Animacy_inan", - "Animacy_nhum", - "Aspect_hab", - "Aspect_imp", - "Aspect_iter", - "Aspect_perf", - "Aspect_prog", - "Aspect_prosp", - "Aspect_none", - "Case_abe", - "Case_abl", - "Case_abs", - "Case_acc", - "Case_ade", - "Case_all", - "Case_cau", - "Case_com", - "Case_dat", - "Case_del", - "Case_dis", - "Case_ela", - "Case_ess", - "Case_gen", - "Case_ill", - "Case_ine", - "Case_ins", - "Case_loc", - "Case_lat", - "Case_nom", - "Case_par", - "Case_sub", - "Case_sup", - "Case_tem", - "Case_ter", - "Case_tra", - "Case_voc", - "ConjType_comp", - "ConjType_oper", - "Connegative_yes", - "Definite_cons", - "Definite_def", - "Definite_ind", - "Definite_red", - "Definite_two", - "Degree_abs", - "Degree_cmp", - "Degree_comp", - "Degree_none", - "Degree_pos", - "Degree_sup", - "Degree_com", - "Degree_dim", - "Derivation_minen", - "Derivation_sti", - "Derivation_inen", - "Derivation_lainen", - "Derivation_ja", - "Derivation_ton", - "Derivation_vs", - "Derivation_ttain", - "Derivation_ttaa", - "Echo_rdp", - "Echo_ech", - "Foreign_foreign", - "Foreign_fscript", - "Foreign_tscript", - "Foreign_yes", - "Gender_com", - "Gender_fem", - "Gender_masc", - "Gender_neut", - "Gender_dat_masc", - "Gender_dat_fem", - "Gender_erg_masc", - "Gender_erg_fem", - "Gender_psor_masc", - "Gender_psor_fem", - "Gender_psor_neut", - "Hyph_yes", - "InfForm_one", - "InfForm_two", - "InfForm_three", - "Mood_cnd", - "Mood_imp", - "Mood_ind", - "Mood_n", - "Mood_pot", - "Mood_sub", - "Mood_opt", - "NameType_geo", - "NameType_prs", - "NameType_giv", - "NameType_sur", - "NameType_nat", - "NameType_com", - "NameType_pro", - "NameType_oth", - "Negative_neg", - "Negative_pos", - "Negative_yes", - "NounType_com", - "NounType_prop", - "NounType_class", - "Number_com", - "Number_dual", - "Number_none", - "Number_plur", - "Number_sing", - "Number_ptan", - "Number_count", - "Number_abs_sing", - "Number_abs_plur", - "Number_dat_sing", - "Number_dat_plur", - "Number_erg_sing", - "Number_erg_plur", - "Number_psee_sing", - "Number_psee_plur", - "Number_psor_sing", - "Number_psor_plur", - "NumForm_digit", - "NumForm_roman", - "NumForm_word", - "NumForm_combi", - "NumType_card", - "NumType_dist", - "NumType_frac", - "NumType_gen", - "NumType_mult", - "NumType_none", - "NumType_ord", - "NumType_sets", - "NumType_dual", - "NumValue_one", - "NumValue_two", - "NumValue_three", - "PartForm_pres", - "PartForm_past", - "PartForm_agt", - "PartForm_neg", - "PartType_mod", - "PartType_emp", - "PartType_res", - "PartType_inf", - "PartType_vbp", - "Person_one", - "Person_two", - "Person_three", - "Person_none", - "Person_abs_one", - "Person_abs_two", - "Person_abs_three", - "Person_dat_one", - "Person_dat_two", - "Person_dat_three", - "Person_erg_one", - "Person_erg_two", - "Person_erg_three", - "Person_psor_one", - "Person_psor_two", - "Person_psor_three", - "Polarity_neg", - "Polarity_pos", - "Polite_inf", - "Polite_pol", - "Polite_abs_inf", - "Polite_abs_pol", - "Polite_erg_inf", - "Polite_erg_pol", - "Polite_dat_inf", - "Polite_dat_pol", - "Poss_yes", - "Prefix_yes", - "PrepCase_npr", - "PrepCase_pre", - "PronType_advPart", - "PronType_art", - "PronType_default", - "PronType_dem", - "PronType_ind", - "PronType_int", - "PronType_neg", - "PronType_prs", - "PronType_rcp", - "PronType_rel", - "PronType_tot", - "PronType_clit", - "PronType_exc", - "PunctSide_ini", - "PunctSide_fin", - "PunctType_peri", - "PunctType_qest", - "PunctType_excl", - "PunctType_quot", - "PunctType_brck", - "PunctType_comm", - "PunctType_colo", - "PunctType_semi", - "PunctType_dash", - "Reflex_yes", - "Style_arch", - "Style_rare", - "Style_poet", - "Style_norm", - "Style_coll", - "Style_vrnc", - "Style_sing", - "Style_expr", - "Style_derg", - "Style_vulg", - "Style_yes", - "StyleVariant_styleShort", - "StyleVariant_styleBound", - "Tense_fut", - "Tense_imp", - "Tense_past", - "Tense_pres", - "Typo_yes", - "VerbForm_fin", - "VerbForm_ger", - "VerbForm_inf", - "VerbForm_none", - "VerbForm_part", - "VerbForm_partFut", - "VerbForm_partPast", - "VerbForm_partPres", - "VerbForm_sup", - "VerbForm_trans", - "VerbForm_conv", - "VerbForm_gdv", - "VerbType_aux", - "VerbType_cop", - "VerbType_mod", - "VerbType_light", - "Voice_act", - "Voice_cau", - "Voice_pass", - "Voice_mid", - "Voice_int", -] - -FEATURE_NAMES = {get_string_id(f): f for f in FEATURES} -FEATURE_FIELDS = {f: FIELDS[f.split('_', 1)[0]] for f in FEATURES} +cdef int get_n_by_field(attr_t* results, const MorphAnalysisC* morph, attr_t field) nogil: + cdef int n_results = 0 + cdef int i + for i in range(morph.length): + if morph.fields[i] == field: + results[n_results] = morph.features[i] + n_results += 1 + return n_results diff --git a/spacy/parts_of_speech.pyx b/spacy/parts_of_speech.pyx index 3925a6738..e71fb917f 100644 --- a/spacy/parts_of_speech.pyx +++ b/spacy/parts_of_speech.pyx @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - IDS = { "": NO_TAG, diff --git a/spacy/pipeline/__init__.py b/spacy/pipeline/__init__.py index 2f30fbbee..b2866bad2 100644 --- a/spacy/pipeline/__init__.py +++ b/spacy/pipeline/__init__.py @@ -1,10 +1,10 @@ -# coding: utf8 -from __future__ import unicode_literals - from .pipes import Tagger, DependencyParser, EntityRecognizer, EntityLinker from .pipes import TextCategorizer, Tensorizer, Pipe, Sentencizer +from .pipes import SentenceRecognizer +from .simple_ner import SimpleNER from .morphologizer import Morphologizer from .entityruler import EntityRuler +from .tok2vec import Tok2Vec from .hooks import SentenceSegmenter, SimilarityHook from .functions import merge_entities, merge_noun_chunks, merge_subtokens @@ -15,12 +15,15 @@ __all__ = [ "EntityLinker", "TextCategorizer", "Tensorizer", + "Tok2Vec", "Pipe", "Morphologizer", "EntityRuler", "Sentencizer", "SentenceSegmenter", + "SentenceRecognizer", "SimilarityHook", + "SimpleNER", "merge_entities", "merge_noun_chunks", "merge_subtokens", diff --git a/spacy/pipeline/defaults/__init__.py b/spacy/pipeline/defaults/__init__.py new file mode 100644 index 000000000..e17e2d3b4 --- /dev/null +++ b/spacy/pipeline/defaults/__init__.py @@ -0,0 +1,103 @@ +from pathlib import Path + +from ... import util + + +def default_nel_config(): + loc = Path(__file__).parent / "entity_linker_defaults.cfg" + return util.load_config(loc, create_objects=False) + + +def default_nel(): + loc = Path(__file__).parent / "entity_linker_defaults.cfg" + return util.load_config(loc, create_objects=True)["model"] + + +def default_morphologizer_config(): + loc = Path(__file__).parent / "morphologizer_defaults.cfg" + return util.load_config(loc, create_objects=False) + + +def default_morphologizer(): + loc = Path(__file__).parent / "morphologizer_defaults.cfg" + return util.load_config(loc, create_objects=True)["model"] + + +def default_parser_config(): + loc = Path(__file__).parent / "parser_defaults.cfg" + return util.load_config(loc, create_objects=False) + + +def default_parser(): + loc = Path(__file__).parent / "parser_defaults.cfg" + return util.load_config(loc, create_objects=True)["model"] + + +def default_ner_config(): + loc = Path(__file__).parent / "ner_defaults.cfg" + return util.load_config(loc, create_objects=False) + + +def default_ner(): + loc = Path(__file__).parent / "ner_defaults.cfg" + return util.load_config(loc, create_objects=True)["model"] + + +def default_senter_config(): + loc = Path(__file__).parent / "senter_defaults.cfg" + return util.load_config(loc, create_objects=False) + + +def default_senter(): + loc = Path(__file__).parent / "senter_defaults.cfg" + return util.load_config(loc, create_objects=True)["model"] + + +def default_tagger_config(): + loc = Path(__file__).parent / "tagger_defaults.cfg" + return util.load_config(loc, create_objects=False) + + +def default_tagger(): + loc = Path(__file__).parent / "tagger_defaults.cfg" + return util.load_config(loc, create_objects=True)["model"] + + +def default_tensorizer_config(): + loc = Path(__file__).parent / "tensorizer_defaults.cfg" + return util.load_config(loc, create_objects=False) + + +def default_tensorizer(): + loc = Path(__file__).parent / "tensorizer_defaults.cfg" + return util.load_config(loc, create_objects=True)["model"] + + +def default_textcat_config(): + loc = Path(__file__).parent / "textcat_defaults.cfg" + return util.load_config(loc, create_objects=False) + + +def default_textcat(): + loc = Path(__file__).parent / "textcat_defaults.cfg" + return util.load_config(loc, create_objects=True)["model"] + + +def default_tok2vec_config(): + loc = Path(__file__).parent / "tok2vec_defaults.cfg" + return util.load_config(loc, create_objects=False) + + +def default_tok2vec(): + loc = Path(__file__).parent / "tok2vec_defaults.cfg" + return util.load_config(loc, create_objects=True)["model"] + + +def default_simple_ner_config(): + loc = Path(__file__).parent / "simple_ner_defaults.cfg" + return util.load_config(loc, create_objects=False) + + +def default_simple_ner(): + loc = Path(__file__).parent / "simple_ner_defaults.cfg" + return util.load_config(loc, create_objects=True)["model"] diff --git a/spacy/pipeline/defaults/entity_linker_defaults.cfg b/spacy/pipeline/defaults/entity_linker_defaults.cfg new file mode 100644 index 000000000..6a591ec3e --- /dev/null +++ b/spacy/pipeline/defaults/entity_linker_defaults.cfg @@ -0,0 +1,12 @@ +[model] +@architectures = "spacy.EntityLinker.v1" + +[model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 96 +depth = 2 +embed_size = 300 +window_size = 1 +maxout_pieces = 3 +subword_features = true diff --git a/spacy/pipeline/defaults/morphologizer_defaults.cfg b/spacy/pipeline/defaults/morphologizer_defaults.cfg new file mode 100644 index 000000000..150eca507 --- /dev/null +++ b/spacy/pipeline/defaults/morphologizer_defaults.cfg @@ -0,0 +1,13 @@ +[model] +@architectures = "spacy.Tagger.v1" + +[model.tok2vec] +@architectures = "spacy.HashCharEmbedCNN.v1" +pretrained_vectors = null +width = 128 +depth = 4 +embed_size = 7000 +window_size = 1 +maxout_pieces = 3 +nM = 64 +nC = 8 diff --git a/spacy/pipeline/defaults/ner_defaults.cfg b/spacy/pipeline/defaults/ner_defaults.cfg new file mode 100644 index 000000000..db2c131f5 --- /dev/null +++ b/spacy/pipeline/defaults/ner_defaults.cfg @@ -0,0 +1,15 @@ +[model] +@architectures = "spacy.TransitionBasedParser.v1" +nr_feature_tokens = 6 +hidden_width = 64 +maxout_pieces = 2 + +[model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 96 +depth = 4 +embed_size = 2000 +window_size = 1 +maxout_pieces = 3 +subword_features = true diff --git a/spacy/pipeline/defaults/parser_defaults.cfg b/spacy/pipeline/defaults/parser_defaults.cfg new file mode 100644 index 000000000..9cbb6eadb --- /dev/null +++ b/spacy/pipeline/defaults/parser_defaults.cfg @@ -0,0 +1,15 @@ +[model] +@architectures = "spacy.TransitionBasedParser.v1" +nr_feature_tokens = 8 +hidden_width = 64 +maxout_pieces = 2 + +[model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 96 +depth = 4 +embed_size = 2000 +window_size = 1 +maxout_pieces = 3 +subword_features = true diff --git a/spacy/pipeline/defaults/senter_defaults.cfg b/spacy/pipeline/defaults/senter_defaults.cfg new file mode 100644 index 000000000..ffa2c6ce2 --- /dev/null +++ b/spacy/pipeline/defaults/senter_defaults.cfg @@ -0,0 +1,12 @@ +[model] +@architectures = "spacy.Tagger.v1" + +[model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 12 +depth = 1 +embed_size = 2000 +window_size = 1 +maxout_pieces = 2 +subword_features = true diff --git a/spacy/pipeline/defaults/simple_ner_defaults.cfg b/spacy/pipeline/defaults/simple_ner_defaults.cfg new file mode 100644 index 000000000..4e3b640df --- /dev/null +++ b/spacy/pipeline/defaults/simple_ner_defaults.cfg @@ -0,0 +1,12 @@ +[model] +@architectures = "spacy.BiluoTagger.v1" + +[model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 128 +depth = 4 +embed_size = 7000 +window_size = 1 +maxout_pieces = 3 +subword_features = true diff --git a/spacy/pipeline/defaults/tagger_defaults.cfg b/spacy/pipeline/defaults/tagger_defaults.cfg new file mode 100644 index 000000000..5aea80a32 --- /dev/null +++ b/spacy/pipeline/defaults/tagger_defaults.cfg @@ -0,0 +1,12 @@ +[model] +@architectures = "spacy.Tagger.v1" + +[model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 96 +depth = 4 +embed_size = 2000 +window_size = 1 +maxout_pieces = 3 +subword_features = true diff --git a/spacy/pipeline/defaults/tensorizer_defaults.cfg b/spacy/pipeline/defaults/tensorizer_defaults.cfg new file mode 100644 index 000000000..81880a109 --- /dev/null +++ b/spacy/pipeline/defaults/tensorizer_defaults.cfg @@ -0,0 +1,4 @@ +[model] +@architectures = "spacy.Tensorizer.v1" +input_size=96 +output_size=300 diff --git a/spacy/pipeline/defaults/textcat_bow_defaults.cfg b/spacy/pipeline/defaults/textcat_bow_defaults.cfg new file mode 100644 index 000000000..84472ea10 --- /dev/null +++ b/spacy/pipeline/defaults/textcat_bow_defaults.cfg @@ -0,0 +1,5 @@ +[model] +@architectures = "spacy.TextCatBOW.v1" +exclusive_classes = false +ngram_size: 1 +no_output_layer: false diff --git a/spacy/pipeline/defaults/textcat_cnn_defaults.cfg b/spacy/pipeline/defaults/textcat_cnn_defaults.cfg new file mode 100644 index 000000000..cea1bfe54 --- /dev/null +++ b/spacy/pipeline/defaults/textcat_cnn_defaults.cfg @@ -0,0 +1,13 @@ +[model] +@architectures = "spacy.TextCatCNN.v1" +exclusive_classes = false + +[model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 96 +depth = 4 +embed_size = 2000 +window_size = 1 +maxout_pieces = 3 +subword_features = true diff --git a/spacy/pipeline/defaults/textcat_defaults.cfg b/spacy/pipeline/defaults/textcat_defaults.cfg new file mode 100644 index 000000000..9477b2995 --- /dev/null +++ b/spacy/pipeline/defaults/textcat_defaults.cfg @@ -0,0 +1,9 @@ +[model] +@architectures = "spacy.TextCat.v1" +exclusive_classes = false +pretrained_vectors = null +width = 64 +conv_depth = 2 +embed_size = 2000 +window_size = 1 +ngram_size = 1 diff --git a/spacy/pipeline/defaults/tok2vec_defaults.cfg b/spacy/pipeline/defaults/tok2vec_defaults.cfg new file mode 100644 index 000000000..9475d4aab --- /dev/null +++ b/spacy/pipeline/defaults/tok2vec_defaults.cfg @@ -0,0 +1,9 @@ +[model] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 96 +depth = 4 +embed_size = 2000 +window_size = 1 +maxout_pieces = 3 +subword_features = true diff --git a/spacy/pipeline/entityruler.py b/spacy/pipeline/entityruler.py index 1786dda87..58160c2e9 100644 --- a/spacy/pipeline/entityruler.py +++ b/spacy/pipeline/entityruler.py @@ -1,12 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - -from collections import defaultdict, OrderedDict +from collections import defaultdict import srsly from ..language import component from ..errors import Errors -from ..compat import basestring_ from ..util import ensure_path, to_disk, from_disk from ..tokens import Doc, Span from ..matcher import Matcher, PhraseMatcher @@ -70,7 +66,7 @@ class EntityRuler(object): self.add_patterns(patterns) @classmethod - def from_nlp(cls, nlp, **cfg): + def from_nlp(cls, nlp, model=None, **cfg): return cls(nlp, **cfg) def __len__(self): @@ -204,14 +200,14 @@ class EntityRuler(object): ] except ValueError: subsequent_pipes = [] - with self.nlp.disable_pipes(subsequent_pipes): + with self.nlp.select_pipes(disable=subsequent_pipes): token_patterns = [] phrase_pattern_labels = [] phrase_pattern_texts = [] phrase_pattern_ids = [] for entry in patterns: - if isinstance(entry["pattern"], basestring_): + if isinstance(entry["pattern"], str): phrase_pattern_labels.append(entry["label"]) phrase_pattern_texts.append(entry["pattern"]) phrase_pattern_ids.append(entry.get("id")) @@ -272,8 +268,8 @@ class EntityRuler(object): RETURNS (str): The ent_label joined with configured `ent_id_sep` """ - if isinstance(ent_id, basestring_): - label = "{}{}{}".format(label, self.ent_id_sep, ent_id) + if isinstance(ent_id, str): + label = f"{label}{self.ent_id_sep}{ent_id}" return label def from_bytes(self, patterns_bytes, **kwargs): @@ -307,15 +303,12 @@ class EntityRuler(object): DOCS: https://spacy.io/api/entityruler#to_bytes """ - - serial = OrderedDict( - ( - ("overwrite", self.overwrite), - ("ent_id_sep", self.ent_id_sep), - ("phrase_matcher_attr", self.phrase_matcher_attr), - ("patterns", self.patterns), - ) - ) + serial = { + "overwrite": self.overwrite, + "ent_id_sep": self.ent_id_sep, + "phrase_matcher_attr": self.phrase_matcher_attr, + "patterns": self.patterns, + } return srsly.msgpack_dumps(serial) def from_disk(self, path, **kwargs): diff --git a/spacy/pipeline/functions.py b/spacy/pipeline/functions.py index 69e638da2..6e9d4197c 100644 --- a/spacy/pipeline/functions.py +++ b/spacy/pipeline/functions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..language import component from ..matcher import Matcher from ..util import filter_spans diff --git a/spacy/pipeline/hooks.py b/spacy/pipeline/hooks.py index b61a34c0e..351323ae9 100644 --- a/spacy/pipeline/hooks.py +++ b/spacy/pipeline/hooks.py @@ -1,12 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - -from thinc.t2v import Pooling, max_pool, mean_pool -from thinc.neural._classes.difference import Siamese, CauchySimilarity +from thinc.api import concatenate, reduce_max, reduce_mean, siamese, CauchySimilarity from .pipes import Pipe from ..language import component -from .._ml import link_vectors_to_models +from ..util import link_vectors_to_models @component("sentencizer_hook", assigns=["doc.user_hooks"]) @@ -66,7 +62,9 @@ class SimilarityHook(Pipe): @classmethod def Model(cls, length): - return Siamese(Pooling(max_pool, mean_pool), CauchySimilarity(length)) + return siamese( + concatenate(reduce_max(), reduce_mean()), CauchySimilarity(length * 2) + ) def __call__(self, doc): """Install similarity hook""" @@ -78,12 +76,10 @@ class SimilarityHook(Pipe): yield self(doc) def predict(self, doc1, doc2): - self.require_model() return self.model.predict([(doc1, doc2)]) def update(self, doc1_doc2, golds, sgd=None, drop=0.0): - self.require_model() - sims, bp_sims = self.model.begin_update(doc1_doc2, drop=drop) + sims, bp_sims = self.model.begin_update(doc1_doc2) def begin_training(self, _=tuple(), pipeline=None, sgd=None, **kwargs): """Allocate model, using width from tensorizer in pipeline. @@ -92,7 +88,7 @@ class SimilarityHook(Pipe): pipeline (list): The pipeline the model is part of. """ if self.model is True: - self.model = self.Model(pipeline[0].model.nO) + self.model = self.Model(pipeline[0].model.get_dim("nO")) link_vectors_to_models(self.vocab) if sgd is None: sgd = self.create_optimizer() diff --git a/spacy/pipeline/morphologizer.pyx b/spacy/pipeline/morphologizer.pyx index 72e31f120..c45a72b25 100644 --- a/spacy/pipeline/morphologizer.pyx +++ b/spacy/pipeline/morphologizer.pyx @@ -1,165 +1,170 @@ -from __future__ import unicode_literals -from collections import OrderedDict, defaultdict - -import numpy +# cython: infer_types=True, profile=True cimport numpy as np -from thinc.api import chain -from thinc.neural.util import to_categorical, copy_array, get_array_module -from .. import util -from .pipes import Pipe -from ..language import component -from .._ml import Tok2Vec, build_morphologizer_model -from .._ml import link_vectors_to_models, zero_init, flatten -from .._ml import create_default_optimizer -from ..errors import Errors, TempErrors -from ..compat import basestring_ +import numpy +import srsly +from thinc.api import to_categorical + from ..tokens.doc cimport Doc from ..vocab cimport Vocab from ..morphology cimport Morphology +from ..parts_of_speech import IDS as POS_IDS +from ..symbols import POS + +from .. import util +from ..language import component +from ..util import link_vectors_to_models, create_default_optimizer +from ..errors import Errors, TempErrors +from .pipes import Tagger, _load_cfg +from .. import util +from .defaults import default_morphologizer -@component("morphologizer", assigns=["token.morph", "token.pos"]) -class Morphologizer(Pipe): +@component("morphologizer", assigns=["token.morph", "token.pos"], default_model=default_morphologizer) +class Morphologizer(Tagger): - @classmethod - def Model(cls, **cfg): - if cfg.get('pretrained_dims') and not cfg.get('pretrained_vectors'): - raise ValueError(TempErrors.T008) - class_map = Morphology.create_class_map() - return build_morphologizer_model(class_map.field_sizes, **cfg) - - def __init__(self, vocab, model=True, **cfg): + def __init__(self, vocab, model, **cfg): self.vocab = vocab self.model = model - self.cfg = OrderedDict(sorted(cfg.items())) - self.cfg.setdefault('cnn_maxout_pieces', 2) - self._class_map = self.vocab.morphology.create_class_map() + self._rehearsal_model = None + self.cfg = dict(sorted(cfg.items())) + self.cfg.setdefault("labels", {}) + self.cfg.setdefault("morph_pos", {}) @property def labels(self): - return self.vocab.morphology.tag_names + return tuple(self.cfg["labels"].keys()) - @property - def tok2vec(self): - if self.model in (None, True, False): - return None - else: - return chain(self.model.tok2vec, flatten) + def add_label(self, label): + if not isinstance(label, str): + raise ValueError(Errors.E187) + if label in self.labels: + return 0 + morph = Morphology.feats_to_dict(label) + norm_morph_pos = self.vocab.strings[self.vocab.morphology.add(morph)] + pos = morph.get("POS", "") + if norm_morph_pos not in self.cfg["labels"]: + self.cfg["labels"][norm_morph_pos] = norm_morph_pos + self.cfg["morph_pos"][norm_morph_pos] = POS_IDS[pos] + return 1 - def __call__(self, doc): - features, tokvecs = self.predict([doc]) - self.set_annotations([doc], features, tensors=tokvecs) - return doc + def begin_training(self, get_examples=lambda: [], pipeline=None, sgd=None, + **kwargs): + for example in get_examples(): + for i, morph in enumerate(example.token_annotation.morphs): + pos = example.token_annotation.get_pos(i) + morph = Morphology.feats_to_dict(morph) + norm_morph = self.vocab.strings[self.vocab.morphology.add(morph)] + if pos: + morph["POS"] = pos + norm_morph_pos = self.vocab.strings[self.vocab.morphology.add(morph)] + if norm_morph_pos not in self.cfg["labels"]: + self.cfg["labels"][norm_morph_pos] = norm_morph + self.cfg["morph_pos"][norm_morph_pos] = POS_IDS[pos] + self.set_output(len(self.labels)) + self.model.initialize() + link_vectors_to_models(self.vocab) + if sgd is None: + sgd = self.create_optimizer() + return sgd - def pipe(self, stream, batch_size=128, n_threads=-1): - for docs in util.minibatch(stream, size=batch_size): - docs = list(docs) - features, tokvecs = self.predict(docs) - self.set_annotations(docs, features, tensors=tokvecs) - yield from docs - - def predict(self, docs): - if not any(len(doc) for doc in docs): - # Handle case where there are no tokens in any docs. - n_labels = self.model.nO - guesses = [self.model.ops.allocate((0, n_labels)) for doc in docs] - tokvecs = self.model.ops.allocate((0, self.model.tok2vec.nO)) - return guesses, tokvecs - tokvecs = self.model.tok2vec(docs) - scores = self.model.softmax(tokvecs) - return scores, tokvecs - - def set_annotations(self, docs, batch_scores, tensors=None): + def set_annotations(self, docs, batch_tag_ids): if isinstance(docs, Doc): docs = [docs] cdef Doc doc cdef Vocab vocab = self.vocab - offsets = [self._class_map.get_field_offset(field) - for field in self._class_map.fields] for i, doc in enumerate(docs): - doc_scores = batch_scores[i] - doc_guesses = scores_to_guesses(doc_scores, self.model.softmax.out_sizes) - # Convert the neuron indices into feature IDs. - doc_feat_ids = numpy.zeros((len(doc), len(self._class_map.fields)), dtype='i') - for j in range(len(doc)): - for k, offset in enumerate(offsets): - if doc_guesses[j, k] == 0: - doc_feat_ids[j, k] = 0 - else: - doc_feat_ids[j, k] = offset + doc_guesses[j, k] - # Get the set of feature names. - feats = {self._class_map.col2info[f][2] for f in doc_feat_ids[j]} - if "NIL" in feats: - feats.remove("NIL") - # Now add the analysis, and set the hash. - doc.c[j].morph = self.vocab.morphology.add(feats) - if doc[j].morph.pos != 0: - doc.c[j].pos = doc[j].morph.pos + doc_tag_ids = batch_tag_ids[i] + if hasattr(doc_tag_ids, "get"): + doc_tag_ids = doc_tag_ids.get() + for j, tag_id in enumerate(doc_tag_ids): + morph = self.labels[tag_id] + doc.c[j].morph = self.vocab.morphology.add(self.cfg["labels"][morph]) + doc.c[j].pos = self.cfg["morph_pos"][morph] - def update(self, docs, golds, drop=0., sgd=None, losses=None): - if losses is not None and self.name not in losses: - losses[self.name] = 0. + doc.is_morphed = True - tag_scores, bp_tag_scores = self.model.begin_update(docs, drop=drop) - loss, d_tag_scores = self.get_loss(docs, golds, tag_scores) - bp_tag_scores(d_tag_scores, sgd=sgd) - - if losses is not None: - losses[self.name] += loss - - def get_loss(self, docs, golds, scores): - guesses = [] - for doc_scores in scores: - guesses.append(scores_to_guesses(doc_scores, self.model.softmax.out_sizes)) - guesses = self.model.ops.xp.vstack(guesses) - scores = self.model.ops.xp.vstack(scores) - if not isinstance(scores, numpy.ndarray): - scores = scores.get() - if not isinstance(guesses, numpy.ndarray): - guesses = guesses.get() + def get_loss(self, examples, scores): + scores = self.model.ops.flatten(scores) + tag_index = {tag: i for i, tag in enumerate(self.labels)} cdef int idx = 0 - # Do this on CPU, as we can't vectorize easily. - target = numpy.zeros(scores.shape, dtype='f') - field_sizes = self.model.softmax.out_sizes - for doc, gold in zip(docs, golds): - for t, features in enumerate(gold.morphology): - if features is None: - target[idx] = scores[idx] + correct = numpy.zeros((scores.shape[0],), dtype="i") + guesses = scores.argmax(axis=1) + known_labels = numpy.ones((scores.shape[0], 1), dtype="f") + for ex in examples: + gold = ex.gold + for i in range(len(gold.morphs)): + pos = gold.pos[i] if i < len(gold.pos) else "" + morph = gold.morphs[i] + feats = Morphology.feats_to_dict(morph) + if pos: + feats["POS"] = pos + if len(feats) > 0: + morph = self.vocab.strings[self.vocab.morphology.add(feats)] + if morph == "": + morph = Morphology.EMPTY_MORPH + if morph is None: + correct[idx] = guesses[idx] + elif morph in tag_index: + correct[idx] = tag_index[morph] else: - gold_fields = {} - for feature in features: - field = self._class_map.feat2field[feature] - gold_fields[field] = self._class_map.feat2offset[feature] - for field in self._class_map.fields: - field_id = self._class_map.field2id[field] - col_offset = self._class_map.field2col[field] - if field_id in gold_fields: - target[idx, col_offset + gold_fields[field_id]] = 1. - else: - target[idx, col_offset] = 1. - #print(doc[t]) - #for col, info in enumerate(self._class_map.col2info): - # print(col, info, scores[idx, col], target[idx, col]) + correct[idx] = 0 + known_labels[idx] = 0. idx += 1 - target = self.model.ops.asarray(target, dtype='f') - scores = self.model.ops.asarray(scores, dtype='f') - d_scores = scores - target + correct = self.model.ops.xp.array(correct, dtype="i") + d_scores = scores - to_categorical(correct, n_classes=scores.shape[1]) + d_scores *= self.model.ops.asarray(known_labels) loss = (d_scores**2).sum() + docs = [ex.doc for ex in examples] d_scores = self.model.ops.unflatten(d_scores, [len(d) for d in docs]) return float(loss), d_scores - def use_params(self, params): - with self.model.use_params(params): - yield + def to_bytes(self, exclude=tuple(), **kwargs): + serialize = {} + serialize["model"] = self.model.to_bytes + serialize["vocab"] = self.vocab.to_bytes + serialize["cfg"] = lambda: srsly.json_dumps(self.cfg) + exclude = util.get_serialization_exclude(serialize, exclude, kwargs) + return util.to_bytes(serialize, exclude) -def scores_to_guesses(scores, out_sizes): - xp = get_array_module(scores) - guesses = xp.zeros((scores.shape[0], len(out_sizes)), dtype='i') - offset = 0 - for i, size in enumerate(out_sizes): - slice_ = scores[:, offset : offset + size] - col_guesses = slice_.argmax(axis=1) - guesses[:, i] = col_guesses - offset += size - return guesses + def from_bytes(self, bytes_data, exclude=tuple(), **kwargs): + def load_model(b): + try: + self.model.from_bytes(b) + except AttributeError: + raise ValueError(Errors.E149) + + deserialize = { + "vocab": lambda b: self.vocab.from_bytes(b), + "cfg": lambda b: self.cfg.update(srsly.json_loads(b)), + "model": lambda b: load_model(b), + } + exclude = util.get_serialization_exclude(deserialize, exclude, kwargs) + util.from_bytes(bytes_data, deserialize, exclude) + return self + + def to_disk(self, path, exclude=tuple(), **kwargs): + serialize = { + "vocab": lambda p: self.vocab.to_disk(p), + "model": lambda p: p.open("wb").write(self.model.to_bytes()), + "cfg": lambda p: srsly.write_json(p, self.cfg), + } + exclude = util.get_serialization_exclude(serialize, exclude, kwargs) + util.to_disk(path, serialize, exclude) + + def from_disk(self, path, exclude=tuple(), **kwargs): + def load_model(p): + with p.open("rb") as file_: + try: + self.model.from_bytes(file_.read()) + except AttributeError: + raise ValueError(Errors.E149) + + deserialize = { + "vocab": lambda p: self.vocab.from_disk(p), + "cfg": lambda p: self.cfg.update(_load_cfg(p)), + "model": load_model, + } + exclude = util.get_serialization_exclude(deserialize, exclude, kwargs) + util.from_disk(path, deserialize, exclude) + return self diff --git a/spacy/pipeline/pipes.pyx b/spacy/pipeline/pipes.pyx index 982c058b4..6804a98c3 100644 --- a/spacy/pipeline/pipes.pyx +++ b/spacy/pipeline/pipes.pyx @@ -1,20 +1,13 @@ -# cython: infer_types=True -# cython: profile=True -# coding: utf8 -from __future__ import unicode_literals - +# cython: infer_types=True, profile=True import numpy import srsly import random -import warnings -from collections import OrderedDict -from thinc.api import chain -from thinc.v2v import Affine, Maxout, Softmax -from thinc.misc import LayerNorm -from thinc.neural.util import to_categorical -from thinc.neural.util import get_array_module +from ast import literal_eval + +from thinc.api import CosineDistance, to_categorical, get_array_module +from thinc.api import set_dropout_rate, SequenceCategoricalCrossentropy +import warnings -from ..compat import basestring_ from ..tokens.doc cimport Doc from ..syntax.nn_parser cimport Parser from ..syntax.ner cimport BiluoPushDown @@ -22,17 +15,16 @@ from ..syntax.arc_eager cimport ArcEager from ..morphology cimport Morphology from ..vocab cimport Vocab +from .defaults import default_tagger, default_parser, default_ner, default_textcat +from .defaults import default_nel, default_senter, default_tensorizer from .functions import merge_subtokens from ..language import Language, component from ..syntax import nonproj +from ..gold import Example from ..attrs import POS, ID +from ..util import link_vectors_to_models, create_default_optimizer from ..parts_of_speech import X from ..kb import KnowledgeBase -from .._ml import Tok2Vec, build_tagger_model, cosine, get_cossim_loss -from .._ml import build_text_classifier, build_simple_cnn_text_classifier -from .._ml import build_bow_text_classifier, build_nel_encoder -from .._ml import link_vectors_to_models, zero_init, flatten -from .._ml import masked_language_model, create_default_optimizer, get_cossim_loss from ..errors import Errors, TempErrors, Warnings from .. import util @@ -53,80 +45,86 @@ class Pipe(object): name = None @classmethod - def Model(cls, *shape, **kwargs): - """Initialize a model for the pipe.""" - raise NotImplementedError + def from_nlp(cls, nlp, model, **cfg): + return cls(nlp.vocab, model, **cfg) - @classmethod - def from_nlp(cls, nlp, **cfg): - return cls(nlp.vocab, **cfg) + def _get_doc(self, example): + """ Use this method if the `example` can be both a Doc or an Example """ + if isinstance(example, Doc): + return example + return example.doc - def __init__(self, vocab, model=True, **cfg): + def __init__(self, vocab, model, **cfg): """Create a new pipe instance.""" raise NotImplementedError - def __call__(self, doc): + def __call__(self, example): """Apply the pipe to one document. The document is modified in-place, and returned. Both __call__ and pipe should delegate to the `predict()` and `set_annotations()` methods. """ - self.require_model() + doc = self._get_doc(example) predictions = self.predict([doc]) if isinstance(predictions, tuple) and len(predictions) == 2: scores, tensors = predictions self.set_annotations([doc], scores, tensors=tensors) else: self.set_annotations([doc], predictions) + if isinstance(example, Example): + example.doc = doc + return example return doc - def require_model(self): - """Raise an error if the component's model is not initialized.""" - if getattr(self, "model", None) in (None, True, False): - raise ValueError(Errors.E109.format(name=self.name)) - - def pipe(self, stream, batch_size=128, n_threads=-1): + def pipe(self, stream, batch_size=128, n_threads=-1, as_example=False): """Apply the pipe to a stream of documents. Both __call__ and pipe should delegate to the `predict()` and `set_annotations()` methods. """ - for docs in util.minibatch(stream, size=batch_size): - docs = list(docs) + for examples in util.minibatch(stream, size=batch_size): + docs = [self._get_doc(ex) for ex in examples] predictions = self.predict(docs) if isinstance(predictions, tuple) and len(tuple) == 2: scores, tensors = predictions self.set_annotations(docs, scores, tensors=tensors) else: self.set_annotations(docs, predictions) - yield from docs + + if as_example: + for ex, doc in zip(examples, docs): + ex.doc = doc + yield ex + else: + yield from docs def predict(self, docs): """Apply the pipeline's model to a batch of docs, without modifying them. """ - self.require_model() raise NotImplementedError def set_annotations(self, docs, scores, tensors=None): """Modify a batch of documents, using pre-computed scores.""" raise NotImplementedError - def update(self, docs, golds, drop=0.0, sgd=None, losses=None): + def update(self, examples, set_annotations=False, drop=0.0, sgd=None, losses=None): """Learn from a batch of documents and gold-standard information, updating the pipe's model. Delegates to predict() and get_loss(). """ + if set_annotations: + docs = (self._get_doc(ex) for ex in examples) + docs = list(self.pipe(docs)) + + def rehearse(self, examples, sgd=None, losses=None, **config): pass - def rehearse(self, docs, sgd=None, losses=None, **config): - pass - - def get_loss(self, docs, golds, scores): + def get_loss(self, examples, scores): """Find the loss and gradient of loss for the batch of - documents and their predicted scores.""" + examples (with embedded docs) and their predicted scores.""" raise NotImplementedError def add_label(self, label): @@ -139,21 +137,43 @@ class Pipe(object): raise NotImplementedError def create_optimizer(self): - return create_default_optimizer(self.model.ops, **self.cfg.get("optimizer", {})) + return create_default_optimizer() def begin_training( - self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs + self, get_examples=lambda: [], pipeline=None, sgd=None, **kwargs ): """Initialize the pipe for training, using data exampes if available. If no model has been initialized yet, the model is added.""" - if self.model is True: - self.model = self.Model(**self.cfg) + self.model.initialize() if hasattr(self, "vocab"): link_vectors_to_models(self.vocab) if sgd is None: sgd = self.create_optimizer() return sgd + def set_output(self, nO): + if self.model.has_dim("nO") is not False: + self.model.set_dim("nO", nO) + if self.model.has_ref("output_layer"): + self.model.get_ref("output_layer").set_dim("nO", nO) + + def get_gradients(self): + """Get non-zero gradients of the model's parameters, as a dictionary + keyed by the parameter ID. The values are (weights, gradients) tuples. + """ + gradients = {} + queue = [self.model] + seen = set() + for node in queue: + if node.id in seen: + continue + seen.add(node.id) + if hasattr(node, "_mem") and node._mem.gradient.any(): + gradients[node.id] = [node._mem.weights, node._mem.gradient] + if hasattr(node, "_layers"): + queue.extend(node._layers) + return gradients + def use_params(self, params): """Modify the pipe's model, to use the given parameter values.""" with self.model.use_params(params): @@ -165,10 +185,9 @@ class Pipe(object): exclude (list): String names of serialization fields to exclude. RETURNS (bytes): The serialized object. """ - serialize = OrderedDict() + serialize = {} serialize["cfg"] = lambda: srsly.json_dumps(self.cfg) - if self.model not in (True, False, None): - serialize["model"] = self.model.to_bytes + serialize["model"] = self.model.to_bytes if hasattr(self, "vocab"): serialize["vocab"] = self.vocab.to_bytes exclude = util.get_serialization_exclude(serialize, exclude, kwargs) @@ -178,20 +197,15 @@ class Pipe(object): """Load the pipe from a bytestring.""" def load_model(b): - # TODO: Remove this once we don't have to handle previous models - if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg: - self.cfg["pretrained_vectors"] = self.vocab.vectors.name - if self.model is True: - self.model = self.Model(**self.cfg) try: self.model.from_bytes(b) except AttributeError: raise ValueError(Errors.E149) - deserialize = OrderedDict() - deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b)) + deserialize = {} if hasattr(self, "vocab"): deserialize["vocab"] = lambda b: self.vocab.from_bytes(b) + deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b)) deserialize["model"] = load_model exclude = util.get_serialization_exclude(deserialize, exclude, kwargs) util.from_bytes(bytes_data, deserialize, exclude) @@ -199,11 +213,10 @@ class Pipe(object): def to_disk(self, path, exclude=tuple(), **kwargs): """Serialize the pipe to disk.""" - serialize = OrderedDict() + serialize = {} serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg) serialize["vocab"] = lambda p: self.vocab.to_disk(p) - if self.model not in (None, True, False): - serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes()) + serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes()) exclude = util.get_serialization_exclude(serialize, exclude, kwargs) util.to_disk(path, serialize, exclude) @@ -211,84 +224,70 @@ class Pipe(object): """Load the pipe from disk.""" def load_model(p): - # TODO: Remove this once we don't have to handle previous models - if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg: - self.cfg["pretrained_vectors"] = self.vocab.vectors.name - if self.model is True: - self.model = self.Model(**self.cfg) try: self.model.from_bytes(p.open("rb").read()) except AttributeError: raise ValueError(Errors.E149) - deserialize = OrderedDict() - deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p)) + deserialize = {} deserialize["vocab"] = lambda p: self.vocab.from_disk(p) + deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p)) deserialize["model"] = load_model exclude = util.get_serialization_exclude(deserialize, exclude, kwargs) util.from_disk(path, deserialize, exclude) return self -@component("tensorizer", assigns=["doc.tensor"]) +@component("tensorizer", assigns=["doc.tensor"], default_model=default_tensorizer) class Tensorizer(Pipe): """Pre-train position-sensitive vectors for tokens.""" - @classmethod - def Model(cls, output_size=300, **cfg): - """Create a new statistical model for the class. - - width (int): Output size of the model. - embed_size (int): Number of vectors in the embedding table. - **cfg: Config parameters. - RETURNS (Model): A `thinc.neural.Model` or similar instance. - """ - input_size = util.env_opt("token_vector_width", cfg.get("input_size", 96)) - return zero_init(Affine(output_size, input_size, drop_factor=0.0)) - - def __init__(self, vocab, model=True, **cfg): + def __init__(self, vocab, model, **cfg): """Construct a new statistical model. Weights are not allocated on initialisation. vocab (Vocab): A `Vocab` instance. The model must share the same `Vocab` instance with the `Doc` objects it will process. - model (Model): A `Model` instance or `True` to allocate one later. **cfg: Config parameters. - - EXAMPLE: - >>> from spacy.pipeline import TokenVectorEncoder - >>> tok2vec = TokenVectorEncoder(nlp.vocab) - >>> tok2vec.model = tok2vec.Model(128, 5000) """ self.vocab = vocab self.model = model self.input_models = [] self.cfg = dict(cfg) - self.cfg.setdefault("cnn_maxout_pieces", 3) - def __call__(self, doc): + def __call__(self, example): """Add context-sensitive vectors to a `Doc`, e.g. from a CNN or LSTM model. Vectors are set to the `Doc.tensor` attribute. docs (Doc or iterable): One or more documents to add vectors to. RETURNS (dict or None): Intermediate computations. """ + doc = self._get_doc(example) tokvecses = self.predict([doc]) self.set_annotations([doc], tokvecses) + if isinstance(example, Example): + example.doc = doc + return example return doc - def pipe(self, stream, batch_size=128, n_threads=-1): + def pipe(self, stream, batch_size=128, n_threads=-1, as_example=False): """Process `Doc` objects as a stream. - stream (iterator): A sequence of `Doc` objects to process. - batch_size (int): Number of `Doc` objects to group. - YIELDS (iterator): A sequence of `Doc` objects, in order of input. + stream (iterator): A sequence of `Doc` or `Example` objects to process. + batch_size (int): Number of `Doc` or `Example` objects to group. + YIELDS (iterator): A sequence of `Doc` or `Example` objects, in order of input. """ - for docs in util.minibatch(stream, size=batch_size): - docs = list(docs) + for examples in util.minibatch(stream, size=batch_size): + docs = [self._get_doc(ex) for ex in examples] tensors = self.predict(docs) self.set_annotations(docs, tensors) - yield from docs + + if as_example: + for ex, doc in zip(examples, docs): + ex.doc = doc + yield ex + else: + yield from docs def predict(self, docs): """Return a single tensor for a batch of documents. @@ -296,7 +295,6 @@ class Tensorizer(Pipe): docs (iterable): A sequence of `Doc` objects. RETURNS (object): Vector representations for each token in the docs. """ - self.require_model() inputs = self.model.ops.flatten([doc.tensor for doc in docs]) outputs = self.model(inputs) return self.model.ops.unflatten(outputs, [len(d) for d in docs]) @@ -312,7 +310,7 @@ class Tensorizer(Pipe): raise ValueError(Errors.E076.format(rows=tensor.shape[0], words=len(doc))) doc.tensor = tensor - def update(self, docs, golds, state=None, drop=0.0, sgd=None, losses=None): + def update(self, examples, state=None, drop=0.0, set_annotations=False, sgd=None, losses=None): """Update the model. docs (iterable): A batch of `Doc` objects. @@ -321,109 +319,121 @@ class Tensorizer(Pipe): sgd (callable): An optimizer. RETURNS (dict): Results from the update. """ - self.require_model() - if isinstance(docs, Doc): - docs = [docs] + examples = Example.to_example_objects(examples) inputs = [] bp_inputs = [] + set_dropout_rate(self.model, drop) for tok2vec in self.input_models: - tensor, bp_tensor = tok2vec.begin_update(docs, drop=drop) + set_dropout_rate(tok2vec, drop) + tensor, bp_tensor = tok2vec.begin_update([ex.doc for ex in examples]) inputs.append(tensor) bp_inputs.append(bp_tensor) inputs = self.model.ops.xp.hstack(inputs) - scores, bp_scores = self.model.begin_update(inputs, drop=drop) - loss, d_scores = self.get_loss(docs, golds, scores) + scores, bp_scores = self.model.begin_update(inputs) + loss, d_scores = self.get_loss(examples, scores) d_inputs = bp_scores(d_scores, sgd=sgd) d_inputs = self.model.ops.xp.split(d_inputs, len(self.input_models), axis=1) for d_input, bp_input in zip(d_inputs, bp_inputs): - bp_input(d_input, sgd=sgd) + bp_input(d_input) + if sgd is not None: + for tok2vec in self.input_models: + tok2vec.finish_update(sgd) + self.model.finish_update(sgd) if losses is not None: losses.setdefault(self.name, 0.0) losses[self.name] += loss return loss - def get_loss(self, docs, golds, prediction): - ids = self.model.ops.flatten([doc.to_array(ID).ravel() for doc in docs]) + def get_loss(self, examples, prediction): + examples = Example.to_example_objects(examples) + ids = self.model.ops.flatten([ex.doc.to_array(ID).ravel() for ex in examples]) target = self.vocab.vectors.data[ids] d_scores = (prediction - target) / prediction.shape[0] loss = (d_scores ** 2).sum() return loss, d_scores - def begin_training(self, gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs): + def begin_training(self, get_examples=lambda: [], pipeline=None, sgd=None, **kwargs): """Allocate models, pre-process training data and acquire an optimizer. - gold_tuples (iterable): Gold-standard training data. + get_examples (iterable): Gold-standard training data. pipeline (list): The pipeline the model is part of. """ if pipeline is not None: for name, model in pipeline: - if getattr(model, "tok2vec", None): - self.input_models.append(model.tok2vec) - if self.model is True: - self.model = self.Model(**self.cfg) + if model.has_ref("tok2vec"): + self.input_models.append(model.get_ref("tok2vec")) + self.model.initialize() link_vectors_to_models(self.vocab) if sgd is None: sgd = self.create_optimizer() return sgd -@component("tagger", assigns=["token.tag", "token.pos", "token.lemma"]) +@component("tagger", assigns=["token.tag", "token.pos", "token.lemma"], default_model=default_tagger) class Tagger(Pipe): """Pipeline component for part-of-speech tagging. DOCS: https://spacy.io/api/tagger """ - def __init__(self, vocab, model=True, **cfg): + def __init__(self, vocab, model, **cfg): self.vocab = vocab self.model = model self._rehearsal_model = None - self.cfg = OrderedDict(sorted(cfg.items())) - self.cfg.setdefault("cnn_maxout_pieces", 2) + self.cfg = dict(sorted(cfg.items())) @property def labels(self): return tuple(self.vocab.morphology.tag_names) - @property - def tok2vec(self): - if self.model in (None, True, False): - return None - else: - return chain(self.model.tok2vec, flatten) - - def __call__(self, doc): - tags, tokvecs = self.predict([doc]) - self.set_annotations([doc], tags, tensors=tokvecs) + def __call__(self, example): + doc = self._get_doc(example) + tags = self.predict([doc]) + self.set_annotations([doc], tags) + if isinstance(example, Example): + example.doc = doc + return example return doc - def pipe(self, stream, batch_size=128, n_threads=-1): - for docs in util.minibatch(stream, size=batch_size): - docs = list(docs) - tag_ids, tokvecs = self.predict(docs) - self.set_annotations(docs, tag_ids, tensors=tokvecs) - yield from docs + def pipe(self, stream, batch_size=128, n_threads=-1, as_example=False): + for examples in util.minibatch(stream, size=batch_size): + docs = [self._get_doc(ex) for ex in examples] + tag_ids = self.predict(docs) + assert len(docs) == len(examples) + assert len(tag_ids) == len(examples) + self.set_annotations(docs, tag_ids) + + if as_example: + for ex, doc in zip(examples, docs): + ex.doc = doc + yield ex + else: + yield from docs def predict(self, docs): - self.require_model() if not any(len(doc) for doc in docs): # Handle cases where there are no tokens in any docs. n_labels = len(self.labels) - guesses = [self.model.ops.allocate((0, n_labels)) for doc in docs] - tokvecs = self.model.ops.allocate((0, self.model.tok2vec.nO)) - return guesses, tokvecs - tokvecs = self.model.tok2vec(docs) - scores = self.model.softmax(tokvecs) + guesses = [self.model.ops.alloc((0, n_labels)) for doc in docs] + assert len(guesses) == len(docs) + return guesses + scores = self.model.predict(docs) + assert len(scores) == len(docs), (len(scores), len(docs)) + guesses = self._scores2guesses(scores) + assert len(guesses) == len(docs) + return guesses + + def _scores2guesses(self, scores): guesses = [] for doc_scores in scores: doc_guesses = doc_scores.argmax(axis=1) if not isinstance(doc_guesses, numpy.ndarray): doc_guesses = doc_guesses.get() guesses.append(doc_guesses) - return guesses, tokvecs + return guesses - def set_annotations(self, docs, batch_tag_ids, tensors=None): + def set_annotations(self, docs, batch_tag_ids): if isinstance(docs, Doc): docs = [docs] cdef Doc doc @@ -446,114 +456,95 @@ class Tagger(Pipe): else: doc.c[j].tag = self.vocab.strings[self.labels[tag_id]] idx += 1 - if tensors is not None and len(tensors): - if isinstance(doc.tensor, numpy.ndarray) \ - and not isinstance(tensors[i], numpy.ndarray): - doc.extend_tensor(tensors[i].get()) - else: - doc.extend_tensor(tensors[i]) doc.is_tagged = True - def update(self, docs, golds, drop=0., sgd=None, losses=None): - self.require_model() + def update(self, examples, drop=0., sgd=None, losses=None, set_annotations=False): + examples = Example.to_example_objects(examples) if losses is not None and self.name not in losses: losses[self.name] = 0. - if not any(len(doc) for doc in docs): + if not any(len(ex.doc) if ex.doc else 0 for ex in examples): # Handle cases where there are no tokens in any docs. return - - tag_scores, bp_tag_scores = self.model.begin_update(docs, drop=drop) - loss, d_tag_scores = self.get_loss(docs, golds, tag_scores) - bp_tag_scores(d_tag_scores, sgd=sgd) + set_dropout_rate(self.model, drop) + tag_scores, bp_tag_scores = self.model.begin_update([ex.doc for ex in examples]) + for sc in tag_scores: + if self.model.ops.xp.isnan(sc.sum()): + raise ValueError("nan value in scores") + loss, d_tag_scores = self.get_loss(examples, tag_scores) + bp_tag_scores(d_tag_scores) + if sgd not in (None, False): + self.model.finish_update(sgd) if losses is not None: losses[self.name] += loss + if set_annotations: + docs = [ex.doc for ex in examples] + self.set_annotations(docs, self._scores2guesses(tag_scores)) - def rehearse(self, docs, drop=0., sgd=None, losses=None): + def rehearse(self, examples, drop=0., sgd=None, losses=None): """Perform a 'rehearsal' update, where we try to match the output of an initial model. """ if self._rehearsal_model is None: return + examples = Example.to_example_objects(examples) + docs = [ex.doc for ex in examples] if not any(len(doc) for doc in docs): # Handle cases where there are no tokens in any docs. return - guesses, backprop = self.model.begin_update(docs, drop=drop) - target = self._rehearsal_model(docs) + set_dropout_rate(self.model, drop) + guesses, backprop = self.model.begin_update(docs) + target = self._rehearsal_model(examples) gradient = guesses - target - backprop(gradient, sgd=sgd) + backprop(gradient) + self.model.finish_update(sgd) if losses is not None: losses.setdefault(self.name, 0.0) losses[self.name] += (gradient**2).sum() - def get_loss(self, docs, golds, scores): - scores = self.model.ops.flatten(scores) - tag_index = {tag: i for i, tag in enumerate(self.labels)} - cdef int idx = 0 - correct = numpy.zeros((scores.shape[0],), dtype="i") - guesses = scores.argmax(axis=1) - known_labels = numpy.ones((scores.shape[0], 1), dtype="f") - for gold in golds: - for tag in gold.tags: - if tag is None: - correct[idx] = guesses[idx] - elif tag in tag_index: - correct[idx] = tag_index[tag] - else: - correct[idx] = 0 - known_labels[idx] = 0. - idx += 1 - correct = self.model.ops.xp.array(correct, dtype="i") - d_scores = scores - to_categorical(correct, nb_classes=scores.shape[1]) - d_scores *= self.model.ops.asarray(known_labels) - loss = (d_scores**2).sum() - d_scores = self.model.ops.unflatten(d_scores, [len(d) for d in docs]) + def get_loss(self, examples, scores): + loss_func = SequenceCategoricalCrossentropy(names=self.labels) + truths = [eg.gold.tags for eg in examples] + d_scores, loss = loss_func(scores, truths) + if self.model.ops.xp.isnan(loss): + raise ValueError("nan value when computing loss") return float(loss), d_scores - def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, + def begin_training(self, get_examples=lambda: [], pipeline=None, sgd=None, **kwargs): lemma_tables = ["lemma_rules", "lemma_index", "lemma_exc", "lemma_lookup"] if not any(table in self.vocab.lookups for table in lemma_tables): warnings.warn(Warnings.W022) orig_tag_map = dict(self.vocab.morphology.tag_map) - new_tag_map = OrderedDict() - for raw_text, annots_brackets in get_gold_tuples(): - for annots, brackets in annots_brackets: - ids, words, tags, heads, deps, ents = annots - for tag in tags: - if tag in orig_tag_map: - new_tag_map[tag] = orig_tag_map[tag] - else: - new_tag_map[tag] = {POS: X} + new_tag_map = {} + for example in get_examples(): + for tag in example.token_annotation.tags: + if tag in orig_tag_map: + new_tag_map[tag] = orig_tag_map[tag] + else: + new_tag_map[tag] = {POS: X} + cdef Vocab vocab = self.vocab if new_tag_map: vocab.morphology = Morphology(vocab.strings, new_tag_map, vocab.morphology.lemmatizer, exc=vocab.morphology.exc) - self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors") - if self.model is True: - for hp in ["token_vector_width", "conv_depth"]: - if hp in kwargs: - self.cfg[hp] = kwargs[hp] - self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg) + self.set_output(len(self.labels)) + self.model.initialize() + # Get batch of example docs, example outputs to call begin_training(). + # This lets the model infer shapes. link_vectors_to_models(self.vocab) if sgd is None: sgd = self.create_optimizer() return sgd - @classmethod - def Model(cls, n_tags, **cfg): - if cfg.get("pretrained_dims") and not cfg.get("pretrained_vectors"): - raise ValueError(TempErrors.T008) - return build_tagger_model(n_tags, **cfg) - def add_label(self, label, values=None): - if not isinstance(label, basestring_): + if not isinstance(label, str): raise ValueError(Errors.E187) if label in self.labels: return 0 - if self.model not in (True, False, None): + if self.model.has_dim("nO"): # Here's how the model resizing will work, once the # neuron-to-tag mapping is no longer controlled by # the Morphology class, which sorts the tag names. @@ -579,26 +570,17 @@ class Tagger(Pipe): yield def to_bytes(self, exclude=tuple(), **kwargs): - serialize = OrderedDict() - if self.model not in (None, True, False): - serialize["model"] = self.model.to_bytes + serialize = {} + serialize["model"] = self.model.to_bytes serialize["vocab"] = self.vocab.to_bytes serialize["cfg"] = lambda: srsly.json_dumps(self.cfg) - tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items())) + tag_map = dict(sorted(self.vocab.morphology.tag_map.items())) serialize["tag_map"] = lambda: srsly.msgpack_dumps(tag_map) exclude = util.get_serialization_exclude(serialize, exclude, kwargs) return util.to_bytes(serialize, exclude) def from_bytes(self, bytes_data, exclude=tuple(), **kwargs): def load_model(b): - # TODO: Remove this once we don't have to handle previous models - if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg: - self.cfg["pretrained_vectors"] = self.vocab.vectors.name - if self.model is True: - token_vector_width = util.env_opt( - "token_vector_width", - self.cfg.get("token_vector_width", 96)) - self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg) try: self.model.from_bytes(b) except AttributeError: @@ -611,34 +593,29 @@ class Tagger(Pipe): lemmatizer=self.vocab.morphology.lemmatizer, exc=self.vocab.morphology.exc) - deserialize = OrderedDict(( - ("vocab", lambda b: self.vocab.from_bytes(b)), - ("tag_map", load_tag_map), - ("cfg", lambda b: self.cfg.update(srsly.json_loads(b))), - ("model", lambda b: load_model(b)), - )) + deserialize = { + "vocab": lambda b: self.vocab.from_bytes(b), + "tag_map": load_tag_map, + "cfg": lambda b: self.cfg.update(srsly.json_loads(b)), + "model": lambda b: load_model(b), + } exclude = util.get_serialization_exclude(deserialize, exclude, kwargs) util.from_bytes(bytes_data, deserialize, exclude) return self def to_disk(self, path, exclude=tuple(), **kwargs): - tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items())) - serialize = OrderedDict(( - ("vocab", lambda p: self.vocab.to_disk(p)), - ("tag_map", lambda p: srsly.write_msgpack(p, tag_map)), - ("model", lambda p: p.open("wb").write(self.model.to_bytes())), - ("cfg", lambda p: srsly.write_json(p, self.cfg)) - )) + tag_map = dict(sorted(self.vocab.morphology.tag_map.items())) + serialize = { + "vocab": lambda p: self.vocab.to_disk(p), + "tag_map": lambda p: srsly.write_msgpack(p, tag_map), + "model": lambda p: p.open("wb").write(self.model.to_bytes()), + "cfg": lambda p: srsly.write_json(p, self.cfg), + } exclude = util.get_serialization_exclude(serialize, exclude, kwargs) util.to_disk(path, serialize, exclude) def from_disk(self, path, exclude=tuple(), **kwargs): def load_model(p): - # TODO: Remove this once we don't have to handle previous models - if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg: - self.cfg["pretrained_vectors"] = self.vocab.vectors.name - if self.model is True: - self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg) with p.open("rb") as file_: try: self.model.from_bytes(file_.read()) @@ -652,12 +629,137 @@ class Tagger(Pipe): lemmatizer=self.vocab.morphology.lemmatizer, exc=self.vocab.morphology.exc) - deserialize = OrderedDict(( - ("cfg", lambda p: self.cfg.update(_load_cfg(p))), - ("vocab", lambda p: self.vocab.from_disk(p)), - ("tag_map", load_tag_map), - ("model", load_model), - )) + deserialize = { + "vocab": lambda p: self.vocab.from_disk(p), + "cfg": lambda p: self.cfg.update(_load_cfg(p)), + "tag_map": load_tag_map, + "model": load_model, + } + exclude = util.get_serialization_exclude(deserialize, exclude, kwargs) + util.from_disk(path, deserialize, exclude) + return self + + +@component("senter", assigns=["token.is_sent_start"], default_model=default_senter) +class SentenceRecognizer(Tagger): + """Pipeline component for sentence segmentation. + + DOCS: https://spacy.io/api/sentencerecognizer + """ + + def __init__(self, vocab, model, **cfg): + self.vocab = vocab + self.model = model + self._rehearsal_model = None + self.cfg = dict(sorted(cfg.items())) + + @property + def labels(self): + # labels are numbered by index internally, so this matches GoldParse + # and Example where the sentence-initial tag is 1 and other positions + # are 0 + return tuple(["I", "S"]) + + def set_annotations(self, docs, batch_tag_ids): + if isinstance(docs, Doc): + docs = [docs] + cdef Doc doc + for i, doc in enumerate(docs): + doc_tag_ids = batch_tag_ids[i] + if hasattr(doc_tag_ids, "get"): + doc_tag_ids = doc_tag_ids.get() + for j, tag_id in enumerate(doc_tag_ids): + # Don't clobber existing sentence boundaries + if doc.c[j].sent_start == 0: + if tag_id == 1: + doc.c[j].sent_start = 1 + else: + doc.c[j].sent_start = -1 + + def get_loss(self, examples, scores): + scores = self.model.ops.flatten(scores) + tag_index = range(len(self.labels)) + cdef int idx = 0 + correct = numpy.zeros((scores.shape[0],), dtype="i") + guesses = scores.argmax(axis=1) + known_labels = numpy.ones((scores.shape[0], 1), dtype="f") + for ex in examples: + gold = ex.gold + for sent_start in gold.sent_starts: + if sent_start is None: + correct[idx] = guesses[idx] + elif sent_start in tag_index: + correct[idx] = sent_start + else: + correct[idx] = 0 + known_labels[idx] = 0. + idx += 1 + correct = self.model.ops.xp.array(correct, dtype="i") + d_scores = scores - to_categorical(correct, n_classes=scores.shape[1]) + d_scores *= self.model.ops.asarray(known_labels) + loss = (d_scores**2).sum() + docs = [ex.doc for ex in examples] + d_scores = self.model.ops.unflatten(d_scores, [len(d) for d in docs]) + return float(loss), d_scores + + def begin_training(self, get_examples=lambda: [], pipeline=None, sgd=None, + **kwargs): + self.set_output(len(self.labels)) + self.model.initialize() + link_vectors_to_models(self.vocab) + if sgd is None: + sgd = self.create_optimizer() + return sgd + + def add_label(self, label, values=None): + raise NotImplementedError + + def to_bytes(self, exclude=tuple(), **kwargs): + serialize = {} + serialize["model"] = self.model.to_bytes + serialize["vocab"] = self.vocab.to_bytes + serialize["cfg"] = lambda: srsly.json_dumps(self.cfg) + exclude = util.get_serialization_exclude(serialize, exclude, kwargs) + return util.to_bytes(serialize, exclude) + + def from_bytes(self, bytes_data, exclude=tuple(), **kwargs): + def load_model(b): + try: + self.model.from_bytes(b) + except AttributeError: + raise ValueError(Errors.E149) + + deserialize = { + "vocab": lambda b: self.vocab.from_bytes(b), + "cfg": lambda b: self.cfg.update(srsly.json_loads(b)), + "model": lambda b: load_model(b), + } + exclude = util.get_serialization_exclude(deserialize, exclude, kwargs) + util.from_bytes(bytes_data, deserialize, exclude) + return self + + def to_disk(self, path, exclude=tuple(), **kwargs): + serialize = { + "vocab": lambda p: self.vocab.to_disk(p), + "model": lambda p: p.open("wb").write(self.model.to_bytes()), + "cfg": lambda p: srsly.write_json(p, self.cfg), + } + exclude = util.get_serialization_exclude(serialize, exclude, kwargs) + util.to_disk(path, serialize, exclude) + + def from_disk(self, path, exclude=tuple(), **kwargs): + def load_model(p): + with p.open("rb") as file_: + try: + self.model.from_bytes(file_.read()) + except AttributeError: + raise ValueError(Errors.E149) + + deserialize = { + "vocab": lambda p: self.vocab.from_disk(p), + "cfg": lambda p: self.cfg.update(_load_cfg(p)), + "model": load_model, + } exclude = util.get_serialization_exclude(deserialize, exclude, kwargs) util.from_disk(path, deserialize, exclude) return self @@ -669,7 +771,7 @@ class MultitaskObjective(Tagger): side-objective. """ - def __init__(self, vocab, model=True, target='dep_tag_offset', **cfg): + def __init__(self, vocab, model, target='dep_tag_offset', **cfg): self.vocab = vocab self.model = model if target == "dep": @@ -689,7 +791,8 @@ class MultitaskObjective(Tagger): else: raise ValueError(Errors.E016) self.cfg = dict(cfg) - self.cfg.setdefault("cnn_maxout_pieces", 2) + # TODO: remove - put in config + self.cfg.setdefault("maxout_pieces", 2) @property def labels(self): @@ -702,99 +805,81 @@ class MultitaskObjective(Tagger): def set_annotations(self, docs, dep_ids, tensors=None): pass - def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, tok2vec=None, + def begin_training(self, get_examples=lambda: [], pipeline=None, tok2vec=None, sgd=None, **kwargs): - gold_tuples = nonproj.preprocess_training_data(get_gold_tuples()) - for raw_text, annots_brackets in gold_tuples: - for annots, brackets in annots_brackets: - ids, words, tags, heads, deps, ents = annots - for i in range(len(ids)): - label = self.make_label(i, words, tags, heads, deps, ents) - if label is not None and label not in self.labels: - self.labels[label] = len(self.labels) - if self.model is True: - token_vector_width = util.env_opt("token_vector_width") - self.model = self.Model(len(self.labels), tok2vec=tok2vec) + gold_examples = nonproj.preprocess_training_data(get_examples()) + # for raw_text, doc_annot in gold_tuples: + for example in gold_examples: + for i in range(len(example.token_annotation.ids)): + label = self.make_label(i, example.token_annotation) + if label is not None and label not in self.labels: + self.labels[label] = len(self.labels) + self.model.initialize() link_vectors_to_models(self.vocab) if sgd is None: sgd = self.create_optimizer() return sgd - @classmethod - def Model(cls, n_tags, tok2vec=None, **cfg): - token_vector_width = util.env_opt("token_vector_width", 96) - softmax = Softmax(n_tags, token_vector_width*2) - model = chain( - tok2vec, - LayerNorm(Maxout(token_vector_width*2, token_vector_width, pieces=3)), - softmax - ) - model.tok2vec = tok2vec - model.softmax = softmax - return model - def predict(self, docs): - self.require_model() - tokvecs = self.model.tok2vec(docs) - scores = self.model.softmax(tokvecs) + tokvecs = self.model.get_ref("tok2vec")(docs) + scores = self.model.get_ref("softmax")(tokvecs) return tokvecs, scores - def get_loss(self, docs, golds, scores): - if len(docs) != len(golds): - raise ValueError(Errors.E077.format(value="loss", n_docs=len(docs), - n_golds=len(golds))) + def get_loss(self, examples, scores): cdef int idx = 0 correct = numpy.zeros((scores.shape[0],), dtype="i") guesses = scores.argmax(axis=1) + golds = [ex.gold for ex in examples] + docs = [ex.doc for ex in examples] for i, gold in enumerate(golds): for j in range(len(docs[i])): - # Handes alignment for tokenization differences - label = self.make_label(j, gold.words, gold.tags, - gold.heads, gold.labels, gold.ents) + # Handels alignment for tokenization differences + token_annotation = gold.get_token_annotation() + label = self.make_label(j, token_annotation) if label is None or label not in self.labels: correct[idx] = guesses[idx] else: correct[idx] = self.labels[label] idx += 1 correct = self.model.ops.xp.array(correct, dtype="i") - d_scores = scores - to_categorical(correct, nb_classes=scores.shape[1]) + d_scores = scores - to_categorical(correct, n_classes=scores.shape[1]) loss = (d_scores**2).sum() return float(loss), d_scores @staticmethod - def make_dep(i, words, tags, heads, deps, ents): - if deps[i] is None or heads[i] is None: + def make_dep(i, token_annotation): + if token_annotation.deps[i] is None or token_annotation.heads[i] is None: return None - return deps[i] + return token_annotation.deps[i] @staticmethod - def make_tag(i, words, tags, heads, deps, ents): - return tags[i] + def make_tag(i, token_annotation): + return token_annotation.tags[i] @staticmethod - def make_ent(i, words, tags, heads, deps, ents): - if ents is None: + def make_ent(i, token_annotation): + if token_annotation.entities is None: return None - return ents[i] + return token_annotation.entities[i] @staticmethod - def make_dep_tag_offset(i, words, tags, heads, deps, ents): - if deps[i] is None or heads[i] is None: + def make_dep_tag_offset(i, token_annotation): + if token_annotation.deps[i] is None or token_annotation.heads[i] is None: return None - offset = heads[i] - i + offset = token_annotation.heads[i] - i offset = min(offset, 2) offset = max(offset, -2) - return "%s-%s:%d" % (deps[i], tags[i], offset) + return f"{token_annotation.deps[i]}-{token_annotation.tags[i]}:{offset}" @staticmethod - def make_ent_tag(i, words, tags, heads, deps, ents): - if ents is None or ents[i] is None: + def make_ent_tag(i, token_annotation): + if token_annotation.entities is None or token_annotation.entities[i] is None: return None else: - return "%s-%s" % (tags[i], ents[i]) + return f"{token_annotation.tags[i]}-{token_annotation.entities[i]}" @staticmethod - def make_sent_start(target, words, tags, heads, deps, ents, cache=True, _cache={}): + def make_sent_start(target, token_annotation, cache=True, _cache={}): """A multi-task objective for representing sentence boundaries, using BILU scheme. (O is impossible) @@ -803,6 +888,8 @@ class MultitaskObjective(Tagger): of gold data. You can pass cache=False if you know the cache will do the wrong thing. """ + words = token_annotation.words + heads = token_annotation.heads assert len(words) == len(heads) assert target < len(words), (target, len(words)) if cache: @@ -840,99 +927,66 @@ class MultitaskObjective(Tagger): class ClozeMultitask(Pipe): - @classmethod - def Model(cls, vocab, tok2vec, **cfg): - output_size = vocab.vectors.data.shape[1] - output_layer = chain( - LayerNorm(Maxout(output_size, tok2vec.nO, pieces=3)), - zero_init(Affine(output_size, output_size, drop_factor=0.0)) - ) - model = chain(tok2vec, output_layer) - model = masked_language_model(vocab, model) - model.tok2vec = tok2vec - model.output_layer = output_layer - return model - - def __init__(self, vocab, model=True, **cfg): + def __init__(self, vocab, model, **cfg): self.vocab = vocab self.model = model self.cfg = cfg + self.distance = CosineDistance(ignore_zeros=True, normalize=False) def set_annotations(self, docs, dep_ids, tensors=None): pass - def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, + def begin_training(self, get_examples=lambda: [], pipeline=None, tok2vec=None, sgd=None, **kwargs): link_vectors_to_models(self.vocab) - if self.model is True: - self.model = self.Model(self.vocab, tok2vec) - X = self.model.ops.allocate((5, self.model.tok2vec.nO)) + self.model.initialize() + X = self.model.ops.alloc((5, self.model.get_ref("tok2vec").get_dim("nO"))) self.model.output_layer.begin_training(X) if sgd is None: sgd = self.create_optimizer() return sgd def predict(self, docs): - self.require_model() - tokvecs = self.model.tok2vec(docs) - vectors = self.model.output_layer(tokvecs) + tokvecs = self.model.get_ref("tok2vec")(docs) + vectors = self.model.get_ref("output_layer")(tokvecs) return tokvecs, vectors - def get_loss(self, docs, vectors, prediction): + def get_loss(self, examples, vectors, prediction): # The simplest way to implement this would be to vstack the # token.vector values, but that's a bit inefficient, especially on GPU. # Instead we fetch the index into the vectors table for each of our tokens, # and look them up all at once. This prevents data copying. - ids = self.model.ops.flatten([doc.to_array(ID).ravel() for doc in docs]) + ids = self.model.ops.flatten([ex.doc.to_array(ID).ravel() for ex in examples]) target = vectors[ids] - loss, gradient = get_cossim_loss(prediction, target, ignore_zeros=True) - return float(loss), gradient + gradient = self.distance.get_grad(prediction, target) + loss = self.distance.get_loss(prediction, target) + return loss, gradient - def update(self, docs, golds, drop=0., sgd=None, losses=None): + def update(self, examples, drop=0., set_annotations=False, sgd=None, losses=None): pass - def rehearse(self, docs, drop=0., sgd=None, losses=None): - self.require_model() + def rehearse(self, examples, drop=0., sgd=None, losses=None): + examples = Example.to_example_objects(examples) if losses is not None and self.name not in losses: losses[self.name] = 0. - predictions, bp_predictions = self.model.begin_update(docs, drop=drop) - loss, d_predictions = self.get_loss(docs, self.vocab.vectors.data, predictions) - bp_predictions(d_predictions, sgd=sgd) + set_dropout_rate(self.model, drop) + predictions, bp_predictions = self.model.begin_update([ex.doc for ex in examples]) + loss, d_predictions = self.get_loss(examples, self.vocab.vectors.data, predictions) + bp_predictions(d_predictions) + if sgd is not None: + self.model.finish_update(sgd) if losses is not None: losses[self.name] += loss -@component("textcat", assigns=["doc.cats"]) +@component("textcat", assigns=["doc.cats"], default_model=default_textcat) class TextCategorizer(Pipe): """Pipeline component for text classification. DOCS: https://spacy.io/api/textcategorizer """ - - @classmethod - def Model(cls, nr_class=1, **cfg): - embed_size = util.env_opt("embed_size", 2000) - if "token_vector_width" in cfg: - token_vector_width = cfg["token_vector_width"] - else: - token_vector_width = util.env_opt("token_vector_width", 96) - if cfg.get("architecture") == "simple_cnn": - tok2vec = Tok2Vec(token_vector_width, embed_size, **cfg) - return build_simple_cnn_text_classifier(tok2vec, nr_class, **cfg) - elif cfg.get("architecture") == "bow": - return build_bow_text_classifier(nr_class, **cfg) - else: - return build_text_classifier(nr_class, **cfg) - - @property - def tok2vec(self): - if self.model in (None, True, False): - return None - else: - return self.model.tok2vec - - def __init__(self, vocab, model=True, **cfg): + def __init__(self, vocab, model, **cfg): self.vocab = vocab self.model = model self._rehearsal_model = None @@ -951,15 +1005,20 @@ class TextCategorizer(Pipe): def labels(self, value): self.cfg["labels"] = tuple(value) - def pipe(self, stream, batch_size=128, n_threads=-1): - for docs in util.minibatch(stream, size=batch_size): - docs = list(docs) + def pipe(self, stream, batch_size=128, n_threads=-1, as_example=False): + for examples in util.minibatch(stream, size=batch_size): + docs = [self._get_doc(ex) for ex in examples] scores, tensors = self.predict(docs) self.set_annotations(docs, scores, tensors=tensors) - yield from docs + + if as_example: + for ex, doc in zip(examples, docs): + ex.doc = doc + yield ex + else: + yield from docs def predict(self, docs): - self.require_model() tensors = [doc.tensor for doc in docs] if not any(len(doc) for doc in docs): @@ -968,7 +1027,7 @@ class TextCategorizer(Pipe): scores = xp.zeros((len(docs), len(self.labels))) return scores, tensors - scores = self.model(docs) + scores = self.model.predict(docs) scores = self.model.ops.asarray(scores) return scores, tensors @@ -977,33 +1036,45 @@ class TextCategorizer(Pipe): for j, label in enumerate(self.labels): doc.cats[label] = float(scores[i, j]) - def update(self, docs, golds, state=None, drop=0., sgd=None, losses=None): - self.require_model() - if not any(len(doc) for doc in docs): + def update(self, examples, state=None, drop=0., set_annotations=False, sgd=None, losses=None): + examples = Example.to_example_objects(examples) + if not any(len(ex.doc) if ex.doc else 0 for ex in examples): # Handle cases where there are no tokens in any docs. return - scores, bp_scores = self.model.begin_update(docs, drop=drop) - loss, d_scores = self.get_loss(docs, golds, scores) - bp_scores(d_scores, sgd=sgd) + set_dropout_rate(self.model, drop) + scores, bp_scores = self.model.begin_update([ex.doc for ex in examples]) + loss, d_scores = self.get_loss(examples, scores) + bp_scores(d_scores) + if sgd is not None: + self.model.finish_update(sgd) if losses is not None: losses.setdefault(self.name, 0.0) losses[self.name] += loss + if set_annotations: + docs = [ex.doc for ex in examples] + self.set_annotations(docs, scores=scores) - def rehearse(self, docs, drop=0., sgd=None, losses=None): + def rehearse(self, examples, drop=0., sgd=None, losses=None): if self._rehearsal_model is None: return + examples = Example.to_example_objects(examples) + docs=[ex.doc for ex in examples] if not any(len(doc) for doc in docs): # Handle cases where there are no tokens in any docs. return - scores, bp_scores = self.model.begin_update(docs, drop=drop) - target = self._rehearsal_model(docs) + set_dropout_rate(self.model, drop) + scores, bp_scores = self.model.begin_update(docs) + target = self._rehearsal_model(examples) gradient = scores - target - bp_scores(gradient, sgd=sgd) + bp_scores(gradient) + if sgd is not None: + self.model.finish_update(sgd) if losses is not None: losses.setdefault(self.name, 0.0) losses[self.name] += (gradient**2).sum() - def get_loss(self, docs, golds, scores): + def _examples_to_truth(self, examples): + golds = [ex.gold for ex in examples] truths = numpy.zeros((len(golds), len(self.labels)), dtype="f") not_missing = numpy.ones((len(golds), len(self.labels)), dtype="f") for i, gold in enumerate(golds): @@ -1013,6 +1084,10 @@ class TextCategorizer(Pipe): else: not_missing[i, j] = 0. truths = self.model.ops.asarray(truths) + return truths, not_missing + + def get_loss(self, examples, scores): + truths, not_missing = self._examples_to_truth(examples) not_missing = self.model.ops.asarray(not_missing) d_scores = (scores-truths) / scores.shape[0] d_scores *= not_missing @@ -1020,35 +1095,36 @@ class TextCategorizer(Pipe): return float(mean_square_error), d_scores def add_label(self, label): - if not isinstance(label, basestring_): + if not isinstance(label, str): raise ValueError(Errors.E187) if label in self.labels: return 0 - if self.model not in (None, True, False): + if self.model.has_dim("nO"): # This functionality was available previously, but was broken. # The problem is that we resize the last layer, but the last layer # is actually just an ensemble. We're not resizing the child layers # - a huge problem. raise ValueError(Errors.E116) # smaller = self.model._layers[-1] - # larger = Affine(len(self.labels)+1, smaller.nI) + # larger = Linear(len(self.labels)+1, smaller.nI) # copy_array(larger.W[:smaller.nO], smaller.W) # copy_array(larger.b[:smaller.nO], smaller.b) # self.model._layers[-1] = larger self.labels = tuple(list(self.labels) + [label]) return 1 - def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs): - for raw_text, annot_brackets in get_gold_tuples(): - for _, (cats, _2) in annot_brackets: - for cat in cats: - self.add_label(cat) - if self.model is True: - self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors") - self.cfg["pretrained_dims"] = kwargs.get("pretrained_dims") - self.require_labels() - self.model = self.Model(len(self.labels), **self.cfg) - link_vectors_to_models(self.vocab) + def begin_training(self, get_examples=lambda: [], pipeline=None, sgd=None, **kwargs): + # TODO: begin_training is not guaranteed to see all data / labels ? + examples = list(get_examples()) + for example in examples: + for cat in example.doc_annotation.cats: + self.add_label(cat) + self.require_labels() + docs = [Doc(Vocab(), words=["hello"])] + truths, _ = self._examples_to_truth(examples) + self.set_output(len(self.labels)) + link_vectors_to_models(self.vocab) + self.model.initialize(X=docs, Y=truths) if sgd is None: sgd = self.create_optimizer() return sgd @@ -1081,14 +1157,20 @@ cdef class DependencyParser(Parser): labeller = MultitaskObjective(self.vocab, target=target) self._multitasks.append(labeller) - def init_multitask_objectives(self, get_gold_tuples, pipeline, sgd=None, **cfg): + def init_multitask_objectives(self, get_examples, pipeline, sgd=None, **cfg): for labeller in self._multitasks: - tok2vec = self.model.tok2vec - labeller.begin_training(get_gold_tuples, pipeline=pipeline, + tok2vec = self.model.get_ref("tok2vec") + labeller.begin_training(get_examples, pipeline=pipeline, tok2vec=tok2vec, sgd=sgd) def __reduce__(self): - return (DependencyParser, (self.vocab, self.moves, self.model), None, None) + return (DependencyParser, (self.vocab, self.model), self.moves) + + def __getstate__(self): + return self.moves + + def __setstate__(self, moves): + self.moves = moves @property def labels(self): @@ -1113,7 +1195,6 @@ cdef class EntityRecognizer(Parser): assigns = ["doc.ents", "token.ent_iob", "token.ent_type"] requires = [] TransitionSystem = BiluoPushDown - nr_feature = 6 def add_multitask_objective(self, target): if target == "cloze": @@ -1123,15 +1204,20 @@ cdef class EntityRecognizer(Parser): labeller = MultitaskObjective(self.vocab, target=target) self._multitasks.append(labeller) - def init_multitask_objectives(self, get_gold_tuples, pipeline, sgd=None, **cfg): + def init_multitask_objectives(self, get_examples, pipeline, sgd=None, **cfg): for labeller in self._multitasks: - tok2vec = self.model.tok2vec - labeller.begin_training(get_gold_tuples, pipeline=pipeline, + tok2vec = self.model.get_ref("tok2vec") + labeller.begin_training(get_examples, pipeline=pipeline, tok2vec=tok2vec) def __reduce__(self): - return (EntityRecognizer, (self.vocab, self.moves, self.model), - None, None) + return (EntityRecognizer, (self.vocab, self.model), self.moves) + + def __getstate__(self): + return self.moves + + def __setstate__(self, moves): + self.moves = moves @property def labels(self): @@ -1145,7 +1231,8 @@ cdef class EntityRecognizer(Parser): @component( "entity_linker", requires=["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"], - assigns=["token.ent_kb_id"] + assigns=["token.ent_kb_id"], + default_model=default_nel, ) class EntityLinker(Pipe): """Pipeline component for named entity linking. @@ -1154,65 +1241,49 @@ class EntityLinker(Pipe): """ NIL = "NIL" # string used to refer to a non-existing link - @classmethod - def Model(cls, **cfg): - embed_width = cfg.get("embed_width", 300) - hidden_width = cfg.get("hidden_width", 128) - type_to_int = cfg.get("type_to_int", dict()) - - model = build_nel_encoder(embed_width=embed_width, hidden_width=hidden_width, ner_types=len(type_to_int), **cfg) - return model - - def __init__(self, vocab, **cfg): + def __init__(self, vocab, model, **cfg): self.vocab = vocab - self.model = True + self.model = model self.kb = None + self.kb = cfg.get("kb", None) + if self.kb is None: + # create an empty KB that should be filled by calling from_disk + self.kb = KnowledgeBase(vocab=vocab) + else: + del cfg["kb"] # we don't want to duplicate its serialization + if not isinstance(self.kb, KnowledgeBase): + raise ValueError(Errors.E990.format(type=type(self.kb))) self.cfg = dict(cfg) - - def set_kb(self, kb): - self.kb = kb - - def require_model(self): - # Raise an error if the component's model is not initialized. - if getattr(self, "model", None) in (None, True, False): - raise ValueError(Errors.E109.format(name=self.name)) + self.distance = CosineDistance(normalize=False) def require_kb(self): # Raise an error if the knowledge base is not initialized. - if getattr(self, "kb", None) in (None, True, False): + if len(self.kb) == 0: raise ValueError(Errors.E139.format(name=self.name)) - def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs): + def begin_training(self, get_examples=lambda: [], pipeline=None, sgd=None, **kwargs): self.require_kb() - self.cfg["entity_width"] = self.kb.entity_vector_length - - if self.model is True: - self.model = self.Model(**self.cfg) - + nO = self.kb.entity_vector_length + self.set_output(nO) + self.model.initialize() if sgd is None: sgd = self.create_optimizer() - return sgd - def update(self, docs, golds, state=None, drop=0.0, sgd=None, losses=None): - self.require_model() + def update(self, examples, state=None, set_annotations=False, drop=0.0, sgd=None, losses=None): self.require_kb() - if losses is not None: losses.setdefault(self.name, 0.0) - - if not docs or not golds: + if not examples: return 0 - - if len(docs) != len(golds): - raise ValueError(Errors.E077.format(value="EL training", n_docs=len(docs), - n_golds=len(golds))) - - if isinstance(docs, Doc): - docs = [docs] - golds = [golds] - + examples = Example.to_example_objects(examples) sentence_docs = [] + docs = [ex.doc for ex in examples] + if set_annotations: + # This seems simpler than other ways to get that exact output -- but + # it does run the model twice :( + predictions = self.model.predict(docs) + golds = [ex.gold for ex in examples] for doc, gold in zip(docs, golds): ents_by_offset = dict() @@ -1220,6 +1291,8 @@ class EntityLinker(Pipe): ents_by_offset[(ent.start_char, ent.end_char)] = ent for entity, kb_dict in gold.links.items(): + if isinstance(entity, str): + entity = literal_eval(entity) start, end = entity mention = doc.text[start:end] @@ -1229,23 +1302,27 @@ class EntityLinker(Pipe): ent = ents_by_offset[(start, end)] for kb_id, value in kb_dict.items(): - # Currently only training on the positive instances + # Currently only training on the positive instances - we assume there is at least 1 per doc/gold if value: try: sentence_docs.append(ent.sent.as_doc()) except AttributeError: # Catch the exception when ent.sent is None and provide a user-friendly warning raise RuntimeError(Errors.E030) - - sentence_encodings, bp_context = self.model.begin_update(sentence_docs, drop=drop) - loss, d_scores = self.get_similarity_loss(scores=sentence_encodings, golds=golds, docs=None) - bp_context(d_scores, sgd=sgd) + set_dropout_rate(self.model, drop) + sentence_encodings, bp_context = self.model.begin_update(sentence_docs) + loss, d_scores = self.get_similarity_loss(scores=sentence_encodings, golds=golds) + bp_context(d_scores) + if sgd is not None: + self.model.finish_update(sgd) if losses is not None: losses[self.name] += loss + if set_annotations: + self.set_annotations(docs, predictions) return loss - def get_similarity_loss(self, docs, golds, scores): + def get_similarity_loss(self, golds, scores): entity_encodings = [] for gold in golds: for entity, kb_dict in gold.links.items(): @@ -1258,16 +1335,17 @@ class EntityLinker(Pipe): entity_encodings = self.model.ops.asarray(entity_encodings, dtype="float32") if scores.shape != entity_encodings.shape: - raise RuntimeError(Errors.E147.format(method="get_loss", msg="gold entities do not match up")) + raise RuntimeError(Errors.E147.format(method="get_similarity_loss", msg="gold entities do not match up")) - loss, gradients = get_cossim_loss(yh=scores, y=entity_encodings) + gradients = self.distance.get_grad(scores, entity_encodings) + loss = self.distance.get_loss(scores, entity_encodings) loss = loss / len(entity_encodings) return loss, gradients - def get_loss(self, docs, golds, scores): + def get_loss(self, examples, scores): cats = [] - for gold in golds: - for entity, kb_dict in gold.links.items(): + for ex in examples: + for entity, kb_dict in ex.gold.links.items(): for kb_id, value in kb_dict.items(): cats.append([value]) @@ -1280,23 +1358,31 @@ class EntityLinker(Pipe): loss = loss / len(cats) return loss, d_scores - def __call__(self, doc): + def __call__(self, example): + doc = self._get_doc(example) kb_ids, tensors = self.predict([doc]) self.set_annotations([doc], kb_ids, tensors=tensors) + if isinstance(example, Example): + example.doc = doc + return example return doc - def pipe(self, stream, batch_size=128, n_threads=-1): - for docs in util.minibatch(stream, size=batch_size): - docs = list(docs) + def pipe(self, stream, batch_size=128, n_threads=-1, as_example=False): + for examples in util.minibatch(stream, size=batch_size): + docs = [self._get_doc(ex) for ex in examples] kb_ids, tensors = self.predict(docs) self.set_annotations(docs, kb_ids, tensors=tensors) - yield from docs + + if as_example: + for ex, doc in zip(examples, docs): + ex.doc = doc + yield ex + else: + yield from docs def predict(self, docs): """ Return the KB IDs for each entity in each doc, including NIL if there is no prediction """ - self.require_model() self.require_kb() - entity_count = 0 final_kb_ids = [] final_tensors = [] @@ -1314,7 +1400,7 @@ class EntityLinker(Pipe): for sent in doc.sents: sent_doc = sent.as_doc() # currently, the context is the same for each entity in a sentence (should be refined) - sentence_encoding = self.model([sent_doc])[0] + sentence_encoding = self.model.predict([sent_doc])[0] xp = get_array_module(sentence_encoding) sentence_encoding_t = sentence_encoding.T sentence_norm = xp.linalg.norm(sentence_encoding_t) @@ -1366,7 +1452,7 @@ class EntityLinker(Pipe): scores = prior_probs + sims - (prior_probs*sims) # TODO: thresholding - best_index = scores.argmax() + best_index = scores.argmax().item() best_candidate = candidates[best_index] final_kb_ids.append(best_candidate.entity_) final_tensors.append(sentence_encoding) @@ -1390,39 +1476,36 @@ class EntityLinker(Pipe): token.ent_kb_id_ = kb_id def to_disk(self, path, exclude=tuple(), **kwargs): - serialize = OrderedDict() + serialize = {} + self.cfg["entity_width"] = self.kb.entity_vector_length serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg) serialize["vocab"] = lambda p: self.vocab.to_disk(p) serialize["kb"] = lambda p: self.kb.dump(p) - if self.model not in (None, True, False): - serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes()) + serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes()) exclude = util.get_serialization_exclude(serialize, exclude, kwargs) util.to_disk(path, serialize, exclude) def from_disk(self, path, exclude=tuple(), **kwargs): def load_model(p): - if self.model is True: - self.model = self.Model(**self.cfg) try: self.model.from_bytes(p.open("rb").read()) except AttributeError: raise ValueError(Errors.E149) def load_kb(p): - kb = KnowledgeBase(vocab=self.vocab, entity_vector_length=self.cfg["entity_width"]) - kb.load_bulk(p) - self.set_kb(kb) + self.kb = KnowledgeBase(vocab=self.vocab, entity_vector_length=self.cfg["entity_width"]) + self.kb.load_bulk(p) - deserialize = OrderedDict() - deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p)) + deserialize = {} deserialize["vocab"] = lambda p: self.vocab.from_disk(p) + deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p)) deserialize["kb"] = load_kb deserialize["model"] = load_model exclude = util.get_serialization_exclude(deserialize, exclude, kwargs) util.from_disk(path, deserialize, exclude) return self - def rehearse(self, docs, sgd=None, losses=None, **config): + def rehearse(self, examples, sgd=None, losses=None, **config): raise NotImplementedError def add_label(self, label): @@ -1430,7 +1513,7 @@ class EntityLinker(Pipe): @component("sentencizer", assigns=["token.is_sent_start", "doc.sents"]) -class Sentencizer(object): +class Sentencizer(Pipe): """Segment the Doc into sentences using a rule-based strategy. DOCS: https://spacy.io/api/sentencizer @@ -1463,27 +1546,56 @@ class Sentencizer(object): self.punct_chars = set(self.default_punct_chars) @classmethod - def from_nlp(cls, nlp, **cfg): + def from_nlp(cls, nlp, model=None, **cfg): return cls(**cfg) - def __call__(self, doc): + def begin_training( + self, get_examples=lambda: [], pipeline=None, sgd=None, **kwargs + ): + pass + + def __call__(self, example): """Apply the sentencizer to a Doc and set Token.is_sent_start. - doc (Doc): The document to process. - RETURNS (Doc): The processed Doc. + example (Doc or Example): The document to process. + RETURNS (Doc or Example): The processed Doc or Example. DOCS: https://spacy.io/api/sentencizer#call """ - tags = self.predict([doc]) - self.set_annotations([doc], tags) + doc = self._get_doc(example) + start = 0 + seen_period = False + for i, token in enumerate(doc): + is_in_punct_chars = token.text in self.punct_chars + token.is_sent_start = i == 0 + if seen_period and not token.is_punct and not is_in_punct_chars: + doc[start].is_sent_start = True + start = token.i + seen_period = False + elif is_in_punct_chars: + seen_period = True + if start < len(doc): + doc[start].is_sent_start = True + if isinstance(example, Example): + example.doc = doc + return example return doc - def pipe(self, stream, batch_size=128, n_threads=-1): - for docs in util.minibatch(stream, size=batch_size): - docs = list(docs) - tag_ids = self.predict(docs) - self.set_annotations(docs, tag_ids) - yield from docs + def pipe(self, stream, batch_size=128, n_threads=-1, as_example=False): + for examples in util.minibatch(stream, size=batch_size): + docs = [self._get_doc(ex) for ex in examples] + predictions = self.predict(docs) + if isinstance(predictions, tuple) and len(tuple) == 2: + scores, tensors = predictions + self.set_annotations(docs, scores, tensors=tensors) + else: + self.set_annotations(docs, predictions) + if as_example: + for ex, doc in zip(examples, docs): + ex.doc = doc + yield ex + else: + yield from docs def predict(self, docs): """Apply the pipeline's model to a batch of docs, without @@ -1572,8 +1684,19 @@ class Sentencizer(object): # Cython classes can't be decorated, so we need to add the factories here -Language.factories["parser"] = lambda nlp, **cfg: DependencyParser.from_nlp(nlp, **cfg) -Language.factories["ner"] = lambda nlp, **cfg: EntityRecognizer.from_nlp(nlp, **cfg) +Language.factories["parser"] = lambda nlp, model, **cfg: parser_factory(nlp, model, **cfg) +Language.factories["ner"] = lambda nlp, model, **cfg: ner_factory(nlp, model, **cfg) +def parser_factory(nlp, model, **cfg): + if model is None: + model = default_parser() + warnings.warn(Warnings.W098.format(name="parser")) + return DependencyParser.from_nlp(nlp, model, **cfg) -__all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer"] +def ner_factory(nlp, model, **cfg): + if model is None: + model = default_ner() + warnings.warn(Warnings.W098.format(name="ner")) + return EntityRecognizer.from_nlp(nlp, model, **cfg) + +__all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer", "SentenceRecognizer"] diff --git a/spacy/pipeline/simple_ner.py b/spacy/pipeline/simple_ner.py new file mode 100644 index 000000000..c674046af --- /dev/null +++ b/spacy/pipeline/simple_ner.py @@ -0,0 +1,151 @@ +from typing import List +from thinc.types import Floats2d +from thinc.api import SequenceCategoricalCrossentropy, set_dropout_rate +from thinc.util import to_numpy + +from .defaults import default_simple_ner +from ..gold import Example, spans_from_biluo_tags, iob_to_biluo, biluo_to_iob +from ..tokens import Doc +from ..language import component +from ..util import link_vectors_to_models +from .pipes import Pipe + + +@component("simple_ner", assigns=["doc.ents"], default_model=default_simple_ner) +class SimpleNER(Pipe): + """Named entity recognition with a tagging model. The model should include + validity constraints to ensure that only valid tag sequences are returned.""" + + def __init__(self, vocab, model): + self.vocab = vocab + self.model = model + self.cfg = {"labels": []} + self.loss_func = SequenceCategoricalCrossentropy( + names=self.get_tag_names(), + normalize=True, + missing_value=None + ) + assert self.model is not None + + @property + def labels(self): + return self.cfg["labels"] + + @property + def is_biluo(self): + return self.model.name.startswith("biluo") + + def add_label(self, label): + if label not in self.cfg["labels"]: + self.cfg["labels"].append(label) + + def get_tag_names(self): + if self.is_biluo: + return ( + [f"B-{label}" for label in self.labels] + + [f"I-{label}" for label in self.labels] + + [f"L-{label}" for label in self.labels] + + [f"U-{label}" for label in self.labels] + + ["O"] + ) + else: + return ( + [f"B-{label}" for label in self.labels] + + [f"I-{label}" for label in self.labels] + + ["O"] + ) + + def predict(self, docs: List[Doc]) -> List[Floats2d]: + scores = self.model.predict(docs) + return scores + + def set_annotations(self, docs: List[Doc], scores: List[Floats2d], tensors=None): + """Set entities on a batch of documents from a batch of scores.""" + tag_names = self.get_tag_names() + for i, doc in enumerate(docs): + actions = to_numpy(scores[i].argmax(axis=1)) + tags = [tag_names[actions[j]] for j in range(len(doc))] + if not self.is_biluo: + tags = iob_to_biluo(tags) + doc.ents = spans_from_biluo_tags(doc, tags) + + def update(self, examples, set_annotations=False, drop=0.0, sgd=None, losses=None): + if not any(_has_ner(eg) for eg in examples): + return 0 + examples = Example.to_example_objects(examples) + docs = [ex.doc for ex in examples] + set_dropout_rate(self.model, drop) + scores, bp_scores = self.model.begin_update(docs) + loss, d_scores = self.get_loss(examples, scores) + bp_scores(d_scores) + if set_annotations: + self.set_annotations(docs, scores) + if sgd is not None: + self.model.finish_update(sgd) + if losses is not None: + losses.setdefault("ner", 0.0) + losses["ner"] += loss + return loss + + def get_loss(self, examples, scores): + loss = 0 + d_scores = [] + truths = [] + for eg in examples: + gold_tags = [(tag if tag != "-" else None) for tag in eg.gold.ner] + if not self.is_biluo: + gold_tags = biluo_to_iob(gold_tags) + truths.append(gold_tags) + for i in range(len(scores)): + if len(scores[i]) != len(truths[i]): + raise ValueError( + f"Mismatched output and gold sizes.\n" + f"Output: {len(scores[i])}, gold: {len(truths[i])}." + f"Input: {len(examples[i].doc)}" + ) + d_scores, loss = self.loss_func(scores, truths) + return loss, d_scores + + def begin_training(self, get_examples, pipeline=None, sgd=None, **kwargs): + self.cfg.update(kwargs) + if not hasattr(get_examples, '__call__'): + gold_tuples = get_examples + get_examples = lambda: gold_tuples + labels = _get_labels(get_examples()) + for label in _get_labels(get_examples()): + self.add_label(label) + labels = self.labels + n_actions = self.model.attrs["get_num_actions"](len(labels)) + self.model.set_dim("nO", n_actions) + self.model.initialize() + if pipeline is not None: + self.init_multitask_objectives(get_examples, pipeline, sgd=sgd, **self.cfg) + link_vectors_to_models(self.vocab) + self.loss_func = SequenceCategoricalCrossentropy( + names=self.get_tag_names(), + normalize=True, + missing_value=None + ) + + return sgd + + def init_multitask_objectives(self, *args, **kwargs): + pass + + +def _has_ner(eg): + for ner_tag in eg.gold.ner: + if ner_tag != "-" and ner_tag != None: + return True + else: + return False + + +def _get_labels(examples): + labels = set() + for eg in examples: + for ner_tag in eg.token_annotation.entities: + if ner_tag != 'O' and ner_tag != '-': + _, label = ner_tag.split('-', 1) + labels.add(label) + return list(sorted(labels)) diff --git a/spacy/pipeline/tok2vec.py b/spacy/pipeline/tok2vec.py new file mode 100644 index 000000000..5882fa266 --- /dev/null +++ b/spacy/pipeline/tok2vec.py @@ -0,0 +1,189 @@ +from thinc.api import Model, set_dropout_rate + +from .pipes import Pipe +from ..gold import Example +from ..tokens import Doc +from ..vocab import Vocab +from ..language import component +from ..util import link_vectors_to_models, minibatch, eg2doc +from .defaults import default_tok2vec + + +@component("tok2vec", assigns=["doc.tensor"], default_model=default_tok2vec) +class Tok2Vec(Pipe): + @classmethod + def from_nlp(cls, nlp, model, **cfg): + return cls(nlp.vocab, model, **cfg) + + def __init__(self, vocab, model, **cfg): + """Construct a new statistical model. Weights are not allocated on + initialisation. + vocab (Vocab): A `Vocab` instance. The model must share the same `Vocab` + instance with the `Doc` objects it will process. + **cfg: Config parameters. + """ + self.vocab = vocab + self.model = model + self.cfg = dict(cfg) + self.listeners = [] + + def create_listener(self): + listener = Tok2VecListener( + upstream_name="tok2vec", width=self.model.get_dim("nO") + ) + self.listeners.append(listener) + + def add_listener(self, listener): + self.listeners.append(listener) + + def find_listeners(self, model): + for node in model.walk(): + if isinstance(node, Tok2VecListener) and node.upstream_name == self.name: + self.add_listener(node) + + def __call__(self, doc): + """Add context-sensitive vectors to a `Doc`, e.g. from a CNN or LSTM + model. Vectors are set to the `Doc.tensor` attribute. + docs (Doc or iterable): One or more documents to add vectors to. + RETURNS (dict or None): Intermediate computations. + """ + tokvecses = self.predict([doc]) + self.set_annotations([doc], tokvecses) + return doc + + def pipe(self, stream, batch_size=128, n_threads=-1, as_example=False): + """Process `Doc` objects as a stream. + stream (iterator): A sequence of `Doc` objects to process. + batch_size (int): Number of `Doc` objects to group. + n_threads (int): Number of threads. + YIELDS (iterator): A sequence of `Doc` objects, in order of input. + """ + for batch in minibatch(stream, batch_size): + batch = list(batch) + if as_example: + docs = [eg2doc(doc) for doc in batch] + else: + docs = batch + tokvecses = self.predict(docs) + self.set_annotations(docs, tokvecses) + yield from batch + + def predict(self, docs): + """Return a single tensor for a batch of documents. + docs (iterable): A sequence of `Doc` objects. + RETURNS (object): Vector representations for each token in the documents. + """ + tokvecs = self.model.predict(docs) + batch_id = Tok2VecListener.get_batch_id(docs) + for listener in self.listeners: + listener.receive(batch_id, tokvecs, None) + return tokvecs + + def set_annotations(self, docs, tokvecses): + """Set the tensor attribute for a batch of documents. + docs (iterable): A sequence of `Doc` objects. + tokvecs (object): Vector representation for each token in the documents. + """ + for doc, tokvecs in zip(docs, tokvecses): + assert tokvecs.shape[0] == len(doc) + doc.tensor = tokvecs + + def update(self, examples, drop=0.0, sgd=None, losses=None, set_annotations=False): + """Update the model. + examples (iterable): A batch of examples + drop (float): The droput rate. + sgd (callable): An optimizer. + RETURNS (dict): Results from the update. + """ + if losses is None: + losses = {} + examples = Example.to_example_objects(examples) + docs = [eg.doc for eg in examples] + if isinstance(docs, Doc): + docs = [docs] + set_dropout_rate(self.model, drop) + tokvecs, bp_tokvecs = self.model.begin_update(docs) + + d_tokvecs = [self.model.ops.alloc2f(*t2v.shape) for t2v in tokvecs] + losses.setdefault(self.name, 0.0) + + def accumulate_gradient(one_d_tokvecs): + """Accumulate tok2vec loss and gradient. This is passed as a callback + to all but the last listener. Only the last one does the backprop. + """ + nonlocal d_tokvecs + for i in range(len(one_d_tokvecs)): + d_tokvecs[i] += one_d_tokvecs[i] + losses[self.name] += float((one_d_tokvecs[i] ** 2).sum()) + + def backprop(one_d_tokvecs): + """Callback to actually do the backprop. Passed to last listener.""" + accumulate_gradient(one_d_tokvecs) + d_docs = bp_tokvecs(d_tokvecs) + if sgd is not None: + self.model.finish_update(sgd) + return d_docs + + batch_id = Tok2VecListener.get_batch_id(docs) + for listener in self.listeners[:-1]: + listener.receive(batch_id, tokvecs, accumulate_gradient) + self.listeners[-1].receive(batch_id, tokvecs, backprop) + if set_annotations: + self.set_annotations(docs, tokvecs) + + def get_loss(self, docs, golds, scores): + pass + + def begin_training( + self, get_examples=lambda: [], pipeline=None, sgd=None, **kwargs + ): + """Allocate models and pre-process training data + + get_examples (function): Function returning example training data. + pipeline (list): The pipeline the model is part of. + """ + docs = [Doc(Vocab(), words=["hello"])] + self.model.initialize(X=docs) + link_vectors_to_models(self.vocab) + + +class Tok2VecListener(Model): + """A layer that gets fed its answers from an upstream connection, + for instance from a component earlier in the pipeline. + """ + + name = "tok2vec-listener" + + def __init__(self, upstream_name, width): + Model.__init__(self, name=self.name, forward=forward, dims={"nO": width}) + self.upstream_name = upstream_name + self._batch_id = None + self._outputs = None + self._backprop = None + + @classmethod + def get_batch_id(cls, inputs): + return sum(sum(token.orth for token in doc) for doc in inputs) + + def receive(self, batch_id, outputs, backprop): + self._batch_id = batch_id + self._outputs = outputs + self._backprop = backprop + + def verify_inputs(self, inputs): + if self._batch_id is None and self._outputs is None: + raise ValueError + else: + batch_id = self.get_batch_id(inputs) + if batch_id != self._batch_id: + raise ValueError(f"Mismatched IDs! {batch_id} vs {self._batch_id}") + else: + return True + + +def forward(model: Tok2VecListener, inputs, is_train): + if is_train: + model.verify_inputs(inputs) + return model._outputs, model._backprop + else: + return [doc.tensor for doc in inputs], lambda dX: [] diff --git a/spacy/schemas.py b/spacy/schemas.py new file mode 100644 index 000000000..3b6313db8 --- /dev/null +++ b/spacy/schemas.py @@ -0,0 +1,190 @@ +from typing import Dict, List, Union, Optional +from enum import Enum +from pydantic import BaseModel, Field, ValidationError, validator +from pydantic import StrictStr, StrictInt, StrictFloat, StrictBool +from collections import defaultdict + +from .attrs import NAMES + + +def validate(schema, obj): + """Validate data against a given pydantic schema. + + obj (dict): JSON-serializable data to validate. + schema (pydantic.BaseModel): The schema to validate against. + RETURNS (list): A list of error messages, if available. + """ + try: + schema(**obj) + return [] + except ValidationError as e: + errors = e.errors() + data = defaultdict(list) + for error in errors: + err_loc = " -> ".join([str(p) for p in error.get("loc", [])]) + data[err_loc].append(error.get("msg")) + return [f"[{loc}] {', '.join(msg)}" for loc, msg in data.items()] + + +# Matcher token patterns + + +def validate_token_pattern(obj): + # Try to convert non-string keys (e.g. {ORTH: "foo"} -> {"ORTH": "foo"}) + get_key = lambda k: NAMES[k] if isinstance(k, int) and k < len(NAMES) else k + if isinstance(obj, list): + converted = [] + for pattern in obj: + if isinstance(pattern, dict): + pattern = {get_key(k): v for k, v in pattern.items()} + converted.append(pattern) + obj = converted + return validate(TokenPatternSchema, {"pattern": obj}) + + +class TokenPatternString(BaseModel): + REGEX: Optional[StrictStr] + IN: Optional[List[StrictStr]] + NOT_IN: Optional[List[StrictStr]] + + class Config: + extra = "forbid" + + @validator("*", pre=True, whole=True) + def raise_for_none(cls, v): + if v is None: + raise ValueError("None / null is not allowed") + return v + + +class TokenPatternNumber(BaseModel): + REGEX: Optional[StrictStr] = None + IN: Optional[List[StrictInt]] = None + NOT_IN: Optional[List[StrictInt]] = None + EQ: Union[StrictInt, StrictFloat] = Field(None, alias="==") + GEQ: Union[StrictInt, StrictFloat] = Field(None, alias=">=") + LEQ: Union[StrictInt, StrictFloat] = Field(None, alias="<=") + GT: Union[StrictInt, StrictFloat] = Field(None, alias=">") + LT: Union[StrictInt, StrictFloat] = Field(None, alias="<") + + class Config: + extra = "forbid" + + @validator("*", pre=True, whole=True) + def raise_for_none(cls, v): + if v is None: + raise ValueError("None / null is not allowed") + return v + + +class TokenPatternOperator(str, Enum): + plus: StrictStr = "+" + start: StrictStr = "*" + question: StrictStr = "?" + exclamation: StrictStr = "!" + + +StringValue = Union[TokenPatternString, StrictStr] +NumberValue = Union[TokenPatternNumber, StrictInt, StrictFloat] +UnderscoreValue = Union[ + TokenPatternString, TokenPatternNumber, str, int, float, list, bool, +] + + +class TokenPattern(BaseModel): + orth: Optional[StringValue] = None + text: Optional[StringValue] = None + lower: Optional[StringValue] = None + pos: Optional[StringValue] = None + tag: Optional[StringValue] = None + dep: Optional[StringValue] = None + lemma: Optional[StringValue] = None + shape: Optional[StringValue] = None + ent_type: Optional[StringValue] = None + norm: Optional[StringValue] = None + length: Optional[NumberValue] = None + spacy: Optional[StrictBool] = None + is_alpha: Optional[StrictBool] = None + is_ascii: Optional[StrictBool] = None + is_digit: Optional[StrictBool] = None + is_lower: Optional[StrictBool] = None + is_upper: Optional[StrictBool] = None + is_title: Optional[StrictBool] = None + is_punct: Optional[StrictBool] = None + is_space: Optional[StrictBool] = None + is_bracket: Optional[StrictBool] = None + is_quote: Optional[StrictBool] = None + is_left_punct: Optional[StrictBool] = None + is_right_punct: Optional[StrictBool] = None + is_currency: Optional[StrictBool] = None + is_stop: Optional[StrictBool] = None + is_sent_start: Optional[StrictBool] = None + sent_start: Optional[StrictBool] = None + like_num: Optional[StrictBool] = None + like_url: Optional[StrictBool] = None + like_email: Optional[StrictBool] = None + op: Optional[TokenPatternOperator] = None + underscore: Optional[Dict[StrictStr, UnderscoreValue]] = Field(None, alias="_") + + class Config: + extra = "forbid" + allow_population_by_field_name = True + alias_generator = lambda value: value.upper() + + @validator("*", pre=True) + def raise_for_none(cls, v): + if v is None: + raise ValueError("None / null is not allowed") + return v + + +class TokenPatternSchema(BaseModel): + pattern: List[TokenPattern] = Field(..., minItems=1) + + class Config: + extra = "forbid" + + +# Model meta + + +class ModelMetaSchema(BaseModel): + # fmt: off + lang: StrictStr = Field(..., title="Two-letter language code, e.g. 'en'") + name: StrictStr = Field(..., title="Model name") + version: StrictStr = Field(..., title="Model version") + spacy_version: Optional[StrictStr] = Field(None, title="Compatible spaCy version identifier") + parent_package: Optional[StrictStr] = Field("spacy", title="Name of parent spaCy package, e.g. spacy or spacy-nightly") + pipeline: Optional[List[StrictStr]] = Field([], title="Names of pipeline components") + description: Optional[StrictStr] = Field(None, title="Model description") + license: Optional[StrictStr] = Field(None, title="Model license") + author: Optional[StrictStr] = Field(None, title="Model author name") + email: Optional[StrictStr] = Field(None, title="Model author email") + url: Optional[StrictStr] = Field(None, title="Model author URL") + sources: Optional[Union[List[StrictStr], Dict[str, str]]] = Field(None, title="Training data sources") + vectors: Optional[Dict[str, int]] = Field(None, title="Included word vectors") + accuracy: Optional[Dict[str, Union[float, int]]] = Field(None, title="Accuracy numbers") + speed: Optional[Dict[str, Union[float, int]]] = Field(None, title="Speed evaluation numbers") + # fmt: on + + +# Training data object in "simple training style" + + +class SimpleTrainingSchema(BaseModel): + # TODO: write + + class Config: + title = "Schema for training data dict in passed to nlp.update" + extra = "forbid" + + +# JSON training format + + +class TrainingSchema(BaseModel): + # TODO: write + + class Config: + title = "Schema for training data in spaCy's JSON format" + extra = "forbid" diff --git a/spacy/scorer.py b/spacy/scorer.py index 25c660240..7e2466be7 100644 --- a/spacy/scorer.py +++ b/spacy/scorer.py @@ -1,9 +1,6 @@ -# coding: utf8 -from __future__ import division, print_function, unicode_literals - import numpy as np -from .gold import tags_to_entities, GoldParse +from .gold import tags_to_entities, GoldParse, DocAnnotation from .errors import Errors @@ -84,6 +81,10 @@ class Scorer(object): self.labelled = PRFScore() self.labelled_per_dep = dict() self.tags = PRFScore() + self.pos = PRFScore() + self.morphs = PRFScore() + self.morphs_per_feat = dict() + self.sent_starts = PRFScore() self.ner = PRFScore() self.ner_per_ents = dict() self.eval_punct = eval_punct @@ -113,6 +114,50 @@ class Scorer(object): """ return self.tags.fscore * 100 + @property + def pos_acc(self): + """RETURNS (float): Part-of-speech tag accuracy (coarse grained pos, + i.e. `Token.pos`). + """ + return self.pos.fscore * 100 + + @property + def morphs_acc(self): + """RETURNS (float): Morph tag accuracy (morphological features, + i.e. `Token.morph`). + """ + return self.morphs.fscore * 100 + + @property + def morphs_per_type(self): + """RETURNS (dict): Scores per dependency label. + """ + return { + k: {"p": v.precision * 100, "r": v.recall * 100, "f": v.fscore * 100} + for k, v in self.morphs_per_feat.items() + } + + @property + def sent_p(self): + """RETURNS (float): F-score for identification of sentence starts. + i.e. `Token.is_sent_start`). + """ + return self.sent_starts.precision * 100 + + @property + def sent_r(self): + """RETURNS (float): F-score for identification of sentence starts. + i.e. `Token.is_sent_start`). + """ + return self.sent_starts.recall * 100 + + @property + def sent_f(self): + """RETURNS (float): F-score for identification of sentence starts. + i.e. `Token.is_sent_start`). + """ + return self.sent_starts.fscore * 100 + @property def token_acc(self): """RETURNS (float): Tokenization accuracy.""" @@ -212,16 +257,21 @@ class Scorer(object): "ents_f": self.ents_f, "ents_per_type": self.ents_per_type, "tags_acc": self.tags_acc, + "pos_acc": self.pos_acc, + "morphs_acc": self.morphs_acc, + "morphs_per_type": self.morphs_per_type, + "sent_p": self.sent_p, + "sent_r": self.sent_r, + "sent_f": self.sent_f, "token_acc": self.token_acc, "textcat_score": self.textcat_score, "textcats_per_cat": self.textcats_per_cat, } - def score(self, doc, gold, verbose=False, punct_labels=("p", "punct")): + def score(self, example, verbose=False, punct_labels=("p", "punct")): """Update the evaluation scores from a single Doc / GoldParse pair. - doc (Doc): The predicted annotations. - gold (GoldParse): The correct annotations. + example (Example): The predicted annotations + correct annotations. verbose (bool): Print debugging information. punct_labels (tuple): Dependency labels for punctuation. Used to evaluate dependency attachments to punctuation if `eval_punct` is @@ -229,16 +279,39 @@ class Scorer(object): DOCS: https://spacy.io/api/scorer#score """ + if isinstance(example, tuple) and len(example) == 2: + doc, gold = example + else: + gold = example.gold + doc = example.doc + if len(doc) != len(gold): - gold = GoldParse.from_annot_tuples( - doc, zip(*gold.orig_annot), cats=gold.cats, - ) + doc_annotation = DocAnnotation(cats=gold.cats) + token_annotation = gold.orig + gold = GoldParse.from_annotation(doc, doc_annotation, token_annotation) + orig = gold.orig gold_deps = set() gold_deps_per_dep = {} gold_tags = set() - gold_ents = set(tags_to_entities([annot[-1] for annot in gold.orig_annot])) - for id_, word, tag, head, dep, ner in gold.orig_annot: + gold_pos = set() + gold_morphs = set() + gold_morphs_per_feat = {} + gold_sent_starts = set() + gold_ents = set(tags_to_entities(orig.entities)) + for id_, tag, pos, morph, head, dep, sent_start in zip(orig.ids, orig.tags, orig.pos, orig.morphs, orig.heads, orig.deps, orig.sent_starts): gold_tags.add((id_, tag)) + gold_pos.add((id_, pos)) + gold_morphs.add((id_, morph)) + if morph: + for feat in morph.split("|"): + field, values = feat.split("=") + if field not in self.morphs_per_feat: + self.morphs_per_feat[field] = PRFScore() + if field not in gold_morphs_per_feat: + gold_morphs_per_feat[field] = set() + gold_morphs_per_feat[field].add((id_, feat)) + if sent_start: + gold_sent_starts.add(id_) if dep not in (None, "") and dep.lower() not in punct_labels: gold_deps.add((id_, head, dep.lower())) if dep.lower() not in self.labelled_per_dep: @@ -249,6 +322,10 @@ class Scorer(object): cand_deps = set() cand_deps_per_dep = {} cand_tags = set() + cand_pos = set() + cand_morphs = set() + cand_morphs_per_feat = {} + cand_sent_starts = set() for token in doc: if token.orth_.isspace(): continue @@ -258,6 +335,18 @@ class Scorer(object): else: self.tokens.tp += 1 cand_tags.add((gold_i, token.tag_)) + cand_pos.add((gold_i, token.pos_)) + cand_morphs.add((gold_i, token.morph_)) + if token.morph_: + for feat in token.morph_.split("|"): + field, values = feat.split("=") + if field not in self.morphs_per_feat: + self.morphs_per_feat[field] = PRFScore() + if field not in cand_morphs_per_feat: + cand_morphs_per_feat[field] = set() + cand_morphs_per_feat[field].add((gold_i, feat)) + if token.is_sent_start: + cand_sent_starts.add(gold_i) if token.dep_.lower() not in punct_labels and token.orth_.strip(): gold_head = gold.cand_to_gold[token.head.i] # None is indistinct, so we can't just add it to the set @@ -274,7 +363,7 @@ class Scorer(object): cand_deps_per_dep[token.dep_.lower()].add( (gold_i, gold_head, token.dep_.lower()) ) - if "-" not in [token[-1] for token in gold.orig_annot]: + if "-" not in [token[-1] for token in orig.entities]: # Find all NER labels in gold and doc ent_labels = set([x[0] for x in gold_ents] + [k.label_ for k in doc.ents]) # Set up all labels for per type scoring and prepare gold per type @@ -304,6 +393,11 @@ class Scorer(object): # Score for all ents self.ner.score_set(cand_ents, gold_ents) self.tags.score_set(cand_tags, gold_tags) + self.pos.score_set(cand_pos, gold_pos) + self.morphs.score_set(cand_morphs, gold_morphs) + for field in self.morphs_per_feat: + self.morphs_per_feat[field].score_set(cand_morphs_per_feat.get(field, set()), gold_morphs_per_feat.get(field, set())) + self.sent_starts.score_set(cand_sent_starts, gold_sent_starts) self.labelled.score_set(cand_deps, gold_deps) for dep in self.labelled_per_dep: self.labelled_per_dep[dep].score_set( @@ -340,7 +434,7 @@ class Scorer(object): Errors.E162.format(model_labels=model_labels, eval_labels=eval_labels) ) if verbose: - gold_words = [item[1] for item in gold.orig_annot] + gold_words = orig.words for w_id, h_id, dep in cand_deps - gold_deps: print("F", gold_words[w_id], dep, gold_words[h_id]) for w_id, h_id, dep in gold_deps - cand_deps: diff --git a/spacy/strings.pxd b/spacy/strings.pxd index e436fb33b..ba2476ec7 100644 --- a/spacy/strings.pxd +++ b/spacy/strings.pxd @@ -1,7 +1,6 @@ from libc.stdint cimport int64_t from libcpp.vector cimport vector from libcpp.set cimport set - from cymem.cymem cimport Pool from preshed.maps cimport PreshMap from murmurhash.mrmr cimport hash64 diff --git a/spacy/strings.pyx b/spacy/strings.pyx index f3457e1a5..a30f11729 100644 --- a/spacy/strings.pyx +++ b/spacy/strings.pyx @@ -1,18 +1,16 @@ # cython: infer_types=True -# coding: utf8 -from __future__ import unicode_literals, absolute_import - cimport cython from libc.string cimport memcpy from libcpp.set cimport set from libc.stdint cimport uint32_t from murmurhash.mrmr cimport hash64, hash32 + import srsly -from .compat import basestring_ +from .typedefs cimport hash_t + from .symbols import IDS as SYMBOLS_BY_STR from .symbols import NAMES as SYMBOLS_BY_INT -from .typedefs cimport hash_t from .errors import Errors from . import util @@ -24,7 +22,7 @@ def get_string_id(key): This function optimises for convenience over performance, so shouldn't be used in tight loops. """ - if not isinstance(key, basestring_): + if not isinstance(key, str): return key elif key in SYMBOLS_BY_STR: return SYMBOLS_BY_STR[key] @@ -150,7 +148,7 @@ cdef class StringStore: return key else: return self[key] - + def add(self, string): """Add a string to the StringStore. diff --git a/spacy/structs.pxd b/spacy/structs.pxd index 1f5f32675..a01244d7e 100644 --- a/spacy/structs.pxd +++ b/spacy/structs.pxd @@ -1,11 +1,9 @@ from libc.stdint cimport uint8_t, uint32_t, int32_t, uint64_t - -from .typedefs cimport flags_t, attr_t, hash_t -from .parts_of_speech cimport univ_pos_t - from libcpp.vector cimport vector from libc.stdint cimport int32_t, int64_t +from .typedefs cimport flags_t, attr_t, hash_t +from .parts_of_speech cimport univ_pos_t cdef struct LexemeC: @@ -59,7 +57,7 @@ cdef struct TokenC: cdef struct MorphAnalysisC: - univ_pos_t pos + hash_t key int length attr_t abbr @@ -105,6 +103,9 @@ cdef struct MorphAnalysisC: attr_t verb_form attr_t voice attr_t verb_type + attr_t* fields + attr_t* features + # Internal struct, for storage and disambiguation of entities. cdef struct KBEntryC: diff --git a/spacy/symbols.pxd b/spacy/symbols.pxd index ebb87c8d2..e516f3ed9 100644 --- a/spacy/symbols.pxd +++ b/spacy/symbols.pxd @@ -108,282 +108,282 @@ cdef enum symbol_t: EOL SPACE - Animacy_anim - Animacy_inan - Animacy_hum # U20 - Animacy_nhum - Aspect_freq - Aspect_imp - Aspect_mod - Aspect_none - Aspect_perf - Aspect_iter # U20 - Aspect_hab # U20 - Case_abe - Case_abl - Case_abs - Case_acc - Case_ade - Case_all - Case_cau - Case_com - Case_cmp # U20 - Case_dat - Case_del - Case_dis - Case_ela - Case_equ # U20 - Case_ess - Case_gen - Case_ill - Case_ine - Case_ins - Case_loc - Case_lat - Case_nom - Case_par - Case_sub - Case_sup - Case_tem - Case_ter - Case_tra - Case_voc - Definite_two - Definite_def - Definite_red - Definite_cons # U20 - Definite_ind - Definite_spec # U20 - Degree_cmp - Degree_comp - Degree_none - Degree_pos - Degree_sup - Degree_abs - Degree_com - Degree_dim # du - Degree_equ # U20 - Evident_nfh # U20 - Gender_com - Gender_fem - Gender_masc - Gender_neut - Mood_cnd - Mood_imp - Mood_ind - Mood_n - Mood_pot - Mood_sub - Mood_opt - Mood_prp # U20 - Mood_adm # U20 - Negative_neg - Negative_pos - Negative_yes - Polarity_neg # U20 - Polarity_pos # U20 - Number_com - Number_dual - Number_none - Number_plur - Number_sing - Number_ptan # bg - Number_count # bg, U20 - Number_tri # U20 - NumType_card - NumType_dist - NumType_frac - NumType_gen - NumType_mult - NumType_none - NumType_ord - NumType_sets - Person_one - Person_two - Person_three - Person_none - Poss_yes - PronType_advPart - PronType_art - PronType_default - PronType_dem - PronType_ind - PronType_int - PronType_neg - PronType_prs - PronType_rcp - PronType_rel - PronType_tot - PronType_clit - PronType_exc # es, ca, it, fa, U20 - PronType_emp # U20 - Reflex_yes - Tense_fut - Tense_imp - Tense_past - Tense_pres - VerbForm_fin - VerbForm_ger - VerbForm_inf - VerbForm_none - VerbForm_part - VerbForm_partFut - VerbForm_partPast - VerbForm_partPres - VerbForm_sup - VerbForm_trans - VerbForm_conv # U20 - VerbForm_gdv # la - VerbForm_vnoun # U20 - Voice_act - Voice_cau - Voice_pass - Voice_mid # gkc, U20 - Voice_int # hb - Voice_antip # U20 - Voice_dir # U20 - Voice_inv # U20 - Abbr_yes # cz, fi, sl, U - AdpType_prep # cz, U - AdpType_post # U - AdpType_voc # cz - AdpType_comprep # cz - AdpType_circ # U - AdvType_man - AdvType_loc - AdvType_tim - AdvType_deg - AdvType_cau - AdvType_mod - AdvType_sta - AdvType_ex - AdvType_adadj - ConjType_oper # cz, U - ConjType_comp # cz, U - Connegative_yes # fi - Derivation_minen # fi - Derivation_sti # fi - Derivation_inen # fi - Derivation_lainen # fi - Derivation_ja # fi - Derivation_ton # fi - Derivation_vs # fi - Derivation_ttain # fi - Derivation_ttaa # fi - Echo_rdp # U - Echo_ech # U - Foreign_foreign # cz, fi, U - Foreign_fscript # cz, fi, U - Foreign_tscript # cz, U - Foreign_yes # sl - Gender_dat_masc # bq, U - Gender_dat_fem # bq, U - Gender_erg_masc # bq - Gender_erg_fem # bq - Gender_psor_masc # cz, sl, U - Gender_psor_fem # cz, sl, U - Gender_psor_neut # sl - Hyph_yes # cz, U - InfForm_one # fi - InfForm_two # fi - InfForm_three # fi - NameType_geo # U, cz - NameType_prs # U, cz - NameType_giv # U, cz - NameType_sur # U, cz - NameType_nat # U, cz - NameType_com # U, cz - NameType_pro # U, cz - NameType_oth # U, cz - NounType_com # U - NounType_prop # U - NounType_class # U - Number_abs_sing # bq, U - Number_abs_plur # bq, U - Number_dat_sing # bq, U - Number_dat_plur # bq, U - Number_erg_sing # bq, U - Number_erg_plur # bq, U - Number_psee_sing # U - Number_psee_plur # U - Number_psor_sing # cz, fi, sl, U - Number_psor_plur # cz, fi, sl, U - Number_pauc # U20 - Number_grpa # U20 - Number_grpl # U20 - Number_inv # U20 - NumForm_digit # cz, sl, U - NumForm_roman # cz, sl, U - NumForm_word # cz, sl, U - NumValue_one # cz, U - NumValue_two # cz, U - NumValue_three # cz, U - PartForm_pres # fi - PartForm_past # fi - PartForm_agt # fi - PartForm_neg # fi - PartType_mod # U - PartType_emp # U - PartType_res # U - PartType_inf # U - PartType_vbp # U - Person_abs_one # bq, U - Person_abs_two # bq, U - Person_abs_three # bq, U - Person_dat_one # bq, U - Person_dat_two # bq, U - Person_dat_three # bq, U - Person_erg_one # bq, U - Person_erg_two # bq, U - Person_erg_three # bq, U - Person_psor_one # fi, U - Person_psor_two # fi, U - Person_psor_three # fi, U - Person_zero # U20 - Person_four # U20 - Polite_inf # bq, U - Polite_pol # bq, U - Polite_abs_inf # bq, U - Polite_abs_pol # bq, U - Polite_erg_inf # bq, U - Polite_erg_pol # bq, U - Polite_dat_inf # bq, U - Polite_dat_pol # bq, U - Polite_infm # U20 - Polite_form # U20 - Polite_form_elev # U20 - Polite_form_humb # U20 - Prefix_yes # U - PrepCase_npr # cz - PrepCase_pre # U - PunctSide_ini # U - PunctSide_fin # U - PunctType_peri # U - PunctType_qest # U - PunctType_excl # U - PunctType_quot # U - PunctType_brck # U - PunctType_comm # U - PunctType_colo # U - PunctType_semi # U - PunctType_dash # U - Style_arch # cz, fi, U - Style_rare # cz, fi, U - Style_poet # cz, U - Style_norm # cz, U - Style_coll # cz, U - Style_vrnc # cz, U - Style_sing # cz, U - Style_expr # cz, U - Style_derg # cz, U - Style_vulg # cz, U - Style_yes # fi, U - StyleVariant_styleShort # cz - StyleVariant_styleBound # cz, sl - VerbType_aux # U - VerbType_cop # U - VerbType_mod # U - VerbType_light # U + DEPRECATED001 + DEPRECATED002 + DEPRECATED003 + DEPRECATED004 + DEPRECATED005 + DEPRECATED006 + DEPRECATED007 + DEPRECATED008 + DEPRECATED009 + DEPRECATED010 + DEPRECATED011 + DEPRECATED012 + DEPRECATED013 + DEPRECATED014 + DEPRECATED015 + DEPRECATED016 + DEPRECATED017 + DEPRECATED018 + DEPRECATED019 + DEPRECATED020 + DEPRECATED021 + DEPRECATED022 + DEPRECATED023 + DEPRECATED024 + DEPRECATED025 + DEPRECATED026 + DEPRECATED027 + DEPRECATED028 + DEPRECATED029 + DEPRECATED030 + DEPRECATED031 + DEPRECATED032 + DEPRECATED033 + DEPRECATED034 + DEPRECATED035 + DEPRECATED036 + DEPRECATED037 + DEPRECATED038 + DEPRECATED039 + DEPRECATED040 + DEPRECATED041 + DEPRECATED042 + DEPRECATED043 + DEPRECATED044 + DEPRECATED045 + DEPRECATED046 + DEPRECATED047 + DEPRECATED048 + DEPRECATED049 + DEPRECATED050 + DEPRECATED051 + DEPRECATED052 + DEPRECATED053 + DEPRECATED054 + DEPRECATED055 + DEPRECATED056 + DEPRECATED057 + DEPRECATED058 + DEPRECATED059 + DEPRECATED060 + DEPRECATED061 + DEPRECATED062 + DEPRECATED063 + DEPRECATED064 + DEPRECATED065 + DEPRECATED066 + DEPRECATED067 + DEPRECATED068 + DEPRECATED069 + DEPRECATED070 + DEPRECATED071 + DEPRECATED072 + DEPRECATED073 + DEPRECATED074 + DEPRECATED075 + DEPRECATED076 + DEPRECATED077 + DEPRECATED078 + DEPRECATED079 + DEPRECATED080 + DEPRECATED081 + DEPRECATED082 + DEPRECATED083 + DEPRECATED084 + DEPRECATED085 + DEPRECATED086 + DEPRECATED087 + DEPRECATED088 + DEPRECATED089 + DEPRECATED090 + DEPRECATED091 + DEPRECATED092 + DEPRECATED093 + DEPRECATED094 + DEPRECATED095 + DEPRECATED096 + DEPRECATED097 + DEPRECATED098 + DEPRECATED099 + DEPRECATED100 + DEPRECATED101 + DEPRECATED102 + DEPRECATED103 + DEPRECATED104 + DEPRECATED105 + DEPRECATED106 + DEPRECATED107 + DEPRECATED108 + DEPRECATED109 + DEPRECATED110 + DEPRECATED111 + DEPRECATED112 + DEPRECATED113 + DEPRECATED114 + DEPRECATED115 + DEPRECATED116 + DEPRECATED117 + DEPRECATED118 + DEPRECATED119 + DEPRECATED120 + DEPRECATED121 + DEPRECATED122 + DEPRECATED123 + DEPRECATED124 + DEPRECATED125 + DEPRECATED126 + DEPRECATED127 + DEPRECATED128 + DEPRECATED129 + DEPRECATED130 + DEPRECATED131 + DEPRECATED132 + DEPRECATED133 + DEPRECATED134 + DEPRECATED135 + DEPRECATED136 + DEPRECATED137 + DEPRECATED138 + DEPRECATED139 + DEPRECATED140 + DEPRECATED141 + DEPRECATED142 + DEPRECATED143 + DEPRECATED144 + DEPRECATED145 + DEPRECATED146 + DEPRECATED147 + DEPRECATED148 + DEPRECATED149 + DEPRECATED150 + DEPRECATED151 + DEPRECATED152 + DEPRECATED153 + DEPRECATED154 + DEPRECATED155 + DEPRECATED156 + DEPRECATED157 + DEPRECATED158 + DEPRECATED159 + DEPRECATED160 + DEPRECATED161 + DEPRECATED162 + DEPRECATED163 + DEPRECATED164 + DEPRECATED165 + DEPRECATED166 + DEPRECATED167 + DEPRECATED168 + DEPRECATED169 + DEPRECATED170 + DEPRECATED171 + DEPRECATED172 + DEPRECATED173 + DEPRECATED174 + DEPRECATED175 + DEPRECATED176 + DEPRECATED177 + DEPRECATED178 + DEPRECATED179 + DEPRECATED180 + DEPRECATED181 + DEPRECATED182 + DEPRECATED183 + DEPRECATED184 + DEPRECATED185 + DEPRECATED186 + DEPRECATED187 + DEPRECATED188 + DEPRECATED189 + DEPRECATED190 + DEPRECATED191 + DEPRECATED192 + DEPRECATED193 + DEPRECATED194 + DEPRECATED195 + DEPRECATED196 + DEPRECATED197 + DEPRECATED198 + DEPRECATED199 + DEPRECATED200 + DEPRECATED201 + DEPRECATED202 + DEPRECATED203 + DEPRECATED204 + DEPRECATED205 + DEPRECATED206 + DEPRECATED207 + DEPRECATED208 + DEPRECATED209 + DEPRECATED210 + DEPRECATED211 + DEPRECATED212 + DEPRECATED213 + DEPRECATED214 + DEPRECATED215 + DEPRECATED216 + DEPRECATED217 + DEPRECATED218 + DEPRECATED219 + DEPRECATED220 + DEPRECATED221 + DEPRECATED222 + DEPRECATED223 + DEPRECATED224 + DEPRECATED225 + DEPRECATED226 + DEPRECATED227 + DEPRECATED228 + DEPRECATED229 + DEPRECATED230 + DEPRECATED231 + DEPRECATED232 + DEPRECATED233 + DEPRECATED234 + DEPRECATED235 + DEPRECATED236 + DEPRECATED237 + DEPRECATED238 + DEPRECATED239 + DEPRECATED240 + DEPRECATED241 + DEPRECATED242 + DEPRECATED243 + DEPRECATED244 + DEPRECATED245 + DEPRECATED246 + DEPRECATED247 + DEPRECATED248 + DEPRECATED249 + DEPRECATED250 + DEPRECATED251 + DEPRECATED252 + DEPRECATED253 + DEPRECATED254 + DEPRECATED255 + DEPRECATED256 + DEPRECATED257 + DEPRECATED258 + DEPRECATED259 + DEPRECATED260 + DEPRECATED261 + DEPRECATED262 + DEPRECATED263 + DEPRECATED264 + DEPRECATED265 + DEPRECATED266 + DEPRECATED267 + DEPRECATED268 + DEPRECATED269 + DEPRECATED270 + DEPRECATED271 + DEPRECATED272 + DEPRECATED273 + DEPRECATED274 + DEPRECATED275 + DEPRECATED276 PERSON NORP @@ -462,6 +462,7 @@ cdef enum symbol_t: acl ENT_KB_ID + MORPH ENT_ID IDX diff --git a/spacy/symbols.pyx b/spacy/symbols.pyx index 83a9d0482..28bbc9fc3 100644 --- a/spacy/symbols.pyx +++ b/spacy/symbols.pyx @@ -1,8 +1,4 @@ -# coding: utf8 -#cython: optimize.unpack_method_calls=False -from __future__ import unicode_literals - - +# cython: optimize.unpack_method_calls=False IDS = { "": NIL, "IS_ALPHA": IS_ALPHA, @@ -116,282 +112,282 @@ IDS = { "EOL": EOL, "SPACE": SPACE, - "Animacy_anim": Animacy_anim, - "Animacy_inam": Animacy_inan, - "Animacy_hum": Animacy_hum, # U20 - "Animacy_nhum": Animacy_nhum, - "Aspect_freq": Aspect_freq, - "Aspect_imp": Aspect_imp, - "Aspect_mod": Aspect_mod, - "Aspect_none": Aspect_none, - "Aspect_perf": Aspect_perf, - "Aspect_iter": Aspect_iter, # U20 - "Aspect_hab": Aspect_hab, # U20 - "Case_abe": Case_abe, - "Case_abl": Case_abl, - "Case_abs": Case_abs, - "Case_acc": Case_acc, - "Case_ade": Case_ade, - "Case_all": Case_all, - "Case_cau": Case_cau, - "Case_com": Case_com, - "Case_cmp": Case_cmp, # U20 - "Case_dat": Case_dat, - "Case_del": Case_del, - "Case_dis": Case_dis, - "Case_ela": Case_ela, - "Case_equ": Case_equ, # U20 - "Case_ess": Case_ess, - "Case_gen": Case_gen, - "Case_ill": Case_ill, - "Case_ine": Case_ine, - "Case_ins": Case_ins, - "Case_loc": Case_loc, - "Case_lat": Case_lat, - "Case_nom": Case_nom, - "Case_par": Case_par, - "Case_sub": Case_sub, - "Case_sup": Case_sup, - "Case_tem": Case_tem, - "Case_ter": Case_ter, - "Case_tra": Case_tra, - "Case_voc": Case_voc, - "Definite_two": Definite_two, - "Definite_def": Definite_def, - "Definite_red": Definite_red, - "Definite_cons": Definite_cons, # U20 - "Definite_ind": Definite_ind, - "Definite_spec": Definite_spec, # U20 - "Degree_cmp": Degree_cmp, - "Degree_comp": Degree_comp, - "Degree_none": Degree_none, - "Degree_pos": Degree_pos, - "Degree_sup": Degree_sup, - "Degree_abs": Degree_abs, - "Degree_com": Degree_com, - "Degree_dim": Degree_dim, # du - "Degree_equ": Degree_equ, # U20 - "Evident_nfh": Evident_nfh, # U20 - "Gender_com": Gender_com, - "Gender_fem": Gender_fem, - "Gender_masc": Gender_masc, - "Gender_neut": Gender_neut, - "Mood_cnd": Mood_cnd, - "Mood_imp": Mood_imp, - "Mood_ind": Mood_ind, - "Mood_n": Mood_n, - "Mood_pot": Mood_pot, - "Mood_sub": Mood_sub, - "Mood_opt": Mood_opt, - "Mood_prp": Mood_prp, # U20 - "Mood_adm": Mood_adm, # U20 - "Negative_neg": Negative_neg, - "Negative_pos": Negative_pos, - "Negative_yes": Negative_yes, - "Polarity_neg": Polarity_neg, # U20 - "Polarity_pos": Polarity_pos, # U20 - "Number_com": Number_com, - "Number_dual": Number_dual, - "Number_none": Number_none, - "Number_plur": Number_plur, - "Number_sing": Number_sing, - "Number_ptan": Number_ptan, # bg - "Number_count": Number_count, # bg, U20 - "Number_tri": Number_tri, # U20 - "NumType_card": NumType_card, - "NumType_dist": NumType_dist, - "NumType_frac": NumType_frac, - "NumType_gen": NumType_gen, - "NumType_mult": NumType_mult, - "NumType_none": NumType_none, - "NumType_ord": NumType_ord, - "NumType_sets": NumType_sets, - "Person_one": Person_one, - "Person_two": Person_two, - "Person_three": Person_three, - "Person_none": Person_none, - "Poss_yes": Poss_yes, - "PronType_advPart": PronType_advPart, - "PronType_art": PronType_art, - "PronType_default": PronType_default, - "PronType_dem": PronType_dem, - "PronType_ind": PronType_ind, - "PronType_int": PronType_int, - "PronType_neg": PronType_neg, - "PronType_prs": PronType_prs, - "PronType_rcp": PronType_rcp, - "PronType_rel": PronType_rel, - "PronType_tot": PronType_tot, - "PronType_clit": PronType_clit, - "PronType_exc": PronType_exc, # es, ca, it, fa, U20 - "PronType_emp": PronType_emp, # U20 - "Reflex_yes": Reflex_yes, - "Tense_fut": Tense_fut, - "Tense_imp": Tense_imp, - "Tense_past": Tense_past, - "Tense_pres": Tense_pres, - "VerbForm_fin": VerbForm_fin, - "VerbForm_ger": VerbForm_ger, - "VerbForm_inf": VerbForm_inf, - "VerbForm_none": VerbForm_none, - "VerbForm_part": VerbForm_part, - "VerbForm_partFut": VerbForm_partFut, - "VerbForm_partPast": VerbForm_partPast, - "VerbForm_partPres": VerbForm_partPres, - "VerbForm_sup": VerbForm_sup, - "VerbForm_trans": VerbForm_trans, - "VerbForm_conv": VerbForm_conv, # U20 - "VerbForm_gdv": VerbForm_gdv, # la, - "VerbForm_vnoun": VerbForm_vnoun, # U20 - "Voice_act": Voice_act, - "Voice_cau": Voice_cau, - "Voice_pass": Voice_pass, - "Voice_mid": Voice_mid, # gkc, U20 - "Voice_int": Voice_int, # hb, - "Voice_antip": Voice_antip, # U20 - "Voice_dir": Voice_dir, # U20 - "Voice_inv": Voice_inv, # U20 - "Abbr_yes": Abbr_yes, # cz, fi, sl, U, - "AdpType_prep": AdpType_prep, # cz, U, - "AdpType_post": AdpType_post, # U, - "AdpType_voc": AdpType_voc, # cz, - "AdpType_comprep": AdpType_comprep, # cz, - "AdpType_circ": AdpType_circ, # U, - "AdvType_man": AdvType_man, - "AdvType_loc": AdvType_loc, - "AdvType_tim": AdvType_tim, - "AdvType_deg": AdvType_deg, - "AdvType_cau": AdvType_cau, - "AdvType_mod": AdvType_mod, - "AdvType_sta": AdvType_sta, - "AdvType_ex": AdvType_ex, - "AdvType_adadj": AdvType_adadj, - "ConjType_oper": ConjType_oper, # cz, U, - "ConjType_comp": ConjType_comp, # cz, U, - "Connegative_yes": Connegative_yes, # fi, - "Derivation_minen": Derivation_minen, # fi, - "Derivation_sti": Derivation_sti, # fi, - "Derivation_inen": Derivation_inen, # fi, - "Derivation_lainen": Derivation_lainen, # fi, - "Derivation_ja": Derivation_ja, # fi, - "Derivation_ton": Derivation_ton, # fi, - "Derivation_vs": Derivation_vs, # fi, - "Derivation_ttain": Derivation_ttain, # fi, - "Derivation_ttaa": Derivation_ttaa, # fi, - "Echo_rdp": Echo_rdp, # U, - "Echo_ech": Echo_ech, # U, - "Foreign_foreign": Foreign_foreign, # cz, fi, U, - "Foreign_fscript": Foreign_fscript, # cz, fi, U, - "Foreign_tscript": Foreign_tscript, # cz, U, - "Foreign_yes": Foreign_yes, # sl, - "Gender_dat_masc": Gender_dat_masc, # bq, U, - "Gender_dat_fem": Gender_dat_fem, # bq, U, - "Gender_erg_masc": Gender_erg_masc, # bq, - "Gender_erg_fem": Gender_erg_fem, # bq, - "Gender_psor_masc": Gender_psor_masc, # cz, sl, U, - "Gender_psor_fem": Gender_psor_fem, # cz, sl, U, - "Gender_psor_neut": Gender_psor_neut, # sl, - "Hyph_yes": Hyph_yes, # cz, U, - "InfForm_one": InfForm_one, # fi, - "InfForm_two": InfForm_two, # fi, - "InfForm_three": InfForm_three, # fi, - "NameType_geo": NameType_geo, # U, cz, - "NameType_prs": NameType_prs, # U, cz, - "NameType_giv": NameType_giv, # U, cz, - "NameType_sur": NameType_sur, # U, cz, - "NameType_nat": NameType_nat, # U, cz, - "NameType_com": NameType_com, # U, cz, - "NameType_pro": NameType_pro, # U, cz, - "NameType_oth": NameType_oth, # U, cz, - "NounType_com": NounType_com, # U, - "NounType_prop": NounType_prop, # U, - "NounType_class": NounType_class, # U, - "Number_abs_sing": Number_abs_sing, # bq, U, - "Number_abs_plur": Number_abs_plur, # bq, U, - "Number_dat_sing": Number_dat_sing, # bq, U, - "Number_dat_plur": Number_dat_plur, # bq, U, - "Number_erg_sing": Number_erg_sing, # bq, U, - "Number_erg_plur": Number_erg_plur, # bq, U, - "Number_psee_sing": Number_psee_sing, # U, - "Number_psee_plur": Number_psee_plur, # U, - "Number_psor_sing": Number_psor_sing, # cz, fi, sl, U, - "Number_psor_plur": Number_psor_plur, # cz, fi, sl, U, - "Number_pauc": Number_pauc, # U20 - "Number_grpa": Number_grpa, # U20 - "Number_grpl": Number_grpl, # U20 - "Number_inv": Number_inv, # U20 - "NumForm_digit": NumForm_digit, # cz, sl, U, - "NumForm_roman": NumForm_roman, # cz, sl, U, - "NumForm_word": NumForm_word, # cz, sl, U, - "NumValue_one": NumValue_one, # cz, U, - "NumValue_two": NumValue_two, # cz, U, - "NumValue_three": NumValue_three, # cz, U, - "PartForm_pres": PartForm_pres, # fi, - "PartForm_past": PartForm_past, # fi, - "PartForm_agt": PartForm_agt, # fi, - "PartForm_neg": PartForm_neg, # fi, - "PartType_mod": PartType_mod, # U, - "PartType_emp": PartType_emp, # U, - "PartType_res": PartType_res, # U, - "PartType_inf": PartType_inf, # U, - "PartType_vbp": PartType_vbp, # U, - "Person_abs_one": Person_abs_one, # bq, U, - "Person_abs_two": Person_abs_two, # bq, U, - "Person_abs_three": Person_abs_three, # bq, U, - "Person_dat_one": Person_dat_one, # bq, U, - "Person_dat_two": Person_dat_two, # bq, U, - "Person_dat_three": Person_dat_three, # bq, U, - "Person_erg_one": Person_erg_one, # bq, U, - "Person_erg_two": Person_erg_two, # bq, U, - "Person_erg_three": Person_erg_three, # bq, U, - "Person_psor_one": Person_psor_one, # fi, U, - "Person_psor_two": Person_psor_two, # fi, U, - "Person_psor_three": Person_psor_three, # fi, U, - "Person_zero": Person_zero, # U20 - "Person_four": Person_four, # U20 - "Polite_inf": Polite_inf, # bq, U, - "Polite_pol": Polite_pol, # bq, U, - "Polite_abs_inf": Polite_abs_inf, # bq, U, - "Polite_abs_pol": Polite_abs_pol, # bq, U, - "Polite_erg_inf": Polite_erg_inf, # bq, U, - "Polite_erg_pol": Polite_erg_pol, # bq, U, - "Polite_dat_inf": Polite_dat_inf, # bq, U, - "Polite_dat_pol": Polite_dat_pol, # bq, U, - "Polite_infm": Polite_infm, # U20 - "Polite_form": Polite_form, # U20 - "Polite_form_elev": Polite_form_elev, # U20 - "Polite_form_humb": Polite_form_humb, # U20 - "Prefix_yes": Prefix_yes, # U, - "PrepCase_npr": PrepCase_npr, # cz, - "PrepCase_pre": PrepCase_pre, # U, - "PunctSide_ini": PunctSide_ini, # U, - "PunctSide_fin": PunctSide_fin, # U, - "PunctType_peri": PunctType_peri, # U, - "PunctType_qest": PunctType_qest, # U, - "PunctType_excl": PunctType_excl, # U, - "PunctType_quot": PunctType_quot, # U, - "PunctType_brck": PunctType_brck, # U, - "PunctType_comm": PunctType_comm, # U, - "PunctType_colo": PunctType_colo, # U, - "PunctType_semi": PunctType_semi, # U, - "PunctType_dash": PunctType_dash, # U, - "Style_arch": Style_arch, # cz, fi, U, - "Style_rare": Style_rare, # cz, fi, U, - "Style_poet": Style_poet, # cz, U, - "Style_norm": Style_norm, # cz, U, - "Style_coll": Style_coll, # cz, U, - "Style_vrnc": Style_vrnc, # cz, U, - "Style_sing": Style_sing, # cz, U, - "Style_expr": Style_expr, # cz, U, - "Style_derg": Style_derg, # cz, U, - "Style_vulg": Style_vulg, # cz, U, - "Style_yes": Style_yes, # fi, U, - "StyleVariant_styleShort": StyleVariant_styleShort, # cz, - "StyleVariant_styleBound": StyleVariant_styleBound, # cz, sl, - "VerbType_aux": VerbType_aux, # U, - "VerbType_cop": VerbType_cop, # U, - "VerbType_mod": VerbType_mod, # U, - "VerbType_light": VerbType_light, # U, + "DEPRECATED001": DEPRECATED001, + "DEPRECATED002": DEPRECATED002, + "DEPRECATED003": DEPRECATED003, + "DEPRECATED004": DEPRECATED004, + "DEPRECATED005": DEPRECATED005, + "DEPRECATED006": DEPRECATED006, + "DEPRECATED007": DEPRECATED007, + "DEPRECATED008": DEPRECATED008, + "DEPRECATED009": DEPRECATED009, + "DEPRECATED010": DEPRECATED010, + "DEPRECATED011": DEPRECATED011, + "DEPRECATED012": DEPRECATED012, + "DEPRECATED013": DEPRECATED013, + "DEPRECATED014": DEPRECATED014, + "DEPRECATED015": DEPRECATED015, + "DEPRECATED016": DEPRECATED016, + "DEPRECATED017": DEPRECATED017, + "DEPRECATED018": DEPRECATED018, + "DEPRECATED019": DEPRECATED019, + "DEPRECATED020": DEPRECATED020, + "DEPRECATED021": DEPRECATED021, + "DEPRECATED022": DEPRECATED022, + "DEPRECATED023": DEPRECATED023, + "DEPRECATED024": DEPRECATED024, + "DEPRECATED025": DEPRECATED025, + "DEPRECATED026": DEPRECATED026, + "DEPRECATED027": DEPRECATED027, + "DEPRECATED028": DEPRECATED028, + "DEPRECATED029": DEPRECATED029, + "DEPRECATED030": DEPRECATED030, + "DEPRECATED031": DEPRECATED031, + "DEPRECATED032": DEPRECATED032, + "DEPRECATED033": DEPRECATED033, + "DEPRECATED034": DEPRECATED034, + "DEPRECATED035": DEPRECATED035, + "DEPRECATED036": DEPRECATED036, + "DEPRECATED037": DEPRECATED037, + "DEPRECATED038": DEPRECATED038, + "DEPRECATED039": DEPRECATED039, + "DEPRECATED040": DEPRECATED040, + "DEPRECATED041": DEPRECATED041, + "DEPRECATED042": DEPRECATED042, + "DEPRECATED043": DEPRECATED043, + "DEPRECATED044": DEPRECATED044, + "DEPRECATED045": DEPRECATED045, + "DEPRECATED046": DEPRECATED046, + "DEPRECATED047": DEPRECATED047, + "DEPRECATED048": DEPRECATED048, + "DEPRECATED049": DEPRECATED049, + "DEPRECATED050": DEPRECATED050, + "DEPRECATED051": DEPRECATED051, + "DEPRECATED052": DEPRECATED052, + "DEPRECATED053": DEPRECATED053, + "DEPRECATED054": DEPRECATED054, + "DEPRECATED055": DEPRECATED055, + "DEPRECATED056": DEPRECATED056, + "DEPRECATED057": DEPRECATED057, + "DEPRECATED058": DEPRECATED058, + "DEPRECATED059": DEPRECATED059, + "DEPRECATED060": DEPRECATED060, + "DEPRECATED061": DEPRECATED061, + "DEPRECATED062": DEPRECATED062, + "DEPRECATED063": DEPRECATED063, + "DEPRECATED064": DEPRECATED064, + "DEPRECATED065": DEPRECATED065, + "DEPRECATED066": DEPRECATED066, + "DEPRECATED067": DEPRECATED067, + "DEPRECATED068": DEPRECATED068, + "DEPRECATED069": DEPRECATED069, + "DEPRECATED070": DEPRECATED070, + "DEPRECATED071": DEPRECATED071, + "DEPRECATED072": DEPRECATED072, + "DEPRECATED073": DEPRECATED073, + "DEPRECATED074": DEPRECATED074, + "DEPRECATED075": DEPRECATED075, + "DEPRECATED076": DEPRECATED076, + "DEPRECATED077": DEPRECATED077, + "DEPRECATED078": DEPRECATED078, + "DEPRECATED079": DEPRECATED079, + "DEPRECATED080": DEPRECATED080, + "DEPRECATED081": DEPRECATED081, + "DEPRECATED082": DEPRECATED082, + "DEPRECATED083": DEPRECATED083, + "DEPRECATED084": DEPRECATED084, + "DEPRECATED085": DEPRECATED085, + "DEPRECATED086": DEPRECATED086, + "DEPRECATED087": DEPRECATED087, + "DEPRECATED088": DEPRECATED088, + "DEPRECATED089": DEPRECATED089, + "DEPRECATED090": DEPRECATED090, + "DEPRECATED091": DEPRECATED091, + "DEPRECATED092": DEPRECATED092, + "DEPRECATED093": DEPRECATED093, + "DEPRECATED094": DEPRECATED094, + "DEPRECATED095": DEPRECATED095, + "DEPRECATED096": DEPRECATED096, + "DEPRECATED097": DEPRECATED097, + "DEPRECATED098": DEPRECATED098, + "DEPRECATED099": DEPRECATED099, + "DEPRECATED100": DEPRECATED100, + "DEPRECATED101": DEPRECATED101, + "DEPRECATED102": DEPRECATED102, + "DEPRECATED103": DEPRECATED103, + "DEPRECATED104": DEPRECATED104, + "DEPRECATED105": DEPRECATED105, + "DEPRECATED106": DEPRECATED106, + "DEPRECATED107": DEPRECATED107, + "DEPRECATED108": DEPRECATED108, + "DEPRECATED109": DEPRECATED109, + "DEPRECATED110": DEPRECATED110, + "DEPRECATED111": DEPRECATED111, + "DEPRECATED112": DEPRECATED112, + "DEPRECATED113": DEPRECATED113, + "DEPRECATED114": DEPRECATED114, + "DEPRECATED115": DEPRECATED115, + "DEPRECATED116": DEPRECATED116, + "DEPRECATED117": DEPRECATED117, + "DEPRECATED118": DEPRECATED118, + "DEPRECATED119": DEPRECATED119, + "DEPRECATED120": DEPRECATED120, + "DEPRECATED121": DEPRECATED121, + "DEPRECATED122": DEPRECATED122, + "DEPRECATED123": DEPRECATED123, + "DEPRECATED124": DEPRECATED124, + "DEPRECATED125": DEPRECATED125, + "DEPRECATED126": DEPRECATED126, + "DEPRECATED127": DEPRECATED127, + "DEPRECATED128": DEPRECATED128, + "DEPRECATED129": DEPRECATED129, + "DEPRECATED130": DEPRECATED130, + "DEPRECATED131": DEPRECATED131, + "DEPRECATED132": DEPRECATED132, + "DEPRECATED133": DEPRECATED133, + "DEPRECATED134": DEPRECATED134, + "DEPRECATED135": DEPRECATED135, + "DEPRECATED136": DEPRECATED136, + "DEPRECATED137": DEPRECATED137, + "DEPRECATED138": DEPRECATED138, + "DEPRECATED139": DEPRECATED139, + "DEPRECATED140": DEPRECATED140, + "DEPRECATED141": DEPRECATED141, + "DEPRECATED142": DEPRECATED142, + "DEPRECATED143": DEPRECATED143, + "DEPRECATED144": DEPRECATED144, + "DEPRECATED145": DEPRECATED145, + "DEPRECATED146": DEPRECATED146, + "DEPRECATED147": DEPRECATED147, + "DEPRECATED148": DEPRECATED148, + "DEPRECATED149": DEPRECATED149, + "DEPRECATED150": DEPRECATED150, + "DEPRECATED151": DEPRECATED151, + "DEPRECATED152": DEPRECATED152, + "DEPRECATED153": DEPRECATED153, + "DEPRECATED154": DEPRECATED154, + "DEPRECATED155": DEPRECATED155, + "DEPRECATED156": DEPRECATED156, + "DEPRECATED157": DEPRECATED157, + "DEPRECATED158": DEPRECATED158, + "DEPRECATED159": DEPRECATED159, + "DEPRECATED160": DEPRECATED160, + "DEPRECATED161": DEPRECATED161, + "DEPRECATED162": DEPRECATED162, + "DEPRECATED163": DEPRECATED163, + "DEPRECATED164": DEPRECATED164, + "DEPRECATED165": DEPRECATED165, + "DEPRECATED166": DEPRECATED166, + "DEPRECATED167": DEPRECATED167, + "DEPRECATED168": DEPRECATED168, + "DEPRECATED169": DEPRECATED169, + "DEPRECATED170": DEPRECATED170, + "DEPRECATED171": DEPRECATED171, + "DEPRECATED172": DEPRECATED172, + "DEPRECATED173": DEPRECATED173, + "DEPRECATED174": DEPRECATED174, + "DEPRECATED175": DEPRECATED175, + "DEPRECATED176": DEPRECATED176, + "DEPRECATED177": DEPRECATED177, + "DEPRECATED178": DEPRECATED178, + "DEPRECATED179": DEPRECATED179, + "DEPRECATED180": DEPRECATED180, + "DEPRECATED181": DEPRECATED181, + "DEPRECATED182": DEPRECATED182, + "DEPRECATED183": DEPRECATED183, + "DEPRECATED184": DEPRECATED184, + "DEPRECATED185": DEPRECATED185, + "DEPRECATED186": DEPRECATED186, + "DEPRECATED187": DEPRECATED187, + "DEPRECATED188": DEPRECATED188, + "DEPRECATED189": DEPRECATED189, + "DEPRECATED190": DEPRECATED190, + "DEPRECATED191": DEPRECATED191, + "DEPRECATED192": DEPRECATED192, + "DEPRECATED193": DEPRECATED193, + "DEPRECATED194": DEPRECATED194, + "DEPRECATED195": DEPRECATED195, + "DEPRECATED196": DEPRECATED196, + "DEPRECATED197": DEPRECATED197, + "DEPRECATED198": DEPRECATED198, + "DEPRECATED199": DEPRECATED199, + "DEPRECATED200": DEPRECATED200, + "DEPRECATED201": DEPRECATED201, + "DEPRECATED202": DEPRECATED202, + "DEPRECATED203": DEPRECATED203, + "DEPRECATED204": DEPRECATED204, + "DEPRECATED205": DEPRECATED205, + "DEPRECATED206": DEPRECATED206, + "DEPRECATED207": DEPRECATED207, + "DEPRECATED208": DEPRECATED208, + "DEPRECATED209": DEPRECATED209, + "DEPRECATED210": DEPRECATED210, + "DEPRECATED211": DEPRECATED211, + "DEPRECATED212": DEPRECATED212, + "DEPRECATED213": DEPRECATED213, + "DEPRECATED214": DEPRECATED214, + "DEPRECATED215": DEPRECATED215, + "DEPRECATED216": DEPRECATED216, + "DEPRECATED217": DEPRECATED217, + "DEPRECATED218": DEPRECATED218, + "DEPRECATED219": DEPRECATED219, + "DEPRECATED220": DEPRECATED220, + "DEPRECATED221": DEPRECATED221, + "DEPRECATED222": DEPRECATED222, + "DEPRECATED223": DEPRECATED223, + "DEPRECATED224": DEPRECATED224, + "DEPRECATED225": DEPRECATED225, + "DEPRECATED226": DEPRECATED226, + "DEPRECATED227": DEPRECATED227, + "DEPRECATED228": DEPRECATED228, + "DEPRECATED229": DEPRECATED229, + "DEPRECATED230": DEPRECATED230, + "DEPRECATED231": DEPRECATED231, + "DEPRECATED232": DEPRECATED232, + "DEPRECATED233": DEPRECATED233, + "DEPRECATED234": DEPRECATED234, + "DEPRECATED235": DEPRECATED235, + "DEPRECATED236": DEPRECATED236, + "DEPRECATED237": DEPRECATED237, + "DEPRECATED238": DEPRECATED238, + "DEPRECATED239": DEPRECATED239, + "DEPRECATED240": DEPRECATED240, + "DEPRECATED241": DEPRECATED241, + "DEPRECATED242": DEPRECATED242, + "DEPRECATED243": DEPRECATED243, + "DEPRECATED244": DEPRECATED244, + "DEPRECATED245": DEPRECATED245, + "DEPRECATED246": DEPRECATED246, + "DEPRECATED247": DEPRECATED247, + "DEPRECATED248": DEPRECATED248, + "DEPRECATED249": DEPRECATED249, + "DEPRECATED250": DEPRECATED250, + "DEPRECATED251": DEPRECATED251, + "DEPRECATED252": DEPRECATED252, + "DEPRECATED253": DEPRECATED253, + "DEPRECATED254": DEPRECATED254, + "DEPRECATED255": DEPRECATED255, + "DEPRECATED256": DEPRECATED256, + "DEPRECATED257": DEPRECATED257, + "DEPRECATED258": DEPRECATED258, + "DEPRECATED259": DEPRECATED259, + "DEPRECATED260": DEPRECATED260, + "DEPRECATED261": DEPRECATED261, + "DEPRECATED262": DEPRECATED262, + "DEPRECATED263": DEPRECATED263, + "DEPRECATED264": DEPRECATED264, + "DEPRECATED265": DEPRECATED265, + "DEPRECATED266": DEPRECATED266, + "DEPRECATED267": DEPRECATED267, + "DEPRECATED268": DEPRECATED268, + "DEPRECATED269": DEPRECATED269, + "DEPRECATED270": DEPRECATED270, + "DEPRECATED271": DEPRECATED271, + "DEPRECATED272": DEPRECATED272, + "DEPRECATED273": DEPRECATED273, + "DEPRECATED274": DEPRECATED274, + "DEPRECATED275": DEPRECATED275, + "DEPRECATED276": DEPRECATED276, "PERSON": PERSON, "NORP": NORP, @@ -468,6 +464,7 @@ IDS = { "acl": acl, "LAW": LAW, + "MORPH": MORPH, } diff --git a/spacy/syntax/_beam_utils.pxd b/spacy/syntax/_beam_utils.pxd index 36b0c05da..cf99ac3d1 100644 --- a/spacy/syntax/_beam_utils.pxd +++ b/spacy/syntax/_beam_utils.pxd @@ -1,4 +1,4 @@ -from thinc.typedefs cimport class_t, hash_t +from ..typedefs cimport hash_t, class_t # These are passed as callbacks to thinc.search.Beam cdef int transition_state(void* _dest, void* _src, class_t clas, void* _moves) except -1 diff --git a/spacy/syntax/_beam_utils.pyx b/spacy/syntax/_beam_utils.pyx index b1085c762..03702e54e 100644 --- a/spacy/syntax/_beam_utils.pyx +++ b/spacy/syntax/_beam_utils.pyx @@ -1,18 +1,19 @@ -# cython: infer_types=True -# cython: profile=True +# cython: infer_types=True, profile=True cimport numpy as np -import numpy from cpython.ref cimport PyObject, Py_XDECREF from thinc.extra.search cimport Beam -from thinc.extra.search import MaxViolation -from thinc.typedefs cimport hash_t, class_t from thinc.extra.search cimport MaxViolation +from thinc.extra.search import MaxViolation +import numpy + +from ..typedefs cimport hash_t, class_t from .transition_system cimport TransitionSystem, Transition from ..gold cimport GoldParse -from ..errors import Errors from .stateclass cimport StateC, StateClass +from ..errors import Errors + # These are passed as callbacks to thinc.search.Beam cdef int transition_state(void* _dest, void* _src, class_t clas, void* _moves) except -1: @@ -326,5 +327,3 @@ def cleanup_beam(Beam beam): seen.add(addr) else: raise ValueError(Errors.E023.format(addr=addr, i=i)) - - diff --git a/spacy/syntax/_parser_model.pxd b/spacy/syntax/_parser_model.pxd index 9c72f3415..15befb372 100644 --- a/spacy/syntax/_parser_model.pxd +++ b/spacy/syntax/_parser_model.pxd @@ -1,6 +1,6 @@ from libc.string cimport memset, memcpy from libc.stdlib cimport calloc, free, realloc -from thinc.typedefs cimport weight_t, class_t, hash_t +from ..typedefs cimport weight_t, class_t, hash_t from ._state cimport StateC diff --git a/spacy/syntax/_parser_model.pyx b/spacy/syntax/_parser_model.pyx index 8b6448a46..60d22a1ab 100644 --- a/spacy/syntax/_parser_model.pyx +++ b/spacy/syntax/_parser_model.pyx @@ -1,38 +1,29 @@ -# cython: infer_types=True -# cython: cdivision=True -# cython: boundscheck=False -# coding: utf-8 -from __future__ import unicode_literals, print_function - -from collections import OrderedDict -import numpy +# cython: infer_types=True, cdivision=True, boundscheck=False cimport cython.parallel -import numpy.random cimport numpy as np from libc.math cimport exp from libcpp.vector cimport vector from libc.string cimport memset, memcpy from libc.stdlib cimport calloc, free, realloc from cymem.cymem cimport Pool -from thinc.typedefs cimport weight_t, class_t, hash_t from thinc.extra.search cimport Beam -from thinc.api import chain, clone -from thinc.v2v import Model, Maxout, Affine -from thinc.misc import LayerNorm -from thinc.neural.ops import CupyOps, NumpyOps -from thinc.neural.util import get_array_module -from thinc.linalg cimport Vec, VecVec +from thinc.backends.linalg cimport Vec, VecVec cimport blis.cy -from .._ml import zero_init, PrecomputableAffine, Tok2Vec, flatten -from .._ml import link_vectors_to_models, create_default_optimizer -from ..compat import copy_array +import numpy +import numpy.random +from thinc.api import Linear, Model, CupyOps, NumpyOps, use_ops, noop + +from ..typedefs cimport weight_t, class_t, hash_t from ..tokens.doc cimport Doc from ..gold cimport GoldParse -from ..errors import Errors, TempErrors -from .. import util from .stateclass cimport StateClass from .transition_system cimport Transition + +from ..compat import copy_array +from ..errors import Errors, TempErrors +from ..util import link_vectors_to_models, create_default_optimizer +from .. import util from . import _beam_utils from . import nonproj @@ -48,8 +39,8 @@ cdef WeightsC get_c_weights(model) except *: output.hidden_weights = NULL output.hidden_bias = NULL else: - vec2scores_W = model.vec2scores.W - vec2scores_b = model.vec2scores.b + vec2scores_W = model.vec2scores.get_param("W") + vec2scores_b = model.vec2scores.get_param("b") output.hidden_weights = vec2scores_W.data output.hidden_bias = vec2scores_b.data cdef np.ndarray class_mask = model._class_mask @@ -61,12 +52,12 @@ cdef SizesC get_c_sizes(model, int batch_size) except *: cdef SizesC output output.states = batch_size if model.vec2scores is None: - output.classes = model.state2vec.nO + output.classes = model.state2vec.get_dim("nO") else: - output.classes = model.vec2scores.nO - output.hiddens = model.state2vec.nO - output.pieces = model.state2vec.nP - output.feats = model.state2vec.nF + output.classes = model.vec2scores.get_dim("nO") + output.hiddens = model.state2vec.get_dim("nO") + output.pieces = model.state2vec.get_dim("nP") + output.feats = model.state2vec.get_dim("nF") output.embed_width = model.tokvecs.shape[1] return output @@ -228,87 +219,27 @@ cdef int arg_max_if_valid(const weight_t* scores, const int* is_valid, int n) no return best -class ParserModel(Model): - def __init__(self, tok2vec, lower_model, upper_model, unseen_classes=None): - Model.__init__(self) - self._layers = [tok2vec, lower_model] - if upper_model is not None: - self._layers.append(upper_model) - self.unseen_classes = set() - if unseen_classes: - for class_ in unseen_classes: - self.unseen_classes.add(class_) - - def begin_update(self, docs, drop=0.): - step_model = ParserStepModel(docs, self._layers, drop=drop, - unseen_classes=self.unseen_classes) - def finish_parser_update(golds, sgd=None): - step_model.make_updates(sgd) - return None - return step_model, finish_parser_update - - def resize_output(self, new_output): - if len(self._layers) == 2: - return - if new_output == self.upper.nO: - return - smaller = self.upper - - with Model.use_device('cpu'): - larger = Affine(new_output, smaller.nI) - larger.W.fill(0.0) - larger.b.fill(0.0) - # It seems very unhappy if I pass these as smaller.W? - # Seems to segfault. Maybe it's a descriptor protocol thing? - smaller_W = smaller.W - larger_W = larger.W - smaller_b = smaller.b - larger_b = larger.b - # Weights are stored in (nr_out, nr_in) format, so we're basically - # just adding rows here. - larger_W[:smaller.nO] = smaller_W - larger_b[:smaller.nO] = smaller_b - self._layers[-1] = larger - for i in range(smaller.nO, new_output): - self.unseen_classes.add(i) - - def begin_training(self, X, y=None): - self.lower.begin_training(X, y=y) - - @property - def tok2vec(self): - return self._layers[0] - - @property - def lower(self): - return self._layers[1] - - @property - def upper(self): - return self._layers[2] - class ParserStepModel(Model): - def __init__(self, docs, layers, unseen_classes=None, drop=0.): - self.tokvecs, self.bp_tokvecs = layers[0].begin_update(docs, drop=drop) - if layers[1].nP >= 2: + def __init__(self, docs, layers, *, has_upper, unseen_classes=None, train=True): + Model.__init__(self, name="parser_step_model", forward=step_forward) + self.attrs["has_upper"] = has_upper + self.tokvecs, self.bp_tokvecs = layers[0](docs, is_train=train) + if layers[1].get_dim("nP") >= 2: activation = "maxout" - elif len(layers) == 2: + elif has_upper: activation = None else: activation = "relu" self.state2vec = precompute_hiddens(len(docs), self.tokvecs, layers[1], - activation=activation, drop=drop) - if len(layers) == 3: + activation=activation, train=train) + if has_upper: self.vec2scores = layers[-1] else: self.vec2scores = None self.cuda_stream = util.get_cuda_stream(non_blocking=True) self.backprops = [] - if self.vec2scores is None: - self._class_mask = numpy.zeros((self.state2vec.nO,), dtype='f') - else: - self._class_mask = numpy.zeros((self.vec2scores.nO,), dtype='f') + self._class_mask = numpy.zeros((self.nO,), dtype='f') self._class_mask.fill(1) if unseen_classes is not None: for class_ in unseen_classes: @@ -316,7 +247,10 @@ class ParserStepModel(Model): @property def nO(self): - return self.state2vec.nO + if self.attrs["has_upper"]: + return self.vec2scores.get_dim("nO") + else: + return self.state2vec.get_dim("nO") def class_is_unseen(self, class_): return self._class_mask[class_] @@ -327,40 +261,6 @@ class ParserStepModel(Model): def mark_class_seen(self, class_): self._class_mask[class_] = 1 - def begin_update(self, states, drop=0.): - token_ids = self.get_token_ids(states) - vector, get_d_tokvecs = self.state2vec.begin_update(token_ids, drop=0.0) - if self.vec2scores is not None: - mask = self.vec2scores.ops.get_dropout_mask(vector.shape, drop) - if mask is not None: - vector *= mask - scores, get_d_vector = self.vec2scores.begin_update(vector, drop=drop) - else: - scores = NumpyOps().asarray(vector) - get_d_vector = lambda d_scores, sgd=None: d_scores - mask = None - # If the class is unseen, make sure its score is minimum - scores[:, self._class_mask == 0] = numpy.nanmin(scores) - - def backprop_parser_step(d_scores, sgd=None): - # Zero vectors for unseen classes - d_scores *= self._class_mask - d_vector = get_d_vector(d_scores, sgd=sgd) - if mask is not None: - d_vector *= mask - if isinstance(self.state2vec.ops, CupyOps) \ - and not isinstance(token_ids, self.state2vec.ops.xp.ndarray): - # Move token_ids and d_vector to GPU, asynchronously - self.backprops.append(( - util.get_async(self.cuda_stream, token_ids), - util.get_async(self.cuda_stream, d_vector), - get_d_tokvecs - )) - else: - self.backprops.append((token_ids, d_vector, get_d_tokvecs)) - return None - return scores, backprop_parser_step - def get_token_ids(self, batch): states = _beam_utils.collect_states(batch) cdef StateClass state @@ -374,25 +274,54 @@ class ParserStepModel(Model): c_ids += ids.shape[1] return ids - def make_updates(self, sgd): + def finish_steps(self, golds): # Add a padding vector to the d_tokvecs gradient, so that missing # values don't affect the real gradient. - d_tokvecs = self.ops.allocate((self.tokvecs.shape[0]+1, self.tokvecs.shape[1])) + d_tokvecs = self.ops.alloc((self.tokvecs.shape[0]+1, self.tokvecs.shape[1])) # Tells CUDA to block, so our async copies complete. if self.cuda_stream is not None: self.cuda_stream.synchronize() for ids, d_vector, bp_vector in self.backprops: - d_state_features = bp_vector((d_vector, ids), sgd=sgd) + d_state_features = bp_vector((d_vector, ids)) ids = ids.flatten() d_state_features = d_state_features.reshape( (ids.size, d_state_features.shape[2])) self.ops.scatter_add(d_tokvecs, ids, d_state_features) # Padded -- see update() - self.bp_tokvecs(d_tokvecs[:-1], sgd=sgd) + self.bp_tokvecs(d_tokvecs[:-1]) return d_tokvecs +def step_forward(model: ParserStepModel, states, is_train): + token_ids = model.get_token_ids(states) + vector, get_d_tokvecs = model.state2vec(token_ids, is_train) + if model.attrs["has_upper"]: + scores, get_d_vector = model.vec2scores(vector, is_train) + else: + scores = NumpyOps().asarray(vector) + get_d_vector = lambda d_scores: d_scores + # If the class is unseen, make sure its score is minimum + scores[:, model._class_mask == 0] = numpy.nanmin(scores) + + def backprop_parser_step(d_scores): + # Zero vectors for unseen classes + d_scores *= model._class_mask + d_vector = get_d_vector(d_scores) + if isinstance(model.state2vec.ops, CupyOps) \ + and not isinstance(token_ids, model.state2vec.ops.xp.ndarray): + # Move token_ids and d_vector to GPU, asynchronously + model.backprops.append(( + util.get_async(model.cuda_stream, token_ids), + util.get_async(model.cuda_stream, d_vector), + get_d_tokvecs + )) + else: + model.backprops.append((token_ids, d_vector, get_d_tokvecs)) + return None + return scores, backprop_parser_step + + cdef class precompute_hiddens: """Allow a model to be "primed" by pre-computing input features in bulk. @@ -421,8 +350,8 @@ cdef class precompute_hiddens: cdef object activation def __init__(self, batch_size, tokvecs, lower_model, cuda_stream=None, - activation="maxout", drop=0.): - gpu_cached, bp_features = lower_model.begin_update(tokvecs, drop=drop) + activation="maxout", train=False): + gpu_cached, bp_features = lower_model(tokvecs, train) cdef np.ndarray cached if not isinstance(gpu_cached, numpy.ndarray): # Note the passing of cuda_stream here: it lets @@ -431,12 +360,15 @@ cdef class precompute_hiddens: cached = gpu_cached.get(stream=cuda_stream) else: cached = gpu_cached - if not isinstance(lower_model.b, numpy.ndarray): - self.bias = lower_model.b.get() + if not isinstance(lower_model.get_param("b"), numpy.ndarray): + self.bias = lower_model.get_param("b").get(stream=cuda_stream) else: - self.bias = lower_model.b + self.bias = lower_model.get_param("b") self.nF = cached.shape[1] - self.nP = getattr(lower_model, 'nP', 1) + if lower_model.has_dim("nP"): + self.nP = lower_model.get_dim("nP") + else: + self.nP = 1 self.nO = cached.shape[2] self.ops = lower_model.ops assert activation in (None, "relu", "maxout") @@ -452,10 +384,46 @@ cdef class precompute_hiddens: self._is_synchronized = True return self._cached.data - def __call__(self, X): - return self.begin_update(X, drop=None)[0] + def has_dim(self, name): + if name == "nF": + return self.nF if self.nF is not None else True + elif name == "nP": + return self.nP if self.nP is not None else True + elif name == "nO": + return self.nO if self.nO is not None else True + else: + return False - def begin_update(self, token_ids, drop=0.): + def get_dim(self, name): + if name == "nF": + return self.nF + elif name == "nP": + return self.nP + elif name == "nO": + return self.nO + else: + raise ValueError(f"Dimension {name} invalid -- only nO, nF, nP") + + def set_dim(self, name, value): + if name == "nF": + self.nF = value + elif name == "nP": + self.nP = value + elif name == "nO": + self.nO = value + else: + raise ValueError(f"Dimension {name} invalid -- only nO, nF, nP") + + def __call__(self, X, bint is_train): + if is_train: + return self.begin_update(X) + else: + return self.predict(X), lambda X: X + + def predict(self, X): + return self.begin_update(X)[0] + + def begin_update(self, token_ids): cdef np.ndarray state_vector = numpy.zeros( (token_ids.shape[0], self.nO, self.nP), dtype='f') # This is tricky, but (assuming GPU available); @@ -470,13 +438,13 @@ cdef class precompute_hiddens: sum_state_features(state_vector.data, feat_weights, &ids[0,0], token_ids.shape[0], self.nF, self.nO*self.nP) - state_vector += self.bias + state_vector = state_vector + self.bias state_vector, bp_nonlinearity = self._nonlinearity(state_vector) - def backward(d_state_vector_ids, sgd=None): + def backward(d_state_vector_ids): d_state_vector, token_ids = d_state_vector_ids - d_state_vector = bp_nonlinearity(d_state_vector, sgd) - d_tokens = bp_hiddens((d_state_vector, token_ids), sgd) + d_state_vector = bp_nonlinearity(d_state_vector) + d_tokens = bp_hiddens((d_state_vector, token_ids)) return d_tokens return state_vector, backward @@ -485,7 +453,7 @@ cdef class precompute_hiddens: ops = NumpyOps() else: ops = CupyOps() - + if self.activation == "maxout": state_vector, mask = ops.maxout(state_vector) else: @@ -496,7 +464,7 @@ cdef class precompute_hiddens: else: mask = None - def backprop_nonlinearity(d_best, sgd=None): + def backprop_nonlinearity(d_best): if isinstance(d_best, numpy.ndarray): ops = NumpyOps() else: @@ -506,7 +474,11 @@ cdef class precompute_hiddens: # This will usually be on GPU d_best = ops.asarray(d_best) # Fix nans (which can occur from unseen classes.) - d_best[ops.xp.isnan(d_best)] = 0. + try: + d_best[ops.xp.isnan(d_best)] = 0. + except: + print(ops.xp.isnan(d_best)) + raise if self.activation == "maxout": mask_ = ops.asarray(mask) return ops.backprop_maxout(d_best, mask_, self.nP) diff --git a/spacy/syntax/_state.pxd b/spacy/syntax/_state.pxd index 141d796a4..fef4f0c92 100644 --- a/spacy/syntax/_state.pxd +++ b/spacy/syntax/_state.pxd @@ -1,9 +1,7 @@ from libc.string cimport memcpy, memset, memmove from libc.stdlib cimport malloc, calloc, free from libc.stdint cimport uint32_t, uint64_t - from cpython.exc cimport PyErr_CheckSignals, PyErr_SetFromErrno - from murmurhash.mrmr cimport hash64 from ..vocab cimport EMPTY_LEXEME diff --git a/spacy/syntax/arc_eager.pxd b/spacy/syntax/arc_eager.pxd index 972ad682a..14d706548 100644 --- a/spacy/syntax/arc_eager.pxd +++ b/spacy/syntax/arc_eager.pxd @@ -1,10 +1,7 @@ from cymem.cymem cimport Pool -from thinc.typedefs cimport weight_t - from .stateclass cimport StateClass -from ..typedefs cimport attr_t - +from ..typedefs cimport weight_t, attr_t from .transition_system cimport TransitionSystem, Transition from ..gold cimport GoldParseC @@ -15,4 +12,3 @@ cdef class ArcEager(TransitionSystem): cdef weight_t push_cost(StateClass stcls, const GoldParseC* gold, int target) nogil cdef weight_t arc_cost(StateClass stcls, const GoldParseC* gold, int head, int child) nogil - diff --git a/spacy/syntax/arc_eager.pyx b/spacy/syntax/arc_eager.pyx index efe8573c1..19be95f3f 100644 --- a/spacy/syntax/arc_eager.pyx +++ b/spacy/syntax/arc_eager.pyx @@ -1,31 +1,29 @@ -# cython: profile=True -# cython: cdivision=True -# cython: infer_types=True -# coding: utf-8 -from __future__ import unicode_literals - +# cython: profile=True, cdivision=True, infer_types=True from cpython.ref cimport Py_INCREF from cymem.cymem cimport Pool -from collections import OrderedDict, defaultdict, Counter from thinc.extra.search cimport Beam + +from collections import defaultdict, Counter import json -from .nonproj import is_nonproj_tree from ..typedefs cimport hash_t, attr_t from ..strings cimport hash_string -from .stateclass cimport StateClass -from ._state cimport StateC -from . import nonproj -from .transition_system cimport move_cost_func_t, label_cost_func_t from ..gold cimport GoldParse, GoldParseC from ..structs cimport TokenC -from ..errors import Errors from ..tokens.doc cimport Doc, set_children_from_heads +from .stateclass cimport StateClass +from ._state cimport StateC +from .transition_system cimport move_cost_func_t, label_cost_func_t + +from ..errors import Errors +from .nonproj import is_nonproj_tree +from . import nonproj + # Calculate cost as gold/not gold. We don't use scalar value anyway. cdef int BINARY_COSTS = 1 cdef weight_t MIN_SCORE = -90000 -cdef attr_t SUBTOK_LABEL = hash_string('subtok') +cdef attr_t SUBTOK_LABEL = hash_string(u'subtok') DEF NON_MONOTONIC = True DEF USE_BREAK = True @@ -347,20 +345,20 @@ cdef class ArcEager(TransitionSystem): for label in kwargs.get('right_labels', []): actions[RIGHT][label] = 1 actions[REDUCE][label] = 1 - for raw_text, sents in kwargs.get('gold_parses', []): - for (ids, words, tags, heads, labels, iob), ctnts in sents: - heads, labels = nonproj.projectivize(heads, labels) - for child, head, label in zip(ids, heads, labels): - if label.upper() == 'ROOT' : - label = 'ROOT' - if head == child: - actions[BREAK][label] += 1 - elif head < child: - actions[RIGHT][label] += 1 - actions[REDUCE][''] += 1 - elif head > child: - actions[LEFT][label] += 1 - actions[SHIFT][''] += 1 + for example in kwargs.get('gold_parses', []): + heads, labels = nonproj.projectivize(example.token_annotation.heads, + example.token_annotation.deps) + for child, head, label in zip(example.token_annotation.ids, heads, labels): + if label.upper() == 'ROOT' : + label = 'ROOT' + if head == child: + actions[BREAK][label] += 1 + elif head < child: + actions[RIGHT][label] += 1 + actions[REDUCE][''] += 1 + elif head > child: + actions[LEFT][label] += 1 + actions[SHIFT][''] += 1 if min_freq is not None: for action, label_freqs in actions.items(): for label, freq in list(label_freqs.items()): @@ -403,7 +401,9 @@ cdef class ArcEager(TransitionSystem): self.strings[state.safe_get(i).dep])) else: predicted.add((i, state.H(i), 'ROOT')) - id_, word, tag, head, dep, ner = gold.orig_annot[gold.cand_to_gold[i]] + id_ = gold.orig.ids[gold.cand_to_gold[i]] + head = gold.orig.heads[gold.cand_to_gold[i]] + dep = gold.orig.deps[gold.cand_to_gold[i]] truth.add((id_, head, dep)) return truth == predicted diff --git a/spacy/syntax/ner.pyx b/spacy/syntax/ner.pyx index 9f8ad418c..ff74be601 100644 --- a/spacy/syntax/ner.pyx +++ b/spacy/syntax/ner.pyx @@ -1,10 +1,8 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from thinc.typedefs cimport weight_t from thinc.extra.search cimport Beam -from collections import OrderedDict, Counter +from collections import Counter + +from ..typedefs cimport weight_t from .stateclass cimport StateClass from ._state cimport StateC from .transition_system cimport Transition @@ -12,6 +10,7 @@ from .transition_system cimport do_func_t from ..gold cimport GoldParseC, GoldParse from ..lexeme cimport Lexeme from ..attrs cimport IS_SPACE + from ..errors import Errors @@ -72,13 +71,12 @@ cdef class BiluoPushDown(TransitionSystem): for action in (BEGIN, IN, LAST, UNIT): actions[action][entity_type] = 1 moves = ('M', 'B', 'I', 'L', 'U') - for raw_text, sents in kwargs.get('gold_parses', []): - for (ids, words, tags, heads, labels, biluo), _ in sents: - for i, ner_tag in enumerate(biluo): - if ner_tag != 'O' and ner_tag != '-': - _, label = ner_tag.split('-', 1) - for action in (BEGIN, IN, LAST, UNIT): - actions[action][label] += 1 + for example in kwargs.get('gold_parses', []): + for i, ner_tag in enumerate(example.token_annotation.entities): + if ner_tag != 'O' and ner_tag != '-': + _, label = ner_tag.split('-', 1) + for action in (BEGIN, IN, LAST, UNIT): + actions[action][label] += 1 return actions @property diff --git a/spacy/syntax/nn_parser.pxd b/spacy/syntax/nn_parser.pxd index 707c9654c..d77a04420 100644 --- a/spacy/syntax/nn_parser.pxd +++ b/spacy/syntax/nn_parser.pxd @@ -1,5 +1,3 @@ -from thinc.typedefs cimport atom_t - from .stateclass cimport StateClass from .arc_eager cimport TransitionSystem from ..vocab cimport Vocab diff --git a/spacy/syntax/nn_parser.pyx b/spacy/syntax/nn_parser.pyx index fafa492c6..1437bdd98 100644 --- a/spacy/syntax/nn_parser.pyx +++ b/spacy/syntax/nn_parser.pyx @@ -1,13 +1,5 @@ -# cython: infer_types=True -# cython: cdivision=True -# cython: boundscheck=False -# coding: utf-8 -from __future__ import unicode_literals, print_function - -from collections import OrderedDict -import numpy +# cython: infer_types=True, cdivision=True, boundscheck=False cimport cython.parallel -import numpy.random cimport numpy as np from cpython.ref cimport PyObject, Py_XDECREF from cpython.exc cimport PyErr_CheckSignals, PyErr_SetFromErrno @@ -16,32 +8,34 @@ from libcpp.vector cimport vector from libc.string cimport memset, memcpy from libc.stdlib cimport calloc, free from cymem.cymem cimport Pool -from thinc.typedefs cimport weight_t, class_t, hash_t from thinc.extra.search cimport Beam -from thinc.api import chain, clone -from thinc.v2v import Model, Maxout, Affine -from thinc.misc import LayerNorm -from thinc.neural.ops import NumpyOps, CupyOps -from thinc.neural.util import get_array_module -from thinc.linalg cimport Vec, VecVec -import srsly +from thinc.backends.linalg cimport Vec, VecVec +from thinc.api import chain, clone, Linear, list2array, NumpyOps, CupyOps, use_ops +from thinc.api import get_array_module, zero_init, set_dropout_rate +from itertools import islice +import srsly +import numpy.random +import numpy +import warnings + +from ..tokens.doc cimport Doc +from ..gold cimport GoldParse +from ..typedefs cimport weight_t, class_t, hash_t from ._parser_model cimport alloc_activations, free_activations from ._parser_model cimport predict_states, arg_max_if_valid from ._parser_model cimport WeightsC, ActivationsC, SizesC, cpu_log_loss from ._parser_model cimport get_c_weights, get_c_sizes -from ._parser_model import ParserModel -from .._ml import zero_init, PrecomputableAffine, Tok2Vec, flatten -from .._ml import link_vectors_to_models, create_default_optimizer -from ..compat import copy_array -from ..tokens.doc cimport Doc -from ..gold cimport GoldParse -from ..errors import Errors, TempErrors -from .. import util from .stateclass cimport StateClass from ._state cimport StateC from .transition_system cimport Transition from . cimport _beam_utils + +from ..gold import Example +from ..util import link_vectors_to_models, create_default_optimizer, registry +from ..compat import copy_array +from ..errors import Errors, Warnings +from .. import util from . import _beam_utils from . import nonproj @@ -50,107 +44,48 @@ cdef class Parser: """ Base class of the DependencyParser and EntityRecognizer. """ - @classmethod - def Model(cls, nr_class, **cfg): - depth = util.env_opt('parser_hidden_depth', cfg.get('hidden_depth', 1)) - subword_features = util.env_opt('subword_features', - cfg.get('subword_features', True)) - conv_depth = util.env_opt('conv_depth', cfg.get('conv_depth', 4)) - conv_window = util.env_opt('conv_window', cfg.get('conv_window', 1)) - t2v_pieces = util.env_opt('cnn_maxout_pieces', cfg.get('cnn_maxout_pieces', 3)) - bilstm_depth = util.env_opt('bilstm_depth', cfg.get('bilstm_depth', 0)) - self_attn_depth = util.env_opt('self_attn_depth', cfg.get('self_attn_depth', 0)) - nr_feature_tokens = cfg.get("nr_feature_tokens", cls.nr_feature) - if depth not in (0, 1): - raise ValueError(TempErrors.T004.format(value=depth)) - parser_maxout_pieces = util.env_opt('parser_maxout_pieces', - cfg.get('maxout_pieces', 2)) - token_vector_width = util.env_opt('token_vector_width', - cfg.get('token_vector_width', 96)) - hidden_width = util.env_opt('hidden_width', cfg.get('hidden_width', 64)) - if depth == 0: - hidden_width = nr_class - parser_maxout_pieces = 1 - embed_size = util.env_opt('embed_size', cfg.get('embed_size', 2000)) - pretrained_vectors = cfg.get('pretrained_vectors', None) - tok2vec = Tok2Vec(token_vector_width, embed_size, - conv_depth=conv_depth, - conv_window=conv_window, - cnn_maxout_pieces=t2v_pieces, - subword_features=subword_features, - pretrained_vectors=pretrained_vectors, - bilstm_depth=bilstm_depth) - tok2vec = chain(tok2vec, flatten) - tok2vec.nO = token_vector_width - lower = PrecomputableAffine(hidden_width, - nF=nr_feature_tokens, nI=token_vector_width, - nP=parser_maxout_pieces) - lower.nP = parser_maxout_pieces - if depth == 1: - with Model.use_device('cpu'): - upper = Affine(nr_class, hidden_width, drop_factor=0.0) - upper.W *= 0 - else: - upper = None - - cfg = { - 'nr_class': nr_class, - 'nr_feature_tokens': nr_feature_tokens, - 'hidden_depth': depth, - 'token_vector_width': token_vector_width, - 'hidden_width': hidden_width, - 'maxout_pieces': parser_maxout_pieces, - 'pretrained_vectors': pretrained_vectors, - 'bilstm_depth': bilstm_depth, - 'self_attn_depth': self_attn_depth, - 'conv_depth': conv_depth, - 'conv_window': conv_window, - 'embed_size': embed_size, - 'cnn_maxout_pieces': t2v_pieces - } - return ParserModel(tok2vec, lower, upper), cfg - name = 'base_parser' - def __init__(self, Vocab vocab, moves=True, model=True, **cfg): + + def __init__(self, Vocab vocab, model, **cfg): """Create a Parser. vocab (Vocab): The vocabulary object. Must be shared with documents to be processed. The value is set to the `.vocab` attribute. - moves (TransitionSystem): Defines how the parse-state is created, - updated and evaluated. The value is set to the .moves attribute - unless True (default), in which case a new instance is created with - `Parser.Moves()`. - model (object): Defines how the parse-state is created, updated and - evaluated. The value is set to the .model attribute. If set to True - (default), a new instance will be created with `Parser.Model()` - in parser.begin_training(), parser.from_disk() or parser.from_bytes(). - **cfg: Arbitrary configuration parameters. Set to the `.cfg` attribute + **cfg: Configuration parameters. Set to the `.cfg` attribute. + If it doesn't include a value for 'moves', a new instance is + created with `self.TransitionSystem()`. This defines how the + parse-state is created, updated and evaluated. """ self.vocab = vocab - if moves is True: - self.moves = self.TransitionSystem(self.vocab.strings) - else: - self.moves = moves - if 'beam_width' not in cfg: - cfg['beam_width'] = util.env_opt('beam_width', 1) - if 'beam_density' not in cfg: - cfg['beam_density'] = util.env_opt('beam_density', 0.0) - if 'beam_update_prob' not in cfg: - cfg['beam_update_prob'] = util.env_opt('beam_update_prob', 1.0) - cfg.setdefault('cnn_maxout_pieces', 3) - cfg.setdefault("nr_feature_tokens", self.nr_feature) - self.cfg = cfg + moves = cfg.get("moves", None) + if moves is None: + # defined by EntityRecognizer as a BiluoPushDown + moves = self.TransitionSystem(self.vocab.strings) + self.moves = moves + cfg.setdefault('min_action_freq', 30) + cfg.setdefault('learn_tokens', False) + cfg.setdefault('beam_width', 1) + cfg.setdefault('beam_update_prob', 1.0) # or 0.5 (both defaults were previously used) self.model = model + if self.moves.n_moves != 0: + self.set_output(self.moves.n_moves) + self.cfg = cfg self._multitasks = [] self._rehearsal_model = None @classmethod - def from_nlp(cls, nlp, **cfg): - return cls(nlp.vocab, **cfg) + def from_nlp(cls, nlp, model, **cfg): + return cls(nlp.vocab, model, **cfg) def __reduce__(self): - return (Parser, (self.vocab, self.moves, self.model), None, None) + return (Parser, (self.vocab, self.model), self.moves) + + def __getstate__(self): + return self.moves + + def __setstate__(self, moves): + self.moves = moves @property def move_names(self): @@ -162,8 +97,6 @@ cdef class Parser: names.append(name) return names - nr_feature = 8 - @property def labels(self): class_names = [self.moves.get_class_name(i) for i in range(self.moves.n_moves)] @@ -172,7 +105,7 @@ cdef class Parser: @property def tok2vec(self): '''Return the embedding and convolutional layer of the model.''' - return None if self.model in (None, True, False) else self.model.tok2vec + return self.model.get_ref("tok2vec") @property def postprocesses(self): @@ -189,18 +122,17 @@ cdef class Parser: self._resize() def _resize(self): - if "nr_class" in self.cfg: - self.cfg["nr_class"] = self.moves.n_moves - if self.model not in (True, False, None): - self.model.resize_output(self.moves.n_moves) + self.model.attrs["resize_output"](self.model, self.moves.n_moves) if self._rehearsal_model not in (True, False, None): - self._rehearsal_model.resize_output(self.moves.n_moves) + self._rehearsal_model.attrs["resize_output"]( + self._rehearsal_model, self.moves.n_moves + ) def add_multitask_objective(self, target): # Defined in subclasses, to avoid circular import raise NotImplementedError - def init_multitask_objectives(self, get_gold_tuples, pipeline, **cfg): + def init_multitask_objectives(self, get_examples, pipeline, **cfg): '''Setup models for secondary objectives, to benefit from multi-task learning. This method is intended to be overridden by subclasses. @@ -210,9 +142,9 @@ cdef class Parser: ''' pass - def preprocess_gold(self, docs_golds): - for doc, gold in docs_golds: - yield doc, gold + def preprocess_gold(self, examples): + for ex in examples: + yield ex def use_params(self, params): # Can't decorate cdef class :(. Workaround. @@ -226,14 +158,15 @@ cdef class Parser: doc (Doc): The document to be processed. """ if beam_width is None: - beam_width = self.cfg.get('beam_width', 1) + beam_width = self.cfg['beam_width'] beam_density = self.cfg.get('beam_density', 0.) states = self.predict([doc], beam_width=beam_width, beam_density=beam_density) self.set_annotations([doc], states, tensors=None) return doc - def pipe(self, docs, int batch_size=256, int n_threads=-1, beam_width=None): + def pipe(self, docs, int batch_size=256, int n_threads=-1, beam_width=None, + as_example=False): """Process a stream of documents. stream: The sequence of documents to process. @@ -241,27 +174,28 @@ cdef class Parser: YIELDS (Doc): Documents, in order. """ if beam_width is None: - beam_width = self.cfg.get('beam_width', 1) + beam_width = self.cfg['beam_width'] beam_density = self.cfg.get('beam_density', 0.) cdef Doc doc for batch in util.minibatch(docs, size=batch_size): batch_in_order = list(batch) - by_length = sorted(batch_in_order, key=lambda doc: len(doc)) + docs = [self._get_doc(ex) for ex in batch_in_order] + by_length = sorted(docs, key=lambda doc: len(doc)) for subbatch in util.minibatch(by_length, size=max(batch_size//4, 2)): subbatch = list(subbatch) parse_states = self.predict(subbatch, beam_width=beam_width, beam_density=beam_density) self.set_annotations(subbatch, parse_states, tensors=None) - for doc in batch_in_order: - yield doc - - def require_model(self): - """Raise an error if the component's model is not initialized.""" - if getattr(self, 'model', None) in (None, True, False): - raise ValueError(Errors.E109.format(name=self.name)) + if as_example: + annotated_examples = [] + for ex, doc in zip(batch_in_order, docs): + ex.doc = doc + annotated_examples.append(ex) + yield from annotated_examples + else: + yield from batch_in_order def predict(self, docs, beam_width=1, beam_density=0.0, drop=0.): - self.require_model() if isinstance(docs, Doc): docs = [docs] if not any(len(doc) for doc in docs): @@ -277,12 +211,13 @@ cdef class Parser: def greedy_parse(self, docs, drop=0.): cdef vector[StateC*] states cdef StateClass state + set_dropout_rate(self.model, drop) batch = self.moves.init_batch(docs) # This is pretty dirty, but the NER can resize itself in init_batch, # if labels are missing. We therefore have to check whether we need to # expand our model output. self._resize() - model = self.model(docs) + model = self.model.predict(docs) weights = get_c_weights(model) for state in batch: if not state.is_final(): @@ -297,18 +232,19 @@ cdef class Parser: cdef Beam beam cdef Doc doc cdef np.ndarray token_ids + set_dropout_rate(self.model, drop) beams = self.moves.init_beams(docs, beam_width, beam_density=beam_density) # This is pretty dirty, but the NER can resize itself in init_batch, # if labels are missing. We therefore have to check whether we need to # expand our model output. self._resize() - model = self.model(docs) - token_ids = numpy.zeros((len(docs) * beam_width, self.nr_feature), + cdef int nr_feature = self.model.get_ref("lower").get_dim("nF") + model = self.model.predict(docs) + token_ids = numpy.zeros((len(docs) * beam_width, nr_feature), dtype='i', order='C') cdef int* c_ids - cdef int nr_feature = self.cfg["nr_feature_tokens"] cdef int n_states - model = self.model(docs) + model = self.model.predict(docs) todo = [beam for beam in beams if not beam.is_done] while todo: token_ids.fill(-1) @@ -325,8 +261,8 @@ cdef class Parser: n_states += 1 if n_states == 0: break - vectors = model.state2vec(token_ids[:n_states]) - scores = model.vec2scores(vectors) + vectors = model.state2vec.predict(token_ids[:n_states]) + scores = model.vec2scores.predict(vectors) todo = self.transition_beams(todo, scores) return beams @@ -418,75 +354,82 @@ cdef class Parser: beam.check_done(_beam_utils.check_final_state, NULL) return [b for b in beams if not b.is_done] - def update(self, docs, golds, drop=0., sgd=None, losses=None): - self.require_model() - if isinstance(docs, Doc) and isinstance(golds, GoldParse): - docs = [docs] - golds = [golds] - if len(docs) != len(golds): - raise ValueError(Errors.E077.format(value='update', n_docs=len(docs), - n_golds=len(golds))) + def update(self, examples, drop=0., set_annotations=False, sgd=None, losses=None): + examples = Example.to_example_objects(examples) + if losses is None: losses = {} losses.setdefault(self.name, 0.) for multitask in self._multitasks: - multitask.update(docs, golds, drop=drop, sgd=sgd) + multitask.update(examples, drop=drop, sgd=sgd) # The probability we use beam update, instead of falling back to # a greedy update - beam_update_prob = self.cfg.get('beam_update_prob', 0.5) - if self.cfg.get('beam_width', 1) >= 2 and numpy.random.random() < beam_update_prob: - return self.update_beam(docs, golds, self.cfg.get('beam_width', 1), - drop=drop, sgd=sgd, losses=losses, + beam_update_prob = self.cfg['beam_update_prob'] + if self.cfg['beam_width'] >= 2 and numpy.random.random() < beam_update_prob: + return self.update_beam(examples, self.cfg['beam_width'], + drop=drop, sgd=sgd, losses=losses, set_annotations=set_annotations, beam_density=self.cfg.get('beam_density', 0.001)) - # Chop sequences into lengths of this many transitions, to make the - # batch uniform length. - cut_gold = numpy.random.choice(range(20, 100)) - states, golds, max_steps = self._init_gold_batch(docs, golds, max_length=cut_gold) + + set_dropout_rate(self.model, drop) + cut_gold = True + if cut_gold: + # Chop sequences into lengths of this many transitions, to make the + # batch uniform length. + cut_gold = numpy.random.choice(range(20, 100)) + states, golds, max_steps = self._init_gold_batch(examples, max_length=cut_gold) + else: + states, golds, max_steps = self._init_gold_batch_no_cut(examples) states_golds = [(s, g) for (s, g) in zip(states, golds) if not s.is_final() and g is not None] - # Prepare the stepwise model, and get the callback for finishing the batch - model, finish_update = self.model.begin_update(docs, drop=drop) + model, backprop_tok2vec = self.model.begin_update([ex.doc for ex in examples]) + all_states = list(states) for _ in range(max_steps): if not states_golds: break states, golds = zip(*states_golds) - scores, backprop = model.begin_update(states, drop=drop) + scores, backprop = model.begin_update(states) d_scores = self.get_batch_loss(states, golds, scores, losses) - backprop(d_scores, sgd=sgd) + backprop(d_scores) # Follow the predicted action self.transition_states(states, scores) states_golds = [eg for eg in states_golds if not eg[0].is_final()] - # Do the backprop - finish_update(golds, sgd=sgd) + backprop_tok2vec(golds) + if sgd is not None: + self.model.finish_update(sgd) + if set_annotations: + docs = [ex.doc for ex in examples] + self.set_annotations(docs, all_states) return losses - def rehearse(self, docs, sgd=None, losses=None, **cfg): + def rehearse(self, examples, sgd=None, losses=None, **cfg): """Perform a "rehearsal" update, to prevent catastrophic forgetting.""" - if isinstance(docs, Doc): - docs = [docs] + examples = Example.to_example_objects(examples) if losses is None: losses = {} for multitask in self._multitasks: if hasattr(multitask, 'rehearse'): - multitask.rehearse(docs, losses=losses, sgd=sgd) + multitask.rehearse(examples, losses=losses, sgd=sgd) if self._rehearsal_model is None: return None losses.setdefault(self.name, 0.) + docs = [ex.doc for ex in examples] states = self.moves.init_batch(docs) # This is pretty dirty, but the NER can resize itself in init_batch, # if labels are missing. We therefore have to check whether we need to # expand our model output. self._resize() # Prepare the stepwise model, and get the callback for finishing the batch - tutor, _ = self._rehearsal_model.begin_update(docs, drop=0.0) - model, finish_update = self.model.begin_update(docs, drop=0.0) + set_dropout_rate(self._rehearsal_model, 0.0) + set_dropout_rate(self.model, 0.0) + tutor, _ = self._rehearsal_model.begin_update(docs) + model, finish_update = self.model.begin_update(docs) n_scores = 0. loss = 0. while states: - targets, _ = tutor.begin_update(states, drop=0.) - guesses, backprop = model.begin_update(states, drop=0.) + targets, _ = tutor.begin_update(states) + guesses, backprop = model.begin_update(states) d_scores = (guesses - targets) / targets.shape[0] # If all weights for an output are 0 in the original model, don't # supervise that output. This allows us to add classes. @@ -497,25 +440,41 @@ cdef class Parser: states = [state for state in states if not state.is_final()] n_scores += d_scores.size # Do the backprop - finish_update(docs, sgd=sgd) + finish_update(docs) + if sgd is not None: + self.model.finish_update(sgd) losses[self.name] += loss / n_scores return losses - def update_beam(self, docs, golds, width, drop=0., sgd=None, losses=None, - beam_density=0.0): + def update_beam(self, examples, width, drop=0., sgd=None, losses=None, + set_annotations=False, beam_density=0.0): + examples = Example.to_example_objects(examples) + docs = [ex.doc for ex in examples] + golds = [ex.gold for ex in examples] + new_golds = [] lengths = [len(d) for d in docs] states = self.moves.init_batch(docs) for gold in golds: self.moves.preprocess_gold(gold) - model, finish_update = self.model.begin_update(docs, drop=drop) + new_golds.append(gold) + set_dropout_rate(self.model, drop) + model, backprop_tok2vec = self.model.begin_update(docs) states_d_scores, backprops, beams = _beam_utils.update_beam( - self.moves, self.cfg["nr_feature_tokens"], 10000, states, golds, model.state2vec, - model.vec2scores, width, drop=drop, losses=losses, - beam_density=beam_density) + self.moves, + self.model.get_ref("lower").get_dim("nF"), + 10000, + states, + golds, + model.state2vec, + model.vec2scores, + width, + losses=losses, + beam_density=beam_density + ) for i, d_scores in enumerate(states_d_scores): losses[self.name] += (d_scores**2).mean() ids, bp_vectors, bp_scores = backprops[i] - d_vector = bp_scores(d_scores, sgd=sgd) + d_vector = bp_scores(d_scores) if isinstance(model.ops, CupyOps) \ and not isinstance(ids, model.state2vec.ops.xp.ndarray): model.backprops.append(( @@ -524,12 +483,51 @@ cdef class Parser: bp_vectors)) else: model.backprops.append((ids, d_vector, bp_vectors)) - model.make_updates(sgd) + backprop_tok2vec(golds) + if sgd is not None: + self.model.finish_update(sgd) + if set_annotations: + self.set_annotations(docs, beams) cdef Beam beam for beam in beams: _beam_utils.cleanup_beam(beam) - def _init_gold_batch(self, whole_docs, whole_golds, min_length=5, max_length=500): + def get_gradients(self): + """Get non-zero gradients of the model's parameters, as a dictionary + keyed by the parameter ID. The values are (weights, gradients) tuples. + """ + gradients = {} + queue = [self.model] + seen = set() + for node in queue: + if node.id in seen: + continue + seen.add(node.id) + if hasattr(node, "_mem") and node._mem.gradient.any(): + gradients[node.id] = [node._mem.weights, node._mem.gradient] + if hasattr(node, "_layers"): + queue.extend(node._layers) + return gradients + + def _init_gold_batch_no_cut(self, whole_examples): + states = self.moves.init_batch([eg.doc for eg in whole_examples]) + good_docs = [] + good_golds = [] + good_states = [] + for i, eg in enumerate(whole_examples): + doc = eg.doc + gold = self.moves.preprocess_gold(eg.gold) + if gold is not None and self.moves.has_gold(gold): + good_docs.append(doc) + good_golds.append(gold) + good_states.append(states[i]) + n_moves = [] + for doc, gold in zip(good_docs, good_golds): + oracle_actions = self.moves.get_oracle_sequence(doc, gold) + n_moves.append(len(oracle_actions)) + return good_states, good_golds, max(n_moves, default=0) * 2 + + def _init_gold_batch(self, whole_examples, min_length=5, max_length=500): """Make a square batch, of length equal to the shortest doc. A long doc will get multiple states. Let's say we have a doc of length 2*N, where N is the shortest doc. We'll make two states, one representing @@ -537,6 +535,8 @@ cdef class Parser: cdef: StateClass state Transition action + whole_docs = [ex.doc for ex in whole_examples] + whole_golds = [ex.gold for ex in whole_examples] whole_states = self.moves.init_batch(whole_docs) max_length = max(min_length, min(max_length, min([len(doc) for doc in whole_docs]))) max_moves = 0 @@ -580,65 +580,67 @@ cdef class Parser: cdef np.ndarray d_scores = numpy.zeros((len(states), self.moves.n_moves), dtype='f', order='C') c_d_scores = d_scores.data + unseen_classes = self.model.attrs["unseen_classes"] for i, (state, gold) in enumerate(zip(states, golds)): memset(is_valid, 0, self.moves.n_moves * sizeof(int)) memset(costs, 0, self.moves.n_moves * sizeof(float)) self.moves.set_costs(is_valid, costs, state, gold) for j in range(self.moves.n_moves): - if costs[j] <= 0.0 and j in self.model.unseen_classes: - self.model.unseen_classes.remove(j) + if costs[j] <= 0.0 and j in unseen_classes: + unseen_classes.remove(j) cpu_log_loss(c_d_scores, costs, is_valid, &scores[i, 0], d_scores.shape[1]) c_d_scores += d_scores.shape[1] + if len(states): + d_scores /= len(states) if losses is not None: losses.setdefault(self.name, 0.) losses[self.name] += (d_scores**2).sum() return d_scores def create_optimizer(self): - return create_default_optimizer(self.model.ops, - **self.cfg.get('optimizer', {})) + return create_default_optimizer() - def begin_training(self, get_gold_tuples, pipeline=None, sgd=None, **cfg): - if 'model' in cfg: - self.model = cfg['model'] - if not hasattr(get_gold_tuples, '__call__'): - gold_tuples = get_gold_tuples - get_gold_tuples = lambda: gold_tuples - actions = self.moves.get_actions(gold_parses=get_gold_tuples(), - min_freq=cfg.get('min_action_freq', 30), - learn_tokens=self.cfg.get("learn_tokens", False)) + def set_output(self, nO): + self.model.attrs["resize_output"](self.model, nO) + + def begin_training(self, get_examples, pipeline=None, sgd=None, **kwargs): + self.cfg.update(kwargs) + if not hasattr(get_examples, '__call__'): + gold_tuples = get_examples + get_examples = lambda: gold_tuples + actions = self.moves.get_actions(gold_parses=get_examples(), + min_freq=self.cfg['min_action_freq'], + learn_tokens=self.cfg["learn_tokens"]) for action, labels in self.moves.labels.items(): actions.setdefault(action, {}) for label, freq in labels.items(): if label not in actions[action]: actions[action][label] = freq self.moves.initialize_actions(actions) - if self.model is True: - cfg.setdefault('min_action_freq', 30) - cfg.setdefault('token_vector_width', 96) - self.model, cfg = self.Model(self.moves.n_moves, **cfg) - if sgd is None: - sgd = self.create_optimizer() - docs = [] - golds = [] - for raw_text, annots_brackets in get_gold_tuples(): - for annots, brackets in annots_brackets: - ids, words, tags, heads, deps, ents = annots - docs.append(Doc(self.vocab, words=words)) - golds.append(GoldParse(docs[-1], words=words, tags=tags, - heads=heads, deps=deps, entities=ents)) - self.model.begin_training(docs, golds) - if pipeline is not None: - self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg) - link_vectors_to_models(self.vocab) - self.cfg.update(cfg) - else: - if sgd is None: - sgd = self.create_optimizer() - self.model.begin_training([]) + # make sure we resize so we have an appropriate upper layer + self._resize() + if sgd is None: + sgd = self.create_optimizer() + doc_sample = [] + gold_sample = [] + for example in islice(get_examples(), 1000): + parses = example.get_gold_parses(merge=False, vocab=self.vocab) + for doc, gold in parses: + doc_sample.append(doc) + gold_sample.append(gold) + self.model.initialize(doc_sample, gold_sample) + if pipeline is not None: + self.init_multitask_objectives(get_examples, pipeline, sgd=sgd, **self.cfg) + link_vectors_to_models(self.vocab) return sgd + def _get_doc(self, example): + """ Use this method if the `example` can be both a Doc or an Example """ + if isinstance(example, Doc): + return example + return example.doc + def to_disk(self, path, exclude=tuple(), **kwargs): serializers = { 'model': lambda p: (self.model.to_disk(p) if self.model is not True else True), @@ -654,56 +656,44 @@ cdef class Parser: 'vocab': lambda p: self.vocab.from_disk(p), 'moves': lambda p: self.moves.from_disk(p, exclude=["strings"]), 'cfg': lambda p: self.cfg.update(srsly.read_json(p)), - 'model': lambda p: None + 'model': lambda p: None, } exclude = util.get_serialization_exclude(deserializers, exclude, kwargs) util.from_disk(path, deserializers, exclude) if 'model' not in exclude: path = util.ensure_path(path) - if self.model is True: - self.model, cfg = self.Model(**self.cfg) - else: - cfg = {} with (path / 'model').open('rb') as file_: bytes_data = file_.read() try: + self._resize() self.model.from_bytes(bytes_data) except AttributeError: raise ValueError(Errors.E149) - self.cfg.update(cfg) return self def to_bytes(self, exclude=tuple(), **kwargs): - serializers = OrderedDict(( - ('model', lambda: (self.model.to_bytes() if self.model is not True else True)), - ('vocab', lambda: self.vocab.to_bytes()), - ('moves', lambda: self.moves.to_bytes(exclude=["strings"])), - ('cfg', lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True)) - )) + serializers = { + "model": lambda: (self.model.to_bytes()), + "vocab": lambda: self.vocab.to_bytes(), + "moves": lambda: self.moves.to_bytes(exclude=["strings"]), + "cfg": lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True) + } exclude = util.get_serialization_exclude(serializers, exclude, kwargs) return util.to_bytes(serializers, exclude) def from_bytes(self, bytes_data, exclude=tuple(), **kwargs): - deserializers = OrderedDict(( - ('vocab', lambda b: self.vocab.from_bytes(b)), - ('moves', lambda b: self.moves.from_bytes(b, exclude=["strings"])), - ('cfg', lambda b: self.cfg.update(srsly.json_loads(b))), - ('model', lambda b: None) - )) + deserializers = { + "vocab": lambda b: self.vocab.from_bytes(b), + "moves": lambda b: self.moves.from_bytes(b, exclude=["strings"]), + "cfg": lambda b: self.cfg.update(srsly.json_loads(b)), + "model": lambda b: None, + } exclude = util.get_serialization_exclude(deserializers, exclude, kwargs) msg = util.from_bytes(bytes_data, deserializers, exclude) if 'model' not in exclude: - # TODO: Remove this once we don't have to handle previous models - if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg: - self.cfg['pretrained_vectors'] = self.vocab.vectors.name - if self.model is True: - self.model, cfg = self.Model(**self.cfg) - else: - cfg = {} if 'model' in msg: try: self.model.from_bytes(msg['model']) except AttributeError: raise ValueError(Errors.E149) - self.cfg.update(cfg) return self diff --git a/spacy/syntax/nonproj.pyx b/spacy/syntax/nonproj.pyx index 53e8a9cfe..1edb2e65c 100644 --- a/spacy/syntax/nonproj.pyx +++ b/spacy/syntax/nonproj.pyx @@ -1,15 +1,13 @@ -# coding: utf-8 -# cython: profile=True -# cython: infer_types=True +# cython: profile=True, infer_types=True """Implements the projectivize/deprojectivize mechanism in Nivre & Nilsson 2005 for doing pseudo-projective parsing implementation uses the HEAD decoration scheme. """ -from __future__ import unicode_literals - from copy import copy from ..tokens.doc cimport Doc, set_children_from_heads + +from ..gold import Example from ..errors import Errors @@ -77,39 +75,41 @@ def decompose(label): def is_decorated(label): return DELIMITER in label -def count_decorated_labels(gold_tuples): +def count_decorated_labels(gold_data): freqs = {} - for raw_text, sents in gold_tuples: - for (ids, words, tags, heads, labels, iob), ctnts in sents: - proj_heads, deco_labels = projectivize(heads, labels) - # set the label to ROOT for each root dependent - deco_labels = ['ROOT' if head == i else deco_labels[i] - for i, head in enumerate(proj_heads)] - # count label frequencies - for label in deco_labels: - if is_decorated(label): - freqs[label] = freqs.get(label, 0) + 1 + for example in gold_data: + proj_heads, deco_deps = projectivize(example.token_annotation.heads, + example.token_annotation.deps) + # set the label to ROOT for each root dependent + deco_deps = ['ROOT' if head == i else deco_deps[i] + for i, head in enumerate(proj_heads)] + # count label frequencies + for label in deco_deps: + if is_decorated(label): + freqs[label] = freqs.get(label, 0) + 1 return freqs -def preprocess_training_data(gold_tuples, label_freq_cutoff=30): +def preprocess_training_data(gold_data, label_freq_cutoff=30): preprocessed = [] freqs = {} - for raw_text, sents in gold_tuples: - prepro_sents = [] - for (ids, words, tags, heads, labels, iob), ctnts in sents: - proj_heads, deco_labels = projectivize(heads, labels) - # set the label to ROOT for each root dependent - deco_labels = ['ROOT' if head == i else deco_labels[i] - for i, head in enumerate(proj_heads)] - # count label frequencies - if label_freq_cutoff > 0: - for label in deco_labels: - if is_decorated(label): - freqs[label] = freqs.get(label, 0) + 1 - prepro_sents.append( - ((ids, words, tags, proj_heads, deco_labels, iob), ctnts)) - preprocessed.append((raw_text, prepro_sents)) + for example in gold_data: + new_example = Example(doc=example.doc) + proj_heads, deco_deps = projectivize(example.token_annotation.heads, + example.token_annotation.deps) + # set the label to ROOT for each root dependent + deco_deps = ['ROOT' if head == i else deco_deps[i] + for i, head in enumerate(proj_heads)] + # count label frequencies + if label_freq_cutoff > 0: + for label in deco_deps: + if is_decorated(label): + freqs[label] = freqs.get(label, 0) + 1 + proj_token_dict = example.token_annotation.to_dict() + proj_token_dict["heads"] = proj_heads + proj_token_dict["deps"] = deco_deps + new_example.set_token_annotation(**proj_token_dict) + preprocessed.append(new_example) if label_freq_cutoff > 0: return _filter_labels(preprocessed, label_freq_cutoff, freqs) return preprocessed @@ -154,8 +154,7 @@ def _decorate(heads, proj_heads, labels): deco_labels = [] for tokenid, head in enumerate(heads): if head != proj_heads[tokenid]: - deco_labels.append( - '%s%s%s' % (labels[tokenid], DELIMITER, labels[head])) + deco_labels.append(f"{labels[tokenid]}{DELIMITER}{labels[head]}") else: deco_labels.append(labels[tokenid]) return deco_labels @@ -203,20 +202,20 @@ def _find_new_head(token, headlabel): return token.head -def _filter_labels(gold_tuples, cutoff, freqs): +def _filter_labels(examples, cutoff, freqs): # throw away infrequent decorated labels # can't learn them reliably anyway and keeps label set smaller filtered = [] - for raw_text, sents in gold_tuples: - filtered_sents = [] - for (ids, words, tags, heads, labels, iob), ctnts in sents: - filtered_labels = [] - for label in labels: - if is_decorated(label) and freqs.get(label, 0) < cutoff: - filtered_labels.append(decompose(label)[0]) - else: - filtered_labels.append(label) - filtered_sents.append( - ((ids, words, tags, heads, filtered_labels, iob), ctnts)) - filtered.append((raw_text, filtered_sents)) + for example in examples: + new_example = Example(doc=example.doc) + filtered_labels = [] + for label in example.token_annotation.deps: + if is_decorated(label) and freqs.get(label, 0) < cutoff: + filtered_labels.append(decompose(label)[0]) + else: + filtered_labels.append(label) + filtered_token_dict = example.token_annotation.to_dict() + filtered_token_dict["deps"] = filtered_labels + new_example.set_token_annotation(**filtered_token_dict) + filtered.append(new_example) return filtered diff --git a/spacy/syntax/stateclass.pyx b/spacy/syntax/stateclass.pyx index 2a15a2de1..e472e9861 100644 --- a/spacy/syntax/stateclass.pyx +++ b/spacy/syntax/stateclass.pyx @@ -1,7 +1,4 @@ -# coding: utf-8 # cython: infer_types=True -from __future__ import unicode_literals - import numpy from ..tokens.doc cimport Doc @@ -49,9 +46,9 @@ cdef class StateClass: def print_state(self, words): words = list(words) + ['_'] - top = words[self.S(0)] + '_%d' % self.S_(0).head - second = words[self.S(1)] + '_%d' % self.S_(1).head - third = words[self.S(2)] + '_%d' % self.S_(2).head + top = f"{words[self.S(0)]}_{self.S_(0).head}" + second = f"{words[self.S(1)]}_{self.S_(1).head}" + third = f"{words[self.S(2)]}_{self.S_(2).head}" n0 = words[self.B(0)] n1 = words[self.B(1)] return ' '.join((third, second, top, '|', n0, n1)) diff --git a/spacy/syntax/transition_system.pxd b/spacy/syntax/transition_system.pxd index a5fe55918..5fd3b5c5f 100644 --- a/spacy/syntax/transition_system.pxd +++ b/spacy/syntax/transition_system.pxd @@ -1,12 +1,10 @@ from cymem.cymem cimport Pool -from thinc.typedefs cimport weight_t -from ..typedefs cimport attr_t +from ..typedefs cimport attr_t, weight_t from ..structs cimport TokenC from ..gold cimport GoldParse from ..gold cimport GoldParseC from ..strings cimport StringStore - from .stateclass cimport StateClass from ._state cimport StateC diff --git a/spacy/syntax/transition_system.pyx b/spacy/syntax/transition_system.pyx index 65097f114..78017c84a 100644 --- a/spacy/syntax/transition_system.pyx +++ b/spacy/syntax/transition_system.pyx @@ -1,19 +1,18 @@ # cython: infer_types=True -# coding: utf-8 -from __future__ import unicode_literals - from cpython.ref cimport Py_INCREF from cymem.cymem cimport Pool -from thinc.typedefs cimport weight_t from thinc.extra.search cimport Beam -from collections import OrderedDict, Counter + +from collections import Counter import srsly +from ..typedefs cimport weight_t from . cimport _beam_utils from ..tokens.doc cimport Doc from ..structs cimport TokenC from .stateclass cimport StateClass from ..typedefs cimport attr_t + from ..errors import Errors from .. import util diff --git a/spacy/tests/conftest.py b/spacy/tests/conftest.py index 63bbf2e0a..d75db26b6 100644 --- a/spacy/tests/conftest.py +++ b/spacy/tests/conftest.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.util import get_lang_class @@ -17,11 +14,11 @@ def pytest_runtest_setup(item): # recognize the option we're asking about. To avoid this, we need to # pass a default value. We default to False, i.e., we act like all the # options weren't given. - return item.config.getoption("--%s" % opt, False) + return item.config.getoption(f"--{opt}", False) for opt in ["slow"]: if opt in item.keywords and not getopt(opt): - pytest.skip("need --%s option to run" % opt) + pytest.skip(f"need --{opt} option to run") # Fixtures for language tokenizers (languages sorted alphabetically) diff --git a/spacy/tests/doc/test_add_entities.py b/spacy/tests/doc/test_add_entities.py index 6c69e699a..c92fc1ff9 100644 --- a/spacy/tests/doc/test_add_entities.py +++ b/spacy/tests/doc/test_add_entities.py @@ -1,17 +1,15 @@ -# coding: utf-8 -from __future__ import unicode_literals - from spacy.pipeline import EntityRecognizer from spacy.tokens import Span import pytest from ..util import get_doc +from spacy.pipeline.defaults import default_ner def test_doc_add_entities_set_ents_iob(en_vocab): text = ["This", "is", "a", "lion"] doc = get_doc(en_vocab, text) - ner = EntityRecognizer(en_vocab) + ner = EntityRecognizer(en_vocab, default_ner()) ner.begin_training([]) ner(doc) assert len(list(doc.ents)) == 0 @@ -27,7 +25,7 @@ def test_doc_add_entities_set_ents_iob(en_vocab): def test_ents_reset(en_vocab): text = ["This", "is", "a", "lion"] doc = get_doc(en_vocab, text) - ner = EntityRecognizer(en_vocab) + ner = EntityRecognizer(en_vocab, default_ner()) ner.begin_training([]) ner(doc) assert [t.ent_iob_ for t in doc] == (["O"] * len(doc)) diff --git a/spacy/tests/doc/test_array.py b/spacy/tests/doc/test_array.py index 09a6f9c4b..f44ae1421 100644 --- a/spacy/tests/doc/test_array.py +++ b/spacy/tests/doc/test_array.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.tokens import Doc from spacy.attrs import ORTH, SHAPE, POS, DEP diff --git a/spacy/tests/doc/test_creation.py b/spacy/tests/doc/test_creation.py index 863a7c210..3ee833aa8 100644 --- a/spacy/tests/doc/test_creation.py +++ b/spacy/tests/doc/test_creation.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.vocab import Vocab from spacy.tokens import Doc diff --git a/spacy/tests/doc/test_doc_api.py b/spacy/tests/doc/test_doc_api.py index 6801d7844..018830d37 100644 --- a/spacy/tests/doc/test_doc_api.py +++ b/spacy/tests/doc/test_doc_api.py @@ -1,7 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - - import pytest import numpy from spacy.tokens import Doc, Span diff --git a/spacy/tests/doc/test_morphanalysis.py b/spacy/tests/doc/test_morphanalysis.py index 5d570af53..221b6f683 100644 --- a/spacy/tests/doc/test_morphanalysis.py +++ b/spacy/tests/doc/test_morphanalysis.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @@ -12,22 +9,54 @@ def i_has(en_tokenizer): return doc -def test_token_morph_id(i_has): - assert i_has[0].morph.id - assert i_has[1].morph.id != 0 - assert i_has[0].morph.id != i_has[1].morph.id +def test_token_morph_eq(i_has): + assert i_has[0].morph is not i_has[0].morph + assert i_has[0].morph == i_has[0].morph + assert i_has[0].morph != i_has[1].morph + + +def test_token_morph_key(i_has): + assert i_has[0].morph.key != 0 + assert i_has[1].morph.key != 0 + assert i_has[0].morph.key == i_has[0].morph.key + assert i_has[0].morph.key != i_has[1].morph.key def test_morph_props(i_has): - assert i_has[0].morph.pron_type == i_has.vocab.strings["PronType_prs"] - assert i_has[0].morph.pron_type_ == "PronType_prs" - assert i_has[1].morph.pron_type == 0 + assert i_has[0].morph.get("PronType") == ["PronType=prs"] + assert i_has[1].morph.get("PronType") == [] def test_morph_iter(i_has): - assert list(i_has[0].morph) == ["PronType_prs"] - assert list(i_has[1].morph) == ["Number_sing", "Person_three", "VerbForm_fin"] + assert set(i_has[0].morph) == set(["PronType=prs"]) + assert set(i_has[1].morph) == set( + ["Number=sing", "Person=three", "Tense=pres", "VerbForm=fin"] + ) def test_morph_get(i_has): - assert i_has[0].morph.get("pron_type") == "PronType_prs" + assert i_has[0].morph.get("PronType") == ["PronType=prs"] + + +def test_morph_set(i_has): + assert i_has[0].morph.get("PronType") == ["PronType=prs"] + # set by string + i_has[0].morph_ = "PronType=unk" + assert i_has[0].morph.get("PronType") == ["PronType=unk"] + # set by string, fields are alphabetized + i_has[0].morph_ = "PronType=123|NounType=unk" + assert i_has[0].morph_ == "NounType=unk|PronType=123" + # set by dict + i_has[0].morph_ = {"AType": "123", "BType": "unk", "POS": "ADJ"} + assert i_has[0].morph_ == "AType=123|BType=unk|POS=ADJ" + # set by string with multiple values, fields and values are alphabetized + i_has[0].morph_ = "BType=c|AType=b,a" + assert i_has[0].morph_ == "AType=a,b|BType=c" + # set by dict with multiple values, fields and values are alphabetized + i_has[0].morph_ = {"AType": "b,a", "BType": "c"} + assert i_has[0].morph_ == "AType=a,b|BType=c" + + +def test_morph_str(i_has): + assert str(i_has[0].morph) == "PronType=prs" + assert str(i_has[1].morph) == "Number=sing|Person=three|Tense=pres|VerbForm=fin" diff --git a/spacy/tests/doc/test_pickle_doc.py b/spacy/tests/doc/test_pickle_doc.py index 2b6970a38..28cb66714 100644 --- a/spacy/tests/doc/test_pickle_doc.py +++ b/spacy/tests/doc/test_pickle_doc.py @@ -1,8 +1,5 @@ -# coding: utf-8 -from __future__ import unicode_literals - from spacy.language import Language -from spacy.compat import pickle, unicode_ +from spacy.compat import pickle def test_pickle_single_doc(): @@ -16,9 +13,9 @@ def test_pickle_single_doc(): def test_list_of_docs_pickles_efficiently(): nlp = Language() for i in range(10000): - _ = nlp.vocab[unicode_(i)] # noqa: F841 + _ = nlp.vocab[str(i)] # noqa: F841 one_pickled = pickle.dumps(nlp("0"), -1) - docs = list(nlp.pipe(unicode_(i) for i in range(100))) + docs = list(nlp.pipe(str(i) for i in range(100))) many_pickled = pickle.dumps(docs, -1) assert len(many_pickled) < (len(one_pickled) * 2) many_unpickled = pickle.loads(many_pickled) diff --git a/spacy/tests/doc/test_retokenize_merge.py b/spacy/tests/doc/test_retokenize_merge.py index 5bdf78f39..5e564d1f2 100644 --- a/spacy/tests/doc/test_retokenize_merge.py +++ b/spacy/tests/doc/test_retokenize_merge.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.attrs import LEMMA from spacy.vocab import Vocab @@ -11,7 +8,12 @@ from ..util import get_doc def test_doc_retokenize_merge(en_tokenizer): text = "WKRO played songs by the beach boys all night" - attrs = {"tag": "NAMED", "lemma": "LEMMA", "ent_type": "TYPE"} + attrs = { + "tag": "NAMED", + "lemma": "LEMMA", + "ent_type": "TYPE", + "morph": "Number=Plur", + } doc = en_tokenizer(text) assert len(doc) == 9 with doc.retokenize() as retokenizer: @@ -21,9 +23,11 @@ def test_doc_retokenize_merge(en_tokenizer): assert doc[4].text == "the beach boys" assert doc[4].text_with_ws == "the beach boys " assert doc[4].tag_ == "NAMED" + assert doc[4].morph_ == "Number=Plur" assert doc[5].text == "all night" assert doc[5].text_with_ws == "all night" assert doc[5].tag_ == "NAMED" + assert doc[5].morph_ == "Number=Plur" def test_doc_retokenize_merge_children(en_tokenizer): diff --git a/spacy/tests/doc/test_retokenize_split.py b/spacy/tests/doc/test_retokenize_split.py index d074fddc6..5f40da425 100644 --- a/spacy/tests/doc/test_retokenize_split.py +++ b/spacy/tests/doc/test_retokenize_split.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.vocab import Vocab from spacy.tokens import Doc, Token @@ -25,15 +22,18 @@ def test_doc_retokenize_split(en_vocab): "tag": ["NNP"] * 2, "lemma": ["Los", "Angeles"], "ent_type": ["GPE"] * 2, + "morph": ["Number=Sing"] * 2, }, ) assert len(doc) == 4 assert doc[0].text == "Los" assert doc[0].head.text == "Angeles" assert doc[0].idx == 0 + assert doc[0].morph_ == "Number=Sing" assert doc[1].idx == 3 assert doc[1].text == "Angeles" assert doc[1].head.text == "start" + assert doc[1].morph_ == "Number=Sing" assert doc[2].text == "start" assert doc[2].head.text == "." assert doc[3].text == "." diff --git a/spacy/tests/doc/test_span.py b/spacy/tests/doc/test_span.py index e76ca4697..43c699d21 100644 --- a/spacy/tests/doc/test_span.py +++ b/spacy/tests/doc/test_span.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.attrs import ORTH, LENGTH from spacy.tokens import Doc, Span diff --git a/spacy/tests/doc/test_to_json.py b/spacy/tests/doc/test_to_json.py index a063a6569..da3bc7dbb 100644 --- a/spacy/tests/doc/test_to_json.py +++ b/spacy/tests/doc/test_to_json.py @@ -1,9 +1,4 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest -from spacy.cli._schemas import TRAINING_SCHEMA -from spacy.util import get_json_validator, validate_json from spacy.tokens import Doc from ..util import get_doc @@ -58,10 +53,3 @@ def test_doc_to_json_underscore_error_serialize(doc): Doc.set_extension("json_test4", method=lambda doc: doc.text) with pytest.raises(ValueError): doc.to_json(underscore=["json_test4"]) - - -def test_doc_to_json_valid_training(doc): - json_doc = doc.to_json() - validator = get_json_validator(TRAINING_SCHEMA) - errors = validate_json([json_doc], validator) - assert not errors diff --git a/spacy/tests/doc/test_token_api.py b/spacy/tests/doc/test_token_api.py index 4dcd07ad9..be56c9b71 100644 --- a/spacy/tests/doc/test_token_api.py +++ b/spacy/tests/doc/test_token_api.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import numpy from spacy.attrs import IS_ALPHA, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_TITLE, IS_STOP diff --git a/spacy/tests/doc/test_underscore.py b/spacy/tests/doc/test_underscore.py index c1eff2c20..b934221af 100644 --- a/spacy/tests/doc/test_underscore.py +++ b/spacy/tests/doc/test_underscore.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from mock import Mock from spacy.tokens import Doc, Span, Token diff --git a/spacy/tests/lang/ar/test_exceptions.py b/spacy/tests/lang/ar/test_exceptions.py index 3cfc380d2..125220caf 100644 --- a/spacy/tests/lang/ar/test_exceptions.py +++ b/spacy/tests/lang/ar/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ar/test_text.py b/spacy/tests/lang/ar/test_text.py index 109c3721a..c5ab376f1 100644 --- a/spacy/tests/lang/ar/test_text.py +++ b/spacy/tests/lang/ar/test_text.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - def test_ar_tokenizer_handles_long_text(ar_tokenizer): text = """نجيب محفوظ مؤلف و كاتب روائي عربي، يعد من أهم الأدباء العرب خلال القرن العشرين. ولد نجيب محفوظ في مدينة القاهرة، حيث ترعرع و تلقى تعليمه الجامعي في جامعتها، diff --git a/spacy/tests/lang/bn/test_tokenizer.py b/spacy/tests/lang/bn/test_tokenizer.py index 62dd52778..5b18c5269 100644 --- a/spacy/tests/lang/bn/test_tokenizer.py +++ b/spacy/tests/lang/bn/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ca/test_exception.py b/spacy/tests/lang/ca/test_exception.py index 56156c328..71098f094 100644 --- a/spacy/tests/lang/ca/test_exception.py +++ b/spacy/tests/lang/ca/test_exception.py @@ -1,7 +1,3 @@ -# coding: utf-8 - -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ca/test_prefix_suffix_infix.py b/spacy/tests/lang/ca/test_prefix_suffix_infix.py index 4583a62b9..83a75f056 100644 --- a/spacy/tests/lang/ca/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/ca/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ca/test_text.py b/spacy/tests/lang/ca/test_text.py index 1506016d4..38f5fc708 100644 --- a/spacy/tests/lang/ca/test_text.py +++ b/spacy/tests/lang/ca/test_text.py @@ -1,10 +1,4 @@ -# coding: utf-8 - """Test that longer and mixed texts are tokenized correctly.""" - - -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/da/test_exceptions.py b/spacy/tests/lang/da/test_exceptions.py index 503399ee4..bd9f2710e 100644 --- a/spacy/tests/lang/da/test_exceptions.py +++ b/spacy/tests/lang/da/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/da/test_prefix_suffix_infix.py b/spacy/tests/lang/da/test_prefix_suffix_infix.py index 8b43bf360..e36b3cdb9 100644 --- a/spacy/tests/lang/da/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/da/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/da/test_text.py b/spacy/tests/lang/da/test_text.py index 07b134e2d..3c6cca5ac 100644 --- a/spacy/tests/lang/da/test_text.py +++ b/spacy/tests/lang/da/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.da.lex_attrs import like_num diff --git a/spacy/tests/lang/de/test_exceptions.py b/spacy/tests/lang/de/test_exceptions.py index 3b464e1ae..a1bbaf58b 100644 --- a/spacy/tests/lang/de/test_exceptions.py +++ b/spacy/tests/lang/de/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/de/test_parser.py b/spacy/tests/lang/de/test_parser.py index 5c8694da3..c897dcf2f 100644 --- a/spacy/tests/lang/de/test_parser.py +++ b/spacy/tests/lang/de/test_parser.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - from ...util import get_doc diff --git a/spacy/tests/lang/de/test_prefix_suffix_infix.py b/spacy/tests/lang/de/test_prefix_suffix_infix.py index 13e109395..82bd8ed69 100644 --- a/spacy/tests/lang/de/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/de/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/de/test_text.py b/spacy/tests/lang/de/test_text.py index b3fb1eaa5..22711763e 100644 --- a/spacy/tests/lang/de/test_text.py +++ b/spacy/tests/lang/de/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/el/test_exception.py b/spacy/tests/lang/el/test_exception.py index b8d10fb69..a4656ea98 100644 --- a/spacy/tests/lang/el/test_exception.py +++ b/spacy/tests/lang/el/test_exception.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/el/test_text.py b/spacy/tests/lang/el/test_text.py index a6395ab4a..1b3ef6182 100644 --- a/spacy/tests/lang/el/test_text.py +++ b/spacy/tests/lang/el/test_text.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/en/test_customized_tokenizer.py b/spacy/tests/lang/en/test_customized_tokenizer.py index 7f939011f..f5302cb31 100644 --- a/spacy/tests/lang/en/test_customized_tokenizer.py +++ b/spacy/tests/lang/en/test_customized_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import re from spacy.lang.en import English diff --git a/spacy/tests/lang/en/test_exceptions.py b/spacy/tests/lang/en/test_exceptions.py index a78e1815f..ce0dac50b 100644 --- a/spacy/tests/lang/en/test_exceptions.py +++ b/spacy/tests/lang/en/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/en/test_indices.py b/spacy/tests/lang/en/test_indices.py index 8a7bc0323..93daeec30 100644 --- a/spacy/tests/lang/en/test_indices.py +++ b/spacy/tests/lang/en/test_indices.py @@ -1,7 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - - def test_en_simple_punct(en_tokenizer): text = "to walk, do foo" tokens = en_tokenizer(text) diff --git a/spacy/tests/lang/en/test_noun_chunks.py b/spacy/tests/lang/en/test_noun_chunks.py index ff67986a5..2d3362317 100644 --- a/spacy/tests/lang/en/test_noun_chunks.py +++ b/spacy/tests/lang/en/test_noun_chunks.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import numpy from spacy.attrs import HEAD, DEP from spacy.symbols import nsubj, dobj, amod, nmod, conj, cc, root diff --git a/spacy/tests/lang/en/test_parser.py b/spacy/tests/lang/en/test_parser.py index ce696bc25..057143696 100644 --- a/spacy/tests/lang/en/test_parser.py +++ b/spacy/tests/lang/en/test_parser.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - from ...util import get_doc diff --git a/spacy/tests/lang/en/test_prefix_suffix_infix.py b/spacy/tests/lang/en/test_prefix_suffix_infix.py index 3dccd6bcf..9efcc1015 100644 --- a/spacy/tests/lang/en/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/en/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @@ -111,7 +108,6 @@ def test_en_tokenizer_splits_double_hyphen_infix(en_tokenizer): assert tokens[9].text == "people" -@pytest.mark.xfail def test_en_tokenizer_splits_period_abbr(en_tokenizer): text = "Today is Tuesday.Mr." tokens = en_tokenizer(text) diff --git a/spacy/tests/lang/en/test_punct.py b/spacy/tests/lang/en/test_punct.py index 61274cf14..1d10478a1 100644 --- a/spacy/tests/lang/en/test_punct.py +++ b/spacy/tests/lang/en/test_punct.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.util import compile_prefix_regex from spacy.lang.punctuation import TOKENIZER_PREFIXES @@ -82,7 +79,6 @@ def test_en_tokenizer_splits_open_appostrophe(en_tokenizer, text): assert tokens[0].text == "'" -@pytest.mark.xfail @pytest.mark.parametrize("text", ["Hello''"]) def test_en_tokenizer_splits_double_end_quote(en_tokenizer, text): tokens = en_tokenizer(text) diff --git a/spacy/tests/lang/en/test_sbd.py b/spacy/tests/lang/en/test_sbd.py index 40bd110e8..ba7b2f2cf 100644 --- a/spacy/tests/lang/en/test_sbd.py +++ b/spacy/tests/lang/en/test_sbd.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from ...util import get_doc, apply_transition_sequence diff --git a/spacy/tests/lang/en/test_tagger.py b/spacy/tests/lang/en/test_tagger.py index 567fd5a44..d9eced2ff 100644 --- a/spacy/tests/lang/en/test_tagger.py +++ b/spacy/tests/lang/en/test_tagger.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - from ...util import get_doc diff --git a/spacy/tests/lang/en/test_text.py b/spacy/tests/lang/en/test_text.py index a7ebde989..c5d56d885 100644 --- a/spacy/tests/lang/en/test_text.py +++ b/spacy/tests/lang/en/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.en.lex_attrs import like_num diff --git a/spacy/tests/lang/es/test_exception.py b/spacy/tests/lang/es/test_exception.py index 8d6164058..90d897a4c 100644 --- a/spacy/tests/lang/es/test_exception.py +++ b/spacy/tests/lang/es/test_exception.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/es/test_text.py b/spacy/tests/lang/es/test_text.py index 999e788dd..96f6bcab5 100644 --- a/spacy/tests/lang/es/test_text.py +++ b/spacy/tests/lang/es/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.es.lex_attrs import like_num diff --git a/spacy/tests/lang/eu/test_text.py b/spacy/tests/lang/eu/test_text.py index f448a7859..94d5ac91d 100644 --- a/spacy/tests/lang/eu/test_text.py +++ b/spacy/tests/lang/eu/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/fi/test_text.py b/spacy/tests/lang/fi/test_text.py index 2dd92597e..dbb67ad7a 100644 --- a/spacy/tests/lang/fi/test_text.py +++ b/spacy/tests/lang/fi/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/fi/test_tokenizer.py b/spacy/tests/lang/fi/test_tokenizer.py index 301b85d74..ae16c7eea 100644 --- a/spacy/tests/lang/fi/test_tokenizer.py +++ b/spacy/tests/lang/fi/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/fr/test_exceptions.py b/spacy/tests/lang/fr/test_exceptions.py index 93dbf0993..98d318f6e 100644 --- a/spacy/tests/lang/fr/test_exceptions.py +++ b/spacy/tests/lang/fr/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/fr/test_prefix_suffix_infix.py b/spacy/tests/lang/fr/test_prefix_suffix_infix.py index ca6bdbd87..01d50b0a6 100644 --- a/spacy/tests/lang/fr/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/fr/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.language import Language from spacy.lang.punctuation import TOKENIZER_INFIXES diff --git a/spacy/tests/lang/fr/test_text.py b/spacy/tests/lang/fr/test_text.py index 24b4c4532..01231f593 100644 --- a/spacy/tests/lang/fr/test_text.py +++ b/spacy/tests/lang/fr/test_text.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.lang.fr.lex_attrs import like_num diff --git a/spacy/tests/lang/ga/test_tokenizer.py b/spacy/tests/lang/ga/test_tokenizer.py index 29bc1c759..78127ef7c 100644 --- a/spacy/tests/lang/ga/test_tokenizer.py +++ b/spacy/tests/lang/ga/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/he/test_tokenizer.py b/spacy/tests/lang/he/test_tokenizer.py index f138ec6e7..3131014a3 100644 --- a/spacy/tests/lang/he/test_tokenizer.py +++ b/spacy/tests/lang/he/test_tokenizer.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/hu/test_tokenizer.py b/spacy/tests/lang/hu/test_tokenizer.py index 1ac6bfc76..fd3acd0a0 100644 --- a/spacy/tests/lang/hu/test_tokenizer.py +++ b/spacy/tests/lang/hu/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/id/test_prefix_suffix_infix.py b/spacy/tests/lang/id/test_prefix_suffix_infix.py index e86a98ee3..2a81dab01 100644 --- a/spacy/tests/lang/id/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/id/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/id/test_text.py b/spacy/tests/lang/id/test_text.py index 915d268ae..ed6487b68 100644 --- a/spacy/tests/lang/id/test_text.py +++ b/spacy/tests/lang/id/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.id.lex_attrs import like_num diff --git a/spacy/tests/lang/it/test_prefix_suffix_infix.py b/spacy/tests/lang/it/test_prefix_suffix_infix.py index f84351fd7..46f66b5e6 100644 --- a/spacy/tests/lang/it/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/it/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ja/test_lemmatization.py b/spacy/tests/lang/ja/test_lemmatization.py index cfff0fcfe..4cb3110b3 100644 --- a/spacy/tests/lang/ja/test_lemmatization.py +++ b/spacy/tests/lang/ja/test_lemmatization.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ja/test_tokenizer.py b/spacy/tests/lang/ja/test_tokenizer.py index ad8bfaa00..481f346bb 100644 --- a/spacy/tests/lang/ja/test_tokenizer.py +++ b/spacy/tests/lang/ja/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ko/test_lemmatization.py b/spacy/tests/lang/ko/test_lemmatization.py index 42c306c11..7782ca4bc 100644 --- a/spacy/tests/lang/ko/test_lemmatization.py +++ b/spacy/tests/lang/ko/test_lemmatization.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ko/test_tokenizer.py b/spacy/tests/lang/ko/test_tokenizer.py index b8fe7959c..eac309857 100644 --- a/spacy/tests/lang/ko/test_tokenizer.py +++ b/spacy/tests/lang/ko/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest # fmt: off diff --git a/spacy/tests/lang/lb/test_exceptions.py b/spacy/tests/lang/lb/test_exceptions.py index ebfab75cf..d941a854b 100644 --- a/spacy/tests/lang/lb/test_exceptions.py +++ b/spacy/tests/lang/lb/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/lb/test_prefix_suffix_infix.py b/spacy/tests/lang/lb/test_prefix_suffix_infix.py index d85f932be..3958d1543 100644 --- a/spacy/tests/lang/lb/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/lb/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/lb/test_text.py b/spacy/tests/lang/lb/test_text.py index 36464b379..b0ba76b6b 100644 --- a/spacy/tests/lang/lb/test_text.py +++ b/spacy/tests/lang/lb/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/lt/test_text.py b/spacy/tests/lang/lt/test_text.py index bb9c75383..9e2b612b9 100644 --- a/spacy/tests/lang/lt/test_text.py +++ b/spacy/tests/lang/lt/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/nb/test_tokenizer.py b/spacy/tests/lang/nb/test_tokenizer.py index f72d310e8..2da6e8d40 100644 --- a/spacy/tests/lang/nb/test_tokenizer.py +++ b/spacy/tests/lang/nb/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/nl/test_text.py b/spacy/tests/lang/nl/test_text.py index 4045b1c39..8bc72cc6d 100644 --- a/spacy/tests/lang/nl/test_text.py +++ b/spacy/tests/lang/nl/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.nl.lex_attrs import like_num diff --git a/spacy/tests/lang/pl/test_text.py b/spacy/tests/lang/pl/test_text.py index ec9b18084..e8654a498 100644 --- a/spacy/tests/lang/pl/test_text.py +++ b/spacy/tests/lang/pl/test_text.py @@ -1,9 +1,4 @@ -# coding: utf-8 """Words like numbers are recognized correctly.""" - - -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/pl/test_tokenizer.py b/spacy/tests/lang/pl/test_tokenizer.py index 9f4f5a38d..44b1be9a6 100644 --- a/spacy/tests/lang/pl/test_tokenizer.py +++ b/spacy/tests/lang/pl/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest DOT_TESTS = [ diff --git a/spacy/tests/lang/pt/test_text.py b/spacy/tests/lang/pt/test_text.py index 39dfff2c1..3a9162b80 100644 --- a/spacy/tests/lang/pt/test_text.py +++ b/spacy/tests/lang/pt/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.pt.lex_attrs import like_num diff --git a/spacy/tests/lang/ro/test_tokenizer.py b/spacy/tests/lang/ro/test_tokenizer.py index a327174e5..64c072470 100644 --- a/spacy/tests/lang/ro/test_tokenizer.py +++ b/spacy/tests/lang/ro/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ru/test_exceptions.py b/spacy/tests/lang/ru/test_exceptions.py index a8f0c3429..4fb417df8 100644 --- a/spacy/tests/lang/ru/test_exceptions.py +++ b/spacy/tests/lang/ru/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ru/test_lemmatizer.py b/spacy/tests/lang/ru/test_lemmatizer.py index b228fded8..40dcf4cf8 100644 --- a/spacy/tests/lang/ru/test_lemmatizer.py +++ b/spacy/tests/lang/ru/test_lemmatizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from ...util import get_doc diff --git a/spacy/tests/lang/ru/test_text.py b/spacy/tests/lang/ru/test_text.py index c5bff6973..b0eaf66bb 100644 --- a/spacy/tests/lang/ru/test_text.py +++ b/spacy/tests/lang/ru/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.ru.lex_attrs import like_num diff --git a/spacy/tests/lang/ru/test_tokenizer.py b/spacy/tests/lang/ru/test_tokenizer.py index 5507f9f09..1cfdc50ee 100644 --- a/spacy/tests/lang/ru/test_tokenizer.py +++ b/spacy/tests/lang/ru/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @@ -80,7 +77,6 @@ def test_ru_tokenizer_splits_open_appostrophe(ru_tokenizer, text): assert tokens[0].text == "'" -@pytest.mark.xfail @pytest.mark.parametrize("text", ["Тест''"]) def test_ru_tokenizer_splits_double_end_quote(ru_tokenizer, text): tokens = ru_tokenizer(text) diff --git a/spacy/tests/lang/sr/test_exceptions.py b/spacy/tests/lang/sr/test_exceptions.py index 285e99996..fa92e5e2d 100644 --- a/spacy/tests/lang/sr/test_exceptions.py +++ b/spacy/tests/lang/sr/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/sr/test_tokenizer.py b/spacy/tests/lang/sr/test_tokenizer.py index c4672b3ef..fdcf790d8 100644 --- a/spacy/tests/lang/sr/test_tokenizer.py +++ b/spacy/tests/lang/sr/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @@ -80,7 +77,6 @@ def test_sr_tokenizer_splits_open_appostrophe(sr_tokenizer, text): assert tokens[0].text == "'" -@pytest.mark.xfail @pytest.mark.parametrize("text", ["Тест''"]) def test_sr_tokenizer_splits_double_end_quote(sr_tokenizer, text): tokens = sr_tokenizer(text) diff --git a/spacy/tests/lang/sv/test_exceptions.py b/spacy/tests/lang/sv/test_exceptions.py index 7c6fd5464..e6cae4d2b 100644 --- a/spacy/tests/lang/sv/test_exceptions.py +++ b/spacy/tests/lang/sv/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/sv/test_noun_chunks.py b/spacy/tests/lang/sv/test_noun_chunks.py index a6283b65e..f352ca648 100644 --- a/spacy/tests/lang/sv/test_noun_chunks.py +++ b/spacy/tests/lang/sv/test_noun_chunks.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from ...util import get_doc diff --git a/spacy/tests/lang/sv/test_prefix_suffix_infix.py b/spacy/tests/lang/sv/test_prefix_suffix_infix.py index f3fdd9a9e..bbb0ff415 100644 --- a/spacy/tests/lang/sv/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/sv/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/sv/test_text.py b/spacy/tests/lang/sv/test_text.py index 9ea1851ae..1e26c45bc 100644 --- a/spacy/tests/lang/sv/test_text.py +++ b/spacy/tests/lang/sv/test_text.py @@ -1,7 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - - def test_sv_tokenizer_handles_long_text(sv_tokenizer): text = """Det var så härligt ute på landet. Det var sommar, majsen var gul, havren grön, höet var uppställt i stackar nere vid den gröna ängen, och där gick storken på sina långa, diff --git a/spacy/tests/lang/sv/test_tokenizer.py b/spacy/tests/lang/sv/test_tokenizer.py index 894b5aa6a..8871f4414 100644 --- a/spacy/tests/lang/sv/test_tokenizer.py +++ b/spacy/tests/lang/sv/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/test_attrs.py b/spacy/tests/lang/test_attrs.py index 4bb5aac70..b39109455 100644 --- a/spacy/tests/lang/test_attrs.py +++ b/spacy/tests/lang/test_attrs.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA from spacy.lang.lex_attrs import is_punct, is_ascii, is_currency, like_url, word_shape diff --git a/spacy/tests/lang/test_initialize.py b/spacy/tests/lang/test_initialize.py index 5c701fc22..de1871e64 100644 --- a/spacy/tests/lang/test_initialize.py +++ b/spacy/tests/lang/test_initialize.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.util import get_lang_class diff --git a/spacy/tests/lang/th/test_tokenizer.py b/spacy/tests/lang/th/test_tokenizer.py index 265c7753d..1e1ba52dc 100644 --- a/spacy/tests/lang/th/test_tokenizer.py +++ b/spacy/tests/lang/th/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/tt/test_tokenizer.py b/spacy/tests/lang/tt/test_tokenizer.py index f6c68a401..246d2824d 100644 --- a/spacy/tests/lang/tt/test_tokenizer.py +++ b/spacy/tests/lang/tt/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/uk/test_tokenizer.py b/spacy/tests/lang/uk/test_tokenizer.py index f744b32b0..eb647a041 100644 --- a/spacy/tests/lang/uk/test_tokenizer.py +++ b/spacy/tests/lang/uk/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/uk/test_tokenizer_exc.py b/spacy/tests/lang/uk/test_tokenizer_exc.py index 328e1d287..4fb4a6b31 100644 --- a/spacy/tests/lang/uk/test_tokenizer_exc.py +++ b/spacy/tests/lang/uk/test_tokenizer_exc.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ur/test_prefix_suffix_infix.py b/spacy/tests/lang/ur/test_prefix_suffix_infix.py index de11c9b34..e9f3272f4 100644 --- a/spacy/tests/lang/ur/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/ur/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ur/test_text.py b/spacy/tests/lang/ur/test_text.py index 546e79182..5da831cf8 100644 --- a/spacy/tests/lang/ur/test_text.py +++ b/spacy/tests/lang/ur/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/yo/test_text.py b/spacy/tests/lang/yo/test_text.py index ce6408b67..48b689f3d 100644 --- a/spacy/tests/lang/yo/test_text.py +++ b/spacy/tests/lang/yo/test_text.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.lang.yo.lex_attrs import like_num diff --git a/spacy/tests/lang/zh/test_text.py b/spacy/tests/lang/zh/test_text.py index 3a3ccbdde..148257329 100644 --- a/spacy/tests/lang/zh/test_text.py +++ b/spacy/tests/lang/zh/test_text.py @@ -1,7 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - - import pytest diff --git a/spacy/tests/lang/zh/test_tokenizer.py b/spacy/tests/lang/zh/test_tokenizer.py index 28240b6a9..7af8a7604 100644 --- a/spacy/tests/lang/zh/test_tokenizer.py +++ b/spacy/tests/lang/zh/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.zh import _get_pkuseg_trie_data diff --git a/spacy/tests/matcher/test_matcher_api.py b/spacy/tests/matcher/test_matcher_api.py index 1112195da..98542e80f 100644 --- a/spacy/tests/matcher/test_matcher_api.py +++ b/spacy/tests/matcher/test_matcher_api.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import re from mock import Mock @@ -183,7 +180,7 @@ def test_matcher_match_one_plus(matcher): doc = Doc(control.vocab, words=["Philippe", "Philippe"]) m = control(doc) assert len(m) == 2 - pattern = [{"ORTH": "Philippe", "OP": "1"}, {"ORTH": "Philippe", "OP": "+"}] + pattern = [{"ORTH": "Philippe"}, {"ORTH": "Philippe", "OP": "+"}] matcher.add("KleenePhilippe", [pattern]) m = matcher(doc) assert len(m) == 1 diff --git a/spacy/tests/matcher/test_matcher_logic.py b/spacy/tests/matcher/test_matcher_logic.py index 240ace537..a2b2cd83f 100644 --- a/spacy/tests/matcher/test_matcher_logic.py +++ b/spacy/tests/matcher/test_matcher_logic.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import re @@ -9,18 +6,18 @@ from spacy.matcher import Matcher from spacy.tokens import Doc, Span -pattern1 = [{"ORTH": "A", "OP": "1"}, {"ORTH": "A", "OP": "*"}] -pattern2 = [{"ORTH": "A", "OP": "*"}, {"ORTH": "A", "OP": "1"}] -pattern3 = [{"ORTH": "A", "OP": "1"}, {"ORTH": "A", "OP": "1"}] +pattern1 = [{"ORTH": "A"}, {"ORTH": "A", "OP": "*"}] +pattern2 = [{"ORTH": "A"}, {"ORTH": "A"}] +pattern3 = [{"ORTH": "A"}, {"ORTH": "A"}] pattern4 = [ - {"ORTH": "B", "OP": "1"}, + {"ORTH": "B"}, {"ORTH": "A", "OP": "*"}, - {"ORTH": "B", "OP": "1"}, + {"ORTH": "B"}, ] pattern5 = [ {"ORTH": "B", "OP": "*"}, {"ORTH": "A", "OP": "*"}, - {"ORTH": "B", "OP": "1"}, + {"ORTH": "B"}, ] re_pattern1 = "AA*" diff --git a/spacy/tests/matcher/test_pattern_validation.py b/spacy/tests/matcher/test_pattern_validation.py index c536698d0..5dea3dde2 100644 --- a/spacy/tests/matcher/test_pattern_validation.py +++ b/spacy/tests/matcher/test_pattern_validation.py @@ -1,11 +1,7 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.matcher import Matcher -from spacy.matcher._schemas import TOKEN_PATTERN_SCHEMA from spacy.errors import MatchPatternError -from spacy.util import get_json_validator, validate_json +from spacy.schemas import validate_token_pattern # (pattern, num errors with validation, num errors identified with minimal # checks) @@ -18,12 +14,12 @@ TEST_PATTERNS = [ ('[{"TEXT": "foo"}, {"LOWER": "bar"}]', 1, 1), ([1, 2, 3], 3, 1), # Bad patterns flagged outside of Matcher - ([{"_": {"foo": "bar", "baz": {"IN": "foo"}}}], 1, 0), + ([{"_": {"foo": "bar", "baz": {"IN": "foo"}}}], 2, 0), # prev: (1, 0) # Bad patterns not flagged with minimal checks ([{"LENGTH": "2", "TEXT": 2}, {"LOWER": "test"}], 2, 0), - ([{"LENGTH": {"IN": [1, 2, "3"]}}, {"POS": {"IN": "VERB"}}], 2, 0), - ([{"LENGTH": {"VALUE": 5}}], 1, 0), - ([{"TEXT": {"VALUE": "foo"}}], 1, 0), + ([{"LENGTH": {"IN": [1, 2, "3"]}}, {"POS": {"IN": "VERB"}}], 4, 0), # prev: (2, 0) + ([{"LENGTH": {"VALUE": 5}}], 2, 0), # prev: (1, 0) + ([{"TEXT": {"VALUE": "foo"}}], 2, 0), # prev: (1, 0) ([{"IS_DIGIT": -1}], 1, 0), ([{"ORTH": -1}], 1, 0), # Good patterns @@ -34,17 +30,11 @@ TEST_PATTERNS = [ ([{"LOWER": {"REGEX": "^X", "NOT_IN": ["XXX", "XY"]}}], 0, 0), ([{"NORM": "a"}, {"POS": {"IN": ["NOUN"]}}], 0, 0), ([{"_": {"foo": {"NOT_IN": ["bar", "baz"]}, "a": 5, "b": {">": 10}}}], 0, 0), + ([{"orth": "foo"}], 0, 0), # prev: xfail ([{"IS_SENT_START": True}], 0, 0), ([{"SENT_START": True}], 0, 0), ] -XFAIL_TEST_PATTERNS = [([{"orth": "foo"}], 0, 0)] - - -@pytest.fixture -def validator(): - return get_json_validator(TOKEN_PATTERN_SCHEMA) - @pytest.mark.parametrize( "pattern", [[{"XX": "y"}, {"LENGTH": "2"}, {"TEXT": {"IN": 5}}]] @@ -56,15 +46,8 @@ def test_matcher_pattern_validation(en_vocab, pattern): @pytest.mark.parametrize("pattern,n_errors,_", TEST_PATTERNS) -def test_pattern_validation(validator, pattern, n_errors, _): - errors = validate_json(pattern, validator) - assert len(errors) == n_errors - - -@pytest.mark.xfail -@pytest.mark.parametrize("pattern,n_errors,_", XFAIL_TEST_PATTERNS) -def test_xfail_pattern_validation(validator, pattern, n_errors, _): - errors = validate_json(pattern, validator) +def test_pattern_validation(pattern, n_errors, _): + errors = validate_token_pattern(pattern) assert len(errors) == n_errors diff --git a/spacy/tests/matcher/test_phrase_matcher.py b/spacy/tests/matcher/test_phrase_matcher.py index 7a6585e06..23cd80d1d 100644 --- a/spacy/tests/matcher/test_phrase_matcher.py +++ b/spacy/tests/matcher/test_phrase_matcher.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from mock import Mock from spacy.matcher import PhraseMatcher diff --git a/spacy/tests/morphology/test_morph_converters.py b/spacy/tests/morphology/test_morph_converters.py new file mode 100644 index 000000000..9486cad45 --- /dev/null +++ b/spacy/tests/morphology/test_morph_converters.py @@ -0,0 +1,25 @@ +from spacy.morphology import Morphology + + +def test_feats_converters(): + feats = "Case=dat,gen|Number=sing" + feats_dict = {"Case": "dat,gen", "Number": "sing"} + feats_list = feats.split(Morphology.FEATURE_SEP) + + # simple conversions + assert Morphology.list_to_feats(feats_list) == feats + assert Morphology.dict_to_feats(feats_dict) == feats + assert Morphology.feats_to_dict(feats) == feats_dict + + # roundtrips + assert Morphology.dict_to_feats(Morphology.feats_to_dict(feats)) == feats + assert Morphology.feats_to_dict(Morphology.dict_to_feats(feats_dict)) == feats_dict + + # unsorted input is normalized + unsorted_feats = "Number=sing|Case=gen,dat" + unsorted_feats_dict = {"Case": "gen,dat", "Number": "sing"} + unsorted_feats_list = feats.split(Morphology.FEATURE_SEP) + assert Morphology.feats_to_dict(unsorted_feats) == feats_dict + assert Morphology.dict_to_feats(unsorted_feats_dict) == feats + assert Morphology.list_to_feats(unsorted_feats_list) == feats + assert Morphology.dict_to_feats(Morphology.feats_to_dict(unsorted_feats)) == feats diff --git a/spacy/tests/morphology/test_morph_features.py b/spacy/tests/morphology/test_morph_features.py index 41f807143..f644a5867 100644 --- a/spacy/tests/morphology/test_morph_features.py +++ b/spacy/tests/morphology/test_morph_features.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.morphology import Morphology from spacy.strings import StringStore, get_string_id @@ -19,32 +16,37 @@ def test_init(morphology): def test_add_morphology_with_string_names(morphology): - morphology.add({"Case_gen", "Number_sing"}) + morphology.add({"Case": "gen", "Number": "sing"}) def test_add_morphology_with_int_ids(morphology): - morphology.add({get_string_id("Case_gen"), get_string_id("Number_sing")}) + morphology.strings.add("Case") + morphology.strings.add("gen") + morphology.strings.add("Number") + morphology.strings.add("sing") + morphology.add( + { + get_string_id("Case"): get_string_id("gen"), + get_string_id("Number"): get_string_id("sing"), + } + ) def test_add_morphology_with_mix_strings_and_ints(morphology): - morphology.add({get_string_id("PunctSide_ini"), "VerbType_aux"}) + morphology.strings.add("PunctSide") + morphology.strings.add("ini") + morphology.add( + {get_string_id("PunctSide"): get_string_id("ini"), "VerbType": "aux"} + ) def test_morphology_tags_hash_distinctly(morphology): - tag1 = morphology.add({"PunctSide_ini", "VerbType_aux"}) - tag2 = morphology.add({"Case_gen", "Number_sing"}) + tag1 = morphology.add({"PunctSide": "ini", "VerbType": "aux"}) + tag2 = morphology.add({"Case": "gen", "Number": "sing"}) assert tag1 != tag2 def test_morphology_tags_hash_independent_of_order(morphology): - tag1 = morphology.add({"Case_gen", "Number_sing"}) - tag2 = morphology.add({"Number_sing", "Case_gen"}) + tag1 = morphology.add({"Case": "gen", "Number": "sing"}) + tag2 = morphology.add({"Number": "sing", "Case": "gen"}) assert tag1 == tag2 - - -def test_update_morphology_tag(morphology): - tag1 = morphology.add({"Case_gen"}) - tag2 = morphology.update(tag1, {"Number_sing"}) - assert tag1 != tag2 - tag3 = morphology.add({"Number_sing", "Case_gen"}) - assert tag2 == tag3 diff --git a/spacy/tests/package/test_requirements.py b/spacy/tests/package/test_requirements.py new file mode 100644 index 000000000..59a8569ee --- /dev/null +++ b/spacy/tests/package/test_requirements.py @@ -0,0 +1,76 @@ +import re +from pathlib import Path + + +def test_build_dependencies(): + # Check that library requirements are pinned exactly the same across different setup files. + libs_ignore_requirements = [ + "pytest", + "pytest-timeout", + "mock", + "flake8", + "jsonschema", + ] + libs_ignore_setup = ["fugashi", "natto-py", "pythainlp"] + + # check requirements.txt + req_dict = {} + + root_dir = Path(__file__).parent + req_file = root_dir / "requirements.txt" + with req_file.open() as f: + lines = f.readlines() + for line in lines: + line = line.strip() + if not line.startswith("#"): + lib, v = _parse_req(line) + if lib and lib not in libs_ignore_requirements: + req_dict[lib] = v + # check setup.cfg and compare to requirements.txt + # also fails when there are missing or additional libs + setup_file = root_dir / "setup.cfg" + with setup_file.open() as f: + lines = f.readlines() + + setup_keys = set() + for line in lines: + line = line.strip() + if not line.startswith("#"): + lib, v = _parse_req(line) + if lib and not lib.startswith("cupy") and lib not in libs_ignore_setup: + req_v = req_dict.get(lib, None) + assert ( + req_v is not None + ), "{} in setup.cfg but not in requirements.txt".format(lib) + assert (lib + v) == (lib + req_v), ( + "{} has different version in setup.cfg and in requirements.txt: " + "{} and {} respectively".format(lib, v, req_v) + ) + setup_keys.add(lib) + assert sorted(setup_keys) == sorted( + req_dict.keys() + ) # if fail: requirements.txt contains a lib not in setup.cfg + + # check pyproject.toml and compare the versions of the libs to requirements.txt + # does not fail when there are missing or additional libs + toml_file = root_dir / "pyproject.toml" + with toml_file.open() as f: + lines = f.readlines() + for line in lines: + line = line.strip().strip(",").strip('"') + if not line.startswith("#"): + lib, v = _parse_req(line) + if lib: + req_v = req_dict.get(lib, None) + assert (lib + v) == (lib + req_v), ( + "{} has different version in pyproject.toml and in requirements.txt: " + "{} and {} respectively".format(lib, v, req_v) + ) + + +def _parse_req(line): + lib = re.match(r"^[a-z0-9\-]*", line).group(0) + v = line.replace(lib, "").strip() + if not re.match(r"^[<>=][<>=].*", v): + return None, None + return lib, v diff --git a/spacy/tests/parser/test_add_label.py b/spacy/tests/parser/test_add_label.py index 4ab9c1e70..ee1bba886 100644 --- a/spacy/tests/parser/test_add_label.py +++ b/spacy/tests/parser/test_add_label.py @@ -1,12 +1,10 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest -from thinc.neural.optimizers import Adam -from thinc.neural.ops import NumpyOps +from thinc.api import Adam, NumpyOps from spacy.attrs import NORM from spacy.gold import GoldParse from spacy.vocab import Vocab + +from spacy.pipeline.defaults import default_parser, default_ner from spacy.tokens import Doc from spacy.pipeline import DependencyParser, EntityRecognizer from spacy.util import fix_random_seed @@ -19,7 +17,7 @@ def vocab(): @pytest.fixture def parser(vocab): - parser = DependencyParser(vocab) + parser = DependencyParser(vocab, default_parser()) return parser @@ -31,27 +29,27 @@ def _train_parser(parser): fix_random_seed(1) parser.add_label("left") parser.begin_training([], **parser.cfg) - sgd = Adam(NumpyOps(), 0.001) + sgd = Adam(0.001) for i in range(5): losses = {} doc = Doc(parser.vocab, words=["a", "b", "c", "d"]) gold = GoldParse(doc, heads=[1, 1, 3, 3], deps=["left", "ROOT", "left", "ROOT"]) - parser.update([doc], [gold], sgd=sgd, losses=losses) + parser.update((doc, gold), sgd=sgd, losses=losses) return parser def test_add_label(parser): parser = _train_parser(parser) parser.add_label("right") - sgd = Adam(NumpyOps(), 0.001) - for i in range(10): + sgd = Adam(0.001) + for i in range(100): losses = {} doc = Doc(parser.vocab, words=["a", "b", "c", "d"]) gold = GoldParse( doc, heads=[1, 1, 3, 3], deps=["right", "ROOT", "left", "ROOT"] ) - parser.update([doc], [gold], sgd=sgd, losses=losses) + parser.update((doc, gold), sgd=sgd, losses=losses) doc = Doc(parser.vocab, words=["a", "b", "c", "d"]) doc = parser(doc) assert doc[0].dep_ == "right" @@ -59,27 +57,32 @@ def test_add_label(parser): def test_add_label_deserializes_correctly(): - ner1 = EntityRecognizer(Vocab()) + ner1 = EntityRecognizer(Vocab(), default_ner()) ner1.add_label("C") ner1.add_label("B") ner1.add_label("A") ner1.begin_training([]) - ner2 = EntityRecognizer(Vocab()).from_bytes(ner1.to_bytes()) + ner2 = EntityRecognizer(Vocab(), default_ner()) + + # the second model needs to be resized before we can call from_bytes + ner2.model.attrs["resize_output"](ner2.model, ner1.moves.n_moves) + ner2.from_bytes(ner1.to_bytes()) assert ner1.moves.n_moves == ner2.moves.n_moves for i in range(ner1.moves.n_moves): assert ner1.moves.get_class_name(i) == ner2.moves.get_class_name(i) @pytest.mark.parametrize( - "pipe_cls,n_moves", [(DependencyParser, 5), (EntityRecognizer, 4)] + "pipe_cls,n_moves,model", + [(DependencyParser, 5, default_parser()), (EntityRecognizer, 4, default_ner())], ) -def test_add_label_get_label(pipe_cls, n_moves): +def test_add_label_get_label(pipe_cls, n_moves, model): """Test that added labels are returned correctly. This test was added to test for a bug in DependencyParser.labels that'd cause it to fail when splitting the move names. """ labels = ["A", "B", "C"] - pipe = pipe_cls(Vocab()) + pipe = pipe_cls(Vocab(), model) for label in labels: pipe.add_label(label) assert len(pipe.move_names) == len(labels) * n_moves diff --git a/spacy/tests/parser/test_arc_eager_oracle.py b/spacy/tests/parser/test_arc_eager_oracle.py index 41b7a4861..30b4a6f6d 100644 --- a/spacy/tests/parser/test_arc_eager_oracle.py +++ b/spacy/tests/parser/test_arc_eager_oracle.py @@ -1,8 +1,7 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.vocab import Vocab + +from spacy.pipeline.defaults import default_parser from spacy.pipeline import DependencyParser from spacy.tokens import Doc from spacy.gold import GoldParse @@ -130,18 +129,25 @@ annot_tuples = [ def test_get_oracle_actions(): + ids, words, tags, heads, deps, ents = [], [], [], [], [], [] + for id_, word, tag, head, dep, ent in annot_tuples: + ids.append(id_) + words.append(word) + tags.append(tag) + heads.append(head) + deps.append(dep) + ents.append(ent) doc = Doc(Vocab(), words=[t[1] for t in annot_tuples]) - parser = DependencyParser(doc.vocab) + parser = DependencyParser(doc.vocab, default_parser()) parser.moves.add_action(0, "") parser.moves.add_action(1, "") parser.moves.add_action(1, "") parser.moves.add_action(4, "ROOT") - for i, (id_, word, tag, head, dep, ent) in enumerate(annot_tuples): + for i, (head, dep) in enumerate(zip(heads, deps)): if head > i: parser.moves.add_action(2, dep) elif head < i: parser.moves.add_action(3, dep) - ids, words, tags, heads, deps, ents = zip(*annot_tuples) heads, deps = projectivize(heads, deps) gold = GoldParse(doc, words=words, tags=tags, heads=heads, deps=deps) parser.moves.preprocess_gold(gold) diff --git a/spacy/tests/parser/test_ner.py b/spacy/tests/parser/test_ner.py index 244e9fa25..e78cac757 100644 --- a/spacy/tests/parser/test_ner.py +++ b/spacy/tests/parser/test_ner.py @@ -1,15 +1,21 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest +from spacy import util from spacy.lang.en import English - +from spacy.pipeline.defaults import default_ner from spacy.pipeline import EntityRecognizer, EntityRuler from spacy.vocab import Vocab from spacy.syntax.ner import BiluoPushDown -from spacy.gold import GoldParse, minibatch +from spacy.gold import GoldParse from spacy.tokens import Doc +from ..util import make_tempdir + + +TRAIN_DATA = [ + ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}), + ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}), +] + @pytest.fixture def vocab(): @@ -132,7 +138,7 @@ def test_accept_blocked_token(): # 1. test normal behaviour nlp1 = English() doc1 = nlp1("I live in New York") - ner1 = EntityRecognizer(doc1.vocab) + ner1 = EntityRecognizer(doc1.vocab, default_ner()) assert [token.ent_iob_ for token in doc1] == ["", "", "", "", ""] assert [token.ent_type_ for token in doc1] == ["", "", "", "", ""] @@ -150,7 +156,7 @@ def test_accept_blocked_token(): # 2. test blocking behaviour nlp2 = English() doc2 = nlp2("I live in New York") - ner2 = EntityRecognizer(doc2.vocab) + ner2 = EntityRecognizer(doc2.vocab, default_ner()) # set "New York" to a blocked entity doc2.ents = [(0, 3, 5)] @@ -189,7 +195,7 @@ def test_train_empty(): nlp.begin_training() for itn in range(2): losses = {} - batches = minibatch(train_data) + batches = util.minibatch(train_data) for batch in batches: texts, annotations = zip(*batch) nlp.update( @@ -211,7 +217,7 @@ def test_overwrite_token(): assert [token.ent_type_ for token in doc] == ["", "", "", "", ""] # Check that a new ner can overwrite O - ner2 = EntityRecognizer(doc.vocab) + ner2 = EntityRecognizer(doc.vocab, default_ner()) ner2.moves.add_action(5, "") ner2.add_label("GPE") state = ner2.moves.init_batch([doc])[0] @@ -222,6 +228,18 @@ def test_overwrite_token(): assert ner2.moves.is_valid(state, "L-GPE") +def test_empty_ner(): + nlp = English() + ner = nlp.create_pipe("ner") + ner.add_label("MY_LABEL") + nlp.add_pipe(ner) + nlp.begin_training() + doc = nlp("John is watching the news about Croatia's elections") + # if this goes wrong, the initialization of the parser's upper layer is probably broken + result = ["O", "O", "O", "O", "O", "O", "O", "O", "O"] + assert [token.ent_iob_ for token in doc] == result + + def test_ruler_before_ner(): """ Test that an NER works after an entity_ruler: the second can add annotations """ nlp = English() @@ -237,7 +255,6 @@ def test_ruler_before_ner(): untrained_ner.add_label("MY_LABEL") nlp.add_pipe(untrained_ner) nlp.begin_training() - doc = nlp("This is Antti Korhonen speaking in Finland") expected_iobs = ["B", "O", "O", "O", "O", "O", "O"] expected_types = ["THING", "", "", "", "", "", ""] @@ -284,25 +301,38 @@ def test_block_ner(): assert [token.ent_type_ for token in doc] == expected_types -def test_change_number_features(): - # Test the default number features +def test_overfitting_IO(): + # Simple test to try and quickly overfit the NER component - ensuring the ML models work correctly nlp = English() ner = nlp.create_pipe("ner") + for _, annotations in TRAIN_DATA: + for ent in annotations.get("entities"): + ner.add_label(ent[2]) nlp.add_pipe(ner) - ner.add_label("PERSON") - nlp.begin_training() - assert ner.model.lower.nF == ner.nr_feature - # Test we can change it - nlp = English() - ner = nlp.create_pipe("ner") - nlp.add_pipe(ner) - ner.add_label("PERSON") - nlp.begin_training( - component_cfg={"ner": {"nr_feature_tokens": 3, "token_vector_width": 128}} - ) - assert ner.model.lower.nF == 3 - # Test the model runs - nlp("hello world") + optimizer = nlp.begin_training() + + for i in range(50): + losses = {} + nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses) + assert losses["ner"] < 0.00001 + + # test the trained model + test_text = "I like London." + doc = nlp(test_text) + ents = doc.ents + assert len(ents) == 1 + assert ents[0].text == "London" + assert ents[0].label_ == "LOC" + + # Also test the results are still the same after IO + with make_tempdir() as tmp_dir: + nlp.to_disk(tmp_dir) + nlp2 = util.load_model_from_path(tmp_dir) + doc2 = nlp2(test_text) + ents2 = doc2.ents + assert len(ents2) == 1 + assert ents2[0].text == "London" + assert ents2[0].label_ == "LOC" class BlockerComponent1(object): diff --git a/spacy/tests/parser/test_neural_parser.py b/spacy/tests/parser/test_neural_parser.py index 062c76ae3..b648e9a00 100644 --- a/spacy/tests/parser/test_neural_parser.py +++ b/spacy/tests/parser/test_neural_parser.py @@ -1,13 +1,11 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest -from spacy._ml import Tok2Vec +from spacy.pipeline.defaults import default_parser, default_tok2vec from spacy.vocab import Vocab from spacy.syntax.arc_eager import ArcEager from spacy.syntax.nn_parser import Parser from spacy.tokens.doc import Doc from spacy.gold import GoldParse +from thinc.api import Model @pytest.fixture @@ -23,17 +21,22 @@ def arc_eager(vocab): @pytest.fixture def tok2vec(): - return Tok2Vec(8, 100) + tok2vec = default_tok2vec() + tok2vec.initialize() + return tok2vec @pytest.fixture def parser(vocab, arc_eager): - return Parser(vocab, moves=arc_eager, model=None) + return Parser(vocab, model=default_parser(), moves=arc_eager) @pytest.fixture -def model(arc_eager, tok2vec): - return Parser.Model(arc_eager.n_moves, token_vector_width=tok2vec.nO)[0] +def model(arc_eager, tok2vec, vocab): + model = default_parser() + model.attrs["resize_output"](model, arc_eager.n_moves) + model.initialize() + return model @pytest.fixture @@ -47,16 +50,16 @@ def gold(doc): def test_can_init_nn_parser(parser): - assert parser.model is None + assert isinstance(parser.model, Model) -def test_build_model(parser): - parser.model = Parser.Model(parser.moves.n_moves, hist_size=0)[0] +def test_build_model(parser, vocab): + parser.model = Parser(vocab, model=default_parser(), moves=parser.moves).model assert parser.model is not None def test_predict_doc(parser, tok2vec, model, doc): - doc.tensor = tok2vec([doc])[0] + doc.tensor = tok2vec.predict([doc])[0] parser.model = model parser(doc) @@ -64,10 +67,11 @@ def test_predict_doc(parser, tok2vec, model, doc): def test_update_doc(parser, model, doc, gold): parser.model = model - def optimize(weights, gradient, key=None): + def optimize(key, weights, gradient): weights -= 0.001 * gradient + return weights, gradient - parser.update([doc], [gold], sgd=optimize) + parser.update((doc, gold), sgd=optimize) @pytest.mark.xfail @@ -83,4 +87,4 @@ def test_update_doc_beam(parser, model, doc, gold): def optimize(weights, gradient, key=None): weights -= 0.001 * gradient - parser.update_beam([doc], [gold], sgd=optimize) + parser.update_beam((doc, gold), sgd=optimize) diff --git a/spacy/tests/parser/test_nn_beam.py b/spacy/tests/parser/test_nn_beam.py index 9dca99255..db9eb5e6f 100644 --- a/spacy/tests/parser/test_nn_beam.py +++ b/spacy/tests/parser/test_nn_beam.py @@ -1,10 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest import numpy from spacy.vocab import Vocab from spacy.language import Language +from spacy.pipeline.defaults import default_parser from spacy.pipeline import DependencyParser from spacy.syntax.arc_eager import ArcEager from spacy.tokens import Doc @@ -96,7 +94,7 @@ def test_beam_advance_too_few_scores(beam, scores): def test_beam_parse(): nlp = Language() - nlp.add_pipe(DependencyParser(nlp.vocab), name="parser") + nlp.add_pipe(DependencyParser(nlp.vocab, default_parser()), name="parser") nlp.parser.add_label("nsubj") nlp.parser.begin_training([], token_vector_width=8, hidden_width=8) doc = nlp.make_doc("Australia is a country") diff --git a/spacy/tests/parser/test_nonproj.py b/spacy/tests/parser/test_nonproj.py index 8bf8111c1..86d9a0180 100644 --- a/spacy/tests/parser/test_nonproj.py +++ b/spacy/tests/parser/test_nonproj.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.syntax.nonproj import ancestors, contains_cycle, is_nonproj_arc from spacy.syntax.nonproj import is_nonproj_tree diff --git a/spacy/tests/parser/test_parse.py b/spacy/tests/parser/test_parse.py index fb5301718..6e13d3044 100644 --- a/spacy/tests/parser/test_parse.py +++ b/spacy/tests/parser/test_parse.py @@ -1,9 +1,25 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest -from ..util import get_doc, apply_transition_sequence +from spacy.lang.en import English +from ..util import get_doc, apply_transition_sequence, make_tempdir +from ... import util + +TRAIN_DATA = [ + ( + "They trade mortgage-backed securities.", + { + "heads": [1, 1, 4, 4, 5, 1, 1], + "deps": ["nsubj", "ROOT", "compound", "punct", "nmod", "dobj", "punct"], + }, + ), + ( + "I like London and Berlin.", + { + "heads": [1, 1, 1, 2, 2, 1], + "deps": ["nsubj", "ROOT", "dobj", "cc", "conj", "punct"], + }, + ), +] def test_parser_root(en_tokenizer): @@ -165,3 +181,35 @@ def test_parser_set_sent_starts(en_vocab): for sent in doc.sents: for token in sent: assert token.head in sent + + +def test_overfitting_IO(): + # Simple test to try and quickly overfit the dependency parser - ensuring the ML models work correctly + nlp = English() + parser = nlp.create_pipe("parser") + for _, annotations in TRAIN_DATA: + for dep in annotations.get("deps", []): + parser.add_label(dep) + nlp.add_pipe(parser) + optimizer = nlp.begin_training() + + for i in range(50): + losses = {} + nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses) + assert losses["parser"] < 0.00001 + + # test the trained model + test_text = "I like securities." + doc = nlp(test_text) + assert doc[0].dep_ is "nsubj" + assert doc[2].dep_ is "dobj" + assert doc[3].dep_ is "punct" + + # Also test the results are still the same after IO + with make_tempdir() as tmp_dir: + nlp.to_disk(tmp_dir) + nlp2 = util.load_model_from_path(tmp_dir) + doc2 = nlp2(test_text) + assert doc2[0].dep_ is "nsubj" + assert doc2[2].dep_ is "dobj" + assert doc2[3].dep_ is "punct" diff --git a/spacy/tests/parser/test_parse_navigate.py b/spacy/tests/parser/test_parse_navigate.py index 41524d45e..f42601a85 100644 --- a/spacy/tests/parser/test_parse_navigate.py +++ b/spacy/tests/parser/test_parse_navigate.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from ..util import get_doc diff --git a/spacy/tests/parser/test_preset_sbd.py b/spacy/tests/parser/test_preset_sbd.py index 70beb2f60..dc13fcdf1 100644 --- a/spacy/tests/parser/test_preset_sbd.py +++ b/spacy/tests/parser/test_preset_sbd.py @@ -1,12 +1,10 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest -from thinc.neural.optimizers import Adam -from thinc.neural.ops import NumpyOps +from thinc.api import Adam from spacy.attrs import NORM from spacy.gold import GoldParse from spacy.vocab import Vocab + +from spacy.pipeline.defaults import default_parser from spacy.tokens import Doc from spacy.pipeline import DependencyParser @@ -18,19 +16,19 @@ def vocab(): @pytest.fixture def parser(vocab): - parser = DependencyParser(vocab) + parser = DependencyParser(vocab, default_parser()) parser.cfg["token_vector_width"] = 4 parser.cfg["hidden_width"] = 32 # parser.add_label('right') parser.add_label("left") parser.begin_training([], **parser.cfg) - sgd = Adam(NumpyOps(), 0.001) + sgd = Adam(0.001) for i in range(10): losses = {} doc = Doc(vocab, words=["a", "b", "c", "d"]) gold = GoldParse(doc, heads=[1, 1, 3, 3], deps=["left", "ROOT", "left", "ROOT"]) - parser.update([doc], [gold], sgd=sgd, losses=losses) + parser.update((doc, gold), sgd=sgd, losses=losses) return parser diff --git a/spacy/tests/parser/test_space_attachment.py b/spacy/tests/parser/test_space_attachment.py index 945173faf..59ae4e629 100644 --- a/spacy/tests/parser/test_space_attachment.py +++ b/spacy/tests/parser/test_space_attachment.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.tokens.doc import Doc diff --git a/spacy/tests/pipeline/test_analysis.py b/spacy/tests/pipeline/test_analysis.py index 198f11bcd..cda39f6ee 100644 --- a/spacy/tests/pipeline/test_analysis.py +++ b/spacy/tests/pipeline/test_analysis.py @@ -1,11 +1,7 @@ -# coding: utf8 -from __future__ import unicode_literals - import spacy.language from spacy.language import Language, component from spacy.analysis import print_summary, validate_attrs from spacy.analysis import get_assigns_for_attr, get_requires_for_attr -from spacy.compat import is_python2 from mock import Mock, ANY import pytest @@ -17,8 +13,7 @@ def test_component_decorator_function(): return doc assert test_component.name == "test" - if not is_python2: - assert test_component.__doc__ == "docstring" + assert test_component.__doc__ == "docstring" assert test_component("foo") == "foo" @@ -45,13 +40,12 @@ def test_component_decorator_class(): assert test_component("foo") == "foo" assert hasattr(test_component, "custom") assert test_component.custom("bar") == "bar" - if not is_python2: - assert TestComponent.__doc__ == "docstring1" - assert TestComponent.__call__.__doc__ == "docstring2" - assert TestComponent.custom.__doc__ == "docstring3" - assert test_component.__doc__ == "docstring1" - assert test_component.__call__.__doc__ == "docstring2" - assert test_component.custom.__doc__ == "docstring3" + assert TestComponent.__doc__ == "docstring1" + assert TestComponent.__call__.__doc__ == "docstring2" + assert TestComponent.custom.__doc__ == "docstring3" + assert test_component.__doc__ == "docstring1" + assert test_component.__call__.__doc__ == "docstring2" + assert test_component.custom.__doc__ == "docstring3" def test_component_decorator_assigns(): @@ -117,7 +111,8 @@ def test_component_factories_from_nlp(): nlp.add_pipe(pipe) assert nlp("hello world") # The first argument here is the class itself, so we're accepting any here - mock.assert_called_once_with(ANY, nlp, foo="bar") + # The model will be initialized to None by the factory + mock.assert_called_once_with(ANY, nlp, None, foo="bar") def test_analysis_validate_attrs_valid(): diff --git a/spacy/tests/pipeline/test_entity_linker.py b/spacy/tests/pipeline/test_entity_linker.py index 8023f72a6..32b434e04 100644 --- a/spacy/tests/pipeline/test_entity_linker.py +++ b/spacy/tests/pipeline/test_entity_linker.py @@ -1,11 +1,11 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.kb import KnowledgeBase + +from spacy import util from spacy.lang.en import English from spacy.pipeline import EntityRuler +from spacy.tests.util import make_tempdir from spacy.tokens import Span @@ -203,8 +203,8 @@ def test_preserving_links_asdoc(nlp): ruler.add_patterns(patterns) nlp.add_pipe(ruler) - el_pipe = nlp.create_pipe(name="entity_linker") - el_pipe.set_kb(mykb) + cfg = {"kb": mykb, "incl_prior": False} + el_pipe = nlp.create_pipe(name="entity_linker", config=cfg) el_pipe.begin_training() el_pipe.incl_context = False el_pipe.incl_prior = True @@ -248,3 +248,71 @@ def test_preserving_links_ents_2(nlp): assert len(list(doc.ents)) == 1 assert list(doc.ents)[0].label_ == "LOC" assert list(doc.ents)[0].kb_id_ == "Q1" + + +# fmt: off +TRAIN_DATA = [ + ("Russ Cochran captured his first major title with his son as caddie.", {"links": {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}}}), + ("Russ Cochran his reprints include EC Comics.", {"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}}), + ("Russ Cochran has been publishing comic art.", {"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}}), + ("Russ Cochran was a member of University of Kentucky's golf team.", {"links": {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}}}), +] +GOLD_entities = ["Q2146908", "Q7381115", "Q7381115", "Q2146908"] +# fmt: on + + +def test_overfitting_IO(): + # Simple test to try and quickly overfit the NEL component - ensuring the ML models work correctly + nlp = English() + nlp.add_pipe(nlp.create_pipe('sentencizer')) + + # Add a custom component to recognize "Russ Cochran" as an entity for the example training data + ruler = EntityRuler(nlp) + patterns = [{"label": "PERSON", "pattern": [{"LOWER": "russ"}, {"LOWER": "cochran"}]}] + ruler.add_patterns(patterns) + nlp.add_pipe(ruler) + + # Convert the texts to docs to make sure we have doc.ents set for the training examples + TRAIN_DOCS = [] + for text, annotation in TRAIN_DATA: + doc = nlp(text) + annotation_clean = annotation + TRAIN_DOCS.append((doc, annotation_clean)) + + # create artificial KB - assign same prior weight to the two russ cochran's + # Q2146908 (Russ Cochran): American golfer + # Q7381115 (Russ Cochran): publisher + mykb = KnowledgeBase(nlp.vocab, entity_vector_length=3) + mykb.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3]) + mykb.add_entity(entity="Q7381115", freq=12, entity_vector=[9, 1, -7]) + mykb.add_alias(alias="Russ Cochran", entities=["Q2146908", "Q7381115"], probabilities=[0.5, 0.5]) + + # Create the Entity Linker component and add it to the pipeline + entity_linker = nlp.create_pipe("entity_linker", config={"kb": mykb}) + nlp.add_pipe(entity_linker, last=True) + + # train the NEL pipe + optimizer = nlp.begin_training() + for i in range(50): + losses = {} + nlp.update(TRAIN_DOCS, sgd=optimizer, losses=losses) + assert losses["entity_linker"] < 0.001 + + # test the trained model + predictions = [] + for text, annotation in TRAIN_DATA: + doc = nlp(text) + for ent in doc.ents: + predictions.append(ent.kb_id_) + assert predictions == GOLD_entities + + # Also test the results are still the same after IO + with make_tempdir() as tmp_dir: + nlp.to_disk(tmp_dir) + nlp2 = util.load_model_from_path(tmp_dir) + predictions = [] + for text, annotation in TRAIN_DATA: + doc2 = nlp2(text) + for ent in doc2.ents: + predictions.append(ent.kb_id_) + assert predictions == GOLD_entities diff --git a/spacy/tests/pipeline/test_entity_ruler.py b/spacy/tests/pipeline/test_entity_ruler.py index b6e3c40c9..b04569e22 100644 --- a/spacy/tests/pipeline/test_entity_ruler.py +++ b/spacy/tests/pipeline/test_entity_ruler.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.tokens import Span from spacy.language import Language diff --git a/spacy/tests/pipeline/test_factories.py b/spacy/tests/pipeline/test_factories.py index 5efcc319a..0a9a4d3c9 100644 --- a/spacy/tests/pipeline/test_factories.py +++ b/spacy/tests/pipeline/test_factories.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.language import Language from spacy.tokens import Span diff --git a/spacy/tests/pipeline/test_functions.py b/spacy/tests/pipeline/test_functions.py index 5b5fcd2fd..ca983267f 100644 --- a/spacy/tests/pipeline/test_functions.py +++ b/spacy/tests/pipeline/test_functions.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.pipeline.functions import merge_subtokens from ..util import get_doc diff --git a/spacy/tests/pipeline/test_morphologizer.py b/spacy/tests/pipeline/test_morphologizer.py new file mode 100644 index 000000000..f9307afc2 --- /dev/null +++ b/spacy/tests/pipeline/test_morphologizer.py @@ -0,0 +1,49 @@ +import pytest + +from spacy import util +from spacy.lang.en import English +from spacy.language import Language +from spacy.tests.util import make_tempdir + + +def test_label_types(): + nlp = Language() + nlp.add_pipe(nlp.create_pipe("morphologizer")) + nlp.get_pipe("morphologizer").add_label("Feat=A") + with pytest.raises(ValueError): + nlp.get_pipe("morphologizer").add_label(9) + + +TRAIN_DATA = [ + ("I like green eggs", {"morphs": ["Feat=N", "Feat=V", "Feat=J", "Feat=N"], "pos": ["NOUN", "VERB", "ADJ", "NOUN"]}), + ("Eat blue ham", {"morphs": ["Feat=V", "Feat=J", "Feat=N"], "pos": ["VERB", "ADJ", "NOUN"]}), +] + + +def test_overfitting_IO(): + # Simple test to try and quickly overfit the morphologizer - ensuring the ML models work correctly + nlp = English() + morphologizer = nlp.create_pipe("morphologizer") + for inst in TRAIN_DATA: + for morph, pos in zip(inst[1]["morphs"], inst[1]["pos"]): + morphologizer.add_label(morph + "|POS=" + pos) + nlp.add_pipe(morphologizer) + optimizer = nlp.begin_training() + + for i in range(50): + losses = {} + nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses) + assert losses["morphologizer"] < 0.00001 + + # test the trained model + test_text = "I like blue eggs" + doc = nlp(test_text) + gold_morphs = ["Feat=N|POS=NOUN", "Feat=V|POS=VERB", "Feat=J|POS=ADJ", "Feat=N|POS=NOUN"] + assert gold_morphs == [t.morph_ for t in doc] + + # Also test the results are still the same after IO + with make_tempdir() as tmp_dir: + nlp.to_disk(tmp_dir) + nlp2 = util.load_model_from_path(tmp_dir) + doc2 = nlp2(test_text) + assert gold_morphs == [t.morph_ for t in doc2] diff --git a/spacy/tests/pipeline/test_pipe_methods.py b/spacy/tests/pipeline/test_pipe_methods.py index 27fb57b18..d42216655 100644 --- a/spacy/tests/pipeline/test_pipe_methods.py +++ b/spacy/tests/pipeline/test_pipe_methods.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.language import Language @@ -91,7 +88,16 @@ def test_remove_pipe(nlp, name): def test_disable_pipes_method(nlp, name): nlp.add_pipe(new_pipe, name=name) assert nlp.has_pipe(name) - disabled = nlp.disable_pipes(name) + disabled = nlp.select_pipes(disable=name) + assert not nlp.has_pipe(name) + disabled.restore() + + +@pytest.mark.parametrize("name", ["my_component"]) +def test_enable_pipes_method(nlp, name): + nlp.add_pipe(new_pipe, name=name) + assert nlp.has_pipe(name) + disabled = nlp.select_pipes(enable=[]) assert not nlp.has_pipe(name) disabled.restore() @@ -100,25 +106,63 @@ def test_disable_pipes_method(nlp, name): def test_disable_pipes_context(nlp, name): nlp.add_pipe(new_pipe, name=name) assert nlp.has_pipe(name) - with nlp.disable_pipes(name): + with nlp.select_pipes(disable=name): assert not nlp.has_pipe(name) assert nlp.has_pipe(name) -def test_disable_pipes_list_arg(nlp): +def test_select_pipes_list_arg(nlp): for name in ["c1", "c2", "c3"]: nlp.add_pipe(new_pipe, name=name) assert nlp.has_pipe(name) - with nlp.disable_pipes(["c1", "c2"]): + with nlp.select_pipes(disable=["c1", "c2"]): assert not nlp.has_pipe("c1") assert not nlp.has_pipe("c2") assert nlp.has_pipe("c3") + with nlp.select_pipes(enable="c3"): + assert not nlp.has_pipe("c1") + assert not nlp.has_pipe("c2") + assert nlp.has_pipe("c3") + with nlp.select_pipes(enable=["c1", "c2"], disable="c3"): + assert nlp.has_pipe("c1") + assert nlp.has_pipe("c2") + assert not nlp.has_pipe("c3") + with nlp.select_pipes(enable=[]): + assert not nlp.has_pipe("c1") + assert not nlp.has_pipe("c2") + assert not nlp.has_pipe("c3") + with nlp.select_pipes(enable=["c1", "c2", "c3"], disable=[]): + assert nlp.has_pipe("c1") + assert nlp.has_pipe("c2") + assert nlp.has_pipe("c3") + with nlp.select_pipes(disable=["c1", "c2", "c3"], enable=[]): + assert not nlp.has_pipe("c1") + assert not nlp.has_pipe("c2") + assert not nlp.has_pipe("c3") + + +def test_select_pipes_errors(nlp): + for name in ["c1", "c2", "c3"]: + nlp.add_pipe(new_pipe, name=name) + assert nlp.has_pipe(name) + + with pytest.raises(ValueError): + nlp.select_pipes() + + with pytest.raises(ValueError): + nlp.select_pipes(enable=["c1", "c2"], disable=["c1"]) + + with pytest.raises(ValueError): + nlp.select_pipes(enable=["c1", "c2"], disable=[]) + + with pytest.raises(ValueError): + nlp.select_pipes(enable=[], disable=["c3"]) @pytest.mark.parametrize("n_pipes", [100]) def test_add_lots_of_pipes(nlp, n_pipes): for i in range(n_pipes): - nlp.add_pipe(lambda doc: doc, name="pipe_%d" % i) + nlp.add_pipe(lambda doc: doc, name=f"pipe_{i}") assert len(nlp.pipe_names) == n_pipes diff --git a/spacy/tests/pipeline/test_sentencizer.py b/spacy/tests/pipeline/test_sentencizer.py index ee9220a29..5c00b97ce 100644 --- a/spacy/tests/pipeline/test_sentencizer.py +++ b/spacy/tests/pipeline/test_sentencizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest import spacy from spacy.pipeline import Sentencizer @@ -29,6 +26,12 @@ def test_sentencizer_pipe(): sent_starts = [t.is_sent_start for t in doc] assert sent_starts == [True, False, True, False, False, False, False] assert len(list(doc.sents)) == 2 + for ex in nlp.pipe(texts, as_example=True): + doc = ex.doc + assert doc.is_sentenced + sent_starts = [t.is_sent_start for t in doc] + assert sent_starts == [True, False, True, False, False, False, False] + assert len(list(doc.sents)) == 2 def test_sentencizer_empty_docs(): diff --git a/spacy/tests/pipeline/test_senter.py b/spacy/tests/pipeline/test_senter.py new file mode 100644 index 000000000..197fdca6e --- /dev/null +++ b/spacy/tests/pipeline/test_senter.py @@ -0,0 +1,52 @@ +import pytest + +from spacy import util +from spacy.lang.en import English +from spacy.language import Language +from spacy.tests.util import make_tempdir + + +def test_label_types(): + nlp = Language() + nlp.add_pipe(nlp.create_pipe("senter")) + with pytest.raises(NotImplementedError): + nlp.get_pipe("senter").add_label("A") + +SENT_STARTS = [0] * 14 +SENT_STARTS[0] = 1 +SENT_STARTS[5] = 1 +SENT_STARTS[9] = 1 + +TRAIN_DATA = [ + ("I like green eggs. Eat blue ham. I like purple eggs.", {"sent_starts": SENT_STARTS}), + ("She likes purple eggs. They hate ham. You like yellow eggs.", {"sent_starts": SENT_STARTS}), +] + + +def test_overfitting_IO(): + # Simple test to try and quickly overfit the senter - ensuring the ML models work correctly + nlp = English() + senter = nlp.create_pipe("senter") + nlp.add_pipe(senter) + optimizer = nlp.begin_training() + + for i in range(200): + losses = {} + nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses) + assert losses["senter"] < 0.001 + + # test the trained model + test_text = "I like purple eggs. They eat ham. You like yellow eggs." + doc = nlp(test_text) + gold_sent_starts = [0] * 14 + gold_sent_starts[0] = 1 + gold_sent_starts[5] = 1 + gold_sent_starts[9] = 1 + assert [int(t.is_sent_start) for t in doc] == gold_sent_starts + + # Also test the results are still the same after IO + with make_tempdir() as tmp_dir: + nlp.to_disk(tmp_dir) + nlp2 = util.load_model_from_path(tmp_dir) + doc2 = nlp2(test_text) + assert [int(t.is_sent_start) for t in doc2] == gold_sent_starts diff --git a/spacy/tests/pipeline/test_simple_ner.py b/spacy/tests/pipeline/test_simple_ner.py new file mode 100644 index 000000000..9d4acf2fd --- /dev/null +++ b/spacy/tests/pipeline/test_simple_ner.py @@ -0,0 +1,417 @@ +import pytest +from collections import namedtuple + +from thinc.api import NumpyOps +from spacy.ml._biluo import BILUO, _get_transition_table +from spacy.pipeline.simple_ner import SimpleNER +import spacy + + +@pytest.fixture(params=[ + ["PER", "ORG", "LOC", "MISC"], + ["GPE", "PERSON", "NUMBER", "CURRENCY", "EVENT"] +]) +def labels(request): + return request.param + +@pytest.fixture +def ops(): + return NumpyOps() + +def _get_actions(labels): + action_names = ( + [f"B{label}" for label in labels] + \ + [f"I{label}" for label in labels] + \ + [f"L{label}" for label in labels] + \ + [f"U{label}" for label in labels] + \ + ["O"] + ) + A = namedtuple("actions", action_names) + return A(**{name: i for i, name in enumerate(action_names)}) + + +def test_init_biluo_layer(labels): + model = BILUO() + model.set_dim("nO", model.attrs["get_num_actions"](len(labels))) + model.initialize() + assert model.get_dim("nO") == len(labels) * 4 + 1 + + +def test_transition_table(ops): + labels = ["per", "loc", "org"] + table = _get_transition_table(len(labels)) + a = _get_actions(labels) + assert table.shape == (2, len(a), len(a)) + # Not last token, prev action was B + assert table[0, a.Bper, a.Bper] == 0 + assert table[0, a.Bper, a.Bloc] == 0 + assert table[0, a.Bper, a.Borg] == 0 + assert table[0, a.Bper, a.Iper] == 1 + assert table[0, a.Bper, a.Iloc] == 0 + assert table[0, a.Bper, a.Iorg] == 0 + assert table[0, a.Bper, a.Lper] == 1 + assert table[0, a.Bper, a.Lloc] == 0 + assert table[0, a.Bper, a.Lorg] == 0 + assert table[0, a.Bper, a.Uper] == 0 + assert table[0, a.Bper, a.Uloc] == 0 + assert table[0, a.Bper, a.Uorg] == 0 + assert table[0, a.Bper, a.O] == 0 + + assert table[0, a.Bloc, a.Bper] == 0 + assert table[0, a.Bloc, a.Bloc] == 0 + assert table[0, a.Bloc, a.Borg] == 0 + assert table[0, a.Bloc, a.Iper] == 0 + assert table[0, a.Bloc, a.Iloc] == 1 + assert table[0, a.Bloc, a.Iorg] == 0 + assert table[0, a.Bloc, a.Lper] == 0 + assert table[0, a.Bloc, a.Lloc] == 1 + assert table[0, a.Bloc, a.Lorg] == 0 + assert table[0, a.Bloc, a.Uper] == 0 + assert table[0, a.Bloc, a.Uloc] == 0 + assert table[0, a.Bloc, a.Uorg] == 0 + assert table[0, a.Bloc, a.O] == 0 + + assert table[0, a.Borg, a.Bper] == 0 + assert table[0, a.Borg, a.Bloc] == 0 + assert table[0, a.Borg, a.Borg] == 0 + assert table[0, a.Borg, a.Iper] == 0 + assert table[0, a.Borg, a.Iloc] == 0 + assert table[0, a.Borg, a.Iorg] == 1 + assert table[0, a.Borg, a.Lper] == 0 + assert table[0, a.Borg, a.Lloc] == 0 + assert table[0, a.Borg, a.Lorg] == 1 + assert table[0, a.Borg, a.Uper] == 0 + assert table[0, a.Borg, a.Uloc] == 0 + assert table[0, a.Borg, a.Uorg] == 0 + assert table[0, a.Borg, a.O] == 0 + + # Not last token, prev action was I + assert table[0, a.Iper, a.Bper] == 0 + assert table[0, a.Iper, a.Bloc] == 0 + assert table[0, a.Iper, a.Borg] == 0 + assert table[0, a.Iper, a.Iper] == 1 + assert table[0, a.Iper, a.Iloc] == 0 + assert table[0, a.Iper, a.Iorg] == 0 + assert table[0, a.Iper, a.Lper] == 1 + assert table[0, a.Iper, a.Lloc] == 0 + assert table[0, a.Iper, a.Lorg] == 0 + assert table[0, a.Iper, a.Uper] == 0 + assert table[0, a.Iper, a.Uloc] == 0 + assert table[0, a.Iper, a.Uorg] == 0 + assert table[0, a.Iper, a.O] == 0 + + assert table[0, a.Iloc, a.Bper] == 0 + assert table[0, a.Iloc, a.Bloc] == 0 + assert table[0, a.Iloc, a.Borg] == 0 + assert table[0, a.Iloc, a.Iper] == 0 + assert table[0, a.Iloc, a.Iloc] == 1 + assert table[0, a.Iloc, a.Iorg] == 0 + assert table[0, a.Iloc, a.Lper] == 0 + assert table[0, a.Iloc, a.Lloc] == 1 + assert table[0, a.Iloc, a.Lorg] == 0 + assert table[0, a.Iloc, a.Uper] == 0 + assert table[0, a.Iloc, a.Uloc] == 0 + assert table[0, a.Iloc, a.Uorg] == 0 + assert table[0, a.Iloc, a.O] == 0 + + assert table[0, a.Iorg, a.Bper] == 0 + assert table[0, a.Iorg, a.Bloc] == 0 + assert table[0, a.Iorg, a.Borg] == 0 + assert table[0, a.Iorg, a.Iper] == 0 + assert table[0, a.Iorg, a.Iloc] == 0 + assert table[0, a.Iorg, a.Iorg] == 1 + assert table[0, a.Iorg, a.Lper] == 0 + assert table[0, a.Iorg, a.Lloc] == 0 + assert table[0, a.Iorg, a.Lorg] == 1 + assert table[0, a.Iorg, a.Uper] == 0 + assert table[0, a.Iorg, a.Uloc] == 0 + assert table[0, a.Iorg, a.Uorg] == 0 + assert table[0, a.Iorg, a.O] == 0 + + # Not last token, prev action was L + assert table[0, a.Lper, a.Bper] == 1 + assert table[0, a.Lper, a.Bloc] == 1 + assert table[0, a.Lper, a.Borg] == 1 + assert table[0, a.Lper, a.Iper] == 0 + assert table[0, a.Lper, a.Iloc] == 0 + assert table[0, a.Lper, a.Iorg] == 0 + assert table[0, a.Lper, a.Lper] == 0 + assert table[0, a.Lper, a.Lloc] == 0 + assert table[0, a.Lper, a.Lorg] == 0 + assert table[0, a.Lper, a.Uper] == 1 + assert table[0, a.Lper, a.Uloc] == 1 + assert table[0, a.Lper, a.Uorg] == 1 + assert table[0, a.Lper, a.O] == 1 + + assert table[0, a.Lloc, a.Bper] == 1 + assert table[0, a.Lloc, a.Bloc] == 1 + assert table[0, a.Lloc, a.Borg] == 1 + assert table[0, a.Lloc, a.Iper] == 0 + assert table[0, a.Lloc, a.Iloc] == 0 + assert table[0, a.Lloc, a.Iorg] == 0 + assert table[0, a.Lloc, a.Lper] == 0 + assert table[0, a.Lloc, a.Lloc] == 0 + assert table[0, a.Lloc, a.Lorg] == 0 + assert table[0, a.Lloc, a.Uper] == 1 + assert table[0, a.Lloc, a.Uloc] == 1 + assert table[0, a.Lloc, a.Uorg] == 1 + assert table[0, a.Lloc, a.O] == 1 + + assert table[0, a.Lorg, a.Bper] == 1 + assert table[0, a.Lorg, a.Bloc] == 1 + assert table[0, a.Lorg, a.Borg] == 1 + assert table[0, a.Lorg, a.Iper] == 0 + assert table[0, a.Lorg, a.Iloc] == 0 + assert table[0, a.Lorg, a.Iorg] == 0 + assert table[0, a.Lorg, a.Lper] == 0 + assert table[0, a.Lorg, a.Lloc] == 0 + assert table[0, a.Lorg, a.Lorg] == 0 + assert table[0, a.Lorg, a.Uper] == 1 + assert table[0, a.Lorg, a.Uloc] == 1 + assert table[0, a.Lorg, a.Uorg] == 1 + assert table[0, a.Lorg, a.O] == 1 + + # Not last token, prev action was U + assert table[0, a.Uper, a.Bper] == 1 + assert table[0, a.Uper, a.Bloc] == 1 + assert table[0, a.Uper, a.Borg] == 1 + assert table[0, a.Uper, a.Iper] == 0 + assert table[0, a.Uper, a.Iloc] == 0 + assert table[0, a.Uper, a.Iorg] == 0 + assert table[0, a.Uper, a.Lper] == 0 + assert table[0, a.Uper, a.Lloc] == 0 + assert table[0, a.Uper, a.Lorg] == 0 + assert table[0, a.Uper, a.Uper] == 1 + assert table[0, a.Uper, a.Uloc] == 1 + assert table[0, a.Uper, a.Uorg] == 1 + assert table[0, a.Uper, a.O] == 1 + + assert table[0, a.Uloc, a.Bper] == 1 + assert table[0, a.Uloc, a.Bloc] == 1 + assert table[0, a.Uloc, a.Borg] == 1 + assert table[0, a.Uloc, a.Iper] == 0 + assert table[0, a.Uloc, a.Iloc] == 0 + assert table[0, a.Uloc, a.Iorg] == 0 + assert table[0, a.Uloc, a.Lper] == 0 + assert table[0, a.Uloc, a.Lloc] == 0 + assert table[0, a.Uloc, a.Lorg] == 0 + assert table[0, a.Uloc, a.Uper] == 1 + assert table[0, a.Uloc, a.Uloc] == 1 + assert table[0, a.Uloc, a.Uorg] == 1 + assert table[0, a.Uloc, a.O] == 1 + + assert table[0, a.Uorg, a.Bper] == 1 + assert table[0, a.Uorg, a.Bloc] == 1 + assert table[0, a.Uorg, a.Borg] == 1 + assert table[0, a.Uorg, a.Iper] == 0 + assert table[0, a.Uorg, a.Iloc] == 0 + assert table[0, a.Uorg, a.Iorg] == 0 + assert table[0, a.Uorg, a.Lper] == 0 + assert table[0, a.Uorg, a.Lloc] == 0 + assert table[0, a.Uorg, a.Lorg] == 0 + assert table[0, a.Uorg, a.Uper] == 1 + assert table[0, a.Uorg, a.Uloc] == 1 + assert table[0, a.Uorg, a.Uorg] == 1 + assert table[0, a.Uorg, a.O] == 1 + + # Not last token, prev action was O + assert table[0, a.O, a.Bper] == 1 + assert table[0, a.O, a.Bloc] == 1 + assert table[0, a.O, a.Borg] == 1 + assert table[0, a.O, a.Iper] == 0 + assert table[0, a.O, a.Iloc] == 0 + assert table[0, a.O, a.Iorg] == 0 + assert table[0, a.O, a.Lper] == 0 + assert table[0, a.O, a.Lloc] == 0 + assert table[0, a.O, a.Lorg] == 0 + assert table[0, a.O, a.Uper] == 1 + assert table[0, a.O, a.Uloc] == 1 + assert table[0, a.O, a.Uorg] == 1 + assert table[0, a.O, a.O] == 1 + + # Last token, prev action was B + assert table[1, a.Bper, a.Bper] == 0 + assert table[1, a.Bper, a.Bloc] == 0 + assert table[1, a.Bper, a.Borg] == 0 + assert table[1, a.Bper, a.Iper] == 0 + assert table[1, a.Bper, a.Iloc] == 0 + assert table[1, a.Bper, a.Iorg] == 0 + assert table[1, a.Bper, a.Lper] == 1 + assert table[1, a.Bper, a.Lloc] == 0 + assert table[1, a.Bper, a.Lorg] == 0 + assert table[1, a.Bper, a.Uper] == 0 + assert table[1, a.Bper, a.Uloc] == 0 + assert table[1, a.Bper, a.Uorg] == 0 + assert table[1, a.Bper, a.O] == 0 + + assert table[1, a.Bloc, a.Bper] == 0 + assert table[1, a.Bloc, a.Bloc] == 0 + assert table[0, a.Bloc, a.Borg] == 0 + assert table[1, a.Bloc, a.Iper] == 0 + assert table[1, a.Bloc, a.Iloc] == 0 + assert table[1, a.Bloc, a.Iorg] == 0 + assert table[1, a.Bloc, a.Lper] == 0 + assert table[1, a.Bloc, a.Lloc] == 1 + assert table[1, a.Bloc, a.Lorg] == 0 + assert table[1, a.Bloc, a.Uper] == 0 + assert table[1, a.Bloc, a.Uloc] == 0 + assert table[1, a.Bloc, a.Uorg] == 0 + assert table[1, a.Bloc, a.O] == 0 + + assert table[1, a.Borg, a.Bper] == 0 + assert table[1, a.Borg, a.Bloc] == 0 + assert table[1, a.Borg, a.Borg] == 0 + assert table[1, a.Borg, a.Iper] == 0 + assert table[1, a.Borg, a.Iloc] == 0 + assert table[1, a.Borg, a.Iorg] == 0 + assert table[1, a.Borg, a.Lper] == 0 + assert table[1, a.Borg, a.Lloc] == 0 + assert table[1, a.Borg, a.Lorg] == 1 + assert table[1, a.Borg, a.Uper] == 0 + assert table[1, a.Borg, a.Uloc] == 0 + assert table[1, a.Borg, a.Uorg] == 0 + assert table[1, a.Borg, a.O] == 0 + + # Last token, prev action was I + assert table[1, a.Iper, a.Bper] == 0 + assert table[1, a.Iper, a.Bloc] == 0 + assert table[1, a.Iper, a.Borg] == 0 + assert table[1, a.Iper, a.Iper] == 0 + assert table[1, a.Iper, a.Iloc] == 0 + assert table[1, a.Iper, a.Iorg] == 0 + assert table[1, a.Iper, a.Lper] == 1 + assert table[1, a.Iper, a.Lloc] == 0 + assert table[1, a.Iper, a.Lorg] == 0 + assert table[1, a.Iper, a.Uper] == 0 + assert table[1, a.Iper, a.Uloc] == 0 + assert table[1, a.Iper, a.Uorg] == 0 + assert table[1, a.Iper, a.O] == 0 + + assert table[1, a.Iloc, a.Bper] == 0 + assert table[1, a.Iloc, a.Bloc] == 0 + assert table[1, a.Iloc, a.Borg] == 0 + assert table[1, a.Iloc, a.Iper] == 0 + assert table[1, a.Iloc, a.Iloc] == 0 + assert table[1, a.Iloc, a.Iorg] == 0 + assert table[1, a.Iloc, a.Lper] == 0 + assert table[1, a.Iloc, a.Lloc] == 1 + assert table[1, a.Iloc, a.Lorg] == 0 + assert table[1, a.Iloc, a.Uper] == 0 + assert table[1, a.Iloc, a.Uloc] == 0 + assert table[1, a.Iloc, a.Uorg] == 0 + assert table[1, a.Iloc, a.O] == 0 + + assert table[1, a.Iorg, a.Bper] == 0 + assert table[1, a.Iorg, a.Bloc] == 0 + assert table[1, a.Iorg, a.Borg] == 0 + assert table[1, a.Iorg, a.Iper] == 0 + assert table[1, a.Iorg, a.Iloc] == 0 + assert table[1, a.Iorg, a.Iorg] == 0 + assert table[1, a.Iorg, a.Lper] == 0 + assert table[1, a.Iorg, a.Lloc] == 0 + assert table[1, a.Iorg, a.Lorg] == 1 + assert table[1, a.Iorg, a.Uper] == 0 + assert table[1, a.Iorg, a.Uloc] == 0 + assert table[1, a.Iorg, a.Uorg] == 0 + assert table[1, a.Iorg, a.O] == 0 + + # Last token, prev action was L + assert table[1, a.Lper, a.Bper] == 0 + assert table[1, a.Lper, a.Bloc] == 0 + assert table[1, a.Lper, a.Borg] == 0 + assert table[1, a.Lper, a.Iper] == 0 + assert table[1, a.Lper, a.Iloc] == 0 + assert table[1, a.Lper, a.Iorg] == 0 + assert table[1, a.Lper, a.Lper] == 0 + assert table[1, a.Lper, a.Lloc] == 0 + assert table[1, a.Lper, a.Lorg] == 0 + assert table[1, a.Lper, a.Uper] == 1 + assert table[1, a.Lper, a.Uloc] == 1 + assert table[1, a.Lper, a.Uorg] == 1 + assert table[1, a.Lper, a.O] == 1 + + assert table[1, a.Lloc, a.Bper] == 0 + assert table[1, a.Lloc, a.Bloc] == 0 + assert table[1, a.Lloc, a.Borg] == 0 + assert table[1, a.Lloc, a.Iper] == 0 + assert table[1, a.Lloc, a.Iloc] == 0 + assert table[1, a.Lloc, a.Iorg] == 0 + assert table[1, a.Lloc, a.Lper] == 0 + assert table[1, a.Lloc, a.Lloc] == 0 + assert table[1, a.Lloc, a.Lorg] == 0 + assert table[1, a.Lloc, a.Uper] == 1 + assert table[1, a.Lloc, a.Uloc] == 1 + assert table[1, a.Lloc, a.Uorg] == 1 + assert table[1, a.Lloc, a.O] == 1 + + assert table[1, a.Lorg, a.Bper] == 0 + assert table[1, a.Lorg, a.Bloc] == 0 + assert table[1, a.Lorg, a.Borg] == 0 + assert table[1, a.Lorg, a.Iper] == 0 + assert table[1, a.Lorg, a.Iloc] == 0 + assert table[1, a.Lorg, a.Iorg] == 0 + assert table[1, a.Lorg, a.Lper] == 0 + assert table[1, a.Lorg, a.Lloc] == 0 + assert table[1, a.Lorg, a.Lorg] == 0 + assert table[1, a.Lorg, a.Uper] == 1 + assert table[1, a.Lorg, a.Uloc] == 1 + assert table[1, a.Lorg, a.Uorg] == 1 + assert table[1, a.Lorg, a.O] == 1 + + # Last token, prev action was U + assert table[1, a.Uper, a.Bper] == 0 + assert table[1, a.Uper, a.Bloc] == 0 + assert table[1, a.Uper, a.Borg] == 0 + assert table[1, a.Uper, a.Iper] == 0 + assert table[1, a.Uper, a.Iloc] == 0 + assert table[1, a.Uper, a.Iorg] == 0 + assert table[1, a.Uper, a.Lper] == 0 + assert table[1, a.Uper, a.Lloc] == 0 + assert table[1, a.Uper, a.Lorg] == 0 + assert table[1, a.Uper, a.Uper] == 1 + assert table[1, a.Uper, a.Uloc] == 1 + assert table[1, a.Uper, a.Uorg] == 1 + assert table[1, a.Uper, a.O] == 1 + + assert table[1, a.Uloc, a.Bper] == 0 + assert table[1, a.Uloc, a.Bloc] == 0 + assert table[1, a.Uloc, a.Borg] == 0 + assert table[1, a.Uloc, a.Iper] == 0 + assert table[1, a.Uloc, a.Iloc] == 0 + assert table[1, a.Uloc, a.Iorg] == 0 + assert table[1, a.Uloc, a.Lper] == 0 + assert table[1, a.Uloc, a.Lloc] == 0 + assert table[1, a.Uloc, a.Lorg] == 0 + assert table[1, a.Uloc, a.Uper] == 1 + assert table[1, a.Uloc, a.Uloc] == 1 + assert table[1, a.Uloc, a.Uorg] == 1 + assert table[1, a.Uloc, a.O] == 1 + + assert table[1, a.Uorg, a.Bper] == 0 + assert table[1, a.Uorg, a.Bloc] == 0 + assert table[1, a.Uorg, a.Borg] == 0 + assert table[1, a.Uorg, a.Iper] == 0 + assert table[1, a.Uorg, a.Iloc] == 0 + assert table[1, a.Uorg, a.Iorg] == 0 + assert table[1, a.Uorg, a.Lper] == 0 + assert table[1, a.Uorg, a.Lloc] == 0 + assert table[1, a.Uorg, a.Lorg] == 0 + assert table[1, a.Uorg, a.Uper] == 1 + assert table[1, a.Uorg, a.Uloc] == 1 + assert table[1, a.Uorg, a.Uorg] == 1 + assert table[1, a.Uorg, a.O] == 1 + + # Last token, prev action was O + assert table[1, a.O, a.Bper] == 0 + assert table[1, a.O, a.Bloc] == 0 + assert table[1, a.O, a.Borg] == 0 + assert table[1, a.O, a.Iper] == 0 + assert table[1, a.O, a.Iloc] == 0 + assert table[1, a.O, a.Iorg] == 0 + assert table[1, a.O, a.Lper] == 0 + assert table[1, a.O, a.Lloc] == 0 + assert table[1, a.O, a.Lorg] == 0 + assert table[1, a.O, a.Uper] == 1 + assert table[1, a.O, a.Uloc] == 1 + assert table[1, a.O, a.Uorg] == 1 + assert table[1, a.O, a.O] == 1 diff --git a/spacy/tests/pipeline/test_tagger.py b/spacy/tests/pipeline/test_tagger.py index a5bda9090..a90207a78 100644 --- a/spacy/tests/pipeline/test_tagger.py +++ b/spacy/tests/pipeline/test_tagger.py @@ -1,8 +1,9 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest + +from spacy import util +from spacy.lang.en import English from spacy.language import Language +from spacy.tests.util import make_tempdir def test_label_types(): @@ -11,3 +12,44 @@ def test_label_types(): nlp.get_pipe("tagger").add_label("A") with pytest.raises(ValueError): nlp.get_pipe("tagger").add_label(9) + + +TAG_MAP = {"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}, "J": {"pos": "ADJ"}} + +TRAIN_DATA = [ + ("I like green eggs", {"tags": ["N", "V", "J", "N"]}), + ("Eat blue ham", {"tags": ["V", "J", "N"]}), +] + + +def test_overfitting_IO(): + # Simple test to try and quickly overfit the tagger - ensuring the ML models work correctly + nlp = English() + tagger = nlp.create_pipe("tagger") + for tag, values in TAG_MAP.items(): + tagger.add_label(tag, values) + nlp.add_pipe(tagger) + optimizer = nlp.begin_training() + + for i in range(50): + losses = {} + nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses) + assert losses["tagger"] < 0.00001 + + # test the trained model + test_text = "I like blue eggs" + doc = nlp(test_text) + assert doc[0].tag_ is "N" + assert doc[1].tag_ is "V" + assert doc[2].tag_ is "J" + assert doc[3].tag_ is "N" + + # Also test the results are still the same after IO + with make_tempdir() as tmp_dir: + nlp.to_disk(tmp_dir) + nlp2 = util.load_model_from_path(tmp_dir) + doc2 = nlp2(test_text) + assert doc2[0].tag_ is "N" + assert doc2[1].tag_ is "V" + assert doc2[2].tag_ is "J" + assert doc2[3].tag_ is "N" diff --git a/spacy/tests/pipeline/test_textcat.py b/spacy/tests/pipeline/test_textcat.py index b7db85056..725a4fd69 100644 --- a/spacy/tests/pipeline/test_textcat.py +++ b/spacy/tests/pipeline/test_textcat.py @@ -1,13 +1,22 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest import random import numpy.random + +from spacy import util +from spacy.lang.en import English from spacy.language import Language from spacy.pipeline import TextCategorizer from spacy.tokens import Doc from spacy.gold import GoldParse +from spacy.util import fix_random_seed + +from ..util import make_tempdir +from spacy.pipeline.defaults import default_tok2vec + +TRAIN_DATA = [ + ("I'm so happy.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}), + ("I'm so angry", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}), +] @pytest.mark.skip(reason="Test is flakey when run with others") @@ -24,7 +33,7 @@ def test_simple_train(): ("bbbbbbbbb", 0.0), ("aaaaaa", 1), ]: - nlp.update([text], [{"cats": {"answer": answer}}]) + nlp.update((text, {"cats": {"answer": answer}})) doc = nlp("aaa") assert "answer" in doc.cats assert doc.cats["answer"] >= 0.5 @@ -70,3 +79,67 @@ def test_label_types(): nlp.get_pipe("textcat").add_label("answer") with pytest.raises(ValueError): nlp.get_pipe("textcat").add_label(9) + + +def test_overfitting_IO(): + # Simple test to try and quickly overfit the textcat component - ensuring the ML models work correctly + fix_random_seed(0) + nlp = English() + textcat = nlp.create_pipe("textcat") + for _, annotations in TRAIN_DATA: + for label, value in annotations.get("cats").items(): + textcat.add_label(label) + nlp.add_pipe(textcat) + optimizer = nlp.begin_training() + + for i in range(50): + losses = {} + nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses) + assert losses["textcat"] < 0.01 + + # test the trained model + test_text = "I am happy." + doc = nlp(test_text) + cats = doc.cats + # note that by default, exclusive_classes = false so we need a bigger error margin + assert cats["POSITIVE"] > 0.9 + assert cats["POSITIVE"] + cats["NEGATIVE"] == pytest.approx(1.0, 0.1) + + # Also test the results are still the same after IO + with make_tempdir() as tmp_dir: + nlp.to_disk(tmp_dir) + nlp2 = util.load_model_from_path(tmp_dir) + doc2 = nlp2(test_text) + cats2 = doc2.cats + assert cats2["POSITIVE"] > 0.9 + assert cats2["POSITIVE"] + cats2["NEGATIVE"] == pytest.approx(1.0, 0.1) + + +# fmt: off +@pytest.mark.parametrize( + "textcat_config", + [ + {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}, + {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 4, "no_output_layer": False}, + {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 3, "no_output_layer": True}, + {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 2, "no_output_layer": True}, + {"@architectures": "spacy.TextCat.v1", "exclusive_classes": False, "ngram_size": 1, "pretrained_vectors": False, "width": 64, "conv_depth": 2, "embed_size": 2000, "window_size": 2}, + {"@architectures": "spacy.TextCat.v1", "exclusive_classes": True, "ngram_size": 5, "pretrained_vectors": False, "width": 128, "conv_depth": 2, "embed_size": 2000, "window_size": 1}, + {"@architectures": "spacy.TextCat.v1", "exclusive_classes": True, "ngram_size": 2, "pretrained_vectors": False, "width": 32, "conv_depth": 3, "embed_size": 500, "window_size": 3}, + {"@architectures": "spacy.TextCatCNN.v1", "tok2vec": default_tok2vec(), "exclusive_classes": True}, + {"@architectures": "spacy.TextCatCNN.v1", "tok2vec": default_tok2vec(), "exclusive_classes": False}, + ], +) +# fmt: on +def test_textcat_configs(textcat_config): + pipe_config = {"model": textcat_config} + nlp = English() + textcat = nlp.create_pipe("textcat", pipe_config) + for _, annotations in TRAIN_DATA: + for label, value in annotations.get("cats").items(): + textcat.add_label(label) + nlp.add_pipe(textcat) + optimizer = nlp.begin_training() + for i in range(5): + losses = {} + nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses) diff --git a/spacy/tests/regression/test_issue1-1000.py b/spacy/tests/regression/test_issue1-1000.py index 6d88d68c2..bfca72853 100644 --- a/spacy/tests/regression/test_issue1-1000.py +++ b/spacy/tests/regression/test_issue1-1000.py @@ -1,11 +1,8 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import random from spacy.matcher import Matcher from spacy.attrs import IS_PUNCT, ORTH, LOWER -from spacy.symbols import POS, VERB, VerbForm_inf +from spacy.symbols import POS, VERB from spacy.vocab import Vocab from spacy.language import Language from spacy.lemmatizer import Lemmatizer @@ -167,7 +164,7 @@ def test_issue590(en_vocab): def test_issue595(): """Test lemmatization of base forms""" words = ["Do", "n't", "feed", "the", "dog"] - tag_map = {"VB": {POS: VERB, VerbForm_inf: True}} + tag_map = {"VB": {POS: VERB, "VerbForm": "inf"}} lookups = Lookups() lookups.add_table("lemma_rules", {"verb": [["ed", "e"]]}) lookups.add_table("lemma_index", {"verb": {}}) @@ -451,7 +448,7 @@ def test_issue999(train_data): for itn in range(100): random.shuffle(TRAIN_DATA) for raw_text, entity_offsets in TRAIN_DATA: - nlp.update([raw_text], [{"entities": entity_offsets}]) + nlp.update((raw_text, {"entities": entity_offsets})) with make_tempdir() as model_dir: nlp.to_disk(model_dir) diff --git a/spacy/tests/regression/test_issue1001-1500.py b/spacy/tests/regression/test_issue1001-1500.py index 924c5aa3e..aaff951e5 100644 --- a/spacy/tests/regression/test_issue1001-1500.py +++ b/spacy/tests/regression/test_issue1001-1500.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import re from spacy.tokens import Doc @@ -11,7 +8,7 @@ from spacy.matcher import Matcher from spacy.tokenizer import Tokenizer from spacy.lemmatizer import Lemmatizer from spacy.lookups import Lookups -from spacy.symbols import ORTH, LEMMA, POS, VERB, VerbForm_part +from spacy.symbols import ORTH, LEMMA, POS, VERB def test_issue1061(): @@ -91,7 +88,7 @@ def test_issue1375(): def test_issue1387(): - tag_map = {"VBG": {POS: VERB, VerbForm_part: True}} + tag_map = {"VBG": {POS: VERB, "VerbForm": "part"}} lookups = Lookups() lookups.add_table("lemma_index", {"verb": ("cope", "cop")}) lookups.add_table("lemma_exc", {"verb": {"coping": ("cope",)}}) diff --git a/spacy/tests/regression/test_issue1501-2000.py b/spacy/tests/regression/test_issue1501-2000.py index e498417d1..5a76697bc 100644 --- a/spacy/tests/regression/test_issue1501-2000.py +++ b/spacy/tests/regression/test_issue1501-2000.py @@ -1,16 +1,16 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest import gc import numpy import copy + +from spacy.gold import Example from spacy.lang.en import English from spacy.lang.en.stop_words import STOP_WORDS from spacy.lang.lex_attrs import is_stop from spacy.vectors import Vectors from spacy.vocab import Vocab from spacy.language import Language +from spacy.pipeline.defaults import default_ner, default_tagger from spacy.tokens import Doc, Span, Token from spacy.pipeline import Tagger, EntityRecognizer from spacy.attrs import HEAD, DEP @@ -124,7 +124,7 @@ def test_issue1727(): correctly after vectors are added.""" data = numpy.ones((3, 300), dtype="f") vectors = Vectors(data=data, keys=["I", "am", "Matt"]) - tagger = Tagger(Vocab()) + tagger = Tagger(Vocab(), default_tagger()) tagger.add_label("PRP") with pytest.warns(UserWarning): tagger.begin_training() @@ -132,7 +132,7 @@ def test_issue1727(): tagger.vocab.vectors = vectors with make_tempdir() as path: tagger.to_disk(path) - tagger = Tagger(Vocab()).from_disk(path) + tagger = Tagger(Vocab(), default_tagger()).from_disk(path) assert tagger.cfg.get("pretrained_dims", 0) == 0 @@ -237,6 +237,7 @@ def test_issue1889(word): assert is_stop(word, STOP_WORDS) == is_stop(word.upper(), STOP_WORDS) +@pytest.mark.skip(reason="obsolete with the config refactor of v.3") def test_issue1915(): cfg = {"hidden_depth": 2} # should error out nlp = Language() @@ -269,10 +270,12 @@ def test_issue1963(en_tokenizer): @pytest.mark.parametrize("label", ["U-JOB-NAME"]) def test_issue1967(label): - ner = EntityRecognizer(Vocab()) - entry = ([0], ["word"], ["tag"], [0], ["dep"], [label]) - gold_parses = [(None, [(entry, None)])] - ner.moves.get_actions(gold_parses=gold_parses) + ner = EntityRecognizer(Vocab(), default_ner()) + example = Example(doc=None) + example.set_token_annotation( + ids=[0], words=["word"], tags=["tag"], heads=[0], deps=["dep"], entities=[label] + ) + ner.moves.get_actions(gold_parses=[example]) def test_issue1971(en_vocab): diff --git a/spacy/tests/regression/test_issue2001-2500.py b/spacy/tests/regression/test_issue2001-2500.py index 01f0f905c..67966f70e 100644 --- a/spacy/tests/regression/test_issue2001-2500.py +++ b/spacy/tests/regression/test_issue2001-2500.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest import numpy from spacy.tokens import Doc @@ -35,6 +32,10 @@ def test_issue2179(): nlp.begin_training() nlp2 = Italian() nlp2.add_pipe(nlp2.create_pipe("ner")) + + assert len(nlp2.get_pipe("ner").labels) == 0 + model = nlp2.get_pipe("ner").model + model.attrs["resize_output"](model, nlp.get_pipe("ner").moves.n_moves) nlp2.from_bytes(nlp.to_bytes()) assert "extra_labels" not in nlp2.get_pipe("ner").cfg assert nlp2.get_pipe("ner").labels == ("CITIZENSHIP",) diff --git a/spacy/tests/regression/test_issue2501-3000.py b/spacy/tests/regression/test_issue2501-3000.py index 1f5e44499..033e4f83e 100644 --- a/spacy/tests/regression/test_issue2501-3000.py +++ b/spacy/tests/regression/test_issue2501-3000.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy import displacy from spacy.lang.en import English @@ -11,7 +8,7 @@ from spacy.matcher import Matcher from spacy.tokens import Doc, Span from spacy.vocab import Vocab from spacy.compat import pickle -from spacy._ml import link_vectors_to_models +from spacy.util import link_vectors_to_models import numpy import random @@ -157,7 +154,7 @@ def test_issue2800(): losses = {} random.shuffle(train_data) for statement, entities in train_data: - nlp.update([statement], [entities], sgd=optimizer, losses=losses, drop=0.5) + nlp.update((statement, entities), sgd=optimizer, losses=losses, drop=0.5) def test_issue2822(it_tokenizer): diff --git a/spacy/tests/regression/test_issue3001-3500.py b/spacy/tests/regression/test_issue3001-3500.py index effbebb92..9ff118a1f 100644 --- a/spacy/tests/regression/test_issue3001-3500.py +++ b/spacy/tests/regression/test_issue3001-3500.py @@ -1,19 +1,16 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.lang.en import English from spacy.lang.de import German +from spacy.pipeline.defaults import default_ner from spacy.pipeline import EntityRuler, EntityRecognizer from spacy.matcher import Matcher, PhraseMatcher from spacy.tokens import Doc from spacy.vocab import Vocab from spacy.attrs import ENT_IOB, ENT_TYPE -from spacy.compat import pickle, is_python2, unescape_unicode +from spacy.compat import pickle from spacy import displacy from spacy.util import decaying import numpy -import re from spacy.vectors import Vectors from ..util import get_doc @@ -107,6 +104,8 @@ def test_issue3209(): assert ner.move_names == move_names nlp2 = English() nlp2.add_pipe(nlp2.create_pipe("ner")) + model = nlp2.get_pipe("ner").model + model.attrs["resize_output"](model, ner.moves.n_moves) nlp2.from_bytes(nlp.to_bytes()) assert nlp2.get_pipe("ner").move_names == move_names @@ -197,7 +196,7 @@ def test_issue3345(): doc = Doc(nlp.vocab, words=["I", "live", "in", "New", "York"]) doc[4].is_sent_start = True ruler = EntityRuler(nlp, patterns=[{"label": "GPE", "pattern": "New York"}]) - ner = EntityRecognizer(doc.vocab) + ner = EntityRecognizer(doc.vocab, default_ner()) # Add the OUT action. I wouldn't have thought this would be necessary... ner.moves.add_action(5, "") ner.add_label("GPE") @@ -211,73 +210,6 @@ def test_issue3345(): assert ner.moves.is_valid(state, "B-GPE") -if is_python2: - # If we have this test in Python 3, pytest chokes, as it can't print the - # string above in the xpass message. - prefix_search = ( - b"^\xc2\xa7|^%|^=|^\xe2\x80\x94|^\xe2\x80\x93|^\\+(?![0-9])" - b"|^\xe2\x80\xa6|^\xe2\x80\xa6\xe2\x80\xa6|^,|^:|^;|^\\!|^\\?" - b"|^\xc2\xbf|^\xd8\x9f|^\xc2\xa1|^\\(|^\\)|^\\[|^\\]|^\\{|^\\}" - b"|^<|^>|^_|^#|^\\*|^&|^\xe3\x80\x82|^\xef\xbc\x9f|^\xef\xbc\x81|" - b"^\xef\xbc\x8c|^\xe3\x80\x81|^\xef\xbc\x9b|^\xef\xbc\x9a|" - b"^\xef\xbd\x9e|^\xc2\xb7|^\xe0\xa5\xa4|^\xd8\x8c|^\xd8\x9b|" - b"^\xd9\xaa|^\\.\\.+|^\xe2\x80\xa6|^\\'|^\"|^\xe2\x80\x9d|" - b"^\xe2\x80\x9c|^`|^\xe2\x80\x98|^\xc2\xb4|^\xe2\x80\x99|" - b"^\xe2\x80\x9a|^,|^\xe2\x80\x9e|^\xc2\xbb|^\xc2\xab|^\xe3\x80\x8c|" - b"^\xe3\x80\x8d|^\xe3\x80\x8e|^\xe3\x80\x8f|^\xef\xbc\x88|" - b"^\xef\xbc\x89|^\xe3\x80\x94|^\xe3\x80\x95|^\xe3\x80\x90|" - b"^\xe3\x80\x91|^\xe3\x80\x8a|^\xe3\x80\x8b|^\xe3\x80\x88|" - b"^\xe3\x80\x89|^\\$|^\xc2\xa3|^\xe2\x82\xac|^\xc2\xa5|^\xe0\xb8\xbf|" - b"^US\\$|^C\\$|^A\\$|^\xe2\x82\xbd|^\xef\xb7\xbc|^\xe2\x82\xb4|" - b"^[\\u00A6\\u00A9\\u00AE\\u00B0\\u0482\\u058D\\u058E\\u060E\\u060F" - b"\\u06DE\\u06E9\\u06FD\\u06FE\\u07F6\\u09FA\\u0B70\\u0BF3-\\u0BF8" - b"\\u0BFA\\u0C7F\\u0D4F\\u0D79\\u0F01-\\u0F03\\u0F13\\u0F15-\\u0F17" - b"\\u0F1A-\\u0F1F\\u0F34\\u0F36\\u0F38\\u0FBE-\\u0FC5\\u0FC7-\\u0FCC" - b"\\u0FCE\\u0FCF\\u0FD5-\\u0FD8\\u109E\\u109F\\u1390-\\u1399\\u1940" - b"\\u19DE-\\u19FF\\u1B61-\\u1B6A\\u1B74-\\u1B7C\\u2100\\u2101\\u2103" - b"-\\u2106\\u2108\\u2109\\u2114\\u2116\\u2117\\u211E-\\u2123\\u2125" - b"\\u2127\\u2129\\u212E\\u213A\\u213B\\u214A\\u214C\\u214D\\u214F" - b"\\u218A\\u218B\\u2195-\\u2199\\u219C-\\u219F\\u21A1\\u21A2\\u21A4" - b"\\u21A5\\u21A7-\\u21AD\\u21AF-\\u21CD\\u21D0\\u21D1\\u21D3\\u21D5" - b"-\\u21F3\\u2300-\\u2307\\u230C-\\u231F\\u2322-\\u2328\\u232B" - b"-\\u237B\\u237D-\\u239A\\u23B4-\\u23DB\\u23E2-\\u2426\\u2440" - b"-\\u244A\\u249C-\\u24E9\\u2500-\\u25B6\\u25B8-\\u25C0\\u25C2" - b"-\\u25F7\\u2600-\\u266E\\u2670-\\u2767\\u2794-\\u27BF\\u2800" - b"-\\u28FF\\u2B00-\\u2B2F\\u2B45\\u2B46\\u2B4D-\\u2B73\\u2B76" - b"-\\u2B95\\u2B98-\\u2BC8\\u2BCA-\\u2BFE\\u2CE5-\\u2CEA\\u2E80" - b"-\\u2E99\\u2E9B-\\u2EF3\\u2F00-\\u2FD5\\u2FF0-\\u2FFB\\u3004" - b"\\u3012\\u3013\\u3020\\u3036\\u3037\\u303E\\u303F\\u3190\\u3191" - b"\\u3196-\\u319F\\u31C0-\\u31E3\\u3200-\\u321E\\u322A-\\u3247\\u3250" - b"\\u3260-\\u327F\\u328A-\\u32B0\\u32C0-\\u32FE\\u3300-\\u33FF\\u4DC0" - b"-\\u4DFF\\uA490-\\uA4C6\\uA828-\\uA82B\\uA836\\uA837\\uA839\\uAA77" - b"-\\uAA79\\uFDFD\\uFFE4\\uFFE8\\uFFED\\uFFEE\\uFFFC\\uFFFD\\U00010137" - b"-\\U0001013F\\U00010179-\\U00010189\\U0001018C-\\U0001018E" - b"\\U00010190-\\U0001019B\\U000101A0\\U000101D0-\\U000101FC\\U00010877" - b"\\U00010878\\U00010AC8\\U0001173F\\U00016B3C-\\U00016B3F\\U00016B45" - b"\\U0001BC9C\\U0001D000-\\U0001D0F5\\U0001D100-\\U0001D126\\U0001D129" - b"-\\U0001D164\\U0001D16A-\\U0001D16C\\U0001D183\\U0001D184\\U0001D18C" - b"-\\U0001D1A9\\U0001D1AE-\\U0001D1E8\\U0001D200-\\U0001D241\\U0001D245" - b"\\U0001D300-\\U0001D356\\U0001D800-\\U0001D9FF\\U0001DA37-\\U0001DA3A" - b"\\U0001DA6D-\\U0001DA74\\U0001DA76-\\U0001DA83\\U0001DA85\\U0001DA86" - b"\\U0001ECAC\\U0001F000-\\U0001F02B\\U0001F030-\\U0001F093\\U0001F0A0" - b"-\\U0001F0AE\\U0001F0B1-\\U0001F0BF\\U0001F0C1-\\U0001F0CF\\U0001F0D1" - b"-\\U0001F0F5\\U0001F110-\\U0001F16B\\U0001F170-\\U0001F1AC\\U0001F1E6" - b"-\\U0001F202\\U0001F210-\\U0001F23B\\U0001F240-\\U0001F248\\U0001F250" - b"\\U0001F251\\U0001F260-\\U0001F265\\U0001F300-\\U0001F3FA\\U0001F400" - b"-\\U0001F6D4\\U0001F6E0-\\U0001F6EC\\U0001F6F0-\\U0001F6F9\\U0001F700" - b"-\\U0001F773\\U0001F780-\\U0001F7D8\\U0001F800-\\U0001F80B\\U0001F810" - b"-\\U0001F847\\U0001F850-\\U0001F859\\U0001F860-\\U0001F887\\U0001F890" - b"-\\U0001F8AD\\U0001F900-\\U0001F90B\\U0001F910-\\U0001F93E\\U0001F940" - b"-\\U0001F970\\U0001F973-\\U0001F976\\U0001F97A\\U0001F97C-\\U0001F9A2" - b"\\U0001F9B0-\\U0001F9B9\\U0001F9C0-\\U0001F9C2\\U0001F9D0-\\U0001F9FF" - b"\\U0001FA60-\\U0001FA6D]" - ) - - def test_issue3356(): - pattern = re.compile(unescape_unicode(prefix_search.decode("utf8"))) - assert not pattern.search("hello") - - def test_issue3410(): texts = ["Hello world", "This is a test"] nlp = English() diff --git a/spacy/tests/regression/test_issue3521.py b/spacy/tests/regression/test_issue3521.py index 35731ac12..3d8ee9922 100644 --- a/spacy/tests/regression/test_issue3521.py +++ b/spacy/tests/regression/test_issue3521.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/regression/test_issue3526.py b/spacy/tests/regression/test_issue3526.py index c6f513730..aa77028fb 100644 --- a/spacy/tests/regression/test_issue3526.py +++ b/spacy/tests/regression/test_issue3526.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.tokens import Span from spacy.language import Language diff --git a/spacy/tests/regression/test_issue3531.py b/spacy/tests/regression/test_issue3531.py index 7b9d0bd2a..4c65a5bfe 100644 --- a/spacy/tests/regression/test_issue3531.py +++ b/spacy/tests/regression/test_issue3531.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy import displacy diff --git a/spacy/tests/regression/test_issue3540.py b/spacy/tests/regression/test_issue3540.py index 19d89c797..be9e04b0b 100644 --- a/spacy/tests/regression/test_issue3540.py +++ b/spacy/tests/regression/test_issue3540.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.tokens import Doc import numpy as np diff --git a/spacy/tests/regression/test_issue3549.py b/spacy/tests/regression/test_issue3549.py index 587b3a857..b3af59c2e 100644 --- a/spacy/tests/regression/test_issue3549.py +++ b/spacy/tests/regression/test_issue3549.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.matcher import Matcher from spacy.errors import MatchPatternError diff --git a/spacy/tests/regression/test_issue3555.py b/spacy/tests/regression/test_issue3555.py index 8444f11f2..de047bcbc 100644 --- a/spacy/tests/regression/test_issue3555.py +++ b/spacy/tests/regression/test_issue3555.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.tokens import Doc, Token from spacy.matcher import Matcher diff --git a/spacy/tests/regression/test_issue3611.py b/spacy/tests/regression/test_issue3611.py index 3c4836264..cab68793c 100644 --- a/spacy/tests/regression/test_issue3611.py +++ b/spacy/tests/regression/test_issue3611.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import spacy from spacy.util import minibatch, compounding @@ -34,18 +31,13 @@ def test_issue3611(): nlp.add_pipe(textcat, last=True) # training the network - with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]): - optimizer = nlp.begin_training() + with nlp.select_pipes(enable="textcat"): + optimizer = nlp.begin_training(X=x_train, Y=y_train) for i in range(3): losses = {} batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001)) for batch in batches: - texts, annotations = zip(*batch) nlp.update( - docs=texts, - golds=annotations, - sgd=optimizer, - drop=0.1, - losses=losses, + examples=batch, sgd=optimizer, drop=0.1, losses=losses, ) diff --git a/spacy/tests/regression/test_issue3625.py b/spacy/tests/regression/test_issue3625.py index d935db17f..51561b3ac 100644 --- a/spacy/tests/regression/test_issue3625.py +++ b/spacy/tests/regression/test_issue3625.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.lang.hi import Hindi diff --git a/spacy/tests/regression/test_issue3803.py b/spacy/tests/regression/test_issue3803.py index 37d15a5cf..ab5250edf 100644 --- a/spacy/tests/regression/test_issue3803.py +++ b/spacy/tests/regression/test_issue3803.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.lang.es import Spanish diff --git a/spacy/tests/regression/test_issue3830.py b/spacy/tests/regression/test_issue3830.py index 54ce10924..3d8e80847 100644 --- a/spacy/tests/regression/test_issue3830.py +++ b/spacy/tests/regression/test_issue3830.py @@ -1,10 +1,12 @@ from spacy.pipeline.pipes import DependencyParser from spacy.vocab import Vocab +from spacy.pipeline.defaults import default_parser + def test_issue3830_no_subtok(): """Test that the parser doesn't have subtok label if not learn_tokens""" - parser = DependencyParser(Vocab()) + parser = DependencyParser(Vocab(), default_parser()) parser.add_label("nsubj") assert "subtok" not in parser.labels parser.begin_training(lambda: []) @@ -13,7 +15,7 @@ def test_issue3830_no_subtok(): def test_issue3830_with_subtok(): """Test that the parser does have subtok label if learn_tokens=True.""" - parser = DependencyParser(Vocab(), learn_tokens=True) + parser = DependencyParser(Vocab(), default_parser(), learn_tokens=True) parser.add_label("nsubj") assert "subtok" not in parser.labels parser.begin_training(lambda: []) diff --git a/spacy/tests/regression/test_issue3839.py b/spacy/tests/regression/test_issue3839.py index fe722a681..27b1f5f29 100644 --- a/spacy/tests/regression/test_issue3839.py +++ b/spacy/tests/regression/test_issue3839.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.matcher import Matcher from spacy.tokens import Doc diff --git a/spacy/tests/regression/test_issue3869.py b/spacy/tests/regression/test_issue3869.py index 62e8eabd6..0a851e869 100644 --- a/spacy/tests/regression/test_issue3869.py +++ b/spacy/tests/regression/test_issue3869.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.attrs import IS_ALPHA from spacy.lang.en import English diff --git a/spacy/tests/regression/test_issue3879.py b/spacy/tests/regression/test_issue3879.py index 5cd245231..8500c09aa 100644 --- a/spacy/tests/regression/test_issue3879.py +++ b/spacy/tests/regression/test_issue3879.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.matcher import Matcher from spacy.tokens import Doc diff --git a/spacy/tests/regression/test_issue3880.py b/spacy/tests/regression/test_issue3880.py index c060473f5..6e8ab6f43 100644 --- a/spacy/tests/regression/test_issue3880.py +++ b/spacy/tests/regression/test_issue3880.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.lang.en import English import pytest diff --git a/spacy/tests/regression/test_issue3882.py b/spacy/tests/regression/test_issue3882.py index 1b2dcea25..fa616db1d 100644 --- a/spacy/tests/regression/test_issue3882.py +++ b/spacy/tests/regression/test_issue3882.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.displacy import parse_deps from spacy.tokens import Doc diff --git a/spacy/tests/regression/test_issue3951.py b/spacy/tests/regression/test_issue3951.py index 33230112f..6e4c9eeaa 100644 --- a/spacy/tests/regression/test_issue3951.py +++ b/spacy/tests/regression/test_issue3951.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.matcher import Matcher from spacy.tokens import Doc diff --git a/spacy/tests/regression/test_issue3959.py b/spacy/tests/regression/test_issue3959.py index c1f7fe100..7db28a31f 100644 --- a/spacy/tests/regression/test_issue3959.py +++ b/spacy/tests/regression/test_issue3959.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.lang.en import English from ..util import make_tempdir diff --git a/spacy/tests/regression/test_issue3962.py b/spacy/tests/regression/test_issue3962.py index ae60fa0fa..971c9b08e 100644 --- a/spacy/tests/regression/test_issue3962.py +++ b/spacy/tests/regression/test_issue3962.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from ..util import get_doc diff --git a/spacy/tests/regression/test_issue3972.py b/spacy/tests/regression/test_issue3972.py index 22b8d486e..fe5388950 100644 --- a/spacy/tests/regression/test_issue3972.py +++ b/spacy/tests/regression/test_issue3972.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.matcher import PhraseMatcher from spacy.tokens import Doc diff --git a/spacy/tests/regression/test_issue4002.py b/spacy/tests/regression/test_issue4002.py index d075128aa..3ac26d3ab 100644 --- a/spacy/tests/regression/test_issue4002.py +++ b/spacy/tests/regression/test_issue4002.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.matcher import PhraseMatcher from spacy.tokens import Doc diff --git a/spacy/tests/regression/test_issue4030.py b/spacy/tests/regression/test_issue4030.py index ed219573f..b641213ad 100644 --- a/spacy/tests/regression/test_issue4030.py +++ b/spacy/tests/regression/test_issue4030.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import spacy from spacy.util import minibatch, compounding @@ -34,20 +31,15 @@ def test_issue4030(): nlp.add_pipe(textcat, last=True) # training the network - with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]): + with nlp.select_pipes(enable="textcat"): optimizer = nlp.begin_training() for i in range(3): losses = {} batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001)) for batch in batches: - texts, annotations = zip(*batch) nlp.update( - docs=texts, - golds=annotations, - sgd=optimizer, - drop=0.1, - losses=losses, + examples=batch, sgd=optimizer, drop=0.1, losses=losses, ) # processing of an empty doc should result in 0.0 for all categories diff --git a/spacy/tests/regression/test_issue4042.py b/spacy/tests/regression/test_issue4042.py index 00a8882d3..30081543b 100644 --- a/spacy/tests/regression/test_issue4042.py +++ b/spacy/tests/regression/test_issue4042.py @@ -1,11 +1,9 @@ -# coding: utf8 -from __future__ import unicode_literals - import spacy from spacy.pipeline import EntityRecognizer, EntityRuler from spacy.lang.en import English from spacy.tokens import Span from spacy.util import ensure_path +from spacy.pipeline.defaults import default_ner from ..util import make_tempdir @@ -76,6 +74,6 @@ def test_issue4042_bug2(): output_dir.mkdir() ner1.to_disk(output_dir) - ner2 = EntityRecognizer(vocab) + ner2 = EntityRecognizer(vocab, default_ner()) ner2.from_disk(output_dir) assert len(ner2.labels) == 2 diff --git a/spacy/tests/regression/test_issue4054.py b/spacy/tests/regression/test_issue4054.py index cc84cebf8..c52ded395 100644 --- a/spacy/tests/regression/test_issue4054.py +++ b/spacy/tests/regression/test_issue4054.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.vocab import Vocab import spacy from spacy.lang.en import English diff --git a/spacy/tests/regression/test_issue4120.py b/spacy/tests/regression/test_issue4120.py index d288f46c4..4849aa238 100644 --- a/spacy/tests/regression/test_issue4120.py +++ b/spacy/tests/regression/test_issue4120.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.matcher import Matcher from spacy.tokens import Doc diff --git a/spacy/tests/regression/test_issue4133.py b/spacy/tests/regression/test_issue4133.py index 93262f8cf..a726806d7 100644 --- a/spacy/tests/regression/test_issue4133.py +++ b/spacy/tests/regression/test_issue4133.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.lang.en import English from spacy.tokens import Doc from spacy.vocab import Vocab diff --git a/spacy/tests/regression/test_issue4190.py b/spacy/tests/regression/test_issue4190.py index eb4eb8648..97d532d2a 100644 --- a/spacy/tests/regression/test_issue4190.py +++ b/spacy/tests/regression/test_issue4190.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.lang.en import English from spacy.tokenizer import Tokenizer from spacy import util diff --git a/spacy/tests/regression/test_issue4267.py b/spacy/tests/regression/test_issue4267.py index ef871bf9f..891f03b30 100644 --- a/spacy/tests/regression/test_issue4267.py +++ b/spacy/tests/regression/test_issue4267.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.lang.en import English from spacy.pipeline import EntityRuler diff --git a/spacy/tests/regression/test_issue4272.py b/spacy/tests/regression/test_issue4272.py index c57704d71..4bac97a44 100644 --- a/spacy/tests/regression/test_issue4272.py +++ b/spacy/tests/regression/test_issue4272.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.lang.el import Greek diff --git a/spacy/tests/regression/test_issue4278.py b/spacy/tests/regression/test_issue4278.py index cb09340ff..ffbc41226 100644 --- a/spacy/tests/regression/test_issue4278.py +++ b/spacy/tests/regression/test_issue4278.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.language import Language from spacy.pipeline import Pipe diff --git a/spacy/tests/regression/test_issue4313.py b/spacy/tests/regression/test_issue4313.py index c68f745a7..ba4d2deab 100644 --- a/spacy/tests/regression/test_issue4313.py +++ b/spacy/tests/regression/test_issue4313.py @@ -1,8 +1,6 @@ -# coding: utf8 -from __future__ import unicode_literals - from collections import defaultdict +from spacy.pipeline.defaults import default_ner from spacy.pipeline import EntityRecognizer from spacy.lang.en import English @@ -14,7 +12,7 @@ def test_issue4313(): beam_width = 16 beam_density = 0.0001 nlp = English() - ner = EntityRecognizer(nlp.vocab) + ner = EntityRecognizer(nlp.vocab, default_ner()) ner.add_label("SOME_LABEL") ner.begin_training([]) nlp.add_pipe(ner) diff --git a/spacy/tests/regression/test_issue4348.py b/spacy/tests/regression/test_issue4348.py index d2e27d563..4978e0c8e 100644 --- a/spacy/tests/regression/test_issue4348.py +++ b/spacy/tests/regression/test_issue4348.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.lang.en import English from spacy.util import minibatch, compounding import pytest @@ -21,5 +18,4 @@ def test_issue4348(): losses = {} batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001)) for batch in batches: - texts, annotations = zip(*batch) - nlp.update(texts, annotations, sgd=optimizer, losses=losses) + nlp.update(batch, sgd=optimizer, losses=losses) diff --git a/spacy/tests/regression/test_issue4367.py b/spacy/tests/regression/test_issue4367.py index ab6192744..917847a05 100644 --- a/spacy/tests/regression/test_issue4367.py +++ b/spacy/tests/regression/test_issue4367.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.tokens import DocBin diff --git a/spacy/tests/regression/test_issue4373.py b/spacy/tests/regression/test_issue4373.py index 57d7547da..dbde1624e 100644 --- a/spacy/tests/regression/test_issue4373.py +++ b/spacy/tests/regression/test_issue4373.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.matcher import Matcher, PhraseMatcher from spacy.vocab import Vocab diff --git a/spacy/tests/regression/test_issue4402.py b/spacy/tests/regression/test_issue4402.py index d3b4bdf9a..80d37b1e6 100644 --- a/spacy/tests/regression/test_issue4402.py +++ b/spacy/tests/regression/test_issue4402.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import srsly from spacy.gold import GoldCorpus from spacy.lang.en import English @@ -11,15 +8,14 @@ from ..util import make_tempdir def test_issue4402(): nlp = English() with make_tempdir() as tmpdir: - print("temp", tmpdir) json_path = tmpdir / "test4402.json" srsly.write_json(json_path, json_data) corpus = GoldCorpus(str(json_path), str(json_path)) - train_docs = list(corpus.train_docs(nlp, gold_preproc=True, max_length=0)) + train_data = list(corpus.train_dataset(nlp, gold_preproc=True, max_length=0)) # assert that the data got split into 4 sentences - assert len(train_docs) == 4 + assert len(train_data) == 4 json_data = [ diff --git a/spacy/tests/regression/test_issue4528.py b/spacy/tests/regression/test_issue4528.py index 460449003..6f96c9f2d 100644 --- a/spacy/tests/regression/test_issue4528.py +++ b/spacy/tests/regression/test_issue4528.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.tokens import Doc, DocBin diff --git a/spacy/tests/regression/test_issue4529.py b/spacy/tests/regression/test_issue4529.py index 381957be6..fa962c053 100644 --- a/spacy/tests/regression/test_issue4529.py +++ b/spacy/tests/regression/test_issue4529.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.gold import GoldParse diff --git a/spacy/tests/regression/test_issue4590.py b/spacy/tests/regression/test_issue4590.py index 3d01cd487..fc49c5117 100644 --- a/spacy/tests/regression/test_issue4590.py +++ b/spacy/tests/regression/test_issue4590.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - from mock import Mock from spacy.matcher import DependencyMatcher from ..util import get_doc diff --git a/spacy/tests/regression/test_issue4651.py b/spacy/tests/regression/test_issue4651.py index eb49f4a38..3f6c1a57c 100644 --- a/spacy/tests/regression/test_issue4651.py +++ b/spacy/tests/regression/test_issue4651.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - from spacy.lang.en import English from spacy.pipeline import EntityRuler diff --git a/spacy/tests/regression/test_issue4674.py b/spacy/tests/regression/test_issue4674.py index 8fa4f9259..149e1431b 100644 --- a/spacy/tests/regression/test_issue4674.py +++ b/spacy/tests/regression/test_issue4674.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.kb import KnowledgeBase from spacy.util import ensure_path diff --git a/spacy/tests/regression/test_issue4707.py b/spacy/tests/regression/test_issue4707.py index e710881d7..d9798ef84 100644 --- a/spacy/tests/regression/test_issue4707.py +++ b/spacy/tests/regression/test_issue4707.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.util import load_model_from_path from spacy.lang.en import English diff --git a/spacy/tests/regression/test_issue4725.py b/spacy/tests/regression/test_issue4725.py index 57675a202..967db5d67 100644 --- a/spacy/tests/regression/test_issue4725.py +++ b/spacy/tests/regression/test_issue4725.py @@ -1,6 +1,4 @@ -# coding: utf8 -from __future__ import unicode_literals - +import pytest import numpy from spacy.lang.en import English diff --git a/spacy/tests/regression/test_issue4849.py b/spacy/tests/regression/test_issue4849.py index 5c7ffc999..ddbf6f7a0 100644 --- a/spacy/tests/regression/test_issue4849.py +++ b/spacy/tests/regression/test_issue4849.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.lang.en import English from spacy.pipeline import EntityRuler diff --git a/spacy/tests/regression/test_issue4903.py b/spacy/tests/regression/test_issue4903.py index d467b1cd6..a3dff16aa 100644 --- a/spacy/tests/regression/test_issue4903.py +++ b/spacy/tests/regression/test_issue4903.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.lang.en import English from spacy.tokens import Span, Doc diff --git a/spacy/tests/regression/test_issue4924.py b/spacy/tests/regression/test_issue4924.py index 0e45291a9..b240f6d4a 100644 --- a/spacy/tests/regression/test_issue4924.py +++ b/spacy/tests/regression/test_issue4924.py @@ -1,16 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest - -import spacy +from spacy.language import Language -@pytest.fixture -def nlp(): - return spacy.blank("en") - - -def test_issue4924(nlp): +def test_issue4924(): + nlp = Language() docs_golds = [("", {})] nlp.evaluate(docs_golds) diff --git a/spacy/tests/regression/test_issue5048.py b/spacy/tests/regression/test_issue5048.py index 228322493..bc52ae82f 100644 --- a/spacy/tests/regression/test_issue5048.py +++ b/spacy/tests/regression/test_issue5048.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import numpy from spacy.tokens import Doc from spacy.attrs import DEP, POS, TAG diff --git a/spacy/tests/regression/test_issue5082.py b/spacy/tests/regression/test_issue5082.py index efa5d39f2..52a52b177 100644 --- a/spacy/tests/regression/test_issue5082.py +++ b/spacy/tests/regression/test_issue5082.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import numpy as np from spacy.lang.en import English from spacy.pipeline import EntityRuler diff --git a/spacy/tests/regression/test_issue5141.py b/spacy/tests/regression/test_issue5141.py new file mode 100644 index 000000000..845454583 --- /dev/null +++ b/spacy/tests/regression/test_issue5141.py @@ -0,0 +1,11 @@ +from spacy.tokens import DocBin + + +def test_issue5141(en_vocab): + """ Ensure an empty DocBin does not crash on serialization """ + doc_bin = DocBin(attrs=["DEP", "HEAD"]) + assert list(doc_bin.get_docs(en_vocab)) == [] + doc_bin_bytes = doc_bin.to_bytes() + + doc_bin_2 = DocBin().from_bytes(doc_bin_bytes) + assert list(doc_bin_2.get_docs(en_vocab)) == [] diff --git a/spacy/tests/serialize/test_serialize_config.py b/spacy/tests/serialize/test_serialize_config.py new file mode 100644 index 000000000..ba63adfa4 --- /dev/null +++ b/spacy/tests/serialize/test_serialize_config.py @@ -0,0 +1,136 @@ +from thinc.api import Config + +import spacy +from spacy import util +from spacy.lang.en import English +from spacy.util import registry + +from ..util import make_tempdir +from ...ml.models import build_Tok2Vec_model, build_tb_parser_model + +nlp_config_string = """ +[nlp] +lang = "en" + +[nlp.pipeline.tok2vec] +factory = "tok2vec" + +[nlp.pipeline.tok2vec.model] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 342 +depth = 4 +window_size = 1 +embed_size = 2000 +maxout_pieces = 3 +subword_features = true + +[nlp.pipeline.tagger] +factory = "tagger" + +[nlp.pipeline.tagger.model] +@architectures = "spacy.Tagger.v1" + +[nlp.pipeline.tagger.model.tok2vec] +@architectures = "spacy.Tok2VecTensors.v1" +width = ${nlp.pipeline.tok2vec.model:width} +""" + + +parser_config_string = """ +[model] +@architectures = "spacy.TransitionBasedParser.v1" +nr_feature_tokens = 99 +hidden_width = 66 +maxout_pieces = 2 + +[model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 333 +depth = 4 +embed_size = 5555 +window_size = 1 +maxout_pieces = 7 +subword_features = false +""" + + +@registry.architectures.register("my_test_parser") +def my_parser(): + tok2vec = build_Tok2Vec_model( + width=321, + embed_size=5432, + pretrained_vectors=None, + window_size=3, + maxout_pieces=4, + subword_features=True, + char_embed=True, + nM=64, + nC=8, + conv_depth=2, + bilstm_depth=0, + ) + parser = build_tb_parser_model( + tok2vec=tok2vec, nr_feature_tokens=7, hidden_width=65, maxout_pieces=5 + ) + return parser + + +def test_serialize_nlp(): + """ Create a custom nlp pipeline from config and ensure it serializes it correctly """ + nlp_config = Config().from_str(nlp_config_string) + nlp = util.load_model_from_config(nlp_config["nlp"]) + nlp.begin_training() + assert "tok2vec" in nlp.pipe_names + assert "tagger" in nlp.pipe_names + assert "parser" not in nlp.pipe_names + assert nlp.get_pipe("tagger").model.get_ref("tok2vec").get_dim("nO") == 342 + + with make_tempdir() as d: + nlp.to_disk(d) + nlp2 = spacy.load(d) + assert "tok2vec" in nlp2.pipe_names + assert "tagger" in nlp2.pipe_names + assert "parser" not in nlp2.pipe_names + assert nlp2.get_pipe("tagger").model.get_ref("tok2vec").get_dim("nO") == 342 + + +def test_serialize_custom_nlp(): + """ Create a custom nlp pipeline and ensure it serializes it correctly""" + nlp = English() + parser_cfg = dict() + parser_cfg["model"] = {"@architectures": "my_test_parser"} + parser = nlp.create_pipe("parser", parser_cfg) + nlp.add_pipe(parser) + nlp.begin_training() + + with make_tempdir() as d: + nlp.to_disk(d) + nlp2 = spacy.load(d) + model = nlp2.get_pipe("parser").model + tok2vec = model.get_ref("tok2vec") + upper = model.get_ref("upper") + + # check that we have the correct settings, not the default ones + assert upper.get_dim("nI") == 65 + + +def test_serialize_parser(): + """ Create a non-default parser config to check nlp serializes it correctly """ + nlp = English() + model_config = Config().from_str(parser_config_string) + parser = nlp.create_pipe("parser", config=model_config) + parser.add_label("nsubj") + nlp.add_pipe(parser) + nlp.begin_training() + + with make_tempdir() as d: + nlp.to_disk(d) + nlp2 = spacy.load(d) + model = nlp2.get_pipe("parser").model + tok2vec = model.get_ref("tok2vec") + upper = model.get_ref("upper") + + # check that we have the correct settings, not the default ones + assert upper.get_dim("nI") == 66 diff --git a/spacy/tests/serialize/test_serialize_doc.py b/spacy/tests/serialize/test_serialize_doc.py index ef2b1ee89..615bb1cd9 100644 --- a/spacy/tests/serialize/test_serialize_doc.py +++ b/spacy/tests/serialize/test_serialize_doc.py @@ -1,13 +1,7 @@ -# coding: utf-8 -from __future__ import unicode_literals - import spacy - import pytest - from spacy.lang.en import English from spacy.tokens import Doc, DocBin -from spacy.compat import path2str from ..util import make_tempdir @@ -43,7 +37,7 @@ def test_serialize_doc_roundtrip_disk_str_path(en_vocab): doc = Doc(en_vocab, words=["hello", "world"]) with make_tempdir() as d: file_path = d / "doc" - file_path = path2str(file_path) + file_path = str(file_path) doc.to_disk(file_path) doc_d = Doc(en_vocab).from_disk(file_path) assert doc.to_bytes() == doc_d.to_bytes() diff --git a/spacy/tests/serialize/test_serialize_extension_attrs.py b/spacy/tests/serialize/test_serialize_extension_attrs.py index 45c2e3909..9cfa1a552 100644 --- a/spacy/tests/serialize/test_serialize_extension_attrs.py +++ b/spacy/tests/serialize/test_serialize_extension_attrs.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.tokens import Doc, Token from spacy.vocab import Vocab @@ -10,9 +7,7 @@ from spacy.vocab import Vocab def doc_w_attrs(en_tokenizer): Doc.set_extension("_test_attr", default=False) Doc.set_extension("_test_prop", getter=lambda doc: len(doc.text)) - Doc.set_extension( - "_test_method", method=lambda doc, arg: "{}{}".format(len(doc.text), arg) - ) + Doc.set_extension("_test_method", method=lambda doc, arg: f"{len(doc.text)}{arg}") doc = en_tokenizer("This is a test.") doc._._test_attr = "test" @@ -28,8 +23,7 @@ def test_serialize_ext_attrs_from_bytes(doc_w_attrs): assert doc._.has("_test_attr") assert doc._._test_attr == "test" assert doc._._test_prop == len(doc.text) - assert doc._._test_method("test") == "{}{}".format(len(doc.text), "test") - + assert doc._._test_method("test") == f"{len(doc.text)}test" assert doc[0]._._test_token == "t0" assert doc[1]._._test_token == "t1" assert doc[2]._._test_token == "t0" diff --git a/spacy/tests/serialize/test_serialize_kb.py b/spacy/tests/serialize/test_serialize_kb.py index b19c11864..91036a496 100644 --- a/spacy/tests/serialize/test_serialize_kb.py +++ b/spacy/tests/serialize/test_serialize_kb.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - from spacy.util import ensure_path from spacy.kb import KnowledgeBase diff --git a/spacy/tests/serialize/test_serialize_language.py b/spacy/tests/serialize/test_serialize_language.py index efc5d181c..0e3b7c59f 100644 --- a/spacy/tests/serialize/test_serialize_language.py +++ b/spacy/tests/serialize/test_serialize_language.py @@ -1,8 +1,6 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import re + from spacy.language import Language from spacy.tokenizer import Tokenizer @@ -59,7 +57,7 @@ def test_serialize_language_exclude(meta_data): nlp = Language(meta=meta_data) assert nlp.meta["name"] == name new_nlp = Language().from_bytes(nlp.to_bytes()) - assert nlp.meta["name"] == name + assert new_nlp.meta["name"] == name new_nlp = Language().from_bytes(nlp.to_bytes(), exclude=["meta"]) assert not new_nlp.meta["name"] == name new_nlp = Language().from_bytes(nlp.to_bytes(exclude=["meta"])) diff --git a/spacy/tests/serialize/test_serialize_pipeline.py b/spacy/tests/serialize/test_serialize_pipeline.py index efa7ef625..4fc277c4f 100644 --- a/spacy/tests/serialize/test_serialize_pipeline.py +++ b/spacy/tests/serialize/test_serialize_pipeline.py @@ -1,9 +1,8 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.pipeline import Tagger, DependencyParser, EntityRecognizer -from spacy.pipeline import Tensorizer, TextCategorizer +from spacy.pipeline import Tensorizer, TextCategorizer, SentenceRecognizer +from spacy.pipeline.defaults import default_parser, default_tensorizer, default_tagger +from spacy.pipeline.defaults import default_textcat, default_senter from ..util import make_tempdir @@ -13,58 +12,58 @@ test_parsers = [DependencyParser, EntityRecognizer] @pytest.fixture def parser(en_vocab): - parser = DependencyParser(en_vocab) + parser = DependencyParser(en_vocab, default_parser()) parser.add_label("nsubj") - parser.model, cfg = parser.Model(parser.moves.n_moves) - parser.cfg.update(cfg) return parser @pytest.fixture def blank_parser(en_vocab): - parser = DependencyParser(en_vocab) + parser = DependencyParser(en_vocab, default_parser()) return parser @pytest.fixture def taggers(en_vocab): - tagger1 = Tagger(en_vocab) - tagger2 = Tagger(en_vocab) - tagger1.model = tagger1.Model(8) - tagger2.model = tagger1.model - return (tagger1, tagger2) + model = default_tagger() + tagger1 = Tagger(en_vocab, model) + tagger2 = Tagger(en_vocab, model) + return tagger1, tagger2 @pytest.mark.parametrize("Parser", test_parsers) def test_serialize_parser_roundtrip_bytes(en_vocab, Parser): - parser = Parser(en_vocab) - parser.model, _ = parser.Model(10) - new_parser = Parser(en_vocab) - new_parser.model, _ = new_parser.Model(10) + parser = Parser(en_vocab, default_parser()) + new_parser = Parser(en_vocab, default_parser()) new_parser = new_parser.from_bytes(parser.to_bytes(exclude=["vocab"])) - assert new_parser.to_bytes(exclude=["vocab"]) == parser.to_bytes(exclude=["vocab"]) + bytes_2 = new_parser.to_bytes(exclude=["vocab"]) + bytes_3 = parser.to_bytes(exclude=["vocab"]) + assert len(bytes_2) == len(bytes_3) + assert bytes_2 == bytes_3 @pytest.mark.parametrize("Parser", test_parsers) def test_serialize_parser_roundtrip_disk(en_vocab, Parser): - parser = Parser(en_vocab) - parser.model, _ = parser.Model(0) + parser = Parser(en_vocab, default_parser()) with make_tempdir() as d: file_path = d / "parser" parser.to_disk(file_path) - parser_d = Parser(en_vocab) - parser_d.model, _ = parser_d.Model(0) + parser_d = Parser(en_vocab, default_parser()) parser_d = parser_d.from_disk(file_path) parser_bytes = parser.to_bytes(exclude=["model", "vocab"]) parser_d_bytes = parser_d.to_bytes(exclude=["model", "vocab"]) + assert len(parser_bytes) == len(parser_d_bytes) assert parser_bytes == parser_d_bytes def test_to_from_bytes(parser, blank_parser): assert parser.model is not True - assert blank_parser.model is True + assert blank_parser.model is not True assert blank_parser.moves.n_moves != parser.moves.n_moves bytes_data = parser.to_bytes(exclude=["vocab"]) + + # the blank parser needs to be resized before we can call from_bytes + blank_parser.model.attrs["resize_output"](blank_parser.model, parser.moves.n_moves) blank_parser.from_bytes(bytes_data) assert blank_parser.model is not True assert blank_parser.moves.n_moves == parser.moves.n_moves @@ -78,8 +77,10 @@ def test_serialize_tagger_roundtrip_bytes(en_vocab, taggers): tagger1_b = tagger1.to_bytes() tagger1 = tagger1.from_bytes(tagger1_b) assert tagger1.to_bytes() == tagger1_b - new_tagger1 = Tagger(en_vocab).from_bytes(tagger1_b) - assert new_tagger1.to_bytes() == tagger1_b + new_tagger1 = Tagger(en_vocab, default_tagger()).from_bytes(tagger1_b) + new_tagger1_b = new_tagger1.to_bytes() + assert len(new_tagger1_b) == len(tagger1_b) + assert new_tagger1_b == tagger1_b def test_serialize_tagger_roundtrip_disk(en_vocab, taggers): @@ -89,26 +90,24 @@ def test_serialize_tagger_roundtrip_disk(en_vocab, taggers): file_path2 = d / "tagger2" tagger1.to_disk(file_path1) tagger2.to_disk(file_path2) - tagger1_d = Tagger(en_vocab).from_disk(file_path1) - tagger2_d = Tagger(en_vocab).from_disk(file_path2) + tagger1_d = Tagger(en_vocab, default_tagger()).from_disk(file_path1) + tagger2_d = Tagger(en_vocab, default_tagger()).from_disk(file_path2) assert tagger1_d.to_bytes() == tagger2_d.to_bytes() def test_serialize_tensorizer_roundtrip_bytes(en_vocab): - tensorizer = Tensorizer(en_vocab) - tensorizer.model = tensorizer.Model() + tensorizer = Tensorizer(en_vocab, default_tensorizer()) tensorizer_b = tensorizer.to_bytes(exclude=["vocab"]) - new_tensorizer = Tensorizer(en_vocab).from_bytes(tensorizer_b) + new_tensorizer = Tensorizer(en_vocab, default_tensorizer()).from_bytes(tensorizer_b) assert new_tensorizer.to_bytes(exclude=["vocab"]) == tensorizer_b def test_serialize_tensorizer_roundtrip_disk(en_vocab): - tensorizer = Tensorizer(en_vocab) - tensorizer.model = tensorizer.Model() + tensorizer = Tensorizer(en_vocab, default_tensorizer()) with make_tempdir() as d: file_path = d / "tensorizer" tensorizer.to_disk(file_path) - tensorizer_d = Tensorizer(en_vocab).from_disk(file_path) + tensorizer_d = Tensorizer(en_vocab, default_tensorizer()).from_disk(file_path) assert tensorizer.to_bytes(exclude=["vocab"]) == tensorizer_d.to_bytes( exclude=["vocab"] ) @@ -116,19 +115,19 @@ def test_serialize_tensorizer_roundtrip_disk(en_vocab): def test_serialize_textcat_empty(en_vocab): # See issue #1105 - textcat = TextCategorizer(en_vocab, labels=["ENTITY", "ACTION", "MODIFIER"]) + textcat = TextCategorizer( + en_vocab, default_textcat(), labels=["ENTITY", "ACTION", "MODIFIER"] + ) textcat.to_bytes(exclude=["vocab"]) @pytest.mark.parametrize("Parser", test_parsers) def test_serialize_pipe_exclude(en_vocab, Parser): def get_new_parser(): - new_parser = Parser(en_vocab) - new_parser.model, _ = new_parser.Model(0) + new_parser = Parser(en_vocab, default_parser()) return new_parser - parser = Parser(en_vocab) - parser.model, _ = parser.Model(0) + parser = Parser(en_vocab, default_parser()) parser.cfg["foo"] = "bar" new_parser = get_new_parser().from_bytes(parser.to_bytes(exclude=["vocab"])) assert "foo" in new_parser.cfg @@ -144,3 +143,10 @@ def test_serialize_pipe_exclude(en_vocab, Parser): parser.to_bytes(cfg=False, exclude=["vocab"]) with pytest.raises(ValueError): get_new_parser().from_bytes(parser.to_bytes(exclude=["vocab"]), cfg=False) + + +def test_serialize_sentencerecognizer(en_vocab): + sr = SentenceRecognizer(en_vocab, default_senter()) + sr_b = sr.to_bytes() + sr_d = SentenceRecognizer(en_vocab, default_senter()).from_bytes(sr_b) + assert sr.to_bytes() == sr_d.to_bytes() diff --git a/spacy/tests/serialize/test_serialize_tokenizer.py b/spacy/tests/serialize/test_serialize_tokenizer.py index cbe119225..a0c36c2a6 100644 --- a/spacy/tests/serialize/test_serialize_tokenizer.py +++ b/spacy/tests/serialize/test_serialize_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.util import get_lang_class from spacy.tokenizer import Tokenizer diff --git a/spacy/tests/serialize/test_serialize_vocab_strings.py b/spacy/tests/serialize/test_serialize_vocab_strings.py index 3be0a75b3..f44426a1a 100644 --- a/spacy/tests/serialize/test_serialize_vocab_strings.py +++ b/spacy/tests/serialize/test_serialize_vocab_strings.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import pickle from spacy.vocab import Vocab diff --git a/spacy/tests/test_architectures.py b/spacy/tests/test_architectures.py index 77f1af020..31b2a2d2f 100644 --- a/spacy/tests/test_architectures.py +++ b/spacy/tests/test_architectures.py @@ -1,15 +1,12 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy import registry -from thinc.v2v import Affine +from thinc.api import Linear from catalogue import RegistryError @registry.architectures.register("my_test_function") def create_model(nr_in, nr_out): - return Affine(nr_in, nr_out) + return Linear(nr_in, nr_out) def test_get_architecture(): diff --git a/spacy/tests/test_cli.py b/spacy/tests/test_cli.py index 6dce649a9..132f7ac9f 100644 --- a/spacy/tests/test_cli.py +++ b/spacy/tests/test_cli.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.en import English @@ -9,7 +6,7 @@ from spacy.cli.pretrain import make_docs def test_cli_converters_conllu2json(): - # https://raw.githubusercontent.com/ohenrik/nb_news_ud_sm/master/original_data/no-ud-dev-ner.conllu + # from NorNE: https://github.com/ltgoslo/norne/blob/3d23274965f513f23aa48455b28b1878dad23c05/ud/nob/no_bokmaal-ud-dev.conllu lines = [ "1\tDommer\tdommer\tNOUN\t_\tDefinite=Ind|Gender=Masc|Number=Sing\t2\tappos\t_\tO", "2\tFinn\tFinn\tPROPN\t_\tGender=Masc\t4\tnsubj\t_\tB-PER", @@ -32,6 +29,86 @@ def test_cli_converters_conllu2json(): assert [t["ner"] for t in tokens] == ["O", "B-PER", "L-PER", "O"] +@pytest.mark.parametrize( + "lines", + [ + ( + "1\tDommer\tdommer\tNOUN\t_\tDefinite=Ind|Gender=Masc|Number=Sing\t2\tappos\t_\tname=O", + "2\tFinn\tFinn\tPROPN\t_\tGender=Masc\t4\tnsubj\t_\tSpaceAfter=No|name=B-PER", + "3\tEilertsen\tEilertsen\tPROPN\t_\t_\t2\tname\t_\tname=I-PER", + "4\tavstår\tavstå\tVERB\t_\tMood=Ind|Tense=Pres|VerbForm=Fin\t0\troot\t_\tSpaceAfter=No|name=O", + "5\t.\t$.\tPUNCT\t_\t_\t4\tpunct\t_\tname=B-BAD", + ), + ( + "1\tDommer\tdommer\tNOUN\t_\tDefinite=Ind|Gender=Masc|Number=Sing\t2\tappos\t_\t_", + "2\tFinn\tFinn\tPROPN\t_\tGender=Masc\t4\tnsubj\t_\tSpaceAfter=No|NE=B-PER", + "3\tEilertsen\tEilertsen\tPROPN\t_\t_\t2\tname\t_\tNE=L-PER", + "4\tavstår\tavstå\tVERB\t_\tMood=Ind|Tense=Pres|VerbForm=Fin\t0\troot\t_\tSpaceAfter=No", + "5\t.\t$.\tPUNCT\t_\t_\t4\tpunct\t_\tNE=B-BAD", + ), + ], +) +def test_cli_converters_conllu2json_name_ner_map(lines): + input_data = "\n".join(lines) + converted = conllu2json(input_data, n_sents=1, ner_map={"PER": "PERSON", "BAD": ""}) + assert len(converted) == 1 + assert converted[0]["id"] == 0 + assert len(converted[0]["paragraphs"]) == 1 + assert converted[0]["paragraphs"][0]["raw"] == "Dommer FinnEilertsen avstår." + assert len(converted[0]["paragraphs"][0]["sentences"]) == 1 + sent = converted[0]["paragraphs"][0]["sentences"][0] + assert len(sent["tokens"]) == 5 + tokens = sent["tokens"] + assert [t["orth"] for t in tokens] == ["Dommer", "Finn", "Eilertsen", "avstår", "."] + assert [t["tag"] for t in tokens] == ["NOUN", "PROPN", "PROPN", "VERB", "PUNCT"] + assert [t["head"] for t in tokens] == [1, 2, -1, 0, -1] + assert [t["dep"] for t in tokens] == ["appos", "nsubj", "name", "ROOT", "punct"] + assert [t["ner"] for t in tokens] == ["O", "B-PERSON", "L-PERSON", "O", "O"] + + +def test_cli_converters_conllu2json_subtokens(): + # https://raw.githubusercontent.com/ohenrik/nb_news_ud_sm/master/original_data/no-ud-dev-ner.conllu + lines = [ + "1\tDommer\tdommer\tNOUN\t_\tDefinite=Ind|Gender=Masc|Number=Sing\t2\tappos\t_\tname=O", + "2-3\tFE\t_\t_\t_\t_\t_\t_\t_\t_", + "2\tFinn\tFinn\tPROPN\t_\tGender=Masc\t4\tnsubj\t_\tname=B-PER", + "3\tEilertsen\tEilertsen\tX\t_\tGender=Fem|Tense=past\t2\tname\t_\tname=I-PER", + "4\tavstår\tavstå\tVERB\t_\tMood=Ind|Tense=Pres|VerbForm=Fin\t0\troot\t_\tSpaceAfter=No|name=O", + "5\t.\t$.\tPUNCT\t_\t_\t4\tpunct\t_\tname=O", + ] + input_data = "\n".join(lines) + converted = conllu2json( + input_data, n_sents=1, merge_subtokens=True, append_morphology=True + ) + assert len(converted) == 1 + assert converted[0]["id"] == 0 + assert len(converted[0]["paragraphs"]) == 1 + assert converted[0]["paragraphs"][0]["raw"] == "Dommer FE avstår." + assert len(converted[0]["paragraphs"][0]["sentences"]) == 1 + sent = converted[0]["paragraphs"][0]["sentences"][0] + assert len(sent["tokens"]) == 4 + tokens = sent["tokens"] + print(tokens) + assert [t["orth"] for t in tokens] == ["Dommer", "FE", "avstår", "."] + assert [t["tag"] for t in tokens] == [ + "NOUN__Definite=Ind|Gender=Masc|Number=Sing", + "PROPN_X__Gender=Fem,Masc|Tense=past", + "VERB__Mood=Ind|Tense=Pres|VerbForm=Fin", + "PUNCT", + ] + assert [t["pos"] for t in tokens] == ["NOUN", "PROPN", "VERB", "PUNCT"] + assert [t["morph"] for t in tokens] == [ + "Definite=Ind|Gender=Masc|Number=Sing", + "Gender=Fem,Masc|Tense=past", + "Mood=Ind|Tense=Pres|VerbForm=Fin", + "", + ] + assert [t["lemma"] for t in tokens] == ["dommer", "Finn Eilertsen", "avstå", "$."] + assert [t["head"] for t in tokens] == [1, 1, 0, -1] + assert [t["dep"] for t in tokens] == ["appos", "nsubj", "ROOT", "punct"] + assert [t["ner"] for t in tokens] == ["O", "U-PER", "O", "O"] + + def test_cli_converters_iob2json(): lines = [ "I|O like|O London|I-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O", @@ -106,7 +183,6 @@ def test_cli_converters_conll_ner2json(): ] input_data = "\n".join(lines) converted = conll_ner2json(input_data, n_sents=10) - print(converted) assert len(converted) == 1 assert converted[0]["id"] == 0 assert len(converted[0]["paragraphs"]) == 1 diff --git a/spacy/tests/test_displacy.py b/spacy/tests/test_displacy.py index 539714e0c..adac0f7c3 100644 --- a/spacy/tests/test_displacy.py +++ b/spacy/tests/test_displacy.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy import displacy from spacy.displacy.render import DependencyRenderer @@ -80,10 +77,10 @@ def test_displacy_rtl(): html = displacy.render(doc, page=True, style="dep") assert "direction: rtl" in html assert 'direction="rtl"' in html - assert 'lang="{}"'.format(nlp.lang) in html + assert f'lang="{nlp.lang}"' in html html = displacy.render(doc, page=True, style="ent") assert "direction: rtl" in html - assert 'lang="{}"'.format(nlp.lang) in html + assert f'lang="{nlp.lang}"' in html def test_displacy_render_wrapper(en_vocab): diff --git a/spacy/tests/test_gold.py b/spacy/tests/test_gold.py index 53665d852..982c0d910 100644 --- a/spacy/tests/test_gold.py +++ b/spacy/tests/test_gold.py @@ -1,16 +1,98 @@ -# coding: utf-8 -from __future__ import unicode_literals - +from spacy.errors import AlignmentError from spacy.gold import biluo_tags_from_offsets, offsets_from_biluo_tags -from spacy.gold import spans_from_biluo_tags, GoldParse, iob_to_biluo -from spacy.gold import GoldCorpus, docs_to_json, align +from spacy.gold import spans_from_biluo_tags, GoldParse, iob_to_biluo, align +from spacy.gold import GoldCorpus, docs_to_json, Example, DocAnnotation from spacy.lang.en import English +from spacy.syntax.nonproj import is_nonproj_tree from spacy.tokens import Doc -from spacy.util import get_words_and_spaces -from .util import make_tempdir +from spacy.util import get_words_and_spaces, compounding, minibatch import pytest import srsly +from .util import make_tempdir + + +@pytest.fixture +def doc(): + text = "Sarah's sister flew to Silicon Valley via London." + tags = ["NNP", "POS", "NN", "VBD", "IN", "NNP", "NNP", "IN", "NNP", "."] + pos = [ + "PROPN", + "PART", + "NOUN", + "VERB", + "ADP", + "PROPN", + "PROPN", + "ADP", + "PROPN", + "PUNCT", + ] + morphs = [ + "NounType=prop|Number=sing", + "Poss=yes", + "Number=sing", + "Tense=past|VerbForm=fin", + "", + "NounType=prop|Number=sing", + "NounType=prop|Number=sing", + "", + "NounType=prop|Number=sing", + "PunctType=peri", + ] + # head of '.' is intentionally nonprojective for testing + heads = [2, 0, 3, 3, 3, 6, 4, 3, 7, 5] + deps = [ + "poss", + "case", + "nsubj", + "ROOT", + "prep", + "compound", + "pobj", + "prep", + "pobj", + "punct", + ] + lemmas = [ + "Sarah", + "'s", + "sister", + "fly", + "to", + "Silicon", + "Valley", + "via", + "London", + ".", + ] + biluo_tags = ["U-PERSON", "O", "O", "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"] + cats = {"TRAVEL": 1.0, "BAKING": 0.0} + nlp = English() + doc = nlp(text) + for i in range(len(tags)): + doc[i].tag_ = tags[i] + doc[i].pos_ = pos[i] + doc[i].morph_ = morphs[i] + doc[i].lemma_ = lemmas[i] + doc[i].dep_ = deps[i] + doc[i].head = doc[heads[i]] + doc.ents = spans_from_biluo_tags(doc, biluo_tags) + doc.cats = cats + doc.is_tagged = True + doc.is_parsed = True + return doc + + +@pytest.fixture() +def merged_dict(): + return { + "ids": [1, 2, 3, 4, 5, 6, 7], + "words": ["Hi", "there", "everyone", "It", "is", "just", "me"], + "tags": ["INTJ", "ADV", "PRON", "PRON", "AUX", "ADV", "PRON"], + "sent_starts": [1, 0, 0, 1, 0, 0, 0, 0], + } + def test_gold_biluo_U(en_vocab): words = ["I", "flew", "to", "London", "."] @@ -168,35 +250,35 @@ def test_iob_to_biluo(): iob_to_biluo(bad_iob) -def test_roundtrip_docs_to_json(): - text = "I flew to Silicon Valley via London." - tags = ["PRP", "VBD", "IN", "NNP", "NNP", "IN", "NNP", "."] - heads = [1, 1, 1, 4, 2, 1, 5, 1] - deps = ["nsubj", "ROOT", "prep", "compound", "pobj", "prep", "pobj", "punct"] - biluo_tags = ["O", "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"] - cats = {"TRAVEL": 1.0, "BAKING": 0.0} +def test_roundtrip_docs_to_json(doc): nlp = English() - doc = nlp(text) - for i in range(len(tags)): - doc[i].tag_ = tags[i] - doc[i].dep_ = deps[i] - doc[i].head = doc[heads[i]] - doc.ents = spans_from_biluo_tags(doc, biluo_tags) - doc.cats = cats - doc.is_tagged = True - doc.is_parsed = True + text = doc.text + tags = [t.tag_ for t in doc] + pos = [t.pos_ for t in doc] + morphs = [t.morph_ for t in doc] + lemmas = [t.lemma_ for t in doc] + deps = [t.dep_ for t in doc] + heads = [t.head.i for t in doc] + biluo_tags = iob_to_biluo( + [t.ent_iob_ + "-" + t.ent_type_ if t.ent_type_ else "O" for t in doc] + ) + cats = doc.cats # roundtrip to JSON with make_tempdir() as tmpdir: json_file = tmpdir / "roundtrip.json" srsly.write_json(json_file, [docs_to_json(doc)]) - goldcorpus = GoldCorpus(str(json_file), str(json_file)) + goldcorpus = GoldCorpus(train=str(json_file), dev=str(json_file)) - reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp)) + reloaded_example = next(goldcorpus.dev_dataset(nlp)) + goldparse = reloaded_example.gold assert len(doc) == goldcorpus.count_train() - assert text == reloaded_doc.text + assert text == reloaded_example.text assert tags == goldparse.tags + assert pos == goldparse.pos + assert morphs == goldparse.morphs + assert lemmas == goldparse.lemmas assert deps == goldparse.labels assert heads == goldparse.heads assert biluo_tags == goldparse.ner @@ -211,11 +293,15 @@ def test_roundtrip_docs_to_json(): srsly.write_jsonl(jsonl_file, [docs_to_json(doc)]) goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file)) - reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp)) + reloaded_example = next(goldcorpus.dev_dataset(nlp)) + goldparse = reloaded_example.gold assert len(doc) == goldcorpus.count_train() - assert text == reloaded_doc.text + assert text == reloaded_example.text assert tags == goldparse.tags + assert pos == goldparse.pos + assert morphs == goldparse.morphs + assert lemmas == goldparse.lemmas assert deps == goldparse.labels assert heads == goldparse.heads assert biluo_tags == goldparse.ner @@ -231,16 +317,18 @@ def test_roundtrip_docs_to_json(): srsly.write_jsonl(jsonl_file, [docs_to_json(doc)]) goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file)) # load and rewrite as JSONL tuples - srsly.write_jsonl(jsonl_file, goldcorpus.train_tuples) + srsly.write_jsonl(jsonl_file, goldcorpus.train_examples) goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file)) - reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp)) + reloaded_example = next(goldcorpus.dev_dataset(nlp)) + goldparse = reloaded_example.gold assert len(doc) == goldcorpus.count_train() - assert text == reloaded_doc.text + assert text == reloaded_example.text assert tags == goldparse.tags assert deps == goldparse.labels assert heads == goldparse.heads + assert lemmas == goldparse.lemmas assert biluo_tags == goldparse.ner assert "TRAVEL" in goldparse.cats assert "BAKING" in goldparse.cats @@ -248,6 +336,75 @@ def test_roundtrip_docs_to_json(): assert cats["BAKING"] == goldparse.cats["BAKING"] +def test_projective_train_vs_nonprojective_dev(doc): + nlp = English() + deps = [t.dep_ for t in doc] + heads = [t.head.i for t in doc] + + with make_tempdir() as tmpdir: + jsonl_file = tmpdir / "test.jsonl" + # write to JSONL train dicts + srsly.write_jsonl(jsonl_file, [docs_to_json(doc)]) + goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file)) + + train_reloaded_example = next(goldcorpus.train_dataset(nlp)) + train_goldparse = train_reloaded_example.gold + + dev_reloaded_example = next(goldcorpus.dev_dataset(nlp)) + dev_goldparse = dev_reloaded_example.gold + + assert is_nonproj_tree([t.head.i for t in doc]) is True + assert is_nonproj_tree(train_goldparse.heads) is False + assert heads[:-1] == train_goldparse.heads[:-1] + assert heads[-1] != train_goldparse.heads[-1] + assert deps[:-1] == train_goldparse.labels[:-1] + assert deps[-1] != train_goldparse.labels[-1] + + assert heads == dev_goldparse.heads + assert deps == dev_goldparse.labels + + +def test_ignore_misaligned(doc): + nlp = English() + text = doc.text + with make_tempdir() as tmpdir: + jsonl_file = tmpdir / "test.jsonl" + data = [docs_to_json(doc)] + data[0]["paragraphs"][0]["raw"] = text.replace("Sarah", "Jane") + # write to JSONL train dicts + srsly.write_jsonl(jsonl_file, data) + goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file)) + + with pytest.raises(AlignmentError): + train_reloaded_example = next(goldcorpus.train_dataset(nlp)) + + with make_tempdir() as tmpdir: + jsonl_file = tmpdir / "test.jsonl" + data = [docs_to_json(doc)] + data[0]["paragraphs"][0]["raw"] = text.replace("Sarah", "Jane") + # write to JSONL train dicts + srsly.write_jsonl(jsonl_file, data) + goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file)) + + # doesn't raise an AlignmentError, but there is nothing to iterate over + # because the only example can't be aligned + train_reloaded_example = list(goldcorpus.train_dataset(nlp, ignore_misaligned=True)) + assert len(train_reloaded_example) == 0 + + +def test_make_orth_variants(doc): + nlp = English() + with make_tempdir() as tmpdir: + jsonl_file = tmpdir / "test.jsonl" + # write to JSONL train dicts + srsly.write_jsonl(jsonl_file, [docs_to_json(doc)]) + goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file)) + + # due to randomness, test only that this runs with no errors for now + train_reloaded_example = next(goldcorpus.train_dataset(nlp, orth_variant_level=0.2)) + train_goldparse = train_reloaded_example.gold # noqa: F841 + + @pytest.mark.parametrize( "tokens_a,tokens_b,expected", [ @@ -286,3 +443,118 @@ def test_goldparse_startswith_space(en_tokenizer): assert g.words == [" ", "a"] assert g.ner == [None, "U-DATE"] assert g.labels == [None, "ROOT"] + + +def test_gold_constructor(): + """Test that the GoldParse constructor works fine""" + nlp = English() + doc = nlp("This is a sentence") + gold = GoldParse(doc, cats={"cat1": 1.0, "cat2": 0.0}) + + assert gold.cats["cat1"] + assert not gold.cats["cat2"] + assert gold.words == ["This", "is", "a", "sentence"] + + +def test_gold_orig_annot(): + nlp = English() + doc = nlp("This is a sentence") + gold = GoldParse(doc, cats={"cat1": 1.0, "cat2": 0.0}) + + assert gold.orig.words == ["This", "is", "a", "sentence"] + assert gold.cats["cat1"] + + doc_annotation = DocAnnotation(cats={"cat1": 0.0, "cat2": 1.0}) + gold2 = GoldParse.from_annotation(doc, doc_annotation, gold.orig) + assert gold2.orig.words == ["This", "is", "a", "sentence"] + assert not gold2.cats["cat1"] + + +def test_tuple_format_implicit(): + """Test tuple format with implicit GoldParse creation""" + + train_data = [ + ("Uber blew through $1 million a week", {"entities": [(0, 4, "ORG")]}), + ( + "Spotify steps up Asia expansion", + {"entities": [(0, 8, "ORG"), (17, 21, "LOC")]}, + ), + ("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]}), + ] + + _train(train_data) + + +def test_tuple_format_implicit_invalid(): + """Test that an error is thrown for an implicit invalid GoldParse field""" + + train_data = [ + ("Uber blew through $1 million a week", {"frumble": [(0, 4, "ORG")]}), + ( + "Spotify steps up Asia expansion", + {"entities": [(0, 8, "ORG"), (17, 21, "LOC")]}, + ), + ("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]}), + ] + + with pytest.raises(TypeError): + _train(train_data) + + +def _train(train_data): + nlp = English() + ner = nlp.create_pipe("ner") + ner.add_label("ORG") + ner.add_label("LOC") + nlp.add_pipe(ner) + + optimizer = nlp.begin_training() + for i in range(5): + losses = {} + batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001)) + for batch in batches: + nlp.update(batch, sgd=optimizer, losses=losses) + + +def test_split_sents(merged_dict): + nlp = English() + example = Example() + example.set_token_annotation(**merged_dict) + assert len(example.get_gold_parses(merge=False, vocab=nlp.vocab)) == 2 + assert len(example.get_gold_parses(merge=True, vocab=nlp.vocab)) == 1 + + split_examples = example.split_sents() + assert len(split_examples) == 2 + + token_annotation_1 = split_examples[0].token_annotation + assert token_annotation_1.ids == [1, 2, 3] + assert token_annotation_1.words == ["Hi", "there", "everyone"] + assert token_annotation_1.tags == ["INTJ", "ADV", "PRON"] + assert token_annotation_1.sent_starts == [1, 0, 0] + + token_annotation_2 = split_examples[1].token_annotation + assert token_annotation_2.ids == [4, 5, 6, 7] + assert token_annotation_2.words == ["It", "is", "just", "me"] + assert token_annotation_2.tags == ["PRON", "AUX", "ADV", "PRON"] + assert token_annotation_2.sent_starts == [1, 0, 0, 0] + + +def test_tuples_to_example(merged_dict): + ex = Example() + ex.set_token_annotation(**merged_dict) + cats = {"TRAVEL": 1.0, "BAKING": 0.0} + ex.set_doc_annotation(cats=cats) + ex_dict = ex.to_dict() + + assert ex_dict["token_annotation"]["ids"] == merged_dict["ids"] + assert ex_dict["token_annotation"]["words"] == merged_dict["words"] + assert ex_dict["token_annotation"]["tags"] == merged_dict["tags"] + assert ex_dict["token_annotation"]["sent_starts"] == merged_dict["sent_starts"] + assert ex_dict["doc_annotation"]["cats"] == cats + + +def test_empty_example_goldparse(): + nlp = English() + doc = nlp("") + example = Example(doc=doc) + assert len(example.get_gold_parses()) == 1 diff --git a/spacy/tests/test_json_schemas.py b/spacy/tests/test_json_schemas.py deleted file mode 100644 index 89e797c1a..000000000 --- a/spacy/tests/test_json_schemas.py +++ /dev/null @@ -1,50 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from spacy.util import get_json_validator, validate_json, validate_schema -from spacy.cli._schemas import META_SCHEMA, TRAINING_SCHEMA -from spacy.matcher._schemas import TOKEN_PATTERN_SCHEMA -import pytest - - -@pytest.fixture(scope="session") -def training_schema_validator(): - return get_json_validator(TRAINING_SCHEMA) - - -def test_validate_schema(): - validate_schema({"type": "object"}) - with pytest.raises(Exception): - validate_schema({"type": lambda x: x}) - - -@pytest.mark.parametrize("schema", [TRAINING_SCHEMA, META_SCHEMA, TOKEN_PATTERN_SCHEMA]) -def test_schemas(schema): - validate_schema(schema) - - -@pytest.mark.parametrize( - "data", - [ - {"text": "Hello world"}, - {"text": "Hello", "ents": [{"start": 0, "end": 5, "label": "TEST"}]}, - ], -) -def test_json_schema_training_valid(data, training_schema_validator): - errors = validate_json([data], training_schema_validator) - assert not errors - - -@pytest.mark.parametrize( - "data,n_errors", - [ - ({"spans": []}, 1), - ({"text": "Hello", "ents": [{"start": "0", "end": "5", "label": "TEST"}]}, 2), - ({"text": "Hello", "ents": [{"start": 0, "end": 5}]}, 1), - ({"text": "Hello", "ents": [{"start": 0, "end": 5, "label": "test"}]}, 1), - ({"text": "spaCy", "tokens": [{"pos": "PROPN"}]}, 2), - ], -) -def test_json_schema_training_invalid(data, n_errors, training_schema_validator): - errors = validate_json([data], training_schema_validator) - assert len(errors) == n_errors diff --git a/spacy/tests/test_language.py b/spacy/tests/test_language.py index 7106cef74..58db0a040 100644 --- a/spacy/tests/test_language.py +++ b/spacy/tests/test_language.py @@ -1,10 +1,5 @@ -# coding: utf-8 -from __future__ import unicode_literals - import itertools - import pytest -from spacy.compat import is_python2 from spacy.gold import GoldParse from spacy.language import Language from spacy.tokens import Doc, Span @@ -31,20 +26,20 @@ def test_language_update(nlp): doc = Doc(nlp.vocab, words=text.split(" ")) gold = GoldParse(doc, **annots) # Update with doc and gold objects - nlp.update([doc], [gold]) + nlp.update((doc, gold)) # Update with text and dict - nlp.update([text], [annots]) + nlp.update((text, annots)) # Update with doc object and dict - nlp.update([doc], [annots]) + nlp.update((doc, annots)) # Update with text and gold object - nlp.update([text], [gold]) + nlp.update((text, gold)) + # Update with empty doc and gold object + nlp.update((None, gold)) # Update badly - with pytest.raises(IndexError): - nlp.update([doc], []) - with pytest.raises(IndexError): - nlp.update([], [gold]) with pytest.raises(ValueError): - nlp.update([text], [wrongkeyannots]) + nlp.update((doc, None)) + with pytest.raises(TypeError): + nlp.update((text, wrongkeyannots)) def test_language_evaluate(nlp): @@ -134,9 +129,6 @@ def test_language_pipe(nlp2, n_process, texts): assert_docs_equal(doc, expected_doc) -@pytest.mark.skipif( - is_python2, reason="python2 seems to be unable to handle iterator properly" -) @pytest.mark.parametrize("n_process", [1, 2]) def test_language_pipe_stream(nlp2, n_process, texts): # check if nlp.pipe can handle infinite length iterator properly. diff --git a/spacy/tests/test_lemmatizer.py b/spacy/tests/test_lemmatizer.py index bcda2999a..1779ff933 100644 --- a/spacy/tests/test_lemmatizer.py +++ b/spacy/tests/test_lemmatizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.tokens import Doc from spacy.language import Language diff --git a/spacy/tests/test_misc.py b/spacy/tests/test_misc.py index 4075ccf64..c320b19c0 100644 --- a/spacy/tests/test_misc.py +++ b/spacy/tests/test_misc.py @@ -1,41 +1,10 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import os import ctypes from pathlib import Path from spacy import util from spacy import prefer_gpu, require_gpu -from spacy.compat import symlink_to, symlink_remove, path2str, is_windows -from spacy._ml import PrecomputableAffine -from subprocess import CalledProcessError - - -@pytest.fixture -def symlink_target(): - return Path("./foo-target") - - -@pytest.fixture -def symlink(): - return Path("./foo-symlink") - - -@pytest.fixture(scope="function") -def symlink_setup_target(request, symlink_target, symlink): - if not symlink_target.exists(): - os.mkdir(path2str(symlink_target)) - # yield -- need to cleanup even if assertion fails - # https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240 - - def cleanup(): - # Remove symlink only if it was created - if symlink.exists(): - symlink_remove(symlink) - os.rmdir(path2str(symlink_target)) - - request.addfinalizer(cleanup) +from spacy.ml._precomputable_affine import PrecomputableAffine, _backprop_precomputable_affine_padding @pytest.fixture @@ -69,29 +38,31 @@ def test_util_get_package_path(package): def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2): - model = PrecomputableAffine(nO=nO, nI=nI, nF=nF, nP=nP) - assert model.W.shape == (nF, nO, nP, nI) - tensor = model.ops.allocate((10, nI)) + model = PrecomputableAffine(nO=nO, nI=nI, nF=nF, nP=nP).initialize() + assert model.get_param("W").shape == (nF, nO, nP, nI) + tensor = model.ops.alloc((10, nI)) Y, get_dX = model.begin_update(tensor) assert Y.shape == (tensor.shape[0] + 1, nF, nO, nP) - assert model.d_pad.shape == (1, nF, nO, nP) - dY = model.ops.allocate((15, nO, nP)) - ids = model.ops.allocate((15, nF)) + dY = model.ops.alloc((15, nO, nP)) + ids = model.ops.alloc((15, nF)) ids[1, 2] = -1 dY[1] = 1 - assert model.d_pad[0, 2, 0, 0] == 0.0 - model._backprop_padding(dY, ids) - assert model.d_pad[0, 2, 0, 0] == 1.0 - model.d_pad.fill(0.0) + assert not model.has_grad("pad") + d_pad = _backprop_precomputable_affine_padding(model, dY, ids) + assert d_pad[0, 2, 0, 0] == 1.0 ids.fill(0.0) dY.fill(0.0) - ids[1, 2] = -1 + dY[0] = 0 + ids[1, 2] = 0 ids[1, 1] = -1 ids[1, 0] = -1 dY[1] = 1 - assert model.d_pad[0, 2, 0, 0] == 0.0 - model._backprop_padding(dY, ids) - assert model.d_pad[0, 2, 0, 0] == 3.0 + ids[2, 0] = -1 + dY[2] = 5 + d_pad = _backprop_precomputable_affine_padding(model, dY, ids) + assert d_pad[0, 0, 0, 0] == 6 + assert d_pad[0, 1, 0, 0] == 1 + assert d_pad[0, 2, 0, 0] == 0 def test_prefer_gpu(): @@ -109,25 +80,6 @@ def test_require_gpu(): require_gpu() -def test_create_symlink_windows( - symlink_setup_target, symlink_target, symlink, is_admin -): - """Test the creation of symlinks on windows. If run as admin or not on windows it should succeed, otherwise a CalledProcessError should be raised.""" - assert symlink_target.exists() - - if is_admin or not is_windows: - try: - symlink_to(symlink, symlink_target) - assert symlink.exists() - except CalledProcessError as e: - pytest.fail(e) - else: - with pytest.raises(CalledProcessError): - symlink_to(symlink, symlink_target) - - assert not symlink.exists() - - def test_ascii_filenames(): """Test that all filenames in the project are ASCII. See: https://twitter.com/_inesmontani/status/1177941471632211968 diff --git a/spacy/tests/test_pickles.py b/spacy/tests/test_pickles.py index 65288527a..e4c67b672 100644 --- a/spacy/tests/test_pickles.py +++ b/spacy/tests/test_pickles.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import numpy import srsly diff --git a/spacy/tests/test_scorer.py b/spacy/tests/test_scorer.py index 2a4ef0f40..d750a8202 100644 --- a/spacy/tests/test_scorer.py +++ b/spacy/tests/test_scorer.py @@ -1,13 +1,11 @@ -# coding: utf-8 -from __future__ import unicode_literals - from numpy.testing import assert_almost_equal, assert_array_almost_equal import pytest from pytest import approx -from spacy.gold import GoldParse +from spacy.gold import Example, GoldParse from spacy.scorer import Scorer, ROCAUCScore from spacy.scorer import _roc_auc_score, _roc_curve from .util import get_doc +from spacy.lang.en import English test_las_apple = [ [ @@ -42,6 +40,43 @@ test_ner_apple = [ ] ] +@pytest.fixture +def tagged_doc(): + text = "Sarah's sister flew to Silicon Valley via London." + tags = ["NNP", "POS", "NN", "VBD", "IN", "NNP", "NNP", "IN", "NNP", "."] + pos = [ + "PROPN", + "PART", + "NOUN", + "VERB", + "ADP", + "PROPN", + "PROPN", + "ADP", + "PROPN", + "PUNCT", + ] + morphs = [ + "NounType=prop|Number=sing", + "Poss=yes", + "Number=sing", + "Tense=past|VerbForm=fin", + "", + "NounType=prop|Number=sing", + "NounType=prop|Number=sing", + "", + "NounType=prop|Number=sing", + "PunctType=peri", + ] + nlp = English() + doc = nlp(text) + for i in range(len(tags)): + doc[i].tag_ = tags[i] + doc[i].pos_ = pos[i] + doc[i].morph_ = morphs[i] + doc.is_tagged = True + return doc + def test_las_per_type(en_vocab): # Gold and Doc are identical @@ -54,7 +89,7 @@ def test_las_per_type(en_vocab): deps=annot["deps"], ) gold = GoldParse(doc, heads=annot["heads"], deps=annot["deps"]) - scorer.score(doc, gold) + scorer.score((doc, gold)) results = scorer.scores assert results["uas"] == 100 @@ -77,7 +112,7 @@ def test_las_per_type(en_vocab): ) gold = GoldParse(doc, heads=annot["heads"], deps=annot["deps"]) doc[0].dep_ = "compound" - scorer.score(doc, gold) + scorer.score((doc, gold)) results = scorer.scores assert results["uas"] == 100 @@ -99,8 +134,9 @@ def test_ner_per_type(en_vocab): words=input_.split(" "), ents=[[0, 1, "CARDINAL"], [2, 3, "CARDINAL"]], ) - gold = GoldParse(doc, entities=annot["entities"]) - scorer.score(doc, gold) + ex = Example(doc=doc) + ex.set_token_annotation(entities=annot["entities"]) + scorer.score(ex) results = scorer.scores assert results["ents_p"] == 100 @@ -119,8 +155,9 @@ def test_ner_per_type(en_vocab): words=input_.split(" "), ents=[[0, 1, "ORG"], [5, 6, "GPE"], [6, 7, "ORG"]], ) - gold = GoldParse(doc, entities=annot["entities"]) - scorer.score(doc, gold) + ex = Example(doc=doc) + ex.set_token_annotation(entities=annot["entities"]) + scorer.score(ex) results = scorer.scores assert results["ents_p"] == approx(66.66666) @@ -140,6 +177,43 @@ def test_ner_per_type(en_vocab): assert results["ents_per_type"]["ORG"]["f"] == approx(66.66666) +def test_tag_score(tagged_doc): + # Gold and Doc are identical + scorer = Scorer() + gold = GoldParse( + tagged_doc, + tags=[t.tag_ for t in tagged_doc], + pos=[t.pos_ for t in tagged_doc], + morphs=[t.morph_ for t in tagged_doc] + ) + scorer.score((tagged_doc, gold)) + results = scorer.scores + + assert results["tags_acc"] == 100 + assert results["pos_acc"] == 100 + assert results["morphs_acc"] == 100 + assert results["morphs_per_type"]["NounType"]["f"] == 100 + + # Gold and Doc are identical + scorer = Scorer() + tags = [t.tag_ for t in tagged_doc] + tags[0] = "NN" + pos = [t.pos_ for t in tagged_doc] + pos[1] = "X" + morphs = [t.morph_ for t in tagged_doc] + morphs[1] = "Number=sing" + morphs[2] = "Number=plur" + gold = GoldParse(tagged_doc, tags=tags, pos=pos, morphs=morphs) + scorer.score((tagged_doc, gold)) + results = scorer.scores + + assert results["tags_acc"] == 90 + assert results["pos_acc"] == 90 + assert results["morphs_acc"] == approx(80) + assert results["morphs_per_type"]["Poss"]["f"] == 0.0 + assert results["morphs_per_type"]["Number"]["f"] == approx(72.727272) + + def test_roc_auc_score(): # Binary classification, toy tests from scikit-learn test suite y_true = [0, 1] diff --git a/spacy/tests/test_tok2vec.py b/spacy/tests/test_tok2vec.py index ddaa71059..9c2e9004b 100644 --- a/spacy/tests/test_tok2vec.py +++ b/spacy/tests/test_tok2vec.py @@ -1,25 +1,10 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest -from spacy._ml import Tok2Vec +from spacy.ml.models.tok2vec import build_Tok2Vec_model from spacy.vocab import Vocab from spacy.tokens import Doc -from spacy.compat import unicode_ - -def get_batch(batch_size): - vocab = Vocab() - docs = [] - start = 0 - for size in range(1, batch_size + 1): - # Make the words numbers, so that they're distnct - # across the batch, and easy to track. - numbers = [unicode_(i) for i in range(start, start + size)] - docs.append(Doc(vocab, words=numbers)) - start += size - return docs +from .util import get_batch # This fails in Thinc v7.3.1. Need to push patch @@ -29,7 +14,8 @@ def test_empty_doc(): embed_size = 2000 vocab = Vocab() doc = Doc(vocab, words=[]) - tok2vec = Tok2Vec(width, embed_size) + # TODO: fix tok2vec arguments + tok2vec = build_Tok2Vec_model(width, embed_size) vectors, backprop = tok2vec.begin_update([doc]) assert len(vectors) == 1 assert vectors[0].shape == (0, width) @@ -40,26 +26,45 @@ def test_empty_doc(): ) def test_tok2vec_batch_sizes(batch_size, width, embed_size): batch = get_batch(batch_size) - tok2vec = Tok2Vec(width, embed_size) + tok2vec = build_Tok2Vec_model( + width, + embed_size, + pretrained_vectors=None, + conv_depth=4, + bilstm_depth=0, + window_size=1, + maxout_pieces=3, + subword_features=True, + char_embed=False, + nM=64, + nC=8, + ) + tok2vec.initialize() vectors, backprop = tok2vec.begin_update(batch) assert len(vectors) == len(batch) for doc_vec, doc in zip(vectors, batch): assert doc_vec.shape == (len(doc), width) +# fmt: off @pytest.mark.parametrize( "tok2vec_config", [ - {"width": 8, "embed_size": 100, "char_embed": False}, - {"width": 8, "embed_size": 100, "char_embed": True}, - {"width": 8, "embed_size": 100, "conv_depth": 6}, - {"width": 8, "embed_size": 100, "conv_depth": 6}, - {"width": 8, "embed_size": 100, "subword_features": False}, + {"width": 8, "embed_size": 100, "char_embed": False, "nM": 64, "nC": 8, "pretrained_vectors": None, "window_size": 1, "conv_depth": 2, "bilstm_depth": 0, "maxout_pieces": 3, "subword_features": True}, + {"width": 8, "embed_size": 100, "char_embed": True, "nM": 64, "nC": 8, "pretrained_vectors": None, "window_size": 1, "conv_depth": 2, "bilstm_depth": 0, "maxout_pieces": 3, "subword_features": True}, + {"width": 8, "embed_size": 100, "char_embed": False, "nM": 64, "nC": 8, "pretrained_vectors": None, "window_size": 1, "conv_depth": 6, "bilstm_depth": 0, "maxout_pieces": 3, "subword_features": True}, + {"width": 8, "embed_size": 100, "char_embed": False, "nM": 64, "nC": 8, "pretrained_vectors": None, "window_size": 1, "conv_depth": 6, "bilstm_depth": 0, "maxout_pieces": 3, "subword_features": True}, + {"width": 8, "embed_size": 100, "char_embed": False, "nM": 64, "nC": 8, "pretrained_vectors": None, "window_size": 1, "conv_depth": 2, "bilstm_depth": 0, "maxout_pieces": 3, "subword_features": False}, + {"width": 8, "embed_size": 100, "char_embed": False, "nM": 64, "nC": 8, "pretrained_vectors": None, "window_size": 3, "conv_depth": 2, "bilstm_depth": 0, "maxout_pieces": 3, "subword_features": False}, + {"width": 8, "embed_size": 100, "char_embed": True, "nM": 81, "nC": 8, "pretrained_vectors": None, "window_size": 3, "conv_depth": 2, "bilstm_depth": 0, "maxout_pieces": 3, "subword_features": False}, + {"width": 8, "embed_size": 100, "char_embed": True, "nM": 81, "nC": 9, "pretrained_vectors": None, "window_size": 3, "conv_depth": 2, "bilstm_depth": 0, "maxout_pieces": 3, "subword_features": False}, ], ) +# fmt: on def test_tok2vec_configs(tok2vec_config): docs = get_batch(3) - tok2vec = Tok2Vec(**tok2vec_config) + tok2vec = build_Tok2Vec_model(**tok2vec_config) + tok2vec.initialize(docs) vectors, backprop = tok2vec.begin_update(docs) assert len(vectors) == len(docs) assert vectors[0].shape == (len(docs[0]), tok2vec_config["width"]) diff --git a/spacy/tests/tokenizer/test_exceptions.py b/spacy/tests/tokenizer/test_exceptions.py index a79363abb..9a98e049e 100644 --- a/spacy/tests/tokenizer/test_exceptions.py +++ b/spacy/tests/tokenizer/test_exceptions.py @@ -1,13 +1,12 @@ -# coding: utf-8 -from __future__ import unicode_literals - import sys import pytest def test_tokenizer_handles_emoticons(tokenizer): # Tweebo challenge (CMU) - text = """:o :/ :'( >:o (: :) >.< XD -__- o.O ;D :-) @_@ :P 8D :1 >:( :D =| ") :> ....""" + text = ( + """:o :/ :'( >:o (: :) >.< XD -__- o.O ;D :-) @_@ :P 8D :1 >:( :D =| :> ....""" + ) tokens = tokenizer(text) assert tokens[0].text == ":o" assert tokens[1].text == ":/" @@ -28,12 +27,11 @@ def test_tokenizer_handles_emoticons(tokenizer): assert tokens[16].text == ">:(" assert tokens[17].text == ":D" assert tokens[18].text == "=|" - assert tokens[19].text == '")' - assert tokens[20].text == ":>" - assert tokens[21].text == "...." + assert tokens[19].text == ":>" + assert tokens[20].text == "...." -@pytest.mark.parametrize("text,length", [("example:)", 3), ("108)", 2), ("XDN", 1)]) +@pytest.mark.parametrize("text,length", [("108)", 2), ("XDN", 1)]) def test_tokenizer_excludes_false_pos_emoticons(tokenizer, text, length): tokens = tokenizer(text) assert len(tokens) == length diff --git a/spacy/tests/tokenizer/test_explain.py b/spacy/tests/tokenizer/test_explain.py index 2d71588cc..3e7681234 100644 --- a/spacy/tests/tokenizer/test_explain.py +++ b/spacy/tests/tokenizer/test_explain.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.util import get_lang_class @@ -58,7 +55,7 @@ LANGUAGES = [ @pytest.mark.parametrize("lang", LANGUAGES) def test_tokenizer_explain(lang): tokenizer = get_lang_class(lang).Defaults.create_tokenizer() - examples = pytest.importorskip("spacy.lang.{}.examples".format(lang)) + examples = pytest.importorskip(f"spacy.lang.{lang}.examples") for sentence in examples.sentences: tokens = [t.text for t in tokenizer(sentence) if not t.is_space] debug_tokens = [t[1] for t in tokenizer.explain(sentence)] diff --git a/spacy/tests/tokenizer/test_naughty_strings.py b/spacy/tests/tokenizer/test_naughty_strings.py index 36c69611e..e93d5654f 100644 --- a/spacy/tests/tokenizer/test_naughty_strings.py +++ b/spacy/tests/tokenizer/test_naughty_strings.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest # Examples taken from the "Big List of Naughty Strings" diff --git a/spacy/tests/tokenizer/test_tokenizer.py b/spacy/tests/tokenizer/test_tokenizer.py index 803c31abf..c035559b4 100644 --- a/spacy/tests/tokenizer/test_tokenizer.py +++ b/spacy/tests/tokenizer/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.vocab import Vocab from spacy.tokenizer import Tokenizer @@ -108,6 +105,12 @@ def test_tokenizer_add_special_case(tokenizer, text, tokens): assert doc[1].text == tokens[1]["orth"] +@pytest.mark.parametrize("text,tokens", [("lorem", [{"orth": "lo"}, {"orth": "re"}])]) +def test_tokenizer_validate_special_case(tokenizer, text, tokens): + with pytest.raises(ValueError): + tokenizer.add_special_case(text, tokens) + + @pytest.mark.parametrize( "text,tokens", [("lorem", [{"orth": "lo", "tag": "NN"}, {"orth": "rem"}])] ) @@ -120,3 +123,30 @@ def test_tokenizer_add_special_case_tag(text, tokens): assert doc[0].tag_ == tokens[0]["tag"] assert doc[0].pos_ == "NOUN" assert doc[1].text == tokens[1]["orth"] + + +def test_tokenizer_special_cases_with_affixes(tokenizer): + text = '(((_SPECIAL_ A/B, A/B-A/B")' + tokenizer.add_special_case("_SPECIAL_", [{"orth": "_SPECIAL_"}]) + tokenizer.add_special_case("A/B", [{"orth": "A/B"}]) + doc = tokenizer(text) + assert [token.text for token in doc] == [ + "(", + "(", + "(", + "_SPECIAL_", + "A/B", + ",", + "A/B", + "-", + "A/B", + '"', + ")", + ] + + +def test_tokenizer_special_cases_with_period(tokenizer): + text = "_SPECIAL_." + tokenizer.add_special_case("_SPECIAL_", [{"orth": "_SPECIAL_"}]) + doc = tokenizer(text) + assert [token.text for token in doc] == ["_SPECIAL_", "."] diff --git a/spacy/tests/tokenizer/test_urls.py b/spacy/tests/tokenizer/test_urls.py index 58e9d73f3..87211ab95 100644 --- a/spacy/tests/tokenizer/test_urls.py +++ b/spacy/tests/tokenizer/test_urls.py @@ -1,8 +1,7 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest +from spacy.lang.tokenizer_exceptions import BASE_EXCEPTIONS + URLS_BASIC = [ "http://www.nytimes.com/2016/04/20/us/politics/new-york-primary-preview.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=a-lede-package-region®ion=top-news&WT.nav=top-news&_r=0", @@ -196,7 +195,12 @@ def test_tokenizer_handles_two_prefix_url(tokenizer, prefix1, prefix2, url): @pytest.mark.parametrize("url", URLS_FULL) def test_tokenizer_handles_two_suffix_url(tokenizer, suffix1, suffix2, url): tokens = tokenizer(url + suffix1 + suffix2) - assert len(tokens) == 3 - assert tokens[0].text == url - assert tokens[1].text == suffix1 - assert tokens[2].text == suffix2 + if suffix1 + suffix2 in BASE_EXCEPTIONS: + assert len(tokens) == 2 + assert tokens[0].text == url + assert tokens[1].text == suffix1 + suffix2 + else: + assert len(tokens) == 3 + assert tokens[0].text == url + assert tokens[1].text == suffix1 + assert tokens[2].text == suffix2 diff --git a/spacy/tests/tokenizer/test_whitespace.py b/spacy/tests/tokenizer/test_whitespace.py index 74c9b369b..c7b9d7c6d 100644 --- a/spacy/tests/tokenizer/test_whitespace.py +++ b/spacy/tests/tokenizer/test_whitespace.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/util.py b/spacy/tests/util.py index 4e1c50398..e29342268 100644 --- a/spacy/tests/util.py +++ b/spacy/tests/util.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import numpy import tempfile import shutil @@ -11,7 +8,8 @@ from pathlib import Path from spacy import Errors from spacy.tokens import Doc, Span from spacy.attrs import POS, TAG, HEAD, DEP, LEMMA -from spacy.compat import path2str + +from spacy.vocab import Vocab @contextlib.contextmanager @@ -25,7 +23,7 @@ def make_tempfile(mode="r"): def make_tempdir(): d = Path(tempfile.mkdtemp()) yield d - shutil.rmtree(path2str(d)) + shutil.rmtree(str(d)) def get_doc( @@ -81,6 +79,19 @@ def get_doc( return doc +def get_batch(batch_size): + vocab = Vocab() + docs = [] + start = 0 + for size in range(1, batch_size + 1): + # Make the words numbers, so that they're distinct + # across the batch, and easy to track. + numbers = [str(i) for i in range(start, start + size)] + docs.append(Doc(vocab, words=numbers)) + start += size + return docs + + def apply_transition_sequence(parser, doc, sequence): """Perform a series of pre-specified transitions, to put the parser in a desired state.""" diff --git a/spacy/tests/vocab_vectors/test_lexeme.py b/spacy/tests/vocab_vectors/test_lexeme.py index af73a79bf..4288f427c 100644 --- a/spacy/tests/vocab_vectors/test_lexeme.py +++ b/spacy/tests/vocab_vectors/test_lexeme.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import numpy from spacy.attrs import IS_ALPHA, IS_DIGIT diff --git a/spacy/tests/vocab_vectors/test_lookups.py b/spacy/tests/vocab_vectors/test_lookups.py index af15e9e91..d8c7651e4 100644 --- a/spacy/tests/vocab_vectors/test_lookups.py +++ b/spacy/tests/vocab_vectors/test_lookups.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lookups import Lookups, Table from spacy.strings import get_string_id diff --git a/spacy/tests/vocab_vectors/test_similarity.py b/spacy/tests/vocab_vectors/test_similarity.py index f98f0e6e0..b5f7303b5 100644 --- a/spacy/tests/vocab_vectors/test_similarity.py +++ b/spacy/tests/vocab_vectors/test_similarity.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import numpy from spacy.tokens import Doc diff --git a/spacy/tests/vocab_vectors/test_stringstore.py b/spacy/tests/vocab_vectors/test_stringstore.py index 75b1116dd..c71d5f3f2 100644 --- a/spacy/tests/vocab_vectors/test_stringstore.py +++ b/spacy/tests/vocab_vectors/test_stringstore.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.strings import StringStore diff --git a/spacy/tests/vocab_vectors/test_vectors.py b/spacy/tests/vocab_vectors/test_vectors.py index 1821f8abc..cc95252a6 100644 --- a/spacy/tests/vocab_vectors/test_vectors.py +++ b/spacy/tests/vocab_vectors/test_vectors.py @@ -1,17 +1,13 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import numpy from numpy.testing import assert_allclose, assert_equal -from spacy._ml import cosine from spacy.vocab import Vocab from spacy.vectors import Vectors from spacy.tokenizer import Tokenizer from spacy.strings import hash_string from spacy.tokens import Doc -from ..util import add_vecs_to_vocab, make_tempdir +from ..util import add_vecs_to_vocab, get_cosine, make_tempdir @pytest.fixture @@ -336,7 +332,7 @@ def test_vocab_prune_vectors(): assert list(remap.keys()) == ["kitten"] neighbour, similarity = list(remap.values())[0] assert neighbour == "cat", remap - assert_allclose(similarity, cosine(data[0], data[2]), atol=1e-4, rtol=1e-3) + assert_allclose(similarity, get_cosine(data[0], data[2]), atol=1e-4, rtol=1e-3) def test_vectors_serialize(): diff --git a/spacy/tests/vocab_vectors/test_vocab_api.py b/spacy/tests/vocab_vectors/test_vocab_api.py index d22db2d8b..a687059be 100644 --- a/spacy/tests/vocab_vectors/test_vocab_api.py +++ b/spacy/tests/vocab_vectors/test_vocab_api.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.attrs import LEMMA, ORTH, PROB, IS_ALPHA from spacy.parts_of_speech import NOUN, VERB diff --git a/spacy/tokenizer.pxd b/spacy/tokenizer.pxd index dadbad7bd..e82833701 100644 --- a/spacy/tokenizer.pxd +++ b/spacy/tokenizer.pxd @@ -1,13 +1,13 @@ from libcpp.vector cimport vector - from preshed.maps cimport PreshMap from cymem.cymem cimport Pool from .typedefs cimport hash_t -from .structs cimport LexemeC, TokenC +from .structs cimport LexemeC, SpanC, TokenC from .strings cimport StringStore from .tokens.doc cimport Doc from .vocab cimport Vocab, LexemesOrTokens, _Cached +from .matcher.phrasematcher cimport PhraseMatcher cdef class Tokenizer: @@ -21,15 +21,32 @@ cdef class Tokenizer: cdef object _suffix_search cdef object _infix_finditer cdef object _rules + cdef PhraseMatcher _special_matcher + cdef int _property_init_count + cdef int _property_init_max cpdef Doc tokens_from_list(self, list strings) + cdef Doc _tokenize_affixes(self, unicode string, bint with_special_cases) + cdef int _apply_special_cases(self, Doc doc) except -1 + cdef void _filter_special_spans(self, vector[SpanC] &original, + vector[SpanC] &filtered, int doc_len) nogil + cdef object _prepare_special_spans(self, Doc doc, + vector[SpanC] &filtered) + cdef int _retokenize_special_spans(self, Doc doc, TokenC* tokens, + object span_data) cdef int _try_cache(self, hash_t key, Doc tokens) except -1 - cdef int _tokenize(self, Doc tokens, unicode span, hash_t key) except -1 - cdef unicode _split_affixes(self, Pool mem, unicode string, vector[LexemeC*] *prefixes, - vector[LexemeC*] *suffixes, int* has_special) + cdef int _try_specials(self, hash_t key, Doc tokens, + int* has_special) except -1 + cdef int _tokenize(self, Doc tokens, unicode span, hash_t key, + int* has_special, bint with_special_cases) except -1 + cdef unicode _split_affixes(self, Pool mem, unicode string, + vector[LexemeC*] *prefixes, + vector[LexemeC*] *suffixes, int* has_special, + bint with_special_cases) cdef int _attach_tokens(self, Doc tokens, unicode string, - vector[LexemeC*] *prefixes, vector[LexemeC*] *suffixes) except -1 - - cdef int _save_cached(self, const TokenC* tokens, hash_t key, int has_special, - int n) except -1 + vector[LexemeC*] *prefixes, + vector[LexemeC*] *suffixes, int* has_special, + bint with_special_cases) except -1 + cdef int _save_cached(self, const TokenC* tokens, hash_t key, + int* has_special, int n) except -1 diff --git a/spacy/tokenizer.pyx b/spacy/tokenizer.pyx index 69d6285e1..7e75052f7 100644 --- a/spacy/tokenizer.pyx +++ b/spacy/tokenizer.pyx @@ -1,26 +1,27 @@ -# cython: embedsignature=True -# cython: profile=True -# coding: utf8 +# cython: embedsignature=True, profile=True from __future__ import unicode_literals from cython.operator cimport dereference as deref from cython.operator cimport preincrement as preinc +from libc.string cimport memcpy, memset +from libcpp.set cimport set as stdset from cymem.cymem cimport Pool from preshed.maps cimport PreshMap cimport cython -from collections import OrderedDict import re import warnings from .tokens.doc cimport Doc from .strings cimport hash_string -from .compat import unescape_unicode, basestring_ +from .lexeme cimport EMPTY_LEXEME + from .attrs import intify_attrs from .symbols import ORTH - from .errors import Errors, Warnings from . import util +from .attrs import intify_attrs +from .symbols import ORTH cdef class Tokenizer: @@ -60,7 +61,10 @@ cdef class Tokenizer: self.infix_finditer = infix_finditer self.vocab = vocab self._rules = {} - self._load_special_tokenization(rules) + self._special_matcher = PhraseMatcher(self.vocab) + self._load_special_cases(rules) + self._property_init_count = 0 + self._property_init_max = 4 property token_match: def __get__(self): @@ -68,7 +72,9 @@ cdef class Tokenizer: def __set__(self, token_match): self._token_match = token_match - self._flush_cache() + self._reload_special_cases() + if self._property_init_count <= self._property_init_max: + self._property_init_count += 1 property prefix_search: def __get__(self): @@ -76,7 +82,9 @@ cdef class Tokenizer: def __set__(self, prefix_search): self._prefix_search = prefix_search - self._flush_cache() + self._reload_special_cases() + if self._property_init_count <= self._property_init_max: + self._property_init_count += 1 property suffix_search: def __get__(self): @@ -84,7 +92,9 @@ cdef class Tokenizer: def __set__(self, suffix_search): self._suffix_search = suffix_search - self._flush_cache() + self._reload_special_cases() + if self._property_init_count <= self._property_init_max: + self._property_init_count += 1 property infix_finditer: def __get__(self): @@ -92,7 +102,9 @@ cdef class Tokenizer: def __set__(self, infix_finditer): self._infix_finditer = infix_finditer - self._flush_cache() + self._reload_special_cases() + if self._property_init_count <= self._property_init_max: + self._property_init_count += 1 property rules: def __get__(self): @@ -101,10 +113,10 @@ cdef class Tokenizer: def __set__(self, rules): self._rules = {} self._reset_cache([key for key in self._cache]) - self._reset_specials() + self._flush_specials() self._cache = PreshMap() self._specials = PreshMap() - self._load_special_tokenization(rules) + self._load_special_cases(rules) def __reduce__(self): args = (self.vocab, @@ -119,7 +131,6 @@ cdef class Tokenizer: warnings.warn(Warnings.W002, DeprecationWarning) return Doc(self.vocab, words=strings) - @cython.boundscheck(False) def __call__(self, unicode string): """Tokenize a string. @@ -128,6 +139,17 @@ cdef class Tokenizer: DOCS: https://spacy.io/api/tokenizer#call """ + doc = self._tokenize_affixes(string, True) + self._apply_special_cases(doc) + return doc + + @cython.boundscheck(False) + cdef Doc _tokenize_affixes(self, unicode string, bint with_special_cases): + """Tokenize according to affix and token_match settings. + + string (unicode): The string to tokenize. + RETURNS (Doc): A container for linguistic annotations. + """ if len(string) >= (2 ** 30): raise ValueError(Errors.E025.format(length=len(string))) cdef int length = len(string) @@ -136,7 +158,9 @@ cdef class Tokenizer: return doc cdef int i = 0 cdef int start = 0 - cdef bint cache_hit + cdef int has_special = 0 + cdef bint specials_hit = 0 + cdef bint cache_hit = 0 cdef bint in_ws = string[0].isspace() cdef unicode span # The task here is much like string.split, but not quite @@ -152,9 +176,14 @@ cdef class Tokenizer: # we don't have to create the slice when we hit the cache. span = string[start:i] key = hash_string(span) - cache_hit = self._try_cache(key, doc) - if not cache_hit: - self._tokenize(doc, span, key) + specials_hit = 0 + cache_hit = 0 + if with_special_cases: + specials_hit = self._try_specials(key, doc, &has_special) + if not specials_hit: + cache_hit = self._try_cache(key, doc) + if not specials_hit and not cache_hit: + self._tokenize(doc, span, key, &has_special, with_special_cases) if uc == ' ': doc.c[doc.length - 1].spacy = True start = i + 1 @@ -165,13 +194,18 @@ cdef class Tokenizer: if start < i: span = string[start:] key = hash_string(span) - cache_hit = self._try_cache(key, doc) - if not cache_hit: - self._tokenize(doc, span, key) + specials_hit = 0 + cache_hit = 0 + if with_special_cases: + specials_hit = self._try_specials(key, doc, &has_special) + if not specials_hit: + cache_hit = self._try_cache(key, doc) + if not specials_hit and not cache_hit: + self._tokenize(doc, span, key, &has_special, with_special_cases) doc.c[doc.length - 1].spacy = string[-1] == " " and not in_ws return doc - def pipe(self, texts, batch_size=1000, n_threads=-1): + def pipe(self, texts, batch_size=1000, n_threads=-1, as_example=False): """Tokenize a stream of texts. texts: A sequence of unicode texts. @@ -187,23 +221,141 @@ cdef class Tokenizer: yield self(text) def _flush_cache(self): - self._reset_cache([key for key in self._cache if not key in self._specials]) + self._reset_cache([key for key in self._cache]) def _reset_cache(self, keys): for k in keys: + cached = <_Cached*>self._cache.get(k) del self._cache[k] - if not k in self._specials: - cached = <_Cached*>self._cache.get(k) - if cached is not NULL: - self.mem.free(cached) + if cached is not NULL: + self.mem.free(cached) - def _reset_specials(self): + def _flush_specials(self): for k in self._specials: cached = <_Cached*>self._specials.get(k) del self._specials[k] if cached is not NULL: self.mem.free(cached) + cdef int _apply_special_cases(self, Doc doc) except -1: + """Retokenize doc according to special cases. + + doc (Doc): Document. + """ + cdef int i + cdef int max_length = 0 + cdef bint modify_in_place + cdef Pool mem = Pool() + cdef vector[SpanC] c_matches + cdef vector[SpanC] c_filtered + cdef int offset + cdef int modified_doc_length + # Find matches for special cases + self._special_matcher.find_matches(doc, &c_matches) + # Skip processing if no matches + if c_matches.size() == 0: + return True + self._filter_special_spans(c_matches, c_filtered, doc.length) + # Put span info in span.start-indexed dict and calculate maximum + # intermediate document size + (span_data, max_length, modify_in_place) = self._prepare_special_spans(doc, c_filtered) + # If modifications never increase doc length, can modify in place + if modify_in_place: + tokens = doc.c + # Otherwise create a separate array to store modified tokens + else: + tokens = mem.alloc(max_length, sizeof(TokenC)) + # Modify tokenization according to filtered special cases + offset = self._retokenize_special_spans(doc, tokens, span_data) + # Allocate more memory for doc if needed + modified_doc_length = doc.length + offset + while modified_doc_length >= doc.max_length: + doc._realloc(doc.max_length * 2) + # If not modified in place, copy tokens back to doc + if not modify_in_place: + memcpy(doc.c, tokens, max_length * sizeof(TokenC)) + for i in range(doc.length + offset, doc.length): + memset(&doc.c[i], 0, sizeof(TokenC)) + doc.c[i].lex = &EMPTY_LEXEME + doc.length = doc.length + offset + return True + + cdef void _filter_special_spans(self, vector[SpanC] &original, vector[SpanC] &filtered, int doc_len) nogil: + + cdef int seen_i + cdef SpanC span + cdef stdset[int] seen_tokens + stdsort(original.begin(), original.end(), len_start_cmp) + cdef int orig_i = original.size() - 1 + while orig_i >= 0: + span = original[orig_i] + if not seen_tokens.count(span.start) and not seen_tokens.count(span.end - 1): + filtered.push_back(span) + for seen_i in range(span.start, span.end): + seen_tokens.insert(seen_i) + orig_i -= 1 + stdsort(filtered.begin(), filtered.end(), start_cmp) + + cdef object _prepare_special_spans(self, Doc doc, vector[SpanC] &filtered): + spans = [doc[match.start:match.end] for match in filtered] + cdef bint modify_in_place = True + cdef int curr_length = doc.length + cdef int max_length + cdef int span_length_diff = 0 + span_data = {} + for span in spans: + rule = self._rules.get(span.text, None) + span_length_diff = 0 + if rule: + span_length_diff = len(rule) - (span.end - span.start) + if span_length_diff > 0: + modify_in_place = False + curr_length += span_length_diff + if curr_length > max_length: + max_length = curr_length + span_data[span.start] = (span.text, span.start, span.end, span_length_diff) + return (span_data, max_length, modify_in_place) + + cdef int _retokenize_special_spans(self, Doc doc, TokenC* tokens, object span_data): + cdef int i = 0 + cdef int j = 0 + cdef int offset = 0 + cdef _Cached* cached + cdef int idx_offset = 0 + cdef int orig_final_spacy + cdef int orig_idx + cdef int span_start + cdef int span_end + while i < doc.length: + if not i in span_data: + tokens[i + offset] = doc.c[i] + i += 1 + else: + span = span_data[i] + span_start = span[1] + span_end = span[2] + cached = <_Cached*>self._specials.get(hash_string(span[0])) + if cached == NULL: + # Copy original tokens if no rule found + for j in range(span_end - span_start): + tokens[i + offset + j] = doc.c[i + j] + i += span_end - span_start + else: + # Copy special case tokens into doc and adjust token and + # character offsets + idx_offset = 0 + orig_final_spacy = doc.c[span_end + offset - 1].spacy + orig_idx = doc.c[i].idx + for j in range(cached.length): + tokens[i + offset + j] = cached.data.tokens[j] + tokens[i + offset + j].idx = orig_idx + idx_offset + idx_offset += cached.data.tokens[j].lex.length + \ + 1 if cached.data.tokens[j].spacy else 0 + tokens[i + offset + cached.length - 1].spacy = orig_final_spacy + i += span_end - span_start + offset += span[3] + return offset + cdef int _try_cache(self, hash_t key, Doc tokens) except -1: cached = <_Cached*>self._cache.get(key) if cached == NULL: @@ -217,22 +369,33 @@ cdef class Tokenizer: tokens.push_back(&cached.data.tokens[i], False) return True - cdef int _tokenize(self, Doc tokens, unicode span, hash_t orig_key) except -1: + cdef int _try_specials(self, hash_t key, Doc tokens, int* has_special) except -1: + cached = <_Cached*>self._specials.get(key) + if cached == NULL: + return False + cdef int i + for i in range(cached.length): + tokens.push_back(&cached.data.tokens[i], False) + has_special[0] = 1 + return True + + cdef int _tokenize(self, Doc tokens, unicode span, hash_t orig_key, int* has_special, bint with_special_cases) except -1: cdef vector[LexemeC*] prefixes cdef vector[LexemeC*] suffixes cdef int orig_size - cdef int has_special = 0 orig_size = tokens.length span = self._split_affixes(tokens.mem, span, &prefixes, &suffixes, - &has_special) - self._attach_tokens(tokens, span, &prefixes, &suffixes) + has_special, with_special_cases) + self._attach_tokens(tokens, span, &prefixes, &suffixes, has_special, + with_special_cases) self._save_cached(&tokens.c[orig_size], orig_key, has_special, tokens.length - orig_size) cdef unicode _split_affixes(self, Pool mem, unicode string, vector[const LexemeC*] *prefixes, vector[const LexemeC*] *suffixes, - int* has_special): + int* has_special, + bint with_special_cases): cdef size_t i cdef unicode prefix cdef unicode suffix @@ -240,29 +403,28 @@ cdef class Tokenizer: cdef unicode minus_suf cdef size_t last_size = 0 while string and len(string) != last_size: - if self._specials.get(hash_string(string)) != NULL: - has_special[0] = 1 + if self.token_match and self.token_match(string) \ + and not self.find_prefix(string) \ + and not self.find_suffix(string): + break + if with_special_cases and self._specials.get(hash_string(string)) != NULL: break last_size = len(string) pre_len = self.find_prefix(string) if pre_len != 0: prefix = string[:pre_len] minus_pre = string[pre_len:] - # Check whether we've hit a special-case - if minus_pre and self._specials.get(hash_string(minus_pre)) != NULL: + if minus_pre and with_special_cases and self._specials.get(hash_string(minus_pre)) != NULL: string = minus_pre prefixes.push_back(self.vocab.get(mem, prefix)) - has_special[0] = 1 break suf_len = self.find_suffix(string) if suf_len != 0: suffix = string[-suf_len:] minus_suf = string[:-suf_len] - # Check whether we've hit a special-case - if minus_suf and (self._specials.get(hash_string(minus_suf)) != NULL): + if minus_suf and with_special_cases and self._specials.get(hash_string(minus_suf)) != NULL: string = minus_suf suffixes.push_back(self.vocab.get(mem, suffix)) - has_special[0] = 1 break if pre_len and suf_len and (pre_len + suf_len) <= len(string): string = string[pre_len:-suf_len] @@ -274,15 +436,15 @@ cdef class Tokenizer: elif suf_len: string = minus_suf suffixes.push_back(self.vocab.get(mem, suffix)) - if string and (self._specials.get(hash_string(string)) != NULL): - has_special[0] = 1 - break return string cdef int _attach_tokens(self, Doc tokens, unicode string, vector[const LexemeC*] *prefixes, - vector[const LexemeC*] *suffixes) except -1: - cdef bint cache_hit + vector[const LexemeC*] *suffixes, + int* has_special, + bint with_special_cases) except -1: + cdef bint specials_hit = 0 + cdef bint cache_hit = 0 cdef int split, end cdef const LexemeC* const* lexemes cdef const LexemeC* lexeme @@ -292,8 +454,12 @@ cdef class Tokenizer: for i in range(prefixes.size()): tokens.push_back(prefixes[0][i], False) if string: - cache_hit = self._try_cache(hash_string(string), tokens) - if cache_hit: + if with_special_cases: + specials_hit = self._try_specials(hash_string(string), tokens, + has_special) + if not specials_hit: + cache_hit = self._try_cache(hash_string(string), tokens) + if specials_hit or cache_hit: pass elif self.token_match and self.token_match(string): # We're always saying 'no' to spaces here -- the caller will @@ -338,7 +504,7 @@ cdef class Tokenizer: tokens.push_back(lexeme, False) cdef int _save_cached(self, const TokenC* tokens, hash_t key, - int has_special, int n) except -1: + int* has_special, int n) except -1: cdef int i if n <= 0: # avoid mem alloc of zero length @@ -347,7 +513,7 @@ cdef class Tokenizer: if self.vocab._by_orth.get(tokens[i].lex.orth) == NULL: return 0 # See #1250 - if has_special: + if has_special[0]: return 0 cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached)) cached.length = n @@ -400,12 +566,25 @@ cdef class Tokenizer: match = self.suffix_search(string) return (match.end() - match.start()) if match is not None else 0 - def _load_special_tokenization(self, special_cases): + def _load_special_cases(self, special_cases): """Add special-case tokenization rules.""" if special_cases is not None: for chunk, substrings in sorted(special_cases.items()): + self._validate_special_case(chunk, substrings) self.add_special_case(chunk, substrings) + def _validate_special_case(self, chunk, substrings): + """Check whether the `ORTH` fields match the string. + + string (unicode): The string to specially tokenize. + substrings (iterable): A sequence of dicts, where each dict describes + a token and its attributes. + """ + attrs = [intify_attrs(spec, _do_deprecated=True) for spec in substrings] + orth = "".join([spec[ORTH] for spec in attrs]) + if chunk != orth: + raise ValueError(Errors.E997.format(chunk=chunk, orth=orth, token_attrs=substrings)) + def add_special_case(self, unicode string, substrings): """Add a special-case tokenization rule. @@ -416,6 +595,7 @@ cdef class Tokenizer: DOCS: https://spacy.io/api/tokenizer#add_special_case """ + self._validate_special_case(string, substrings) substrings = list(substrings) cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached)) cached.length = len(substrings) @@ -423,15 +603,25 @@ cdef class Tokenizer: cached.data.tokens = self.vocab.make_fused_token(substrings) key = hash_string(string) stale_special = <_Cached*>self._specials.get(key) - stale_cached = <_Cached*>self._cache.get(key) - self._flush_cache() self._specials.set(key, cached) - self._cache.set(key, cached) if stale_special is not NULL: self.mem.free(stale_special) - if stale_special != stale_cached and stale_cached is not NULL: - self.mem.free(stale_cached) self._rules[string] = substrings + self._flush_cache() + if self.find_prefix(string) or self.find_infix(string) or self.find_suffix(string): + self._special_matcher.add(string, None, self._tokenize_affixes(string, False)) + + def _reload_special_cases(self): + try: + self._property_init_count + except AttributeError: + return + # only reload if all 4 of prefix, suffix, infix, token_match have + # have been initialized + if self.vocab is not None and self._property_init_count >= self._property_init_max: + self._flush_cache() + self._flush_specials() + self._load_special_cases(self._rules) def explain(self, text): """A debugging tokenizer that provides information about which @@ -537,14 +727,14 @@ cdef class Tokenizer: DOCS: https://spacy.io/api/tokenizer#to_bytes """ - serializers = OrderedDict(( - ("vocab", lambda: self.vocab.to_bytes()), - ("prefix_search", lambda: _get_regex_pattern(self.prefix_search)), - ("suffix_search", lambda: _get_regex_pattern(self.suffix_search)), - ("infix_finditer", lambda: _get_regex_pattern(self.infix_finditer)), - ("token_match", lambda: _get_regex_pattern(self.token_match)), - ("exceptions", lambda: OrderedDict(sorted(self._rules.items()))) - )) + serializers = { + "vocab": lambda: self.vocab.to_bytes(), + "prefix_search": lambda: _get_regex_pattern(self.prefix_search), + "suffix_search": lambda: _get_regex_pattern(self.suffix_search), + "infix_finditer": lambda: _get_regex_pattern(self.infix_finditer), + "token_match": lambda: _get_regex_pattern(self.token_match), + "exceptions": lambda: dict(sorted(self._rules.items())) + } exclude = util.get_serialization_exclude(serializers, exclude, kwargs) return util.to_bytes(serializers, exclude) @@ -557,40 +747,50 @@ cdef class Tokenizer: DOCS: https://spacy.io/api/tokenizer#from_bytes """ - data = OrderedDict() - deserializers = OrderedDict(( - ("vocab", lambda b: self.vocab.from_bytes(b)), - ("prefix_search", lambda b: data.setdefault("prefix_search", b)), - ("suffix_search", lambda b: data.setdefault("suffix_search", b)), - ("infix_finditer", lambda b: data.setdefault("infix_finditer", b)), - ("token_match", lambda b: data.setdefault("token_match", b)), - ("exceptions", lambda b: data.setdefault("rules", b)) - )) + data = {} + deserializers = { + "vocab": lambda b: self.vocab.from_bytes(b), + "prefix_search": lambda b: data.setdefault("prefix_search", b), + "suffix_search": lambda b: data.setdefault("suffix_search", b), + "infix_finditer": lambda b: data.setdefault("infix_finditer", b), + "token_match": lambda b: data.setdefault("token_match", b), + "exceptions": lambda b: data.setdefault("rules", b) + } exclude = util.get_serialization_exclude(deserializers, exclude, kwargs) msg = util.from_bytes(bytes_data, deserializers, exclude) - for key in ["prefix_search", "suffix_search", "infix_finditer", "token_match"]: - if key in data: - data[key] = unescape_unicode(data[key]) - if "prefix_search" in data and isinstance(data["prefix_search"], basestring_): + if "prefix_search" in data and isinstance(data["prefix_search"], str): self.prefix_search = re.compile(data["prefix_search"]).search - if "suffix_search" in data and isinstance(data["suffix_search"], basestring_): + if "suffix_search" in data and isinstance(data["suffix_search"], str): self.suffix_search = re.compile(data["suffix_search"]).search - if "infix_finditer" in data and isinstance(data["infix_finditer"], basestring_): + if "infix_finditer" in data and isinstance(data["infix_finditer"], str): self.infix_finditer = re.compile(data["infix_finditer"]).finditer - if "token_match" in data and isinstance(data["token_match"], basestring_): + if "token_match" in data and isinstance(data["token_match"], str): self.token_match = re.compile(data["token_match"]).match if "rules" in data and isinstance(data["rules"], dict): # make sure to hard reset the cache to remove data from the default exceptions self._rules = {} - self._reset_cache([key for key in self._cache]) - self._reset_specials() - self._cache = PreshMap() - self._specials = PreshMap() - self._load_special_tokenization(data["rules"]) - + self._flush_cache() + self._flush_specials() + self._load_special_cases(data["rules"]) return self def _get_regex_pattern(regex): """Get a pattern string for a regex, or None if the pattern is None.""" return None if regex is None else regex.__self__.pattern + + +cdef extern from "" namespace "std" nogil: + void stdsort "sort"(vector[SpanC].iterator, + vector[SpanC].iterator, + bint (*)(SpanC, SpanC)) + + +cdef bint len_start_cmp(SpanC a, SpanC b) nogil: + if a.end - a.start == b.end - b.start: + return b.start < a.start + return a.end - a.start < b.end - b.start + + +cdef bint start_cmp(SpanC a, SpanC b) nogil: + return a.start < b.start diff --git a/spacy/tokens/__init__.py b/spacy/tokens/__init__.py index 536ec8349..1aefa2b7c 100644 --- a/spacy/tokens/__init__.py +++ b/spacy/tokens/__init__.py @@ -1,9 +1,7 @@ -# coding: utf8 -from __future__ import unicode_literals - from .doc import Doc from .token import Token from .span import Span from ._serialize import DocBin +from .morphanalysis import MorphAnalysis -__all__ = ["Doc", "Token", "Span", "DocBin"] +__all__ = ["Doc", "Token", "Span", "DocBin", "MorphAnalysis"] diff --git a/spacy/tokens/_retokenize.pyx b/spacy/tokens/_retokenize.pyx index 512ad73bc..dd0b2b820 100644 --- a/spacy/tokens/_retokenize.pyx +++ b/spacy/tokens/_retokenize.pyx @@ -1,14 +1,9 @@ -# coding: utf8 -# cython: infer_types=True -# cython: bounds_check=False -# cython: profile=True -from __future__ import unicode_literals - +# cython: infer_types=True, bounds_check=False, profile=True from libc.string cimport memcpy, memset from libc.stdlib cimport malloc, free from cymem.cymem cimport Pool -from thinc.neural.util import get_array_module +from thinc.api import get_array_module import numpy from .doc cimport Doc, set_children_from_heads, token_by_start, token_by_end @@ -16,7 +11,7 @@ from .span cimport Span from .token cimport Token from ..lexeme cimport Lexeme, EMPTY_LEXEME from ..structs cimport LexemeC, TokenC -from ..attrs cimport TAG +from ..attrs cimport TAG, MORPH from .underscore import is_writable_attr from ..attrs import intify_attrs @@ -68,6 +63,8 @@ cdef class Retokenizer: attrs["_"] = extensions else: attrs = intify_attrs(attrs, strings_map=self.doc.vocab.strings) + if MORPH in attrs: + self.doc.vocab.morphology.add(self.doc.vocab.strings.as_string(attrs[MORPH])) self.merges.append((span, attrs)) def split(self, Token token, orths, heads, attrs=SimpleFrozenDict()): @@ -99,6 +96,9 @@ cdef class Retokenizer: # NB: Since we support {"KEY": [value, value]} syntax here, this # will only "intify" the keys, not the values attrs = intify_attrs(attrs, strings_map=self.doc.vocab.strings) + if MORPH in attrs: + for morph in attrs[MORPH]: + self.doc.vocab.morphology.add(self.doc.vocab.strings.as_string(morph)) head_offsets = [] for head in heads: if isinstance(head, Token): diff --git a/spacy/tokens/_serialize.py b/spacy/tokens/_serialize.py index b60a6d7b3..d3f49550c 100644 --- a/spacy/tokens/_serialize.py +++ b/spacy/tokens/_serialize.py @@ -1,10 +1,7 @@ -# coding: utf8 -from __future__ import unicode_literals - import numpy import zlib import srsly -from thinc.neural.ops import NumpyOps +from thinc.api import NumpyOps from ..compat import copy_reg from ..tokens import Doc @@ -138,10 +135,13 @@ class DocBin(object): for tokens in self.tokens: assert len(tokens.shape) == 2, tokens.shape # this should never happen lengths = [len(tokens) for tokens in self.tokens] + tokens = numpy.vstack(self.tokens) if self.tokens else numpy.asarray([]) + spaces = numpy.vstack(self.spaces) if self.spaces else numpy.asarray([]) + msg = { "attrs": self.attrs, - "tokens": numpy.vstack(self.tokens).tobytes("C"), - "spaces": numpy.vstack(self.spaces).tobytes("C"), + "tokens": tokens.tobytes("C"), + "spaces": spaces.tobytes("C"), "lengths": numpy.asarray(lengths, dtype="int32").tobytes("C"), "strings": list(self.strings), "cats": self.cats, diff --git a/spacy/tokens/doc.pxd b/spacy/tokens/doc.pxd index 6536d271d..42918ab6d 100644 --- a/spacy/tokens/doc.pxd +++ b/spacy/tokens/doc.pxd @@ -51,6 +51,7 @@ cdef class Doc: cdef public bint is_tagged cdef public bint is_parsed + cdef public bint is_morphed cdef public float sentiment diff --git a/spacy/tokens/doc.pyx b/spacy/tokens/doc.pyx index 25a147208..e6841eb80 100644 --- a/spacy/tokens/doc.pyx +++ b/spacy/tokens/doc.pyx @@ -1,21 +1,16 @@ - -# coding: utf8 -# cython: infer_types=True -# cython: bounds_check=False -# cython: profile=True -from __future__ import unicode_literals - +# cython: infer_types=True, bounds_check=False, profile=True cimport cython cimport numpy as np from libc.string cimport memcpy, memset from libc.math cimport sqrt -from collections import Counter +from collections import Counter import numpy import numpy.linalg import struct import srsly -from thinc.neural.util import get_array_module, copy_array +from thinc.api import get_array_module +from thinc.util import copy_array import warnings from .span cimport Span @@ -29,7 +24,7 @@ from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t from ..attrs import intify_attrs, IDS from ..util import normalize_slice -from ..compat import is_config, copy_reg, pickle, basestring_ +from ..compat import copy_reg, pickle from ..errors import Errors, Warnings from .. import util from .underscore import Underscore, get_ext_args @@ -341,9 +336,7 @@ cdef class Doc: return "".join([t.text_with_ws for t in self]).encode("utf-8") def __str__(self): - if is_config(python3=True): - return self.__unicode__() - return self.__bytes__() + return self.__unicode__() def __repr__(self): return self.__str__() @@ -412,7 +405,9 @@ cdef class Doc: return 0.0 vector = self.vector xp = get_array_module(vector) - return xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm) + result = xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm) + # ensure we get a scalar back (numpy does this automatically but cupy doesn't) + return result.item() @property def has_vector(self): @@ -520,7 +515,7 @@ cdef class Doc: token = &self.c[i] if token.ent_iob == 1: if start == -1: - seq = ["%s|%s" % (t.text, t.ent_iob_) for t in self[i-5:i+5]] + seq = [f"{t.text}|{t.ent_iob_}" for t in self[i-5:i+5]] raise ValueError(Errors.E093.format(seq=" ".join(seq))) elif token.ent_iob == 2 or token.ent_iob == 0: if start != -1: @@ -597,7 +592,7 @@ cdef class Doc: DOCS: https://spacy.io/api/doc#noun_chunks """ - + # Accumulate the result before beginning to iterate over it. This # prevents the tokenisation from being changed out from under us # during the iteration. The tricky thing here is that Span accepts @@ -697,7 +692,7 @@ cdef class Doc: cdef np.ndarray[attr_t, ndim=2] output # Handle scalar/list inputs of strings/ints for py_attr_ids # See also #3064 - if isinstance(py_attr_ids, basestring_): + if isinstance(py_attr_ids, str): # Handle inputs like doc.to_array('ORTH') py_attr_ids = [py_attr_ids] elif not hasattr(py_attr_ids, "__iter__"): @@ -786,7 +781,7 @@ cdef class Doc: """ # Handle scalar/list inputs of strings/ints for py_attr_ids # See also #3064 - if isinstance(attrs, basestring_): + if isinstance(attrs, str): # Handle inputs like doc.to_array('ORTH') attrs = [attrs] elif not hasattr(attrs, "__iter__"): diff --git a/spacy/tokens/morphanalysis.pxd b/spacy/tokens/morphanalysis.pxd index 22844454a..9510875c9 100644 --- a/spacy/tokens/morphanalysis.pxd +++ b/spacy/tokens/morphanalysis.pxd @@ -5,5 +5,5 @@ from ..structs cimport MorphAnalysisC cdef class MorphAnalysis: cdef readonly Vocab vocab - cdef hash_t key + cdef readonly hash_t key cdef MorphAnalysisC c diff --git a/spacy/tokens/morphanalysis.pyx b/spacy/tokens/morphanalysis.pyx index e09870741..ed987f4e4 100644 --- a/spacy/tokens/morphanalysis.pyx +++ b/spacy/tokens/morphanalysis.pyx @@ -1,15 +1,14 @@ from libc.string cimport memset +cimport numpy as np from ..vocab cimport Vocab from ..typedefs cimport hash_t, attr_t -from ..morphology cimport list_features, check_feature, get_field, tag_to_json - -from ..strings import get_string_id +from ..morphology cimport list_features, check_feature, get_by_field cdef class MorphAnalysis: """Control access to morphological features for a token.""" - def __init__(self, Vocab vocab, features=tuple()): + def __init__(self, Vocab vocab, features=dict()): self.vocab = vocab self.key = self.vocab.morphology.add(features) analysis = self.vocab.morphology.tags.get(self.key) @@ -33,7 +32,7 @@ cdef class MorphAnalysis: def __contains__(self, feature): """Test whether the morphological analysis contains some feature.""" - cdef attr_t feat_id = get_string_id(feature) + cdef attr_t feat_id = self.vocab.strings.as_int(feature) return check_feature(&self.c, feat_id) def __iter__(self): @@ -55,369 +54,28 @@ cdef class MorphAnalysis: def __hash__(self): return self.key - def get(self, unicode field): + def __eq__(self, other): + return self.key == other.key + + def __ne__(self, other): + return self.key != other.key + + def get(self, field): """Retrieve a feature by field.""" - cdef int field_id = self.vocab.morphology._feat_map.attr2field[field] - return self.vocab.strings[get_field(&self.c, field_id)] + cdef attr_t field_id = self.vocab.strings.as_int(field) + cdef np.ndarray results = get_by_field(&self.c, field_id) + return [self.vocab.strings[result] for result in results] def to_json(self): - """Produce a json serializable representation, which will be a list of - strings. + """Produce a json serializable representation as a UD FEATS-style + string. """ - return tag_to_json(&self.c) - - @property - def is_base_form(self): - raise NotImplementedError - - @property - def pos(self): - return self.c.pos - - @property - def pos_(self): - return self.vocab.strings[self.c.pos] - - property id: - def __get__(self): - return self.key - - property abbr: - def __get__(self): - return self.c.abbr - - property adp_type: - def __get__(self): - return self.c.adp_type - - property adv_type: - def __get__(self): - return self.c.adv_type - - property animacy: - def __get__(self): - return self.c.animacy - - property aspect: - def __get__(self): - return self.c.aspect - - property case: - def __get__(self): - return self.c.case - - property conj_type: - def __get__(self): - return self.c.conj_type - - property connegative: - def __get__(self): - return self.c.connegative - - property definite: - def __get__(self): - return self.c.definite - - property degree: - def __get__(self): - return self.c.degree - - property derivation: - def __get__(self): - return self.c.derivation - - property echo: - def __get__(self): - return self.c.echo - - property foreign: - def __get__(self): - return self.c.foreign - - property gender: - def __get__(self): - return self.c.gender - - property hyph: - def __get__(self): - return self.c.hyph - - property inf_form: - def __get__(self): - return self.c.inf_form - - property mood: - def __get__(self): - return self.c.mood - - property name_type: - def __get__(self): - return self.c.name_type - - property negative: - def __get__(self): - return self.c.negative - - property noun_type: - def __get__(self): - return self.c.noun_type - - property number: - def __get__(self): - return self.c.number - - property num_form: - def __get__(self): - return self.c.num_form - - property num_type: - def __get__(self): - return self.c.num_type - - property num_value: - def __get__(self): - return self.c.num_value - - property part_form: - def __get__(self): - return self.c.part_form - - property part_type: - def __get__(self): - return self.c.part_type - - property person: - def __get__(self): - return self.c.person - - property polite: - def __get__(self): - return self.c.polite - - property polarity: - def __get__(self): - return self.c.polarity - - property poss: - def __get__(self): - return self.c.poss - - property prefix: - def __get__(self): - return self.c.prefix - - property prep_case: - def __get__(self): - return self.c.prep_case - - property pron_type: - def __get__(self): - return self.c.pron_type - - property punct_side: - def __get__(self): - return self.c.punct_side - - property punct_type: - def __get__(self): - return self.c.punct_type - - property reflex: - def __get__(self): - return self.c.reflex - - property style: - def __get__(self): - return self.c.style - - property style_variant: - def __get__(self): - return self.c.style_variant - - property tense: - def __get__(self): - return self.c.tense - - property typo: - def __get__(self): - return self.c.typo - - property verb_form: - def __get__(self): - return self.c.verb_form - - property voice: - def __get__(self): - return self.c.voice - - property verb_type: - def __get__(self): - return self.c.verb_type - - property abbr_: - def __get__(self): - return self.vocab.strings[self.c.abbr] - - property adp_type_: - def __get__(self): - return self.vocab.strings[self.c.adp_type] - - property adv_type_: - def __get__(self): - return self.vocab.strings[self.c.adv_type] - - property animacy_: - def __get__(self): - return self.vocab.strings[self.c.animacy] - - property aspect_: - def __get__(self): - return self.vocab.strings[self.c.aspect] - - property case_: - def __get__(self): - return self.vocab.strings[self.c.case] - - property conj_type_: - def __get__(self): - return self.vocab.strings[self.c.conj_type] - - property connegative_: - def __get__(self): - return self.vocab.strings[self.c.connegative] - - property definite_: - def __get__(self): - return self.vocab.strings[self.c.definite] - - property degree_: - def __get__(self): - return self.vocab.strings[self.c.degree] - - property derivation_: - def __get__(self): - return self.vocab.strings[self.c.derivation] - - property echo_: - def __get__(self): - return self.vocab.strings[self.c.echo] - - property foreign_: - def __get__(self): - return self.vocab.strings[self.c.foreign] - - property gender_: - def __get__(self): - return self.vocab.strings[self.c.gender] - - property hyph_: - def __get__(self): - return self.vocab.strings[self.c.hyph] - - property inf_form_: - def __get__(self): - return self.vocab.strings[self.c.inf_form] - - property name_type_: - def __get__(self): - return self.vocab.strings[self.c.name_type] - - property negative_: - def __get__(self): - return self.vocab.strings[self.c.negative] - - property mood_: - def __get__(self): - return self.vocab.strings[self.c.mood] - - property number_: - def __get__(self): - return self.vocab.strings[self.c.number] - - property num_form_: - def __get__(self): - return self.vocab.strings[self.c.num_form] - - property num_type_: - def __get__(self): - return self.vocab.strings[self.c.num_type] - - property num_value_: - def __get__(self): - return self.vocab.strings[self.c.num_value] - - property part_form_: - def __get__(self): - return self.vocab.strings[self.c.part_form] - - property part_type_: - def __get__(self): - return self.vocab.strings[self.c.part_type] - - property person_: - def __get__(self): - return self.vocab.strings[self.c.person] - - property polite_: - def __get__(self): - return self.vocab.strings[self.c.polite] - - property polarity_: - def __get__(self): - return self.vocab.strings[self.c.polarity] - - property poss_: - def __get__(self): - return self.vocab.strings[self.c.poss] - - property prefix_: - def __get__(self): - return self.vocab.strings[self.c.prefix] - - property prep_case_: - def __get__(self): - return self.vocab.strings[self.c.prep_case] - - property pron_type_: - def __get__(self): - return self.vocab.strings[self.c.pron_type] - - property punct_side_: - def __get__(self): - return self.vocab.strings[self.c.punct_side] - - property punct_type_: - def __get__(self): - return self.vocab.strings[self.c.punct_type] - - property reflex_: - def __get__(self): - return self.vocab.strings[self.c.reflex] - - property style_: - def __get__(self): - return self.vocab.strings[self.c.style] - - property style_variant_: - def __get__(self): - return self.vocab.strings[self.c.style_variant] - - property tense_: - def __get__(self): - return self.vocab.strings[self.c.tense] - - property typo_: - def __get__(self): - return self.vocab.strings[self.c.typo] - - property verb_form_: - def __get__(self): - return self.vocab.strings[self.c.verb_form] - - property voice_: - def __get__(self): - return self.vocab.strings[self.c.voice] - - property verb_type_: - def __get__(self): - return self.vocab.strings[self.c.verb_type] + morph_string = self.vocab.strings[self.c.key] + if morph_string == self.vocab.morphology.EMPTY_MORPH: + return "" + return morph_string + + def to_dict(self): + """Produce a dict representation. + """ + return self.vocab.morphology.feats_to_dict(self.to_json()) diff --git a/spacy/tokens/span.pyx b/spacy/tokens/span.pyx index 2f1418a5b..e9b151985 100644 --- a/spacy/tokens/span.pyx +++ b/spacy/tokens/span.pyx @@ -1,4 +1,3 @@ -# coding: utf8 from __future__ import unicode_literals cimport numpy as np @@ -6,9 +5,9 @@ from libc.math cimport sqrt import numpy import numpy.linalg -import warnings -from thinc.neural.util import get_array_module +from thinc.api import get_array_module from collections import defaultdict +import warnings from .doc cimport token_by_start, token_by_end, get_token_attr, _get_lca_matrix from .token cimport TokenC @@ -21,7 +20,6 @@ from ..lexeme cimport Lexeme from ..symbols cimport dep from ..util import normalize_slice -from ..compat import is_config, basestring_ from ..errors import Errors, TempErrors, Warnings from .underscore import Underscore, get_ext_args @@ -110,9 +108,9 @@ cdef class Span: self.end_char = self.doc[end - 1].idx + len(self.doc[end - 1]) else: self.end_char = 0 - if isinstance(label, basestring_): + if isinstance(label, str): label = doc.vocab.strings.add(label) - if isinstance(kb_id, basestring_): + if isinstance(kb_id, str): kb_id = doc.vocab.strings.add(kb_id) if label not in doc.vocab.strings: raise ValueError(Errors.E084.format(label=label)) @@ -162,9 +160,7 @@ cdef class Span: return self.end - self.start def __repr__(self): - if is_config(python3=True): - return self.text - return self.text.encode("utf-8") + return self.text def __getitem__(self, object i): """Get a `Token` or a `Span` object @@ -475,7 +471,7 @@ cdef class Span: @property def tensor(self): """The span's slice of the doc's tensor. - + RETURNS (ndarray[ndim=2, dtype='float32']): A 2D numpy or cupy array representing the span's semantics. """ diff --git a/spacy/tokens/token.pxd b/spacy/tokens/token.pxd index cbca55c40..45c906a82 100644 --- a/spacy/tokens/token.pxd +++ b/spacy/tokens/token.pxd @@ -6,6 +6,7 @@ from ..typedefs cimport attr_t, flags_t from ..parts_of_speech cimport univ_pos_t from .doc cimport Doc from ..lexeme cimport Lexeme + from ..errors import Errors @@ -43,6 +44,8 @@ cdef class Token: return token.pos elif feat_name == TAG: return token.tag + elif feat_name == MORPH: + return token.morph elif feat_name == DEP: return token.dep elif feat_name == HEAD: @@ -73,6 +76,8 @@ cdef class Token: token.pos = value elif feat_name == TAG: token.tag = value + elif feat_name == MORPH: + token.morph = value elif feat_name == DEP: token.dep = value elif feat_name == HEAD: diff --git a/spacy/tokens/token.pyx b/spacy/tokens/token.pyx index 45deebc93..58e9196ea 100644 --- a/spacy/tokens/token.pyx +++ b/spacy/tokens/token.pyx @@ -1,7 +1,4 @@ # cython: infer_types=True -# coding: utf8 -from __future__ import unicode_literals - from libc.string cimport memcpy from cpython.mem cimport PyMem_Malloc, PyMem_Free # Compiler crashes on memory view coercion without this. Should report bug. @@ -10,8 +7,8 @@ cimport numpy as np np.import_array() import numpy +from thinc.api import get_array_module import warnings -from thinc.neural.util import get_array_module from ..typedefs cimport hash_t from ..lexeme cimport Lexeme @@ -21,13 +18,12 @@ from ..attrs cimport IS_TITLE, IS_UPPER, IS_CURRENCY, LIKE_URL, LIKE_NUM, LIKE_E from ..attrs cimport IS_STOP, ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX from ..attrs cimport LENGTH, CLUSTER, LEMMA, POS, TAG, DEP from ..symbols cimport conj +from .morphanalysis cimport MorphAnalysis from .. import parts_of_speech from .. import util -from ..compat import is_config from ..errors import Errors, Warnings from .underscore import Underscore, get_ext_args -from .morphanalysis cimport MorphAnalysis cdef class Token: @@ -123,9 +119,7 @@ cdef class Token: return self.text.encode('utf8') def __str__(self): - if is_config(python3=True): - return self.__unicode__() - return self.__bytes__() + return self.__unicode__() def __repr__(self): return self.__str__() @@ -224,6 +218,14 @@ cdef class Token: def morph(self): return MorphAnalysis.from_id(self.vocab, self.c.morph) + property morph_: + def __get__(self): + return str(MorphAnalysis.from_id(self.vocab, self.c.morph)) + + def __set__(self, features): + cdef hash_t key = self.vocab.morphology.add(features) + self.c.morph = key + @property def lex_id(self): """RETURNS (int): Sequential ID of the token's lexical type.""" diff --git a/spacy/tokens/underscore.py b/spacy/tokens/underscore.py index 8dac8526e..fab10b94d 100644 --- a/spacy/tokens/underscore.py +++ b/spacy/tokens/underscore.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import functools import copy diff --git a/spacy/typedefs.pxd b/spacy/typedefs.pxd index bd5b38958..b43814268 100644 --- a/spacy/typedefs.pxd +++ b/spacy/typedefs.pxd @@ -2,7 +2,9 @@ from libc.stdint cimport uint16_t, uint32_t, uint64_t, uintptr_t, int32_t from libc.stdint cimport uint8_t +ctypedef float weight_t ctypedef uint64_t hash_t +ctypedef uint64_t class_t ctypedef char* utf8_t ctypedef uint64_t attr_t ctypedef uint64_t flags_t diff --git a/spacy/util.py b/spacy/util.py index 419c99bc0..a6ccae075 100644 --- a/spacy/util.py +++ b/spacy/util.py @@ -1,14 +1,12 @@ -# coding: utf8 -from __future__ import unicode_literals, print_function - import os import importlib +import importlib.util import re from pathlib import Path import random -from collections import OrderedDict -from thinc.neural._classes.model import Model -from thinc.neural.ops import NumpyOps +from typing import List +import thinc +from thinc.api import NumpyOps, get_current_ops, Adam, require_gpu, Config import functools import itertools import numpy.random @@ -18,10 +16,6 @@ import catalogue import sys import warnings -try: - import jsonschema -except ImportError: - jsonschema = None try: import cupy.random @@ -29,22 +23,20 @@ except ImportError: cupy = None from .symbols import ORTH -from .compat import cupy, CudaStream, path2str, basestring_, unicode_ -from .compat import import_file +from .compat import cupy, CudaStream from .errors import Errors, Warnings - -_data_path = Path(__file__).parent / "data" _PRINT_ENV = False OOV_RANK = numpy.iinfo(numpy.uint64).max -class registry(object): +class registry(thinc.registry): languages = catalogue.create("spacy", "languages", entry_points=True) architectures = catalogue.create("spacy", "architectures", entry_points=True) lookups = catalogue.create("spacy", "lookups", entry_points=True) factories = catalogue.create("spacy", "factories", entry_points=True) displacy_colors = catalogue.create("spacy", "displacy_colors", entry_points=True) + assets = catalogue.create("spacy", "assets", entry_points=True) def set_env_log(value): @@ -74,7 +66,7 @@ def get_lang_class(lang): return registry.languages.get(lang) else: try: - module = importlib.import_module(".lang.%s" % lang, "spacy") + module = importlib.import_module(f".lang.{lang}", "spacy") except ImportError as err: raise ImportError(Errors.E048.format(lang=lang, err=err)) set_lang_class(lang, getattr(module, module.__all__[0])) @@ -90,39 +82,13 @@ def set_lang_class(name, cls): registry.languages.register(name, func=cls) -def get_data_path(require_exists=True): - """Get path to spaCy data directory. - - require_exists (bool): Only return path if it exists, otherwise None. - RETURNS (Path or None): Data path or None. - """ - if not require_exists: - return _data_path - else: - return _data_path if _data_path.exists() else None - - -def set_data_path(path): - """Set path to spaCy data directory. - - path (unicode or Path): Path to new data directory. - """ - global _data_path - _data_path = ensure_path(path) - - -def make_layer(arch_config): - arch_func = registry.architectures.get(arch_config["arch"]) - return arch_func(arch_config["config"]) - - def ensure_path(path): """Ensure string is converted to a Path. path: Anything. If string, it's converted to Path. RETURNS: Path or original argument. """ - if isinstance(path, basestring_): + if isinstance(path, str): return Path(path) else: return path @@ -141,7 +107,7 @@ def load_language_data(path): path = path.with_suffix(path.suffix + ".gz") if path.exists(): return srsly.read_gzip_json(path) - raise ValueError(Errors.E160.format(path=path2str(path))) + raise ValueError(Errors.E160.format(path=path)) def get_module_path(module): @@ -151,18 +117,13 @@ def get_module_path(module): def load_model(name, **overrides): - """Load a model from a shortcut link, package or data path. + """Load a model from a package or data path. - name (unicode): Package name, shortcut link or model path. + name (unicode): Package name or model path. **overrides: Specific overrides, like pipeline components to disable. RETURNS (Language): `Language` class with the loaded model. """ - data_path = get_data_path() - if not data_path or not data_path.exists(): - raise IOError(Errors.E049.format(path=path2str(data_path))) - if isinstance(name, basestring_): # in data dir / shortcut - if name in set([d.name for d in data_path.iterdir()]): - return load_model_from_link(name, **overrides) + if isinstance(name, str): # name or string path if is_package(name): # installed as package return load_model_from_package(name, **overrides) if Path(name).exists(): # path to model data directory @@ -172,16 +133,6 @@ def load_model(name, **overrides): raise IOError(Errors.E050.format(name=name)) -def load_model_from_link(name, **overrides): - """Load a model from a shortcut link, or directory in spaCy data path.""" - path = get_data_path() / name / "__init__.py" - try: - cls = import_file(name, path) - except AttributeError: - raise IOError(Errors.E051.format(name=name)) - return cls.load(**overrides) - - def load_model_from_package(name, **overrides): """Load a model from an installed package.""" cls = importlib.import_module(name) @@ -193,6 +144,10 @@ def load_model_from_path(model_path, meta=False, **overrides): pipeline from meta.json and then calls from_disk() with path.""" if not meta: meta = get_model_meta(model_path) + nlp_config = get_model_config(model_path) + if nlp_config.get("nlp", None): + return load_model_from_config(nlp_config["nlp"]) + # Support language factories registered via entry points (e.g. custom # language subclass) while keeping top-level language identifier "lang" lang = meta.get("lang_factory", meta["lang"]) @@ -210,11 +165,30 @@ def load_model_from_path(model_path, meta=False, **overrides): config = meta.get("pipeline_args", {}).get(name, {}) config.update(overrides) factory = factories.get(name, name) + if nlp_config.get(name, None): + model_config = nlp_config[name]["model"] + config["model"] = model_config component = nlp.create_pipe(factory, config=config) nlp.add_pipe(component, name=name) return nlp.from_disk(model_path, exclude=disable) +def load_model_from_config(nlp_config): + if "name" in nlp_config: + nlp = load_model(**nlp_config) + elif "lang" in nlp_config: + lang_class = get_lang_class(nlp_config["lang"]) + nlp = lang_class() + else: + raise ValueError(Errors.E993) + if "pipeline" in nlp_config: + for name, component_cfg in nlp_config["pipeline"].items(): + factory = component_cfg.pop("factory") + component = nlp.create_pipe(factory, config=component_cfg) + nlp.add_pipe(component, name=name) + return nlp + + def load_model_from_init_py(init_file, **overrides): """Helper function to use in the `load()` method of a model package's __init__.py. @@ -225,13 +199,47 @@ def load_model_from_init_py(init_file, **overrides): """ model_path = Path(init_file).parent meta = get_model_meta(model_path) - data_dir = "%s_%s-%s" % (meta["lang"], meta["name"], meta["version"]) + data_dir = f"{meta['lang']}_{meta['name']}-{meta['version']}" data_path = model_path / data_dir if not model_path.exists(): - raise IOError(Errors.E052.format(path=path2str(data_path))) + raise IOError(Errors.E052.format(path=data_path)) return load_model_from_path(data_path, meta, **overrides) +def load_config(path, create_objects=False): + """Load a Thinc-formatted config file, optionally filling in objects where + the config references registry entries. See "Thinc config files" for details. + + path (unicode or Path): Path to the config file + create_objects (bool): Whether to automatically create objects when the config + references registry entries. Defaults to False. + + RETURNS (dict): The objects from the config file. + """ + config = thinc.config.Config().from_disk(path) + if create_objects: + return registry.make_from_config(config, validate=True) + else: + return config + + +def load_config_from_str(string, create_objects=False): + """Load a Thinc-formatted config, optionally filling in objects where + the config references registry entries. See "Thinc config files" for details. + + string (unicode or Path): Text contents of the config file. + create_objects (bool): Whether to automatically create objects when the config + references registry entries. Defaults to False. + + RETURNS (dict): The objects from the config file. + """ + config = thinc.config.Config().from_str(string) + if create_objects: + return registry.make_from_config(config, validate=True) + else: + return config + + def get_model_meta(path): """Get model meta.json from a directory path and validate its contents. @@ -240,10 +248,10 @@ def get_model_meta(path): """ model_path = ensure_path(path) if not model_path.exists(): - raise IOError(Errors.E052.format(path=path2str(model_path))) + raise IOError(Errors.E052.format(path=model_path)) meta_path = model_path / "meta.json" if not meta_path.is_file(): - raise IOError(Errors.E053.format(path=meta_path)) + raise IOError(Errors.E053.format(path=meta_path, name="meta.json")) meta = srsly.read_json(meta_path) for setting in ["lang", "name", "version"]: if setting not in meta or not meta[setting]: @@ -251,6 +259,23 @@ def get_model_meta(path): return meta +def get_model_config(path): + """Get the model's config from a directory path. + + path (unicode or Path): Path to model directory. + RETURNS (Config): The model's config data. + """ + model_path = ensure_path(path) + if not model_path.exists(): + raise IOError(Errors.E052.format(path=model_path)) + config_path = model_path / "config.cfg" + # model directories are allowed not to have config files ? + if not config_path.is_file(): + return Config({}) + # raise IOError(Errors.E053.format(path=config_path, name="config.cfg")) + return Config().from_disk(config_path) + + def is_package(name): """Check if string maps to a package installed via pip. @@ -306,9 +331,10 @@ def get_component_name(component): def get_cuda_stream(require=False, non_blocking=True): + ops = get_current_ops() if CudaStream is None: return None - elif isinstance(Model.ops, NumpyOps): + elif isinstance(ops, NumpyOps): return None else: return CudaStream(non_blocking=non_blocking) @@ -323,6 +349,14 @@ def get_async(stream, numpy_array): return array +def eg2doc(example): + """Get a Doc object from an Example (or if it's a Doc, use it directly)""" + # Put the import here to avoid circular import problems + from .tokens.doc import Doc + + return example if isinstance(example, Doc) else example.doc + + def env_opt(name, default=None): if type(default) is float: type_convert = float @@ -421,7 +455,7 @@ def update_exc(base_exceptions, *addition_dicts): exc = dict(base_exceptions) for additions in addition_dicts: for orth, token_attrs in additions.items(): - if not all(isinstance(attr[ORTH], unicode_) for attr in token_attrs): + if not all(isinstance(attr[ORTH], str) for attr in token_attrs): raise ValueError(Errors.E055.format(key=orth, orths=token_attrs)) described_orth = "".join(attr[ORTH] for attr in token_attrs) if orth != described_orth: @@ -541,31 +575,40 @@ def decaying(start, stop, decay): curr -= decay -def minibatch_by_words(items, size, tuples=True, count_words=len): - """Create minibatches of a given number of words.""" +def minibatch_by_words(examples, size, tuples=True, count_words=len, tolerance=0.2): + """Create minibatches of roughly a given number of words. If any examples + are longer than the specified batch length, they will appear in a batch by + themselves.""" if isinstance(size, int): size_ = itertools.repeat(size) + elif isinstance(size, List): + size_ = iter(size) else: size_ = size - items = iter(items) + examples = iter(examples) + oversize = [] while True: batch_size = next(size_) + tol_size = batch_size * 0.2 batch = [] - while batch_size >= 0: + if oversize: + example = oversize.pop(0) + n_words = count_words(example.doc) + batch.append(example) + batch_size -= n_words + while batch_size >= 1: try: - if tuples: - doc, gold = next(items) - else: - doc = next(items) + example = next(examples) except StopIteration: if batch: yield batch return - batch_size -= count_words(doc) - if tuples: - batch.append((doc, gold)) + n_words = count_words(example.doc) + if n_words < (batch_size + tol_size): + batch_size -= n_words + batch.append(example) else: - batch.append(doc) + oversize.append(example) if batch: yield batch @@ -622,7 +665,7 @@ def filter_spans(spans): def to_bytes(getters, exclude): - serialized = OrderedDict() + serialized = {} for key, getter in getters.items(): # Split to support file names like meta.json if key.split(".")[0] not in exclude: @@ -659,6 +702,20 @@ def from_disk(path, readers, exclude): return path +def import_file(name, loc): + """Import module from a file. Used to load models from a directory. + + name (unicode): Name of module to load. + loc (unicode / Path): Path to the file. + RETURNS: The loaded module. + """ + loc = str(loc) + spec = importlib.util.spec_from_file_location(name, str(loc)) + module = importlib.util.module_from_spec(spec) + spec.loader.exec_module(module) + return module + + def minify_html(html): """Perform a template-specific, rudimentary HTML minification for displaCy. Disclaimer: NOT a general-purpose solution, only removes indentation and @@ -685,17 +742,7 @@ def escape_html(text): def use_gpu(gpu_id): - try: - import cupy.cuda.device - except ImportError: - return None - from thinc.neural.ops import CupyOps - - device = cupy.cuda.device.Device(gpu_id) - device.use() - Model.ops = CupyOps() - Model.Ops = CupyOps - return device + return require_gpu(gpu_id) def fix_random_seed(seed=0): @@ -705,43 +752,6 @@ def fix_random_seed(seed=0): cupy.random.seed(seed) -def get_json_validator(schema): - # We're using a helper function here to make it easier to change the - # validator that's used (e.g. different draft implementation), without - # having to change it all across the codebase. - # TODO: replace with (stable) Draft6Validator, if available - if jsonschema is None: - raise ValueError(Errors.E136) - return jsonschema.Draft4Validator(schema) - - -def validate_schema(schema): - """Validate a given schema. This just checks if the schema itself is valid.""" - validator = get_json_validator(schema) - validator.check_schema(schema) - - -def validate_json(data, validator): - """Validate data against a given JSON schema (see https://json-schema.org). - - data: JSON-serializable data to validate. - validator (jsonschema.DraftXValidator): The validator. - RETURNS (list): A list of error messages, if available. - """ - errors = [] - for err in sorted(validator.iter_errors(data), key=lambda e: e.path): - if err.path: - err_path = "[{}]".format(" -> ".join([str(p) for p in err.path])) - else: - err_path = "" - msg = err.message + " " + err_path - if err.context: # Error has suberrors, e.g. if schema uses anyOf - suberrs = [" - {}".format(suberr.message) for suberr in err.context] - msg += ":\n{}".format("".join(suberrs)) - errors.append(msg) - return errors - - def get_serialization_exclude(serializers, exclude, kwargs): """Helper function to validate serialization args and manage transition from keyword arguments (pre v2.1) to exclude argument. @@ -819,3 +829,39 @@ class DummyTokenizer(object): def from_disk(self, _path, **kwargs): return self + + +def link_vectors_to_models(vocab): + vectors = vocab.vectors + if vectors.name is None: + vectors.name = VECTORS_KEY + if vectors.data.size != 0: + warnings.warn(Warnings.W020.format(shape=vectors.data.shape)) + for word in vocab: + if word.orth in vectors.key2row: + word.rank = vectors.key2row[word.orth] + else: + word.rank = 0 + + +VECTORS_KEY = "spacy_pretrained_vectors" + + +def create_default_optimizer(): + learn_rate = env_opt("learn_rate", 0.001) + beta1 = env_opt("optimizer_B1", 0.9) + beta2 = env_opt("optimizer_B2", 0.999) + eps = env_opt("optimizer_eps", 1e-8) + L2 = env_opt("L2_penalty", 1e-6) + grad_clip = env_opt("grad_norm_clip", 10.0) + L2_is_weight_decay = env_opt("L2_is_weight_decay", False) + optimizer = Adam( + learn_rate, + L2=L2, + beta1=beta1, + beta2=beta2, + eps=eps, + grad_clip=grad_clip, + L2_is_weight_decay=L2_is_weight_decay, + ) + return optimizer diff --git a/spacy/vectors.pyx b/spacy/vectors.pyx index 3da3b01d7..471c6463f 100644 --- a/spacy/vectors.pyx +++ b/spacy/vectors.pyx @@ -1,22 +1,15 @@ -# coding: utf8 -from __future__ import unicode_literals - cimport numpy as np from cython.operator cimport dereference as deref from libcpp.set cimport set as cppset import functools import numpy -from collections import OrderedDict import srsly -import warnings -from thinc.neural.util import get_array_module -from thinc.neural._classes.model import Model +from thinc.api import get_array_module, get_current_ops from .strings cimport StringStore from .strings import get_string_id -from .compat import basestring_, path2str from .errors import Errors from . import util @@ -75,7 +68,7 @@ cdef class Vectors: shape = (0,0) data = numpy.zeros(shape, dtype="f") self.data = data - self.key2row = OrderedDict() + self.key2row = {} if self.data is not None: self._unset = cppset[int]({i for i in range(self.data.shape[0])}) else: @@ -357,7 +350,7 @@ cdef class Vectors: sorted_index = xp.arange(scores.shape[0])[:,None][i:i+batch_size],xp.argsort(scores[i:i+batch_size], axis=1)[:,::-1] scores[i:i+batch_size] = scores[sorted_index] best_rows[i:i+batch_size] = best_rows[sorted_index] - + for i, j in numpy.ndindex(best_rows.shape): best_rows[i, j] = filled[best_rows[i, j]] # Round values really close to 1 or -1 @@ -366,7 +359,7 @@ cdef class Vectors: scores = xp.clip(scores, a_min=-1, a_max=1, out=scores) row2key = {row: key for key, row in self.key2row.items()} keys = xp.asarray( - [[row2key[row] for row in best_rows[i] if row in row2key] + [[row2key[row] for row in best_rows[i] if row in row2key] for i in range(len(queries)) ], dtype="uint64") return (keys, best_rows, scores) @@ -383,10 +376,10 @@ cdef class Vectors: save_array = lambda arr, file_: xp.save(file_, arr, allow_pickle=False) else: save_array = lambda arr, file_: xp.save(file_, arr) - serializers = OrderedDict(( - ("vectors", lambda p: save_array(self.data, p.open("wb"))), - ("key2row", lambda p: srsly.write_msgpack(p, self.key2row)) - )) + serializers = { + "vectors": lambda p: save_array(self.data, p.open("wb")), + "key2row": lambda p: srsly.write_msgpack(p, self.key2row) + } return util.to_disk(path, serializers, []) def from_disk(self, path, **kwargs): @@ -412,15 +405,15 @@ cdef class Vectors: self.add(key, row=i) def load_vectors(path): - xp = Model.ops.xp + ops = get_current_ops() if path.exists(): - self.data = xp.load(str(path)) + self.data = ops.xp.load(str(path)) - serializers = OrderedDict(( - ("key2row", load_key2row), - ("keys", load_keys), - ("vectors", load_vectors), - )) + serializers = { + "key2row": load_key2row, + "keys": load_keys, + "vectors": load_vectors, + } util.from_disk(path, serializers, []) self._sync_unset() return self @@ -439,10 +432,10 @@ cdef class Vectors: else: return srsly.msgpack_dumps(self.data) - serializers = OrderedDict(( - ("key2row", lambda: srsly.msgpack_dumps(self.key2row)), - ("vectors", serialize_weights) - )) + serializers = { + "key2row": lambda: srsly.msgpack_dumps(self.key2row), + "vectors": serialize_weights + } return util.to_bytes(serializers, []) def from_bytes(self, data, **kwargs): @@ -460,10 +453,10 @@ cdef class Vectors: else: self.data = srsly.msgpack_loads(b) - deserializers = OrderedDict(( - ("key2row", lambda b: self.key2row.update(srsly.msgpack_loads(b))), - ("vectors", deserialize_weights) - )) + deserializers = { + "key2row": lambda b: self.key2row.update(srsly.msgpack_loads(b)), + "vectors": deserialize_weights + } util.from_bytes(data, deserializers, []) self._sync_unset() return self diff --git a/spacy/vocab.pxd b/spacy/vocab.pxd index 73754eb02..49f5bf415 100644 --- a/spacy/vocab.pxd +++ b/spacy/vocab.pxd @@ -1,5 +1,4 @@ from libcpp.vector cimport vector - from preshed.maps cimport PreshMap from cymem.cymem cimport Pool from murmurhash.mrmr cimport hash64 diff --git a/spacy/vocab.pyx b/spacy/vocab.pyx index 68f0ac0db..ab240df90 100644 --- a/spacy/vocab.pyx +++ b/spacy/vocab.pyx @@ -1,11 +1,8 @@ -# coding: utf8 # cython: profile=True -from __future__ import unicode_literals from libc.string cimport memcpy import srsly -from collections import OrderedDict -from thinc.neural.util import get_array_module +from thinc.api import get_array_module from .lexeme cimport EMPTY_LEXEME, OOV_RANK from .lexeme cimport Lexeme @@ -13,12 +10,12 @@ from .typedefs cimport attr_t from .tokens.token cimport Token from .attrs cimport LANG, ORTH, TAG, POS -from .compat import copy_reg, basestring_ +from .compat import copy_reg from .errors import Errors from .lemmatizer import Lemmatizer from .attrs import intify_attrs, NORM from .vectors import Vectors -from ._ml import link_vectors_to_models +from .util import link_vectors_to_models from .lookups import Lookups from . import util from .lang.norm_exceptions import BASE_NORMS @@ -340,14 +337,14 @@ cdef class Vocab: """Retrieve a vector for a word in the vocabulary. Words can be looked up by string or int ID. If no vectors data is loaded, ValueError is raised. - - If `minn` is defined, then the resulting vector uses Fasttext's + + If `minn` is defined, then the resulting vector uses Fasttext's subword features by average over ngrams of `orth`. orth (int / unicode): The hash value of a word, or its unicode string. - minn (int): Minimum n-gram length used for Fasttext's ngram computation. + minn (int): Minimum n-gram length used for Fasttext's ngram computation. Defaults to the length of `orth`. - maxn (int): Maximum n-gram length used for Fasttext's ngram computation. + maxn (int): Maximum n-gram length used for Fasttext's ngram computation. Defaults to the length of `orth`. RETURNS (numpy.ndarray): A word vector. Size and shape determined by the `vocab.vectors` instance. Usually, a @@ -355,7 +352,7 @@ cdef class Vocab: DOCS: https://spacy.io/api/vocab#get_vector """ - if isinstance(orth, basestring_): + if isinstance(orth, str): orth = self.strings.add(orth) word = self[orth].orth_ if orth in self.vectors.key2row: @@ -402,7 +399,7 @@ cdef class Vocab: DOCS: https://spacy.io/api/vocab#set_vector """ - if isinstance(orth, basestring_): + if isinstance(orth, str): orth = self.strings.add(orth) if self.vectors.is_full and orth not in self.vectors: new_rows = max(100, int(self.vectors.shape[0]*1.3)) @@ -424,7 +421,7 @@ cdef class Vocab: DOCS: https://spacy.io/api/vocab#has_vector """ - if isinstance(orth, basestring_): + if isinstance(orth, str): orth = self.strings.add(orth) return orth in self.vectors @@ -497,12 +494,13 @@ cdef class Vocab: else: return self.vectors.to_bytes() - getters = OrderedDict(( - ("strings", lambda: self.strings.to_bytes()), - ("vectors", deserialize_vectors), - ("lookups", lambda: self.lookups.to_bytes()), - ("lookups_extra", lambda: self.lookups_extra.to_bytes()) - )) + getters = { + "strings": lambda: self.strings.to_bytes(), + "lexemes": lambda: self.lexemes_to_bytes(), + "vectors": deserialize_vectors, + "lookups": lambda: self.lookups.to_bytes(), + "lookups_extra": lambda: self.lookups_extra.to_bytes() + } exclude = util.get_serialization_exclude(getters, exclude, kwargs) return util.to_bytes(getters, exclude) @@ -521,12 +519,13 @@ cdef class Vocab: else: return self.vectors.from_bytes(b) - setters = OrderedDict(( - ("strings", lambda b: self.strings.from_bytes(b)), - ("vectors", lambda b: serialize_vectors(b)), - ("lookups", lambda b: self.lookups.from_bytes(b)), - ("lookups_extra", lambda b: self.lookups_extra.from_bytes(b)) - )) + setters = { + "strings": lambda b: self.strings.from_bytes(b), + "lexemes": lambda b: self.lexemes_from_bytes(b), + "vectors": lambda b: serialize_vectors(b), + "lookups": lambda b: self.lookups.from_bytes(b), + "lookups_extra": lambda b: self.lookups_extra.from_bytes(b) + } exclude = util.get_serialization_exclude(setters, exclude, kwargs) util.from_bytes(bytes_data, setters, exclude) if "lexeme_norm" in self.lookups: diff --git a/website/docs/api/language.md b/website/docs/api/language.md index 97dfbf100..50689a7ef 100644 --- a/website/docs/api/language.md +++ b/website/docs/api/language.md @@ -314,45 +314,47 @@ component function. | `name` | unicode | Name of the component to remove. | | **RETURNS** | tuple | A `(name, component)` tuple of the removed component. | -## Language.disable_pipes {#disable_pipes tag="contextmanager, method" new="2"} +## Language.select_pipes {#select_pipes tag="contextmanager, method" new="3"} Disable one or more pipeline components. If used as a context manager, the pipeline will be restored to the initial state at the end of the block. Otherwise, a `DisabledPipes` object is returned, that has a `.restore()` method you can use to undo your changes. +You can specify either `disable` (as a list or string), or `enable`. In the +latter case, all components not in the `enable` list, will be disabled. + > #### Example > > ```python -> # New API as of v2.2.2 -> with nlp.disable_pipes(["tagger", "parser"]): +> # New API as of v3.0 +> with nlp.select_pipes(disable=["tagger", "parser"]): > nlp.begin_training() > -> with nlp.disable_pipes("tagger", "parser"): +> with nlp.select_pipes(enable="ner"): > nlp.begin_training() > -> disabled = nlp.disable_pipes("tagger", "parser") +> disabled = nlp.select_pipes(disable=["tagger", "parser"]) > nlp.begin_training() > disabled.restore() > ``` -| Name | Type | Description | -| ----------------------------------------- | --------------- | ------------------------------------------------------------------------------------ | -| `disabled` 2.2.2 | list | Names of pipeline components to disable. | -| `*disabled` | unicode | Names of pipeline components to disable. | -| **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. | +| Name | Type | Description | +| ----------- | --------------- | ------------------------------------------------------------------------------------ | +| `disable` | list | Names of pipeline components to disable. | +| `disable` | unicode | Name of pipeline component to disable. | +| `enable` | list | Names of pipeline components that will not be disabled. | +| `enable` | unicode | Name of pipeline component that will not be disabled. | +| **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. | - -As of spaCy v2.2.2, the `Language.disable_pipes` method can also take a list of -component names as its first argument (instead of a variable number of -arguments). This is especially useful if you're generating the component names -to disable programmatically. The new syntax will become the default in the -future. + + +As of spaCy v3.0, the `disable_pipes` method has been renamed to `select_pipes`: ```diff -- disabled = nlp.disable_pipes("tagger", "parser") -+ disabled = nlp.disable_pipes(["tagger", "parser"]) +- nlp.disable_pipes(["tagger", "parser"]) ++ nlp.select_pipes(disable=["tagger", "parser"]) ``` diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 217c51794..2360ad472 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -678,50 +678,3 @@ of one entity) or when merging spans with | ----------- | -------- | -------------------- | | `spans` | iterable | The spans to filter. | | **RETURNS** | list | The filtered spans. | - -## Compatibility functions {#compat source="spacy/compaty.py"} - -All Python code is written in an **intersection of Python 2 and Python 3**. This -is easy in Cython, but somewhat ugly in Python. Logic that deals with Python or -platform compatibility only lives in `spacy.compat`. To distinguish them from -the builtin functions, replacement functions are suffixed with an underscore, -e.g. `unicode_`. - -> #### Example -> -> ```python -> from spacy.compat import unicode_ -> -> compatible_unicode = unicode_("hello world") -> ``` - -| Name | Python 2 | Python 3 | -| -------------------- | ---------------------------------- | ----------- | -| `compat.bytes_` | `str` | `bytes` | -| `compat.unicode_` | `unicode` | `str` | -| `compat.basestring_` | `basestring` | `str` | -| `compat.input_` | `raw_input` | `input` | -| `compat.path2str` | `str(path)` with `.decode('utf8')` | `str(path)` | - -### compat.is_config {#compat.is_config tag="function"} - -Check if a specific configuration of Python version and operating system matches -the user's setup. Mostly used to display targeted error messages. - -> #### Example -> -> ```python -> from spacy.compat import is_config -> -> if is_config(python2=True, windows=True): -> print("You are using Python 2 on Windows.") -> ``` - -| Name | Type | Description | -| ----------- | ---- | ---------------------------------------------------------------- | -| `python2` | bool | spaCy is executed with Python 2.x. | -| `python3` | bool | spaCy is executed with Python 3.x. | -| `windows` | bool | spaCy is executed on Windows. | -| `linux` | bool | spaCy is executed on Linux. | -| `osx` | bool | spaCy is executed on OS X or macOS. | -| **RETURNS** | bool | Whether the specified configuration matches the user's platform. | diff --git a/website/docs/usage/index.md b/website/docs/usage/index.md index d0172104b..473ffded8 100644 --- a/website/docs/usage/index.md +++ b/website/docs/usage/index.md @@ -8,9 +8,9 @@ menu: - ['Changelog', 'changelog'] --- -spaCy is compatible with **64-bit CPython 2.7 / 3.5+** and runs on -**Unix/Linux**, **macOS/OS X** and **Windows**. The latest spaCy releases are -available over [pip](https://pypi.python.org/pypi/spacy) and +spaCy is compatible with **64-bit CPython 3.6+** and runs on **Unix/Linux**, +**macOS/OS X** and **Windows**. The latest spaCy releases are available over +[pip](https://pypi.python.org/pypi/spacy) and [conda](https://anaconda.org/conda-forge/spacy). > #### 📖 Looking for the old docs? @@ -20,6 +20,17 @@ available over [pip](https://pypi.python.org/pypi/spacy) and > possible, the new docs also include notes on features that have changed in > v2.0, and features that were introduced in the new version. + + +We can't yet ship pre-compiled binary wheels for spaCy that work on Python 3.8, +as we're still waiting for our CI providers and other tooling to support it. +This means that in order to run spaCy on Python 3.8, you'll need +[a compiler installed](#source) and compile the library and its Cython +dependencies locally. If this is causing problems for you, the easiest solution +is to **use Python 3.7** in the meantime. + + + ## Quickstart {hidden="true"} import QuickstartInstall from 'widgets/quickstart-install.js' @@ -195,14 +206,7 @@ Install a version of the [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) or [Visual Studio Express](https://www.visualstudio.com/vs/visual-studio-express/) -that matches the version that was used to compile your Python interpreter. For -official distributions these are: - -| Distribution | Version | -| ------------ | ------------------ | -| Python 2.7 | Visual Studio 2008 | -| Python 3.4 | Visual Studio 2010 | -| Python 3.5+ | Visual Studio 2015 | +that matches the version that was used to compile your Python interpreter. ### Run tests {#run-tests} diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index b7b840999..696e11106 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -252,9 +252,9 @@ for doc in nlp.pipe(texts, disable=["tagger", "parser"]): If you need to **execute more code** with components disabled – e.g. to reset the weights or update only some components during training – you can use the -[`nlp.disable_pipes`](/api/language#disable_pipes) contextmanager. At the end of +[`nlp.select_pipes`](/api/language#select_pipes) contextmanager. At the end of the `with` block, the disabled pipeline components will be restored -automatically. Alternatively, `disable_pipes` returns an object that lets you +automatically. Alternatively, `select_pipes` returns an object that lets you call its `restore()` method to restore the disabled components when needed. This can be useful if you want to prevent unnecessary code indentation of large blocks. @@ -262,16 +262,26 @@ blocks. ```python ### Disable for block # 1. Use as a contextmanager -with nlp.disable_pipes("tagger", "parser"): +with nlp.select_pipes(disable=["tagger", "parser"]): doc = nlp("I won't be tagged and parsed") doc = nlp("I will be tagged and parsed") # 2. Restore manually -disabled = nlp.disable_pipes("ner") +disabled = nlp.select_pipes(disable="ner") doc = nlp("I won't have named entities") disabled.restore() ``` +If you want to disable all pipes except for one or a few, you can use the `enable` +keyword. Just like the `disable` keyword, it takes a list of pipe names, or a string +defining just one pipe. +```python +# Enable only the parser +with nlp.select_pipes(enable="parser"): + doc = nlp("I will only be parsed") +``` + + Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method to remove pipeline components from an existing pipeline, the [`rename_pipe`](/api/language#rename_pipe) method to rename them, or the @@ -367,7 +377,7 @@ tokens and a conditional message based on the document length. import spacy def my_component(doc): - print("After tokenization, this doc has {} tokens.".format(len(doc))) + print(f"After tokenization, this doc has {len(doc)} tokens.") print("The part-of-speech tags are:", [token.pos_ for token in doc]) if len(doc) < 10: print("This is a pretty short document.") @@ -602,7 +612,7 @@ There are three main types of extensions, which can be defined using the [these examples](/usage/examples#custom-components-attr-methods). ```python - Doc.set_extension("hello", method=lambda doc, name: "Hi {}!".format(name)) + Doc.set_extension("hello", method=lambda doc, name: f"Hi {name}!") assert doc._.hello("Bob") == "Hi Bob!" ``` diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index 1db2405d1..5f47bd2e3 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -906,7 +906,7 @@ pipeline component, **make sure that the pipeline component runs** when you create the pattern. For example, to match on `POS` or `LEMMA`, the pattern `Doc` objects need to have part-of-speech tags set by the `tagger`. You can either call the `nlp` object on your pattern texts instead of `nlp.make_doc`, or use -[`nlp.disable_pipes`](/api/language#disable_pipes) to disable components +[`nlp.select_pipes`](/api/language#select_pipes) to disable components selectively. @@ -1121,8 +1121,7 @@ while adding the phrase patterns. entityruler = EntityRuler(nlp) patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)] -other_pipes = [p for p in nlp.pipe_names if p != "tagger"] -with nlp.disable_pipes(*other_pipes): +with nlp.select_pipes(enable="tagger"): entityruler.add_patterns(patterns) ``` diff --git a/website/docs/usage/saving-loading.md b/website/docs/usage/saving-loading.md index 76a9773f6..c94c79360 100644 --- a/website/docs/usage/saving-loading.md +++ b/website/docs/usage/saving-loading.md @@ -131,7 +131,7 @@ shared vocab it depends on. If you need to pickle multiple objects, try to pickle them **together** instead of separately. For instance, instead of pickling all pipeline components, pickle the entire pipeline once. And instead of pickling several `Doc` objects -separately, pickle a list of `Doc` objects. Since the all share a reference to +separately, pickle a list of `Doc` objects. Since they all share a reference to the _same_ `Vocab` object, it will only be included once. ```python diff --git a/website/docs/usage/spacy-101.md b/website/docs/usage/spacy-101.md index 5a3a95a53..39d732724 100644 --- a/website/docs/usage/spacy-101.md +++ b/website/docs/usage/spacy-101.md @@ -304,12 +304,6 @@ print(doc.vocab.strings["coffee"]) # 3197928453018144401 print(doc.vocab.strings[3197928453018144401]) # 'coffee' ``` -> #### What does 'L' at the end of a hash mean? -> -> If you return a hash value in the **Python 2 interpreter**, it'll show up as -> `3197928453018144401L`. The `L` just means "long integer" – it's **not** -> actually a part of the hash value. - Now that all strings are encoded, the entries in the vocabulary **don't need to include the word text** themselves. Instead, they can look it up in the `StringStore` via its hash value. Each entry in the vocabulary, also called @@ -653,8 +647,7 @@ import random nlp = spacy.load("en_core_web_sm") train_data = [("Uber blew through $1 million", {"entities": [(0, 4, "ORG")]})] -other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"] -with nlp.disable_pipes(*other_pipes): +with nlp.select_pipes(enable="ner"): optimizer = nlp.begin_training() for i in range(10): random.shuffle(train_data) @@ -857,17 +850,16 @@ def put_spans_around_tokens(doc): and you can calculate what you need, e.g.
,

etc.) """ output = [] - html = '{word}{space}' for token in doc: if token.is_space: output.append(token.text) else: - classes = "pos-{} dep-{}".format(token.pos_, token.dep_) - output.append(html.format(classes=classes, word=token.text, space=token.whitespace_)) + classes = f"pos-{token.pos_} dep-{token.dep_}" + output.append(f'{token.text}{token.whitespace_}') string = "".join(output) string = string.replace("\\n", "") string = string.replace("\\t", " ") - return "

{}
".format(string) + return f"
{string}
" nlp = spacy.load("en_core_web_sm") diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 0be14df69..55d4accba 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -362,7 +362,7 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_ner.py you're using a blank model, don't forget to add the entity recognizer to the pipeline. If you're using an existing model, make sure to disable all other pipeline components during training using - [`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be + [`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be training the entity recognizer. 2. **Shuffle and loop over** the examples. For each example, **update the model** by calling [`nlp.update`](/api/language#update), which steps through @@ -403,7 +403,7 @@ referred to as the "catastrophic forgetting" problem. you're using a blank model, don't forget to add the entity recognizer to the pipeline. If you're using an existing model, make sure to disable all other pipeline components during training using - [`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be + [`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be training the entity recognizer. 2. **Add the new entity label** to the entity recognizer using the [`add_label`](/api/entityrecognizer#add_label) method. You can access the @@ -436,7 +436,7 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_parser.py you're using a blank model, don't forget to add the parser to the pipeline. If you're using an existing model, make sure to disable all other pipeline components during training using - [`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be + [`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be training the parser. 2. **Add the dependency labels** to the parser using the [`add_label`](/api/dependencyparser#add_label) method. If you're starting off @@ -470,7 +470,7 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_tagger.py you're using a blank model, don't forget to add the tagger to the pipeline. If you're using an existing model, make sure to disable all other pipeline components during training using - [`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be + [`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be training the tagger. 2. **Add the tag map** to the tagger using the [`add_label`](/api/tagger#add_label) method. The first argument is the new @@ -544,7 +544,7 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_intent_pa you're using a blank model, don't forget to add the custom parser to the pipeline. If you're using an existing model, make sure to **remove the old parser** from the pipeline, and disable all other pipeline components during - training using [`nlp.disable_pipes`](/api/language#disable_pipes). This way, + training using [`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be training the parser. 3. **Add the dependency labels** to the parser using the [`add_label`](/api/dependencyparser#add_label) method. @@ -576,7 +576,7 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_textcat.p [`spacy.blank`](/api/top-level#spacy.blank) with the ID of your language. If you're using an existing model, make sure to disable all other pipeline components during training using - [`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be + [`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be training the text classifier. 2. **Add the text classifier** to the pipeline, and add the labels you want to train – for example, `POSITIVE`. @@ -654,7 +654,7 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_li linker component, add the KB to it, and then add the entity linker to the pipeline. If you're using a model with additional components, make sure to disable all other pipeline components during training using - [`nlp.disable_pipes`](/api/language#disable_pipes). This way, you'll only be + [`nlp.select_pipes`](/api/language#select_pipes). This way, you'll only be training the entity linker. 2. **Shuffle and loop over** the examples. For each example, **update the model** by calling [`nlp.update`](/api/language#update), which steps through diff --git a/website/meta/universe.json b/website/meta/universe.json index 857e26813..7bd954c2f 100644 --- a/website/meta/universe.json +++ b/website/meta/universe.json @@ -367,15 +367,16 @@ "from spacy_lookup import Entity", "", "nlp = spacy.load('en')", - "entity = Entity(keywords_list=['python', 'java platform'])", + "entity = Entity(keywords_list=['python', 'product manager', 'java platform'])", "nlp.add_pipe(entity, last=True)", "", "doc = nlp(\"I am a product manager for a java and python.\")", "assert doc._.has_entities == True", - "assert doc[2:5]._.has_entities == True", "assert doc[0]._.is_entity == False", + "assert doc[3]._.entity_desc == 'product manager'", "assert doc[3]._.is_entity == True", - "print(doc._.entities)" + "", + "print([(token.text, token._.canonical) for token in doc if token._.is_entity])" ], "author": "Marc Puig", "author_links": {