diff --git a/.github/contributors/GiorgioPorgio.md b/.github/contributors/GiorgioPorgio.md new file mode 100644 index 000000000..ffa1f693e --- /dev/null +++ b/.github/contributors/GiorgioPorgio.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | George Ketsopoulos | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 23 October 2019 | +| GitHub username | GiorgioPorgio | +| Website (optional) | | diff --git a/.github/contributors/zhuorulin.md b/.github/contributors/zhuorulin.md new file mode 100644 index 000000000..8fef7577a --- /dev/null +++ b/.github/contributors/zhuorulin.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | ------------------------ | +| Name | Zhuoru Lin | +| Company name (if applicable) | Bombora Inc. | +| Title or role (if applicable) | Data Scientist | +| Date | 2017-11-13 | +| GitHub username | ZhuoruLin | +| Website (optional) | | diff --git a/Makefile b/Makefile index 0f5c31ca6..5d15bccec 100644 --- a/Makefile +++ b/Makefile @@ -9,7 +9,7 @@ dist/spacy.pex : dist/spacy-$(sha).pex dist/spacy-$(sha).pex : dist/$(wheel) env3.6/bin/python -m pip install pex==1.5.3 - env3.6/bin/pex pytest dist/$(wheel) -e spacy -o dist/spacy-$(sha).pex + env3.6/bin/pex pytest dist/$(wheel) spacy_lookups_data -e spacy -o dist/spacy-$(sha).pex dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py* python3.6 -m venv env3.6 diff --git a/README.md b/README.md index 473533422..99d66bb31 100644 --- a/README.md +++ b/README.md @@ -135,8 +135,7 @@ Thanks to our great community, we've finally re-added conda support. You can now install spaCy via `conda-forge`: ```bash -conda config --add channels conda-forge -conda install spacy +conda install -c conda-forge spacy ``` For the feedstock including the build recipe and configuration, check out @@ -214,16 +213,6 @@ doc = nlp("This is a sentence.") 📖 **For more info and examples, check out the [models documentation](https://spacy.io/docs/usage/models).** -### Support for older versions - -If you're using an older version (`v1.6.0` or below), you can still download and -install the old models from within spaCy using `python -m spacy.en.download all` -or `python -m spacy.de.download all`. The `.tar.gz` archives are also -[attached to the v1.6.0 release](https://github.com/explosion/spaCy/tree/v1.6.0). -To download and install the models manually, unpack the archive, drop the -contained directory into `spacy/data` and load the model via `spacy.load('en')` -or `spacy.load('de')`. - ## Compile from source The other way to install spaCy is to clone its diff --git a/bin/ud/ud_run_test.py b/bin/ud/ud_run_test.py index de01cf350..7cb270d84 100644 --- a/bin/ud/ud_run_test.py +++ b/bin/ud/ud_run_test.py @@ -84,7 +84,7 @@ def read_conllu(file_): def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None): if text_loc.parts[-1].endswith(".conllu"): docs = [] - with text_loc.open() as file_: + with text_loc.open(encoding="utf8") as file_: for conllu_doc in read_conllu(file_): for conllu_sent in conllu_doc: words = [line[1] for line in conllu_sent] diff --git a/bin/ud/ud_train.py b/bin/ud/ud_train.py index 5d4f20d6e..945bf57eb 100644 --- a/bin/ud/ud_train.py +++ b/bin/ud/ud_train.py @@ -203,7 +203,7 @@ def golds_to_gold_tuples(docs, golds): def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None): if text_loc.parts[-1].endswith(".conllu"): docs = [] - with text_loc.open() as file_: + with text_loc.open(encoding="utf8") as file_: for conllu_doc in read_conllu(file_): for conllu_sent in conllu_doc: words = [line[1] for line in conllu_sent] @@ -378,7 +378,7 @@ def _load_pretrained_tok2vec(nlp, loc): """Load pretrained weights for the 'token-to-vector' part of the component models, which is typically a CNN. See 'spacy pretrain'. Experimental. """ - with Path(loc).open("rb") as file_: + with Path(loc).open("rb", encoding="utf8") as file_: weights_data = file_.read() loaded = [] for name, component in nlp.pipeline: @@ -519,8 +519,8 @@ def main( for i in range(config.nr_epoch): docs, golds = read_data( nlp, - paths.train.conllu.open(), - paths.train.text.open(), + paths.train.conllu.open(encoding="utf8"), + paths.train.text.open(encoding="utf8"), max_doc_length=config.max_doc_length, limit=limit, oracle_segments=use_oracle_segments, @@ -560,7 +560,7 @@ def main( def _render_parses(i, to_render): to_render[0].user_data["title"] = "Batch %d" % i - with Path("/tmp/parses.html").open("w") as file_: + with Path("/tmp/parses.html").open("w", encoding="utf8") as file_: html = displacy.render(to_render[:5], style="dep", page=True) file_.write(html) diff --git a/bin/wiki_entity_linking/wikidata_train_entity_linker.py b/bin/wiki_entity_linking/wikidata_train_entity_linker.py index 20c5fe91b..8635ae547 100644 --- a/bin/wiki_entity_linking/wikidata_train_entity_linker.py +++ b/bin/wiki_entity_linking/wikidata_train_entity_linker.py @@ -77,6 +77,8 @@ def main( if labels_discard: labels_discard = [x.strip() for x in labels_discard.split(",")] logger.info("Discarding {} NER types: {}".format(len(labels_discard), labels_discard)) + else: + labels_discard = [] train_data = wikipedia_processor.read_training( nlp=nlp, diff --git a/examples/training/ner_multitask_objective.py b/examples/training/ner_multitask_objective.py index 5d44ed649..4bf7a008f 100644 --- a/examples/training/ner_multitask_objective.py +++ b/examples/training/ner_multitask_objective.py @@ -18,19 +18,21 @@ during training. We discard the auxiliary model before run-time. The specific example here is not necessarily a good idea --- but it shows how an arbitrary objective function for some word can be used. -Developed and tested for spaCy 2.0.6 +Developed and tested for spaCy 2.0.6. Updated for v2.2.2 """ import random import plac import spacy import os.path +from spacy.tokens import Doc from spacy.gold import read_json_file, GoldParse random.seed(0) PWD = os.path.dirname(__file__) -TRAIN_DATA = list(read_json_file(os.path.join(PWD, "training-data.json"))) +TRAIN_DATA = list(read_json_file( + os.path.join(PWD, "ner_example_data", "ner-sent-per-line.json"))) def get_position_label(i, words, tags, heads, labels, ents): @@ -55,6 +57,7 @@ def main(n_iter=10): ner = nlp.create_pipe("ner") ner.add_multitask_objective(get_position_label) nlp.add_pipe(ner) + print(nlp.pipeline) print("Create data", len(TRAIN_DATA)) optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA) @@ -62,23 +65,24 @@ def main(n_iter=10): random.shuffle(TRAIN_DATA) losses = {} for text, annot_brackets in TRAIN_DATA: - annotations, _ = annot_brackets - doc = nlp.make_doc(text) - gold = GoldParse.from_annot_tuples(doc, annotations[0]) - nlp.update( - [doc], # batch of texts - [gold], # batch of annotations - drop=0.2, # dropout - make it harder to memorise data - sgd=optimizer, # callable to update weights - losses=losses, - ) + for annotations, _ in annot_brackets: + doc = Doc(nlp.vocab, words=annotations[1]) + gold = GoldParse.from_annot_tuples(doc, annotations) + nlp.update( + [doc], # batch of texts + [gold], # batch of annotations + drop=0.2, # dropout - make it harder to memorise data + sgd=optimizer, # callable to update weights + losses=losses, + ) print(losses.get("nn_labeller", 0.0), losses["ner"]) # test the trained model for text, _ in TRAIN_DATA: - doc = nlp(text) - print("Entities", [(ent.text, ent.label_) for ent in doc.ents]) - print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc]) + if text is not None: + doc = nlp(text) + print("Entities", [(ent.text, ent.label_) for ent in doc.ents]) + print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc]) if __name__ == "__main__": diff --git a/requirements.txt b/requirements.txt index 68e29f6ab..ad7059f3a 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,10 +1,10 @@ # Our libraries cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 -thinc>=7.2.0,<7.3.0 +thinc>=7.3.0,<7.4.0 blis>=0.4.0,<0.5.0 murmurhash>=0.28.0,<1.1.0 -wasabi>=0.2.0,<1.1.0 +wasabi>=0.3.0,<1.1.0 srsly>=0.1.0,<1.1.0 # Third party dependencies numpy>=1.15.0 diff --git a/setup.cfg b/setup.cfg index 796f4176a..51e722354 100644 --- a/setup.cfg +++ b/setup.cfg @@ -38,18 +38,18 @@ setup_requires = cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 murmurhash>=0.28.0,<1.1.0 - thinc>=7.2.0,<7.3.0 + thinc>=7.3.0,<7.4.0 install_requires = setuptools numpy>=1.15.0 murmurhash>=0.28.0,<1.1.0 cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 - thinc>=7.2.0,<7.3.0 + thinc>=7.3.0,<7.4.0 blis>=0.4.0,<0.5.0 plac>=0.9.6,<1.2.0 requests>=2.13.0,<3.0.0 - wasabi>=0.2.0,<1.1.0 + wasabi>=0.3.0,<1.1.0 srsly>=0.1.0,<1.1.0 pathlib==1.0.1; python_version < "3.4" importlib_metadata>=0.20; python_version < "3.8" diff --git a/spacy/__init__.py b/spacy/__init__.py index 8930b1d4e..57701179f 100644 --- a/spacy/__init__.py +++ b/spacy/__init__.py @@ -9,12 +9,14 @@ warnings.filterwarnings("ignore", message="numpy.ufunc size changed") # These are imported as part of the API from thinc.neural.util import prefer_gpu, require_gpu +from . import pipeline from .cli.info import info as cli_info from .glossary import explain from .about import __version__ from .errors import Errors, Warnings, deprecation_warning from . import util from .util import register_architecture, get_architecture +from .language import component if sys.maxunicode == 65535: diff --git a/spacy/_ml.py b/spacy/_ml.py index 86dac6c7a..8695a88cc 100644 --- a/spacy/_ml.py +++ b/spacy/_ml.py @@ -3,16 +3,14 @@ from __future__ import unicode_literals import numpy from thinc.v2v import Model, Maxout, Softmax, Affine, ReLu -from thinc.i2v import HashEmbed, StaticVectors from thinc.t2t import ExtractWindow, ParametricAttention from thinc.t2v import Pooling, sum_pool, mean_pool -from thinc.misc import Residual +from thinc.i2v import HashEmbed +from thinc.misc import Residual, FeatureExtracter from thinc.misc import LayerNorm as LN -from thinc.misc import FeatureExtracter from thinc.api import add, layerize, chain, clone, concatenate, with_flatten from thinc.api import with_getitem, flatten_add_lengths from thinc.api import uniqued, wrap, noop -from thinc.api import with_square_sequences from thinc.linear.linear import LinearModel from thinc.neural.ops import NumpyOps, CupyOps from thinc.neural.util import get_array_module, copy_array @@ -26,14 +24,13 @@ import thinc.extra.load_nlp from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE from .errors import Errors, user_warning, Warnings from . import util +from . import ml as new_ml +from .ml import _legacy_tok2vec -try: - import torch.nn - from thinc.extra.wrappers import PyTorchWrapperRNN -except ImportError: - torch = None VECTORS_KEY = "spacy_pretrained_vectors" +# Backwards compatibility with <2.2.2 +USE_MODEL_REGISTRY_TOK2VEC = False def cosine(vec1, vec2): @@ -310,6 +307,10 @@ def link_vectors_to_models(vocab): def PyTorchBiLSTM(nO, nI, depth, dropout=0.2): + import torch.nn + from thinc.api import with_square_sequences + from thinc.extra.wrappers import PyTorchWrapperRNN + if depth == 0: return layerize(noop()) model = torch.nn.LSTM(nI, nO // 2, depth, bidirectional=True, dropout=dropout) @@ -317,81 +318,89 @@ def PyTorchBiLSTM(nO, nI, depth, dropout=0.2): def Tok2Vec(width, embed_size, **kwargs): + if not USE_MODEL_REGISTRY_TOK2VEC: + # Preserve prior tok2vec for backwards compat, in v2.2.2 + return _legacy_tok2vec.Tok2Vec(width, embed_size, **kwargs) pretrained_vectors = kwargs.get("pretrained_vectors", None) cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3) subword_features = kwargs.get("subword_features", True) char_embed = kwargs.get("char_embed", False) - if char_embed: - subword_features = False conv_depth = kwargs.get("conv_depth", 4) bilstm_depth = kwargs.get("bilstm_depth", 0) - cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH] - with Model.define_operators( - {">>": chain, "|": concatenate, "**": clone, "+": add, "*": reapply} - ): - norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm") - if subword_features: - prefix = HashEmbed( - width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix" - ) - suffix = HashEmbed( - width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix" - ) - shape = HashEmbed( - width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape" - ) - else: - prefix, suffix, shape = (None, None, None) - if pretrained_vectors is not None: - glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID)) + conv_window = kwargs.get("conv_window", 1) - if subword_features: - embed = uniqued( - (glove | norm | prefix | suffix | shape) - >> LN(Maxout(width, width * 5, pieces=3)), - column=cols.index(ORTH), - ) - else: - embed = uniqued( - (glove | norm) >> LN(Maxout(width, width * 2, pieces=3)), - column=cols.index(ORTH), - ) - elif subword_features: - embed = uniqued( - (norm | prefix | suffix | shape) - >> LN(Maxout(width, width * 4, pieces=3)), - column=cols.index(ORTH), - ) - elif char_embed: - embed = concatenate_lists( - CharacterEmbed(nM=64, nC=8), - FeatureExtracter(cols) >> with_flatten(norm), - ) - reduce_dimensions = LN( - Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces) - ) - else: - embed = norm + cols = ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"] - convolution = Residual( - ExtractWindow(nW=1) - >> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces)) - ) - if char_embed: - tok2vec = embed >> with_flatten( - reduce_dimensions >> convolution ** conv_depth, pad=conv_depth - ) - else: - tok2vec = FeatureExtracter(cols) >> with_flatten( - embed >> convolution ** conv_depth, pad=conv_depth - ) - - if bilstm_depth >= 1: - tok2vec = tok2vec >> PyTorchBiLSTM(width, width, bilstm_depth) - # Work around thinc API limitations :(. TODO: Revise in Thinc 7 - tok2vec.nO = width - tok2vec.embed = embed - return tok2vec + doc2feats_cfg = {"arch": "spacy.Doc2Feats.v1", "config": {"columns": cols}} + if char_embed: + embed_cfg = { + "arch": "spacy.CharacterEmbed.v1", + "config": { + "width": 64, + "chars": 6, + "@mix": { + "arch": "spacy.LayerNormalizedMaxout.v1", + "config": {"width": width, "pieces": 3}, + }, + "@embed_features": None, + }, + } + else: + embed_cfg = { + "arch": "spacy.MultiHashEmbed.v1", + "config": { + "width": width, + "rows": embed_size, + "columns": cols, + "use_subwords": subword_features, + "@pretrained_vectors": None, + "@mix": { + "arch": "spacy.LayerNormalizedMaxout.v1", + "config": {"width": width, "pieces": 3}, + }, + }, + } + if pretrained_vectors: + embed_cfg["config"]["@pretrained_vectors"] = { + "arch": "spacy.PretrainedVectors.v1", + "config": { + "vectors_name": pretrained_vectors, + "width": width, + "column": cols.index("ID"), + }, + } + if cnn_maxout_pieces >= 2: + cnn_cfg = { + "arch": "spacy.MaxoutWindowEncoder.v1", + "config": { + "width": width, + "window_size": conv_window, + "pieces": cnn_maxout_pieces, + "depth": conv_depth, + }, + } + else: + cnn_cfg = { + "arch": "spacy.MishWindowEncoder.v1", + "config": {"width": width, "window_size": conv_window, "depth": conv_depth}, + } + bilstm_cfg = { + "arch": "spacy.TorchBiLSTMEncoder.v1", + "config": {"width": width, "depth": bilstm_depth}, + } + if conv_depth == 0 and bilstm_depth == 0: + encode_cfg = {} + elif conv_depth >= 1 and bilstm_depth >= 1: + encode_cfg = { + "arch": "thinc.FeedForward.v1", + "config": {"children": [cnn_cfg, bilstm_cfg]}, + } + elif conv_depth >= 1: + encode_cfg = cnn_cfg + else: + encode_cfg = bilstm_cfg + config = {"@doc2feats": doc2feats_cfg, "@embed": embed_cfg, "@encode": encode_cfg} + return new_ml.Tok2Vec(config) def reapply(layer, n_times): diff --git a/spacy/about.py b/spacy/about.py index 086e53242..c6db9700f 100644 --- a/spacy/about.py +++ b/spacy/about.py @@ -1,6 +1,6 @@ # fmt: off __title__ = "spacy" -__version__ = "2.2.2.dev1" +__version__ = "2.2.2" __release__ = True __download_url__ = "https://github.com/explosion/spacy-models/releases/download" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" diff --git a/spacy/analysis.py b/spacy/analysis.py new file mode 100644 index 000000000..761be3de9 --- /dev/null +++ b/spacy/analysis.py @@ -0,0 +1,179 @@ +# coding: utf8 +from __future__ import unicode_literals + +from collections import OrderedDict +from wasabi import Printer + +from .tokens import Doc, Token, Span +from .errors import Errors, Warnings, user_warning + + +def analyze_pipes(pipeline, name, pipe, index, warn=True): + """Analyze a pipeline component with respect to its position in the current + pipeline and the other components. Will check whether requirements are + fulfilled (e.g. if previous components assign the attributes). + + pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline. + name (unicode): The name of the pipeline component to analyze. + pipe (callable): The pipeline component function to analyze. + index (int): The index of the component in the pipeline. + warn (bool): Show user warning if problem is found. + RETURNS (list): The problems found for the given pipeline component. + """ + assert pipeline[index][0] == name + prev_pipes = pipeline[:index] + pipe_requires = getattr(pipe, "requires", []) + requires = OrderedDict([(annot, False) for annot in pipe_requires]) + if requires: + for prev_name, prev_pipe in prev_pipes: + prev_assigns = getattr(prev_pipe, "assigns", []) + for annot in prev_assigns: + requires[annot] = True + problems = [] + for annot, fulfilled in requires.items(): + if not fulfilled: + problems.append(annot) + if warn: + user_warning(Warnings.W025.format(name=name, attr=annot)) + return problems + + +def analyze_all_pipes(pipeline, warn=True): + """Analyze all pipes in the pipeline in order. + + pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline. + warn (bool): Show user warning if problem is found. + RETURNS (dict): The problems found, keyed by component name. + """ + problems = {} + for i, (name, pipe) in enumerate(pipeline): + problems[name] = analyze_pipes(pipeline, name, pipe, i, warn=warn) + return problems + + +def dot_to_dict(values): + """Convert dot notation to a dict. For example: ["token.pos", "token._.xyz"] + become {"token": {"pos": True, "_": {"xyz": True }}}. + + values (iterable): The values to convert. + RETURNS (dict): The converted values. + """ + result = {} + for value in values: + path = result + parts = value.lower().split(".") + for i, item in enumerate(parts): + is_last = i == len(parts) - 1 + path = path.setdefault(item, True if is_last else {}) + return result + + +def validate_attrs(values): + """Validate component attributes provided to "assigns", "requires" etc. + Raises error for invalid attributes and formatting. Doesn't check if + custom extension attributes are registered, since this is something the + user might want to do themselves later in the component. + + values (iterable): The string attributes to check, e.g. `["token.pos"]`. + RETURNS (iterable): The checked attributes. + """ + data = dot_to_dict(values) + objs = {"doc": Doc, "token": Token, "span": Span} + for obj_key, attrs in data.items(): + if obj_key == "span": + # Support Span only for custom extension attributes + span_attrs = [attr for attr in values if attr.startswith("span.")] + span_attrs = [attr for attr in span_attrs if not attr.startswith("span._.")] + if span_attrs: + raise ValueError(Errors.E180.format(attrs=", ".join(span_attrs))) + if obj_key not in objs: # first element is not doc/token/span + invalid_attrs = ", ".join(a for a in values if a.startswith(obj_key)) + raise ValueError(Errors.E181.format(obj=obj_key, attrs=invalid_attrs)) + if not isinstance(attrs, dict): # attr is something like "doc" + raise ValueError(Errors.E182.format(attr=obj_key)) + for attr, value in attrs.items(): + if attr == "_": + if value is True: # attr is something like "doc._" + raise ValueError(Errors.E182.format(attr="{}._".format(obj_key))) + for ext_attr, ext_value in value.items(): + # We don't check whether the attribute actually exists + if ext_value is not True: # attr is something like doc._.x.y + good = "{}._.{}".format(obj_key, ext_attr) + bad = "{}.{}".format(good, ".".join(ext_value)) + raise ValueError(Errors.E183.format(attr=bad, solution=good)) + continue # we can't validate those further + if attr.endswith("_"): # attr is something like "token.pos_" + raise ValueError(Errors.E184.format(attr=attr, solution=attr[:-1])) + if value is not True: # attr is something like doc.x.y + good = "{}.{}".format(obj_key, attr) + bad = "{}.{}".format(good, ".".join(value)) + raise ValueError(Errors.E183.format(attr=bad, solution=good)) + obj = objs[obj_key] + if not hasattr(obj, attr): + raise ValueError(Errors.E185.format(obj=obj_key, attr=attr)) + return values + + +def _get_feature_for_attr(pipeline, attr, feature): + assert feature in ["assigns", "requires"] + result = [] + for pipe_name, pipe in pipeline: + pipe_assigns = getattr(pipe, feature, []) + if attr in pipe_assigns: + result.append((pipe_name, pipe)) + return result + + +def get_assigns_for_attr(pipeline, attr): + """Get all pipeline components that assign an attr, e.g. "doc.tensor". + + pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline. + attr (unicode): The attribute to check. + RETURNS (list): (name, pipeline) tuples of components that assign the attr. + """ + return _get_feature_for_attr(pipeline, attr, "assigns") + + +def get_requires_for_attr(pipeline, attr): + """Get all pipeline components that require an attr, e.g. "doc.tensor". + + pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline. + attr (unicode): The attribute to check. + RETURNS (list): (name, pipeline) tuples of components that require the attr. + """ + return _get_feature_for_attr(pipeline, attr, "requires") + + +def print_summary(nlp, pretty=True, no_print=False): + """Print a formatted summary for the current nlp object's pipeline. Shows + a table with the pipeline components and why they assign and require, as + well as any problems if available. + + nlp (Language): The nlp object. + pretty (bool): Pretty-print the results (color etc). + no_print (bool): Don't print anything, just return the data. + RETURNS (dict): A dict with "overview" and "problems". + """ + msg = Printer(pretty=pretty, no_print=no_print) + overview = [] + problems = {} + for i, (name, pipe) in enumerate(nlp.pipeline): + requires = getattr(pipe, "requires", []) + assigns = getattr(pipe, "assigns", []) + retok = getattr(pipe, "retokenizes", False) + overview.append((i, name, requires, assigns, retok)) + problems[name] = analyze_pipes(nlp.pipeline, name, pipe, i, warn=False) + msg.divider("Pipeline Overview") + header = ("#", "Component", "Requires", "Assigns", "Retokenizes") + msg.table(overview, header=header, divider=True, multiline=True) + n_problems = sum(len(p) for p in problems.values()) + if any(p for p in problems.values()): + msg.divider("Problems ({})".format(n_problems)) + for name, problem in problems.items(): + if problem: + problem = ", ".join(problem) + msg.warn("'{}' requirements not met: {}".format(name, problem)) + else: + msg.good("No problems found.") + if no_print: + return {"overview": overview, "problems": problems} diff --git a/spacy/cli/convert.py b/spacy/cli/convert.py index 2d8661339..fa867fa04 100644 --- a/spacy/cli/convert.py +++ b/spacy/cli/convert.py @@ -57,7 +57,7 @@ def convert( is written to stdout, so you can pipe them forward to a JSON file: $ spacy convert some_file.conllu > some_file.json """ - no_print = (output_dir == "-") + no_print = output_dir == "-" msg = Printer(no_print=no_print) input_path = Path(input_file) if file_type not in FILE_TYPES: diff --git a/spacy/cli/converters/conll_ner2json.py b/spacy/cli/converters/conll_ner2json.py index 3d0e4d308..46489ad7c 100644 --- a/spacy/cli/converters/conll_ner2json.py +++ b/spacy/cli/converters/conll_ner2json.py @@ -9,7 +9,9 @@ from ...tokens.doc import Doc from ...util import load_model -def conll_ner2json(input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs): +def conll_ner2json( + input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs +): """ Convert files in the CoNLL-2003 NER format and similar whitespace-separated columns into JSON format for use with train cli. diff --git a/spacy/cli/pretrain.py b/spacy/cli/pretrain.py index 9b63b31f0..f7236f7de 100644 --- a/spacy/cli/pretrain.py +++ b/spacy/cli/pretrain.py @@ -35,6 +35,10 @@ from .train import _load_pretrained_tok2vec output_dir=("Directory to write models to on each epoch", "positional", None, str), width=("Width of CNN layers", "option", "cw", int), depth=("Depth of CNN layers", "option", "cd", int), + cnn_window=("Window size for CNN layers", "option", "cW", int), + cnn_pieces=("Maxout size for CNN layers. 1 for Mish", "option", "cP", int), + use_chars=("Whether to use character-based embedding", "flag", "chr", bool), + sa_depth=("Depth of self-attention layers", "option", "sa", int), bilstm_depth=("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int), embed_rows=("Number of embedding rows", "option", "er", int), loss_func=( @@ -81,7 +85,11 @@ def pretrain( output_dir, width=96, depth=4, - bilstm_depth=2, + bilstm_depth=0, + cnn_pieces=3, + sa_depth=0, + use_chars=False, + cnn_window=1, embed_rows=2000, loss_func="cosine", use_vectors=False, @@ -158,8 +166,8 @@ def pretrain( conv_depth=depth, pretrained_vectors=pretrained_vectors, bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental. - cnn_maxout_pieces=3, # You can try setting this higher - subword_features=True, # Set to False for Chinese etc + subword_features=not use_chars, # Set to False for Chinese etc + cnn_maxout_pieces=cnn_pieces, # If set to 1, use Mish activation. ), ) # Load in pretrained weights diff --git a/spacy/cli/train.py b/spacy/cli/train.py index 6538994c0..13fcae37f 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -156,8 +156,7 @@ def train( "`lang` argument ('{}') ".format(nlp.lang, lang), exits=1, ) - other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipeline] - nlp.disable_pipes(*other_pipes) + nlp.disable_pipes([p for p in nlp.pipe_names if p not in pipeline]) for pipe in pipeline: if pipe not in nlp.pipe_names: if pipe == "parser": @@ -263,7 +262,11 @@ def train( exits=1, ) train_docs = corpus.train_docs( - nlp, noise_level=noise_level, gold_preproc=gold_preproc, max_length=0 + nlp, + noise_level=noise_level, + gold_preproc=gold_preproc, + max_length=0, + ignore_misaligned=True, ) train_labels = set() if textcat_multilabel: @@ -344,6 +347,7 @@ def train( orth_variant_level=orth_variant_level, gold_preproc=gold_preproc, max_length=0, + ignore_misaligned=True, ) if raw_text: random.shuffle(raw_text) @@ -382,7 +386,11 @@ def train( if hasattr(component, "cfg"): component.cfg["beam_width"] = beam_width dev_docs = list( - corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc) + corpus.dev_docs( + nlp_loaded, + gold_preproc=gold_preproc, + ignore_misaligned=True, + ) ) nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs) start_time = timer() @@ -399,7 +407,11 @@ def train( if hasattr(component, "cfg"): component.cfg["beam_width"] = beam_width dev_docs = list( - corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc) + corpus.dev_docs( + nlp_loaded, + gold_preproc=gold_preproc, + ignore_misaligned=True, + ) ) start_time = timer() scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose) diff --git a/spacy/compat.py b/spacy/compat.py index 3a19e9423..5bff28815 100644 --- a/spacy/compat.py +++ b/spacy/compat.py @@ -12,6 +12,7 @@ import os import sys import itertools import ast +import types from thinc.neural.util import copy_array @@ -67,6 +68,7 @@ if is_python2: basestring_ = basestring # noqa: F821 input_ = raw_input # noqa: F821 path2str = lambda path: str(path).decode("utf8") + class_types = (type, types.ClassType) elif is_python3: bytes_ = bytes @@ -74,6 +76,7 @@ elif is_python3: basestring_ = str input_ = input path2str = lambda path: str(path) + class_types = (type, types.ClassType) if is_python_pre_3_5 else type def b_to_str(b_str): diff --git a/spacy/displacy/templates.py b/spacy/displacy/templates.py index 4a7c596d8..ade75d1d6 100644 --- a/spacy/displacy/templates.py +++ b/spacy/displacy/templates.py @@ -44,14 +44,14 @@ TPL_ENTS = """ TPL_ENT = """ - + {text} {label} """ TPL_ENT_RTL = """ - + {text} {label} diff --git a/spacy/errors.py b/spacy/errors.py index 23203d98a..c708f0a5b 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -99,6 +99,8 @@ class Warnings(object): "'n_process' will be set to 1.") W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in " "the Knowledge Base.") + W025 = ("'{name}' requires '{attr}' to be assigned, but none of the " + "previous components in the pipeline declare that they assign it.") @add_codes @@ -504,6 +506,29 @@ class Errors(object): E175 = ("Can't remove rule for unknown match pattern ID: {key}") E176 = ("Alias '{alias}' is not defined in the Knowledge Base.") E177 = ("Ill-formed IOB input detected: {tag}") + E178 = ("Invalid pattern. Expected list of dicts but got: {pat}. Maybe you " + "accidentally passed a single pattern to Matcher.add instead of a " + "list of patterns? If you only want to add one pattern, make sure " + "to wrap it in a list. For example: matcher.add('{key}', [pattern])") + E179 = ("Invalid pattern. Expected a list of Doc objects but got a single " + "Doc. If you only want to add one pattern, make sure to wrap it " + "in a list. For example: matcher.add('{key}', [doc])") + E180 = ("Span attributes can't be declared as required or assigned by " + "components, since spans are only views of the Doc. Use Doc and " + "Token attributes (or custom extension attributes) only and remove " + "the following: {attrs}") + E181 = ("Received invalid attributes for unkown object {obj}: {attrs}. " + "Only Doc and Token attributes are supported.") + E182 = ("Received invalid attribute declaration: {attr}\nDid you forget " + "to define the attribute? For example: {attr}.???") + E183 = ("Received invalid attribute declaration: {attr}\nOnly top-level " + "attributes are supported, for example: {solution}") + E184 = ("Only attributes without underscores are supported in component " + "attribute declarations (because underscore and non-underscore " + "attributes are connected anyways): {attr} -> {solution}") + E185 = ("Received invalid attribute in component attribute declaration: " + "{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.") + E186 = ("'{tok_a}' and '{tok_b}' are different texts.") @add_codes @@ -536,6 +561,10 @@ class MatchPatternError(ValueError): ValueError.__init__(self, msg) +class AlignmentError(ValueError): + pass + + class ModelsWarning(UserWarning): pass diff --git a/spacy/glossary.py b/spacy/glossary.py index 52abc7bb5..44a8277da 100644 --- a/spacy/glossary.py +++ b/spacy/glossary.py @@ -80,7 +80,7 @@ GLOSSARY = { "RBR": "adverb, comparative", "RBS": "adverb, superlative", "RP": "adverb, particle", - "TO": "infinitival to", + "TO": 'infinitival "to"', "UH": "interjection", "VB": "verb, base form", "VBD": "verb, past tense", @@ -279,6 +279,12 @@ GLOSSARY = { "re": "repeated element", "rs": "reported speech", "sb": "subject", + "sb": "subject", + "sbp": "passivized subject (PP)", + "sp": "subject or predicate", + "svp": "separable verb prefix", + "uc": "unit component", + "vo": "vocative", # Named Entity Recognition # OntoNotes 5 # https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf diff --git a/spacy/gold.pyx b/spacy/gold.pyx index 7bf89c84a..5aecc2584 100644 --- a/spacy/gold.pyx +++ b/spacy/gold.pyx @@ -11,10 +11,9 @@ import itertools from pathlib import Path import srsly -from . import _align from .syntax import nonproj from .tokens import Doc, Span -from .errors import Errors +from .errors import Errors, AlignmentError from .compat import path2str from . import util from .util import minibatch, itershuffle @@ -22,6 +21,7 @@ from .util import minibatch, itershuffle from libc.stdio cimport FILE, fopen, fclose, fread, fwrite, feof, fseek +USE_NEW_ALIGN = False punct_re = re.compile(r"\W") @@ -56,10 +56,10 @@ def tags_to_entities(tags): def merge_sents(sents): m_deps = [[], [], [], [], [], []] + m_cats = {} m_brackets = [] - m_cats = sents.pop() i = 0 - for (ids, words, tags, heads, labels, ner), brackets in sents: + for (ids, words, tags, heads, labels, ner), (cats, brackets) in sents: m_deps[0].extend(id_ + i for id_ in ids) m_deps[1].extend(words) m_deps[2].extend(tags) @@ -68,12 +68,26 @@ def merge_sents(sents): m_deps[5].extend(ner) m_brackets.extend((b["first"] + i, b["last"] + i, b["label"]) for b in brackets) + m_cats.update(cats) i += len(ids) - m_deps.append(m_cats) - return [(m_deps, m_brackets)] + return [(m_deps, (m_cats, m_brackets))] -def align(tokens_a, tokens_b): +_ALIGNMENT_NORM_MAP = [("``", "'"), ("''", "'"), ('"', "'"), ("`", "'")] + + +def _normalize_for_alignment(tokens): + tokens = [w.replace(" ", "").lower() for w in tokens] + output = [] + for token in tokens: + token = token.replace(" ", "").lower() + for before, after in _ALIGNMENT_NORM_MAP: + token = token.replace(before, after) + output.append(token) + return output + + +def _align_before_v2_2_2(tokens_a, tokens_b): """Calculate alignment tables between two tokenizations, using the Levenshtein algorithm. The alignment is case-insensitive. @@ -92,6 +106,7 @@ def align(tokens_a, tokens_b): * b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other direction. """ + from . import _align if tokens_a == tokens_b: alignment = numpy.arange(len(tokens_a)) return 0, alignment, alignment, {}, {} @@ -111,6 +126,82 @@ def align(tokens_a, tokens_b): return cost, i2j, j2i, i2j_multi, j2i_multi +def align(tokens_a, tokens_b): + """Calculate alignment tables between two tokenizations. + + tokens_a (List[str]): The candidate tokenization. + tokens_b (List[str]): The reference tokenization. + RETURNS: (tuple): A 5-tuple consisting of the following information: + * cost (int): The number of misaligned tokens. + * a2b (List[int]): Mapping of indices in `tokens_a` to indices in `tokens_b`. + For instance, if `a2b[4] == 6`, that means that `tokens_a[4]` aligns + to `tokens_b[6]`. If there's no one-to-one alignment for a token, + it has the value -1. + * b2a (List[int]): The same as `a2b`, but mapping the other direction. + * a2b_multi (Dict[int, int]): A dictionary mapping indices in `tokens_a` + to indices in `tokens_b`, where multiple tokens of `tokens_a` align to + the same token of `tokens_b`. + * b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other + direction. + """ + if not USE_NEW_ALIGN: + return _align_before_v2_2_2(tokens_a, tokens_b) + tokens_a = _normalize_for_alignment(tokens_a) + tokens_b = _normalize_for_alignment(tokens_b) + cost = 0 + a2b = numpy.empty(len(tokens_a), dtype="i") + b2a = numpy.empty(len(tokens_b), dtype="i") + a2b_multi = {} + b2a_multi = {} + i = 0 + j = 0 + offset_a = 0 + offset_b = 0 + while i < len(tokens_a) and j < len(tokens_b): + a = tokens_a[i][offset_a:] + b = tokens_b[j][offset_b:] + a2b[i] = b2a[j] = -1 + if a == b: + if offset_a == offset_b == 0: + a2b[i] = j + b2a[j] = i + elif offset_a == 0: + cost += 2 + a2b_multi[i] = j + elif offset_b == 0: + cost += 2 + b2a_multi[j] = i + offset_a = offset_b = 0 + i += 1 + j += 1 + elif a == "": + assert offset_a == 0 + cost += 1 + i += 1 + elif b == "": + assert offset_b == 0 + cost += 1 + j += 1 + elif b.startswith(a): + cost += 1 + if offset_a == 0: + a2b_multi[i] = j + i += 1 + offset_a = 0 + offset_b += len(a) + elif a.startswith(b): + cost += 1 + if offset_b == 0: + b2a_multi[j] = i + j += 1 + offset_b = 0 + offset_a += len(b) + else: + assert "".join(tokens_a) != "".join(tokens_b) + raise AlignmentError(Errors.E186.format(tok_a=tokens_a, tok_b=tokens_b)) + return cost, a2b, b2a, a2b_multi, b2a_multi + + class GoldCorpus(object): """An annotated corpus, using the JSON file format. Manages annotations for tagging, dependency parsing and NER. @@ -176,6 +267,11 @@ class GoldCorpus(object): gold_tuples = read_json_file(loc) elif loc.parts[-1].endswith("jsonl"): gold_tuples = srsly.read_jsonl(loc) + first_gold_tuple = next(gold_tuples) + gold_tuples = itertools.chain([first_gold_tuple], gold_tuples) + # TODO: proper format checks with schemas + if isinstance(first_gold_tuple, dict): + gold_tuples = read_json_object(gold_tuples) elif loc.parts[-1].endswith("msg"): gold_tuples = srsly.read_msgpack(loc) else: @@ -201,7 +297,6 @@ class GoldCorpus(object): n = 0 i = 0 for raw_text, paragraph_tuples in self.train_tuples: - cats = paragraph_tuples.pop() for sent_tuples, brackets in paragraph_tuples: n += len(sent_tuples[1]) if self.limit and i >= self.limit: @@ -210,7 +305,8 @@ class GoldCorpus(object): return n def train_docs(self, nlp, gold_preproc=False, max_length=None, - noise_level=0.0, orth_variant_level=0.0): + noise_level=0.0, orth_variant_level=0.0, + ignore_misaligned=False): locs = list((self.tmp_dir / 'train').iterdir()) random.shuffle(locs) train_tuples = self.read_tuples(locs, limit=self.limit) @@ -218,20 +314,23 @@ class GoldCorpus(object): max_length=max_length, noise_level=noise_level, orth_variant_level=orth_variant_level, - make_projective=True) + make_projective=True, + ignore_misaligned=ignore_misaligned) yield from gold_docs def train_docs_without_preprocessing(self, nlp, gold_preproc=False): gold_docs = self.iter_gold_docs(nlp, self.train_tuples, gold_preproc=gold_preproc) yield from gold_docs - def dev_docs(self, nlp, gold_preproc=False): - gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc) + def dev_docs(self, nlp, gold_preproc=False, ignore_misaligned=False): + gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc, + ignore_misaligned=ignore_misaligned) yield from gold_docs @classmethod def iter_gold_docs(cls, nlp, tuples, gold_preproc, max_length=None, - noise_level=0.0, orth_variant_level=0.0, make_projective=False): + noise_level=0.0, orth_variant_level=0.0, make_projective=False, + ignore_misaligned=False): for raw_text, paragraph_tuples in tuples: if gold_preproc: raw_text = None @@ -240,10 +339,12 @@ class GoldCorpus(object): docs, paragraph_tuples = cls._make_docs(nlp, raw_text, paragraph_tuples, gold_preproc, noise_level=noise_level, orth_variant_level=orth_variant_level) - golds = cls._make_golds(docs, paragraph_tuples, make_projective) + golds = cls._make_golds(docs, paragraph_tuples, make_projective, + ignore_misaligned=ignore_misaligned) for doc, gold in zip(docs, golds): - if (not max_length) or len(doc) < max_length: - yield doc, gold + if gold is not None: + if (not max_length) or len(doc) < max_length: + yield doc, gold @classmethod def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc, noise_level=0.0, orth_variant_level=0.0): @@ -259,14 +360,22 @@ class GoldCorpus(object): @classmethod - def _make_golds(cls, docs, paragraph_tuples, make_projective): + def _make_golds(cls, docs, paragraph_tuples, make_projective, ignore_misaligned=False): if len(docs) != len(paragraph_tuples): n_annots = len(paragraph_tuples) raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots=n_annots)) - return [GoldParse.from_annot_tuples(doc, sent_tuples, - make_projective=make_projective) - for doc, (sent_tuples, brackets) - in zip(docs, paragraph_tuples)] + golds = [] + for doc, (sent_tuples, (cats, brackets)) in zip(docs, paragraph_tuples): + try: + gold = GoldParse.from_annot_tuples(doc, sent_tuples, cats=cats, + make_projective=make_projective) + except AlignmentError: + if ignore_misaligned: + gold = None + else: + raise + golds.append(gold) + return golds def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0): @@ -281,7 +390,7 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0): # modify words in paragraph_tuples variant_paragraph_tuples = [] for sent_tuples, brackets in paragraph_tuples: - ids, words, tags, heads, labels, ner, cats = sent_tuples + ids, words, tags, heads, labels, ner = sent_tuples if lower: words = [w.lower() for w in words] # single variants @@ -310,7 +419,7 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0): pair_idx = pair.index(words[word_idx]) words[word_idx] = punct_choices[punct_idx][pair_idx] - variant_paragraph_tuples.append(((ids, words, tags, heads, labels, ner, cats), brackets)) + variant_paragraph_tuples.append(((ids, words, tags, heads, labels, ner), brackets)) # modify raw to match variant_paragraph_tuples if raw is not None: variants = [] @@ -329,7 +438,7 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0): variant_raw += raw[raw_idx] raw_idx += 1 for sent_tuples, brackets in variant_paragraph_tuples: - ids, words, tags, heads, labels, ner, cats = sent_tuples + ids, words, tags, heads, labels, ner = sent_tuples for word in words: match_found = False # add identical word @@ -400,6 +509,9 @@ def json_to_tuple(doc): paragraphs = [] for paragraph in doc["paragraphs"]: sents = [] + cats = {} + for cat in paragraph.get("cats", {}): + cats[cat["label"]] = cat["value"] for sent in paragraph["sentences"]: words = [] ids = [] @@ -419,11 +531,7 @@ def json_to_tuple(doc): ner.append(token.get("ner", "-")) sents.append([ [ids, words, tags, heads, labels, ner], - sent.get("brackets", [])]) - cats = {} - for cat in paragraph.get("cats", {}): - cats[cat["label"]] = cat["value"] - sents.append(cats) + [cats, sent.get("brackets", [])]]) if sents: yield [paragraph.get("raw", None), sents] @@ -537,8 +645,8 @@ cdef class GoldParse: DOCS: https://spacy.io/api/goldparse """ @classmethod - def from_annot_tuples(cls, doc, annot_tuples, make_projective=False): - _, words, tags, heads, deps, entities, cats = annot_tuples + def from_annot_tuples(cls, doc, annot_tuples, cats=None, make_projective=False): + _, words, tags, heads, deps, entities = annot_tuples return cls(doc, words=words, tags=tags, heads=heads, deps=deps, entities=entities, cats=cats, make_projective=make_projective) @@ -595,9 +703,9 @@ cdef class GoldParse: if morphology is None: morphology = [None for _ in words] if entities is None: - entities = ["-" for _ in doc] + entities = ["-" for _ in words] elif len(entities) == 0: - entities = ["O" for _ in doc] + entities = ["O" for _ in words] else: # Translate the None values to '-', to make processing easier. # See Issue #2603 @@ -660,7 +768,9 @@ cdef class GoldParse: self.heads[i] = i+1 self.labels[i] = "subtok" else: - self.heads[i] = self.gold_to_cand[heads[i2j_multi[i]]] + head_i = heads[i2j_multi[i]] + if head_i: + self.heads[i] = self.gold_to_cand[head_i] self.labels[i] = deps[i2j_multi[i]] # Now set NER...This is annoying because if we've split # got an entity word split into two, we need to adjust the @@ -748,7 +858,7 @@ def docs_to_json(docs, id=0): docs (iterable / Doc): The Doc object(s) to convert. id (int): Id for the JSON. - RETURNS (dict): The data in spaCy's JSON format + RETURNS (dict): The data in spaCy's JSON format - each input doc will be treated as a paragraph in the output doc """ if isinstance(docs, Doc): @@ -804,7 +914,7 @@ def biluo_tags_from_offsets(doc, entities, missing="O"): """ # Ensure no overlapping entity labels exist tokens_in_ents = {} - + starts = {token.idx: token.i for token in doc} ends = {token.idx + len(token): token.i for token in doc} biluo = ["-" for _ in doc] diff --git a/spacy/lang/de/tag_map.py b/spacy/lang/de/tag_map.py index 394478145..c169501a9 100644 --- a/spacy/lang/de/tag_map.py +++ b/spacy/lang/de/tag_map.py @@ -1,8 +1,8 @@ # coding: utf8 from __future__ import unicode_literals -from ...symbols import POS, PUNCT, ADJ, CONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB -from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX +from ...symbols import POS, PUNCT, ADJ, CCONJ, SCONJ, NUM, DET, ADV, ADP, X +from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX, VERB TAG_MAP = { @@ -20,8 +20,8 @@ TAG_MAP = { "CARD": {POS: NUM, "NumType": "card"}, "FM": {POS: X, "Foreign": "yes"}, "ITJ": {POS: INTJ}, - "KOKOM": {POS: CONJ, "ConjType": "comp"}, - "KON": {POS: CONJ}, + "KOKOM": {POS: CCONJ, "ConjType": "comp"}, + "KON": {POS: CCONJ}, "KOUI": {POS: SCONJ}, "KOUS": {POS: SCONJ}, "NE": {POS: PROPN}, @@ -43,7 +43,7 @@ TAG_MAP = { "PTKA": {POS: PART}, "PTKANT": {POS: PART, "PartType": "res"}, "PTKNEG": {POS: PART, "Polarity": "neg"}, - "PTKVZ": {POS: PART, "PartType": "vbp"}, + "PTKVZ": {POS: ADP, "PartType": "vbp"}, "PTKZU": {POS: PART, "PartType": "inf"}, "PWAT": {POS: DET, "PronType": "int"}, "PWAV": {POS: ADV, "PronType": "int"}, diff --git a/spacy/lang/en/tag_map.py b/spacy/lang/en/tag_map.py index 9bd884a3a..ecb3103cc 100644 --- a/spacy/lang/en/tag_map.py +++ b/spacy/lang/en/tag_map.py @@ -2,7 +2,7 @@ from __future__ import unicode_literals from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB -from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX +from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON TAG_MAP = { @@ -28,8 +28,8 @@ TAG_MAP = { "JJR": {POS: ADJ, "Degree": "comp"}, "JJS": {POS: ADJ, "Degree": "sup"}, "LS": {POS: X, "NumType": "ord"}, - "MD": {POS: AUX, "VerbType": "mod"}, - "NIL": {POS: ""}, + "MD": {POS: VERB, "VerbType": "mod"}, + "NIL": {POS: X}, "NN": {POS: NOUN, "Number": "sing"}, "NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"}, "NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"}, @@ -37,7 +37,7 @@ TAG_MAP = { "PDT": {POS: DET}, "POS": {POS: PART, "Poss": "yes"}, "PRP": {POS: PRON, "PronType": "prs"}, - "PRP$": {POS: PRON, "PronType": "prs", "Poss": "yes"}, + "PRP$": {POS: DET, "PronType": "prs", "Poss": "yes"}, "RB": {POS: ADV, "Degree": "pos"}, "RBR": {POS: ADV, "Degree": "comp"}, "RBS": {POS: ADV, "Degree": "sup"}, @@ -58,9 +58,9 @@ TAG_MAP = { "Number": "sing", "Person": "three", }, - "WDT": {POS: PRON}, + "WDT": {POS: DET}, "WP": {POS: PRON}, - "WP$": {POS: PRON, "Poss": "yes"}, + "WP$": {POS: DET, "Poss": "yes"}, "WRB": {POS: ADV}, "ADD": {POS: X}, "NFP": {POS: PUNCT}, diff --git a/spacy/language.py b/spacy/language.py index 330852741..d53710f58 100644 --- a/spacy/language.py +++ b/spacy/language.py @@ -18,13 +18,8 @@ from .tokenizer import Tokenizer from .vocab import Vocab from .lemmatizer import Lemmatizer from .lookups import Lookups -from .pipeline import DependencyParser, Tagger -from .pipeline import Tensorizer, EntityRecognizer, EntityLinker -from .pipeline import SimilarityHook, TextCategorizer, Sentencizer -from .pipeline import merge_noun_chunks, merge_entities, merge_subtokens -from .pipeline import EntityRuler -from .pipeline import Morphologizer -from .compat import izip, basestring_, is_python2 +from .analysis import analyze_pipes, analyze_all_pipes, validate_attrs +from .compat import izip, basestring_, is_python2, class_types from .gold import GoldParse from .scorer import Scorer from ._ml import link_vectors_to_models, create_default_optimizer @@ -40,6 +35,9 @@ from . import util from . import about +ENABLE_PIPELINE_ANALYSIS = False + + class BaseDefaults(object): @classmethod def create_lemmatizer(cls, nlp=None, lookups=None): @@ -133,22 +131,7 @@ class Language(object): Defaults = BaseDefaults lang = None - factories = { - "tokenizer": lambda nlp: nlp.Defaults.create_tokenizer(nlp), - "tensorizer": lambda nlp, **cfg: Tensorizer(nlp.vocab, **cfg), - "tagger": lambda nlp, **cfg: Tagger(nlp.vocab, **cfg), - "morphologizer": lambda nlp, **cfg: Morphologizer(nlp.vocab, **cfg), - "parser": lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg), - "ner": lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg), - "entity_linker": lambda nlp, **cfg: EntityLinker(nlp.vocab, **cfg), - "similarity": lambda nlp, **cfg: SimilarityHook(nlp.vocab, **cfg), - "textcat": lambda nlp, **cfg: TextCategorizer(nlp.vocab, **cfg), - "sentencizer": lambda nlp, **cfg: Sentencizer(**cfg), - "merge_noun_chunks": lambda nlp, **cfg: merge_noun_chunks, - "merge_entities": lambda nlp, **cfg: merge_entities, - "merge_subtokens": lambda nlp, **cfg: merge_subtokens, - "entity_ruler": lambda nlp, **cfg: EntityRuler(nlp, **cfg), - } + factories = {"tokenizer": lambda nlp: nlp.Defaults.create_tokenizer(nlp)} def __init__( self, vocab=True, make_doc=True, max_length=10 ** 6, meta={}, **kwargs @@ -218,6 +201,7 @@ class Language(object): "name": self.vocab.vectors.name, } self._meta["pipeline"] = self.pipe_names + self._meta["factories"] = self.pipe_factories self._meta["labels"] = self.pipe_labels return self._meta @@ -259,6 +243,17 @@ class Language(object): """ return [pipe_name for pipe_name, _ in self.pipeline] + @property + def pipe_factories(self): + """Get the component factories for the available pipeline components. + + RETURNS (dict): Factory names, keyed by component names. + """ + factories = {} + for pipe_name, pipe in self.pipeline: + factories[pipe_name] = getattr(pipe, "factory", pipe_name) + return factories + @property def pipe_labels(self): """Get the labels set by the pipeline components, if available (if @@ -327,33 +322,30 @@ class Language(object): msg += Errors.E004.format(component=component) raise ValueError(msg) if name is None: - if hasattr(component, "name"): - name = component.name - elif hasattr(component, "__name__"): - name = component.__name__ - elif hasattr(component, "__class__") and hasattr( - component.__class__, "__name__" - ): - name = component.__class__.__name__ - else: - name = repr(component) + name = util.get_component_name(component) if name in self.pipe_names: raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names)) if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2: raise ValueError(Errors.E006) + pipe_index = 0 pipe = (name, component) if last or not any([first, before, after]): + pipe_index = len(self.pipeline) self.pipeline.append(pipe) elif first: self.pipeline.insert(0, pipe) elif before and before in self.pipe_names: + pipe_index = self.pipe_names.index(before) self.pipeline.insert(self.pipe_names.index(before), pipe) elif after and after in self.pipe_names: + pipe_index = self.pipe_names.index(after) + 1 self.pipeline.insert(self.pipe_names.index(after) + 1, pipe) else: raise ValueError( Errors.E001.format(name=before or after, opts=self.pipe_names) ) + if ENABLE_PIPELINE_ANALYSIS: + analyze_pipes(self.pipeline, name, component, pipe_index) def has_pipe(self, name): """Check if a component name is present in the pipeline. Equivalent to @@ -382,6 +374,8 @@ class Language(object): msg += Errors.E135.format(name=name) raise ValueError(msg) self.pipeline[self.pipe_names.index(name)] = (name, component) + if ENABLE_PIPELINE_ANALYSIS: + analyze_all_pipes(self.pipeline) def rename_pipe(self, old_name, new_name): """Rename a pipeline component. @@ -408,7 +402,10 @@ class Language(object): """ if name not in self.pipe_names: raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names)) - return self.pipeline.pop(self.pipe_names.index(name)) + removed = self.pipeline.pop(self.pipe_names.index(name)) + if ENABLE_PIPELINE_ANALYSIS: + analyze_all_pipes(self.pipeline) + return removed def __call__(self, text, disable=[], component_cfg=None): """Apply the pipeline to some text. The text can span multiple sentences, @@ -448,6 +445,8 @@ class Language(object): DOCS: https://spacy.io/api/language#disable_pipes """ + if len(names) == 1 and isinstance(names[0], (list, tuple)): + names = names[0] # support list of names instead of spread return DisabledPipes(self, *names) def make_doc(self, text): @@ -999,6 +998,52 @@ class Language(object): return self +class component(object): + """Decorator for pipeline components. Can decorate both function components + and class components and will automatically register components in the + Language.factories. If the component is a class and needs access to the + nlp object or config parameters, it can expose a from_nlp classmethod + that takes the nlp object and **cfg arguments and returns the initialized + component. + """ + + # NB: This decorator needs to live here, because it needs to write to + # Language.factories. All other solutions would cause circular import. + + def __init__(self, name=None, assigns=tuple(), requires=tuple(), retokenizes=False): + """Decorate a pipeline component. + + name (unicode): Default component and factory name. + assigns (list): Attributes assigned by component, e.g. `["token.pos"]`. + requires (list): Attributes required by component, e.g. `["token.dep"]`. + retokenizes (bool): Whether the component changes the tokenization. + """ + self.name = name + self.assigns = validate_attrs(assigns) + self.requires = validate_attrs(requires) + self.retokenizes = retokenizes + + def __call__(self, *args, **kwargs): + obj = args[0] + args = args[1:] + factory_name = self.name or util.get_component_name(obj) + obj.name = factory_name + obj.factory = factory_name + obj.assigns = self.assigns + obj.requires = self.requires + obj.retokenizes = self.retokenizes + + def factory(nlp, **cfg): + if hasattr(obj, "from_nlp"): + return obj.from_nlp(nlp, **cfg) + elif isinstance(obj, class_types): + return obj() + return obj + + Language.factories[obj.factory] = factory + return obj + + def _fix_pretrained_vectors_name(nlp): # TODO: Replace this once we handle vectors consistently as static # data diff --git a/spacy/matcher/dependencymatcher.pyx b/spacy/matcher/dependencymatcher.pyx index b58d36d62..ae2ad3ca6 100644 --- a/spacy/matcher/dependencymatcher.pyx +++ b/spacy/matcher/dependencymatcher.pyx @@ -102,7 +102,10 @@ cdef class DependencyMatcher: visitedNodes[relation["SPEC"]["NBOR_NAME"]] = True idx = idx + 1 - def add(self, key, on_match, *patterns): + def add(self, key, patterns, *_patterns, on_match=None): + if patterns is None or hasattr(patterns, "__call__"): # old API + on_match = patterns + patterns = _patterns for pattern in patterns: if len(pattern) == 0: raise ValueError(Errors.E012.format(key=key)) diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx index af0450592..6f6848102 100644 --- a/spacy/matcher/matcher.pyx +++ b/spacy/matcher/matcher.pyx @@ -74,7 +74,7 @@ cdef class Matcher: """ return self._normalize_key(key) in self._patterns - def add(self, key, on_match, *patterns): + def add(self, key, patterns, *_patterns, on_match=None): """Add a match-rule to the matcher. A match-rule consists of: an ID key, an on_match callback, and one or more patterns. @@ -98,16 +98,29 @@ cdef class Matcher: operator will behave non-greedily. This quirk in the semantics makes the matcher more efficient, by avoiding the need for back-tracking. + As of spaCy v2.2.2, Matcher.add supports the future API, which makes + the patterns the second argument and a list (instead of a variable + number of arguments). The on_match callback becomes an optional keyword + argument. + key (unicode): The match ID. - on_match (callable): Callback executed on match. - *patterns (list): List of token descriptions. + patterns (list): The patterns to add for the given key. + on_match (callable): Optional callback executed on match. + *_patterns (list): For backwards compatibility: list of patterns to add + as variable arguments. Will be ignored if a list of patterns is + provided as the second argument. """ errors = {} if on_match is not None and not hasattr(on_match, "__call__"): raise ValueError(Errors.E171.format(arg_type=type(on_match))) + if patterns is None or hasattr(patterns, "__call__"): # old API + on_match = patterns + patterns = _patterns for i, pattern in enumerate(patterns): if len(pattern) == 0: raise ValueError(Errors.E012.format(key=key)) + if not isinstance(pattern, list): + raise ValueError(Errors.E178.format(pat=pattern, key=key)) if self.validator: errors[i] = validate_json(pattern, self.validator) if any(err for err in errors.values()): diff --git a/spacy/matcher/phrasematcher.pyx b/spacy/matcher/phrasematcher.pyx index 135e81efe..4de5782f9 100644 --- a/spacy/matcher/phrasematcher.pyx +++ b/spacy/matcher/phrasematcher.pyx @@ -152,16 +152,27 @@ cdef class PhraseMatcher: del self._callbacks[key] del self._docs[key] - def add(self, key, on_match, *docs): + def add(self, key, docs, *_docs, on_match=None): """Add a match-rule to the phrase-matcher. A match-rule consists of: an ID key, an on_match callback, and one or more patterns. + As of spaCy v2.2.2, PhraseMatcher.add supports the future API, which + makes the patterns the second argument and a list (instead of a variable + number of arguments). The on_match callback becomes an optional keyword + argument. + key (unicode): The match ID. + docs (list): List of `Doc` objects representing match patterns. on_match (callable): Callback executed on match. - *docs (Doc): `Doc` objects representing match patterns. + *_docs (Doc): For backwards compatibility: list of patterns to add + as variable arguments. Will be ignored if a list of patterns is + provided as the second argument. DOCS: https://spacy.io/api/phrasematcher#add """ + if docs is None or hasattr(docs, "__call__"): # old API + on_match = docs + docs = _docs _ = self.vocab[key] self._callbacks[key] = on_match @@ -171,6 +182,8 @@ cdef class PhraseMatcher: cdef MapStruct* internal_node cdef void* result + if isinstance(docs, Doc): + raise ValueError(Errors.E179.format(key=key)) for doc in docs: if len(doc) == 0: continue diff --git a/spacy/ml/__init__.py b/spacy/ml/__init__.py new file mode 100644 index 000000000..57e7ef571 --- /dev/null +++ b/spacy/ml/__init__.py @@ -0,0 +1,5 @@ +# coding: utf8 +from __future__ import unicode_literals + +from .tok2vec import Tok2Vec # noqa: F401 +from .common import FeedForward, LayerNormalizedMaxout # noqa: F401 diff --git a/spacy/ml/_legacy_tok2vec.py b/spacy/ml/_legacy_tok2vec.py new file mode 100644 index 000000000..b077a46b7 --- /dev/null +++ b/spacy/ml/_legacy_tok2vec.py @@ -0,0 +1,131 @@ +# coding: utf8 +from __future__ import unicode_literals +from thinc.v2v import Model, Maxout +from thinc.i2v import HashEmbed, StaticVectors +from thinc.t2t import ExtractWindow +from thinc.misc import Residual +from thinc.misc import LayerNorm as LN +from thinc.misc import FeatureExtracter +from thinc.api import layerize, chain, clone, concatenate, with_flatten +from thinc.api import uniqued, wrap, noop + +from ..attrs import ID, ORTH, NORM, PREFIX, SUFFIX, SHAPE + + +def Tok2Vec(width, embed_size, **kwargs): + # Circular imports :( + from .._ml import CharacterEmbed + from .._ml import PyTorchBiLSTM + + pretrained_vectors = kwargs.get("pretrained_vectors", None) + cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3) + subword_features = kwargs.get("subword_features", True) + char_embed = kwargs.get("char_embed", False) + if char_embed: + subword_features = False + conv_depth = kwargs.get("conv_depth", 4) + bilstm_depth = kwargs.get("bilstm_depth", 0) + cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH] + with Model.define_operators({">>": chain, "|": concatenate, "**": clone}): + norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm") + if subword_features: + prefix = HashEmbed( + width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix" + ) + suffix = HashEmbed( + width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix" + ) + shape = HashEmbed( + width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape" + ) + else: + prefix, suffix, shape = (None, None, None) + if pretrained_vectors is not None: + glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID)) + + if subword_features: + embed = uniqued( + (glove | norm | prefix | suffix | shape) + >> LN(Maxout(width, width * 5, pieces=3)), + column=cols.index(ORTH), + ) + else: + embed = uniqued( + (glove | norm) >> LN(Maxout(width, width * 2, pieces=3)), + column=cols.index(ORTH), + ) + elif subword_features: + embed = uniqued( + (norm | prefix | suffix | shape) + >> LN(Maxout(width, width * 4, pieces=3)), + column=cols.index(ORTH), + ) + elif char_embed: + embed = concatenate_lists( + CharacterEmbed(nM=64, nC=8), + FeatureExtracter(cols) >> with_flatten(norm), + ) + reduce_dimensions = LN( + Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces) + ) + else: + embed = norm + + convolution = Residual( + ExtractWindow(nW=1) + >> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces)) + ) + if char_embed: + tok2vec = embed >> with_flatten( + reduce_dimensions >> convolution ** conv_depth, pad=conv_depth + ) + else: + tok2vec = FeatureExtracter(cols) >> with_flatten( + embed >> convolution ** conv_depth, pad=conv_depth + ) + + if bilstm_depth >= 1: + tok2vec = tok2vec >> PyTorchBiLSTM(width, width, bilstm_depth) + # Work around thinc API limitations :(. TODO: Revise in Thinc 7 + tok2vec.nO = width + tok2vec.embed = embed + return tok2vec + + +@layerize +def flatten(seqs, drop=0.0): + ops = Model.ops + lengths = ops.asarray([len(seq) for seq in seqs], dtype="i") + + def finish_update(d_X, sgd=None): + return ops.unflatten(d_X, lengths, pad=0) + + X = ops.flatten(seqs, pad=0) + return X, finish_update + + +def concatenate_lists(*layers, **kwargs): # pragma: no cover + """Compose two or more models `f`, `g`, etc, such that their outputs are + concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))` + """ + if not layers: + return noop() + drop_factor = kwargs.get("drop_factor", 1.0) + ops = layers[0].ops + layers = [chain(layer, flatten) for layer in layers] + concat = concatenate(*layers) + + def concatenate_lists_fwd(Xs, drop=0.0): + if drop is not None: + drop *= drop_factor + lengths = ops.asarray([len(X) for X in Xs], dtype="i") + flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop) + ys = ops.unflatten(flat_y, lengths) + + def concatenate_lists_bwd(d_ys, sgd=None): + return bp_flat_y(ops.flatten(d_ys), sgd=sgd) + + return ys, concatenate_lists_bwd + + model = wrap(concatenate_lists_fwd, concat) + return model diff --git a/spacy/ml/_wire.py b/spacy/ml/_wire.py new file mode 100644 index 000000000..fa271b37c --- /dev/null +++ b/spacy/ml/_wire.py @@ -0,0 +1,42 @@ +from __future__ import unicode_literals +from thinc.api import layerize, wrap, noop, chain, concatenate +from thinc.v2v import Model + + +def concatenate_lists(*layers, **kwargs): # pragma: no cover + """Compose two or more models `f`, `g`, etc, such that their outputs are + concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))` + """ + if not layers: + return layerize(noop()) + drop_factor = kwargs.get("drop_factor", 1.0) + ops = layers[0].ops + layers = [chain(layer, flatten) for layer in layers] + concat = concatenate(*layers) + + def concatenate_lists_fwd(Xs, drop=0.0): + if drop is not None: + drop *= drop_factor + lengths = ops.asarray([len(X) for X in Xs], dtype="i") + flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop) + ys = ops.unflatten(flat_y, lengths) + + def concatenate_lists_bwd(d_ys, sgd=None): + return bp_flat_y(ops.flatten(d_ys), sgd=sgd) + + return ys, concatenate_lists_bwd + + model = wrap(concatenate_lists_fwd, concat) + return model + + +@layerize +def flatten(seqs, drop=0.0): + ops = Model.ops + lengths = ops.asarray([len(seq) for seq in seqs], dtype="i") + + def finish_update(d_X, sgd=None): + return ops.unflatten(d_X, lengths, pad=0) + + X = ops.flatten(seqs, pad=0) + return X, finish_update diff --git a/spacy/ml/common.py b/spacy/ml/common.py new file mode 100644 index 000000000..963d4dc35 --- /dev/null +++ b/spacy/ml/common.py @@ -0,0 +1,23 @@ +from __future__ import unicode_literals + +from thinc.api import chain +from thinc.v2v import Maxout +from thinc.misc import LayerNorm +from ..util import register_architecture, make_layer + + +@register_architecture("thinc.FeedForward.v1") +def FeedForward(config): + layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]] + model = chain(*layers) + model.cfg = config + return model + + +@register_architecture("spacy.LayerNormalizedMaxout.v1") +def LayerNormalizedMaxout(config): + width = config["width"] + pieces = config["pieces"] + layer = LayerNorm(Maxout(width, pieces=pieces)) + layer.nO = width + return layer diff --git a/spacy/ml/tok2vec.py b/spacy/ml/tok2vec.py new file mode 100644 index 000000000..0b30551b5 --- /dev/null +++ b/spacy/ml/tok2vec.py @@ -0,0 +1,176 @@ +from __future__ import unicode_literals + +from thinc.api import chain, layerize, clone, concatenate, with_flatten, uniqued +from thinc.api import noop, with_square_sequences +from thinc.v2v import Maxout, Model +from thinc.i2v import HashEmbed, StaticVectors +from thinc.t2t import ExtractWindow +from thinc.misc import Residual, LayerNorm, FeatureExtracter +from ..util import make_layer, register_architecture +from ._wire import concatenate_lists + + +@register_architecture("spacy.Tok2Vec.v1") +def Tok2Vec(config): + doc2feats = make_layer(config["@doc2feats"]) + embed = make_layer(config["@embed"]) + encode = make_layer(config["@encode"]) + field_size = getattr(encode, "receptive_field", 0) + tok2vec = chain(doc2feats, with_flatten(chain(embed, encode), pad=field_size)) + tok2vec.cfg = config + tok2vec.nO = encode.nO + tok2vec.embed = embed + tok2vec.encode = encode + return tok2vec + + +@register_architecture("spacy.Doc2Feats.v1") +def Doc2Feats(config): + columns = config["columns"] + return FeatureExtracter(columns) + + +@register_architecture("spacy.MultiHashEmbed.v1") +def MultiHashEmbed(config): + # For backwards compatibility with models before the architecture registry, + # we have to be careful to get exactly the same model structure. One subtle + # trick is that when we define concatenation with the operator, the operator + # is actually binary associative. So when we write (a | b | c), we're actually + # getting concatenate(concatenate(a, b), c). That's why the implementation + # is a bit ugly here. + cols = config["columns"] + width = config["width"] + rows = config["rows"] + + norm = HashEmbed(width, rows, column=cols.index("NORM"), name="embed_norm") + if config["use_subwords"]: + prefix = HashEmbed( + width, rows // 2, column=cols.index("PREFIX"), name="embed_prefix" + ) + suffix = HashEmbed( + width, rows // 2, column=cols.index("SUFFIX"), name="embed_suffix" + ) + shape = HashEmbed( + width, rows // 2, column=cols.index("SHAPE"), name="embed_shape" + ) + if config.get("@pretrained_vectors"): + glove = make_layer(config["@pretrained_vectors"]) + mix = make_layer(config["@mix"]) + + with Model.define_operators({">>": chain, "|": concatenate}): + if config["use_subwords"] and config["@pretrained_vectors"]: + mix._layers[0].nI = width * 5 + layer = uniqued( + (glove | norm | prefix | suffix | shape) >> mix, + column=cols.index("ORTH"), + ) + elif config["use_subwords"]: + mix._layers[0].nI = width * 4 + layer = uniqued( + (norm | prefix | suffix | shape) >> mix, column=cols.index("ORTH") + ) + elif config["@pretrained_vectors"]: + mix._layers[0].nI = width * 2 + layer = uniqued((glove | norm) >> mix, column=cols.index("ORTH"),) + else: + layer = norm + layer.cfg = config + return layer + + +@register_architecture("spacy.CharacterEmbed.v1") +def CharacterEmbed(config): + from .. import _ml + + width = config["width"] + chars = config["chars"] + + chr_embed = _ml.CharacterEmbedModel(nM=width, nC=chars) + other_tables = make_layer(config["@embed_features"]) + mix = make_layer(config["@mix"]) + + model = chain(concatenate_lists(chr_embed, other_tables), mix) + model.cfg = config + return model + + +@register_architecture("spacy.MaxoutWindowEncoder.v1") +def MaxoutWindowEncoder(config): + nO = config["width"] + nW = config["window_size"] + nP = config["pieces"] + depth = config["depth"] + + cnn = chain( + ExtractWindow(nW=nW), LayerNorm(Maxout(nO, nO * ((nW * 2) + 1), pieces=nP)) + ) + model = clone(Residual(cnn), depth) + model.nO = nO + model.receptive_field = nW * depth + return model + + +@register_architecture("spacy.MishWindowEncoder.v1") +def MishWindowEncoder(config): + from thinc.v2v import Mish + + nO = config["width"] + nW = config["window_size"] + depth = config["depth"] + + cnn = chain(ExtractWindow(nW=nW), LayerNorm(Mish(nO, nO * ((nW * 2) + 1)))) + model = clone(Residual(cnn), depth) + model.nO = nO + return model + + +@register_architecture("spacy.PretrainedVectors.v1") +def PretrainedVectors(config): + return StaticVectors(config["vectors_name"], config["width"], config["column"]) + + +@register_architecture("spacy.TorchBiLSTMEncoder.v1") +def TorchBiLSTMEncoder(config): + import torch.nn + from thinc.extra.wrappers import PyTorchWrapperRNN + + width = config["width"] + depth = config["depth"] + if depth == 0: + return layerize(noop()) + return with_square_sequences( + PyTorchWrapperRNN(torch.nn.LSTM(width, width // 2, depth, bidirectional=True)) + ) + + +_EXAMPLE_CONFIG = { + "@doc2feats": { + "arch": "Doc2Feats", + "config": {"columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"]}, + }, + "@embed": { + "arch": "spacy.MultiHashEmbed.v1", + "config": { + "width": 96, + "rows": 2000, + "columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"], + "use_subwords": True, + "@pretrained_vectors": { + "arch": "TransformedStaticVectors", + "config": { + "vectors_name": "en_vectors_web_lg.vectors", + "width": 96, + "column": 0, + }, + }, + "@mix": { + "arch": "LayerNormalizedMaxout", + "config": {"width": 96, "pieces": 3}, + }, + }, + }, + "@encode": { + "arch": "MaxoutWindowEncode", + "config": {"width": 96, "window_size": 1, "depth": 4, "pieces": 3}, + }, +} diff --git a/spacy/pipeline/entityruler.py b/spacy/pipeline/entityruler.py index e59bbc666..d926b987b 100644 --- a/spacy/pipeline/entityruler.py +++ b/spacy/pipeline/entityruler.py @@ -4,6 +4,7 @@ from __future__ import unicode_literals from collections import defaultdict, OrderedDict import srsly +from ..language import component from ..errors import Errors from ..compat import basestring_ from ..util import ensure_path, to_disk, from_disk @@ -13,6 +14,7 @@ from ..matcher import Matcher, PhraseMatcher DEFAULT_ENT_ID_SEP = "||" +@component("entity_ruler", assigns=["doc.ents", "token.ent_type", "token.ent_iob"]) class EntityRuler(object): """The EntityRuler lets you add spans to the `Doc.ents` using token-based rules or exact phrase matches. It can be combined with the statistical @@ -24,8 +26,6 @@ class EntityRuler(object): USAGE: https://spacy.io/usage/rule-based-matching#entityruler """ - name = "entity_ruler" - def __init__(self, nlp, phrase_matcher_attr=None, validate=False, **cfg): """Initialize the entitiy ruler. If patterns are supplied here, they need to be a list of dictionaries with a `"label"` and `"pattern"` @@ -64,10 +64,15 @@ class EntityRuler(object): self.phrase_matcher_attr = None self.phrase_matcher = PhraseMatcher(nlp.vocab, validate=validate) self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP) + self._ent_ids = defaultdict(dict) patterns = cfg.get("patterns") if patterns is not None: self.add_patterns(patterns) + @classmethod + def from_nlp(cls, nlp, **cfg): + return cls(nlp, **cfg) + def __len__(self): """The number of all patterns added to the entity ruler.""" n_token_patterns = sum(len(p) for p in self.token_patterns.values()) @@ -100,10 +105,9 @@ class EntityRuler(object): continue # check for end - 1 here because boundaries are inclusive if start not in seen_tokens and end - 1 not in seen_tokens: - if self.ent_ids: - label_ = self.nlp.vocab.strings[match_id] - ent_label, ent_id = self._split_label(label_) - span = Span(doc, start, end, label=ent_label) + if match_id in self._ent_ids: + label, ent_id = self._ent_ids[match_id] + span = Span(doc, start, end, label=label) if ent_id: for token in span: token.ent_id_ = ent_id @@ -131,11 +135,11 @@ class EntityRuler(object): @property def ent_ids(self): - """All entity ids present in the match patterns meta dicts. + """All entity ids present in the match patterns `id` properties. RETURNS (set): The string entity ids. - DOCS: https://spacy.io/api/entityruler#labels + DOCS: https://spacy.io/api/entityruler#ent_ids """ all_ent_ids = set() for l in self.labels: @@ -147,7 +151,6 @@ class EntityRuler(object): @property def patterns(self): """Get all patterns that were added to the entity ruler. - RETURNS (list): The original patterns, one dictionary per pattern. DOCS: https://spacy.io/api/entityruler#patterns @@ -188,11 +191,15 @@ class EntityRuler(object): ] except ValueError: subsequent_pipes = [] - with self.nlp.disable_pipes(*subsequent_pipes): + with self.nlp.disable_pipes(subsequent_pipes): for entry in patterns: label = entry["label"] if "id" in entry: + ent_label = label label = self._create_label(label, entry["id"]) + key = self.matcher._normalize_key(label) + self._ent_ids[key] = (ent_label, entry["id"]) + pattern = entry["pattern"] if isinstance(pattern, basestring_): self.phrase_patterns[label].append(self.nlp(pattern)) @@ -201,9 +208,9 @@ class EntityRuler(object): else: raise ValueError(Errors.E097.format(pattern=pattern)) for label, patterns in self.token_patterns.items(): - self.matcher.add(label, None, *patterns) + self.matcher.add(label, patterns) for label, patterns in self.phrase_patterns.items(): - self.phrase_matcher.add(label, None, *patterns) + self.phrase_matcher.add(label, patterns) def _split_label(self, label): """Split Entity label into ent_label and ent_id if it contains self.ent_id_sep diff --git a/spacy/pipeline/functions.py b/spacy/pipeline/functions.py index 0f7d94df2..69e638da2 100644 --- a/spacy/pipeline/functions.py +++ b/spacy/pipeline/functions.py @@ -1,9 +1,16 @@ # coding: utf8 from __future__ import unicode_literals +from ..language import component from ..matcher import Matcher +from ..util import filter_spans +@component( + "merge_noun_chunks", + requires=["token.dep", "token.tag", "token.pos"], + retokenizes=True, +) def merge_noun_chunks(doc): """Merge noun chunks into a single token. @@ -21,6 +28,11 @@ def merge_noun_chunks(doc): return doc +@component( + "merge_entities", + requires=["doc.ents", "token.ent_iob", "token.ent_type"], + retokenizes=True, +) def merge_entities(doc): """Merge entities into a single token. @@ -36,6 +48,7 @@ def merge_entities(doc): return doc +@component("merge_subtokens", requires=["token.dep"], retokenizes=True) def merge_subtokens(doc, label="subtok"): """Merge subtokens into a single token. @@ -48,7 +61,7 @@ def merge_subtokens(doc, label="subtok"): merger = Matcher(doc.vocab) merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}]) matches = merger(doc) - spans = [doc[start : end + 1] for _, start, end in matches] + spans = filter_spans([doc[start : end + 1] for _, start, end in matches]) with doc.retokenize() as retokenizer: for span in spans: retokenizer.merge(span) diff --git a/spacy/pipeline/hooks.py b/spacy/pipeline/hooks.py index 38672cde0..b61a34c0e 100644 --- a/spacy/pipeline/hooks.py +++ b/spacy/pipeline/hooks.py @@ -5,9 +5,11 @@ from thinc.t2v import Pooling, max_pool, mean_pool from thinc.neural._classes.difference import Siamese, CauchySimilarity from .pipes import Pipe +from ..language import component from .._ml import link_vectors_to_models +@component("sentencizer_hook", assigns=["doc.user_hooks"]) class SentenceSegmenter(object): """A simple spaCy hook, to allow custom sentence boundary detection logic (that doesn't require the dependency parse). To change the sentence @@ -17,8 +19,6 @@ class SentenceSegmenter(object): and yield `Span` objects for each sentence. """ - name = "sentencizer" - def __init__(self, vocab, strategy=None): self.vocab = vocab if strategy is None or strategy == "on_punct": @@ -44,6 +44,7 @@ class SentenceSegmenter(object): yield doc[start : len(doc)] +@component("similarity", assigns=["doc.user_hooks"]) class SimilarityHook(Pipe): """ Experimental: A pipeline component to install a hook for supervised @@ -58,8 +59,6 @@ class SimilarityHook(Pipe): Where W is a vector of dimension weights, initialized to 1. """ - name = "similarity" - def __init__(self, vocab, model=True, **cfg): self.vocab = vocab self.model = model diff --git a/spacy/pipeline/morphologizer.pyx b/spacy/pipeline/morphologizer.pyx index b14e2bec7..72e31f120 100644 --- a/spacy/pipeline/morphologizer.pyx +++ b/spacy/pipeline/morphologizer.pyx @@ -8,6 +8,7 @@ from thinc.api import chain from thinc.neural.util import to_categorical, copy_array, get_array_module from .. import util from .pipes import Pipe +from ..language import component from .._ml import Tok2Vec, build_morphologizer_model from .._ml import link_vectors_to_models, zero_init, flatten from .._ml import create_default_optimizer @@ -18,9 +19,9 @@ from ..vocab cimport Vocab from ..morphology cimport Morphology +@component("morphologizer", assigns=["token.morph", "token.pos"]) class Morphologizer(Pipe): - name = 'morphologizer' - + @classmethod def Model(cls, **cfg): if cfg.get('pretrained_dims') and not cfg.get('pretrained_vectors'): diff --git a/spacy/pipeline/pipes.pyx b/spacy/pipeline/pipes.pyx index 0607ac43d..d29cf9ce9 100644 --- a/spacy/pipeline/pipes.pyx +++ b/spacy/pipeline/pipes.pyx @@ -13,7 +13,6 @@ from thinc.misc import LayerNorm from thinc.neural.util import to_categorical from thinc.neural.util import get_array_module -from .functions import merge_subtokens from ..tokens.doc cimport Doc from ..syntax.nn_parser cimport Parser from ..syntax.ner cimport BiluoPushDown @@ -21,6 +20,8 @@ from ..syntax.arc_eager cimport ArcEager from ..morphology cimport Morphology from ..vocab cimport Vocab +from .functions import merge_subtokens +from ..language import Language, component from ..syntax import nonproj from ..attrs import POS, ID from ..parts_of_speech import X @@ -54,6 +55,10 @@ class Pipe(object): """Initialize a model for the pipe.""" raise NotImplementedError + @classmethod + def from_nlp(cls, nlp, **cfg): + return cls(nlp.vocab, **cfg) + def __init__(self, vocab, model=True, **cfg): """Create a new pipe instance.""" raise NotImplementedError @@ -223,11 +228,10 @@ class Pipe(object): return self +@component("tensorizer", assigns=["doc.tensor"]) class Tensorizer(Pipe): """Pre-train position-sensitive vectors for tokens.""" - name = "tensorizer" - @classmethod def Model(cls, output_size=300, **cfg): """Create a new statistical model for the class. @@ -362,14 +366,13 @@ class Tensorizer(Pipe): return sgd +@component("tagger", assigns=["token.tag", "token.pos"]) class Tagger(Pipe): """Pipeline component for part-of-speech tagging. DOCS: https://spacy.io/api/tagger """ - name = "tagger" - def __init__(self, vocab, model=True, **cfg): self.vocab = vocab self.model = model @@ -514,7 +517,6 @@ class Tagger(Pipe): orig_tag_map = dict(self.vocab.morphology.tag_map) new_tag_map = OrderedDict() for raw_text, annots_brackets in get_gold_tuples(): - _ = annots_brackets.pop() for annots, brackets in annots_brackets: ids, words, tags, heads, deps, ents = annots for tag in tags: @@ -657,13 +659,12 @@ class Tagger(Pipe): return self +@component("nn_labeller") class MultitaskObjective(Tagger): """Experimental: Assist training of a parser or tagger, by training a side-objective. """ - name = "nn_labeller" - def __init__(self, vocab, model=True, target='dep_tag_offset', **cfg): self.vocab = vocab self.model = model @@ -898,12 +899,12 @@ class ClozeMultitask(Pipe): losses[self.name] += loss +@component("textcat", assigns=["doc.cats"]) class TextCategorizer(Pipe): """Pipeline component for text classification. DOCS: https://spacy.io/api/textcategorizer """ - name = 'textcat' @classmethod def Model(cls, nr_class=1, **cfg): @@ -1032,10 +1033,10 @@ class TextCategorizer(Pipe): return 1 def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs): - for raw_text, annots_brackets in get_gold_tuples(): - cats = annots_brackets.pop() - for cat in cats: - self.add_label(cat) + for raw_text, annot_brackets in get_gold_tuples(): + for _, (cats, _2) in annot_brackets: + for cat in cats: + self.add_label(cat) if self.model is True: self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors") self.require_labels() @@ -1051,8 +1052,11 @@ cdef class DependencyParser(Parser): DOCS: https://spacy.io/api/dependencyparser """ - + # cdef classes can't have decorators, so we're defining this here name = "parser" + factory = "parser" + assigns = ["token.dep", "token.is_sent_start", "doc.sents"] + requires = [] TransitionSystem = ArcEager @property @@ -1097,8 +1101,10 @@ cdef class EntityRecognizer(Parser): DOCS: https://spacy.io/api/entityrecognizer """ - name = "ner" + factory = "ner" + assigns = ["doc.ents", "token.ent_iob", "token.ent_type"] + requires = [] TransitionSystem = BiluoPushDown nr_feature = 6 @@ -1129,12 +1135,16 @@ cdef class EntityRecognizer(Parser): return tuple(sorted(labels)) +@component( + "entity_linker", + requires=["doc.ents", "token.ent_iob", "token.ent_type"], + assigns=["token.ent_kb_id"] +) class EntityLinker(Pipe): """Pipeline component for named entity linking. DOCS: https://spacy.io/api/entitylinker """ - name = 'entity_linker' NIL = "NIL" # string used to refer to a non-existing link @classmethod @@ -1298,7 +1308,8 @@ class EntityLinker(Pipe): for ent in sent_doc.ents: entity_count += 1 - if ent.label_ in self.cfg.get("labels_discard", []): + to_discard = self.cfg.get("labels_discard", []) + if to_discard and ent.label_ in to_discard: # ignoring this entity - setting to NIL final_kb_ids.append(self.NIL) final_tensors.append(sentence_encoding) @@ -1404,13 +1415,13 @@ class EntityLinker(Pipe): raise NotImplementedError +@component("sentencizer", assigns=["token.is_sent_start", "doc.sents"]) class Sentencizer(object): """Segment the Doc into sentences using a rule-based strategy. DOCS: https://spacy.io/api/sentencizer """ - name = "sentencizer" default_punct_chars = ['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።', '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', @@ -1436,6 +1447,10 @@ class Sentencizer(object): else: self.punct_chars = set(self.default_punct_chars) + @classmethod + def from_nlp(cls, nlp, **cfg): + return cls(**cfg) + def __call__(self, doc): """Apply the sentencizer to a Doc and set Token.is_sent_start. @@ -1502,4 +1517,9 @@ class Sentencizer(object): return self +# Cython classes can't be decorated, so we need to add the factories here +Language.factories["parser"] = lambda nlp, **cfg: DependencyParser.from_nlp(nlp, **cfg) +Language.factories["ner"] = lambda nlp, **cfg: EntityRecognizer.from_nlp(nlp, **cfg) + + __all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer"] diff --git a/spacy/syntax/_parser_model.pyx b/spacy/syntax/_parser_model.pyx index ce3dcbfa5..77bd43ed7 100644 --- a/spacy/syntax/_parser_model.pyx +++ b/spacy/syntax/_parser_model.pyx @@ -19,7 +19,7 @@ from thinc.extra.search cimport Beam from thinc.api import chain, clone from thinc.v2v import Model, Maxout, Affine from thinc.misc import LayerNorm -from thinc.neural.ops import CupyOps +from thinc.neural.ops import CupyOps, NumpyOps from thinc.neural.util import get_array_module from thinc.linalg cimport Vec, VecVec cimport blis.cy @@ -440,28 +440,38 @@ cdef class precompute_hiddens: def backward(d_state_vector_ids, sgd=None): d_state_vector, token_ids = d_state_vector_ids d_state_vector = bp_nonlinearity(d_state_vector, sgd) - # This will usually be on GPU - if not isinstance(d_state_vector, self.ops.xp.ndarray): - d_state_vector = self.ops.xp.array(d_state_vector) d_tokens = bp_hiddens((d_state_vector, token_ids), sgd) return d_tokens return state_vector, backward def _nonlinearity(self, state_vector): + if isinstance(state_vector, numpy.ndarray): + ops = NumpyOps() + else: + ops = CupyOps() + if self.nP == 1: state_vector = state_vector.reshape(state_vector.shape[:-1]) mask = state_vector >= 0. state_vector *= mask else: - state_vector, mask = self.ops.maxout(state_vector) + state_vector, mask = ops.maxout(state_vector) def backprop_nonlinearity(d_best, sgd=None): + if isinstance(d_best, numpy.ndarray): + ops = NumpyOps() + else: + ops = CupyOps() + mask_ = ops.asarray(mask) + + # This will usually be on GPU + d_best = ops.asarray(d_best) # Fix nans (which can occur from unseen classes.) - d_best[self.ops.xp.isnan(d_best)] = 0. + d_best[ops.xp.isnan(d_best)] = 0. if self.nP == 1: - d_best *= mask + d_best *= mask_ d_best = d_best.reshape((d_best.shape + (1,))) return d_best else: - return self.ops.backprop_maxout(d_best, mask, self.nP) + return ops.backprop_maxout(d_best, mask_, self.nP) return state_vector, backprop_nonlinearity diff --git a/spacy/syntax/arc_eager.pyx b/spacy/syntax/arc_eager.pyx index 5a7355061..eb39124ce 100644 --- a/spacy/syntax/arc_eager.pyx +++ b/spacy/syntax/arc_eager.pyx @@ -342,7 +342,6 @@ cdef class ArcEager(TransitionSystem): actions[RIGHT][label] = 1 actions[REDUCE][label] = 1 for raw_text, sents in kwargs.get('gold_parses', []): - _ = sents.pop() for (ids, words, tags, heads, labels, iob), ctnts in sents: heads, labels = nonproj.projectivize(heads, labels) for child, head, label in zip(ids, heads, labels): diff --git a/spacy/syntax/ner.pyx b/spacy/syntax/ner.pyx index 3bd096463..9f8ad418c 100644 --- a/spacy/syntax/ner.pyx +++ b/spacy/syntax/ner.pyx @@ -73,7 +73,6 @@ cdef class BiluoPushDown(TransitionSystem): actions[action][entity_type] = 1 moves = ('M', 'B', 'I', 'L', 'U') for raw_text, sents in kwargs.get('gold_parses', []): - _ = sents.pop() for (ids, words, tags, heads, labels, biluo), _ in sents: for i, ner_tag in enumerate(biluo): if ner_tag != 'O' and ner_tag != '-': diff --git a/spacy/syntax/nn_parser.pyx b/spacy/syntax/nn_parser.pyx index dd19b0e43..0ed7e6952 100644 --- a/spacy/syntax/nn_parser.pyx +++ b/spacy/syntax/nn_parser.pyx @@ -57,7 +57,10 @@ cdef class Parser: subword_features = util.env_opt('subword_features', cfg.get('subword_features', True)) conv_depth = util.env_opt('conv_depth', cfg.get('conv_depth', 4)) + conv_window = util.env_opt('conv_window', cfg.get('conv_depth', 1)) + t2v_pieces = util.env_opt('cnn_maxout_pieces', cfg.get('cnn_maxout_pieces', 3)) bilstm_depth = util.env_opt('bilstm_depth', cfg.get('bilstm_depth', 0)) + self_attn_depth = util.env_opt('self_attn_depth', cfg.get('self_attn_depth', 0)) if depth != 1: raise ValueError(TempErrors.T004.format(value=depth)) parser_maxout_pieces = util.env_opt('parser_maxout_pieces', @@ -69,6 +72,8 @@ cdef class Parser: pretrained_vectors = cfg.get('pretrained_vectors', None) tok2vec = Tok2Vec(token_vector_width, embed_size, conv_depth=conv_depth, + conv_window=conv_window, + cnn_maxout_pieces=t2v_pieces, subword_features=subword_features, pretrained_vectors=pretrained_vectors, bilstm_depth=bilstm_depth) @@ -90,7 +95,12 @@ cdef class Parser: 'hidden_width': hidden_width, 'maxout_pieces': parser_maxout_pieces, 'pretrained_vectors': pretrained_vectors, - 'bilstm_depth': bilstm_depth + 'bilstm_depth': bilstm_depth, + 'self_attn_depth': self_attn_depth, + 'conv_depth': conv_depth, + 'conv_window': conv_window, + 'embed_size': embed_size, + 'cnn_maxout_pieces': t2v_pieces } return ParserModel(tok2vec, lower, upper), cfg @@ -128,6 +138,10 @@ cdef class Parser: self._multitasks = [] self._rehearsal_model = None + @classmethod + def from_nlp(cls, nlp, **cfg): + return cls(nlp.vocab, **cfg) + def __reduce__(self): return (Parser, (self.vocab, self.moves, self.model), None, None) @@ -602,12 +616,11 @@ cdef class Parser: doc_sample = [] gold_sample = [] for raw_text, annots_brackets in islice(get_gold_tuples(), 1000): - _ = annots_brackets.pop() for annots, brackets in annots_brackets: ids, words, tags, heads, deps, ents = annots doc_sample.append(Doc(self.vocab, words=words)) gold_sample.append(GoldParse(doc_sample[-1], words=words, tags=tags, - heads=heads, deps=deps, ents=ents)) + heads=heads, deps=deps, entities=ents)) self.model.begin_training(doc_sample, gold_sample) if pipeline is not None: self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg) diff --git a/spacy/tests/lang/sv/test_noun_chunks.py b/spacy/tests/lang/sv/test_noun_chunks.py index aafe69139..ac7c066ba 100644 --- a/spacy/tests/lang/sv/test_noun_chunks.py +++ b/spacy/tests/lang/sv/test_noun_chunks.py @@ -2,9 +2,9 @@ from __future__ import unicode_literals import pytest -from spacy.lang.sv.syntax_iterators import SYNTAX_ITERATORS from ...util import get_doc + SV_NP_TEST_EXAMPLES = [ ( "En student läste en bok", # A student read a book @@ -45,4 +45,3 @@ def test_sv_noun_chunks(sv_tokenizer, text, pos, deps, heads, expected_noun_chun assert len(noun_chunks) == len(expected_noun_chunks) for i, np in enumerate(noun_chunks): assert np.text == expected_noun_chunks[i] - diff --git a/spacy/tests/matcher/test_matcher_api.py b/spacy/tests/matcher/test_matcher_api.py index 730756524..e4584d03a 100644 --- a/spacy/tests/matcher/test_matcher_api.py +++ b/spacy/tests/matcher/test_matcher_api.py @@ -17,7 +17,7 @@ def matcher(en_vocab): } matcher = Matcher(en_vocab) for key, patterns in rules.items(): - matcher.add(key, None, *patterns) + matcher.add(key, patterns) return matcher @@ -25,11 +25,11 @@ def test_matcher_from_api_docs(en_vocab): matcher = Matcher(en_vocab) pattern = [{"ORTH": "test"}] assert len(matcher) == 0 - matcher.add("Rule", None, pattern) + matcher.add("Rule", [pattern]) assert len(matcher) == 1 matcher.remove("Rule") assert "Rule" not in matcher - matcher.add("Rule", None, pattern) + matcher.add("Rule", [pattern]) assert "Rule" in matcher on_match, patterns = matcher.get("Rule") assert len(patterns[0]) @@ -52,7 +52,7 @@ def test_matcher_from_usage_docs(en_vocab): token.vocab[token.text].norm_ = "happy emoji" matcher = Matcher(en_vocab) - matcher.add("HAPPY", label_sentiment, *pos_patterns) + matcher.add("HAPPY", pos_patterns, on_match=label_sentiment) matcher(doc) assert doc.sentiment != 0 assert doc[1].norm_ == "happy emoji" @@ -60,11 +60,33 @@ def test_matcher_from_usage_docs(en_vocab): def test_matcher_len_contains(matcher): assert len(matcher) == 3 - matcher.add("TEST", None, [{"ORTH": "test"}]) + matcher.add("TEST", [[{"ORTH": "test"}]]) assert "TEST" in matcher assert "TEST2" not in matcher +def test_matcher_add_new_old_api(en_vocab): + doc = Doc(en_vocab, words=["a", "b"]) + patterns = [[{"TEXT": "a"}], [{"TEXT": "a"}, {"TEXT": "b"}]] + matcher = Matcher(en_vocab) + matcher.add("OLD_API", None, *patterns) + assert len(matcher(doc)) == 2 + matcher = Matcher(en_vocab) + on_match = Mock() + matcher.add("OLD_API_CALLBACK", on_match, *patterns) + assert len(matcher(doc)) == 2 + assert on_match.call_count == 2 + # New API: add(key: str, patterns: List[List[dict]], on_match: Callable) + matcher = Matcher(en_vocab) + matcher.add("NEW_API", patterns) + assert len(matcher(doc)) == 2 + matcher = Matcher(en_vocab) + on_match = Mock() + matcher.add("NEW_API_CALLBACK", patterns, on_match=on_match) + assert len(matcher(doc)) == 2 + assert on_match.call_count == 2 + + def test_matcher_no_match(matcher): doc = Doc(matcher.vocab, words=["I", "like", "cheese", "."]) assert matcher(doc) == [] @@ -100,12 +122,12 @@ def test_matcher_empty_dict(en_vocab): """Test matcher allows empty token specs, meaning match on any token.""" matcher = Matcher(en_vocab) doc = Doc(matcher.vocab, words=["a", "b", "c"]) - matcher.add("A.C", None, [{"ORTH": "a"}, {}, {"ORTH": "c"}]) + matcher.add("A.C", [[{"ORTH": "a"}, {}, {"ORTH": "c"}]]) matches = matcher(doc) assert len(matches) == 1 assert matches[0][1:] == (0, 3) matcher = Matcher(en_vocab) - matcher.add("A.", None, [{"ORTH": "a"}, {}]) + matcher.add("A.", [[{"ORTH": "a"}, {}]]) matches = matcher(doc) assert matches[0][1:] == (0, 2) @@ -114,7 +136,7 @@ def test_matcher_operator_shadow(en_vocab): matcher = Matcher(en_vocab) doc = Doc(matcher.vocab, words=["a", "b", "c"]) pattern = [{"ORTH": "a"}, {"IS_ALPHA": True, "OP": "+"}, {"ORTH": "c"}] - matcher.add("A.C", None, pattern) + matcher.add("A.C", [pattern]) matches = matcher(doc) assert len(matches) == 1 assert matches[0][1:] == (0, 3) @@ -136,12 +158,12 @@ def test_matcher_match_zero(matcher): {"IS_PUNCT": True}, {"ORTH": '"'}, ] - matcher.add("Quote", None, pattern1) + matcher.add("Quote", [pattern1]) doc = Doc(matcher.vocab, words=words1) assert len(matcher(doc)) == 1 doc = Doc(matcher.vocab, words=words2) assert len(matcher(doc)) == 0 - matcher.add("Quote", None, pattern2) + matcher.add("Quote", [pattern2]) assert len(matcher(doc)) == 0 @@ -149,7 +171,7 @@ def test_matcher_match_zero_plus(matcher): words = 'He said , " some words " ...'.split() pattern = [{"ORTH": '"'}, {"OP": "*", "IS_PUNCT": False}, {"ORTH": '"'}] matcher = Matcher(matcher.vocab) - matcher.add("Quote", None, pattern) + matcher.add("Quote", [pattern]) doc = Doc(matcher.vocab, words=words) assert len(matcher(doc)) == 1 @@ -160,11 +182,8 @@ def test_matcher_match_one_plus(matcher): doc = Doc(control.vocab, words=["Philippe", "Philippe"]) m = control(doc) assert len(m) == 2 - matcher.add( - "KleenePhilippe", - None, - [{"ORTH": "Philippe", "OP": "1"}, {"ORTH": "Philippe", "OP": "+"}], - ) + pattern = [{"ORTH": "Philippe", "OP": "1"}, {"ORTH": "Philippe", "OP": "+"}] + matcher.add("KleenePhilippe", [pattern]) m = matcher(doc) assert len(m) == 1 @@ -172,7 +191,7 @@ def test_matcher_match_one_plus(matcher): def test_matcher_any_token_operator(en_vocab): """Test that patterns with "any token" {} work with operators.""" matcher = Matcher(en_vocab) - matcher.add("TEST", None, [{"ORTH": "test"}, {"OP": "*"}]) + matcher.add("TEST", [[{"ORTH": "test"}, {"OP": "*"}]]) doc = Doc(en_vocab, words=["test", "hello", "world"]) matches = [doc[start:end].text for _, start, end in matcher(doc)] assert len(matches) == 3 @@ -186,7 +205,7 @@ def test_matcher_extension_attribute(en_vocab): get_is_fruit = lambda token: token.text in ("apple", "banana") Token.set_extension("is_fruit", getter=get_is_fruit, force=True) pattern = [{"ORTH": "an"}, {"_": {"is_fruit": True}}] - matcher.add("HAVING_FRUIT", None, pattern) + matcher.add("HAVING_FRUIT", [pattern]) doc = Doc(en_vocab, words=["an", "apple"]) matches = matcher(doc) assert len(matches) == 1 @@ -198,7 +217,7 @@ def test_matcher_extension_attribute(en_vocab): def test_matcher_set_value(en_vocab): matcher = Matcher(en_vocab) pattern = [{"ORTH": {"IN": ["an", "a"]}}] - matcher.add("A_OR_AN", None, pattern) + matcher.add("A_OR_AN", [pattern]) doc = Doc(en_vocab, words=["an", "a", "apple"]) matches = matcher(doc) assert len(matches) == 2 @@ -210,7 +229,7 @@ def test_matcher_set_value(en_vocab): def test_matcher_set_value_operator(en_vocab): matcher = Matcher(en_vocab) pattern = [{"ORTH": {"IN": ["a", "the"]}, "OP": "?"}, {"ORTH": "house"}] - matcher.add("DET_HOUSE", None, pattern) + matcher.add("DET_HOUSE", [pattern]) doc = Doc(en_vocab, words=["In", "a", "house"]) matches = matcher(doc) assert len(matches) == 2 @@ -222,7 +241,7 @@ def test_matcher_set_value_operator(en_vocab): def test_matcher_regex(en_vocab): matcher = Matcher(en_vocab) pattern = [{"ORTH": {"REGEX": r"(?:a|an)"}}] - matcher.add("A_OR_AN", None, pattern) + matcher.add("A_OR_AN", [pattern]) doc = Doc(en_vocab, words=["an", "a", "hi"]) matches = matcher(doc) assert len(matches) == 2 @@ -234,7 +253,7 @@ def test_matcher_regex(en_vocab): def test_matcher_regex_shape(en_vocab): matcher = Matcher(en_vocab) pattern = [{"SHAPE": {"REGEX": r"^[^x]+$"}}] - matcher.add("NON_ALPHA", None, pattern) + matcher.add("NON_ALPHA", [pattern]) doc = Doc(en_vocab, words=["99", "problems", "!"]) matches = matcher(doc) assert len(matches) == 2 @@ -246,7 +265,7 @@ def test_matcher_regex_shape(en_vocab): def test_matcher_compare_length(en_vocab): matcher = Matcher(en_vocab) pattern = [{"LENGTH": {">=": 2}}] - matcher.add("LENGTH_COMPARE", None, pattern) + matcher.add("LENGTH_COMPARE", [pattern]) doc = Doc(en_vocab, words=["a", "aa", "aaa"]) matches = matcher(doc) assert len(matches) == 2 @@ -260,7 +279,7 @@ def test_matcher_extension_set_membership(en_vocab): get_reversed = lambda token: "".join(reversed(token.text)) Token.set_extension("reversed", getter=get_reversed, force=True) pattern = [{"_": {"reversed": {"IN": ["eyb", "ih"]}}}] - matcher.add("REVERSED", None, pattern) + matcher.add("REVERSED", [pattern]) doc = Doc(en_vocab, words=["hi", "bye", "hello"]) matches = matcher(doc) assert len(matches) == 2 @@ -328,9 +347,9 @@ def dependency_matcher(en_vocab): ] matcher = DependencyMatcher(en_vocab) - matcher.add("pattern1", None, pattern1) - matcher.add("pattern2", None, pattern2) - matcher.add("pattern3", None, pattern3) + matcher.add("pattern1", [pattern1]) + matcher.add("pattern2", [pattern2]) + matcher.add("pattern3", [pattern3]) return matcher @@ -347,6 +366,14 @@ def test_dependency_matcher_compile(dependency_matcher): # assert matches[2][1] == [[4, 3, 2]] +def test_matcher_basic_check(en_vocab): + matcher = Matcher(en_vocab) + # Potential mistake: pass in pattern instead of list of patterns + pattern = [{"TEXT": "hello"}, {"TEXT": "world"}] + with pytest.raises(ValueError): + matcher.add("TEST", pattern) + + def test_attr_pipeline_checks(en_vocab): doc1 = Doc(en_vocab, words=["Test"]) doc1.is_parsed = True @@ -355,7 +382,7 @@ def test_attr_pipeline_checks(en_vocab): doc3 = Doc(en_vocab, words=["Test"]) # DEP requires is_parsed matcher = Matcher(en_vocab) - matcher.add("TEST", None, [{"DEP": "a"}]) + matcher.add("TEST", [[{"DEP": "a"}]]) matcher(doc1) with pytest.raises(ValueError): matcher(doc2) @@ -364,7 +391,7 @@ def test_attr_pipeline_checks(en_vocab): # TAG, POS, LEMMA require is_tagged for attr in ("TAG", "POS", "LEMMA"): matcher = Matcher(en_vocab) - matcher.add("TEST", None, [{attr: "a"}]) + matcher.add("TEST", [[{attr: "a"}]]) matcher(doc2) with pytest.raises(ValueError): matcher(doc1) @@ -372,12 +399,12 @@ def test_attr_pipeline_checks(en_vocab): matcher(doc3) # TEXT/ORTH only require tokens matcher = Matcher(en_vocab) - matcher.add("TEST", None, [{"ORTH": "a"}]) + matcher.add("TEST", [[{"ORTH": "a"}]]) matcher(doc1) matcher(doc2) matcher(doc3) matcher = Matcher(en_vocab) - matcher.add("TEST", None, [{"TEXT": "a"}]) + matcher.add("TEST", [[{"TEXT": "a"}]]) matcher(doc1) matcher(doc2) matcher(doc3) @@ -407,7 +434,7 @@ def test_attr_pipeline_checks(en_vocab): def test_matcher_schema_token_attributes(en_vocab, pattern, text): matcher = Matcher(en_vocab) doc = Doc(en_vocab, words=text.split(" ")) - matcher.add("Rule", None, pattern) + matcher.add("Rule", [pattern]) assert len(matcher) == 1 matches = matcher(doc) assert len(matches) == 1 @@ -417,7 +444,7 @@ def test_matcher_valid_callback(en_vocab): """Test that on_match can only be None or callable.""" matcher = Matcher(en_vocab) with pytest.raises(ValueError): - matcher.add("TEST", [], [{"TEXT": "test"}]) + matcher.add("TEST", [[{"TEXT": "test"}]], on_match=[]) matcher(Doc(en_vocab, words=["test"])) @@ -425,7 +452,7 @@ def test_matcher_callback(en_vocab): mock = Mock() matcher = Matcher(en_vocab) pattern = [{"ORTH": "test"}] - matcher.add("Rule", mock, pattern) + matcher.add("Rule", [pattern], on_match=mock) doc = Doc(en_vocab, words=["This", "is", "a", "test", "."]) matches = matcher(doc) mock.assert_called_once_with(matcher, doc, 0, matches) diff --git a/spacy/tests/matcher/test_matcher_logic.py b/spacy/tests/matcher/test_matcher_logic.py index b9c435c17..240ace537 100644 --- a/spacy/tests/matcher/test_matcher_logic.py +++ b/spacy/tests/matcher/test_matcher_logic.py @@ -55,7 +55,7 @@ def test_greedy_matching(doc, text, pattern, re_pattern): """Test that the greedy matching behavior of the * op is consistant with other re implementations.""" matcher = Matcher(doc.vocab) - matcher.add(re_pattern, None, pattern) + matcher.add(re_pattern, [pattern]) matches = matcher(doc) re_matches = [m.span() for m in re.finditer(re_pattern, text)] for match, re_match in zip(matches, re_matches): @@ -77,7 +77,7 @@ def test_match_consuming(doc, text, pattern, re_pattern): """Test that matcher.__call__ consumes tokens on a match similar to re.findall.""" matcher = Matcher(doc.vocab) - matcher.add(re_pattern, None, pattern) + matcher.add(re_pattern, [pattern]) matches = matcher(doc) re_matches = [m.span() for m in re.finditer(re_pattern, text)] assert len(matches) == len(re_matches) @@ -111,7 +111,7 @@ def test_operator_combos(en_vocab): pattern.append({"ORTH": part[0], "OP": "+"}) else: pattern.append({"ORTH": part}) - matcher.add("PATTERN", None, pattern) + matcher.add("PATTERN", [pattern]) matches = matcher(doc) if result: assert matches, (string, pattern_str) @@ -123,7 +123,7 @@ def test_matcher_end_zero_plus(en_vocab): """Test matcher works when patterns end with * operator. (issue 1450)""" matcher = Matcher(en_vocab) pattern = [{"ORTH": "a"}, {"ORTH": "b", "OP": "*"}] - matcher.add("TSTEND", None, pattern) + matcher.add("TSTEND", [pattern]) nlp = lambda string: Doc(matcher.vocab, words=string.split()) assert len(matcher(nlp("a"))) == 1 assert len(matcher(nlp("a b"))) == 2 @@ -140,7 +140,7 @@ def test_matcher_sets_return_correct_tokens(en_vocab): [{"LOWER": {"IN": ["one"]}}], [{"LOWER": {"IN": ["two"]}}], ] - matcher.add("TEST", None, *patterns) + matcher.add("TEST", patterns) doc = Doc(en_vocab, words="zero one two three".split()) matches = matcher(doc) texts = [Span(doc, s, e, label=L).text for L, s, e in matches] @@ -154,7 +154,7 @@ def test_matcher_remove(): pattern = [{"ORTH": "test"}, {"OP": "?"}] assert len(matcher) == 0 - matcher.add("Rule", None, pattern) + matcher.add("Rule", [pattern]) assert "Rule" in matcher # should give two matches diff --git a/spacy/tests/matcher/test_pattern_validation.py b/spacy/tests/matcher/test_pattern_validation.py index 80f08e40c..2db2f9eb3 100644 --- a/spacy/tests/matcher/test_pattern_validation.py +++ b/spacy/tests/matcher/test_pattern_validation.py @@ -50,7 +50,7 @@ def validator(): def test_matcher_pattern_validation(en_vocab, pattern): matcher = Matcher(en_vocab, validate=True) with pytest.raises(MatchPatternError): - matcher.add("TEST", None, pattern) + matcher.add("TEST", [pattern]) @pytest.mark.parametrize("pattern,n_errors,_", TEST_PATTERNS) @@ -71,6 +71,6 @@ def test_minimal_pattern_validation(en_vocab, pattern, n_errors, n_min_errors): matcher = Matcher(en_vocab) if n_min_errors > 0: with pytest.raises(ValueError): - matcher.add("TEST", None, pattern) + matcher.add("TEST", [pattern]) elif n_errors == 0: - matcher.add("TEST", None, pattern) + matcher.add("TEST", [pattern]) diff --git a/spacy/tests/matcher/test_phrase_matcher.py b/spacy/tests/matcher/test_phrase_matcher.py index 2a7532e85..7a6585e06 100644 --- a/spacy/tests/matcher/test_phrase_matcher.py +++ b/spacy/tests/matcher/test_phrase_matcher.py @@ -13,53 +13,75 @@ def test_matcher_phrase_matcher(en_vocab): # intermediate phrase pattern = Doc(en_vocab, words=["Google", "Now"]) matcher = PhraseMatcher(en_vocab) - matcher.add("COMPANY", None, pattern) + matcher.add("COMPANY", [pattern]) assert len(matcher(doc)) == 1 # initial token pattern = Doc(en_vocab, words=["I"]) matcher = PhraseMatcher(en_vocab) - matcher.add("I", None, pattern) + matcher.add("I", [pattern]) assert len(matcher(doc)) == 1 # initial phrase pattern = Doc(en_vocab, words=["I", "like"]) matcher = PhraseMatcher(en_vocab) - matcher.add("ILIKE", None, pattern) + matcher.add("ILIKE", [pattern]) assert len(matcher(doc)) == 1 # final token pattern = Doc(en_vocab, words=["best"]) matcher = PhraseMatcher(en_vocab) - matcher.add("BEST", None, pattern) + matcher.add("BEST", [pattern]) assert len(matcher(doc)) == 1 # final phrase pattern = Doc(en_vocab, words=["Now", "best"]) matcher = PhraseMatcher(en_vocab) - matcher.add("NOWBEST", None, pattern) + matcher.add("NOWBEST", [pattern]) assert len(matcher(doc)) == 1 def test_phrase_matcher_length(en_vocab): matcher = PhraseMatcher(en_vocab) assert len(matcher) == 0 - matcher.add("TEST", None, Doc(en_vocab, words=["test"])) + matcher.add("TEST", [Doc(en_vocab, words=["test"])]) assert len(matcher) == 1 - matcher.add("TEST2", None, Doc(en_vocab, words=["test2"])) + matcher.add("TEST2", [Doc(en_vocab, words=["test2"])]) assert len(matcher) == 2 def test_phrase_matcher_contains(en_vocab): matcher = PhraseMatcher(en_vocab) - matcher.add("TEST", None, Doc(en_vocab, words=["test"])) + matcher.add("TEST", [Doc(en_vocab, words=["test"])]) assert "TEST" in matcher assert "TEST2" not in matcher +def test_phrase_matcher_add_new_api(en_vocab): + doc = Doc(en_vocab, words=["a", "b"]) + patterns = [Doc(en_vocab, words=["a"]), Doc(en_vocab, words=["a", "b"])] + matcher = PhraseMatcher(en_vocab) + matcher.add("OLD_API", None, *patterns) + assert len(matcher(doc)) == 2 + matcher = PhraseMatcher(en_vocab) + on_match = Mock() + matcher.add("OLD_API_CALLBACK", on_match, *patterns) + assert len(matcher(doc)) == 2 + assert on_match.call_count == 2 + # New API: add(key: str, patterns: List[List[dict]], on_match: Callable) + matcher = PhraseMatcher(en_vocab) + matcher.add("NEW_API", patterns) + assert len(matcher(doc)) == 2 + matcher = PhraseMatcher(en_vocab) + on_match = Mock() + matcher.add("NEW_API_CALLBACK", patterns, on_match=on_match) + assert len(matcher(doc)) == 2 + assert on_match.call_count == 2 + + def test_phrase_matcher_repeated_add(en_vocab): matcher = PhraseMatcher(en_vocab) # match ID only gets added once - matcher.add("TEST", None, Doc(en_vocab, words=["like"])) - matcher.add("TEST", None, Doc(en_vocab, words=["like"])) - matcher.add("TEST", None, Doc(en_vocab, words=["like"])) - matcher.add("TEST", None, Doc(en_vocab, words=["like"])) + matcher.add("TEST", [Doc(en_vocab, words=["like"])]) + matcher.add("TEST", [Doc(en_vocab, words=["like"])]) + matcher.add("TEST", [Doc(en_vocab, words=["like"])]) + matcher.add("TEST", [Doc(en_vocab, words=["like"])]) doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) assert "TEST" in matcher assert "TEST2" not in matcher @@ -68,8 +90,8 @@ def test_phrase_matcher_repeated_add(en_vocab): def test_phrase_matcher_remove(en_vocab): matcher = PhraseMatcher(en_vocab) - matcher.add("TEST1", None, Doc(en_vocab, words=["like"])) - matcher.add("TEST2", None, Doc(en_vocab, words=["best"])) + matcher.add("TEST1", [Doc(en_vocab, words=["like"])]) + matcher.add("TEST2", [Doc(en_vocab, words=["best"])]) doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) assert "TEST1" in matcher assert "TEST2" in matcher @@ -95,9 +117,9 @@ def test_phrase_matcher_remove(en_vocab): def test_phrase_matcher_overlapping_with_remove(en_vocab): matcher = PhraseMatcher(en_vocab) - matcher.add("TEST", None, Doc(en_vocab, words=["like"])) + matcher.add("TEST", [Doc(en_vocab, words=["like"])]) # TEST2 is added alongside TEST - matcher.add("TEST2", None, Doc(en_vocab, words=["like"])) + matcher.add("TEST2", [Doc(en_vocab, words=["like"])]) doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) assert "TEST" in matcher assert len(matcher) == 2 @@ -122,7 +144,7 @@ def test_phrase_matcher_string_attrs(en_vocab): pos2 = ["INTJ", "PUNCT", "PRON", "VERB", "NOUN", "ADV", "ADV"] pattern = get_doc(en_vocab, words=words1, pos=pos1) matcher = PhraseMatcher(en_vocab, attr="POS") - matcher.add("TEST", None, pattern) + matcher.add("TEST", [pattern]) doc = get_doc(en_vocab, words=words2, pos=pos2) matches = matcher(doc) assert len(matches) == 1 @@ -140,7 +162,7 @@ def test_phrase_matcher_string_attrs_negative(en_vocab): pos2 = ["X", "X", "X"] pattern = get_doc(en_vocab, words=words1, pos=pos1) matcher = PhraseMatcher(en_vocab, attr="POS") - matcher.add("TEST", None, pattern) + matcher.add("TEST", [pattern]) doc = get_doc(en_vocab, words=words2, pos=pos2) matches = matcher(doc) assert len(matches) == 0 @@ -151,7 +173,7 @@ def test_phrase_matcher_bool_attrs(en_vocab): words2 = ["No", "problem", ",", "he", "said", "."] pattern = Doc(en_vocab, words=words1) matcher = PhraseMatcher(en_vocab, attr="IS_PUNCT") - matcher.add("TEST", None, pattern) + matcher.add("TEST", [pattern]) doc = Doc(en_vocab, words=words2) matches = matcher(doc) assert len(matches) == 2 @@ -173,15 +195,15 @@ def test_phrase_matcher_validation(en_vocab): doc3 = Doc(en_vocab, words=["Test"]) matcher = PhraseMatcher(en_vocab, validate=True) with pytest.warns(UserWarning): - matcher.add("TEST1", None, doc1) + matcher.add("TEST1", [doc1]) with pytest.warns(UserWarning): - matcher.add("TEST2", None, doc2) + matcher.add("TEST2", [doc2]) with pytest.warns(None) as record: - matcher.add("TEST3", None, doc3) + matcher.add("TEST3", [doc3]) assert not record.list matcher = PhraseMatcher(en_vocab, attr="POS", validate=True) with pytest.warns(None) as record: - matcher.add("TEST4", None, doc2) + matcher.add("TEST4", [doc2]) assert not record.list @@ -198,24 +220,24 @@ def test_attr_pipeline_checks(en_vocab): doc3 = Doc(en_vocab, words=["Test"]) # DEP requires is_parsed matcher = PhraseMatcher(en_vocab, attr="DEP") - matcher.add("TEST1", None, doc1) + matcher.add("TEST1", [doc1]) with pytest.raises(ValueError): - matcher.add("TEST2", None, doc2) + matcher.add("TEST2", [doc2]) with pytest.raises(ValueError): - matcher.add("TEST3", None, doc3) + matcher.add("TEST3", [doc3]) # TAG, POS, LEMMA require is_tagged for attr in ("TAG", "POS", "LEMMA"): matcher = PhraseMatcher(en_vocab, attr=attr) - matcher.add("TEST2", None, doc2) + matcher.add("TEST2", [doc2]) with pytest.raises(ValueError): - matcher.add("TEST1", None, doc1) + matcher.add("TEST1", [doc1]) with pytest.raises(ValueError): - matcher.add("TEST3", None, doc3) + matcher.add("TEST3", [doc3]) # TEXT/ORTH only require tokens matcher = PhraseMatcher(en_vocab, attr="ORTH") - matcher.add("TEST3", None, doc3) + matcher.add("TEST3", [doc3]) matcher = PhraseMatcher(en_vocab, attr="TEXT") - matcher.add("TEST3", None, doc3) + matcher.add("TEST3", [doc3]) def test_phrase_matcher_callback(en_vocab): @@ -223,7 +245,7 @@ def test_phrase_matcher_callback(en_vocab): doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) pattern = Doc(en_vocab, words=["Google", "Now"]) matcher = PhraseMatcher(en_vocab) - matcher.add("COMPANY", mock, pattern) + matcher.add("COMPANY", [pattern], on_match=mock) matches = matcher(doc) mock.assert_called_once_with(matcher, doc, 0, matches) @@ -234,5 +256,13 @@ def test_phrase_matcher_remove_overlapping_patterns(en_vocab): pattern2 = Doc(en_vocab, words=["this", "is"]) pattern3 = Doc(en_vocab, words=["this", "is", "a"]) pattern4 = Doc(en_vocab, words=["this", "is", "a", "word"]) - matcher.add("THIS", None, pattern1, pattern2, pattern3, pattern4) + matcher.add("THIS", [pattern1, pattern2, pattern3, pattern4]) matcher.remove("THIS") + + +def test_phrase_matcher_basic_check(en_vocab): + matcher = PhraseMatcher(en_vocab) + # Potential mistake: pass in pattern instead of list of patterns + pattern = Doc(en_vocab, words=["hello", "world"]) + with pytest.raises(ValueError): + matcher.add("TEST", pattern) diff --git a/spacy/tests/pipeline/test_analysis.py b/spacy/tests/pipeline/test_analysis.py new file mode 100644 index 000000000..198f11bcd --- /dev/null +++ b/spacy/tests/pipeline/test_analysis.py @@ -0,0 +1,168 @@ +# coding: utf8 +from __future__ import unicode_literals + +import spacy.language +from spacy.language import Language, component +from spacy.analysis import print_summary, validate_attrs +from spacy.analysis import get_assigns_for_attr, get_requires_for_attr +from spacy.compat import is_python2 +from mock import Mock, ANY +import pytest + + +def test_component_decorator_function(): + @component(name="test") + def test_component(doc): + """docstring""" + return doc + + assert test_component.name == "test" + if not is_python2: + assert test_component.__doc__ == "docstring" + assert test_component("foo") == "foo" + + +def test_component_decorator_class(): + @component(name="test") + class TestComponent(object): + """docstring1""" + + foo = "bar" + + def __call__(self, doc): + """docstring2""" + return doc + + def custom(self, x): + """docstring3""" + return x + + assert TestComponent.name == "test" + assert TestComponent.foo == "bar" + assert hasattr(TestComponent, "custom") + test_component = TestComponent() + assert test_component.foo == "bar" + assert test_component("foo") == "foo" + assert hasattr(test_component, "custom") + assert test_component.custom("bar") == "bar" + if not is_python2: + assert TestComponent.__doc__ == "docstring1" + assert TestComponent.__call__.__doc__ == "docstring2" + assert TestComponent.custom.__doc__ == "docstring3" + assert test_component.__doc__ == "docstring1" + assert test_component.__call__.__doc__ == "docstring2" + assert test_component.custom.__doc__ == "docstring3" + + +def test_component_decorator_assigns(): + spacy.language.ENABLE_PIPELINE_ANALYSIS = True + + @component("c1", assigns=["token.tag", "doc.tensor"]) + def test_component1(doc): + return doc + + @component( + "c2", requires=["token.tag", "token.pos"], assigns=["token.lemma", "doc.tensor"] + ) + def test_component2(doc): + return doc + + @component("c3", requires=["token.lemma"], assigns=["token._.custom_lemma"]) + def test_component3(doc): + return doc + + assert "c1" in Language.factories + assert "c2" in Language.factories + assert "c3" in Language.factories + + nlp = Language() + nlp.add_pipe(test_component1) + with pytest.warns(UserWarning): + nlp.add_pipe(test_component2) + nlp.add_pipe(test_component3) + assigns_tensor = get_assigns_for_attr(nlp.pipeline, "doc.tensor") + assert [name for name, _ in assigns_tensor] == ["c1", "c2"] + test_component4 = nlp.create_pipe("c1") + assert test_component4.name == "c1" + assert test_component4.factory == "c1" + nlp.add_pipe(test_component4, name="c4") + assert nlp.pipe_names == ["c1", "c2", "c3", "c4"] + assert "c4" not in Language.factories + assert nlp.pipe_factories["c1"] == "c1" + assert nlp.pipe_factories["c4"] == "c1" + assigns_tensor = get_assigns_for_attr(nlp.pipeline, "doc.tensor") + assert [name for name, _ in assigns_tensor] == ["c1", "c2", "c4"] + requires_pos = get_requires_for_attr(nlp.pipeline, "token.pos") + assert [name for name, _ in requires_pos] == ["c2"] + assert print_summary(nlp, no_print=True) + assert nlp("hello world") + + +def test_component_factories_from_nlp(): + """Test that class components can implement a from_nlp classmethod that + gives them access to the nlp object and config via the factory.""" + + class TestComponent5(object): + def __call__(self, doc): + return doc + + mock = Mock() + mock.return_value = TestComponent5() + TestComponent5.from_nlp = classmethod(mock) + TestComponent5 = component("c5")(TestComponent5) + + assert "c5" in Language.factories + nlp = Language() + pipe = nlp.create_pipe("c5", config={"foo": "bar"}) + nlp.add_pipe(pipe) + assert nlp("hello world") + # The first argument here is the class itself, so we're accepting any here + mock.assert_called_once_with(ANY, nlp, foo="bar") + + +def test_analysis_validate_attrs_valid(): + attrs = ["doc.sents", "doc.ents", "token.tag", "token._.xyz", "span._.xyz"] + assert validate_attrs(attrs) + for attr in attrs: + assert validate_attrs([attr]) + with pytest.raises(ValueError): + validate_attrs(["doc.sents", "doc.xyz"]) + + +@pytest.mark.parametrize( + "attr", + [ + "doc", + "doc_ents", + "doc.xyz", + "token.xyz", + "token.tag_", + "token.tag.xyz", + "token._.xyz.abc", + "span.label", + ], +) +def test_analysis_validate_attrs_invalid(attr): + with pytest.raises(ValueError): + validate_attrs([attr]) + + +def test_analysis_validate_attrs_remove_pipe(): + """Test that attributes are validated correctly on remove.""" + spacy.language.ENABLE_PIPELINE_ANALYSIS = True + + @component("c1", assigns=["token.tag"]) + def c1(doc): + return doc + + @component("c2", requires=["token.pos"]) + def c2(doc): + return doc + + nlp = Language() + nlp.add_pipe(c1) + with pytest.warns(UserWarning): + nlp.add_pipe(c2) + with pytest.warns(None) as record: + nlp.remove_pipe("c2") + assert not record.list diff --git a/spacy/tests/pipeline/test_entity_linker.py b/spacy/tests/pipeline/test_entity_linker.py index bc76d1e47..8023f72a6 100644 --- a/spacy/tests/pipeline/test_entity_linker.py +++ b/spacy/tests/pipeline/test_entity_linker.py @@ -154,7 +154,8 @@ def test_append_alias(nlp): assert len(mykb.get_candidates("douglas")) == 3 # append the same alias-entity pair again should not work (will throw a warning) - mykb.append_alias(alias="douglas", entity="Q1", prior_prob=0.3) + with pytest.warns(UserWarning): + mykb.append_alias(alias="douglas", entity="Q1", prior_prob=0.3) # test the size of the relevant candidates remained unchanged assert len(mykb.get_candidates("douglas")) == 3 diff --git a/spacy/tests/pipeline/test_functions.py b/spacy/tests/pipeline/test_functions.py new file mode 100644 index 000000000..5b5fcd2fd --- /dev/null +++ b/spacy/tests/pipeline/test_functions.py @@ -0,0 +1,34 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest +from spacy.pipeline.functions import merge_subtokens +from ..util import get_doc + + +@pytest.fixture +def doc(en_tokenizer): + # fmt: off + text = "This is a sentence. This is another sentence. And a third." + heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 1, 1, 1, 0] + deps = ["nsubj", "ROOT", "subtok", "attr", "punct", "nsubj", "ROOT", + "subtok", "attr", "punct", "subtok", "subtok", "subtok", "ROOT"] + # fmt: on + tokens = en_tokenizer(text) + return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) + + +def test_merge_subtokens(doc): + doc = merge_subtokens(doc) + # get_doc() doesn't set spaces, so the result is "And a third ." + assert [t.text for t in doc] == [ + "This", + "is", + "a sentence", + ".", + "This", + "is", + "another sentence", + ".", + "And a third .", + ] diff --git a/spacy/tests/pipeline/test_pipe_methods.py b/spacy/tests/pipeline/test_pipe_methods.py index 5f1fa5cfe..27fb57b18 100644 --- a/spacy/tests/pipeline/test_pipe_methods.py +++ b/spacy/tests/pipeline/test_pipe_methods.py @@ -105,6 +105,16 @@ def test_disable_pipes_context(nlp, name): assert nlp.has_pipe(name) +def test_disable_pipes_list_arg(nlp): + for name in ["c1", "c2", "c3"]: + nlp.add_pipe(new_pipe, name=name) + assert nlp.has_pipe(name) + with nlp.disable_pipes(["c1", "c2"]): + assert not nlp.has_pipe("c1") + assert not nlp.has_pipe("c2") + assert nlp.has_pipe("c3") + + @pytest.mark.parametrize("n_pipes", [100]) def test_add_lots_of_pipes(nlp, n_pipes): for i in range(n_pipes): diff --git a/spacy/tests/regression/test_issue1-1000.py b/spacy/tests/regression/test_issue1-1000.py index 989eba805..6d88d68c2 100644 --- a/spacy/tests/regression/test_issue1-1000.py +++ b/spacy/tests/regression/test_issue1-1000.py @@ -30,7 +30,7 @@ def test_issue118(en_tokenizer, patterns): doc = en_tokenizer(text) ORG = doc.vocab.strings["ORG"] matcher = Matcher(doc.vocab) - matcher.add("BostonCeltics", None, *patterns) + matcher.add("BostonCeltics", patterns) assert len(list(doc.ents)) == 0 matches = [(ORG, start, end) for _, start, end in matcher(doc)] assert matches == [(ORG, 9, 11), (ORG, 10, 11)] @@ -57,7 +57,7 @@ def test_issue118_prefix_reorder(en_tokenizer, patterns): doc = en_tokenizer(text) ORG = doc.vocab.strings["ORG"] matcher = Matcher(doc.vocab) - matcher.add("BostonCeltics", None, *patterns) + matcher.add("BostonCeltics", patterns) assert len(list(doc.ents)) == 0 matches = [(ORG, start, end) for _, start, end in matcher(doc)] doc.ents += tuple(matches)[1:] @@ -78,7 +78,7 @@ def test_issue242(en_tokenizer): ] doc = en_tokenizer(text) matcher = Matcher(doc.vocab) - matcher.add("FOOD", None, *patterns) + matcher.add("FOOD", patterns) matches = [(ent_type, start, end) for ent_type, start, end in matcher(doc)] match1, match2 = matches assert match1[1] == 3 @@ -127,17 +127,13 @@ def test_issue587(en_tokenizer): """Test that Matcher doesn't segfault on particular input""" doc = en_tokenizer("a b; c") matcher = Matcher(doc.vocab) - matcher.add("TEST1", None, [{ORTH: "a"}, {ORTH: "b"}]) + matcher.add("TEST1", [[{ORTH: "a"}, {ORTH: "b"}]]) matches = matcher(doc) assert len(matches) == 1 - matcher.add( - "TEST2", None, [{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "c"}] - ) + matcher.add("TEST2", [[{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "c"}]]) matches = matcher(doc) assert len(matches) == 2 - matcher.add( - "TEST3", None, [{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "d"}] - ) + matcher.add("TEST3", [[{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "d"}]]) matches = matcher(doc) assert len(matches) == 2 @@ -145,7 +141,7 @@ def test_issue587(en_tokenizer): def test_issue588(en_vocab): matcher = Matcher(en_vocab) with pytest.raises(ValueError): - matcher.add("TEST", None, []) + matcher.add("TEST", [[]]) @pytest.mark.xfail @@ -161,11 +157,9 @@ def test_issue590(en_vocab): doc = Doc(en_vocab, words=["n", "=", "1", ";", "a", ":", "5", "%"]) matcher = Matcher(en_vocab) matcher.add( - "ab", - None, - [{"IS_ALPHA": True}, {"ORTH": ":"}, {"LIKE_NUM": True}, {"ORTH": "%"}], + "ab", [[{"IS_ALPHA": True}, {"ORTH": ":"}, {"LIKE_NUM": True}, {"ORTH": "%"}]] ) - matcher.add("ab", None, [{"IS_ALPHA": True}, {"ORTH": "="}, {"LIKE_NUM": True}]) + matcher.add("ab", [[{"IS_ALPHA": True}, {"ORTH": "="}, {"LIKE_NUM": True}]]) matches = matcher(doc) assert len(matches) == 2 @@ -221,7 +215,7 @@ def test_issue615(en_tokenizer): label = "Sport_Equipment" doc = en_tokenizer(text) matcher = Matcher(doc.vocab) - matcher.add(label, merge_phrases, pattern) + matcher.add(label, [pattern], on_match=merge_phrases) matcher(doc) entities = list(doc.ents) assert entities != [] @@ -339,7 +333,7 @@ def test_issue850(): vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()}) matcher = Matcher(vocab) pattern = [{"LOWER": "bob"}, {"OP": "*"}, {"LOWER": "frank"}] - matcher.add("FarAway", None, pattern) + matcher.add("FarAway", [pattern]) doc = Doc(matcher.vocab, words=["bob", "and", "and", "frank"]) match = matcher(doc) assert len(match) == 1 @@ -353,7 +347,7 @@ def test_issue850_basic(): vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()}) matcher = Matcher(vocab) pattern = [{"LOWER": "bob"}, {"OP": "*", "LOWER": "and"}, {"LOWER": "frank"}] - matcher.add("FarAway", None, pattern) + matcher.add("FarAway", [pattern]) doc = Doc(matcher.vocab, words=["bob", "and", "and", "frank"]) match = matcher(doc) assert len(match) == 1 diff --git a/spacy/tests/regression/test_issue1001-1500.py b/spacy/tests/regression/test_issue1001-1500.py index 889a5dc71..924c5aa3e 100644 --- a/spacy/tests/regression/test_issue1001-1500.py +++ b/spacy/tests/regression/test_issue1001-1500.py @@ -111,7 +111,7 @@ def test_issue1434(): hello_world = Doc(vocab, words=["Hello", "World"]) hello = Doc(vocab, words=["Hello"]) matcher = Matcher(vocab) - matcher.add("MyMatcher", None, pattern) + matcher.add("MyMatcher", [pattern]) matches = matcher(hello_world) assert matches matches = matcher(hello) @@ -133,7 +133,7 @@ def test_issue1450(string, start, end): """Test matcher works when patterns end with * operator.""" pattern = [{"ORTH": "a"}, {"ORTH": "b", "OP": "*"}] matcher = Matcher(Vocab()) - matcher.add("TSTEND", None, pattern) + matcher.add("TSTEND", [pattern]) doc = Doc(Vocab(), words=string.split()) matches = matcher(doc) if start is None or end is None: diff --git a/spacy/tests/regression/test_issue1501-2000.py b/spacy/tests/regression/test_issue1501-2000.py index a9cf070cd..e498417d1 100644 --- a/spacy/tests/regression/test_issue1501-2000.py +++ b/spacy/tests/regression/test_issue1501-2000.py @@ -224,7 +224,7 @@ def test_issue1868(): def test_issue1883(): matcher = Matcher(Vocab()) - matcher.add("pat1", None, [{"orth": "hello"}]) + matcher.add("pat1", [[{"orth": "hello"}]]) doc = Doc(matcher.vocab, words=["hello"]) assert len(matcher(doc)) == 1 new_matcher = copy.deepcopy(matcher) @@ -249,7 +249,7 @@ def test_issue1915(): def test_issue1945(): """Test regression in Matcher introduced in v2.0.6.""" matcher = Matcher(Vocab()) - matcher.add("MWE", None, [{"orth": "a"}, {"orth": "a"}]) + matcher.add("MWE", [[{"orth": "a"}, {"orth": "a"}]]) doc = Doc(matcher.vocab, words=["a", "a", "a"]) matches = matcher(doc) # we should see two overlapping matches here assert len(matches) == 2 @@ -285,7 +285,7 @@ def test_issue1971(en_vocab): {"ORTH": "!", "OP": "?"}, ] Token.set_extension("optional", default=False) - matcher.add("TEST", None, pattern) + matcher.add("TEST", [pattern]) doc = Doc(en_vocab, words=["Hello", "John", "Doe", "!"]) # We could also assert length 1 here, but this is more conclusive, because # the real problem here is that it returns a duplicate match for a match_id @@ -299,7 +299,7 @@ def test_issue_1971_2(en_vocab): pattern1 = [{"ORTH": "EUR", "LOWER": {"IN": ["eur"]}}, {"LIKE_NUM": True}] pattern2 = [{"LIKE_NUM": True}, {"ORTH": "EUR"}] # {"IN": ["EUR"]}}] doc = Doc(en_vocab, words=["EUR", "10", "is", "10", "EUR"]) - matcher.add("TEST1", None, pattern1, pattern2) + matcher.add("TEST1", [pattern1, pattern2]) matches = matcher(doc) assert len(matches) == 2 @@ -310,8 +310,8 @@ def test_issue_1971_3(en_vocab): Token.set_extension("b", default=2, force=True) doc = Doc(en_vocab, words=["hello", "world"]) matcher = Matcher(en_vocab) - matcher.add("A", None, [{"_": {"a": 1}}]) - matcher.add("B", None, [{"_": {"b": 2}}]) + matcher.add("A", [[{"_": {"a": 1}}]]) + matcher.add("B", [[{"_": {"b": 2}}]]) matches = sorted((en_vocab.strings[m_id], s, e) for m_id, s, e in matcher(doc)) assert len(matches) == 4 assert matches == sorted([("A", 0, 1), ("A", 1, 2), ("B", 0, 1), ("B", 1, 2)]) @@ -326,7 +326,7 @@ def test_issue_1971_4(en_vocab): matcher = Matcher(en_vocab) doc = Doc(en_vocab, words=["this", "is", "text"]) pattern = [{"_": {"ext_a": "str_a", "ext_b": "str_b"}}] * 3 - matcher.add("TEST", None, pattern) + matcher.add("TEST", [pattern]) matches = matcher(doc) # Uncommenting this caused a segmentation fault assert len(matches) == 1 diff --git a/spacy/tests/regression/test_issue2001-2500.py b/spacy/tests/regression/test_issue2001-2500.py index 4292c8d23..e95c1a9b9 100644 --- a/spacy/tests/regression/test_issue2001-2500.py +++ b/spacy/tests/regression/test_issue2001-2500.py @@ -128,7 +128,7 @@ def test_issue2464(en_vocab): """Test problem with successive ?. This is the same bug, so putting it here.""" matcher = Matcher(en_vocab) doc = Doc(en_vocab, words=["a", "b"]) - matcher.add("4", None, [{"OP": "?"}, {"OP": "?"}]) + matcher.add("4", [[{"OP": "?"}, {"OP": "?"}]]) matches = matcher(doc) assert len(matches) == 3 diff --git a/spacy/tests/regression/test_issue2501-3000.py b/spacy/tests/regression/test_issue2501-3000.py index e26ccbf4b..73ff7376a 100644 --- a/spacy/tests/regression/test_issue2501-3000.py +++ b/spacy/tests/regression/test_issue2501-3000.py @@ -37,7 +37,7 @@ def test_issue2569(en_tokenizer): doc = en_tokenizer("It is May 15, 1993.") doc.ents = [Span(doc, 2, 6, label=doc.vocab.strings["DATE"])] matcher = Matcher(doc.vocab) - matcher.add("RULE", None, [{"ENT_TYPE": "DATE", "OP": "+"}]) + matcher.add("RULE", [[{"ENT_TYPE": "DATE", "OP": "+"}]]) matched = [doc[start:end] for _, start, end in matcher(doc)] matched = sorted(matched, key=len, reverse=True) assert len(matched) == 10 @@ -89,7 +89,7 @@ def test_issue2671(): {"IS_PUNCT": True, "OP": "?"}, {"LOWER": "adrenaline"}, ] - matcher.add(pattern_id, None, pattern) + matcher.add(pattern_id, [pattern]) doc1 = nlp("This is a high-adrenaline situation.") doc2 = nlp("This is a high adrenaline situation.") matches1 = matcher(doc1) diff --git a/spacy/tests/regression/test_issue3001-3500.py b/spacy/tests/regression/test_issue3001-3500.py index 8ed243051..b883ae67a 100644 --- a/spacy/tests/regression/test_issue3001-3500.py +++ b/spacy/tests/regression/test_issue3001-3500.py @@ -52,7 +52,7 @@ def test_issue3009(en_vocab): doc = get_doc(en_vocab, words=words, tags=tags) matcher = Matcher(en_vocab) for i, pattern in enumerate(patterns): - matcher.add(str(i), None, pattern) + matcher.add(str(i), [pattern]) matches = matcher(doc) assert matches @@ -116,8 +116,8 @@ def test_issue3248_1(): total number of patterns.""" nlp = English() matcher = PhraseMatcher(nlp.vocab) - matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c")) - matcher.add("TEST2", None, nlp("d")) + matcher.add("TEST1", [nlp("a"), nlp("b"), nlp("c")]) + matcher.add("TEST2", [nlp("d")]) assert len(matcher) == 2 @@ -125,8 +125,8 @@ def test_issue3248_2(): """Test that the PhraseMatcher can be pickled correctly.""" nlp = English() matcher = PhraseMatcher(nlp.vocab) - matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c")) - matcher.add("TEST2", None, nlp("d")) + matcher.add("TEST1", [nlp("a"), nlp("b"), nlp("c")]) + matcher.add("TEST2", [nlp("d")]) data = pickle.dumps(matcher) new_matcher = pickle.loads(data) assert len(new_matcher) == len(matcher) @@ -170,7 +170,7 @@ def test_issue3328(en_vocab): [{"LOWER": {"IN": ["hello", "how"]}}], [{"LOWER": {"IN": ["you", "doing"]}}], ] - matcher.add("TEST", None, *patterns) + matcher.add("TEST", patterns) matches = matcher(doc) assert len(matches) == 4 matched_texts = [doc[start:end].text for _, start, end in matches] @@ -183,8 +183,8 @@ def test_issue3331(en_vocab): matches, one per rule. """ matcher = PhraseMatcher(en_vocab) - matcher.add("A", None, Doc(en_vocab, words=["Barack", "Obama"])) - matcher.add("B", None, Doc(en_vocab, words=["Barack", "Obama"])) + matcher.add("A", [Doc(en_vocab, words=["Barack", "Obama"])]) + matcher.add("B", [Doc(en_vocab, words=["Barack", "Obama"])]) doc = Doc(en_vocab, words=["Barack", "Obama", "lifts", "America"]) matches = matcher(doc) assert len(matches) == 2 @@ -297,8 +297,10 @@ def test_issue3410(): def test_issue3412(): data = numpy.asarray([[0, 0, 0], [1, 2, 3], [9, 8, 7]], dtype="f") vectors = Vectors(data=data) - keys, best_rows, scores = vectors.most_similar(numpy.asarray([[9, 8, 7], [0, 0, 0]], dtype="f")) - assert(best_rows[0] == 2) + keys, best_rows, scores = vectors.most_similar( + numpy.asarray([[9, 8, 7], [0, 0, 0]], dtype="f") + ) + assert best_rows[0] == 2 def test_issue3447(): diff --git a/spacy/tests/regression/test_issue3549.py b/spacy/tests/regression/test_issue3549.py index 3932bf19c..587b3a857 100644 --- a/spacy/tests/regression/test_issue3549.py +++ b/spacy/tests/regression/test_issue3549.py @@ -10,6 +10,6 @@ def test_issue3549(en_vocab): """Test that match pattern validation doesn't raise on empty errors.""" matcher = Matcher(en_vocab, validate=True) pattern = [{"LOWER": "hello"}, {"LOWER": "world"}] - matcher.add("GOOD", None, pattern) + matcher.add("GOOD", [pattern]) with pytest.raises(MatchPatternError): - matcher.add("BAD", None, [{"X": "Y"}]) + matcher.add("BAD", [[{"X": "Y"}]]) diff --git a/spacy/tests/regression/test_issue3555.py b/spacy/tests/regression/test_issue3555.py index 096b33367..8444f11f2 100644 --- a/spacy/tests/regression/test_issue3555.py +++ b/spacy/tests/regression/test_issue3555.py @@ -12,6 +12,6 @@ def test_issue3555(en_vocab): Token.set_extension("issue3555", default=None) matcher = Matcher(en_vocab) pattern = [{"LEMMA": "have"}, {"_": {"issue3555": True}}] - matcher.add("TEST", None, pattern) + matcher.add("TEST", [pattern]) doc = Doc(en_vocab, words=["have", "apple"]) matcher(doc) diff --git a/spacy/tests/regression/test_issue3611.py b/spacy/tests/regression/test_issue3611.py index c0ee83e1b..3c4836264 100644 --- a/spacy/tests/regression/test_issue3611.py +++ b/spacy/tests/regression/test_issue3611.py @@ -34,8 +34,7 @@ def test_issue3611(): nlp.add_pipe(textcat, last=True) # training the network - other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"] - with nlp.disable_pipes(*other_pipes): + with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]): optimizer = nlp.begin_training() for i in range(3): losses = {} diff --git a/spacy/tests/regression/test_issue3839.py b/spacy/tests/regression/test_issue3839.py index c24c60b6d..fe722a681 100644 --- a/spacy/tests/regression/test_issue3839.py +++ b/spacy/tests/regression/test_issue3839.py @@ -12,10 +12,10 @@ def test_issue3839(en_vocab): match_id = "PATTERN" pattern1 = [{"LOWER": "terrific"}, {"OP": "?"}, {"LOWER": "group"}] pattern2 = [{"LOWER": "terrific"}, {"OP": "?"}, {"OP": "?"}, {"LOWER": "group"}] - matcher.add(match_id, None, pattern1) + matcher.add(match_id, [pattern1]) matches = matcher(doc) assert matches[0][0] == en_vocab.strings[match_id] matcher = Matcher(en_vocab) - matcher.add(match_id, None, pattern2) + matcher.add(match_id, [pattern2]) matches = matcher(doc) assert matches[0][0] == en_vocab.strings[match_id] diff --git a/spacy/tests/regression/test_issue3879.py b/spacy/tests/regression/test_issue3879.py index 123e9fce3..5cd245231 100644 --- a/spacy/tests/regression/test_issue3879.py +++ b/spacy/tests/regression/test_issue3879.py @@ -10,5 +10,5 @@ def test_issue3879(en_vocab): assert len(doc) == 5 pattern = [{"ORTH": "This", "OP": "?"}, {"OP": "?"}, {"ORTH": "test"}] matcher = Matcher(en_vocab) - matcher.add("TEST", None, pattern) + matcher.add("TEST", [pattern]) assert len(matcher(doc)) == 2 # fails because of a FP match 'is a test' diff --git a/spacy/tests/regression/test_issue3951.py b/spacy/tests/regression/test_issue3951.py index e07ffd36e..33230112f 100644 --- a/spacy/tests/regression/test_issue3951.py +++ b/spacy/tests/regression/test_issue3951.py @@ -14,7 +14,7 @@ def test_issue3951(en_vocab): {"OP": "?"}, {"LOWER": "world"}, ] - matcher.add("TEST", None, pattern) + matcher.add("TEST", [pattern]) doc = Doc(en_vocab, words=["Hello", "my", "new", "world"]) matches = matcher(doc) assert len(matches) == 0 diff --git a/spacy/tests/regression/test_issue3972.py b/spacy/tests/regression/test_issue3972.py index a7f76e4d7..22b8d486e 100644 --- a/spacy/tests/regression/test_issue3972.py +++ b/spacy/tests/regression/test_issue3972.py @@ -9,8 +9,8 @@ def test_issue3972(en_vocab): """Test that the PhraseMatcher returns duplicates for duplicate match IDs. """ matcher = PhraseMatcher(en_vocab) - matcher.add("A", None, Doc(en_vocab, words=["New", "York"])) - matcher.add("B", None, Doc(en_vocab, words=["New", "York"])) + matcher.add("A", [Doc(en_vocab, words=["New", "York"])]) + matcher.add("B", [Doc(en_vocab, words=["New", "York"])]) doc = Doc(en_vocab, words=["I", "live", "in", "New", "York"]) matches = matcher(doc) diff --git a/spacy/tests/regression/test_issue4002.py b/spacy/tests/regression/test_issue4002.py index 37e054b3e..d075128aa 100644 --- a/spacy/tests/regression/test_issue4002.py +++ b/spacy/tests/regression/test_issue4002.py @@ -11,7 +11,7 @@ def test_issue4002(en_vocab): matcher = PhraseMatcher(en_vocab, attr="NORM") pattern1 = Doc(en_vocab, words=["c", "d"]) assert [t.norm_ for t in pattern1] == ["c", "d"] - matcher.add("TEST", None, pattern1) + matcher.add("TEST", [pattern1]) doc = Doc(en_vocab, words=["a", "b", "c", "d"]) assert [t.norm_ for t in doc] == ["a", "b", "c", "d"] matches = matcher(doc) @@ -21,6 +21,6 @@ def test_issue4002(en_vocab): pattern2[0].norm_ = "c" pattern2[1].norm_ = "d" assert [t.norm_ for t in pattern2] == ["c", "d"] - matcher.add("TEST", None, pattern2) + matcher.add("TEST", [pattern2]) matches = matcher(doc) assert len(matches) == 1 diff --git a/spacy/tests/regression/test_issue4030.py b/spacy/tests/regression/test_issue4030.py index c331fa1d2..ed219573f 100644 --- a/spacy/tests/regression/test_issue4030.py +++ b/spacy/tests/regression/test_issue4030.py @@ -34,8 +34,7 @@ def test_issue4030(): nlp.add_pipe(textcat, last=True) # training the network - other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"] - with nlp.disable_pipes(*other_pipes): + with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]): optimizer = nlp.begin_training() for i in range(3): losses = {} diff --git a/spacy/tests/regression/test_issue4120.py b/spacy/tests/regression/test_issue4120.py index 2ce5aec6a..d288f46c4 100644 --- a/spacy/tests/regression/test_issue4120.py +++ b/spacy/tests/regression/test_issue4120.py @@ -8,7 +8,7 @@ from spacy.tokens import Doc def test_issue4120(en_vocab): """Test that matches without a final {OP: ?} token are returned.""" matcher = Matcher(en_vocab) - matcher.add("TEST", None, [{"ORTH": "a"}, {"OP": "?"}]) + matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}]]) doc1 = Doc(en_vocab, words=["a"]) assert len(matcher(doc1)) == 1 # works @@ -16,11 +16,11 @@ def test_issue4120(en_vocab): assert len(matcher(doc2)) == 2 # fixed matcher = Matcher(en_vocab) - matcher.add("TEST", None, [{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b"}]) + matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b"}]]) doc3 = Doc(en_vocab, words=["a", "b", "b", "c"]) assert len(matcher(doc3)) == 2 # works matcher = Matcher(en_vocab) - matcher.add("TEST", None, [{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b", "OP": "?"}]) + matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b", "OP": "?"}]]) doc4 = Doc(en_vocab, words=["a", "b", "b", "c"]) assert len(matcher(doc4)) == 3 # fixed diff --git a/spacy/tests/regression/test_issue4402.py b/spacy/tests/regression/test_issue4402.py new file mode 100644 index 000000000..2e1b69000 --- /dev/null +++ b/spacy/tests/regression/test_issue4402.py @@ -0,0 +1,96 @@ +# coding: utf8 +from __future__ import unicode_literals + +import srsly +from spacy.gold import GoldCorpus + +from spacy.lang.en import English +from spacy.tests.util import make_tempdir + + +def test_issue4402(): + nlp = English() + with make_tempdir() as tmpdir: + print("temp", tmpdir) + json_path = tmpdir / "test4402.json" + srsly.write_json(json_path, json_data) + + corpus = GoldCorpus(str(json_path), str(json_path)) + + train_docs = list(corpus.train_docs(nlp, gold_preproc=True, max_length=0)) + # assert that the data got split into 4 sentences + assert len(train_docs) == 4 + + +json_data = [ + { + "id": 0, + "paragraphs": [ + { + "raw": "How should I cook bacon in an oven?\nI've heard of people cooking bacon in an oven.", + "sentences": [ + { + "tokens": [ + {"id": 0, "orth": "How", "ner": "O"}, + {"id": 1, "orth": "should", "ner": "O"}, + {"id": 2, "orth": "I", "ner": "O"}, + {"id": 3, "orth": "cook", "ner": "O"}, + {"id": 4, "orth": "bacon", "ner": "O"}, + {"id": 5, "orth": "in", "ner": "O"}, + {"id": 6, "orth": "an", "ner": "O"}, + {"id": 7, "orth": "oven", "ner": "O"}, + {"id": 8, "orth": "?", "ner": "O"}, + ], + "brackets": [], + }, + { + "tokens": [ + {"id": 9, "orth": "\n", "ner": "O"}, + {"id": 10, "orth": "I", "ner": "O"}, + {"id": 11, "orth": "'ve", "ner": "O"}, + {"id": 12, "orth": "heard", "ner": "O"}, + {"id": 13, "orth": "of", "ner": "O"}, + {"id": 14, "orth": "people", "ner": "O"}, + {"id": 15, "orth": "cooking", "ner": "O"}, + {"id": 16, "orth": "bacon", "ner": "O"}, + {"id": 17, "orth": "in", "ner": "O"}, + {"id": 18, "orth": "an", "ner": "O"}, + {"id": 19, "orth": "oven", "ner": "O"}, + {"id": 20, "orth": ".", "ner": "O"}, + ], + "brackets": [], + }, + ], + "cats": [ + {"label": "baking", "value": 1.0}, + {"label": "not_baking", "value": 0.0}, + ], + }, + { + "raw": "What is the difference between white and brown eggs?\n", + "sentences": [ + { + "tokens": [ + {"id": 0, "orth": "What", "ner": "O"}, + {"id": 1, "orth": "is", "ner": "O"}, + {"id": 2, "orth": "the", "ner": "O"}, + {"id": 3, "orth": "difference", "ner": "O"}, + {"id": 4, "orth": "between", "ner": "O"}, + {"id": 5, "orth": "white", "ner": "O"}, + {"id": 6, "orth": "and", "ner": "O"}, + {"id": 7, "orth": "brown", "ner": "O"}, + {"id": 8, "orth": "eggs", "ner": "O"}, + {"id": 9, "orth": "?", "ner": "O"}, + ], + "brackets": [], + }, + {"tokens": [{"id": 10, "orth": "\n", "ner": "O"}], "brackets": []}, + ], + "cats": [ + {"label": "baking", "value": 0.0}, + {"label": "not_baking", "value": 1.0}, + ], + }, + ], + } +] diff --git a/spacy/tests/regression/test_issue4528.py b/spacy/tests/regression/test_issue4528.py new file mode 100644 index 000000000..460449003 --- /dev/null +++ b/spacy/tests/regression/test_issue4528.py @@ -0,0 +1,19 @@ +# coding: utf8 +from __future__ import unicode_literals + +from spacy.tokens import Doc, DocBin + + +def test_issue4528(en_vocab): + """Test that user_data is correctly serialized in DocBin.""" + doc = Doc(en_vocab, words=["hello", "world"]) + doc.user_data["foo"] = "bar" + # This is how extension attribute values are stored in the user data + doc.user_data[("._.", "foo", None, None)] = "bar" + doc_bin = DocBin(store_user_data=True) + doc_bin.add(doc) + doc_bin_bytes = doc_bin.to_bytes() + new_doc_bin = DocBin(store_user_data=True).from_bytes(doc_bin_bytes) + new_doc = list(new_doc_bin.get_docs(en_vocab))[0] + assert new_doc.user_data["foo"] == "bar" + assert new_doc.user_data[("._.", "foo", None, None)] == "bar" diff --git a/spacy/tests/regression/test_issue4529.py b/spacy/tests/regression/test_issue4529.py new file mode 100644 index 000000000..381957be6 --- /dev/null +++ b/spacy/tests/regression/test_issue4529.py @@ -0,0 +1,13 @@ +# coding: utf8 +from __future__ import unicode_literals + +import pytest +from spacy.gold import GoldParse + + +@pytest.mark.parametrize( + "text,words", [("A'B C", ["A", "'", "B", "C"]), ("A-B", ["A-B"])] +) +def test_gold_misaligned(en_tokenizer, text, words): + doc = en_tokenizer(text) + GoldParse(doc, words=words) diff --git a/spacy/tests/test_gold.py b/spacy/tests/test_gold.py index 234a91443..731a1b5c2 100644 --- a/spacy/tests/test_gold.py +++ b/spacy/tests/test_gold.py @@ -3,7 +3,7 @@ from __future__ import unicode_literals from spacy.gold import biluo_tags_from_offsets, offsets_from_biluo_tags from spacy.gold import spans_from_biluo_tags, GoldParse, iob_to_biluo -from spacy.gold import GoldCorpus, docs_to_json +from spacy.gold import GoldCorpus, docs_to_json, align from spacy.lang.en import English from spacy.tokens import Doc from .util import make_tempdir @@ -90,7 +90,7 @@ def test_gold_ner_missing_tags(en_tokenizer): def test_iob_to_biluo(): good_iob = ["O", "O", "B-LOC", "I-LOC", "O", "B-PERSON"] good_biluo = ["O", "O", "B-LOC", "L-LOC", "O", "U-PERSON"] - bad_iob = ["O", "O", "\"", "B-LOC", "I-LOC"] + bad_iob = ["O", "O", '"', "B-LOC", "I-LOC"] converted_biluo = iob_to_biluo(good_iob) assert good_biluo == converted_biluo with pytest.raises(ValueError): @@ -99,14 +99,23 @@ def test_iob_to_biluo(): def test_roundtrip_docs_to_json(): text = "I flew to Silicon Valley via London." + tags = ["PRP", "VBD", "IN", "NNP", "NNP", "IN", "NNP", "."] + heads = [1, 1, 1, 4, 2, 1, 5, 1] + deps = ["nsubj", "ROOT", "prep", "compound", "pobj", "prep", "pobj", "punct"] + biluo_tags = ["O", "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"] cats = {"TRAVEL": 1.0, "BAKING": 0.0} nlp = English() doc = nlp(text) + for i in range(len(tags)): + doc[i].tag_ = tags[i] + doc[i].dep_ = deps[i] + doc[i].head = doc[heads[i]] + doc.ents = spans_from_biluo_tags(doc, biluo_tags) doc.cats = cats - doc[0].is_sent_start = True - for i in range(1, len(doc)): - doc[i].is_sent_start = False + doc.is_tagged = True + doc.is_parsed = True + # roundtrip to JSON with make_tempdir() as tmpdir: json_file = tmpdir / "roundtrip.json" srsly.write_json(json_file, [docs_to_json(doc)]) @@ -116,7 +125,95 @@ def test_roundtrip_docs_to_json(): assert len(doc) == goldcorpus.count_train() assert text == reloaded_doc.text + assert tags == goldparse.tags + assert deps == goldparse.labels + assert heads == goldparse.heads + assert biluo_tags == goldparse.ner assert "TRAVEL" in goldparse.cats assert "BAKING" in goldparse.cats assert cats["TRAVEL"] == goldparse.cats["TRAVEL"] assert cats["BAKING"] == goldparse.cats["BAKING"] + + # roundtrip to JSONL train dicts + with make_tempdir() as tmpdir: + jsonl_file = tmpdir / "roundtrip.jsonl" + srsly.write_jsonl(jsonl_file, [docs_to_json(doc)]) + goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file)) + + reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp)) + + assert len(doc) == goldcorpus.count_train() + assert text == reloaded_doc.text + assert tags == goldparse.tags + assert deps == goldparse.labels + assert heads == goldparse.heads + assert biluo_tags == goldparse.ner + assert "TRAVEL" in goldparse.cats + assert "BAKING" in goldparse.cats + assert cats["TRAVEL"] == goldparse.cats["TRAVEL"] + assert cats["BAKING"] == goldparse.cats["BAKING"] + + # roundtrip to JSONL tuples + with make_tempdir() as tmpdir: + jsonl_file = tmpdir / "roundtrip.jsonl" + # write to JSONL train dicts + srsly.write_jsonl(jsonl_file, [docs_to_json(doc)]) + goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file)) + # load and rewrite as JSONL tuples + srsly.write_jsonl(jsonl_file, goldcorpus.train_tuples) + goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file)) + + reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp)) + + assert len(doc) == goldcorpus.count_train() + assert text == reloaded_doc.text + assert tags == goldparse.tags + assert deps == goldparse.labels + assert heads == goldparse.heads + assert biluo_tags == goldparse.ner + assert "TRAVEL" in goldparse.cats + assert "BAKING" in goldparse.cats + assert cats["TRAVEL"] == goldparse.cats["TRAVEL"] + assert cats["BAKING"] == goldparse.cats["BAKING"] + + +# xfail while we have backwards-compatible alignment +@pytest.mark.xfail +@pytest.mark.parametrize( + "tokens_a,tokens_b,expected", + [ + (["a", "b", "c"], ["ab", "c"], (3, [-1, -1, 1], [-1, 2], {0: 0, 1: 0}, {})), + ( + ["a", "b", "``", "c"], + ['ab"', "c"], + (4, [-1, -1, -1, 1], [-1, 3], {0: 0, 1: 0, 2: 0}, {}), + ), + (["a", "bc"], ["ab", "c"], (4, [-1, -1], [-1, -1], {0: 0}, {1: 1})), + ( + ["ab", "c", "d"], + ["a", "b", "cd"], + (6, [-1, -1, -1], [-1, -1, -1], {1: 2, 2: 2}, {0: 0, 1: 0}), + ), + ( + ["a", "b", "cd"], + ["a", "b", "c", "d"], + (3, [0, 1, -1], [0, 1, -1, -1], {}, {2: 2, 3: 2}), + ), + ([" ", "a"], ["a"], (1, [-1, 0], [1], {}, {})), + ], +) +def test_align(tokens_a, tokens_b, expected): + cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_a, tokens_b) + assert (cost, list(a2b), list(b2a), a2b_multi, b2a_multi) == expected + # check symmetry + cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_b, tokens_a) + assert (cost, list(b2a), list(a2b), b2a_multi, a2b_multi) == expected + + +def test_goldparse_startswith_space(en_tokenizer): + text = " a" + doc = en_tokenizer(text) + g = GoldParse(doc, words=["a"], entities=["U-DATE"], deps=["ROOT"], heads=[0]) + assert g.words == [" ", "a"] + assert g.ner == [None, "U-DATE"] + assert g.labels == [None, "ROOT"] diff --git a/spacy/tests/test_misc.py b/spacy/tests/test_misc.py index a033b6dd0..4075ccf64 100644 --- a/spacy/tests/test_misc.py +++ b/spacy/tests/test_misc.py @@ -95,12 +95,18 @@ def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2): def test_prefer_gpu(): - assert not prefer_gpu() + try: + import cupy # noqa: F401 + except ImportError: + assert not prefer_gpu() def test_require_gpu(): - with pytest.raises(ValueError): - require_gpu() + try: + import cupy # noqa: F401 + except ImportError: + with pytest.raises(ValueError): + require_gpu() def test_create_symlink_windows( diff --git a/spacy/tests/test_tok2vec.py b/spacy/tests/test_tok2vec.py new file mode 100644 index 000000000..ddaa71059 --- /dev/null +++ b/spacy/tests/test_tok2vec.py @@ -0,0 +1,66 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + +from spacy._ml import Tok2Vec +from spacy.vocab import Vocab +from spacy.tokens import Doc +from spacy.compat import unicode_ + + +def get_batch(batch_size): + vocab = Vocab() + docs = [] + start = 0 + for size in range(1, batch_size + 1): + # Make the words numbers, so that they're distnct + # across the batch, and easy to track. + numbers = [unicode_(i) for i in range(start, start + size)] + docs.append(Doc(vocab, words=numbers)) + start += size + return docs + + +# This fails in Thinc v7.3.1. Need to push patch +@pytest.mark.xfail +def test_empty_doc(): + width = 128 + embed_size = 2000 + vocab = Vocab() + doc = Doc(vocab, words=[]) + tok2vec = Tok2Vec(width, embed_size) + vectors, backprop = tok2vec.begin_update([doc]) + assert len(vectors) == 1 + assert vectors[0].shape == (0, width) + + +@pytest.mark.parametrize( + "batch_size,width,embed_size", [[1, 128, 2000], [2, 128, 2000], [3, 8, 63]] +) +def test_tok2vec_batch_sizes(batch_size, width, embed_size): + batch = get_batch(batch_size) + tok2vec = Tok2Vec(width, embed_size) + vectors, backprop = tok2vec.begin_update(batch) + assert len(vectors) == len(batch) + for doc_vec, doc in zip(vectors, batch): + assert doc_vec.shape == (len(doc), width) + + +@pytest.mark.parametrize( + "tok2vec_config", + [ + {"width": 8, "embed_size": 100, "char_embed": False}, + {"width": 8, "embed_size": 100, "char_embed": True}, + {"width": 8, "embed_size": 100, "conv_depth": 6}, + {"width": 8, "embed_size": 100, "conv_depth": 6}, + {"width": 8, "embed_size": 100, "subword_features": False}, + ], +) +def test_tok2vec_configs(tok2vec_config): + docs = get_batch(3) + tok2vec = Tok2Vec(**tok2vec_config) + vectors, backprop = tok2vec.begin_update(docs) + assert len(vectors) == len(docs) + assert vectors[0].shape == (len(docs[0]), tok2vec_config["width"]) + backprop(vectors) diff --git a/spacy/tokens/_serialize.py b/spacy/tokens/_serialize.py index 67ad9a21a..18cb8a234 100644 --- a/spacy/tokens/_serialize.py +++ b/spacy/tokens/_serialize.py @@ -103,7 +103,8 @@ class DocBin(object): doc = Doc(vocab, words=words, spaces=spaces) doc = doc.from_array(self.attrs, tokens) if self.store_user_data: - doc.user_data.update(srsly.msgpack_loads(self.user_data[i])) + user_data = srsly.msgpack_loads(self.user_data[i], use_list=False) + doc.user_data.update(user_data) yield doc def merge(self, other): @@ -155,9 +156,9 @@ class DocBin(object): msg = srsly.msgpack_loads(zlib.decompress(bytes_data)) self.attrs = msg["attrs"] self.strings = set(msg["strings"]) - lengths = numpy.fromstring(msg["lengths"], dtype="int32") - flat_spaces = numpy.fromstring(msg["spaces"], dtype=bool) - flat_tokens = numpy.fromstring(msg["tokens"], dtype="uint64") + lengths = numpy.frombuffer(msg["lengths"], dtype="int32") + flat_spaces = numpy.frombuffer(msg["spaces"], dtype=bool) + flat_tokens = numpy.frombuffer(msg["tokens"], dtype="uint64") shape = (flat_tokens.size // len(self.attrs), len(self.attrs)) flat_tokens = flat_tokens.reshape(shape) flat_spaces = flat_spaces.reshape((flat_spaces.size, 1)) diff --git a/spacy/util.py b/spacy/util.py index fa8111d67..74e4cc1c6 100644 --- a/spacy/util.py +++ b/spacy/util.py @@ -142,6 +142,11 @@ def register_architecture(name, arch=None): return do_registration +def make_layer(arch_config): + arch_func = get_architecture(arch_config["arch"]) + return arch_func(arch_config["config"]) + + def get_architecture(name): """Get a model architecture function by name. Raises a KeyError if the architecture is not found. @@ -242,6 +247,7 @@ def load_model_from_path(model_path, meta=False, **overrides): cls = get_lang_class(lang) nlp = cls(meta=meta, **overrides) pipeline = meta.get("pipeline", []) + factories = meta.get("factories", {}) disable = overrides.get("disable", []) if pipeline is True: pipeline = nlp.Defaults.pipe_names @@ -250,7 +256,8 @@ def load_model_from_path(model_path, meta=False, **overrides): for name in pipeline: if name not in disable: config = meta.get("pipeline_args", {}).get(name, {}) - component = nlp.create_pipe(name, config=config) + factory = factories.get(name, name) + component = nlp.create_pipe(factory, config=config) nlp.add_pipe(component, name=name) return nlp.from_disk(model_path) @@ -363,6 +370,16 @@ def is_in_jupyter(): return False +def get_component_name(component): + if hasattr(component, "name"): + return component.name + if hasattr(component, "__name__"): + return component.__name__ + if hasattr(component, "__class__") and hasattr(component.__class__, "__name__"): + return component.__class__.__name__ + return repr(component) + + def get_cuda_stream(require=False): if CudaStream is None: return None @@ -404,7 +421,7 @@ def env_opt(name, default=None): def read_regex(path): path = ensure_path(path) - with path.open() as file_: + with path.open(encoding="utf8") as file_: entries = file_.read().split("\n") expression = "|".join( ["^" + re.escape(piece) for piece in entries if piece.strip()] diff --git a/website/docs/api/annotation.md b/website/docs/api/annotation.md index fac7e79b6..5ca5e91d9 100644 --- a/website/docs/api/annotation.md +++ b/website/docs/api/annotation.md @@ -48,14 +48,14 @@ be installed if needed via `pip install spacy[lookups]`. Some languages provide full lemmatization rules and exceptions, while other languages currently only rely on simple lookup tables. - + -spaCy adds a **special case for pronouns**: all pronouns are lemmatized to the -special token `-PRON-`. Unlike verbs and common nouns, there's no clear base -form of a personal pronoun. Should the lemma of "me" be "I", or should we -normalize person as well, giving "it" — or maybe "he"? spaCy's solution is to -introduce a novel symbol, `-PRON-`, which is used as the lemma for all personal -pronouns. +spaCy adds a **special case for English pronouns**: all English pronouns are +lemmatized to the special token `-PRON-`. Unlike verbs and common nouns, +there's no clear base form of a personal pronoun. Should the lemma of "me" be +"I", or should we normalize person as well, giving "it" — or maybe "he"? +spaCy's solution is to introduce a novel symbol, `-PRON-`, which is used as the +lemma for all personal pronouns. @@ -117,76 +117,72 @@ type. They're available as the [`Token.pos`](/api/token#attributes) and The English part-of-speech tagger uses the [OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) version of the Penn -Treebank tag set. We also map the tags to the simpler Google Universal POS tag -set. - -| Tag |  POS | Morphology | Description | -| ----------------------------------- | ------- | ---------------------------------------------- | ----------------------------------------- | -| `-LRB-` | `PUNCT` | `PunctType=brck PunctSide=ini` | left round bracket | -| `-RRB-` | `PUNCT` | `PunctType=brck PunctSide=fin` | right round bracket | -| `,` | `PUNCT` | `PunctType=comm` | punctuation mark, comma | -| `:` | `PUNCT` | | punctuation mark, colon or ellipsis | -| `.` | `PUNCT` | `PunctType=peri` | punctuation mark, sentence closer | -| `''` | `PUNCT` | `PunctType=quot PunctSide=fin` | closing quotation mark | -| `""` | `PUNCT` | `PunctType=quot PunctSide=fin` | closing quotation mark | -| `` | `PUNCT` | `PunctType=quot PunctSide=ini` | opening quotation mark | -| `#` | `SYM` | `SymType=numbersign` | symbol, number sign | -| `$` | `SYM` | `SymType=currency` | symbol, currency | -| `ADD` | `X` | | email | -| `AFX` | `ADJ` | `Hyph=yes` | affix | -| `BES` | `VERB` | | auxiliary "be" | -| `CC` | `CONJ` | `ConjType=coor` | conjunction, coordinating | -| `CD` | `NUM` | `NumType=card` | cardinal number | -| `DT` | `DET` | | determiner | -| `EX` | `ADV` | `AdvType=ex` | existential there | -| `FW` | `X` | `Foreign=yes` | foreign word | -| `GW` | `X` | | additional word in multi-word expression | -| `HVS` | `VERB` | | forms of "have" | -| `HYPH` | `PUNCT` | `PunctType=dash` | punctuation mark, hyphen | -| `IN` | `ADP` | | conjunction, subordinating or preposition | -| `JJ` | `ADJ` | `Degree=pos` | adjective | -| `JJR` | `ADJ` | `Degree=comp` | adjective, comparative | -| `JJS` | `ADJ` | `Degree=sup` | adjective, superlative | -| `LS` | `PUNCT` | `NumType=ord` | list item marker | -| `MD` | `VERB` | `VerbType=mod` | verb, modal auxiliary | -| `NFP` | `PUNCT` | | superfluous punctuation | -| `NIL` | | | missing tag | -| `NN` | `NOUN` | `Number=sing` | noun, singular or mass | -| `NNP` | `PROPN` | `NounType=prop Number=sign` | noun, proper singular | -| `NNPS` | `PROPN` | `NounType=prop Number=plur` | noun, proper plural | -| `NNS` | `NOUN` | `Number=plur` | noun, plural | -| `PDT` | `ADJ` | `AdjType=pdt PronType=prn` | predeterminer | -| `POS` | `PART` | `Poss=yes` | possessive ending | -| `PRP` | `PRON` | `PronType=prs` | pronoun, personal | -| `PRP$` | `ADJ` | `PronType=prs Poss=yes` | pronoun, possessive | -| `RB` | `ADV` | `Degree=pos` | adverb | -| `RBR` | `ADV` | `Degree=comp` | adverb, comparative | -| `RBS` | `ADV` | `Degree=sup` | adverb, superlative | -| `RP` | `PART` | | adverb, particle | -| `_SP` | `SPACE` | | space | -| `SYM` | `SYM` | | symbol | -| `TO` | `PART` | `PartType=inf VerbForm=inf` | infinitival "to" | -| `UH` | `INTJ` | | interjection | -| `VB` | `VERB` | `VerbForm=inf` | verb, base form | -| `VBD` | `VERB` | `VerbForm=fin Tense=past` | verb, past tense | -| `VBG` | `VERB` | `VerbForm=part Tense=pres Aspect=prog` | verb, gerund or present participle | -| `VBN` | `VERB` | `VerbForm=part Tense=past Aspect=perf` | verb, past participle | -| `VBP` | `VERB` | `VerbForm=fin Tense=pres` | verb, non-3rd person singular present | -| `VBZ` | `VERB` | `VerbForm=fin Tense=pres Number=sing Person=3` | verb, 3rd person singular present | -| `WDT` | `ADJ` | `PronType=int|rel` | wh-determiner | -| `WP` | `NOUN` | `PronType=int|rel` | wh-pronoun, personal | -| `WP$` | `ADJ` | `Poss=yes PronType=int|rel` | wh-pronoun, possessive | -| `WRB` | `ADV` | `PronType=int|rel` | wh-adverb | -| `XX` | `X` | | unknown | +Treebank tag set. We also map the tags to the simpler Universal Dependencies v2 +POS tag set. +| Tag |  POS | Morphology | Description | +| ------------------------------------- | ------- | --------------------------------------- | ----------------------------------------- | +| `$` | `SYM` | | symbol, currency | +| `` | `PUNCT` | `PunctType=quot PunctSide=ini` | opening quotation mark | +| `''` | `PUNCT` | `PunctType=quot PunctSide=fin` | closing quotation mark | +| `,` | `PUNCT` | `PunctType=comm` | punctuation mark, comma | +| `-LRB-` | `PUNCT` | `PunctType=brck PunctSide=ini` | left round bracket | +| `-RRB-` | `PUNCT` | `PunctType=brck PunctSide=fin` | right round bracket | +| `.` | `PUNCT` | `PunctType=peri` | punctuation mark, sentence closer | +| `:` | `PUNCT` | | punctuation mark, colon or ellipsis | +| `ADD` | `X` | | email | +| `AFX` | `ADJ` | `Hyph=yes` | affix | +| `CC` | `CCONJ` | `ConjType=comp` | conjunction, coordinating | +| `CD` | `NUM` | `NumType=card` | cardinal number | +| `DT` | `DET` | | determiner | +| `EX` | `PRON` | `AdvType=ex` | existential there | +| `FW` | `X` | `Foreign=yes` | foreign word | +| `GW` | `X` | | additional word in multi-word expression | +| `HYPH` | `PUNCT` | `PunctType=dash` | punctuation mark, hyphen | +| `IN` | `ADP` | | conjunction, subordinating or preposition | +| `JJ` | `ADJ` | `Degree=pos` | adjective | +| `JJR` | `ADJ` | `Degree=comp` | adjective, comparative | +| `JJS` | `ADJ` | `Degree=sup` | adjective, superlative | +| `LS` | `X` | `NumType=ord` | list item marker | +| `MD` | `VERB` | `VerbType=mod` | verb, modal auxiliary | +| `NFP` | `PUNCT` | | superfluous punctuation | +| `NIL` | `X` | | missing tag | +| `NN` | `NOUN` | `Number=sing` | noun, singular or mass | +| `NNP` | `PROPN` | `NounType=prop Number=sing` | noun, proper singular | +| `NNPS` | `PROPN` | `NounType=prop Number=plur` | noun, proper plural | +| `NNS` | `NOUN` | `Number=plur` | noun, plural | +| `PDT` | `DET` | | predeterminer | +| `POS` | `PART` | `Poss=yes` | possessive ending | +| `PRP` | `PRON` | `PronType=prs` | pronoun, personal | +| `PRP$` | `DET` | `PronType=prs Poss=yes` | pronoun, possessive | +| `RB` | `ADV` | `Degree=pos` | adverb | +| `RBR` | `ADV` | `Degree=comp` | adverb, comparative | +| `RBS` | `ADV` | `Degree=sup` | adverb, superlative | +| `RP` | `ADP` | | adverb, particle | +| `SP` | `SPACE` | | space | +| `SYM` | `SYM` | | symbol | +| `TO` | `PART` | `PartType=inf VerbForm=inf` | infinitival "to" | +| `UH` | `INTJ` | | interjection | +| `VB` | `VERB` | `VerbForm=inf` | verb, base form | +| `VBD` | `VERB` | `VerbForm=fin Tense=past` | verb, past tense | +| `VBG` | `VERB` | `VerbForm=part Tense=pres Aspect=prog` | verb, gerund or present participle | +| `VBN` | `VERB` | `VerbForm=part Tense=past Aspect=perf` | verb, past participle | +| `VBP` | `VERB` | `VerbForm=fin Tense=pres` | verb, non-3rd person singular present | +| `VBZ` | `VERB` | `VerbForm=fin Tense=pres Number=sing Person=three` | verb, 3rd person singular present | +| `WDT` | `DET` | | wh-determiner | +| `WP` | `PRON` | | wh-pronoun, personal | +| `WP$` | `DET` | `Poss=yes` | wh-pronoun, possessive | +| `WRB` | `ADV` | | wh-adverb | +| `XX` | `X` | | unknown | +| `_SP` | `SPACE` | | | The German part-of-speech tagger uses the [TIGER Treebank](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html) -annotation scheme. We also map the tags to the simpler Google Universal POS tag -set. +annotation scheme. We also map the tags to the simpler Universal Dependencies +v2 POS tag set. | Tag |  POS | Morphology | Description | | --------- | ------- | ---------------------------------------- | ------------------------------------------------- | @@ -194,7 +190,7 @@ set. | `$,` | `PUNCT` | `PunctType=comm` | comma | | `$.` | `PUNCT` | `PunctType=peri` | sentence-final punctuation mark | | `ADJA` | `ADJ` | | adjective, attributive | -| `ADJD` | `ADJ` | `Variant=short` | adjective, adverbial or predicative | +| `ADJD` | `ADJ` | | adjective, adverbial or predicative | | `ADV` | `ADV` | | adverb | | `APPO` | `ADP` | `AdpType=post` | postposition | | `APPR` | `ADP` | `AdpType=prep` | preposition; circumposition left | @@ -204,28 +200,28 @@ set. | `CARD` | `NUM` | `NumType=card` | cardinal number | | `FM` | `X` | `Foreign=yes` | foreign language material | | `ITJ` | `INTJ` | | interjection | -| `KOKOM` | `CONJ` | `ConjType=comp` | comparative conjunction | -| `KON` | `CONJ` | | coordinate conjunction | +| `KOKOM` | `CCONJ` | `ConjType=comp` | comparative conjunction | +| `KON` | `CCONJ` | | coordinate conjunction | | `KOUI` | `SCONJ` | | subordinate conjunction with "zu" and infinitive | | `KOUS` | `SCONJ` | | subordinate conjunction with sentence | | `NE` | `PROPN` | | proper noun | -| `NNE` | `PROPN` | | proper noun | | `NN` | `NOUN` | | noun, singular or mass | -| `PROAV` | `ADV` | `PronType=dem` | pronominal adverb | +| `NNE` | `PROPN` | | proper noun | | `PDAT` | `DET` | `PronType=dem` | attributive demonstrative pronoun | | `PDS` | `PRON` | `PronType=dem` | substituting demonstrative pronoun | -| `PIAT` | `DET` | `PronType=ind\|neg\|tot` | attributive indefinite pronoun without determiner | -| `PIS` | `PRON` | `PronType=ind\|neg\|tot` | substituting indefinite pronoun | +| `PIAT` | `DET` | `PronType=ind|neg|tot` | attributive indefinite pronoun without determiner | +| `PIS` | `PRON` | `PronType=ind|neg|tot` | substituting indefinite pronoun | | `PPER` | `PRON` | `PronType=prs` | non-reflexive personal pronoun | | `PPOSAT` | `DET` | `Poss=yes PronType=prs` | attributive possessive pronoun | -| `PPOSS` | `PRON` | `PronType=rel` | substituting possessive pronoun | +| `PPOSS` | `PRON` | `Poss=yes PronType=prs` | substituting possessive pronoun | | `PRELAT` | `DET` | `PronType=rel` | attributive relative pronoun | | `PRELS` | `PRON` | `PronType=rel` | substituting relative pronoun | | `PRF` | `PRON` | `PronType=prs Reflex=yes` | reflexive personal pronoun | +| `PROAV` | `ADV` | `PronType=dem` | pronominal adverb | | `PTKA` | `PART` | | particle with adjective or adverb | | `PTKANT` | `PART` | `PartType=res` | answer particle | -| `PTKNEG` | `PART` | `Negative=yes` | negative particle | -| `PTKVZ` | `PART` | `PartType=vbp` | separable verbal particle | +| `PTKNEG` | `PART` | `Polarity=neg` | negative particle | +| `PTKVZ` | `ADP` | `PartType=vbp` | separable verbal particle | | `PTKZU` | `PART` | `PartType=inf` | "zu" before infinitive | | `PWAT` | `DET` | `PronType=int` | attributive interrogative pronoun | | `PWAV` | `ADV` | `PronType=int` | adverbial interrogative or relative pronoun | @@ -234,9 +230,9 @@ set. | `VAFIN` | `AUX` | `Mood=ind VerbForm=fin` | finite verb, auxiliary | | `VAIMP` | `AUX` | `Mood=imp VerbForm=fin` | imperative, auxiliary | | `VAINF` | `AUX` | `VerbForm=inf` | infinitive, auxiliary | -| `VAPP` | `AUX` | `Aspect=perf VerbForm=fin` | perfect participle, auxiliary | +| `VAPP` | `AUX` | `Aspect=perf VerbForm=part` | perfect participle, auxiliary | | `VMFIN` | `VERB` | `Mood=ind VerbForm=fin VerbType=mod` | finite verb, modal | -| `VMINF` | `VERB` | `VerbForm=fin VerbType=mod` | infinitive, modal | +| `VMINF` | `VERB` | `VerbForm=inf VerbType=mod` | infinitive, modal | | `VMPP` | `VERB` | `Aspect=perf VerbForm=part VerbType=mod` | perfect participle, modal | | `VVFIN` | `VERB` | `Mood=ind VerbForm=fin` | finite verb, full | | `VVIMP` | `VERB` | `Mood=imp VerbForm=fin` | imperative, full | @@ -244,8 +240,7 @@ set. | `VVIZU` | `VERB` | `VerbForm=inf` | infinitive with "zu", full | | `VVPP` | `VERB` | `Aspect=perf VerbForm=part` | perfect participle, full | | `XY` | `X` | | non-word containing non-letter | -| `SP` | `SPACE` | | space | - +| `_SP` | `SPACE` | | | --- diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index aa28a14d1..a37921f3c 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -155,21 +155,14 @@ $ python -m spacy convert [input_file] [output_dir] [--file-type] [--converter] ### Output file types {new="2.1"} -> #### Which format should I choose? -> -> If you're not sure, go with the default `jsonl`. Newline-delimited JSON means -> that there's one JSON object per line. Unlike a regular JSON file, it can also -> be read in line-by-line and you won't have to parse the _entire file_ first. -> This makes it a very convenient format for larger corpora. - All output files generated by this command are compatible with [`spacy train`](/api/cli#train). -| ID | Description | -| ------- | --------------------------------- | -| `jsonl` | Newline-delimited JSON (default). | -| `json` | Regular JSON. | -| `msg` | Binary MessagePack format. | +| ID | Description | +| ------- | -------------------------- | +| `json` | Regular JSON (default). | +| `jsonl` | Newline-delimited JSON. | +| `msg` | Binary MessagePack format. | ### Converter options @@ -453,8 +446,10 @@ improvement. ```bash $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] -[--width] [--depth] [--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length] [--min-length] -[--seed] [--n-iter] [--use-vectors] [--n-save_every] [--init-tok2vec] [--epoch-start] +[--width] [--depth] [--cnn-window] [--cnn-pieces] [--use-chars] [--sa-depth] +[--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length] +[--min-length] [--seed] [--n-iter] [--use-vectors] [--n-save_every] +[--init-tok2vec] [--epoch-start] ``` | Argument | Type | Description | @@ -464,6 +459,10 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] | `output_dir` | positional | Directory to write models to on each epoch. | | `--width`, `-cw` | option | Width of CNN layers. | | `--depth`, `-cd` | option | Depth of CNN layers. | +| `--cnn-window`, `-cW` 2.2.2 | option | Window size for CNN layers. | +| `--cnn-pieces`, `-cP` 2.2.2 | option | Maxout size for CNN layers. `1` for [Mish](https://github.com/digantamisra98/Mish). | +| `--use-chars`, `-chr` 2.2.2 | flag | Whether to use character-based embedding. | +| `--sa-depth`, `-sa` 2.2.2 | option | Depth of self-attention layers. | | `--embed-rows`, `-er` | option | Number of embedding rows. | | `--loss-func`, `-L` | option | Loss function to use for the objective. Either `"L2"` or `"cosine"`. | | `--dropout`, `-d` | option | Dropout rate. | @@ -476,7 +475,7 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] | `--n-save-every`, `-se` | option | Save model every X batches. | | `--init-tok2vec`, `-t2v` 2.1 | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. | | `--epoch-start`, `-es` 2.1.5 | option | The epoch to start counting at. Only relevant when using `--init-tok2vec` and the given weight file has been renamed. Prevents unintended overwriting of existing weight files. | -| **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. | +| **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. | ### JSONL format for raw text {#pretrain-jsonl} diff --git a/website/docs/api/entityruler.md b/website/docs/api/entityruler.md index 607cb28ce..af3db0dcb 100644 --- a/website/docs/api/entityruler.md +++ b/website/docs/api/entityruler.md @@ -202,6 +202,14 @@ All labels present in the match patterns. | ----------- | ----- | ------------------ | | **RETURNS** | tuple | The string labels. | +## EntityRuler.ent_ids {#labels tag="property" new="2.2.2"} + +All entity ids present in the match patterns `id` properties. + +| Name | Type | Description | +| ----------- | ----- | ------------------- | +| **RETURNS** | tuple | The string ent_ids. | + ## EntityRuler.patterns {#patterns tag="property"} Get all patterns that were added to the entity ruler. diff --git a/website/docs/api/language.md b/website/docs/api/language.md index c44339ff5..6e7f6be3e 100644 --- a/website/docs/api/language.md +++ b/website/docs/api/language.md @@ -323,18 +323,38 @@ you can use to undo your changes. > #### Example > > ```python -> with nlp.disable_pipes('tagger', 'parser'): +> # New API as of v2.2.2 +> with nlp.disable_pipes(["tagger", "parser"]): +> nlp.begin_training() +> +> with nlp.disable_pipes("tagger", "parser"): > nlp.begin_training() > -> disabled = nlp.disable_pipes('tagger', 'parser') +> disabled = nlp.disable_pipes("tagger", "parser") > nlp.begin_training() > disabled.restore() > ``` -| Name | Type | Description | -| ----------- | --------------- | ------------------------------------------------------------------------------------ | -| `*disabled` | unicode | Names of pipeline components to disable. | -| **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. | +| Name | Type | Description | +| ----------------------------------------- | --------------- | ------------------------------------------------------------------------------------ | +| `disabled` 2.2.2 | list | Names of pipeline components to disable. | +| `*disabled` | unicode | Names of pipeline components to disable. | +| **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. | + + + +As of spaCy v2.2.2, the `Language.disable_pipes` method can also take a list of +component names as its first argument (instead of a variable number of +arguments). This is especially useful if you're generating the component names +to disable programmatically. The new syntax will become the default in the +future. + +```diff +- disabled = nlp.disable_pipes("tagger", "parser") ++ disabled = nlp.disable_pipes(["tagger", "parser"]) +``` + + ## Language.to_disk {#to_disk tag="method" new="2"} diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md index 84d9ed888..bfd4fb0ec 100644 --- a/website/docs/api/matcher.md +++ b/website/docs/api/matcher.md @@ -157,16 +157,19 @@ overwritten. | `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. | | `*patterns` | list | Match pattern. A pattern consists of a list of dicts, where each dict describes a token. | - + -As of spaCy 2.0, `Matcher.add_pattern` and `Matcher.add_entity` are deprecated -and have been replaced with a simpler [`Matcher.add`](/api/matcher#add) that -lets you add a list of patterns and a callback for a given match ID. +As of spaCy 2.2.2, `Matcher.add` also supports the new API, which will become +the default in the future. The patterns are now the second argument and a list +(instead of a variable number of arguments). The `on_match` callback becomes an +optional keyword argument. ```diff -- matcher.add_entity("GoogleNow", on_match=merge_phrases) -- matcher.add_pattern("GoogleNow", [{ORTH: "Google"}, {ORTH: "Now"}]) -+ matcher.add('GoogleNow', merge_phrases, [{"ORTH": "Google"}, {"ORTH": "Now"}]) +patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]] +- matcher.add("GoogleNow", None, *patterns) ++ matcher.add("GoogleNow", patterns) +- matcher.add("GoogleNow", on_match, *patterns) ++ matcher.add("GoogleNow", patterns, on_match=on_match) ``` diff --git a/website/docs/api/phrasematcher.md b/website/docs/api/phrasematcher.md index 9d95522ac..c7311a401 100644 --- a/website/docs/api/phrasematcher.md +++ b/website/docs/api/phrasematcher.md @@ -153,6 +153,23 @@ overwritten. | `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. | | `*docs` | `Doc` | `Doc` objects of the phrases to match. | + + +As of spaCy 2.2.2, `PhraseMatcher.add` also supports the new API, which will +become the default in the future. The `Doc` patterns are now the second argument +and a list (instead of a variable number of arguments). The `on_match` callback +becomes an optional keyword argument. + +```diff +patterns = [nlp("health care reform"), nlp("healthcare reform")] +- matcher.add("HEALTH", None, *patterns) ++ matcher.add("HEALTH", patterns) +- matcher.add("HEALTH", on_match, *patterns) ++ matcher.add("HEALTH", patterns, on_match=on_match) +``` + + + ## PhraseMatcher.remove {#remove tag="method" new="2.2"} Remove a rule from the matcher by match ID. A `KeyError` is raised if the key diff --git a/website/docs/images/displacy-ent-custom.html b/website/docs/images/displacy-ent-custom.html index 15294db49..709c6f631 100644 --- a/website/docs/images/displacy-ent-custom.html +++ b/website/docs/images/displacy-ent-custom.html @@ -1,9 +1,33 @@ -
But -Google -ORGis starting from behind. The company made a late push into hardware, -and -Apple -ORG’s Siri, available on iPhones, and -Amazon -ORG’s Alexa software, which runs on its Echo and Dot devices, have clear -leads in consumer adoption.
+
But + Google + ORGis starting from behind. The company made a late push into hardware, and + Apple + ORG’s Siri, available on iPhones, and + Amazon + ORG’s Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer + adoption.
diff --git a/website/docs/images/displacy-ent-snek.html b/website/docs/images/displacy-ent-snek.html index 1e4920fb5..c8b416d8d 100644 --- a/website/docs/images/displacy-ent-snek.html +++ b/website/docs/images/displacy-ent-snek.html @@ -2,17 +2,25 @@ class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px" > - 🌱🌿 🐍 SNEK ____ 🌳🌲 ____ 👨‍🌾 HUMAN 🏘️ + 🌱🌿 + 🐍 + SNEK + ____ 🌳🌲 ____ + 👨‍🌾 + HUMAN + 🏘️ diff --git a/website/docs/images/displacy-ent1.html b/website/docs/images/displacy-ent1.html index 6e3de2675..708df8093 100644 --- a/website/docs/images/displacy-ent1.html +++ b/website/docs/images/displacy-ent1.html @@ -1,16 +1,37 @@ -
- +
+ Apple - ORG + ORG is looking at buying - + U.K. - GPE + GPE startup for - + $1 billion - MONEY + MONEY
diff --git a/website/docs/images/displacy-ent2.html b/website/docs/images/displacy-ent2.html index e72640b51..5e1833ca0 100644 --- a/website/docs/images/displacy-ent2.html +++ b/website/docs/images/displacy-ent2.html @@ -1,18 +1,39 @@ -
+
When - + Sebastian Thrun - PERSON + PERSON started working on self-driving cars at - + Google - ORG + ORG in - + 2007 - DATE + DATE , few people outside of the company took him seriously.
diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index a375f416c..663ac5e5a 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -986,6 +986,37 @@ doc = nlp("Apple is opening its first big office in San Francisco.") print([(ent.text, ent.label_) for ent in doc.ents]) ``` +### Adding IDs to patterns {#entityruler-ent-ids new="2.2.2"} + +The [`EntityRuler`](/api/entityruler) can also accept an `id` attribute for each +pattern. Using the `id` attribute allows multiple patterns to be associated with +the same entity. + +```python +### {executable="true"} +from spacy.lang.en import English +from spacy.pipeline import EntityRuler + +nlp = English() +ruler = EntityRuler(nlp) +patterns = [{"label": "ORG", "pattern": "Apple", "id": "apple"}, + {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}], "id": "san-francisco"}, + {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "fran"}], "id": "san-francisco"}] +ruler.add_patterns(patterns) +nlp.add_pipe(ruler) + +doc1 = nlp("Apple is opening its first big office in San Francisco.") +print([(ent.text, ent.label_, ent.ent_id_) for ent in doc1.ents]) + +doc2 = nlp("Apple is opening its first big office in San Fran.") +print([(ent.text, ent.label_, ent.ent_id_) for ent in doc2.ents]) +``` + +If the `id` attribute is included in the [`EntityRuler`](/api/entityruler) +patterns, the `ent_id_` property of the matched entity is set to the `id` given +in the patterns. So in the example above it's easy to identify that "San +Francisco" and "San Fran" are both the same entity. + The entity ruler is designed to integrate with spaCy's existing statistical models and enhance the named entity recognizer. If it's added **before the `"ner"` component**, the entity recognizer will respect the existing entity diff --git a/website/meta/languages.json b/website/meta/languages.json index 364b2ef6a..dbb300fbf 100644 --- a/website/meta/languages.json +++ b/website/meta/languages.json @@ -127,6 +127,7 @@ { "code": "sr", "name": "Serbian" }, { "code": "sk", "name": "Slovak" }, { "code": "sl", "name": "Slovenian" }, + { "code": "lb", "name": "Luxembourgish" }, { "code": "sq", "name": "Albanian",