Merge branch 'master' into spacy.io

This commit is contained in:
Ines Montani 2019-10-31 17:30:42 +01:00
commit 07ba9b4aa2
88 changed files with 2370 additions and 565 deletions

106
.github/contributors/GiorgioPorgio.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | George Ketsopoulos |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 23 October 2019 |
| GitHub username | GiorgioPorgio |
| Website (optional) | |

106
.github/contributors/zhuorulin.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------ |
| Name | Zhuoru Lin |
| Company name (if applicable) | Bombora Inc. |
| Title or role (if applicable) | Data Scientist |
| Date | 2017-11-13 |
| GitHub username | ZhuoruLin |
| Website (optional) | |

View File

@ -9,7 +9,7 @@ dist/spacy.pex : dist/spacy-$(sha).pex
dist/spacy-$(sha).pex : dist/$(wheel) dist/spacy-$(sha).pex : dist/$(wheel)
env3.6/bin/python -m pip install pex==1.5.3 env3.6/bin/python -m pip install pex==1.5.3
env3.6/bin/pex pytest dist/$(wheel) -e spacy -o dist/spacy-$(sha).pex env3.6/bin/pex pytest dist/$(wheel) spacy_lookups_data -e spacy -o dist/spacy-$(sha).pex
dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py* dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py*
python3.6 -m venv env3.6 python3.6 -m venv env3.6

View File

@ -135,8 +135,7 @@ Thanks to our great community, we've finally re-added conda support. You can now
install spaCy via `conda-forge`: install spaCy via `conda-forge`:
```bash ```bash
conda config --add channels conda-forge conda install -c conda-forge spacy
conda install spacy
``` ```
For the feedstock including the build recipe and configuration, check out For the feedstock including the build recipe and configuration, check out
@ -214,16 +213,6 @@ doc = nlp("This is a sentence.")
📖 **For more info and examples, check out the 📖 **For more info and examples, check out the
[models documentation](https://spacy.io/docs/usage/models).** [models documentation](https://spacy.io/docs/usage/models).**
### Support for older versions
If you're using an older version (`v1.6.0` or below), you can still download and
install the old models from within spaCy using `python -m spacy.en.download all`
or `python -m spacy.de.download all`. The `.tar.gz` archives are also
[attached to the v1.6.0 release](https://github.com/explosion/spaCy/tree/v1.6.0).
To download and install the models manually, unpack the archive, drop the
contained directory into `spacy/data` and load the model via `spacy.load('en')`
or `spacy.load('de')`.
## Compile from source ## Compile from source
The other way to install spaCy is to clone its The other way to install spaCy is to clone its

View File

@ -84,7 +84,7 @@ def read_conllu(file_):
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None): def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
if text_loc.parts[-1].endswith(".conllu"): if text_loc.parts[-1].endswith(".conllu"):
docs = [] docs = []
with text_loc.open() as file_: with text_loc.open(encoding="utf8") as file_:
for conllu_doc in read_conllu(file_): for conllu_doc in read_conllu(file_):
for conllu_sent in conllu_doc: for conllu_sent in conllu_doc:
words = [line[1] for line in conllu_sent] words = [line[1] for line in conllu_sent]

View File

@ -203,7 +203,7 @@ def golds_to_gold_tuples(docs, golds):
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None): def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
if text_loc.parts[-1].endswith(".conllu"): if text_loc.parts[-1].endswith(".conllu"):
docs = [] docs = []
with text_loc.open() as file_: with text_loc.open(encoding="utf8") as file_:
for conllu_doc in read_conllu(file_): for conllu_doc in read_conllu(file_):
for conllu_sent in conllu_doc: for conllu_sent in conllu_doc:
words = [line[1] for line in conllu_sent] words = [line[1] for line in conllu_sent]
@ -378,7 +378,7 @@ def _load_pretrained_tok2vec(nlp, loc):
"""Load pretrained weights for the 'token-to-vector' part of the component """Load pretrained weights for the 'token-to-vector' part of the component
models, which is typically a CNN. See 'spacy pretrain'. Experimental. models, which is typically a CNN. See 'spacy pretrain'. Experimental.
""" """
with Path(loc).open("rb") as file_: with Path(loc).open("rb", encoding="utf8") as file_:
weights_data = file_.read() weights_data = file_.read()
loaded = [] loaded = []
for name, component in nlp.pipeline: for name, component in nlp.pipeline:
@ -519,8 +519,8 @@ def main(
for i in range(config.nr_epoch): for i in range(config.nr_epoch):
docs, golds = read_data( docs, golds = read_data(
nlp, nlp,
paths.train.conllu.open(), paths.train.conllu.open(encoding="utf8"),
paths.train.text.open(), paths.train.text.open(encoding="utf8"),
max_doc_length=config.max_doc_length, max_doc_length=config.max_doc_length,
limit=limit, limit=limit,
oracle_segments=use_oracle_segments, oracle_segments=use_oracle_segments,
@ -560,7 +560,7 @@ def main(
def _render_parses(i, to_render): def _render_parses(i, to_render):
to_render[0].user_data["title"] = "Batch %d" % i to_render[0].user_data["title"] = "Batch %d" % i
with Path("/tmp/parses.html").open("w") as file_: with Path("/tmp/parses.html").open("w", encoding="utf8") as file_:
html = displacy.render(to_render[:5], style="dep", page=True) html = displacy.render(to_render[:5], style="dep", page=True)
file_.write(html) file_.write(html)

View File

@ -77,6 +77,8 @@ def main(
if labels_discard: if labels_discard:
labels_discard = [x.strip() for x in labels_discard.split(",")] labels_discard = [x.strip() for x in labels_discard.split(",")]
logger.info("Discarding {} NER types: {}".format(len(labels_discard), labels_discard)) logger.info("Discarding {} NER types: {}".format(len(labels_discard), labels_discard))
else:
labels_discard = []
train_data = wikipedia_processor.read_training( train_data = wikipedia_processor.read_training(
nlp=nlp, nlp=nlp,

View File

@ -18,19 +18,21 @@ during training. We discard the auxiliary model before run-time.
The specific example here is not necessarily a good idea --- but it shows The specific example here is not necessarily a good idea --- but it shows
how an arbitrary objective function for some word can be used. how an arbitrary objective function for some word can be used.
Developed and tested for spaCy 2.0.6 Developed and tested for spaCy 2.0.6. Updated for v2.2.2
""" """
import random import random
import plac import plac
import spacy import spacy
import os.path import os.path
from spacy.tokens import Doc
from spacy.gold import read_json_file, GoldParse from spacy.gold import read_json_file, GoldParse
random.seed(0) random.seed(0)
PWD = os.path.dirname(__file__) PWD = os.path.dirname(__file__)
TRAIN_DATA = list(read_json_file(os.path.join(PWD, "training-data.json"))) TRAIN_DATA = list(read_json_file(
os.path.join(PWD, "ner_example_data", "ner-sent-per-line.json")))
def get_position_label(i, words, tags, heads, labels, ents): def get_position_label(i, words, tags, heads, labels, ents):
@ -55,6 +57,7 @@ def main(n_iter=10):
ner = nlp.create_pipe("ner") ner = nlp.create_pipe("ner")
ner.add_multitask_objective(get_position_label) ner.add_multitask_objective(get_position_label)
nlp.add_pipe(ner) nlp.add_pipe(ner)
print(nlp.pipeline)
print("Create data", len(TRAIN_DATA)) print("Create data", len(TRAIN_DATA))
optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA) optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA)
@ -62,23 +65,24 @@ def main(n_iter=10):
random.shuffle(TRAIN_DATA) random.shuffle(TRAIN_DATA)
losses = {} losses = {}
for text, annot_brackets in TRAIN_DATA: for text, annot_brackets in TRAIN_DATA:
annotations, _ = annot_brackets for annotations, _ in annot_brackets:
doc = nlp.make_doc(text) doc = Doc(nlp.vocab, words=annotations[1])
gold = GoldParse.from_annot_tuples(doc, annotations[0]) gold = GoldParse.from_annot_tuples(doc, annotations)
nlp.update( nlp.update(
[doc], # batch of texts [doc], # batch of texts
[gold], # batch of annotations [gold], # batch of annotations
drop=0.2, # dropout - make it harder to memorise data drop=0.2, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights sgd=optimizer, # callable to update weights
losses=losses, losses=losses,
) )
print(losses.get("nn_labeller", 0.0), losses["ner"]) print(losses.get("nn_labeller", 0.0), losses["ner"])
# test the trained model # test the trained model
for text, _ in TRAIN_DATA: for text, _ in TRAIN_DATA:
doc = nlp(text) if text is not None:
print("Entities", [(ent.text, ent.label_) for ent in doc.ents]) doc = nlp(text)
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc]) print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
if __name__ == "__main__": if __name__ == "__main__":

View File

@ -1,10 +1,10 @@
# Our libraries # Our libraries
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc>=7.2.0,<7.3.0 thinc>=7.3.0,<7.4.0
blis>=0.4.0,<0.5.0 blis>=0.4.0,<0.5.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
wasabi>=0.2.0,<1.1.0 wasabi>=0.3.0,<1.1.0
srsly>=0.1.0,<1.1.0 srsly>=0.1.0,<1.1.0
# Third party dependencies # Third party dependencies
numpy>=1.15.0 numpy>=1.15.0

View File

@ -38,18 +38,18 @@ setup_requires =
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
thinc>=7.2.0,<7.3.0 thinc>=7.3.0,<7.4.0
install_requires = install_requires =
setuptools setuptools
numpy>=1.15.0 numpy>=1.15.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc>=7.2.0,<7.3.0 thinc>=7.3.0,<7.4.0
blis>=0.4.0,<0.5.0 blis>=0.4.0,<0.5.0
plac>=0.9.6,<1.2.0 plac>=0.9.6,<1.2.0
requests>=2.13.0,<3.0.0 requests>=2.13.0,<3.0.0
wasabi>=0.2.0,<1.1.0 wasabi>=0.3.0,<1.1.0
srsly>=0.1.0,<1.1.0 srsly>=0.1.0,<1.1.0
pathlib==1.0.1; python_version < "3.4" pathlib==1.0.1; python_version < "3.4"
importlib_metadata>=0.20; python_version < "3.8" importlib_metadata>=0.20; python_version < "3.8"

View File

@ -9,12 +9,14 @@ warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
# These are imported as part of the API # These are imported as part of the API
from thinc.neural.util import prefer_gpu, require_gpu from thinc.neural.util import prefer_gpu, require_gpu
from . import pipeline
from .cli.info import info as cli_info from .cli.info import info as cli_info
from .glossary import explain from .glossary import explain
from .about import __version__ from .about import __version__
from .errors import Errors, Warnings, deprecation_warning from .errors import Errors, Warnings, deprecation_warning
from . import util from . import util
from .util import register_architecture, get_architecture from .util import register_architecture, get_architecture
from .language import component
if sys.maxunicode == 65535: if sys.maxunicode == 65535:

View File

@ -3,16 +3,14 @@ from __future__ import unicode_literals
import numpy import numpy
from thinc.v2v import Model, Maxout, Softmax, Affine, ReLu from thinc.v2v import Model, Maxout, Softmax, Affine, ReLu
from thinc.i2v import HashEmbed, StaticVectors
from thinc.t2t import ExtractWindow, ParametricAttention from thinc.t2t import ExtractWindow, ParametricAttention
from thinc.t2v import Pooling, sum_pool, mean_pool from thinc.t2v import Pooling, sum_pool, mean_pool
from thinc.misc import Residual from thinc.i2v import HashEmbed
from thinc.misc import Residual, FeatureExtracter
from thinc.misc import LayerNorm as LN from thinc.misc import LayerNorm as LN
from thinc.misc import FeatureExtracter
from thinc.api import add, layerize, chain, clone, concatenate, with_flatten from thinc.api import add, layerize, chain, clone, concatenate, with_flatten
from thinc.api import with_getitem, flatten_add_lengths from thinc.api import with_getitem, flatten_add_lengths
from thinc.api import uniqued, wrap, noop from thinc.api import uniqued, wrap, noop
from thinc.api import with_square_sequences
from thinc.linear.linear import LinearModel from thinc.linear.linear import LinearModel
from thinc.neural.ops import NumpyOps, CupyOps from thinc.neural.ops import NumpyOps, CupyOps
from thinc.neural.util import get_array_module, copy_array from thinc.neural.util import get_array_module, copy_array
@ -26,14 +24,13 @@ import thinc.extra.load_nlp
from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
from .errors import Errors, user_warning, Warnings from .errors import Errors, user_warning, Warnings
from . import util from . import util
from . import ml as new_ml
from .ml import _legacy_tok2vec
try:
import torch.nn
from thinc.extra.wrappers import PyTorchWrapperRNN
except ImportError:
torch = None
VECTORS_KEY = "spacy_pretrained_vectors" VECTORS_KEY = "spacy_pretrained_vectors"
# Backwards compatibility with <2.2.2
USE_MODEL_REGISTRY_TOK2VEC = False
def cosine(vec1, vec2): def cosine(vec1, vec2):
@ -310,6 +307,10 @@ def link_vectors_to_models(vocab):
def PyTorchBiLSTM(nO, nI, depth, dropout=0.2): def PyTorchBiLSTM(nO, nI, depth, dropout=0.2):
import torch.nn
from thinc.api import with_square_sequences
from thinc.extra.wrappers import PyTorchWrapperRNN
if depth == 0: if depth == 0:
return layerize(noop()) return layerize(noop())
model = torch.nn.LSTM(nI, nO // 2, depth, bidirectional=True, dropout=dropout) model = torch.nn.LSTM(nI, nO // 2, depth, bidirectional=True, dropout=dropout)
@ -317,81 +318,89 @@ def PyTorchBiLSTM(nO, nI, depth, dropout=0.2):
def Tok2Vec(width, embed_size, **kwargs): def Tok2Vec(width, embed_size, **kwargs):
if not USE_MODEL_REGISTRY_TOK2VEC:
# Preserve prior tok2vec for backwards compat, in v2.2.2
return _legacy_tok2vec.Tok2Vec(width, embed_size, **kwargs)
pretrained_vectors = kwargs.get("pretrained_vectors", None) pretrained_vectors = kwargs.get("pretrained_vectors", None)
cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3) cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3)
subword_features = kwargs.get("subword_features", True) subword_features = kwargs.get("subword_features", True)
char_embed = kwargs.get("char_embed", False) char_embed = kwargs.get("char_embed", False)
if char_embed:
subword_features = False
conv_depth = kwargs.get("conv_depth", 4) conv_depth = kwargs.get("conv_depth", 4)
bilstm_depth = kwargs.get("bilstm_depth", 0) bilstm_depth = kwargs.get("bilstm_depth", 0)
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH] conv_window = kwargs.get("conv_window", 1)
with Model.define_operators(
{">>": chain, "|": concatenate, "**": clone, "+": add, "*": reapply}
):
norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm")
if subword_features:
prefix = HashEmbed(
width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix"
)
suffix = HashEmbed(
width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix"
)
shape = HashEmbed(
width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape"
)
else:
prefix, suffix, shape = (None, None, None)
if pretrained_vectors is not None:
glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
if subword_features: cols = ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"]
embed = uniqued(
(glove | norm | prefix | suffix | shape)
>> LN(Maxout(width, width * 5, pieces=3)),
column=cols.index(ORTH),
)
else:
embed = uniqued(
(glove | norm) >> LN(Maxout(width, width * 2, pieces=3)),
column=cols.index(ORTH),
)
elif subword_features:
embed = uniqued(
(norm | prefix | suffix | shape)
>> LN(Maxout(width, width * 4, pieces=3)),
column=cols.index(ORTH),
)
elif char_embed:
embed = concatenate_lists(
CharacterEmbed(nM=64, nC=8),
FeatureExtracter(cols) >> with_flatten(norm),
)
reduce_dimensions = LN(
Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces)
)
else:
embed = norm
convolution = Residual( doc2feats_cfg = {"arch": "spacy.Doc2Feats.v1", "config": {"columns": cols}}
ExtractWindow(nW=1) if char_embed:
>> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces)) embed_cfg = {
) "arch": "spacy.CharacterEmbed.v1",
if char_embed: "config": {
tok2vec = embed >> with_flatten( "width": 64,
reduce_dimensions >> convolution ** conv_depth, pad=conv_depth "chars": 6,
) "@mix": {
else: "arch": "spacy.LayerNormalizedMaxout.v1",
tok2vec = FeatureExtracter(cols) >> with_flatten( "config": {"width": width, "pieces": 3},
embed >> convolution ** conv_depth, pad=conv_depth },
) "@embed_features": None,
},
if bilstm_depth >= 1: }
tok2vec = tok2vec >> PyTorchBiLSTM(width, width, bilstm_depth) else:
# Work around thinc API limitations :(. TODO: Revise in Thinc 7 embed_cfg = {
tok2vec.nO = width "arch": "spacy.MultiHashEmbed.v1",
tok2vec.embed = embed "config": {
return tok2vec "width": width,
"rows": embed_size,
"columns": cols,
"use_subwords": subword_features,
"@pretrained_vectors": None,
"@mix": {
"arch": "spacy.LayerNormalizedMaxout.v1",
"config": {"width": width, "pieces": 3},
},
},
}
if pretrained_vectors:
embed_cfg["config"]["@pretrained_vectors"] = {
"arch": "spacy.PretrainedVectors.v1",
"config": {
"vectors_name": pretrained_vectors,
"width": width,
"column": cols.index("ID"),
},
}
if cnn_maxout_pieces >= 2:
cnn_cfg = {
"arch": "spacy.MaxoutWindowEncoder.v1",
"config": {
"width": width,
"window_size": conv_window,
"pieces": cnn_maxout_pieces,
"depth": conv_depth,
},
}
else:
cnn_cfg = {
"arch": "spacy.MishWindowEncoder.v1",
"config": {"width": width, "window_size": conv_window, "depth": conv_depth},
}
bilstm_cfg = {
"arch": "spacy.TorchBiLSTMEncoder.v1",
"config": {"width": width, "depth": bilstm_depth},
}
if conv_depth == 0 and bilstm_depth == 0:
encode_cfg = {}
elif conv_depth >= 1 and bilstm_depth >= 1:
encode_cfg = {
"arch": "thinc.FeedForward.v1",
"config": {"children": [cnn_cfg, bilstm_cfg]},
}
elif conv_depth >= 1:
encode_cfg = cnn_cfg
else:
encode_cfg = bilstm_cfg
config = {"@doc2feats": doc2feats_cfg, "@embed": embed_cfg, "@encode": encode_cfg}
return new_ml.Tok2Vec(config)
def reapply(layer, n_times): def reapply(layer, n_times):

View File

@ -1,6 +1,6 @@
# fmt: off # fmt: off
__title__ = "spacy" __title__ = "spacy"
__version__ = "2.2.2.dev1" __version__ = "2.2.2"
__release__ = True __release__ = True
__download_url__ = "https://github.com/explosion/spacy-models/releases/download" __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"

179
spacy/analysis.py Normal file
View File

@ -0,0 +1,179 @@
# coding: utf8
from __future__ import unicode_literals
from collections import OrderedDict
from wasabi import Printer
from .tokens import Doc, Token, Span
from .errors import Errors, Warnings, user_warning
def analyze_pipes(pipeline, name, pipe, index, warn=True):
"""Analyze a pipeline component with respect to its position in the current
pipeline and the other components. Will check whether requirements are
fulfilled (e.g. if previous components assign the attributes).
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
name (unicode): The name of the pipeline component to analyze.
pipe (callable): The pipeline component function to analyze.
index (int): The index of the component in the pipeline.
warn (bool): Show user warning if problem is found.
RETURNS (list): The problems found for the given pipeline component.
"""
assert pipeline[index][0] == name
prev_pipes = pipeline[:index]
pipe_requires = getattr(pipe, "requires", [])
requires = OrderedDict([(annot, False) for annot in pipe_requires])
if requires:
for prev_name, prev_pipe in prev_pipes:
prev_assigns = getattr(prev_pipe, "assigns", [])
for annot in prev_assigns:
requires[annot] = True
problems = []
for annot, fulfilled in requires.items():
if not fulfilled:
problems.append(annot)
if warn:
user_warning(Warnings.W025.format(name=name, attr=annot))
return problems
def analyze_all_pipes(pipeline, warn=True):
"""Analyze all pipes in the pipeline in order.
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
warn (bool): Show user warning if problem is found.
RETURNS (dict): The problems found, keyed by component name.
"""
problems = {}
for i, (name, pipe) in enumerate(pipeline):
problems[name] = analyze_pipes(pipeline, name, pipe, i, warn=warn)
return problems
def dot_to_dict(values):
"""Convert dot notation to a dict. For example: ["token.pos", "token._.xyz"]
become {"token": {"pos": True, "_": {"xyz": True }}}.
values (iterable): The values to convert.
RETURNS (dict): The converted values.
"""
result = {}
for value in values:
path = result
parts = value.lower().split(".")
for i, item in enumerate(parts):
is_last = i == len(parts) - 1
path = path.setdefault(item, True if is_last else {})
return result
def validate_attrs(values):
"""Validate component attributes provided to "assigns", "requires" etc.
Raises error for invalid attributes and formatting. Doesn't check if
custom extension attributes are registered, since this is something the
user might want to do themselves later in the component.
values (iterable): The string attributes to check, e.g. `["token.pos"]`.
RETURNS (iterable): The checked attributes.
"""
data = dot_to_dict(values)
objs = {"doc": Doc, "token": Token, "span": Span}
for obj_key, attrs in data.items():
if obj_key == "span":
# Support Span only for custom extension attributes
span_attrs = [attr for attr in values if attr.startswith("span.")]
span_attrs = [attr for attr in span_attrs if not attr.startswith("span._.")]
if span_attrs:
raise ValueError(Errors.E180.format(attrs=", ".join(span_attrs)))
if obj_key not in objs: # first element is not doc/token/span
invalid_attrs = ", ".join(a for a in values if a.startswith(obj_key))
raise ValueError(Errors.E181.format(obj=obj_key, attrs=invalid_attrs))
if not isinstance(attrs, dict): # attr is something like "doc"
raise ValueError(Errors.E182.format(attr=obj_key))
for attr, value in attrs.items():
if attr == "_":
if value is True: # attr is something like "doc._"
raise ValueError(Errors.E182.format(attr="{}._".format(obj_key)))
for ext_attr, ext_value in value.items():
# We don't check whether the attribute actually exists
if ext_value is not True: # attr is something like doc._.x.y
good = "{}._.{}".format(obj_key, ext_attr)
bad = "{}.{}".format(good, ".".join(ext_value))
raise ValueError(Errors.E183.format(attr=bad, solution=good))
continue # we can't validate those further
if attr.endswith("_"): # attr is something like "token.pos_"
raise ValueError(Errors.E184.format(attr=attr, solution=attr[:-1]))
if value is not True: # attr is something like doc.x.y
good = "{}.{}".format(obj_key, attr)
bad = "{}.{}".format(good, ".".join(value))
raise ValueError(Errors.E183.format(attr=bad, solution=good))
obj = objs[obj_key]
if not hasattr(obj, attr):
raise ValueError(Errors.E185.format(obj=obj_key, attr=attr))
return values
def _get_feature_for_attr(pipeline, attr, feature):
assert feature in ["assigns", "requires"]
result = []
for pipe_name, pipe in pipeline:
pipe_assigns = getattr(pipe, feature, [])
if attr in pipe_assigns:
result.append((pipe_name, pipe))
return result
def get_assigns_for_attr(pipeline, attr):
"""Get all pipeline components that assign an attr, e.g. "doc.tensor".
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
attr (unicode): The attribute to check.
RETURNS (list): (name, pipeline) tuples of components that assign the attr.
"""
return _get_feature_for_attr(pipeline, attr, "assigns")
def get_requires_for_attr(pipeline, attr):
"""Get all pipeline components that require an attr, e.g. "doc.tensor".
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
attr (unicode): The attribute to check.
RETURNS (list): (name, pipeline) tuples of components that require the attr.
"""
return _get_feature_for_attr(pipeline, attr, "requires")
def print_summary(nlp, pretty=True, no_print=False):
"""Print a formatted summary for the current nlp object's pipeline. Shows
a table with the pipeline components and why they assign and require, as
well as any problems if available.
nlp (Language): The nlp object.
pretty (bool): Pretty-print the results (color etc).
no_print (bool): Don't print anything, just return the data.
RETURNS (dict): A dict with "overview" and "problems".
"""
msg = Printer(pretty=pretty, no_print=no_print)
overview = []
problems = {}
for i, (name, pipe) in enumerate(nlp.pipeline):
requires = getattr(pipe, "requires", [])
assigns = getattr(pipe, "assigns", [])
retok = getattr(pipe, "retokenizes", False)
overview.append((i, name, requires, assigns, retok))
problems[name] = analyze_pipes(nlp.pipeline, name, pipe, i, warn=False)
msg.divider("Pipeline Overview")
header = ("#", "Component", "Requires", "Assigns", "Retokenizes")
msg.table(overview, header=header, divider=True, multiline=True)
n_problems = sum(len(p) for p in problems.values())
if any(p for p in problems.values()):
msg.divider("Problems ({})".format(n_problems))
for name, problem in problems.items():
if problem:
problem = ", ".join(problem)
msg.warn("'{}' requirements not met: {}".format(name, problem))
else:
msg.good("No problems found.")
if no_print:
return {"overview": overview, "problems": problems}

View File

@ -57,7 +57,7 @@ def convert(
is written to stdout, so you can pipe them forward to a JSON file: is written to stdout, so you can pipe them forward to a JSON file:
$ spacy convert some_file.conllu > some_file.json $ spacy convert some_file.conllu > some_file.json
""" """
no_print = (output_dir == "-") no_print = output_dir == "-"
msg = Printer(no_print=no_print) msg = Printer(no_print=no_print)
input_path = Path(input_file) input_path = Path(input_file)
if file_type not in FILE_TYPES: if file_type not in FILE_TYPES:

View File

@ -9,7 +9,9 @@ from ...tokens.doc import Doc
from ...util import load_model from ...util import load_model
def conll_ner2json(input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs): def conll_ner2json(
input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs
):
""" """
Convert files in the CoNLL-2003 NER format and similar Convert files in the CoNLL-2003 NER format and similar
whitespace-separated columns into JSON format for use with train cli. whitespace-separated columns into JSON format for use with train cli.

View File

@ -35,6 +35,10 @@ from .train import _load_pretrained_tok2vec
output_dir=("Directory to write models to on each epoch", "positional", None, str), output_dir=("Directory to write models to on each epoch", "positional", None, str),
width=("Width of CNN layers", "option", "cw", int), width=("Width of CNN layers", "option", "cw", int),
depth=("Depth of CNN layers", "option", "cd", int), depth=("Depth of CNN layers", "option", "cd", int),
cnn_window=("Window size for CNN layers", "option", "cW", int),
cnn_pieces=("Maxout size for CNN layers. 1 for Mish", "option", "cP", int),
use_chars=("Whether to use character-based embedding", "flag", "chr", bool),
sa_depth=("Depth of self-attention layers", "option", "sa", int),
bilstm_depth=("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int), bilstm_depth=("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int),
embed_rows=("Number of embedding rows", "option", "er", int), embed_rows=("Number of embedding rows", "option", "er", int),
loss_func=( loss_func=(
@ -81,7 +85,11 @@ def pretrain(
output_dir, output_dir,
width=96, width=96,
depth=4, depth=4,
bilstm_depth=2, bilstm_depth=0,
cnn_pieces=3,
sa_depth=0,
use_chars=False,
cnn_window=1,
embed_rows=2000, embed_rows=2000,
loss_func="cosine", loss_func="cosine",
use_vectors=False, use_vectors=False,
@ -158,8 +166,8 @@ def pretrain(
conv_depth=depth, conv_depth=depth,
pretrained_vectors=pretrained_vectors, pretrained_vectors=pretrained_vectors,
bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental. bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental.
cnn_maxout_pieces=3, # You can try setting this higher subword_features=not use_chars, # Set to False for Chinese etc
subword_features=True, # Set to False for Chinese etc cnn_maxout_pieces=cnn_pieces, # If set to 1, use Mish activation.
), ),
) )
# Load in pretrained weights # Load in pretrained weights

View File

@ -156,8 +156,7 @@ def train(
"`lang` argument ('{}') ".format(nlp.lang, lang), "`lang` argument ('{}') ".format(nlp.lang, lang),
exits=1, exits=1,
) )
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipeline] nlp.disable_pipes([p for p in nlp.pipe_names if p not in pipeline])
nlp.disable_pipes(*other_pipes)
for pipe in pipeline: for pipe in pipeline:
if pipe not in nlp.pipe_names: if pipe not in nlp.pipe_names:
if pipe == "parser": if pipe == "parser":
@ -263,7 +262,11 @@ def train(
exits=1, exits=1,
) )
train_docs = corpus.train_docs( train_docs = corpus.train_docs(
nlp, noise_level=noise_level, gold_preproc=gold_preproc, max_length=0 nlp,
noise_level=noise_level,
gold_preproc=gold_preproc,
max_length=0,
ignore_misaligned=True,
) )
train_labels = set() train_labels = set()
if textcat_multilabel: if textcat_multilabel:
@ -344,6 +347,7 @@ def train(
orth_variant_level=orth_variant_level, orth_variant_level=orth_variant_level,
gold_preproc=gold_preproc, gold_preproc=gold_preproc,
max_length=0, max_length=0,
ignore_misaligned=True,
) )
if raw_text: if raw_text:
random.shuffle(raw_text) random.shuffle(raw_text)
@ -382,7 +386,11 @@ def train(
if hasattr(component, "cfg"): if hasattr(component, "cfg"):
component.cfg["beam_width"] = beam_width component.cfg["beam_width"] = beam_width
dev_docs = list( dev_docs = list(
corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc) corpus.dev_docs(
nlp_loaded,
gold_preproc=gold_preproc,
ignore_misaligned=True,
)
) )
nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs) nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
start_time = timer() start_time = timer()
@ -399,7 +407,11 @@ def train(
if hasattr(component, "cfg"): if hasattr(component, "cfg"):
component.cfg["beam_width"] = beam_width component.cfg["beam_width"] = beam_width
dev_docs = list( dev_docs = list(
corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc) corpus.dev_docs(
nlp_loaded,
gold_preproc=gold_preproc,
ignore_misaligned=True,
)
) )
start_time = timer() start_time = timer()
scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose) scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose)

View File

@ -12,6 +12,7 @@ import os
import sys import sys
import itertools import itertools
import ast import ast
import types
from thinc.neural.util import copy_array from thinc.neural.util import copy_array
@ -67,6 +68,7 @@ if is_python2:
basestring_ = basestring # noqa: F821 basestring_ = basestring # noqa: F821
input_ = raw_input # noqa: F821 input_ = raw_input # noqa: F821
path2str = lambda path: str(path).decode("utf8") path2str = lambda path: str(path).decode("utf8")
class_types = (type, types.ClassType)
elif is_python3: elif is_python3:
bytes_ = bytes bytes_ = bytes
@ -74,6 +76,7 @@ elif is_python3:
basestring_ = str basestring_ = str
input_ = input input_ = input
path2str = lambda path: str(path) path2str = lambda path: str(path)
class_types = (type, types.ClassType) if is_python_pre_3_5 else type
def b_to_str(b_str): def b_to_str(b_str):

View File

@ -44,14 +44,14 @@ TPL_ENTS = """
TPL_ENT = """ TPL_ENT = """
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> <mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
{text} {text}
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">{label}</span> <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">{label}</span>
</mark> </mark>
""" """
TPL_ENT_RTL = """ TPL_ENT_RTL = """
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;"> <mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em">
{text} {text}
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span> <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span>
</mark> </mark>

View File

@ -99,6 +99,8 @@ class Warnings(object):
"'n_process' will be set to 1.") "'n_process' will be set to 1.")
W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in " W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in "
"the Knowledge Base.") "the Knowledge Base.")
W025 = ("'{name}' requires '{attr}' to be assigned, but none of the "
"previous components in the pipeline declare that they assign it.")
@add_codes @add_codes
@ -504,6 +506,29 @@ class Errors(object):
E175 = ("Can't remove rule for unknown match pattern ID: {key}") E175 = ("Can't remove rule for unknown match pattern ID: {key}")
E176 = ("Alias '{alias}' is not defined in the Knowledge Base.") E176 = ("Alias '{alias}' is not defined in the Knowledge Base.")
E177 = ("Ill-formed IOB input detected: {tag}") E177 = ("Ill-formed IOB input detected: {tag}")
E178 = ("Invalid pattern. Expected list of dicts but got: {pat}. Maybe you "
"accidentally passed a single pattern to Matcher.add instead of a "
"list of patterns? If you only want to add one pattern, make sure "
"to wrap it in a list. For example: matcher.add('{key}', [pattern])")
E179 = ("Invalid pattern. Expected a list of Doc objects but got a single "
"Doc. If you only want to add one pattern, make sure to wrap it "
"in a list. For example: matcher.add('{key}', [doc])")
E180 = ("Span attributes can't be declared as required or assigned by "
"components, since spans are only views of the Doc. Use Doc and "
"Token attributes (or custom extension attributes) only and remove "
"the following: {attrs}")
E181 = ("Received invalid attributes for unkown object {obj}: {attrs}. "
"Only Doc and Token attributes are supported.")
E182 = ("Received invalid attribute declaration: {attr}\nDid you forget "
"to define the attribute? For example: {attr}.???")
E183 = ("Received invalid attribute declaration: {attr}\nOnly top-level "
"attributes are supported, for example: {solution}")
E184 = ("Only attributes without underscores are supported in component "
"attribute declarations (because underscore and non-underscore "
"attributes are connected anyways): {attr} -> {solution}")
E185 = ("Received invalid attribute in component attribute declaration: "
"{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
@add_codes @add_codes
@ -536,6 +561,10 @@ class MatchPatternError(ValueError):
ValueError.__init__(self, msg) ValueError.__init__(self, msg)
class AlignmentError(ValueError):
pass
class ModelsWarning(UserWarning): class ModelsWarning(UserWarning):
pass pass

View File

@ -80,7 +80,7 @@ GLOSSARY = {
"RBR": "adverb, comparative", "RBR": "adverb, comparative",
"RBS": "adverb, superlative", "RBS": "adverb, superlative",
"RP": "adverb, particle", "RP": "adverb, particle",
"TO": "infinitival to", "TO": 'infinitival "to"',
"UH": "interjection", "UH": "interjection",
"VB": "verb, base form", "VB": "verb, base form",
"VBD": "verb, past tense", "VBD": "verb, past tense",
@ -279,6 +279,12 @@ GLOSSARY = {
"re": "repeated element", "re": "repeated element",
"rs": "reported speech", "rs": "reported speech",
"sb": "subject", "sb": "subject",
"sb": "subject",
"sbp": "passivized subject (PP)",
"sp": "subject or predicate",
"svp": "separable verb prefix",
"uc": "unit component",
"vo": "vocative",
# Named Entity Recognition # Named Entity Recognition
# OntoNotes 5 # OntoNotes 5
# https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf # https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf

View File

@ -11,10 +11,9 @@ import itertools
from pathlib import Path from pathlib import Path
import srsly import srsly
from . import _align
from .syntax import nonproj from .syntax import nonproj
from .tokens import Doc, Span from .tokens import Doc, Span
from .errors import Errors from .errors import Errors, AlignmentError
from .compat import path2str from .compat import path2str
from . import util from . import util
from .util import minibatch, itershuffle from .util import minibatch, itershuffle
@ -22,6 +21,7 @@ from .util import minibatch, itershuffle
from libc.stdio cimport FILE, fopen, fclose, fread, fwrite, feof, fseek from libc.stdio cimport FILE, fopen, fclose, fread, fwrite, feof, fseek
USE_NEW_ALIGN = False
punct_re = re.compile(r"\W") punct_re = re.compile(r"\W")
@ -56,10 +56,10 @@ def tags_to_entities(tags):
def merge_sents(sents): def merge_sents(sents):
m_deps = [[], [], [], [], [], []] m_deps = [[], [], [], [], [], []]
m_cats = {}
m_brackets = [] m_brackets = []
m_cats = sents.pop()
i = 0 i = 0
for (ids, words, tags, heads, labels, ner), brackets in sents: for (ids, words, tags, heads, labels, ner), (cats, brackets) in sents:
m_deps[0].extend(id_ + i for id_ in ids) m_deps[0].extend(id_ + i for id_ in ids)
m_deps[1].extend(words) m_deps[1].extend(words)
m_deps[2].extend(tags) m_deps[2].extend(tags)
@ -68,12 +68,26 @@ def merge_sents(sents):
m_deps[5].extend(ner) m_deps[5].extend(ner)
m_brackets.extend((b["first"] + i, b["last"] + i, b["label"]) m_brackets.extend((b["first"] + i, b["last"] + i, b["label"])
for b in brackets) for b in brackets)
m_cats.update(cats)
i += len(ids) i += len(ids)
m_deps.append(m_cats) return [(m_deps, (m_cats, m_brackets))]
return [(m_deps, m_brackets)]
def align(tokens_a, tokens_b): _ALIGNMENT_NORM_MAP = [("``", "'"), ("''", "'"), ('"', "'"), ("`", "'")]
def _normalize_for_alignment(tokens):
tokens = [w.replace(" ", "").lower() for w in tokens]
output = []
for token in tokens:
token = token.replace(" ", "").lower()
for before, after in _ALIGNMENT_NORM_MAP:
token = token.replace(before, after)
output.append(token)
return output
def _align_before_v2_2_2(tokens_a, tokens_b):
"""Calculate alignment tables between two tokenizations, using the Levenshtein """Calculate alignment tables between two tokenizations, using the Levenshtein
algorithm. The alignment is case-insensitive. algorithm. The alignment is case-insensitive.
@ -92,6 +106,7 @@ def align(tokens_a, tokens_b):
* b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other * b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other
direction. direction.
""" """
from . import _align
if tokens_a == tokens_b: if tokens_a == tokens_b:
alignment = numpy.arange(len(tokens_a)) alignment = numpy.arange(len(tokens_a))
return 0, alignment, alignment, {}, {} return 0, alignment, alignment, {}, {}
@ -111,6 +126,82 @@ def align(tokens_a, tokens_b):
return cost, i2j, j2i, i2j_multi, j2i_multi return cost, i2j, j2i, i2j_multi, j2i_multi
def align(tokens_a, tokens_b):
"""Calculate alignment tables between two tokenizations.
tokens_a (List[str]): The candidate tokenization.
tokens_b (List[str]): The reference tokenization.
RETURNS: (tuple): A 5-tuple consisting of the following information:
* cost (int): The number of misaligned tokens.
* a2b (List[int]): Mapping of indices in `tokens_a` to indices in `tokens_b`.
For instance, if `a2b[4] == 6`, that means that `tokens_a[4]` aligns
to `tokens_b[6]`. If there's no one-to-one alignment for a token,
it has the value -1.
* b2a (List[int]): The same as `a2b`, but mapping the other direction.
* a2b_multi (Dict[int, int]): A dictionary mapping indices in `tokens_a`
to indices in `tokens_b`, where multiple tokens of `tokens_a` align to
the same token of `tokens_b`.
* b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other
direction.
"""
if not USE_NEW_ALIGN:
return _align_before_v2_2_2(tokens_a, tokens_b)
tokens_a = _normalize_for_alignment(tokens_a)
tokens_b = _normalize_for_alignment(tokens_b)
cost = 0
a2b = numpy.empty(len(tokens_a), dtype="i")
b2a = numpy.empty(len(tokens_b), dtype="i")
a2b_multi = {}
b2a_multi = {}
i = 0
j = 0
offset_a = 0
offset_b = 0
while i < len(tokens_a) and j < len(tokens_b):
a = tokens_a[i][offset_a:]
b = tokens_b[j][offset_b:]
a2b[i] = b2a[j] = -1
if a == b:
if offset_a == offset_b == 0:
a2b[i] = j
b2a[j] = i
elif offset_a == 0:
cost += 2
a2b_multi[i] = j
elif offset_b == 0:
cost += 2
b2a_multi[j] = i
offset_a = offset_b = 0
i += 1
j += 1
elif a == "":
assert offset_a == 0
cost += 1
i += 1
elif b == "":
assert offset_b == 0
cost += 1
j += 1
elif b.startswith(a):
cost += 1
if offset_a == 0:
a2b_multi[i] = j
i += 1
offset_a = 0
offset_b += len(a)
elif a.startswith(b):
cost += 1
if offset_b == 0:
b2a_multi[j] = i
j += 1
offset_b = 0
offset_a += len(b)
else:
assert "".join(tokens_a) != "".join(tokens_b)
raise AlignmentError(Errors.E186.format(tok_a=tokens_a, tok_b=tokens_b))
return cost, a2b, b2a, a2b_multi, b2a_multi
class GoldCorpus(object): class GoldCorpus(object):
"""An annotated corpus, using the JSON file format. Manages """An annotated corpus, using the JSON file format. Manages
annotations for tagging, dependency parsing and NER. annotations for tagging, dependency parsing and NER.
@ -176,6 +267,11 @@ class GoldCorpus(object):
gold_tuples = read_json_file(loc) gold_tuples = read_json_file(loc)
elif loc.parts[-1].endswith("jsonl"): elif loc.parts[-1].endswith("jsonl"):
gold_tuples = srsly.read_jsonl(loc) gold_tuples = srsly.read_jsonl(loc)
first_gold_tuple = next(gold_tuples)
gold_tuples = itertools.chain([first_gold_tuple], gold_tuples)
# TODO: proper format checks with schemas
if isinstance(first_gold_tuple, dict):
gold_tuples = read_json_object(gold_tuples)
elif loc.parts[-1].endswith("msg"): elif loc.parts[-1].endswith("msg"):
gold_tuples = srsly.read_msgpack(loc) gold_tuples = srsly.read_msgpack(loc)
else: else:
@ -201,7 +297,6 @@ class GoldCorpus(object):
n = 0 n = 0
i = 0 i = 0
for raw_text, paragraph_tuples in self.train_tuples: for raw_text, paragraph_tuples in self.train_tuples:
cats = paragraph_tuples.pop()
for sent_tuples, brackets in paragraph_tuples: for sent_tuples, brackets in paragraph_tuples:
n += len(sent_tuples[1]) n += len(sent_tuples[1])
if self.limit and i >= self.limit: if self.limit and i >= self.limit:
@ -210,7 +305,8 @@ class GoldCorpus(object):
return n return n
def train_docs(self, nlp, gold_preproc=False, max_length=None, def train_docs(self, nlp, gold_preproc=False, max_length=None,
noise_level=0.0, orth_variant_level=0.0): noise_level=0.0, orth_variant_level=0.0,
ignore_misaligned=False):
locs = list((self.tmp_dir / 'train').iterdir()) locs = list((self.tmp_dir / 'train').iterdir())
random.shuffle(locs) random.shuffle(locs)
train_tuples = self.read_tuples(locs, limit=self.limit) train_tuples = self.read_tuples(locs, limit=self.limit)
@ -218,20 +314,23 @@ class GoldCorpus(object):
max_length=max_length, max_length=max_length,
noise_level=noise_level, noise_level=noise_level,
orth_variant_level=orth_variant_level, orth_variant_level=orth_variant_level,
make_projective=True) make_projective=True,
ignore_misaligned=ignore_misaligned)
yield from gold_docs yield from gold_docs
def train_docs_without_preprocessing(self, nlp, gold_preproc=False): def train_docs_without_preprocessing(self, nlp, gold_preproc=False):
gold_docs = self.iter_gold_docs(nlp, self.train_tuples, gold_preproc=gold_preproc) gold_docs = self.iter_gold_docs(nlp, self.train_tuples, gold_preproc=gold_preproc)
yield from gold_docs yield from gold_docs
def dev_docs(self, nlp, gold_preproc=False): def dev_docs(self, nlp, gold_preproc=False, ignore_misaligned=False):
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc) gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc,
ignore_misaligned=ignore_misaligned)
yield from gold_docs yield from gold_docs
@classmethod @classmethod
def iter_gold_docs(cls, nlp, tuples, gold_preproc, max_length=None, def iter_gold_docs(cls, nlp, tuples, gold_preproc, max_length=None,
noise_level=0.0, orth_variant_level=0.0, make_projective=False): noise_level=0.0, orth_variant_level=0.0, make_projective=False,
ignore_misaligned=False):
for raw_text, paragraph_tuples in tuples: for raw_text, paragraph_tuples in tuples:
if gold_preproc: if gold_preproc:
raw_text = None raw_text = None
@ -240,10 +339,12 @@ class GoldCorpus(object):
docs, paragraph_tuples = cls._make_docs(nlp, raw_text, docs, paragraph_tuples = cls._make_docs(nlp, raw_text,
paragraph_tuples, gold_preproc, noise_level=noise_level, paragraph_tuples, gold_preproc, noise_level=noise_level,
orth_variant_level=orth_variant_level) orth_variant_level=orth_variant_level)
golds = cls._make_golds(docs, paragraph_tuples, make_projective) golds = cls._make_golds(docs, paragraph_tuples, make_projective,
ignore_misaligned=ignore_misaligned)
for doc, gold in zip(docs, golds): for doc, gold in zip(docs, golds):
if (not max_length) or len(doc) < max_length: if gold is not None:
yield doc, gold if (not max_length) or len(doc) < max_length:
yield doc, gold
@classmethod @classmethod
def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc, noise_level=0.0, orth_variant_level=0.0): def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc, noise_level=0.0, orth_variant_level=0.0):
@ -259,14 +360,22 @@ class GoldCorpus(object):
@classmethod @classmethod
def _make_golds(cls, docs, paragraph_tuples, make_projective): def _make_golds(cls, docs, paragraph_tuples, make_projective, ignore_misaligned=False):
if len(docs) != len(paragraph_tuples): if len(docs) != len(paragraph_tuples):
n_annots = len(paragraph_tuples) n_annots = len(paragraph_tuples)
raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots=n_annots)) raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots=n_annots))
return [GoldParse.from_annot_tuples(doc, sent_tuples, golds = []
make_projective=make_projective) for doc, (sent_tuples, (cats, brackets)) in zip(docs, paragraph_tuples):
for doc, (sent_tuples, brackets) try:
in zip(docs, paragraph_tuples)] gold = GoldParse.from_annot_tuples(doc, sent_tuples, cats=cats,
make_projective=make_projective)
except AlignmentError:
if ignore_misaligned:
gold = None
else:
raise
golds.append(gold)
return golds
def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0): def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0):
@ -281,7 +390,7 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0):
# modify words in paragraph_tuples # modify words in paragraph_tuples
variant_paragraph_tuples = [] variant_paragraph_tuples = []
for sent_tuples, brackets in paragraph_tuples: for sent_tuples, brackets in paragraph_tuples:
ids, words, tags, heads, labels, ner, cats = sent_tuples ids, words, tags, heads, labels, ner = sent_tuples
if lower: if lower:
words = [w.lower() for w in words] words = [w.lower() for w in words]
# single variants # single variants
@ -310,7 +419,7 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0):
pair_idx = pair.index(words[word_idx]) pair_idx = pair.index(words[word_idx])
words[word_idx] = punct_choices[punct_idx][pair_idx] words[word_idx] = punct_choices[punct_idx][pair_idx]
variant_paragraph_tuples.append(((ids, words, tags, heads, labels, ner, cats), brackets)) variant_paragraph_tuples.append(((ids, words, tags, heads, labels, ner), brackets))
# modify raw to match variant_paragraph_tuples # modify raw to match variant_paragraph_tuples
if raw is not None: if raw is not None:
variants = [] variants = []
@ -329,7 +438,7 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0):
variant_raw += raw[raw_idx] variant_raw += raw[raw_idx]
raw_idx += 1 raw_idx += 1
for sent_tuples, brackets in variant_paragraph_tuples: for sent_tuples, brackets in variant_paragraph_tuples:
ids, words, tags, heads, labels, ner, cats = sent_tuples ids, words, tags, heads, labels, ner = sent_tuples
for word in words: for word in words:
match_found = False match_found = False
# add identical word # add identical word
@ -400,6 +509,9 @@ def json_to_tuple(doc):
paragraphs = [] paragraphs = []
for paragraph in doc["paragraphs"]: for paragraph in doc["paragraphs"]:
sents = [] sents = []
cats = {}
for cat in paragraph.get("cats", {}):
cats[cat["label"]] = cat["value"]
for sent in paragraph["sentences"]: for sent in paragraph["sentences"]:
words = [] words = []
ids = [] ids = []
@ -419,11 +531,7 @@ def json_to_tuple(doc):
ner.append(token.get("ner", "-")) ner.append(token.get("ner", "-"))
sents.append([ sents.append([
[ids, words, tags, heads, labels, ner], [ids, words, tags, heads, labels, ner],
sent.get("brackets", [])]) [cats, sent.get("brackets", [])]])
cats = {}
for cat in paragraph.get("cats", {}):
cats[cat["label"]] = cat["value"]
sents.append(cats)
if sents: if sents:
yield [paragraph.get("raw", None), sents] yield [paragraph.get("raw", None), sents]
@ -537,8 +645,8 @@ cdef class GoldParse:
DOCS: https://spacy.io/api/goldparse DOCS: https://spacy.io/api/goldparse
""" """
@classmethod @classmethod
def from_annot_tuples(cls, doc, annot_tuples, make_projective=False): def from_annot_tuples(cls, doc, annot_tuples, cats=None, make_projective=False):
_, words, tags, heads, deps, entities, cats = annot_tuples _, words, tags, heads, deps, entities = annot_tuples
return cls(doc, words=words, tags=tags, heads=heads, deps=deps, return cls(doc, words=words, tags=tags, heads=heads, deps=deps,
entities=entities, cats=cats, entities=entities, cats=cats,
make_projective=make_projective) make_projective=make_projective)
@ -595,9 +703,9 @@ cdef class GoldParse:
if morphology is None: if morphology is None:
morphology = [None for _ in words] morphology = [None for _ in words]
if entities is None: if entities is None:
entities = ["-" for _ in doc] entities = ["-" for _ in words]
elif len(entities) == 0: elif len(entities) == 0:
entities = ["O" for _ in doc] entities = ["O" for _ in words]
else: else:
# Translate the None values to '-', to make processing easier. # Translate the None values to '-', to make processing easier.
# See Issue #2603 # See Issue #2603
@ -660,7 +768,9 @@ cdef class GoldParse:
self.heads[i] = i+1 self.heads[i] = i+1
self.labels[i] = "subtok" self.labels[i] = "subtok"
else: else:
self.heads[i] = self.gold_to_cand[heads[i2j_multi[i]]] head_i = heads[i2j_multi[i]]
if head_i:
self.heads[i] = self.gold_to_cand[head_i]
self.labels[i] = deps[i2j_multi[i]] self.labels[i] = deps[i2j_multi[i]]
# Now set NER...This is annoying because if we've split # Now set NER...This is annoying because if we've split
# got an entity word split into two, we need to adjust the # got an entity word split into two, we need to adjust the

View File

@ -1,8 +1,8 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import POS, PUNCT, ADJ, CONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB from ...symbols import POS, PUNCT, ADJ, CCONJ, SCONJ, NUM, DET, ADV, ADP, X
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX, VERB
TAG_MAP = { TAG_MAP = {
@ -20,8 +20,8 @@ TAG_MAP = {
"CARD": {POS: NUM, "NumType": "card"}, "CARD": {POS: NUM, "NumType": "card"},
"FM": {POS: X, "Foreign": "yes"}, "FM": {POS: X, "Foreign": "yes"},
"ITJ": {POS: INTJ}, "ITJ": {POS: INTJ},
"KOKOM": {POS: CONJ, "ConjType": "comp"}, "KOKOM": {POS: CCONJ, "ConjType": "comp"},
"KON": {POS: CONJ}, "KON": {POS: CCONJ},
"KOUI": {POS: SCONJ}, "KOUI": {POS: SCONJ},
"KOUS": {POS: SCONJ}, "KOUS": {POS: SCONJ},
"NE": {POS: PROPN}, "NE": {POS: PROPN},
@ -43,7 +43,7 @@ TAG_MAP = {
"PTKA": {POS: PART}, "PTKA": {POS: PART},
"PTKANT": {POS: PART, "PartType": "res"}, "PTKANT": {POS: PART, "PartType": "res"},
"PTKNEG": {POS: PART, "Polarity": "neg"}, "PTKNEG": {POS: PART, "Polarity": "neg"},
"PTKVZ": {POS: PART, "PartType": "vbp"}, "PTKVZ": {POS: ADP, "PartType": "vbp"},
"PTKZU": {POS: PART, "PartType": "inf"}, "PTKZU": {POS: PART, "PartType": "inf"},
"PWAT": {POS: DET, "PronType": "int"}, "PWAT": {POS: DET, "PronType": "int"},
"PWAV": {POS: ADV, "PronType": "int"}, "PWAV": {POS: ADV, "PronType": "int"},

View File

@ -2,7 +2,7 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
TAG_MAP = { TAG_MAP = {
@ -28,8 +28,8 @@ TAG_MAP = {
"JJR": {POS: ADJ, "Degree": "comp"}, "JJR": {POS: ADJ, "Degree": "comp"},
"JJS": {POS: ADJ, "Degree": "sup"}, "JJS": {POS: ADJ, "Degree": "sup"},
"LS": {POS: X, "NumType": "ord"}, "LS": {POS: X, "NumType": "ord"},
"MD": {POS: AUX, "VerbType": "mod"}, "MD": {POS: VERB, "VerbType": "mod"},
"NIL": {POS: ""}, "NIL": {POS: X},
"NN": {POS: NOUN, "Number": "sing"}, "NN": {POS: NOUN, "Number": "sing"},
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"}, "NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"}, "NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
@ -37,7 +37,7 @@ TAG_MAP = {
"PDT": {POS: DET}, "PDT": {POS: DET},
"POS": {POS: PART, "Poss": "yes"}, "POS": {POS: PART, "Poss": "yes"},
"PRP": {POS: PRON, "PronType": "prs"}, "PRP": {POS: PRON, "PronType": "prs"},
"PRP$": {POS: PRON, "PronType": "prs", "Poss": "yes"}, "PRP$": {POS: DET, "PronType": "prs", "Poss": "yes"},
"RB": {POS: ADV, "Degree": "pos"}, "RB": {POS: ADV, "Degree": "pos"},
"RBR": {POS: ADV, "Degree": "comp"}, "RBR": {POS: ADV, "Degree": "comp"},
"RBS": {POS: ADV, "Degree": "sup"}, "RBS": {POS: ADV, "Degree": "sup"},
@ -58,9 +58,9 @@ TAG_MAP = {
"Number": "sing", "Number": "sing",
"Person": "three", "Person": "three",
}, },
"WDT": {POS: PRON}, "WDT": {POS: DET},
"WP": {POS: PRON}, "WP": {POS: PRON},
"WP$": {POS: PRON, "Poss": "yes"}, "WP$": {POS: DET, "Poss": "yes"},
"WRB": {POS: ADV}, "WRB": {POS: ADV},
"ADD": {POS: X}, "ADD": {POS: X},
"NFP": {POS: PUNCT}, "NFP": {POS: PUNCT},

View File

@ -18,13 +18,8 @@ from .tokenizer import Tokenizer
from .vocab import Vocab from .vocab import Vocab
from .lemmatizer import Lemmatizer from .lemmatizer import Lemmatizer
from .lookups import Lookups from .lookups import Lookups
from .pipeline import DependencyParser, Tagger from .analysis import analyze_pipes, analyze_all_pipes, validate_attrs
from .pipeline import Tensorizer, EntityRecognizer, EntityLinker from .compat import izip, basestring_, is_python2, class_types
from .pipeline import SimilarityHook, TextCategorizer, Sentencizer
from .pipeline import merge_noun_chunks, merge_entities, merge_subtokens
from .pipeline import EntityRuler
from .pipeline import Morphologizer
from .compat import izip, basestring_, is_python2
from .gold import GoldParse from .gold import GoldParse
from .scorer import Scorer from .scorer import Scorer
from ._ml import link_vectors_to_models, create_default_optimizer from ._ml import link_vectors_to_models, create_default_optimizer
@ -40,6 +35,9 @@ from . import util
from . import about from . import about
ENABLE_PIPELINE_ANALYSIS = False
class BaseDefaults(object): class BaseDefaults(object):
@classmethod @classmethod
def create_lemmatizer(cls, nlp=None, lookups=None): def create_lemmatizer(cls, nlp=None, lookups=None):
@ -133,22 +131,7 @@ class Language(object):
Defaults = BaseDefaults Defaults = BaseDefaults
lang = None lang = None
factories = { factories = {"tokenizer": lambda nlp: nlp.Defaults.create_tokenizer(nlp)}
"tokenizer": lambda nlp: nlp.Defaults.create_tokenizer(nlp),
"tensorizer": lambda nlp, **cfg: Tensorizer(nlp.vocab, **cfg),
"tagger": lambda nlp, **cfg: Tagger(nlp.vocab, **cfg),
"morphologizer": lambda nlp, **cfg: Morphologizer(nlp.vocab, **cfg),
"parser": lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg),
"ner": lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg),
"entity_linker": lambda nlp, **cfg: EntityLinker(nlp.vocab, **cfg),
"similarity": lambda nlp, **cfg: SimilarityHook(nlp.vocab, **cfg),
"textcat": lambda nlp, **cfg: TextCategorizer(nlp.vocab, **cfg),
"sentencizer": lambda nlp, **cfg: Sentencizer(**cfg),
"merge_noun_chunks": lambda nlp, **cfg: merge_noun_chunks,
"merge_entities": lambda nlp, **cfg: merge_entities,
"merge_subtokens": lambda nlp, **cfg: merge_subtokens,
"entity_ruler": lambda nlp, **cfg: EntityRuler(nlp, **cfg),
}
def __init__( def __init__(
self, vocab=True, make_doc=True, max_length=10 ** 6, meta={}, **kwargs self, vocab=True, make_doc=True, max_length=10 ** 6, meta={}, **kwargs
@ -218,6 +201,7 @@ class Language(object):
"name": self.vocab.vectors.name, "name": self.vocab.vectors.name,
} }
self._meta["pipeline"] = self.pipe_names self._meta["pipeline"] = self.pipe_names
self._meta["factories"] = self.pipe_factories
self._meta["labels"] = self.pipe_labels self._meta["labels"] = self.pipe_labels
return self._meta return self._meta
@ -259,6 +243,17 @@ class Language(object):
""" """
return [pipe_name for pipe_name, _ in self.pipeline] return [pipe_name for pipe_name, _ in self.pipeline]
@property
def pipe_factories(self):
"""Get the component factories for the available pipeline components.
RETURNS (dict): Factory names, keyed by component names.
"""
factories = {}
for pipe_name, pipe in self.pipeline:
factories[pipe_name] = getattr(pipe, "factory", pipe_name)
return factories
@property @property
def pipe_labels(self): def pipe_labels(self):
"""Get the labels set by the pipeline components, if available (if """Get the labels set by the pipeline components, if available (if
@ -327,33 +322,30 @@ class Language(object):
msg += Errors.E004.format(component=component) msg += Errors.E004.format(component=component)
raise ValueError(msg) raise ValueError(msg)
if name is None: if name is None:
if hasattr(component, "name"): name = util.get_component_name(component)
name = component.name
elif hasattr(component, "__name__"):
name = component.__name__
elif hasattr(component, "__class__") and hasattr(
component.__class__, "__name__"
):
name = component.__class__.__name__
else:
name = repr(component)
if name in self.pipe_names: if name in self.pipe_names:
raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names)) raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names))
if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2: if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2:
raise ValueError(Errors.E006) raise ValueError(Errors.E006)
pipe_index = 0
pipe = (name, component) pipe = (name, component)
if last or not any([first, before, after]): if last or not any([first, before, after]):
pipe_index = len(self.pipeline)
self.pipeline.append(pipe) self.pipeline.append(pipe)
elif first: elif first:
self.pipeline.insert(0, pipe) self.pipeline.insert(0, pipe)
elif before and before in self.pipe_names: elif before and before in self.pipe_names:
pipe_index = self.pipe_names.index(before)
self.pipeline.insert(self.pipe_names.index(before), pipe) self.pipeline.insert(self.pipe_names.index(before), pipe)
elif after and after in self.pipe_names: elif after and after in self.pipe_names:
pipe_index = self.pipe_names.index(after) + 1
self.pipeline.insert(self.pipe_names.index(after) + 1, pipe) self.pipeline.insert(self.pipe_names.index(after) + 1, pipe)
else: else:
raise ValueError( raise ValueError(
Errors.E001.format(name=before or after, opts=self.pipe_names) Errors.E001.format(name=before or after, opts=self.pipe_names)
) )
if ENABLE_PIPELINE_ANALYSIS:
analyze_pipes(self.pipeline, name, component, pipe_index)
def has_pipe(self, name): def has_pipe(self, name):
"""Check if a component name is present in the pipeline. Equivalent to """Check if a component name is present in the pipeline. Equivalent to
@ -382,6 +374,8 @@ class Language(object):
msg += Errors.E135.format(name=name) msg += Errors.E135.format(name=name)
raise ValueError(msg) raise ValueError(msg)
self.pipeline[self.pipe_names.index(name)] = (name, component) self.pipeline[self.pipe_names.index(name)] = (name, component)
if ENABLE_PIPELINE_ANALYSIS:
analyze_all_pipes(self.pipeline)
def rename_pipe(self, old_name, new_name): def rename_pipe(self, old_name, new_name):
"""Rename a pipeline component. """Rename a pipeline component.
@ -408,7 +402,10 @@ class Language(object):
""" """
if name not in self.pipe_names: if name not in self.pipe_names:
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names)) raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
return self.pipeline.pop(self.pipe_names.index(name)) removed = self.pipeline.pop(self.pipe_names.index(name))
if ENABLE_PIPELINE_ANALYSIS:
analyze_all_pipes(self.pipeline)
return removed
def __call__(self, text, disable=[], component_cfg=None): def __call__(self, text, disable=[], component_cfg=None):
"""Apply the pipeline to some text. The text can span multiple sentences, """Apply the pipeline to some text. The text can span multiple sentences,
@ -448,6 +445,8 @@ class Language(object):
DOCS: https://spacy.io/api/language#disable_pipes DOCS: https://spacy.io/api/language#disable_pipes
""" """
if len(names) == 1 and isinstance(names[0], (list, tuple)):
names = names[0] # support list of names instead of spread
return DisabledPipes(self, *names) return DisabledPipes(self, *names)
def make_doc(self, text): def make_doc(self, text):
@ -999,6 +998,52 @@ class Language(object):
return self return self
class component(object):
"""Decorator for pipeline components. Can decorate both function components
and class components and will automatically register components in the
Language.factories. If the component is a class and needs access to the
nlp object or config parameters, it can expose a from_nlp classmethod
that takes the nlp object and **cfg arguments and returns the initialized
component.
"""
# NB: This decorator needs to live here, because it needs to write to
# Language.factories. All other solutions would cause circular import.
def __init__(self, name=None, assigns=tuple(), requires=tuple(), retokenizes=False):
"""Decorate a pipeline component.
name (unicode): Default component and factory name.
assigns (list): Attributes assigned by component, e.g. `["token.pos"]`.
requires (list): Attributes required by component, e.g. `["token.dep"]`.
retokenizes (bool): Whether the component changes the tokenization.
"""
self.name = name
self.assigns = validate_attrs(assigns)
self.requires = validate_attrs(requires)
self.retokenizes = retokenizes
def __call__(self, *args, **kwargs):
obj = args[0]
args = args[1:]
factory_name = self.name or util.get_component_name(obj)
obj.name = factory_name
obj.factory = factory_name
obj.assigns = self.assigns
obj.requires = self.requires
obj.retokenizes = self.retokenizes
def factory(nlp, **cfg):
if hasattr(obj, "from_nlp"):
return obj.from_nlp(nlp, **cfg)
elif isinstance(obj, class_types):
return obj()
return obj
Language.factories[obj.factory] = factory
return obj
def _fix_pretrained_vectors_name(nlp): def _fix_pretrained_vectors_name(nlp):
# TODO: Replace this once we handle vectors consistently as static # TODO: Replace this once we handle vectors consistently as static
# data # data

View File

@ -102,7 +102,10 @@ cdef class DependencyMatcher:
visitedNodes[relation["SPEC"]["NBOR_NAME"]] = True visitedNodes[relation["SPEC"]["NBOR_NAME"]] = True
idx = idx + 1 idx = idx + 1
def add(self, key, on_match, *patterns): def add(self, key, patterns, *_patterns, on_match=None):
if patterns is None or hasattr(patterns, "__call__"): # old API
on_match = patterns
patterns = _patterns
for pattern in patterns: for pattern in patterns:
if len(pattern) == 0: if len(pattern) == 0:
raise ValueError(Errors.E012.format(key=key)) raise ValueError(Errors.E012.format(key=key))

View File

@ -74,7 +74,7 @@ cdef class Matcher:
""" """
return self._normalize_key(key) in self._patterns return self._normalize_key(key) in self._patterns
def add(self, key, on_match, *patterns): def add(self, key, patterns, *_patterns, on_match=None):
"""Add a match-rule to the matcher. A match-rule consists of: an ID """Add a match-rule to the matcher. A match-rule consists of: an ID
key, an on_match callback, and one or more patterns. key, an on_match callback, and one or more patterns.
@ -98,16 +98,29 @@ cdef class Matcher:
operator will behave non-greedily. This quirk in the semantics makes operator will behave non-greedily. This quirk in the semantics makes
the matcher more efficient, by avoiding the need for back-tracking. the matcher more efficient, by avoiding the need for back-tracking.
As of spaCy v2.2.2, Matcher.add supports the future API, which makes
the patterns the second argument and a list (instead of a variable
number of arguments). The on_match callback becomes an optional keyword
argument.
key (unicode): The match ID. key (unicode): The match ID.
on_match (callable): Callback executed on match. patterns (list): The patterns to add for the given key.
*patterns (list): List of token descriptions. on_match (callable): Optional callback executed on match.
*_patterns (list): For backwards compatibility: list of patterns to add
as variable arguments. Will be ignored if a list of patterns is
provided as the second argument.
""" """
errors = {} errors = {}
if on_match is not None and not hasattr(on_match, "__call__"): if on_match is not None and not hasattr(on_match, "__call__"):
raise ValueError(Errors.E171.format(arg_type=type(on_match))) raise ValueError(Errors.E171.format(arg_type=type(on_match)))
if patterns is None or hasattr(patterns, "__call__"): # old API
on_match = patterns
patterns = _patterns
for i, pattern in enumerate(patterns): for i, pattern in enumerate(patterns):
if len(pattern) == 0: if len(pattern) == 0:
raise ValueError(Errors.E012.format(key=key)) raise ValueError(Errors.E012.format(key=key))
if not isinstance(pattern, list):
raise ValueError(Errors.E178.format(pat=pattern, key=key))
if self.validator: if self.validator:
errors[i] = validate_json(pattern, self.validator) errors[i] = validate_json(pattern, self.validator)
if any(err for err in errors.values()): if any(err for err in errors.values()):

View File

@ -152,16 +152,27 @@ cdef class PhraseMatcher:
del self._callbacks[key] del self._callbacks[key]
del self._docs[key] del self._docs[key]
def add(self, key, on_match, *docs): def add(self, key, docs, *_docs, on_match=None):
"""Add a match-rule to the phrase-matcher. A match-rule consists of: an ID """Add a match-rule to the phrase-matcher. A match-rule consists of: an ID
key, an on_match callback, and one or more patterns. key, an on_match callback, and one or more patterns.
As of spaCy v2.2.2, PhraseMatcher.add supports the future API, which
makes the patterns the second argument and a list (instead of a variable
number of arguments). The on_match callback becomes an optional keyword
argument.
key (unicode): The match ID. key (unicode): The match ID.
docs (list): List of `Doc` objects representing match patterns.
on_match (callable): Callback executed on match. on_match (callable): Callback executed on match.
*docs (Doc): `Doc` objects representing match patterns. *_docs (Doc): For backwards compatibility: list of patterns to add
as variable arguments. Will be ignored if a list of patterns is
provided as the second argument.
DOCS: https://spacy.io/api/phrasematcher#add DOCS: https://spacy.io/api/phrasematcher#add
""" """
if docs is None or hasattr(docs, "__call__"): # old API
on_match = docs
docs = _docs
_ = self.vocab[key] _ = self.vocab[key]
self._callbacks[key] = on_match self._callbacks[key] = on_match
@ -171,6 +182,8 @@ cdef class PhraseMatcher:
cdef MapStruct* internal_node cdef MapStruct* internal_node
cdef void* result cdef void* result
if isinstance(docs, Doc):
raise ValueError(Errors.E179.format(key=key))
for doc in docs: for doc in docs:
if len(doc) == 0: if len(doc) == 0:
continue continue

5
spacy/ml/__init__.py Normal file
View File

@ -0,0 +1,5 @@
# coding: utf8
from __future__ import unicode_literals
from .tok2vec import Tok2Vec # noqa: F401
from .common import FeedForward, LayerNormalizedMaxout # noqa: F401

131
spacy/ml/_legacy_tok2vec.py Normal file
View File

@ -0,0 +1,131 @@
# coding: utf8
from __future__ import unicode_literals
from thinc.v2v import Model, Maxout
from thinc.i2v import HashEmbed, StaticVectors
from thinc.t2t import ExtractWindow
from thinc.misc import Residual
from thinc.misc import LayerNorm as LN
from thinc.misc import FeatureExtracter
from thinc.api import layerize, chain, clone, concatenate, with_flatten
from thinc.api import uniqued, wrap, noop
from ..attrs import ID, ORTH, NORM, PREFIX, SUFFIX, SHAPE
def Tok2Vec(width, embed_size, **kwargs):
# Circular imports :(
from .._ml import CharacterEmbed
from .._ml import PyTorchBiLSTM
pretrained_vectors = kwargs.get("pretrained_vectors", None)
cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3)
subword_features = kwargs.get("subword_features", True)
char_embed = kwargs.get("char_embed", False)
if char_embed:
subword_features = False
conv_depth = kwargs.get("conv_depth", 4)
bilstm_depth = kwargs.get("bilstm_depth", 0)
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
with Model.define_operators({">>": chain, "|": concatenate, "**": clone}):
norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm")
if subword_features:
prefix = HashEmbed(
width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix"
)
suffix = HashEmbed(
width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix"
)
shape = HashEmbed(
width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape"
)
else:
prefix, suffix, shape = (None, None, None)
if pretrained_vectors is not None:
glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
if subword_features:
embed = uniqued(
(glove | norm | prefix | suffix | shape)
>> LN(Maxout(width, width * 5, pieces=3)),
column=cols.index(ORTH),
)
else:
embed = uniqued(
(glove | norm) >> LN(Maxout(width, width * 2, pieces=3)),
column=cols.index(ORTH),
)
elif subword_features:
embed = uniqued(
(norm | prefix | suffix | shape)
>> LN(Maxout(width, width * 4, pieces=3)),
column=cols.index(ORTH),
)
elif char_embed:
embed = concatenate_lists(
CharacterEmbed(nM=64, nC=8),
FeatureExtracter(cols) >> with_flatten(norm),
)
reduce_dimensions = LN(
Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces)
)
else:
embed = norm
convolution = Residual(
ExtractWindow(nW=1)
>> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces))
)
if char_embed:
tok2vec = embed >> with_flatten(
reduce_dimensions >> convolution ** conv_depth, pad=conv_depth
)
else:
tok2vec = FeatureExtracter(cols) >> with_flatten(
embed >> convolution ** conv_depth, pad=conv_depth
)
if bilstm_depth >= 1:
tok2vec = tok2vec >> PyTorchBiLSTM(width, width, bilstm_depth)
# Work around thinc API limitations :(. TODO: Revise in Thinc 7
tok2vec.nO = width
tok2vec.embed = embed
return tok2vec
@layerize
def flatten(seqs, drop=0.0):
ops = Model.ops
lengths = ops.asarray([len(seq) for seq in seqs], dtype="i")
def finish_update(d_X, sgd=None):
return ops.unflatten(d_X, lengths, pad=0)
X = ops.flatten(seqs, pad=0)
return X, finish_update
def concatenate_lists(*layers, **kwargs): # pragma: no cover
"""Compose two or more models `f`, `g`, etc, such that their outputs are
concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))`
"""
if not layers:
return noop()
drop_factor = kwargs.get("drop_factor", 1.0)
ops = layers[0].ops
layers = [chain(layer, flatten) for layer in layers]
concat = concatenate(*layers)
def concatenate_lists_fwd(Xs, drop=0.0):
if drop is not None:
drop *= drop_factor
lengths = ops.asarray([len(X) for X in Xs], dtype="i")
flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop)
ys = ops.unflatten(flat_y, lengths)
def concatenate_lists_bwd(d_ys, sgd=None):
return bp_flat_y(ops.flatten(d_ys), sgd=sgd)
return ys, concatenate_lists_bwd
model = wrap(concatenate_lists_fwd, concat)
return model

42
spacy/ml/_wire.py Normal file
View File

@ -0,0 +1,42 @@
from __future__ import unicode_literals
from thinc.api import layerize, wrap, noop, chain, concatenate
from thinc.v2v import Model
def concatenate_lists(*layers, **kwargs): # pragma: no cover
"""Compose two or more models `f`, `g`, etc, such that their outputs are
concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))`
"""
if not layers:
return layerize(noop())
drop_factor = kwargs.get("drop_factor", 1.0)
ops = layers[0].ops
layers = [chain(layer, flatten) for layer in layers]
concat = concatenate(*layers)
def concatenate_lists_fwd(Xs, drop=0.0):
if drop is not None:
drop *= drop_factor
lengths = ops.asarray([len(X) for X in Xs], dtype="i")
flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop)
ys = ops.unflatten(flat_y, lengths)
def concatenate_lists_bwd(d_ys, sgd=None):
return bp_flat_y(ops.flatten(d_ys), sgd=sgd)
return ys, concatenate_lists_bwd
model = wrap(concatenate_lists_fwd, concat)
return model
@layerize
def flatten(seqs, drop=0.0):
ops = Model.ops
lengths = ops.asarray([len(seq) for seq in seqs], dtype="i")
def finish_update(d_X, sgd=None):
return ops.unflatten(d_X, lengths, pad=0)
X = ops.flatten(seqs, pad=0)
return X, finish_update

23
spacy/ml/common.py Normal file
View File

@ -0,0 +1,23 @@
from __future__ import unicode_literals
from thinc.api import chain
from thinc.v2v import Maxout
from thinc.misc import LayerNorm
from ..util import register_architecture, make_layer
@register_architecture("thinc.FeedForward.v1")
def FeedForward(config):
layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]]
model = chain(*layers)
model.cfg = config
return model
@register_architecture("spacy.LayerNormalizedMaxout.v1")
def LayerNormalizedMaxout(config):
width = config["width"]
pieces = config["pieces"]
layer = LayerNorm(Maxout(width, pieces=pieces))
layer.nO = width
return layer

176
spacy/ml/tok2vec.py Normal file
View File

@ -0,0 +1,176 @@
from __future__ import unicode_literals
from thinc.api import chain, layerize, clone, concatenate, with_flatten, uniqued
from thinc.api import noop, with_square_sequences
from thinc.v2v import Maxout, Model
from thinc.i2v import HashEmbed, StaticVectors
from thinc.t2t import ExtractWindow
from thinc.misc import Residual, LayerNorm, FeatureExtracter
from ..util import make_layer, register_architecture
from ._wire import concatenate_lists
@register_architecture("spacy.Tok2Vec.v1")
def Tok2Vec(config):
doc2feats = make_layer(config["@doc2feats"])
embed = make_layer(config["@embed"])
encode = make_layer(config["@encode"])
field_size = getattr(encode, "receptive_field", 0)
tok2vec = chain(doc2feats, with_flatten(chain(embed, encode), pad=field_size))
tok2vec.cfg = config
tok2vec.nO = encode.nO
tok2vec.embed = embed
tok2vec.encode = encode
return tok2vec
@register_architecture("spacy.Doc2Feats.v1")
def Doc2Feats(config):
columns = config["columns"]
return FeatureExtracter(columns)
@register_architecture("spacy.MultiHashEmbed.v1")
def MultiHashEmbed(config):
# For backwards compatibility with models before the architecture registry,
# we have to be careful to get exactly the same model structure. One subtle
# trick is that when we define concatenation with the operator, the operator
# is actually binary associative. So when we write (a | b | c), we're actually
# getting concatenate(concatenate(a, b), c). That's why the implementation
# is a bit ugly here.
cols = config["columns"]
width = config["width"]
rows = config["rows"]
norm = HashEmbed(width, rows, column=cols.index("NORM"), name="embed_norm")
if config["use_subwords"]:
prefix = HashEmbed(
width, rows // 2, column=cols.index("PREFIX"), name="embed_prefix"
)
suffix = HashEmbed(
width, rows // 2, column=cols.index("SUFFIX"), name="embed_suffix"
)
shape = HashEmbed(
width, rows // 2, column=cols.index("SHAPE"), name="embed_shape"
)
if config.get("@pretrained_vectors"):
glove = make_layer(config["@pretrained_vectors"])
mix = make_layer(config["@mix"])
with Model.define_operators({">>": chain, "|": concatenate}):
if config["use_subwords"] and config["@pretrained_vectors"]:
mix._layers[0].nI = width * 5
layer = uniqued(
(glove | norm | prefix | suffix | shape) >> mix,
column=cols.index("ORTH"),
)
elif config["use_subwords"]:
mix._layers[0].nI = width * 4
layer = uniqued(
(norm | prefix | suffix | shape) >> mix, column=cols.index("ORTH")
)
elif config["@pretrained_vectors"]:
mix._layers[0].nI = width * 2
layer = uniqued((glove | norm) >> mix, column=cols.index("ORTH"),)
else:
layer = norm
layer.cfg = config
return layer
@register_architecture("spacy.CharacterEmbed.v1")
def CharacterEmbed(config):
from .. import _ml
width = config["width"]
chars = config["chars"]
chr_embed = _ml.CharacterEmbedModel(nM=width, nC=chars)
other_tables = make_layer(config["@embed_features"])
mix = make_layer(config["@mix"])
model = chain(concatenate_lists(chr_embed, other_tables), mix)
model.cfg = config
return model
@register_architecture("spacy.MaxoutWindowEncoder.v1")
def MaxoutWindowEncoder(config):
nO = config["width"]
nW = config["window_size"]
nP = config["pieces"]
depth = config["depth"]
cnn = chain(
ExtractWindow(nW=nW), LayerNorm(Maxout(nO, nO * ((nW * 2) + 1), pieces=nP))
)
model = clone(Residual(cnn), depth)
model.nO = nO
model.receptive_field = nW * depth
return model
@register_architecture("spacy.MishWindowEncoder.v1")
def MishWindowEncoder(config):
from thinc.v2v import Mish
nO = config["width"]
nW = config["window_size"]
depth = config["depth"]
cnn = chain(ExtractWindow(nW=nW), LayerNorm(Mish(nO, nO * ((nW * 2) + 1))))
model = clone(Residual(cnn), depth)
model.nO = nO
return model
@register_architecture("spacy.PretrainedVectors.v1")
def PretrainedVectors(config):
return StaticVectors(config["vectors_name"], config["width"], config["column"])
@register_architecture("spacy.TorchBiLSTMEncoder.v1")
def TorchBiLSTMEncoder(config):
import torch.nn
from thinc.extra.wrappers import PyTorchWrapperRNN
width = config["width"]
depth = config["depth"]
if depth == 0:
return layerize(noop())
return with_square_sequences(
PyTorchWrapperRNN(torch.nn.LSTM(width, width // 2, depth, bidirectional=True))
)
_EXAMPLE_CONFIG = {
"@doc2feats": {
"arch": "Doc2Feats",
"config": {"columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"]},
},
"@embed": {
"arch": "spacy.MultiHashEmbed.v1",
"config": {
"width": 96,
"rows": 2000,
"columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"],
"use_subwords": True,
"@pretrained_vectors": {
"arch": "TransformedStaticVectors",
"config": {
"vectors_name": "en_vectors_web_lg.vectors",
"width": 96,
"column": 0,
},
},
"@mix": {
"arch": "LayerNormalizedMaxout",
"config": {"width": 96, "pieces": 3},
},
},
},
"@encode": {
"arch": "MaxoutWindowEncode",
"config": {"width": 96, "window_size": 1, "depth": 4, "pieces": 3},
},
}

View File

@ -4,6 +4,7 @@ from __future__ import unicode_literals
from collections import defaultdict, OrderedDict from collections import defaultdict, OrderedDict
import srsly import srsly
from ..language import component
from ..errors import Errors from ..errors import Errors
from ..compat import basestring_ from ..compat import basestring_
from ..util import ensure_path, to_disk, from_disk from ..util import ensure_path, to_disk, from_disk
@ -13,6 +14,7 @@ from ..matcher import Matcher, PhraseMatcher
DEFAULT_ENT_ID_SEP = "||" DEFAULT_ENT_ID_SEP = "||"
@component("entity_ruler", assigns=["doc.ents", "token.ent_type", "token.ent_iob"])
class EntityRuler(object): class EntityRuler(object):
"""The EntityRuler lets you add spans to the `Doc.ents` using token-based """The EntityRuler lets you add spans to the `Doc.ents` using token-based
rules or exact phrase matches. It can be combined with the statistical rules or exact phrase matches. It can be combined with the statistical
@ -24,8 +26,6 @@ class EntityRuler(object):
USAGE: https://spacy.io/usage/rule-based-matching#entityruler USAGE: https://spacy.io/usage/rule-based-matching#entityruler
""" """
name = "entity_ruler"
def __init__(self, nlp, phrase_matcher_attr=None, validate=False, **cfg): def __init__(self, nlp, phrase_matcher_attr=None, validate=False, **cfg):
"""Initialize the entitiy ruler. If patterns are supplied here, they """Initialize the entitiy ruler. If patterns are supplied here, they
need to be a list of dictionaries with a `"label"` and `"pattern"` need to be a list of dictionaries with a `"label"` and `"pattern"`
@ -64,10 +64,15 @@ class EntityRuler(object):
self.phrase_matcher_attr = None self.phrase_matcher_attr = None
self.phrase_matcher = PhraseMatcher(nlp.vocab, validate=validate) self.phrase_matcher = PhraseMatcher(nlp.vocab, validate=validate)
self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP) self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
self._ent_ids = defaultdict(dict)
patterns = cfg.get("patterns") patterns = cfg.get("patterns")
if patterns is not None: if patterns is not None:
self.add_patterns(patterns) self.add_patterns(patterns)
@classmethod
def from_nlp(cls, nlp, **cfg):
return cls(nlp, **cfg)
def __len__(self): def __len__(self):
"""The number of all patterns added to the entity ruler.""" """The number of all patterns added to the entity ruler."""
n_token_patterns = sum(len(p) for p in self.token_patterns.values()) n_token_patterns = sum(len(p) for p in self.token_patterns.values())
@ -100,10 +105,9 @@ class EntityRuler(object):
continue continue
# check for end - 1 here because boundaries are inclusive # check for end - 1 here because boundaries are inclusive
if start not in seen_tokens and end - 1 not in seen_tokens: if start not in seen_tokens and end - 1 not in seen_tokens:
if self.ent_ids: if match_id in self._ent_ids:
label_ = self.nlp.vocab.strings[match_id] label, ent_id = self._ent_ids[match_id]
ent_label, ent_id = self._split_label(label_) span = Span(doc, start, end, label=label)
span = Span(doc, start, end, label=ent_label)
if ent_id: if ent_id:
for token in span: for token in span:
token.ent_id_ = ent_id token.ent_id_ = ent_id
@ -131,11 +135,11 @@ class EntityRuler(object):
@property @property
def ent_ids(self): def ent_ids(self):
"""All entity ids present in the match patterns meta dicts. """All entity ids present in the match patterns `id` properties.
RETURNS (set): The string entity ids. RETURNS (set): The string entity ids.
DOCS: https://spacy.io/api/entityruler#labels DOCS: https://spacy.io/api/entityruler#ent_ids
""" """
all_ent_ids = set() all_ent_ids = set()
for l in self.labels: for l in self.labels:
@ -147,7 +151,6 @@ class EntityRuler(object):
@property @property
def patterns(self): def patterns(self):
"""Get all patterns that were added to the entity ruler. """Get all patterns that were added to the entity ruler.
RETURNS (list): The original patterns, one dictionary per pattern. RETURNS (list): The original patterns, one dictionary per pattern.
DOCS: https://spacy.io/api/entityruler#patterns DOCS: https://spacy.io/api/entityruler#patterns
@ -188,11 +191,15 @@ class EntityRuler(object):
] ]
except ValueError: except ValueError:
subsequent_pipes = [] subsequent_pipes = []
with self.nlp.disable_pipes(*subsequent_pipes): with self.nlp.disable_pipes(subsequent_pipes):
for entry in patterns: for entry in patterns:
label = entry["label"] label = entry["label"]
if "id" in entry: if "id" in entry:
ent_label = label
label = self._create_label(label, entry["id"]) label = self._create_label(label, entry["id"])
key = self.matcher._normalize_key(label)
self._ent_ids[key] = (ent_label, entry["id"])
pattern = entry["pattern"] pattern = entry["pattern"]
if isinstance(pattern, basestring_): if isinstance(pattern, basestring_):
self.phrase_patterns[label].append(self.nlp(pattern)) self.phrase_patterns[label].append(self.nlp(pattern))
@ -201,9 +208,9 @@ class EntityRuler(object):
else: else:
raise ValueError(Errors.E097.format(pattern=pattern)) raise ValueError(Errors.E097.format(pattern=pattern))
for label, patterns in self.token_patterns.items(): for label, patterns in self.token_patterns.items():
self.matcher.add(label, None, *patterns) self.matcher.add(label, patterns)
for label, patterns in self.phrase_patterns.items(): for label, patterns in self.phrase_patterns.items():
self.phrase_matcher.add(label, None, *patterns) self.phrase_matcher.add(label, patterns)
def _split_label(self, label): def _split_label(self, label):
"""Split Entity label into ent_label and ent_id if it contains self.ent_id_sep """Split Entity label into ent_label and ent_id if it contains self.ent_id_sep

View File

@ -1,9 +1,16 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from ..language import component
from ..matcher import Matcher from ..matcher import Matcher
from ..util import filter_spans
@component(
"merge_noun_chunks",
requires=["token.dep", "token.tag", "token.pos"],
retokenizes=True,
)
def merge_noun_chunks(doc): def merge_noun_chunks(doc):
"""Merge noun chunks into a single token. """Merge noun chunks into a single token.
@ -21,6 +28,11 @@ def merge_noun_chunks(doc):
return doc return doc
@component(
"merge_entities",
requires=["doc.ents", "token.ent_iob", "token.ent_type"],
retokenizes=True,
)
def merge_entities(doc): def merge_entities(doc):
"""Merge entities into a single token. """Merge entities into a single token.
@ -36,6 +48,7 @@ def merge_entities(doc):
return doc return doc
@component("merge_subtokens", requires=["token.dep"], retokenizes=True)
def merge_subtokens(doc, label="subtok"): def merge_subtokens(doc, label="subtok"):
"""Merge subtokens into a single token. """Merge subtokens into a single token.
@ -48,7 +61,7 @@ def merge_subtokens(doc, label="subtok"):
merger = Matcher(doc.vocab) merger = Matcher(doc.vocab)
merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}]) merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}])
matches = merger(doc) matches = merger(doc)
spans = [doc[start : end + 1] for _, start, end in matches] spans = filter_spans([doc[start : end + 1] for _, start, end in matches])
with doc.retokenize() as retokenizer: with doc.retokenize() as retokenizer:
for span in spans: for span in spans:
retokenizer.merge(span) retokenizer.merge(span)

View File

@ -5,9 +5,11 @@ from thinc.t2v import Pooling, max_pool, mean_pool
from thinc.neural._classes.difference import Siamese, CauchySimilarity from thinc.neural._classes.difference import Siamese, CauchySimilarity
from .pipes import Pipe from .pipes import Pipe
from ..language import component
from .._ml import link_vectors_to_models from .._ml import link_vectors_to_models
@component("sentencizer_hook", assigns=["doc.user_hooks"])
class SentenceSegmenter(object): class SentenceSegmenter(object):
"""A simple spaCy hook, to allow custom sentence boundary detection logic """A simple spaCy hook, to allow custom sentence boundary detection logic
(that doesn't require the dependency parse). To change the sentence (that doesn't require the dependency parse). To change the sentence
@ -17,8 +19,6 @@ class SentenceSegmenter(object):
and yield `Span` objects for each sentence. and yield `Span` objects for each sentence.
""" """
name = "sentencizer"
def __init__(self, vocab, strategy=None): def __init__(self, vocab, strategy=None):
self.vocab = vocab self.vocab = vocab
if strategy is None or strategy == "on_punct": if strategy is None or strategy == "on_punct":
@ -44,6 +44,7 @@ class SentenceSegmenter(object):
yield doc[start : len(doc)] yield doc[start : len(doc)]
@component("similarity", assigns=["doc.user_hooks"])
class SimilarityHook(Pipe): class SimilarityHook(Pipe):
""" """
Experimental: A pipeline component to install a hook for supervised Experimental: A pipeline component to install a hook for supervised
@ -58,8 +59,6 @@ class SimilarityHook(Pipe):
Where W is a vector of dimension weights, initialized to 1. Where W is a vector of dimension weights, initialized to 1.
""" """
name = "similarity"
def __init__(self, vocab, model=True, **cfg): def __init__(self, vocab, model=True, **cfg):
self.vocab = vocab self.vocab = vocab
self.model = model self.model = model

View File

@ -8,6 +8,7 @@ from thinc.api import chain
from thinc.neural.util import to_categorical, copy_array, get_array_module from thinc.neural.util import to_categorical, copy_array, get_array_module
from .. import util from .. import util
from .pipes import Pipe from .pipes import Pipe
from ..language import component
from .._ml import Tok2Vec, build_morphologizer_model from .._ml import Tok2Vec, build_morphologizer_model
from .._ml import link_vectors_to_models, zero_init, flatten from .._ml import link_vectors_to_models, zero_init, flatten
from .._ml import create_default_optimizer from .._ml import create_default_optimizer
@ -18,8 +19,8 @@ from ..vocab cimport Vocab
from ..morphology cimport Morphology from ..morphology cimport Morphology
@component("morphologizer", assigns=["token.morph", "token.pos"])
class Morphologizer(Pipe): class Morphologizer(Pipe):
name = 'morphologizer'
@classmethod @classmethod
def Model(cls, **cfg): def Model(cls, **cfg):

View File

@ -13,7 +13,6 @@ from thinc.misc import LayerNorm
from thinc.neural.util import to_categorical from thinc.neural.util import to_categorical
from thinc.neural.util import get_array_module from thinc.neural.util import get_array_module
from .functions import merge_subtokens
from ..tokens.doc cimport Doc from ..tokens.doc cimport Doc
from ..syntax.nn_parser cimport Parser from ..syntax.nn_parser cimport Parser
from ..syntax.ner cimport BiluoPushDown from ..syntax.ner cimport BiluoPushDown
@ -21,6 +20,8 @@ from ..syntax.arc_eager cimport ArcEager
from ..morphology cimport Morphology from ..morphology cimport Morphology
from ..vocab cimport Vocab from ..vocab cimport Vocab
from .functions import merge_subtokens
from ..language import Language, component
from ..syntax import nonproj from ..syntax import nonproj
from ..attrs import POS, ID from ..attrs import POS, ID
from ..parts_of_speech import X from ..parts_of_speech import X
@ -54,6 +55,10 @@ class Pipe(object):
"""Initialize a model for the pipe.""" """Initialize a model for the pipe."""
raise NotImplementedError raise NotImplementedError
@classmethod
def from_nlp(cls, nlp, **cfg):
return cls(nlp.vocab, **cfg)
def __init__(self, vocab, model=True, **cfg): def __init__(self, vocab, model=True, **cfg):
"""Create a new pipe instance.""" """Create a new pipe instance."""
raise NotImplementedError raise NotImplementedError
@ -223,11 +228,10 @@ class Pipe(object):
return self return self
@component("tensorizer", assigns=["doc.tensor"])
class Tensorizer(Pipe): class Tensorizer(Pipe):
"""Pre-train position-sensitive vectors for tokens.""" """Pre-train position-sensitive vectors for tokens."""
name = "tensorizer"
@classmethod @classmethod
def Model(cls, output_size=300, **cfg): def Model(cls, output_size=300, **cfg):
"""Create a new statistical model for the class. """Create a new statistical model for the class.
@ -362,14 +366,13 @@ class Tensorizer(Pipe):
return sgd return sgd
@component("tagger", assigns=["token.tag", "token.pos"])
class Tagger(Pipe): class Tagger(Pipe):
"""Pipeline component for part-of-speech tagging. """Pipeline component for part-of-speech tagging.
DOCS: https://spacy.io/api/tagger DOCS: https://spacy.io/api/tagger
""" """
name = "tagger"
def __init__(self, vocab, model=True, **cfg): def __init__(self, vocab, model=True, **cfg):
self.vocab = vocab self.vocab = vocab
self.model = model self.model = model
@ -514,7 +517,6 @@ class Tagger(Pipe):
orig_tag_map = dict(self.vocab.morphology.tag_map) orig_tag_map = dict(self.vocab.morphology.tag_map)
new_tag_map = OrderedDict() new_tag_map = OrderedDict()
for raw_text, annots_brackets in get_gold_tuples(): for raw_text, annots_brackets in get_gold_tuples():
_ = annots_brackets.pop()
for annots, brackets in annots_brackets: for annots, brackets in annots_brackets:
ids, words, tags, heads, deps, ents = annots ids, words, tags, heads, deps, ents = annots
for tag in tags: for tag in tags:
@ -657,13 +659,12 @@ class Tagger(Pipe):
return self return self
@component("nn_labeller")
class MultitaskObjective(Tagger): class MultitaskObjective(Tagger):
"""Experimental: Assist training of a parser or tagger, by training a """Experimental: Assist training of a parser or tagger, by training a
side-objective. side-objective.
""" """
name = "nn_labeller"
def __init__(self, vocab, model=True, target='dep_tag_offset', **cfg): def __init__(self, vocab, model=True, target='dep_tag_offset', **cfg):
self.vocab = vocab self.vocab = vocab
self.model = model self.model = model
@ -898,12 +899,12 @@ class ClozeMultitask(Pipe):
losses[self.name] += loss losses[self.name] += loss
@component("textcat", assigns=["doc.cats"])
class TextCategorizer(Pipe): class TextCategorizer(Pipe):
"""Pipeline component for text classification. """Pipeline component for text classification.
DOCS: https://spacy.io/api/textcategorizer DOCS: https://spacy.io/api/textcategorizer
""" """
name = 'textcat'
@classmethod @classmethod
def Model(cls, nr_class=1, **cfg): def Model(cls, nr_class=1, **cfg):
@ -1032,10 +1033,10 @@ class TextCategorizer(Pipe):
return 1 return 1
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs): def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs):
for raw_text, annots_brackets in get_gold_tuples(): for raw_text, annot_brackets in get_gold_tuples():
cats = annots_brackets.pop() for _, (cats, _2) in annot_brackets:
for cat in cats: for cat in cats:
self.add_label(cat) self.add_label(cat)
if self.model is True: if self.model is True:
self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors") self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
self.require_labels() self.require_labels()
@ -1051,8 +1052,11 @@ cdef class DependencyParser(Parser):
DOCS: https://spacy.io/api/dependencyparser DOCS: https://spacy.io/api/dependencyparser
""" """
# cdef classes can't have decorators, so we're defining this here
name = "parser" name = "parser"
factory = "parser"
assigns = ["token.dep", "token.is_sent_start", "doc.sents"]
requires = []
TransitionSystem = ArcEager TransitionSystem = ArcEager
@property @property
@ -1097,8 +1101,10 @@ cdef class EntityRecognizer(Parser):
DOCS: https://spacy.io/api/entityrecognizer DOCS: https://spacy.io/api/entityrecognizer
""" """
name = "ner" name = "ner"
factory = "ner"
assigns = ["doc.ents", "token.ent_iob", "token.ent_type"]
requires = []
TransitionSystem = BiluoPushDown TransitionSystem = BiluoPushDown
nr_feature = 6 nr_feature = 6
@ -1129,12 +1135,16 @@ cdef class EntityRecognizer(Parser):
return tuple(sorted(labels)) return tuple(sorted(labels))
@component(
"entity_linker",
requires=["doc.ents", "token.ent_iob", "token.ent_type"],
assigns=["token.ent_kb_id"]
)
class EntityLinker(Pipe): class EntityLinker(Pipe):
"""Pipeline component for named entity linking. """Pipeline component for named entity linking.
DOCS: https://spacy.io/api/entitylinker DOCS: https://spacy.io/api/entitylinker
""" """
name = 'entity_linker'
NIL = "NIL" # string used to refer to a non-existing link NIL = "NIL" # string used to refer to a non-existing link
@classmethod @classmethod
@ -1298,7 +1308,8 @@ class EntityLinker(Pipe):
for ent in sent_doc.ents: for ent in sent_doc.ents:
entity_count += 1 entity_count += 1
if ent.label_ in self.cfg.get("labels_discard", []): to_discard = self.cfg.get("labels_discard", [])
if to_discard and ent.label_ in to_discard:
# ignoring this entity - setting to NIL # ignoring this entity - setting to NIL
final_kb_ids.append(self.NIL) final_kb_ids.append(self.NIL)
final_tensors.append(sentence_encoding) final_tensors.append(sentence_encoding)
@ -1404,13 +1415,13 @@ class EntityLinker(Pipe):
raise NotImplementedError raise NotImplementedError
@component("sentencizer", assigns=["token.is_sent_start", "doc.sents"])
class Sentencizer(object): class Sentencizer(object):
"""Segment the Doc into sentences using a rule-based strategy. """Segment the Doc into sentences using a rule-based strategy.
DOCS: https://spacy.io/api/sentencizer DOCS: https://spacy.io/api/sentencizer
""" """
name = "sentencizer"
default_punct_chars = ['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', default_punct_chars = ['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '᱿', '', '', '', '', '', '', '', '', '', '', '', '', '᱿',
@ -1436,6 +1447,10 @@ class Sentencizer(object):
else: else:
self.punct_chars = set(self.default_punct_chars) self.punct_chars = set(self.default_punct_chars)
@classmethod
def from_nlp(cls, nlp, **cfg):
return cls(**cfg)
def __call__(self, doc): def __call__(self, doc):
"""Apply the sentencizer to a Doc and set Token.is_sent_start. """Apply the sentencizer to a Doc and set Token.is_sent_start.
@ -1502,4 +1517,9 @@ class Sentencizer(object):
return self return self
# Cython classes can't be decorated, so we need to add the factories here
Language.factories["parser"] = lambda nlp, **cfg: DependencyParser.from_nlp(nlp, **cfg)
Language.factories["ner"] = lambda nlp, **cfg: EntityRecognizer.from_nlp(nlp, **cfg)
__all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer"] __all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer"]

View File

@ -19,7 +19,7 @@ from thinc.extra.search cimport Beam
from thinc.api import chain, clone from thinc.api import chain, clone
from thinc.v2v import Model, Maxout, Affine from thinc.v2v import Model, Maxout, Affine
from thinc.misc import LayerNorm from thinc.misc import LayerNorm
from thinc.neural.ops import CupyOps from thinc.neural.ops import CupyOps, NumpyOps
from thinc.neural.util import get_array_module from thinc.neural.util import get_array_module
from thinc.linalg cimport Vec, VecVec from thinc.linalg cimport Vec, VecVec
cimport blis.cy cimport blis.cy
@ -440,28 +440,38 @@ cdef class precompute_hiddens:
def backward(d_state_vector_ids, sgd=None): def backward(d_state_vector_ids, sgd=None):
d_state_vector, token_ids = d_state_vector_ids d_state_vector, token_ids = d_state_vector_ids
d_state_vector = bp_nonlinearity(d_state_vector, sgd) d_state_vector = bp_nonlinearity(d_state_vector, sgd)
# This will usually be on GPU
if not isinstance(d_state_vector, self.ops.xp.ndarray):
d_state_vector = self.ops.xp.array(d_state_vector)
d_tokens = bp_hiddens((d_state_vector, token_ids), sgd) d_tokens = bp_hiddens((d_state_vector, token_ids), sgd)
return d_tokens return d_tokens
return state_vector, backward return state_vector, backward
def _nonlinearity(self, state_vector): def _nonlinearity(self, state_vector):
if isinstance(state_vector, numpy.ndarray):
ops = NumpyOps()
else:
ops = CupyOps()
if self.nP == 1: if self.nP == 1:
state_vector = state_vector.reshape(state_vector.shape[:-1]) state_vector = state_vector.reshape(state_vector.shape[:-1])
mask = state_vector >= 0. mask = state_vector >= 0.
state_vector *= mask state_vector *= mask
else: else:
state_vector, mask = self.ops.maxout(state_vector) state_vector, mask = ops.maxout(state_vector)
def backprop_nonlinearity(d_best, sgd=None): def backprop_nonlinearity(d_best, sgd=None):
if isinstance(d_best, numpy.ndarray):
ops = NumpyOps()
else:
ops = CupyOps()
mask_ = ops.asarray(mask)
# This will usually be on GPU
d_best = ops.asarray(d_best)
# Fix nans (which can occur from unseen classes.) # Fix nans (which can occur from unseen classes.)
d_best[self.ops.xp.isnan(d_best)] = 0. d_best[ops.xp.isnan(d_best)] = 0.
if self.nP == 1: if self.nP == 1:
d_best *= mask d_best *= mask_
d_best = d_best.reshape((d_best.shape + (1,))) d_best = d_best.reshape((d_best.shape + (1,)))
return d_best return d_best
else: else:
return self.ops.backprop_maxout(d_best, mask, self.nP) return ops.backprop_maxout(d_best, mask_, self.nP)
return state_vector, backprop_nonlinearity return state_vector, backprop_nonlinearity

View File

@ -342,7 +342,6 @@ cdef class ArcEager(TransitionSystem):
actions[RIGHT][label] = 1 actions[RIGHT][label] = 1
actions[REDUCE][label] = 1 actions[REDUCE][label] = 1
for raw_text, sents in kwargs.get('gold_parses', []): for raw_text, sents in kwargs.get('gold_parses', []):
_ = sents.pop()
for (ids, words, tags, heads, labels, iob), ctnts in sents: for (ids, words, tags, heads, labels, iob), ctnts in sents:
heads, labels = nonproj.projectivize(heads, labels) heads, labels = nonproj.projectivize(heads, labels)
for child, head, label in zip(ids, heads, labels): for child, head, label in zip(ids, heads, labels):

View File

@ -73,7 +73,6 @@ cdef class BiluoPushDown(TransitionSystem):
actions[action][entity_type] = 1 actions[action][entity_type] = 1
moves = ('M', 'B', 'I', 'L', 'U') moves = ('M', 'B', 'I', 'L', 'U')
for raw_text, sents in kwargs.get('gold_parses', []): for raw_text, sents in kwargs.get('gold_parses', []):
_ = sents.pop()
for (ids, words, tags, heads, labels, biluo), _ in sents: for (ids, words, tags, heads, labels, biluo), _ in sents:
for i, ner_tag in enumerate(biluo): for i, ner_tag in enumerate(biluo):
if ner_tag != 'O' and ner_tag != '-': if ner_tag != 'O' and ner_tag != '-':

View File

@ -57,7 +57,10 @@ cdef class Parser:
subword_features = util.env_opt('subword_features', subword_features = util.env_opt('subword_features',
cfg.get('subword_features', True)) cfg.get('subword_features', True))
conv_depth = util.env_opt('conv_depth', cfg.get('conv_depth', 4)) conv_depth = util.env_opt('conv_depth', cfg.get('conv_depth', 4))
conv_window = util.env_opt('conv_window', cfg.get('conv_depth', 1))
t2v_pieces = util.env_opt('cnn_maxout_pieces', cfg.get('cnn_maxout_pieces', 3))
bilstm_depth = util.env_opt('bilstm_depth', cfg.get('bilstm_depth', 0)) bilstm_depth = util.env_opt('bilstm_depth', cfg.get('bilstm_depth', 0))
self_attn_depth = util.env_opt('self_attn_depth', cfg.get('self_attn_depth', 0))
if depth != 1: if depth != 1:
raise ValueError(TempErrors.T004.format(value=depth)) raise ValueError(TempErrors.T004.format(value=depth))
parser_maxout_pieces = util.env_opt('parser_maxout_pieces', parser_maxout_pieces = util.env_opt('parser_maxout_pieces',
@ -69,6 +72,8 @@ cdef class Parser:
pretrained_vectors = cfg.get('pretrained_vectors', None) pretrained_vectors = cfg.get('pretrained_vectors', None)
tok2vec = Tok2Vec(token_vector_width, embed_size, tok2vec = Tok2Vec(token_vector_width, embed_size,
conv_depth=conv_depth, conv_depth=conv_depth,
conv_window=conv_window,
cnn_maxout_pieces=t2v_pieces,
subword_features=subword_features, subword_features=subword_features,
pretrained_vectors=pretrained_vectors, pretrained_vectors=pretrained_vectors,
bilstm_depth=bilstm_depth) bilstm_depth=bilstm_depth)
@ -90,7 +95,12 @@ cdef class Parser:
'hidden_width': hidden_width, 'hidden_width': hidden_width,
'maxout_pieces': parser_maxout_pieces, 'maxout_pieces': parser_maxout_pieces,
'pretrained_vectors': pretrained_vectors, 'pretrained_vectors': pretrained_vectors,
'bilstm_depth': bilstm_depth 'bilstm_depth': bilstm_depth,
'self_attn_depth': self_attn_depth,
'conv_depth': conv_depth,
'conv_window': conv_window,
'embed_size': embed_size,
'cnn_maxout_pieces': t2v_pieces
} }
return ParserModel(tok2vec, lower, upper), cfg return ParserModel(tok2vec, lower, upper), cfg
@ -128,6 +138,10 @@ cdef class Parser:
self._multitasks = [] self._multitasks = []
self._rehearsal_model = None self._rehearsal_model = None
@classmethod
def from_nlp(cls, nlp, **cfg):
return cls(nlp.vocab, **cfg)
def __reduce__(self): def __reduce__(self):
return (Parser, (self.vocab, self.moves, self.model), None, None) return (Parser, (self.vocab, self.moves, self.model), None, None)
@ -602,12 +616,11 @@ cdef class Parser:
doc_sample = [] doc_sample = []
gold_sample = [] gold_sample = []
for raw_text, annots_brackets in islice(get_gold_tuples(), 1000): for raw_text, annots_brackets in islice(get_gold_tuples(), 1000):
_ = annots_brackets.pop()
for annots, brackets in annots_brackets: for annots, brackets in annots_brackets:
ids, words, tags, heads, deps, ents = annots ids, words, tags, heads, deps, ents = annots
doc_sample.append(Doc(self.vocab, words=words)) doc_sample.append(Doc(self.vocab, words=words))
gold_sample.append(GoldParse(doc_sample[-1], words=words, tags=tags, gold_sample.append(GoldParse(doc_sample[-1], words=words, tags=tags,
heads=heads, deps=deps, ents=ents)) heads=heads, deps=deps, entities=ents))
self.model.begin_training(doc_sample, gold_sample) self.model.begin_training(doc_sample, gold_sample)
if pipeline is not None: if pipeline is not None:
self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg) self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg)

View File

@ -2,9 +2,9 @@
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
from spacy.lang.sv.syntax_iterators import SYNTAX_ITERATORS
from ...util import get_doc from ...util import get_doc
SV_NP_TEST_EXAMPLES = [ SV_NP_TEST_EXAMPLES = [
( (
"En student läste en bok", # A student read a book "En student läste en bok", # A student read a book
@ -45,4 +45,3 @@ def test_sv_noun_chunks(sv_tokenizer, text, pos, deps, heads, expected_noun_chun
assert len(noun_chunks) == len(expected_noun_chunks) assert len(noun_chunks) == len(expected_noun_chunks)
for i, np in enumerate(noun_chunks): for i, np in enumerate(noun_chunks):
assert np.text == expected_noun_chunks[i] assert np.text == expected_noun_chunks[i]

View File

@ -17,7 +17,7 @@ def matcher(en_vocab):
} }
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
for key, patterns in rules.items(): for key, patterns in rules.items():
matcher.add(key, None, *patterns) matcher.add(key, patterns)
return matcher return matcher
@ -25,11 +25,11 @@ def test_matcher_from_api_docs(en_vocab):
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
pattern = [{"ORTH": "test"}] pattern = [{"ORTH": "test"}]
assert len(matcher) == 0 assert len(matcher) == 0
matcher.add("Rule", None, pattern) matcher.add("Rule", [pattern])
assert len(matcher) == 1 assert len(matcher) == 1
matcher.remove("Rule") matcher.remove("Rule")
assert "Rule" not in matcher assert "Rule" not in matcher
matcher.add("Rule", None, pattern) matcher.add("Rule", [pattern])
assert "Rule" in matcher assert "Rule" in matcher
on_match, patterns = matcher.get("Rule") on_match, patterns = matcher.get("Rule")
assert len(patterns[0]) assert len(patterns[0])
@ -52,7 +52,7 @@ def test_matcher_from_usage_docs(en_vocab):
token.vocab[token.text].norm_ = "happy emoji" token.vocab[token.text].norm_ = "happy emoji"
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
matcher.add("HAPPY", label_sentiment, *pos_patterns) matcher.add("HAPPY", pos_patterns, on_match=label_sentiment)
matcher(doc) matcher(doc)
assert doc.sentiment != 0 assert doc.sentiment != 0
assert doc[1].norm_ == "happy emoji" assert doc[1].norm_ == "happy emoji"
@ -60,11 +60,33 @@ def test_matcher_from_usage_docs(en_vocab):
def test_matcher_len_contains(matcher): def test_matcher_len_contains(matcher):
assert len(matcher) == 3 assert len(matcher) == 3
matcher.add("TEST", None, [{"ORTH": "test"}]) matcher.add("TEST", [[{"ORTH": "test"}]])
assert "TEST" in matcher assert "TEST" in matcher
assert "TEST2" not in matcher assert "TEST2" not in matcher
def test_matcher_add_new_old_api(en_vocab):
doc = Doc(en_vocab, words=["a", "b"])
patterns = [[{"TEXT": "a"}], [{"TEXT": "a"}, {"TEXT": "b"}]]
matcher = Matcher(en_vocab)
matcher.add("OLD_API", None, *patterns)
assert len(matcher(doc)) == 2
matcher = Matcher(en_vocab)
on_match = Mock()
matcher.add("OLD_API_CALLBACK", on_match, *patterns)
assert len(matcher(doc)) == 2
assert on_match.call_count == 2
# New API: add(key: str, patterns: List[List[dict]], on_match: Callable)
matcher = Matcher(en_vocab)
matcher.add("NEW_API", patterns)
assert len(matcher(doc)) == 2
matcher = Matcher(en_vocab)
on_match = Mock()
matcher.add("NEW_API_CALLBACK", patterns, on_match=on_match)
assert len(matcher(doc)) == 2
assert on_match.call_count == 2
def test_matcher_no_match(matcher): def test_matcher_no_match(matcher):
doc = Doc(matcher.vocab, words=["I", "like", "cheese", "."]) doc = Doc(matcher.vocab, words=["I", "like", "cheese", "."])
assert matcher(doc) == [] assert matcher(doc) == []
@ -100,12 +122,12 @@ def test_matcher_empty_dict(en_vocab):
"""Test matcher allows empty token specs, meaning match on any token.""" """Test matcher allows empty token specs, meaning match on any token."""
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
doc = Doc(matcher.vocab, words=["a", "b", "c"]) doc = Doc(matcher.vocab, words=["a", "b", "c"])
matcher.add("A.C", None, [{"ORTH": "a"}, {}, {"ORTH": "c"}]) matcher.add("A.C", [[{"ORTH": "a"}, {}, {"ORTH": "c"}]])
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 1 assert len(matches) == 1
assert matches[0][1:] == (0, 3) assert matches[0][1:] == (0, 3)
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
matcher.add("A.", None, [{"ORTH": "a"}, {}]) matcher.add("A.", [[{"ORTH": "a"}, {}]])
matches = matcher(doc) matches = matcher(doc)
assert matches[0][1:] == (0, 2) assert matches[0][1:] == (0, 2)
@ -114,7 +136,7 @@ def test_matcher_operator_shadow(en_vocab):
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
doc = Doc(matcher.vocab, words=["a", "b", "c"]) doc = Doc(matcher.vocab, words=["a", "b", "c"])
pattern = [{"ORTH": "a"}, {"IS_ALPHA": True, "OP": "+"}, {"ORTH": "c"}] pattern = [{"ORTH": "a"}, {"IS_ALPHA": True, "OP": "+"}, {"ORTH": "c"}]
matcher.add("A.C", None, pattern) matcher.add("A.C", [pattern])
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 1 assert len(matches) == 1
assert matches[0][1:] == (0, 3) assert matches[0][1:] == (0, 3)
@ -136,12 +158,12 @@ def test_matcher_match_zero(matcher):
{"IS_PUNCT": True}, {"IS_PUNCT": True},
{"ORTH": '"'}, {"ORTH": '"'},
] ]
matcher.add("Quote", None, pattern1) matcher.add("Quote", [pattern1])
doc = Doc(matcher.vocab, words=words1) doc = Doc(matcher.vocab, words=words1)
assert len(matcher(doc)) == 1 assert len(matcher(doc)) == 1
doc = Doc(matcher.vocab, words=words2) doc = Doc(matcher.vocab, words=words2)
assert len(matcher(doc)) == 0 assert len(matcher(doc)) == 0
matcher.add("Quote", None, pattern2) matcher.add("Quote", [pattern2])
assert len(matcher(doc)) == 0 assert len(matcher(doc)) == 0
@ -149,7 +171,7 @@ def test_matcher_match_zero_plus(matcher):
words = 'He said , " some words " ...'.split() words = 'He said , " some words " ...'.split()
pattern = [{"ORTH": '"'}, {"OP": "*", "IS_PUNCT": False}, {"ORTH": '"'}] pattern = [{"ORTH": '"'}, {"OP": "*", "IS_PUNCT": False}, {"ORTH": '"'}]
matcher = Matcher(matcher.vocab) matcher = Matcher(matcher.vocab)
matcher.add("Quote", None, pattern) matcher.add("Quote", [pattern])
doc = Doc(matcher.vocab, words=words) doc = Doc(matcher.vocab, words=words)
assert len(matcher(doc)) == 1 assert len(matcher(doc)) == 1
@ -160,11 +182,8 @@ def test_matcher_match_one_plus(matcher):
doc = Doc(control.vocab, words=["Philippe", "Philippe"]) doc = Doc(control.vocab, words=["Philippe", "Philippe"])
m = control(doc) m = control(doc)
assert len(m) == 2 assert len(m) == 2
matcher.add( pattern = [{"ORTH": "Philippe", "OP": "1"}, {"ORTH": "Philippe", "OP": "+"}]
"KleenePhilippe", matcher.add("KleenePhilippe", [pattern])
None,
[{"ORTH": "Philippe", "OP": "1"}, {"ORTH": "Philippe", "OP": "+"}],
)
m = matcher(doc) m = matcher(doc)
assert len(m) == 1 assert len(m) == 1
@ -172,7 +191,7 @@ def test_matcher_match_one_plus(matcher):
def test_matcher_any_token_operator(en_vocab): def test_matcher_any_token_operator(en_vocab):
"""Test that patterns with "any token" {} work with operators.""" """Test that patterns with "any token" {} work with operators."""
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
matcher.add("TEST", None, [{"ORTH": "test"}, {"OP": "*"}]) matcher.add("TEST", [[{"ORTH": "test"}, {"OP": "*"}]])
doc = Doc(en_vocab, words=["test", "hello", "world"]) doc = Doc(en_vocab, words=["test", "hello", "world"])
matches = [doc[start:end].text for _, start, end in matcher(doc)] matches = [doc[start:end].text for _, start, end in matcher(doc)]
assert len(matches) == 3 assert len(matches) == 3
@ -186,7 +205,7 @@ def test_matcher_extension_attribute(en_vocab):
get_is_fruit = lambda token: token.text in ("apple", "banana") get_is_fruit = lambda token: token.text in ("apple", "banana")
Token.set_extension("is_fruit", getter=get_is_fruit, force=True) Token.set_extension("is_fruit", getter=get_is_fruit, force=True)
pattern = [{"ORTH": "an"}, {"_": {"is_fruit": True}}] pattern = [{"ORTH": "an"}, {"_": {"is_fruit": True}}]
matcher.add("HAVING_FRUIT", None, pattern) matcher.add("HAVING_FRUIT", [pattern])
doc = Doc(en_vocab, words=["an", "apple"]) doc = Doc(en_vocab, words=["an", "apple"])
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 1 assert len(matches) == 1
@ -198,7 +217,7 @@ def test_matcher_extension_attribute(en_vocab):
def test_matcher_set_value(en_vocab): def test_matcher_set_value(en_vocab):
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
pattern = [{"ORTH": {"IN": ["an", "a"]}}] pattern = [{"ORTH": {"IN": ["an", "a"]}}]
matcher.add("A_OR_AN", None, pattern) matcher.add("A_OR_AN", [pattern])
doc = Doc(en_vocab, words=["an", "a", "apple"]) doc = Doc(en_vocab, words=["an", "a", "apple"])
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 2 assert len(matches) == 2
@ -210,7 +229,7 @@ def test_matcher_set_value(en_vocab):
def test_matcher_set_value_operator(en_vocab): def test_matcher_set_value_operator(en_vocab):
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
pattern = [{"ORTH": {"IN": ["a", "the"]}, "OP": "?"}, {"ORTH": "house"}] pattern = [{"ORTH": {"IN": ["a", "the"]}, "OP": "?"}, {"ORTH": "house"}]
matcher.add("DET_HOUSE", None, pattern) matcher.add("DET_HOUSE", [pattern])
doc = Doc(en_vocab, words=["In", "a", "house"]) doc = Doc(en_vocab, words=["In", "a", "house"])
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 2 assert len(matches) == 2
@ -222,7 +241,7 @@ def test_matcher_set_value_operator(en_vocab):
def test_matcher_regex(en_vocab): def test_matcher_regex(en_vocab):
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
pattern = [{"ORTH": {"REGEX": r"(?:a|an)"}}] pattern = [{"ORTH": {"REGEX": r"(?:a|an)"}}]
matcher.add("A_OR_AN", None, pattern) matcher.add("A_OR_AN", [pattern])
doc = Doc(en_vocab, words=["an", "a", "hi"]) doc = Doc(en_vocab, words=["an", "a", "hi"])
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 2 assert len(matches) == 2
@ -234,7 +253,7 @@ def test_matcher_regex(en_vocab):
def test_matcher_regex_shape(en_vocab): def test_matcher_regex_shape(en_vocab):
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
pattern = [{"SHAPE": {"REGEX": r"^[^x]+$"}}] pattern = [{"SHAPE": {"REGEX": r"^[^x]+$"}}]
matcher.add("NON_ALPHA", None, pattern) matcher.add("NON_ALPHA", [pattern])
doc = Doc(en_vocab, words=["99", "problems", "!"]) doc = Doc(en_vocab, words=["99", "problems", "!"])
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 2 assert len(matches) == 2
@ -246,7 +265,7 @@ def test_matcher_regex_shape(en_vocab):
def test_matcher_compare_length(en_vocab): def test_matcher_compare_length(en_vocab):
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
pattern = [{"LENGTH": {">=": 2}}] pattern = [{"LENGTH": {">=": 2}}]
matcher.add("LENGTH_COMPARE", None, pattern) matcher.add("LENGTH_COMPARE", [pattern])
doc = Doc(en_vocab, words=["a", "aa", "aaa"]) doc = Doc(en_vocab, words=["a", "aa", "aaa"])
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 2 assert len(matches) == 2
@ -260,7 +279,7 @@ def test_matcher_extension_set_membership(en_vocab):
get_reversed = lambda token: "".join(reversed(token.text)) get_reversed = lambda token: "".join(reversed(token.text))
Token.set_extension("reversed", getter=get_reversed, force=True) Token.set_extension("reversed", getter=get_reversed, force=True)
pattern = [{"_": {"reversed": {"IN": ["eyb", "ih"]}}}] pattern = [{"_": {"reversed": {"IN": ["eyb", "ih"]}}}]
matcher.add("REVERSED", None, pattern) matcher.add("REVERSED", [pattern])
doc = Doc(en_vocab, words=["hi", "bye", "hello"]) doc = Doc(en_vocab, words=["hi", "bye", "hello"])
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 2 assert len(matches) == 2
@ -328,9 +347,9 @@ def dependency_matcher(en_vocab):
] ]
matcher = DependencyMatcher(en_vocab) matcher = DependencyMatcher(en_vocab)
matcher.add("pattern1", None, pattern1) matcher.add("pattern1", [pattern1])
matcher.add("pattern2", None, pattern2) matcher.add("pattern2", [pattern2])
matcher.add("pattern3", None, pattern3) matcher.add("pattern3", [pattern3])
return matcher return matcher
@ -347,6 +366,14 @@ def test_dependency_matcher_compile(dependency_matcher):
# assert matches[2][1] == [[4, 3, 2]] # assert matches[2][1] == [[4, 3, 2]]
def test_matcher_basic_check(en_vocab):
matcher = Matcher(en_vocab)
# Potential mistake: pass in pattern instead of list of patterns
pattern = [{"TEXT": "hello"}, {"TEXT": "world"}]
with pytest.raises(ValueError):
matcher.add("TEST", pattern)
def test_attr_pipeline_checks(en_vocab): def test_attr_pipeline_checks(en_vocab):
doc1 = Doc(en_vocab, words=["Test"]) doc1 = Doc(en_vocab, words=["Test"])
doc1.is_parsed = True doc1.is_parsed = True
@ -355,7 +382,7 @@ def test_attr_pipeline_checks(en_vocab):
doc3 = Doc(en_vocab, words=["Test"]) doc3 = Doc(en_vocab, words=["Test"])
# DEP requires is_parsed # DEP requires is_parsed
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
matcher.add("TEST", None, [{"DEP": "a"}]) matcher.add("TEST", [[{"DEP": "a"}]])
matcher(doc1) matcher(doc1)
with pytest.raises(ValueError): with pytest.raises(ValueError):
matcher(doc2) matcher(doc2)
@ -364,7 +391,7 @@ def test_attr_pipeline_checks(en_vocab):
# TAG, POS, LEMMA require is_tagged # TAG, POS, LEMMA require is_tagged
for attr in ("TAG", "POS", "LEMMA"): for attr in ("TAG", "POS", "LEMMA"):
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
matcher.add("TEST", None, [{attr: "a"}]) matcher.add("TEST", [[{attr: "a"}]])
matcher(doc2) matcher(doc2)
with pytest.raises(ValueError): with pytest.raises(ValueError):
matcher(doc1) matcher(doc1)
@ -372,12 +399,12 @@ def test_attr_pipeline_checks(en_vocab):
matcher(doc3) matcher(doc3)
# TEXT/ORTH only require tokens # TEXT/ORTH only require tokens
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
matcher.add("TEST", None, [{"ORTH": "a"}]) matcher.add("TEST", [[{"ORTH": "a"}]])
matcher(doc1) matcher(doc1)
matcher(doc2) matcher(doc2)
matcher(doc3) matcher(doc3)
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
matcher.add("TEST", None, [{"TEXT": "a"}]) matcher.add("TEST", [[{"TEXT": "a"}]])
matcher(doc1) matcher(doc1)
matcher(doc2) matcher(doc2)
matcher(doc3) matcher(doc3)
@ -407,7 +434,7 @@ def test_attr_pipeline_checks(en_vocab):
def test_matcher_schema_token_attributes(en_vocab, pattern, text): def test_matcher_schema_token_attributes(en_vocab, pattern, text):
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
doc = Doc(en_vocab, words=text.split(" ")) doc = Doc(en_vocab, words=text.split(" "))
matcher.add("Rule", None, pattern) matcher.add("Rule", [pattern])
assert len(matcher) == 1 assert len(matcher) == 1
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 1 assert len(matches) == 1
@ -417,7 +444,7 @@ def test_matcher_valid_callback(en_vocab):
"""Test that on_match can only be None or callable.""" """Test that on_match can only be None or callable."""
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
with pytest.raises(ValueError): with pytest.raises(ValueError):
matcher.add("TEST", [], [{"TEXT": "test"}]) matcher.add("TEST", [[{"TEXT": "test"}]], on_match=[])
matcher(Doc(en_vocab, words=["test"])) matcher(Doc(en_vocab, words=["test"]))
@ -425,7 +452,7 @@ def test_matcher_callback(en_vocab):
mock = Mock() mock = Mock()
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
pattern = [{"ORTH": "test"}] pattern = [{"ORTH": "test"}]
matcher.add("Rule", mock, pattern) matcher.add("Rule", [pattern], on_match=mock)
doc = Doc(en_vocab, words=["This", "is", "a", "test", "."]) doc = Doc(en_vocab, words=["This", "is", "a", "test", "."])
matches = matcher(doc) matches = matcher(doc)
mock.assert_called_once_with(matcher, doc, 0, matches) mock.assert_called_once_with(matcher, doc, 0, matches)

View File

@ -55,7 +55,7 @@ def test_greedy_matching(doc, text, pattern, re_pattern):
"""Test that the greedy matching behavior of the * op is consistant with """Test that the greedy matching behavior of the * op is consistant with
other re implementations.""" other re implementations."""
matcher = Matcher(doc.vocab) matcher = Matcher(doc.vocab)
matcher.add(re_pattern, None, pattern) matcher.add(re_pattern, [pattern])
matches = matcher(doc) matches = matcher(doc)
re_matches = [m.span() for m in re.finditer(re_pattern, text)] re_matches = [m.span() for m in re.finditer(re_pattern, text)]
for match, re_match in zip(matches, re_matches): for match, re_match in zip(matches, re_matches):
@ -77,7 +77,7 @@ def test_match_consuming(doc, text, pattern, re_pattern):
"""Test that matcher.__call__ consumes tokens on a match similar to """Test that matcher.__call__ consumes tokens on a match similar to
re.findall.""" re.findall."""
matcher = Matcher(doc.vocab) matcher = Matcher(doc.vocab)
matcher.add(re_pattern, None, pattern) matcher.add(re_pattern, [pattern])
matches = matcher(doc) matches = matcher(doc)
re_matches = [m.span() for m in re.finditer(re_pattern, text)] re_matches = [m.span() for m in re.finditer(re_pattern, text)]
assert len(matches) == len(re_matches) assert len(matches) == len(re_matches)
@ -111,7 +111,7 @@ def test_operator_combos(en_vocab):
pattern.append({"ORTH": part[0], "OP": "+"}) pattern.append({"ORTH": part[0], "OP": "+"})
else: else:
pattern.append({"ORTH": part}) pattern.append({"ORTH": part})
matcher.add("PATTERN", None, pattern) matcher.add("PATTERN", [pattern])
matches = matcher(doc) matches = matcher(doc)
if result: if result:
assert matches, (string, pattern_str) assert matches, (string, pattern_str)
@ -123,7 +123,7 @@ def test_matcher_end_zero_plus(en_vocab):
"""Test matcher works when patterns end with * operator. (issue 1450)""" """Test matcher works when patterns end with * operator. (issue 1450)"""
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
pattern = [{"ORTH": "a"}, {"ORTH": "b", "OP": "*"}] pattern = [{"ORTH": "a"}, {"ORTH": "b", "OP": "*"}]
matcher.add("TSTEND", None, pattern) matcher.add("TSTEND", [pattern])
nlp = lambda string: Doc(matcher.vocab, words=string.split()) nlp = lambda string: Doc(matcher.vocab, words=string.split())
assert len(matcher(nlp("a"))) == 1 assert len(matcher(nlp("a"))) == 1
assert len(matcher(nlp("a b"))) == 2 assert len(matcher(nlp("a b"))) == 2
@ -140,7 +140,7 @@ def test_matcher_sets_return_correct_tokens(en_vocab):
[{"LOWER": {"IN": ["one"]}}], [{"LOWER": {"IN": ["one"]}}],
[{"LOWER": {"IN": ["two"]}}], [{"LOWER": {"IN": ["two"]}}],
] ]
matcher.add("TEST", None, *patterns) matcher.add("TEST", patterns)
doc = Doc(en_vocab, words="zero one two three".split()) doc = Doc(en_vocab, words="zero one two three".split())
matches = matcher(doc) matches = matcher(doc)
texts = [Span(doc, s, e, label=L).text for L, s, e in matches] texts = [Span(doc, s, e, label=L).text for L, s, e in matches]
@ -154,7 +154,7 @@ def test_matcher_remove():
pattern = [{"ORTH": "test"}, {"OP": "?"}] pattern = [{"ORTH": "test"}, {"OP": "?"}]
assert len(matcher) == 0 assert len(matcher) == 0
matcher.add("Rule", None, pattern) matcher.add("Rule", [pattern])
assert "Rule" in matcher assert "Rule" in matcher
# should give two matches # should give two matches

View File

@ -50,7 +50,7 @@ def validator():
def test_matcher_pattern_validation(en_vocab, pattern): def test_matcher_pattern_validation(en_vocab, pattern):
matcher = Matcher(en_vocab, validate=True) matcher = Matcher(en_vocab, validate=True)
with pytest.raises(MatchPatternError): with pytest.raises(MatchPatternError):
matcher.add("TEST", None, pattern) matcher.add("TEST", [pattern])
@pytest.mark.parametrize("pattern,n_errors,_", TEST_PATTERNS) @pytest.mark.parametrize("pattern,n_errors,_", TEST_PATTERNS)
@ -71,6 +71,6 @@ def test_minimal_pattern_validation(en_vocab, pattern, n_errors, n_min_errors):
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
if n_min_errors > 0: if n_min_errors > 0:
with pytest.raises(ValueError): with pytest.raises(ValueError):
matcher.add("TEST", None, pattern) matcher.add("TEST", [pattern])
elif n_errors == 0: elif n_errors == 0:
matcher.add("TEST", None, pattern) matcher.add("TEST", [pattern])

View File

@ -13,53 +13,75 @@ def test_matcher_phrase_matcher(en_vocab):
# intermediate phrase # intermediate phrase
pattern = Doc(en_vocab, words=["Google", "Now"]) pattern = Doc(en_vocab, words=["Google", "Now"])
matcher = PhraseMatcher(en_vocab) matcher = PhraseMatcher(en_vocab)
matcher.add("COMPANY", None, pattern) matcher.add("COMPANY", [pattern])
assert len(matcher(doc)) == 1 assert len(matcher(doc)) == 1
# initial token # initial token
pattern = Doc(en_vocab, words=["I"]) pattern = Doc(en_vocab, words=["I"])
matcher = PhraseMatcher(en_vocab) matcher = PhraseMatcher(en_vocab)
matcher.add("I", None, pattern) matcher.add("I", [pattern])
assert len(matcher(doc)) == 1 assert len(matcher(doc)) == 1
# initial phrase # initial phrase
pattern = Doc(en_vocab, words=["I", "like"]) pattern = Doc(en_vocab, words=["I", "like"])
matcher = PhraseMatcher(en_vocab) matcher = PhraseMatcher(en_vocab)
matcher.add("ILIKE", None, pattern) matcher.add("ILIKE", [pattern])
assert len(matcher(doc)) == 1 assert len(matcher(doc)) == 1
# final token # final token
pattern = Doc(en_vocab, words=["best"]) pattern = Doc(en_vocab, words=["best"])
matcher = PhraseMatcher(en_vocab) matcher = PhraseMatcher(en_vocab)
matcher.add("BEST", None, pattern) matcher.add("BEST", [pattern])
assert len(matcher(doc)) == 1 assert len(matcher(doc)) == 1
# final phrase # final phrase
pattern = Doc(en_vocab, words=["Now", "best"]) pattern = Doc(en_vocab, words=["Now", "best"])
matcher = PhraseMatcher(en_vocab) matcher = PhraseMatcher(en_vocab)
matcher.add("NOWBEST", None, pattern) matcher.add("NOWBEST", [pattern])
assert len(matcher(doc)) == 1 assert len(matcher(doc)) == 1
def test_phrase_matcher_length(en_vocab): def test_phrase_matcher_length(en_vocab):
matcher = PhraseMatcher(en_vocab) matcher = PhraseMatcher(en_vocab)
assert len(matcher) == 0 assert len(matcher) == 0
matcher.add("TEST", None, Doc(en_vocab, words=["test"])) matcher.add("TEST", [Doc(en_vocab, words=["test"])])
assert len(matcher) == 1 assert len(matcher) == 1
matcher.add("TEST2", None, Doc(en_vocab, words=["test2"])) matcher.add("TEST2", [Doc(en_vocab, words=["test2"])])
assert len(matcher) == 2 assert len(matcher) == 2
def test_phrase_matcher_contains(en_vocab): def test_phrase_matcher_contains(en_vocab):
matcher = PhraseMatcher(en_vocab) matcher = PhraseMatcher(en_vocab)
matcher.add("TEST", None, Doc(en_vocab, words=["test"])) matcher.add("TEST", [Doc(en_vocab, words=["test"])])
assert "TEST" in matcher assert "TEST" in matcher
assert "TEST2" not in matcher assert "TEST2" not in matcher
def test_phrase_matcher_add_new_api(en_vocab):
doc = Doc(en_vocab, words=["a", "b"])
patterns = [Doc(en_vocab, words=["a"]), Doc(en_vocab, words=["a", "b"])]
matcher = PhraseMatcher(en_vocab)
matcher.add("OLD_API", None, *patterns)
assert len(matcher(doc)) == 2
matcher = PhraseMatcher(en_vocab)
on_match = Mock()
matcher.add("OLD_API_CALLBACK", on_match, *patterns)
assert len(matcher(doc)) == 2
assert on_match.call_count == 2
# New API: add(key: str, patterns: List[List[dict]], on_match: Callable)
matcher = PhraseMatcher(en_vocab)
matcher.add("NEW_API", patterns)
assert len(matcher(doc)) == 2
matcher = PhraseMatcher(en_vocab)
on_match = Mock()
matcher.add("NEW_API_CALLBACK", patterns, on_match=on_match)
assert len(matcher(doc)) == 2
assert on_match.call_count == 2
def test_phrase_matcher_repeated_add(en_vocab): def test_phrase_matcher_repeated_add(en_vocab):
matcher = PhraseMatcher(en_vocab) matcher = PhraseMatcher(en_vocab)
# match ID only gets added once # match ID only gets added once
matcher.add("TEST", None, Doc(en_vocab, words=["like"])) matcher.add("TEST", [Doc(en_vocab, words=["like"])])
matcher.add("TEST", None, Doc(en_vocab, words=["like"])) matcher.add("TEST", [Doc(en_vocab, words=["like"])])
matcher.add("TEST", None, Doc(en_vocab, words=["like"])) matcher.add("TEST", [Doc(en_vocab, words=["like"])])
matcher.add("TEST", None, Doc(en_vocab, words=["like"])) matcher.add("TEST", [Doc(en_vocab, words=["like"])])
doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"])
assert "TEST" in matcher assert "TEST" in matcher
assert "TEST2" not in matcher assert "TEST2" not in matcher
@ -68,8 +90,8 @@ def test_phrase_matcher_repeated_add(en_vocab):
def test_phrase_matcher_remove(en_vocab): def test_phrase_matcher_remove(en_vocab):
matcher = PhraseMatcher(en_vocab) matcher = PhraseMatcher(en_vocab)
matcher.add("TEST1", None, Doc(en_vocab, words=["like"])) matcher.add("TEST1", [Doc(en_vocab, words=["like"])])
matcher.add("TEST2", None, Doc(en_vocab, words=["best"])) matcher.add("TEST2", [Doc(en_vocab, words=["best"])])
doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"])
assert "TEST1" in matcher assert "TEST1" in matcher
assert "TEST2" in matcher assert "TEST2" in matcher
@ -95,9 +117,9 @@ def test_phrase_matcher_remove(en_vocab):
def test_phrase_matcher_overlapping_with_remove(en_vocab): def test_phrase_matcher_overlapping_with_remove(en_vocab):
matcher = PhraseMatcher(en_vocab) matcher = PhraseMatcher(en_vocab)
matcher.add("TEST", None, Doc(en_vocab, words=["like"])) matcher.add("TEST", [Doc(en_vocab, words=["like"])])
# TEST2 is added alongside TEST # TEST2 is added alongside TEST
matcher.add("TEST2", None, Doc(en_vocab, words=["like"])) matcher.add("TEST2", [Doc(en_vocab, words=["like"])])
doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"])
assert "TEST" in matcher assert "TEST" in matcher
assert len(matcher) == 2 assert len(matcher) == 2
@ -122,7 +144,7 @@ def test_phrase_matcher_string_attrs(en_vocab):
pos2 = ["INTJ", "PUNCT", "PRON", "VERB", "NOUN", "ADV", "ADV"] pos2 = ["INTJ", "PUNCT", "PRON", "VERB", "NOUN", "ADV", "ADV"]
pattern = get_doc(en_vocab, words=words1, pos=pos1) pattern = get_doc(en_vocab, words=words1, pos=pos1)
matcher = PhraseMatcher(en_vocab, attr="POS") matcher = PhraseMatcher(en_vocab, attr="POS")
matcher.add("TEST", None, pattern) matcher.add("TEST", [pattern])
doc = get_doc(en_vocab, words=words2, pos=pos2) doc = get_doc(en_vocab, words=words2, pos=pos2)
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 1 assert len(matches) == 1
@ -140,7 +162,7 @@ def test_phrase_matcher_string_attrs_negative(en_vocab):
pos2 = ["X", "X", "X"] pos2 = ["X", "X", "X"]
pattern = get_doc(en_vocab, words=words1, pos=pos1) pattern = get_doc(en_vocab, words=words1, pos=pos1)
matcher = PhraseMatcher(en_vocab, attr="POS") matcher = PhraseMatcher(en_vocab, attr="POS")
matcher.add("TEST", None, pattern) matcher.add("TEST", [pattern])
doc = get_doc(en_vocab, words=words2, pos=pos2) doc = get_doc(en_vocab, words=words2, pos=pos2)
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 0 assert len(matches) == 0
@ -151,7 +173,7 @@ def test_phrase_matcher_bool_attrs(en_vocab):
words2 = ["No", "problem", ",", "he", "said", "."] words2 = ["No", "problem", ",", "he", "said", "."]
pattern = Doc(en_vocab, words=words1) pattern = Doc(en_vocab, words=words1)
matcher = PhraseMatcher(en_vocab, attr="IS_PUNCT") matcher = PhraseMatcher(en_vocab, attr="IS_PUNCT")
matcher.add("TEST", None, pattern) matcher.add("TEST", [pattern])
doc = Doc(en_vocab, words=words2) doc = Doc(en_vocab, words=words2)
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 2 assert len(matches) == 2
@ -173,15 +195,15 @@ def test_phrase_matcher_validation(en_vocab):
doc3 = Doc(en_vocab, words=["Test"]) doc3 = Doc(en_vocab, words=["Test"])
matcher = PhraseMatcher(en_vocab, validate=True) matcher = PhraseMatcher(en_vocab, validate=True)
with pytest.warns(UserWarning): with pytest.warns(UserWarning):
matcher.add("TEST1", None, doc1) matcher.add("TEST1", [doc1])
with pytest.warns(UserWarning): with pytest.warns(UserWarning):
matcher.add("TEST2", None, doc2) matcher.add("TEST2", [doc2])
with pytest.warns(None) as record: with pytest.warns(None) as record:
matcher.add("TEST3", None, doc3) matcher.add("TEST3", [doc3])
assert not record.list assert not record.list
matcher = PhraseMatcher(en_vocab, attr="POS", validate=True) matcher = PhraseMatcher(en_vocab, attr="POS", validate=True)
with pytest.warns(None) as record: with pytest.warns(None) as record:
matcher.add("TEST4", None, doc2) matcher.add("TEST4", [doc2])
assert not record.list assert not record.list
@ -198,24 +220,24 @@ def test_attr_pipeline_checks(en_vocab):
doc3 = Doc(en_vocab, words=["Test"]) doc3 = Doc(en_vocab, words=["Test"])
# DEP requires is_parsed # DEP requires is_parsed
matcher = PhraseMatcher(en_vocab, attr="DEP") matcher = PhraseMatcher(en_vocab, attr="DEP")
matcher.add("TEST1", None, doc1) matcher.add("TEST1", [doc1])
with pytest.raises(ValueError): with pytest.raises(ValueError):
matcher.add("TEST2", None, doc2) matcher.add("TEST2", [doc2])
with pytest.raises(ValueError): with pytest.raises(ValueError):
matcher.add("TEST3", None, doc3) matcher.add("TEST3", [doc3])
# TAG, POS, LEMMA require is_tagged # TAG, POS, LEMMA require is_tagged
for attr in ("TAG", "POS", "LEMMA"): for attr in ("TAG", "POS", "LEMMA"):
matcher = PhraseMatcher(en_vocab, attr=attr) matcher = PhraseMatcher(en_vocab, attr=attr)
matcher.add("TEST2", None, doc2) matcher.add("TEST2", [doc2])
with pytest.raises(ValueError): with pytest.raises(ValueError):
matcher.add("TEST1", None, doc1) matcher.add("TEST1", [doc1])
with pytest.raises(ValueError): with pytest.raises(ValueError):
matcher.add("TEST3", None, doc3) matcher.add("TEST3", [doc3])
# TEXT/ORTH only require tokens # TEXT/ORTH only require tokens
matcher = PhraseMatcher(en_vocab, attr="ORTH") matcher = PhraseMatcher(en_vocab, attr="ORTH")
matcher.add("TEST3", None, doc3) matcher.add("TEST3", [doc3])
matcher = PhraseMatcher(en_vocab, attr="TEXT") matcher = PhraseMatcher(en_vocab, attr="TEXT")
matcher.add("TEST3", None, doc3) matcher.add("TEST3", [doc3])
def test_phrase_matcher_callback(en_vocab): def test_phrase_matcher_callback(en_vocab):
@ -223,7 +245,7 @@ def test_phrase_matcher_callback(en_vocab):
doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"])
pattern = Doc(en_vocab, words=["Google", "Now"]) pattern = Doc(en_vocab, words=["Google", "Now"])
matcher = PhraseMatcher(en_vocab) matcher = PhraseMatcher(en_vocab)
matcher.add("COMPANY", mock, pattern) matcher.add("COMPANY", [pattern], on_match=mock)
matches = matcher(doc) matches = matcher(doc)
mock.assert_called_once_with(matcher, doc, 0, matches) mock.assert_called_once_with(matcher, doc, 0, matches)
@ -234,5 +256,13 @@ def test_phrase_matcher_remove_overlapping_patterns(en_vocab):
pattern2 = Doc(en_vocab, words=["this", "is"]) pattern2 = Doc(en_vocab, words=["this", "is"])
pattern3 = Doc(en_vocab, words=["this", "is", "a"]) pattern3 = Doc(en_vocab, words=["this", "is", "a"])
pattern4 = Doc(en_vocab, words=["this", "is", "a", "word"]) pattern4 = Doc(en_vocab, words=["this", "is", "a", "word"])
matcher.add("THIS", None, pattern1, pattern2, pattern3, pattern4) matcher.add("THIS", [pattern1, pattern2, pattern3, pattern4])
matcher.remove("THIS") matcher.remove("THIS")
def test_phrase_matcher_basic_check(en_vocab):
matcher = PhraseMatcher(en_vocab)
# Potential mistake: pass in pattern instead of list of patterns
pattern = Doc(en_vocab, words=["hello", "world"])
with pytest.raises(ValueError):
matcher.add("TEST", pattern)

View File

@ -0,0 +1,168 @@
# coding: utf8
from __future__ import unicode_literals
import spacy.language
from spacy.language import Language, component
from spacy.analysis import print_summary, validate_attrs
from spacy.analysis import get_assigns_for_attr, get_requires_for_attr
from spacy.compat import is_python2
from mock import Mock, ANY
import pytest
def test_component_decorator_function():
@component(name="test")
def test_component(doc):
"""docstring"""
return doc
assert test_component.name == "test"
if not is_python2:
assert test_component.__doc__ == "docstring"
assert test_component("foo") == "foo"
def test_component_decorator_class():
@component(name="test")
class TestComponent(object):
"""docstring1"""
foo = "bar"
def __call__(self, doc):
"""docstring2"""
return doc
def custom(self, x):
"""docstring3"""
return x
assert TestComponent.name == "test"
assert TestComponent.foo == "bar"
assert hasattr(TestComponent, "custom")
test_component = TestComponent()
assert test_component.foo == "bar"
assert test_component("foo") == "foo"
assert hasattr(test_component, "custom")
assert test_component.custom("bar") == "bar"
if not is_python2:
assert TestComponent.__doc__ == "docstring1"
assert TestComponent.__call__.__doc__ == "docstring2"
assert TestComponent.custom.__doc__ == "docstring3"
assert test_component.__doc__ == "docstring1"
assert test_component.__call__.__doc__ == "docstring2"
assert test_component.custom.__doc__ == "docstring3"
def test_component_decorator_assigns():
spacy.language.ENABLE_PIPELINE_ANALYSIS = True
@component("c1", assigns=["token.tag", "doc.tensor"])
def test_component1(doc):
return doc
@component(
"c2", requires=["token.tag", "token.pos"], assigns=["token.lemma", "doc.tensor"]
)
def test_component2(doc):
return doc
@component("c3", requires=["token.lemma"], assigns=["token._.custom_lemma"])
def test_component3(doc):
return doc
assert "c1" in Language.factories
assert "c2" in Language.factories
assert "c3" in Language.factories
nlp = Language()
nlp.add_pipe(test_component1)
with pytest.warns(UserWarning):
nlp.add_pipe(test_component2)
nlp.add_pipe(test_component3)
assigns_tensor = get_assigns_for_attr(nlp.pipeline, "doc.tensor")
assert [name for name, _ in assigns_tensor] == ["c1", "c2"]
test_component4 = nlp.create_pipe("c1")
assert test_component4.name == "c1"
assert test_component4.factory == "c1"
nlp.add_pipe(test_component4, name="c4")
assert nlp.pipe_names == ["c1", "c2", "c3", "c4"]
assert "c4" not in Language.factories
assert nlp.pipe_factories["c1"] == "c1"
assert nlp.pipe_factories["c4"] == "c1"
assigns_tensor = get_assigns_for_attr(nlp.pipeline, "doc.tensor")
assert [name for name, _ in assigns_tensor] == ["c1", "c2", "c4"]
requires_pos = get_requires_for_attr(nlp.pipeline, "token.pos")
assert [name for name, _ in requires_pos] == ["c2"]
assert print_summary(nlp, no_print=True)
assert nlp("hello world")
def test_component_factories_from_nlp():
"""Test that class components can implement a from_nlp classmethod that
gives them access to the nlp object and config via the factory."""
class TestComponent5(object):
def __call__(self, doc):
return doc
mock = Mock()
mock.return_value = TestComponent5()
TestComponent5.from_nlp = classmethod(mock)
TestComponent5 = component("c5")(TestComponent5)
assert "c5" in Language.factories
nlp = Language()
pipe = nlp.create_pipe("c5", config={"foo": "bar"})
nlp.add_pipe(pipe)
assert nlp("hello world")
# The first argument here is the class itself, so we're accepting any here
mock.assert_called_once_with(ANY, nlp, foo="bar")
def test_analysis_validate_attrs_valid():
attrs = ["doc.sents", "doc.ents", "token.tag", "token._.xyz", "span._.xyz"]
assert validate_attrs(attrs)
for attr in attrs:
assert validate_attrs([attr])
with pytest.raises(ValueError):
validate_attrs(["doc.sents", "doc.xyz"])
@pytest.mark.parametrize(
"attr",
[
"doc",
"doc_ents",
"doc.xyz",
"token.xyz",
"token.tag_",
"token.tag.xyz",
"token._.xyz.abc",
"span.label",
],
)
def test_analysis_validate_attrs_invalid(attr):
with pytest.raises(ValueError):
validate_attrs([attr])
def test_analysis_validate_attrs_remove_pipe():
"""Test that attributes are validated correctly on remove."""
spacy.language.ENABLE_PIPELINE_ANALYSIS = True
@component("c1", assigns=["token.tag"])
def c1(doc):
return doc
@component("c2", requires=["token.pos"])
def c2(doc):
return doc
nlp = Language()
nlp.add_pipe(c1)
with pytest.warns(UserWarning):
nlp.add_pipe(c2)
with pytest.warns(None) as record:
nlp.remove_pipe("c2")
assert not record.list

View File

@ -154,7 +154,8 @@ def test_append_alias(nlp):
assert len(mykb.get_candidates("douglas")) == 3 assert len(mykb.get_candidates("douglas")) == 3
# append the same alias-entity pair again should not work (will throw a warning) # append the same alias-entity pair again should not work (will throw a warning)
mykb.append_alias(alias="douglas", entity="Q1", prior_prob=0.3) with pytest.warns(UserWarning):
mykb.append_alias(alias="douglas", entity="Q1", prior_prob=0.3)
# test the size of the relevant candidates remained unchanged # test the size of the relevant candidates remained unchanged
assert len(mykb.get_candidates("douglas")) == 3 assert len(mykb.get_candidates("douglas")) == 3

View File

@ -0,0 +1,34 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy.pipeline.functions import merge_subtokens
from ..util import get_doc
@pytest.fixture
def doc(en_tokenizer):
# fmt: off
text = "This is a sentence. This is another sentence. And a third."
heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 1, 1, 1, 0]
deps = ["nsubj", "ROOT", "subtok", "attr", "punct", "nsubj", "ROOT",
"subtok", "attr", "punct", "subtok", "subtok", "subtok", "ROOT"]
# fmt: on
tokens = en_tokenizer(text)
return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
def test_merge_subtokens(doc):
doc = merge_subtokens(doc)
# get_doc() doesn't set spaces, so the result is "And a third ."
assert [t.text for t in doc] == [
"This",
"is",
"a sentence",
".",
"This",
"is",
"another sentence",
".",
"And a third .",
]

View File

@ -105,6 +105,16 @@ def test_disable_pipes_context(nlp, name):
assert nlp.has_pipe(name) assert nlp.has_pipe(name)
def test_disable_pipes_list_arg(nlp):
for name in ["c1", "c2", "c3"]:
nlp.add_pipe(new_pipe, name=name)
assert nlp.has_pipe(name)
with nlp.disable_pipes(["c1", "c2"]):
assert not nlp.has_pipe("c1")
assert not nlp.has_pipe("c2")
assert nlp.has_pipe("c3")
@pytest.mark.parametrize("n_pipes", [100]) @pytest.mark.parametrize("n_pipes", [100])
def test_add_lots_of_pipes(nlp, n_pipes): def test_add_lots_of_pipes(nlp, n_pipes):
for i in range(n_pipes): for i in range(n_pipes):

View File

@ -30,7 +30,7 @@ def test_issue118(en_tokenizer, patterns):
doc = en_tokenizer(text) doc = en_tokenizer(text)
ORG = doc.vocab.strings["ORG"] ORG = doc.vocab.strings["ORG"]
matcher = Matcher(doc.vocab) matcher = Matcher(doc.vocab)
matcher.add("BostonCeltics", None, *patterns) matcher.add("BostonCeltics", patterns)
assert len(list(doc.ents)) == 0 assert len(list(doc.ents)) == 0
matches = [(ORG, start, end) for _, start, end in matcher(doc)] matches = [(ORG, start, end) for _, start, end in matcher(doc)]
assert matches == [(ORG, 9, 11), (ORG, 10, 11)] assert matches == [(ORG, 9, 11), (ORG, 10, 11)]
@ -57,7 +57,7 @@ def test_issue118_prefix_reorder(en_tokenizer, patterns):
doc = en_tokenizer(text) doc = en_tokenizer(text)
ORG = doc.vocab.strings["ORG"] ORG = doc.vocab.strings["ORG"]
matcher = Matcher(doc.vocab) matcher = Matcher(doc.vocab)
matcher.add("BostonCeltics", None, *patterns) matcher.add("BostonCeltics", patterns)
assert len(list(doc.ents)) == 0 assert len(list(doc.ents)) == 0
matches = [(ORG, start, end) for _, start, end in matcher(doc)] matches = [(ORG, start, end) for _, start, end in matcher(doc)]
doc.ents += tuple(matches)[1:] doc.ents += tuple(matches)[1:]
@ -78,7 +78,7 @@ def test_issue242(en_tokenizer):
] ]
doc = en_tokenizer(text) doc = en_tokenizer(text)
matcher = Matcher(doc.vocab) matcher = Matcher(doc.vocab)
matcher.add("FOOD", None, *patterns) matcher.add("FOOD", patterns)
matches = [(ent_type, start, end) for ent_type, start, end in matcher(doc)] matches = [(ent_type, start, end) for ent_type, start, end in matcher(doc)]
match1, match2 = matches match1, match2 = matches
assert match1[1] == 3 assert match1[1] == 3
@ -127,17 +127,13 @@ def test_issue587(en_tokenizer):
"""Test that Matcher doesn't segfault on particular input""" """Test that Matcher doesn't segfault on particular input"""
doc = en_tokenizer("a b; c") doc = en_tokenizer("a b; c")
matcher = Matcher(doc.vocab) matcher = Matcher(doc.vocab)
matcher.add("TEST1", None, [{ORTH: "a"}, {ORTH: "b"}]) matcher.add("TEST1", [[{ORTH: "a"}, {ORTH: "b"}]])
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 1 assert len(matches) == 1
matcher.add( matcher.add("TEST2", [[{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "c"}]])
"TEST2", None, [{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "c"}]
)
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 2 assert len(matches) == 2
matcher.add( matcher.add("TEST3", [[{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "d"}]])
"TEST3", None, [{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "d"}]
)
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 2 assert len(matches) == 2
@ -145,7 +141,7 @@ def test_issue587(en_tokenizer):
def test_issue588(en_vocab): def test_issue588(en_vocab):
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
with pytest.raises(ValueError): with pytest.raises(ValueError):
matcher.add("TEST", None, []) matcher.add("TEST", [[]])
@pytest.mark.xfail @pytest.mark.xfail
@ -161,11 +157,9 @@ def test_issue590(en_vocab):
doc = Doc(en_vocab, words=["n", "=", "1", ";", "a", ":", "5", "%"]) doc = Doc(en_vocab, words=["n", "=", "1", ";", "a", ":", "5", "%"])
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
matcher.add( matcher.add(
"ab", "ab", [[{"IS_ALPHA": True}, {"ORTH": ":"}, {"LIKE_NUM": True}, {"ORTH": "%"}]]
None,
[{"IS_ALPHA": True}, {"ORTH": ":"}, {"LIKE_NUM": True}, {"ORTH": "%"}],
) )
matcher.add("ab", None, [{"IS_ALPHA": True}, {"ORTH": "="}, {"LIKE_NUM": True}]) matcher.add("ab", [[{"IS_ALPHA": True}, {"ORTH": "="}, {"LIKE_NUM": True}]])
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 2 assert len(matches) == 2
@ -221,7 +215,7 @@ def test_issue615(en_tokenizer):
label = "Sport_Equipment" label = "Sport_Equipment"
doc = en_tokenizer(text) doc = en_tokenizer(text)
matcher = Matcher(doc.vocab) matcher = Matcher(doc.vocab)
matcher.add(label, merge_phrases, pattern) matcher.add(label, [pattern], on_match=merge_phrases)
matcher(doc) matcher(doc)
entities = list(doc.ents) entities = list(doc.ents)
assert entities != [] assert entities != []
@ -339,7 +333,7 @@ def test_issue850():
vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()}) vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()})
matcher = Matcher(vocab) matcher = Matcher(vocab)
pattern = [{"LOWER": "bob"}, {"OP": "*"}, {"LOWER": "frank"}] pattern = [{"LOWER": "bob"}, {"OP": "*"}, {"LOWER": "frank"}]
matcher.add("FarAway", None, pattern) matcher.add("FarAway", [pattern])
doc = Doc(matcher.vocab, words=["bob", "and", "and", "frank"]) doc = Doc(matcher.vocab, words=["bob", "and", "and", "frank"])
match = matcher(doc) match = matcher(doc)
assert len(match) == 1 assert len(match) == 1
@ -353,7 +347,7 @@ def test_issue850_basic():
vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()}) vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()})
matcher = Matcher(vocab) matcher = Matcher(vocab)
pattern = [{"LOWER": "bob"}, {"OP": "*", "LOWER": "and"}, {"LOWER": "frank"}] pattern = [{"LOWER": "bob"}, {"OP": "*", "LOWER": "and"}, {"LOWER": "frank"}]
matcher.add("FarAway", None, pattern) matcher.add("FarAway", [pattern])
doc = Doc(matcher.vocab, words=["bob", "and", "and", "frank"]) doc = Doc(matcher.vocab, words=["bob", "and", "and", "frank"])
match = matcher(doc) match = matcher(doc)
assert len(match) == 1 assert len(match) == 1

View File

@ -111,7 +111,7 @@ def test_issue1434():
hello_world = Doc(vocab, words=["Hello", "World"]) hello_world = Doc(vocab, words=["Hello", "World"])
hello = Doc(vocab, words=["Hello"]) hello = Doc(vocab, words=["Hello"])
matcher = Matcher(vocab) matcher = Matcher(vocab)
matcher.add("MyMatcher", None, pattern) matcher.add("MyMatcher", [pattern])
matches = matcher(hello_world) matches = matcher(hello_world)
assert matches assert matches
matches = matcher(hello) matches = matcher(hello)
@ -133,7 +133,7 @@ def test_issue1450(string, start, end):
"""Test matcher works when patterns end with * operator.""" """Test matcher works when patterns end with * operator."""
pattern = [{"ORTH": "a"}, {"ORTH": "b", "OP": "*"}] pattern = [{"ORTH": "a"}, {"ORTH": "b", "OP": "*"}]
matcher = Matcher(Vocab()) matcher = Matcher(Vocab())
matcher.add("TSTEND", None, pattern) matcher.add("TSTEND", [pattern])
doc = Doc(Vocab(), words=string.split()) doc = Doc(Vocab(), words=string.split())
matches = matcher(doc) matches = matcher(doc)
if start is None or end is None: if start is None or end is None:

View File

@ -224,7 +224,7 @@ def test_issue1868():
def test_issue1883(): def test_issue1883():
matcher = Matcher(Vocab()) matcher = Matcher(Vocab())
matcher.add("pat1", None, [{"orth": "hello"}]) matcher.add("pat1", [[{"orth": "hello"}]])
doc = Doc(matcher.vocab, words=["hello"]) doc = Doc(matcher.vocab, words=["hello"])
assert len(matcher(doc)) == 1 assert len(matcher(doc)) == 1
new_matcher = copy.deepcopy(matcher) new_matcher = copy.deepcopy(matcher)
@ -249,7 +249,7 @@ def test_issue1915():
def test_issue1945(): def test_issue1945():
"""Test regression in Matcher introduced in v2.0.6.""" """Test regression in Matcher introduced in v2.0.6."""
matcher = Matcher(Vocab()) matcher = Matcher(Vocab())
matcher.add("MWE", None, [{"orth": "a"}, {"orth": "a"}]) matcher.add("MWE", [[{"orth": "a"}, {"orth": "a"}]])
doc = Doc(matcher.vocab, words=["a", "a", "a"]) doc = Doc(matcher.vocab, words=["a", "a", "a"])
matches = matcher(doc) # we should see two overlapping matches here matches = matcher(doc) # we should see two overlapping matches here
assert len(matches) == 2 assert len(matches) == 2
@ -285,7 +285,7 @@ def test_issue1971(en_vocab):
{"ORTH": "!", "OP": "?"}, {"ORTH": "!", "OP": "?"},
] ]
Token.set_extension("optional", default=False) Token.set_extension("optional", default=False)
matcher.add("TEST", None, pattern) matcher.add("TEST", [pattern])
doc = Doc(en_vocab, words=["Hello", "John", "Doe", "!"]) doc = Doc(en_vocab, words=["Hello", "John", "Doe", "!"])
# We could also assert length 1 here, but this is more conclusive, because # We could also assert length 1 here, but this is more conclusive, because
# the real problem here is that it returns a duplicate match for a match_id # the real problem here is that it returns a duplicate match for a match_id
@ -299,7 +299,7 @@ def test_issue_1971_2(en_vocab):
pattern1 = [{"ORTH": "EUR", "LOWER": {"IN": ["eur"]}}, {"LIKE_NUM": True}] pattern1 = [{"ORTH": "EUR", "LOWER": {"IN": ["eur"]}}, {"LIKE_NUM": True}]
pattern2 = [{"LIKE_NUM": True}, {"ORTH": "EUR"}] # {"IN": ["EUR"]}}] pattern2 = [{"LIKE_NUM": True}, {"ORTH": "EUR"}] # {"IN": ["EUR"]}}]
doc = Doc(en_vocab, words=["EUR", "10", "is", "10", "EUR"]) doc = Doc(en_vocab, words=["EUR", "10", "is", "10", "EUR"])
matcher.add("TEST1", None, pattern1, pattern2) matcher.add("TEST1", [pattern1, pattern2])
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 2 assert len(matches) == 2
@ -310,8 +310,8 @@ def test_issue_1971_3(en_vocab):
Token.set_extension("b", default=2, force=True) Token.set_extension("b", default=2, force=True)
doc = Doc(en_vocab, words=["hello", "world"]) doc = Doc(en_vocab, words=["hello", "world"])
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
matcher.add("A", None, [{"_": {"a": 1}}]) matcher.add("A", [[{"_": {"a": 1}}]])
matcher.add("B", None, [{"_": {"b": 2}}]) matcher.add("B", [[{"_": {"b": 2}}]])
matches = sorted((en_vocab.strings[m_id], s, e) for m_id, s, e in matcher(doc)) matches = sorted((en_vocab.strings[m_id], s, e) for m_id, s, e in matcher(doc))
assert len(matches) == 4 assert len(matches) == 4
assert matches == sorted([("A", 0, 1), ("A", 1, 2), ("B", 0, 1), ("B", 1, 2)]) assert matches == sorted([("A", 0, 1), ("A", 1, 2), ("B", 0, 1), ("B", 1, 2)])
@ -326,7 +326,7 @@ def test_issue_1971_4(en_vocab):
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
doc = Doc(en_vocab, words=["this", "is", "text"]) doc = Doc(en_vocab, words=["this", "is", "text"])
pattern = [{"_": {"ext_a": "str_a", "ext_b": "str_b"}}] * 3 pattern = [{"_": {"ext_a": "str_a", "ext_b": "str_b"}}] * 3
matcher.add("TEST", None, pattern) matcher.add("TEST", [pattern])
matches = matcher(doc) matches = matcher(doc)
# Uncommenting this caused a segmentation fault # Uncommenting this caused a segmentation fault
assert len(matches) == 1 assert len(matches) == 1

View File

@ -128,7 +128,7 @@ def test_issue2464(en_vocab):
"""Test problem with successive ?. This is the same bug, so putting it here.""" """Test problem with successive ?. This is the same bug, so putting it here."""
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
doc = Doc(en_vocab, words=["a", "b"]) doc = Doc(en_vocab, words=["a", "b"])
matcher.add("4", None, [{"OP": "?"}, {"OP": "?"}]) matcher.add("4", [[{"OP": "?"}, {"OP": "?"}]])
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 3 assert len(matches) == 3

View File

@ -37,7 +37,7 @@ def test_issue2569(en_tokenizer):
doc = en_tokenizer("It is May 15, 1993.") doc = en_tokenizer("It is May 15, 1993.")
doc.ents = [Span(doc, 2, 6, label=doc.vocab.strings["DATE"])] doc.ents = [Span(doc, 2, 6, label=doc.vocab.strings["DATE"])]
matcher = Matcher(doc.vocab) matcher = Matcher(doc.vocab)
matcher.add("RULE", None, [{"ENT_TYPE": "DATE", "OP": "+"}]) matcher.add("RULE", [[{"ENT_TYPE": "DATE", "OP": "+"}]])
matched = [doc[start:end] for _, start, end in matcher(doc)] matched = [doc[start:end] for _, start, end in matcher(doc)]
matched = sorted(matched, key=len, reverse=True) matched = sorted(matched, key=len, reverse=True)
assert len(matched) == 10 assert len(matched) == 10
@ -89,7 +89,7 @@ def test_issue2671():
{"IS_PUNCT": True, "OP": "?"}, {"IS_PUNCT": True, "OP": "?"},
{"LOWER": "adrenaline"}, {"LOWER": "adrenaline"},
] ]
matcher.add(pattern_id, None, pattern) matcher.add(pattern_id, [pattern])
doc1 = nlp("This is a high-adrenaline situation.") doc1 = nlp("This is a high-adrenaline situation.")
doc2 = nlp("This is a high adrenaline situation.") doc2 = nlp("This is a high adrenaline situation.")
matches1 = matcher(doc1) matches1 = matcher(doc1)

View File

@ -52,7 +52,7 @@ def test_issue3009(en_vocab):
doc = get_doc(en_vocab, words=words, tags=tags) doc = get_doc(en_vocab, words=words, tags=tags)
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
for i, pattern in enumerate(patterns): for i, pattern in enumerate(patterns):
matcher.add(str(i), None, pattern) matcher.add(str(i), [pattern])
matches = matcher(doc) matches = matcher(doc)
assert matches assert matches
@ -116,8 +116,8 @@ def test_issue3248_1():
total number of patterns.""" total number of patterns."""
nlp = English() nlp = English()
matcher = PhraseMatcher(nlp.vocab) matcher = PhraseMatcher(nlp.vocab)
matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c")) matcher.add("TEST1", [nlp("a"), nlp("b"), nlp("c")])
matcher.add("TEST2", None, nlp("d")) matcher.add("TEST2", [nlp("d")])
assert len(matcher) == 2 assert len(matcher) == 2
@ -125,8 +125,8 @@ def test_issue3248_2():
"""Test that the PhraseMatcher can be pickled correctly.""" """Test that the PhraseMatcher can be pickled correctly."""
nlp = English() nlp = English()
matcher = PhraseMatcher(nlp.vocab) matcher = PhraseMatcher(nlp.vocab)
matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c")) matcher.add("TEST1", [nlp("a"), nlp("b"), nlp("c")])
matcher.add("TEST2", None, nlp("d")) matcher.add("TEST2", [nlp("d")])
data = pickle.dumps(matcher) data = pickle.dumps(matcher)
new_matcher = pickle.loads(data) new_matcher = pickle.loads(data)
assert len(new_matcher) == len(matcher) assert len(new_matcher) == len(matcher)
@ -170,7 +170,7 @@ def test_issue3328(en_vocab):
[{"LOWER": {"IN": ["hello", "how"]}}], [{"LOWER": {"IN": ["hello", "how"]}}],
[{"LOWER": {"IN": ["you", "doing"]}}], [{"LOWER": {"IN": ["you", "doing"]}}],
] ]
matcher.add("TEST", None, *patterns) matcher.add("TEST", patterns)
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 4 assert len(matches) == 4
matched_texts = [doc[start:end].text for _, start, end in matches] matched_texts = [doc[start:end].text for _, start, end in matches]
@ -183,8 +183,8 @@ def test_issue3331(en_vocab):
matches, one per rule. matches, one per rule.
""" """
matcher = PhraseMatcher(en_vocab) matcher = PhraseMatcher(en_vocab)
matcher.add("A", None, Doc(en_vocab, words=["Barack", "Obama"])) matcher.add("A", [Doc(en_vocab, words=["Barack", "Obama"])])
matcher.add("B", None, Doc(en_vocab, words=["Barack", "Obama"])) matcher.add("B", [Doc(en_vocab, words=["Barack", "Obama"])])
doc = Doc(en_vocab, words=["Barack", "Obama", "lifts", "America"]) doc = Doc(en_vocab, words=["Barack", "Obama", "lifts", "America"])
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 2 assert len(matches) == 2
@ -297,8 +297,10 @@ def test_issue3410():
def test_issue3412(): def test_issue3412():
data = numpy.asarray([[0, 0, 0], [1, 2, 3], [9, 8, 7]], dtype="f") data = numpy.asarray([[0, 0, 0], [1, 2, 3], [9, 8, 7]], dtype="f")
vectors = Vectors(data=data) vectors = Vectors(data=data)
keys, best_rows, scores = vectors.most_similar(numpy.asarray([[9, 8, 7], [0, 0, 0]], dtype="f")) keys, best_rows, scores = vectors.most_similar(
assert(best_rows[0] == 2) numpy.asarray([[9, 8, 7], [0, 0, 0]], dtype="f")
)
assert best_rows[0] == 2
def test_issue3447(): def test_issue3447():

View File

@ -10,6 +10,6 @@ def test_issue3549(en_vocab):
"""Test that match pattern validation doesn't raise on empty errors.""" """Test that match pattern validation doesn't raise on empty errors."""
matcher = Matcher(en_vocab, validate=True) matcher = Matcher(en_vocab, validate=True)
pattern = [{"LOWER": "hello"}, {"LOWER": "world"}] pattern = [{"LOWER": "hello"}, {"LOWER": "world"}]
matcher.add("GOOD", None, pattern) matcher.add("GOOD", [pattern])
with pytest.raises(MatchPatternError): with pytest.raises(MatchPatternError):
matcher.add("BAD", None, [{"X": "Y"}]) matcher.add("BAD", [[{"X": "Y"}]])

View File

@ -12,6 +12,6 @@ def test_issue3555(en_vocab):
Token.set_extension("issue3555", default=None) Token.set_extension("issue3555", default=None)
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
pattern = [{"LEMMA": "have"}, {"_": {"issue3555": True}}] pattern = [{"LEMMA": "have"}, {"_": {"issue3555": True}}]
matcher.add("TEST", None, pattern) matcher.add("TEST", [pattern])
doc = Doc(en_vocab, words=["have", "apple"]) doc = Doc(en_vocab, words=["have", "apple"])
matcher(doc) matcher(doc)

View File

@ -34,8 +34,7 @@ def test_issue3611():
nlp.add_pipe(textcat, last=True) nlp.add_pipe(textcat, last=True)
# training the network # training the network
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"] with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]):
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.begin_training() optimizer = nlp.begin_training()
for i in range(3): for i in range(3):
losses = {} losses = {}

View File

@ -12,10 +12,10 @@ def test_issue3839(en_vocab):
match_id = "PATTERN" match_id = "PATTERN"
pattern1 = [{"LOWER": "terrific"}, {"OP": "?"}, {"LOWER": "group"}] pattern1 = [{"LOWER": "terrific"}, {"OP": "?"}, {"LOWER": "group"}]
pattern2 = [{"LOWER": "terrific"}, {"OP": "?"}, {"OP": "?"}, {"LOWER": "group"}] pattern2 = [{"LOWER": "terrific"}, {"OP": "?"}, {"OP": "?"}, {"LOWER": "group"}]
matcher.add(match_id, None, pattern1) matcher.add(match_id, [pattern1])
matches = matcher(doc) matches = matcher(doc)
assert matches[0][0] == en_vocab.strings[match_id] assert matches[0][0] == en_vocab.strings[match_id]
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
matcher.add(match_id, None, pattern2) matcher.add(match_id, [pattern2])
matches = matcher(doc) matches = matcher(doc)
assert matches[0][0] == en_vocab.strings[match_id] assert matches[0][0] == en_vocab.strings[match_id]

View File

@ -10,5 +10,5 @@ def test_issue3879(en_vocab):
assert len(doc) == 5 assert len(doc) == 5
pattern = [{"ORTH": "This", "OP": "?"}, {"OP": "?"}, {"ORTH": "test"}] pattern = [{"ORTH": "This", "OP": "?"}, {"OP": "?"}, {"ORTH": "test"}]
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
matcher.add("TEST", None, pattern) matcher.add("TEST", [pattern])
assert len(matcher(doc)) == 2 # fails because of a FP match 'is a test' assert len(matcher(doc)) == 2 # fails because of a FP match 'is a test'

View File

@ -14,7 +14,7 @@ def test_issue3951(en_vocab):
{"OP": "?"}, {"OP": "?"},
{"LOWER": "world"}, {"LOWER": "world"},
] ]
matcher.add("TEST", None, pattern) matcher.add("TEST", [pattern])
doc = Doc(en_vocab, words=["Hello", "my", "new", "world"]) doc = Doc(en_vocab, words=["Hello", "my", "new", "world"])
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 0 assert len(matches) == 0

View File

@ -9,8 +9,8 @@ def test_issue3972(en_vocab):
"""Test that the PhraseMatcher returns duplicates for duplicate match IDs. """Test that the PhraseMatcher returns duplicates for duplicate match IDs.
""" """
matcher = PhraseMatcher(en_vocab) matcher = PhraseMatcher(en_vocab)
matcher.add("A", None, Doc(en_vocab, words=["New", "York"])) matcher.add("A", [Doc(en_vocab, words=["New", "York"])])
matcher.add("B", None, Doc(en_vocab, words=["New", "York"])) matcher.add("B", [Doc(en_vocab, words=["New", "York"])])
doc = Doc(en_vocab, words=["I", "live", "in", "New", "York"]) doc = Doc(en_vocab, words=["I", "live", "in", "New", "York"])
matches = matcher(doc) matches = matcher(doc)

View File

@ -11,7 +11,7 @@ def test_issue4002(en_vocab):
matcher = PhraseMatcher(en_vocab, attr="NORM") matcher = PhraseMatcher(en_vocab, attr="NORM")
pattern1 = Doc(en_vocab, words=["c", "d"]) pattern1 = Doc(en_vocab, words=["c", "d"])
assert [t.norm_ for t in pattern1] == ["c", "d"] assert [t.norm_ for t in pattern1] == ["c", "d"]
matcher.add("TEST", None, pattern1) matcher.add("TEST", [pattern1])
doc = Doc(en_vocab, words=["a", "b", "c", "d"]) doc = Doc(en_vocab, words=["a", "b", "c", "d"])
assert [t.norm_ for t in doc] == ["a", "b", "c", "d"] assert [t.norm_ for t in doc] == ["a", "b", "c", "d"]
matches = matcher(doc) matches = matcher(doc)
@ -21,6 +21,6 @@ def test_issue4002(en_vocab):
pattern2[0].norm_ = "c" pattern2[0].norm_ = "c"
pattern2[1].norm_ = "d" pattern2[1].norm_ = "d"
assert [t.norm_ for t in pattern2] == ["c", "d"] assert [t.norm_ for t in pattern2] == ["c", "d"]
matcher.add("TEST", None, pattern2) matcher.add("TEST", [pattern2])
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 1 assert len(matches) == 1

View File

@ -34,8 +34,7 @@ def test_issue4030():
nlp.add_pipe(textcat, last=True) nlp.add_pipe(textcat, last=True)
# training the network # training the network
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"] with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]):
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.begin_training() optimizer = nlp.begin_training()
for i in range(3): for i in range(3):
losses = {} losses = {}

View File

@ -8,7 +8,7 @@ from spacy.tokens import Doc
def test_issue4120(en_vocab): def test_issue4120(en_vocab):
"""Test that matches without a final {OP: ?} token are returned.""" """Test that matches without a final {OP: ?} token are returned."""
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
matcher.add("TEST", None, [{"ORTH": "a"}, {"OP": "?"}]) matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}]])
doc1 = Doc(en_vocab, words=["a"]) doc1 = Doc(en_vocab, words=["a"])
assert len(matcher(doc1)) == 1 # works assert len(matcher(doc1)) == 1 # works
@ -16,11 +16,11 @@ def test_issue4120(en_vocab):
assert len(matcher(doc2)) == 2 # fixed assert len(matcher(doc2)) == 2 # fixed
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
matcher.add("TEST", None, [{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b"}]) matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b"}]])
doc3 = Doc(en_vocab, words=["a", "b", "b", "c"]) doc3 = Doc(en_vocab, words=["a", "b", "b", "c"])
assert len(matcher(doc3)) == 2 # works assert len(matcher(doc3)) == 2 # works
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
matcher.add("TEST", None, [{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b", "OP": "?"}]) matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b", "OP": "?"}]])
doc4 = Doc(en_vocab, words=["a", "b", "b", "c"]) doc4 = Doc(en_vocab, words=["a", "b", "b", "c"])
assert len(matcher(doc4)) == 3 # fixed assert len(matcher(doc4)) == 3 # fixed

View File

@ -0,0 +1,96 @@
# coding: utf8
from __future__ import unicode_literals
import srsly
from spacy.gold import GoldCorpus
from spacy.lang.en import English
from spacy.tests.util import make_tempdir
def test_issue4402():
nlp = English()
with make_tempdir() as tmpdir:
print("temp", tmpdir)
json_path = tmpdir / "test4402.json"
srsly.write_json(json_path, json_data)
corpus = GoldCorpus(str(json_path), str(json_path))
train_docs = list(corpus.train_docs(nlp, gold_preproc=True, max_length=0))
# assert that the data got split into 4 sentences
assert len(train_docs) == 4
json_data = [
{
"id": 0,
"paragraphs": [
{
"raw": "How should I cook bacon in an oven?\nI've heard of people cooking bacon in an oven.",
"sentences": [
{
"tokens": [
{"id": 0, "orth": "How", "ner": "O"},
{"id": 1, "orth": "should", "ner": "O"},
{"id": 2, "orth": "I", "ner": "O"},
{"id": 3, "orth": "cook", "ner": "O"},
{"id": 4, "orth": "bacon", "ner": "O"},
{"id": 5, "orth": "in", "ner": "O"},
{"id": 6, "orth": "an", "ner": "O"},
{"id": 7, "orth": "oven", "ner": "O"},
{"id": 8, "orth": "?", "ner": "O"},
],
"brackets": [],
},
{
"tokens": [
{"id": 9, "orth": "\n", "ner": "O"},
{"id": 10, "orth": "I", "ner": "O"},
{"id": 11, "orth": "'ve", "ner": "O"},
{"id": 12, "orth": "heard", "ner": "O"},
{"id": 13, "orth": "of", "ner": "O"},
{"id": 14, "orth": "people", "ner": "O"},
{"id": 15, "orth": "cooking", "ner": "O"},
{"id": 16, "orth": "bacon", "ner": "O"},
{"id": 17, "orth": "in", "ner": "O"},
{"id": 18, "orth": "an", "ner": "O"},
{"id": 19, "orth": "oven", "ner": "O"},
{"id": 20, "orth": ".", "ner": "O"},
],
"brackets": [],
},
],
"cats": [
{"label": "baking", "value": 1.0},
{"label": "not_baking", "value": 0.0},
],
},
{
"raw": "What is the difference between white and brown eggs?\n",
"sentences": [
{
"tokens": [
{"id": 0, "orth": "What", "ner": "O"},
{"id": 1, "orth": "is", "ner": "O"},
{"id": 2, "orth": "the", "ner": "O"},
{"id": 3, "orth": "difference", "ner": "O"},
{"id": 4, "orth": "between", "ner": "O"},
{"id": 5, "orth": "white", "ner": "O"},
{"id": 6, "orth": "and", "ner": "O"},
{"id": 7, "orth": "brown", "ner": "O"},
{"id": 8, "orth": "eggs", "ner": "O"},
{"id": 9, "orth": "?", "ner": "O"},
],
"brackets": [],
},
{"tokens": [{"id": 10, "orth": "\n", "ner": "O"}], "brackets": []},
],
"cats": [
{"label": "baking", "value": 0.0},
{"label": "not_baking", "value": 1.0},
],
},
],
}
]

View File

@ -0,0 +1,19 @@
# coding: utf8
from __future__ import unicode_literals
from spacy.tokens import Doc, DocBin
def test_issue4528(en_vocab):
"""Test that user_data is correctly serialized in DocBin."""
doc = Doc(en_vocab, words=["hello", "world"])
doc.user_data["foo"] = "bar"
# This is how extension attribute values are stored in the user data
doc.user_data[("._.", "foo", None, None)] = "bar"
doc_bin = DocBin(store_user_data=True)
doc_bin.add(doc)
doc_bin_bytes = doc_bin.to_bytes()
new_doc_bin = DocBin(store_user_data=True).from_bytes(doc_bin_bytes)
new_doc = list(new_doc_bin.get_docs(en_vocab))[0]
assert new_doc.user_data["foo"] == "bar"
assert new_doc.user_data[("._.", "foo", None, None)] == "bar"

View File

@ -0,0 +1,13 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from spacy.gold import GoldParse
@pytest.mark.parametrize(
"text,words", [("A'B C", ["A", "'", "B", "C"]), ("A-B", ["A-B"])]
)
def test_gold_misaligned(en_tokenizer, text, words):
doc = en_tokenizer(text)
GoldParse(doc, words=words)

View File

@ -3,7 +3,7 @@ from __future__ import unicode_literals
from spacy.gold import biluo_tags_from_offsets, offsets_from_biluo_tags from spacy.gold import biluo_tags_from_offsets, offsets_from_biluo_tags
from spacy.gold import spans_from_biluo_tags, GoldParse, iob_to_biluo from spacy.gold import spans_from_biluo_tags, GoldParse, iob_to_biluo
from spacy.gold import GoldCorpus, docs_to_json from spacy.gold import GoldCorpus, docs_to_json, align
from spacy.lang.en import English from spacy.lang.en import English
from spacy.tokens import Doc from spacy.tokens import Doc
from .util import make_tempdir from .util import make_tempdir
@ -90,7 +90,7 @@ def test_gold_ner_missing_tags(en_tokenizer):
def test_iob_to_biluo(): def test_iob_to_biluo():
good_iob = ["O", "O", "B-LOC", "I-LOC", "O", "B-PERSON"] good_iob = ["O", "O", "B-LOC", "I-LOC", "O", "B-PERSON"]
good_biluo = ["O", "O", "B-LOC", "L-LOC", "O", "U-PERSON"] good_biluo = ["O", "O", "B-LOC", "L-LOC", "O", "U-PERSON"]
bad_iob = ["O", "O", "\"", "B-LOC", "I-LOC"] bad_iob = ["O", "O", '"', "B-LOC", "I-LOC"]
converted_biluo = iob_to_biluo(good_iob) converted_biluo = iob_to_biluo(good_iob)
assert good_biluo == converted_biluo assert good_biluo == converted_biluo
with pytest.raises(ValueError): with pytest.raises(ValueError):
@ -99,14 +99,23 @@ def test_iob_to_biluo():
def test_roundtrip_docs_to_json(): def test_roundtrip_docs_to_json():
text = "I flew to Silicon Valley via London." text = "I flew to Silicon Valley via London."
tags = ["PRP", "VBD", "IN", "NNP", "NNP", "IN", "NNP", "."]
heads = [1, 1, 1, 4, 2, 1, 5, 1]
deps = ["nsubj", "ROOT", "prep", "compound", "pobj", "prep", "pobj", "punct"]
biluo_tags = ["O", "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"]
cats = {"TRAVEL": 1.0, "BAKING": 0.0} cats = {"TRAVEL": 1.0, "BAKING": 0.0}
nlp = English() nlp = English()
doc = nlp(text) doc = nlp(text)
for i in range(len(tags)):
doc[i].tag_ = tags[i]
doc[i].dep_ = deps[i]
doc[i].head = doc[heads[i]]
doc.ents = spans_from_biluo_tags(doc, biluo_tags)
doc.cats = cats doc.cats = cats
doc[0].is_sent_start = True doc.is_tagged = True
for i in range(1, len(doc)): doc.is_parsed = True
doc[i].is_sent_start = False
# roundtrip to JSON
with make_tempdir() as tmpdir: with make_tempdir() as tmpdir:
json_file = tmpdir / "roundtrip.json" json_file = tmpdir / "roundtrip.json"
srsly.write_json(json_file, [docs_to_json(doc)]) srsly.write_json(json_file, [docs_to_json(doc)])
@ -116,7 +125,95 @@ def test_roundtrip_docs_to_json():
assert len(doc) == goldcorpus.count_train() assert len(doc) == goldcorpus.count_train()
assert text == reloaded_doc.text assert text == reloaded_doc.text
assert tags == goldparse.tags
assert deps == goldparse.labels
assert heads == goldparse.heads
assert biluo_tags == goldparse.ner
assert "TRAVEL" in goldparse.cats assert "TRAVEL" in goldparse.cats
assert "BAKING" in goldparse.cats assert "BAKING" in goldparse.cats
assert cats["TRAVEL"] == goldparse.cats["TRAVEL"] assert cats["TRAVEL"] == goldparse.cats["TRAVEL"]
assert cats["BAKING"] == goldparse.cats["BAKING"] assert cats["BAKING"] == goldparse.cats["BAKING"]
# roundtrip to JSONL train dicts
with make_tempdir() as tmpdir:
jsonl_file = tmpdir / "roundtrip.jsonl"
srsly.write_jsonl(jsonl_file, [docs_to_json(doc)])
goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file))
reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp))
assert len(doc) == goldcorpus.count_train()
assert text == reloaded_doc.text
assert tags == goldparse.tags
assert deps == goldparse.labels
assert heads == goldparse.heads
assert biluo_tags == goldparse.ner
assert "TRAVEL" in goldparse.cats
assert "BAKING" in goldparse.cats
assert cats["TRAVEL"] == goldparse.cats["TRAVEL"]
assert cats["BAKING"] == goldparse.cats["BAKING"]
# roundtrip to JSONL tuples
with make_tempdir() as tmpdir:
jsonl_file = tmpdir / "roundtrip.jsonl"
# write to JSONL train dicts
srsly.write_jsonl(jsonl_file, [docs_to_json(doc)])
goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file))
# load and rewrite as JSONL tuples
srsly.write_jsonl(jsonl_file, goldcorpus.train_tuples)
goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file))
reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp))
assert len(doc) == goldcorpus.count_train()
assert text == reloaded_doc.text
assert tags == goldparse.tags
assert deps == goldparse.labels
assert heads == goldparse.heads
assert biluo_tags == goldparse.ner
assert "TRAVEL" in goldparse.cats
assert "BAKING" in goldparse.cats
assert cats["TRAVEL"] == goldparse.cats["TRAVEL"]
assert cats["BAKING"] == goldparse.cats["BAKING"]
# xfail while we have backwards-compatible alignment
@pytest.mark.xfail
@pytest.mark.parametrize(
"tokens_a,tokens_b,expected",
[
(["a", "b", "c"], ["ab", "c"], (3, [-1, -1, 1], [-1, 2], {0: 0, 1: 0}, {})),
(
["a", "b", "``", "c"],
['ab"', "c"],
(4, [-1, -1, -1, 1], [-1, 3], {0: 0, 1: 0, 2: 0}, {}),
),
(["a", "bc"], ["ab", "c"], (4, [-1, -1], [-1, -1], {0: 0}, {1: 1})),
(
["ab", "c", "d"],
["a", "b", "cd"],
(6, [-1, -1, -1], [-1, -1, -1], {1: 2, 2: 2}, {0: 0, 1: 0}),
),
(
["a", "b", "cd"],
["a", "b", "c", "d"],
(3, [0, 1, -1], [0, 1, -1, -1], {}, {2: 2, 3: 2}),
),
([" ", "a"], ["a"], (1, [-1, 0], [1], {}, {})),
],
)
def test_align(tokens_a, tokens_b, expected):
cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_a, tokens_b)
assert (cost, list(a2b), list(b2a), a2b_multi, b2a_multi) == expected
# check symmetry
cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_b, tokens_a)
assert (cost, list(b2a), list(a2b), b2a_multi, a2b_multi) == expected
def test_goldparse_startswith_space(en_tokenizer):
text = " a"
doc = en_tokenizer(text)
g = GoldParse(doc, words=["a"], entities=["U-DATE"], deps=["ROOT"], heads=[0])
assert g.words == [" ", "a"]
assert g.ner == [None, "U-DATE"]
assert g.labels == [None, "ROOT"]

View File

@ -95,12 +95,18 @@ def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
def test_prefer_gpu(): def test_prefer_gpu():
assert not prefer_gpu() try:
import cupy # noqa: F401
except ImportError:
assert not prefer_gpu()
def test_require_gpu(): def test_require_gpu():
with pytest.raises(ValueError): try:
require_gpu() import cupy # noqa: F401
except ImportError:
with pytest.raises(ValueError):
require_gpu()
def test_create_symlink_windows( def test_create_symlink_windows(

View File

@ -0,0 +1,66 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy._ml import Tok2Vec
from spacy.vocab import Vocab
from spacy.tokens import Doc
from spacy.compat import unicode_
def get_batch(batch_size):
vocab = Vocab()
docs = []
start = 0
for size in range(1, batch_size + 1):
# Make the words numbers, so that they're distnct
# across the batch, and easy to track.
numbers = [unicode_(i) for i in range(start, start + size)]
docs.append(Doc(vocab, words=numbers))
start += size
return docs
# This fails in Thinc v7.3.1. Need to push patch
@pytest.mark.xfail
def test_empty_doc():
width = 128
embed_size = 2000
vocab = Vocab()
doc = Doc(vocab, words=[])
tok2vec = Tok2Vec(width, embed_size)
vectors, backprop = tok2vec.begin_update([doc])
assert len(vectors) == 1
assert vectors[0].shape == (0, width)
@pytest.mark.parametrize(
"batch_size,width,embed_size", [[1, 128, 2000], [2, 128, 2000], [3, 8, 63]]
)
def test_tok2vec_batch_sizes(batch_size, width, embed_size):
batch = get_batch(batch_size)
tok2vec = Tok2Vec(width, embed_size)
vectors, backprop = tok2vec.begin_update(batch)
assert len(vectors) == len(batch)
for doc_vec, doc in zip(vectors, batch):
assert doc_vec.shape == (len(doc), width)
@pytest.mark.parametrize(
"tok2vec_config",
[
{"width": 8, "embed_size": 100, "char_embed": False},
{"width": 8, "embed_size": 100, "char_embed": True},
{"width": 8, "embed_size": 100, "conv_depth": 6},
{"width": 8, "embed_size": 100, "conv_depth": 6},
{"width": 8, "embed_size": 100, "subword_features": False},
],
)
def test_tok2vec_configs(tok2vec_config):
docs = get_batch(3)
tok2vec = Tok2Vec(**tok2vec_config)
vectors, backprop = tok2vec.begin_update(docs)
assert len(vectors) == len(docs)
assert vectors[0].shape == (len(docs[0]), tok2vec_config["width"])
backprop(vectors)

View File

@ -103,7 +103,8 @@ class DocBin(object):
doc = Doc(vocab, words=words, spaces=spaces) doc = Doc(vocab, words=words, spaces=spaces)
doc = doc.from_array(self.attrs, tokens) doc = doc.from_array(self.attrs, tokens)
if self.store_user_data: if self.store_user_data:
doc.user_data.update(srsly.msgpack_loads(self.user_data[i])) user_data = srsly.msgpack_loads(self.user_data[i], use_list=False)
doc.user_data.update(user_data)
yield doc yield doc
def merge(self, other): def merge(self, other):
@ -155,9 +156,9 @@ class DocBin(object):
msg = srsly.msgpack_loads(zlib.decompress(bytes_data)) msg = srsly.msgpack_loads(zlib.decompress(bytes_data))
self.attrs = msg["attrs"] self.attrs = msg["attrs"]
self.strings = set(msg["strings"]) self.strings = set(msg["strings"])
lengths = numpy.fromstring(msg["lengths"], dtype="int32") lengths = numpy.frombuffer(msg["lengths"], dtype="int32")
flat_spaces = numpy.fromstring(msg["spaces"], dtype=bool) flat_spaces = numpy.frombuffer(msg["spaces"], dtype=bool)
flat_tokens = numpy.fromstring(msg["tokens"], dtype="uint64") flat_tokens = numpy.frombuffer(msg["tokens"], dtype="uint64")
shape = (flat_tokens.size // len(self.attrs), len(self.attrs)) shape = (flat_tokens.size // len(self.attrs), len(self.attrs))
flat_tokens = flat_tokens.reshape(shape) flat_tokens = flat_tokens.reshape(shape)
flat_spaces = flat_spaces.reshape((flat_spaces.size, 1)) flat_spaces = flat_spaces.reshape((flat_spaces.size, 1))

View File

@ -142,6 +142,11 @@ def register_architecture(name, arch=None):
return do_registration return do_registration
def make_layer(arch_config):
arch_func = get_architecture(arch_config["arch"])
return arch_func(arch_config["config"])
def get_architecture(name): def get_architecture(name):
"""Get a model architecture function by name. Raises a KeyError if the """Get a model architecture function by name. Raises a KeyError if the
architecture is not found. architecture is not found.
@ -242,6 +247,7 @@ def load_model_from_path(model_path, meta=False, **overrides):
cls = get_lang_class(lang) cls = get_lang_class(lang)
nlp = cls(meta=meta, **overrides) nlp = cls(meta=meta, **overrides)
pipeline = meta.get("pipeline", []) pipeline = meta.get("pipeline", [])
factories = meta.get("factories", {})
disable = overrides.get("disable", []) disable = overrides.get("disable", [])
if pipeline is True: if pipeline is True:
pipeline = nlp.Defaults.pipe_names pipeline = nlp.Defaults.pipe_names
@ -250,7 +256,8 @@ def load_model_from_path(model_path, meta=False, **overrides):
for name in pipeline: for name in pipeline:
if name not in disable: if name not in disable:
config = meta.get("pipeline_args", {}).get(name, {}) config = meta.get("pipeline_args", {}).get(name, {})
component = nlp.create_pipe(name, config=config) factory = factories.get(name, name)
component = nlp.create_pipe(factory, config=config)
nlp.add_pipe(component, name=name) nlp.add_pipe(component, name=name)
return nlp.from_disk(model_path) return nlp.from_disk(model_path)
@ -363,6 +370,16 @@ def is_in_jupyter():
return False return False
def get_component_name(component):
if hasattr(component, "name"):
return component.name
if hasattr(component, "__name__"):
return component.__name__
if hasattr(component, "__class__") and hasattr(component.__class__, "__name__"):
return component.__class__.__name__
return repr(component)
def get_cuda_stream(require=False): def get_cuda_stream(require=False):
if CudaStream is None: if CudaStream is None:
return None return None
@ -404,7 +421,7 @@ def env_opt(name, default=None):
def read_regex(path): def read_regex(path):
path = ensure_path(path) path = ensure_path(path)
with path.open() as file_: with path.open(encoding="utf8") as file_:
entries = file_.read().split("\n") entries = file_.read().split("\n")
expression = "|".join( expression = "|".join(
["^" + re.escape(piece) for piece in entries if piece.strip()] ["^" + re.escape(piece) for piece in entries if piece.strip()]

View File

@ -48,14 +48,14 @@ be installed if needed via `pip install spacy[lookups]`. Some languages provide
full lemmatization rules and exceptions, while other languages currently only full lemmatization rules and exceptions, while other languages currently only
rely on simple lookup tables. rely on simple lookup tables.
<Infobox title="About spaCy's custom pronoun lemma" variant="warning"> <Infobox title="About spaCy's custom pronoun lemma for English" variant="warning">
spaCy adds a **special case for pronouns**: all pronouns are lemmatized to the spaCy adds a **special case for English pronouns**: all English pronouns are
special token `-PRON-`. Unlike verbs and common nouns, there's no clear base lemmatized to the special token `-PRON-`. Unlike verbs and common nouns,
form of a personal pronoun. Should the lemma of "me" be "I", or should we there's no clear base form of a personal pronoun. Should the lemma of "me" be
normalize person as well, giving "it" — or maybe "he"? spaCy's solution is to "I", or should we normalize person as well, giving "it" — or maybe "he"?
introduce a novel symbol, `-PRON-`, which is used as the lemma for all personal spaCy's solution is to introduce a novel symbol, `-PRON-`, which is used as the
pronouns. lemma for all personal pronouns.
</Infobox> </Infobox>
@ -117,76 +117,72 @@ type. They're available as the [`Token.pos`](/api/token#attributes) and
The English part-of-speech tagger uses the The English part-of-speech tagger uses the
[OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) version of the Penn [OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) version of the Penn
Treebank tag set. We also map the tags to the simpler Google Universal POS tag Treebank tag set. We also map the tags to the simpler Universal Dependencies v2
set. POS tag set.
| Tag |  POS | Morphology | Description |
| ----------------------------------- | ------- | ---------------------------------------------- | ----------------------------------------- |
| `-LRB-` | `PUNCT` | `PunctType=brck PunctSide=ini` | left round bracket |
| `-RRB-` | `PUNCT` | `PunctType=brck PunctSide=fin` | right round bracket |
| `,` | `PUNCT` | `PunctType=comm` | punctuation mark, comma |
| `:` | `PUNCT` | | punctuation mark, colon or ellipsis |
| `.` | `PUNCT` | `PunctType=peri` | punctuation mark, sentence closer |
| `''` | `PUNCT` | `PunctType=quot PunctSide=fin` | closing quotation mark |
| `""` | `PUNCT` | `PunctType=quot PunctSide=fin` | closing quotation mark |
| <InlineCode>&#96;&#96;</InlineCode> | `PUNCT` | `PunctType=quot PunctSide=ini` | opening quotation mark |
| `#` | `SYM` | `SymType=numbersign` | symbol, number sign |
| `$` | `SYM` | `SymType=currency` | symbol, currency |
| `ADD` | `X` | | email |
| `AFX` | `ADJ` | `Hyph=yes` | affix |
| `BES` | `VERB` | | auxiliary "be" |
| `CC` | `CONJ` | `ConjType=coor` | conjunction, coordinating |
| `CD` | `NUM` | `NumType=card` | cardinal number |
| `DT` | `DET` | | determiner |
| `EX` | `ADV` | `AdvType=ex` | existential there |
| `FW` | `X` | `Foreign=yes` | foreign word |
| `GW` | `X` | | additional word in multi-word expression |
| `HVS` | `VERB` | | forms of "have" |
| `HYPH` | `PUNCT` | `PunctType=dash` | punctuation mark, hyphen |
| `IN` | `ADP` | | conjunction, subordinating or preposition |
| `JJ` | `ADJ` | `Degree=pos` | adjective |
| `JJR` | `ADJ` | `Degree=comp` | adjective, comparative |
| `JJS` | `ADJ` | `Degree=sup` | adjective, superlative |
| `LS` | `PUNCT` | `NumType=ord` | list item marker |
| `MD` | `VERB` | `VerbType=mod` | verb, modal auxiliary |
| `NFP` | `PUNCT` | | superfluous punctuation |
| `NIL` | | | missing tag |
| `NN` | `NOUN` | `Number=sing` | noun, singular or mass |
| `NNP` | `PROPN` | `NounType=prop Number=sign` | noun, proper singular |
| `NNPS` | `PROPN` | `NounType=prop Number=plur` | noun, proper plural |
| `NNS` | `NOUN` | `Number=plur` | noun, plural |
| `PDT` | `ADJ` | `AdjType=pdt PronType=prn` | predeterminer |
| `POS` | `PART` | `Poss=yes` | possessive ending |
| `PRP` | `PRON` | `PronType=prs` | pronoun, personal |
| `PRP$` | `ADJ` | `PronType=prs Poss=yes` | pronoun, possessive |
| `RB` | `ADV` | `Degree=pos` | adverb |
| `RBR` | `ADV` | `Degree=comp` | adverb, comparative |
| `RBS` | `ADV` | `Degree=sup` | adverb, superlative |
| `RP` | `PART` | | adverb, particle |
| `_SP` | `SPACE` | | space |
| `SYM` | `SYM` | | symbol |
| `TO` | `PART` | `PartType=inf VerbForm=inf` | infinitival "to" |
| `UH` | `INTJ` | | interjection |
| `VB` | `VERB` | `VerbForm=inf` | verb, base form |
| `VBD` | `VERB` | `VerbForm=fin Tense=past` | verb, past tense |
| `VBG` | `VERB` | `VerbForm=part Tense=pres Aspect=prog` | verb, gerund or present participle |
| `VBN` | `VERB` | `VerbForm=part Tense=past Aspect=perf` | verb, past participle |
| `VBP` | `VERB` | `VerbForm=fin Tense=pres` | verb, non-3rd person singular present |
| `VBZ` | `VERB` | `VerbForm=fin Tense=pres Number=sing Person=3` | verb, 3rd person singular present |
| `WDT` | `ADJ` | `PronType=int|rel` | wh-determiner |
| `WP` | `NOUN` | `PronType=int|rel` | wh-pronoun, personal |
| `WP$` | `ADJ` | `Poss=yes PronType=int|rel` | wh-pronoun, possessive |
| `WRB` | `ADV` | `PronType=int|rel` | wh-adverb |
| `XX` | `X` | | unknown |
| Tag |  POS | Morphology | Description |
| ------------------------------------- | ------- | --------------------------------------- | ----------------------------------------- |
| `$` | `SYM` | | symbol, currency |
| <InlineCode>&#96;&#96;</InlineCode> | `PUNCT` | `PunctType=quot PunctSide=ini` | opening quotation mark |
| `''` | `PUNCT` | `PunctType=quot PunctSide=fin` | closing quotation mark |
| `,` | `PUNCT` | `PunctType=comm` | punctuation mark, comma |
| `-LRB-` | `PUNCT` | `PunctType=brck PunctSide=ini` | left round bracket |
| `-RRB-` | `PUNCT` | `PunctType=brck PunctSide=fin` | right round bracket |
| `.` | `PUNCT` | `PunctType=peri` | punctuation mark, sentence closer |
| `:` | `PUNCT` | | punctuation mark, colon or ellipsis |
| `ADD` | `X` | | email |
| `AFX` | `ADJ` | `Hyph=yes` | affix |
| `CC` | `CCONJ` | `ConjType=comp` | conjunction, coordinating |
| `CD` | `NUM` | `NumType=card` | cardinal number |
| `DT` | `DET` | | determiner |
| `EX` | `PRON` | `AdvType=ex` | existential there |
| `FW` | `X` | `Foreign=yes` | foreign word |
| `GW` | `X` | | additional word in multi-word expression |
| `HYPH` | `PUNCT` | `PunctType=dash` | punctuation mark, hyphen |
| `IN` | `ADP` | | conjunction, subordinating or preposition |
| `JJ` | `ADJ` | `Degree=pos` | adjective |
| `JJR` | `ADJ` | `Degree=comp` | adjective, comparative |
| `JJS` | `ADJ` | `Degree=sup` | adjective, superlative |
| `LS` | `X` | `NumType=ord` | list item marker |
| `MD` | `VERB` | `VerbType=mod` | verb, modal auxiliary |
| `NFP` | `PUNCT` | | superfluous punctuation |
| `NIL` | `X` | | missing tag |
| `NN` | `NOUN` | `Number=sing` | noun, singular or mass |
| `NNP` | `PROPN` | `NounType=prop Number=sing` | noun, proper singular |
| `NNPS` | `PROPN` | `NounType=prop Number=plur` | noun, proper plural |
| `NNS` | `NOUN` | `Number=plur` | noun, plural |
| `PDT` | `DET` | | predeterminer |
| `POS` | `PART` | `Poss=yes` | possessive ending |
| `PRP` | `PRON` | `PronType=prs` | pronoun, personal |
| `PRP$` | `DET` | `PronType=prs Poss=yes` | pronoun, possessive |
| `RB` | `ADV` | `Degree=pos` | adverb |
| `RBR` | `ADV` | `Degree=comp` | adverb, comparative |
| `RBS` | `ADV` | `Degree=sup` | adverb, superlative |
| `RP` | `ADP` | | adverb, particle |
| `SP` | `SPACE` | | space |
| `SYM` | `SYM` | | symbol |
| `TO` | `PART` | `PartType=inf VerbForm=inf` | infinitival "to" |
| `UH` | `INTJ` | | interjection |
| `VB` | `VERB` | `VerbForm=inf` | verb, base form |
| `VBD` | `VERB` | `VerbForm=fin Tense=past` | verb, past tense |
| `VBG` | `VERB` | `VerbForm=part Tense=pres Aspect=prog` | verb, gerund or present participle |
| `VBN` | `VERB` | `VerbForm=part Tense=past Aspect=perf` | verb, past participle |
| `VBP` | `VERB` | `VerbForm=fin Tense=pres` | verb, non-3rd person singular present |
| `VBZ` | `VERB` | `VerbForm=fin Tense=pres Number=sing Person=three` | verb, 3rd person singular present |
| `WDT` | `DET` | | wh-determiner |
| `WP` | `PRON` | | wh-pronoun, personal |
| `WP$` | `DET` | `Poss=yes` | wh-pronoun, possessive |
| `WRB` | `ADV` | | wh-adverb |
| `XX` | `X` | | unknown |
| `_SP` | `SPACE` | | |
</Accordion> </Accordion>
<Accordion title="German" id="pos-de"> <Accordion title="German" id="pos-de">
The German part-of-speech tagger uses the The German part-of-speech tagger uses the
[TIGER Treebank](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html) [TIGER Treebank](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html)
annotation scheme. We also map the tags to the simpler Google Universal POS tag annotation scheme. We also map the tags to the simpler Universal Dependencies
set. v2 POS tag set.
| Tag |  POS | Morphology | Description | | Tag |  POS | Morphology | Description |
| --------- | ------- | ---------------------------------------- | ------------------------------------------------- | | --------- | ------- | ---------------------------------------- | ------------------------------------------------- |
@ -194,7 +190,7 @@ set.
| `$,` | `PUNCT` | `PunctType=comm` | comma | | `$,` | `PUNCT` | `PunctType=comm` | comma |
| `$.` | `PUNCT` | `PunctType=peri` | sentence-final punctuation mark | | `$.` | `PUNCT` | `PunctType=peri` | sentence-final punctuation mark |
| `ADJA` | `ADJ` | | adjective, attributive | | `ADJA` | `ADJ` | | adjective, attributive |
| `ADJD` | `ADJ` | `Variant=short` | adjective, adverbial or predicative | | `ADJD` | `ADJ` | | adjective, adverbial or predicative |
| `ADV` | `ADV` | | adverb | | `ADV` | `ADV` | | adverb |
| `APPO` | `ADP` | `AdpType=post` | postposition | | `APPO` | `ADP` | `AdpType=post` | postposition |
| `APPR` | `ADP` | `AdpType=prep` | preposition; circumposition left | | `APPR` | `ADP` | `AdpType=prep` | preposition; circumposition left |
@ -204,28 +200,28 @@ set.
| `CARD` | `NUM` | `NumType=card` | cardinal number | | `CARD` | `NUM` | `NumType=card` | cardinal number |
| `FM` | `X` | `Foreign=yes` | foreign language material | | `FM` | `X` | `Foreign=yes` | foreign language material |
| `ITJ` | `INTJ` | | interjection | | `ITJ` | `INTJ` | | interjection |
| `KOKOM` | `CONJ` | `ConjType=comp` | comparative conjunction | | `KOKOM` | `CCONJ` | `ConjType=comp` | comparative conjunction |
| `KON` | `CONJ` | | coordinate conjunction | | `KON` | `CCONJ` | | coordinate conjunction |
| `KOUI` | `SCONJ` | | subordinate conjunction with "zu" and infinitive | | `KOUI` | `SCONJ` | | subordinate conjunction with "zu" and infinitive |
| `KOUS` | `SCONJ` | | subordinate conjunction with sentence | | `KOUS` | `SCONJ` | | subordinate conjunction with sentence |
| `NE` | `PROPN` | | proper noun | | `NE` | `PROPN` | | proper noun |
| `NNE` | `PROPN` | | proper noun |
| `NN` | `NOUN` | | noun, singular or mass | | `NN` | `NOUN` | | noun, singular or mass |
| `PROAV` | `ADV` | `PronType=dem` | pronominal adverb | | `NNE` | `PROPN` | | proper noun |
| `PDAT` | `DET` | `PronType=dem` | attributive demonstrative pronoun | | `PDAT` | `DET` | `PronType=dem` | attributive demonstrative pronoun |
| `PDS` | `PRON` | `PronType=dem` | substituting demonstrative pronoun | | `PDS` | `PRON` | `PronType=dem` | substituting demonstrative pronoun |
| `PIAT` | `DET` | `PronType=ind\|neg\|tot` | attributive indefinite pronoun without determiner | | `PIAT` | `DET` | `PronType=ind|neg|tot` | attributive indefinite pronoun without determiner |
| `PIS` | `PRON` | `PronType=ind\|neg\|tot` | substituting indefinite pronoun | | `PIS` | `PRON` | `PronType=ind|neg|tot` | substituting indefinite pronoun |
| `PPER` | `PRON` | `PronType=prs` | non-reflexive personal pronoun | | `PPER` | `PRON` | `PronType=prs` | non-reflexive personal pronoun |
| `PPOSAT` | `DET` | `Poss=yes PronType=prs` | attributive possessive pronoun | | `PPOSAT` | `DET` | `Poss=yes PronType=prs` | attributive possessive pronoun |
| `PPOSS` | `PRON` | `PronType=rel` | substituting possessive pronoun | | `PPOSS` | `PRON` | `Poss=yes PronType=prs` | substituting possessive pronoun |
| `PRELAT` | `DET` | `PronType=rel` | attributive relative pronoun | | `PRELAT` | `DET` | `PronType=rel` | attributive relative pronoun |
| `PRELS` | `PRON` | `PronType=rel` | substituting relative pronoun | | `PRELS` | `PRON` | `PronType=rel` | substituting relative pronoun |
| `PRF` | `PRON` | `PronType=prs Reflex=yes` | reflexive personal pronoun | | `PRF` | `PRON` | `PronType=prs Reflex=yes` | reflexive personal pronoun |
| `PROAV` | `ADV` | `PronType=dem` | pronominal adverb |
| `PTKA` | `PART` | | particle with adjective or adverb | | `PTKA` | `PART` | | particle with adjective or adverb |
| `PTKANT` | `PART` | `PartType=res` | answer particle | | `PTKANT` | `PART` | `PartType=res` | answer particle |
| `PTKNEG` | `PART` | `Negative=yes` | negative particle | | `PTKNEG` | `PART` | `Polarity=neg` | negative particle |
| `PTKVZ` | `PART` | `PartType=vbp` | separable verbal particle | | `PTKVZ` | `ADP` | `PartType=vbp` | separable verbal particle |
| `PTKZU` | `PART` | `PartType=inf` | "zu" before infinitive | | `PTKZU` | `PART` | `PartType=inf` | "zu" before infinitive |
| `PWAT` | `DET` | `PronType=int` | attributive interrogative pronoun | | `PWAT` | `DET` | `PronType=int` | attributive interrogative pronoun |
| `PWAV` | `ADV` | `PronType=int` | adverbial interrogative or relative pronoun | | `PWAV` | `ADV` | `PronType=int` | adverbial interrogative or relative pronoun |
@ -234,9 +230,9 @@ set.
| `VAFIN` | `AUX` | `Mood=ind VerbForm=fin` | finite verb, auxiliary | | `VAFIN` | `AUX` | `Mood=ind VerbForm=fin` | finite verb, auxiliary |
| `VAIMP` | `AUX` | `Mood=imp VerbForm=fin` | imperative, auxiliary | | `VAIMP` | `AUX` | `Mood=imp VerbForm=fin` | imperative, auxiliary |
| `VAINF` | `AUX` | `VerbForm=inf` | infinitive, auxiliary | | `VAINF` | `AUX` | `VerbForm=inf` | infinitive, auxiliary |
| `VAPP` | `AUX` | `Aspect=perf VerbForm=fin` | perfect participle, auxiliary | | `VAPP` | `AUX` | `Aspect=perf VerbForm=part` | perfect participle, auxiliary |
| `VMFIN` | `VERB` | `Mood=ind VerbForm=fin VerbType=mod` | finite verb, modal | | `VMFIN` | `VERB` | `Mood=ind VerbForm=fin VerbType=mod` | finite verb, modal |
| `VMINF` | `VERB` | `VerbForm=fin VerbType=mod` | infinitive, modal | | `VMINF` | `VERB` | `VerbForm=inf VerbType=mod` | infinitive, modal |
| `VMPP` | `VERB` | `Aspect=perf VerbForm=part VerbType=mod` | perfect participle, modal | | `VMPP` | `VERB` | `Aspect=perf VerbForm=part VerbType=mod` | perfect participle, modal |
| `VVFIN` | `VERB` | `Mood=ind VerbForm=fin` | finite verb, full | | `VVFIN` | `VERB` | `Mood=ind VerbForm=fin` | finite verb, full |
| `VVIMP` | `VERB` | `Mood=imp VerbForm=fin` | imperative, full | | `VVIMP` | `VERB` | `Mood=imp VerbForm=fin` | imperative, full |
@ -244,8 +240,7 @@ set.
| `VVIZU` | `VERB` | `VerbForm=inf` | infinitive with "zu", full | | `VVIZU` | `VERB` | `VerbForm=inf` | infinitive with "zu", full |
| `VVPP` | `VERB` | `Aspect=perf VerbForm=part` | perfect participle, full | | `VVPP` | `VERB` | `Aspect=perf VerbForm=part` | perfect participle, full |
| `XY` | `X` | | non-word containing non-letter | | `XY` | `X` | | non-word containing non-letter |
| `SP` | `SPACE` | | space | | `_SP` | `SPACE` | | |
</Accordion> </Accordion>
--- ---

View File

@ -155,21 +155,14 @@ $ python -m spacy convert [input_file] [output_dir] [--file-type] [--converter]
### Output file types {new="2.1"} ### Output file types {new="2.1"}
> #### Which format should I choose?
>
> If you're not sure, go with the default `jsonl`. Newline-delimited JSON means
> that there's one JSON object per line. Unlike a regular JSON file, it can also
> be read in line-by-line and you won't have to parse the _entire file_ first.
> This makes it a very convenient format for larger corpora.
All output files generated by this command are compatible with All output files generated by this command are compatible with
[`spacy train`](/api/cli#train). [`spacy train`](/api/cli#train).
| ID | Description | | ID | Description |
| ------- | --------------------------------- | | ------- | -------------------------- |
| `jsonl` | Newline-delimited JSON (default). | | `json` | Regular JSON (default). |
| `json` | Regular JSON. | | `jsonl` | Newline-delimited JSON. |
| `msg` | Binary MessagePack format. | | `msg` | Binary MessagePack format. |
### Converter options ### Converter options
@ -453,8 +446,10 @@ improvement.
```bash ```bash
$ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
[--width] [--depth] [--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length] [--min-length] [--width] [--depth] [--cnn-window] [--cnn-pieces] [--use-chars] [--sa-depth]
[--seed] [--n-iter] [--use-vectors] [--n-save_every] [--init-tok2vec] [--epoch-start] [--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length]
[--min-length] [--seed] [--n-iter] [--use-vectors] [--n-save_every]
[--init-tok2vec] [--epoch-start]
``` ```
| Argument | Type | Description | | Argument | Type | Description |
@ -464,6 +459,10 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
| `output_dir` | positional | Directory to write models to on each epoch. | | `output_dir` | positional | Directory to write models to on each epoch. |
| `--width`, `-cw` | option | Width of CNN layers. | | `--width`, `-cw` | option | Width of CNN layers. |
| `--depth`, `-cd` | option | Depth of CNN layers. | | `--depth`, `-cd` | option | Depth of CNN layers. |
| `--cnn-window`, `-cW` <Tag variant="new">2.2.2</Tag> | option | Window size for CNN layers. |
| `--cnn-pieces`, `-cP` <Tag variant="new">2.2.2</Tag> | option | Maxout size for CNN layers. `1` for [Mish](https://github.com/digantamisra98/Mish). |
| `--use-chars`, `-chr` <Tag variant="new">2.2.2</Tag> | flag | Whether to use character-based embedding. |
| `--sa-depth`, `-sa` <Tag variant="new">2.2.2</Tag> | option | Depth of self-attention layers. |
| `--embed-rows`, `-er` | option | Number of embedding rows. | | `--embed-rows`, `-er` | option | Number of embedding rows. |
| `--loss-func`, `-L` | option | Loss function to use for the objective. Either `"L2"` or `"cosine"`. | | `--loss-func`, `-L` | option | Loss function to use for the objective. Either `"L2"` or `"cosine"`. |
| `--dropout`, `-d` | option | Dropout rate. | | `--dropout`, `-d` | option | Dropout rate. |
@ -476,7 +475,7 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
| `--n-save-every`, `-se` | option | Save model every X batches. | | `--n-save-every`, `-se` | option | Save model every X batches. |
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. | | `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. |
| `--epoch-start`, `-es` <Tag variant="new">2.1.5</Tag> | option | The epoch to start counting at. Only relevant when using `--init-tok2vec` and the given weight file has been renamed. Prevents unintended overwriting of existing weight files. | | `--epoch-start`, `-es` <Tag variant="new">2.1.5</Tag> | option | The epoch to start counting at. Only relevant when using `--init-tok2vec` and the given weight file has been renamed. Prevents unintended overwriting of existing weight files. |
| **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. | | **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. |
### JSONL format for raw text {#pretrain-jsonl} ### JSONL format for raw text {#pretrain-jsonl}

View File

@ -202,6 +202,14 @@ All labels present in the match patterns.
| ----------- | ----- | ------------------ | | ----------- | ----- | ------------------ |
| **RETURNS** | tuple | The string labels. | | **RETURNS** | tuple | The string labels. |
## EntityRuler.ent_ids {#labels tag="property" new="2.2.2"}
All entity ids present in the match patterns `id` properties.
| Name | Type | Description |
| ----------- | ----- | ------------------- |
| **RETURNS** | tuple | The string ent_ids. |
## EntityRuler.patterns {#patterns tag="property"} ## EntityRuler.patterns {#patterns tag="property"}
Get all patterns that were added to the entity ruler. Get all patterns that were added to the entity ruler.

View File

@ -323,18 +323,38 @@ you can use to undo your changes.
> #### Example > #### Example
> >
> ```python > ```python
> with nlp.disable_pipes('tagger', 'parser'): > # New API as of v2.2.2
> with nlp.disable_pipes(["tagger", "parser"]):
> nlp.begin_training()
>
> with nlp.disable_pipes("tagger", "parser"):
> nlp.begin_training() > nlp.begin_training()
> >
> disabled = nlp.disable_pipes('tagger', 'parser') > disabled = nlp.disable_pipes("tagger", "parser")
> nlp.begin_training() > nlp.begin_training()
> disabled.restore() > disabled.restore()
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | --------------- | ------------------------------------------------------------------------------------ | | ----------------------------------------- | --------------- | ------------------------------------------------------------------------------------ |
| `*disabled` | unicode | Names of pipeline components to disable. | | `disabled` <Tag variant="new">2.2.2</Tag> | list | Names of pipeline components to disable. |
| **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. | | `*disabled` | unicode | Names of pipeline components to disable. |
| **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. |
<Infobox title="Changed in v2.2.2" variant="warning">
As of spaCy v2.2.2, the `Language.disable_pipes` method can also take a list of
component names as its first argument (instead of a variable number of
arguments). This is especially useful if you're generating the component names
to disable programmatically. The new syntax will become the default in the
future.
```diff
- disabled = nlp.disable_pipes("tagger", "parser")
+ disabled = nlp.disable_pipes(["tagger", "parser"])
```
</Infobox>
## Language.to_disk {#to_disk tag="method" new="2"} ## Language.to_disk {#to_disk tag="method" new="2"}

View File

@ -157,16 +157,19 @@ overwritten.
| `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. | | `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. |
| `*patterns` | list | Match pattern. A pattern consists of a list of dicts, where each dict describes a token. | | `*patterns` | list | Match pattern. A pattern consists of a list of dicts, where each dict describes a token. |
<Infobox title="Changed in v2.0" variant="warning"> <Infobox title="Changed in v2.2.2" variant="warning">
As of spaCy 2.0, `Matcher.add_pattern` and `Matcher.add_entity` are deprecated As of spaCy 2.2.2, `Matcher.add` also supports the new API, which will become
and have been replaced with a simpler [`Matcher.add`](/api/matcher#add) that the default in the future. The patterns are now the second argument and a list
lets you add a list of patterns and a callback for a given match ID. (instead of a variable number of arguments). The `on_match` callback becomes an
optional keyword argument.
```diff ```diff
- matcher.add_entity("GoogleNow", on_match=merge_phrases) patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]
- matcher.add_pattern("GoogleNow", [{ORTH: "Google"}, {ORTH: "Now"}]) - matcher.add("GoogleNow", None, *patterns)
+ matcher.add('GoogleNow', merge_phrases, [{"ORTH": "Google"}, {"ORTH": "Now"}]) + matcher.add("GoogleNow", patterns)
- matcher.add("GoogleNow", on_match, *patterns)
+ matcher.add("GoogleNow", patterns, on_match=on_match)
``` ```
</Infobox> </Infobox>

View File

@ -153,6 +153,23 @@ overwritten.
| `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. | | `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. |
| `*docs` | `Doc` | `Doc` objects of the phrases to match. | | `*docs` | `Doc` | `Doc` objects of the phrases to match. |
<Infobox title="Changed in v2.2.2" variant="warning">
As of spaCy 2.2.2, `PhraseMatcher.add` also supports the new API, which will
become the default in the future. The `Doc` patterns are now the second argument
and a list (instead of a variable number of arguments). The `on_match` callback
becomes an optional keyword argument.
```diff
patterns = [nlp("health care reform"), nlp("healthcare reform")]
- matcher.add("HEALTH", None, *patterns)
+ matcher.add("HEALTH", patterns)
- matcher.add("HEALTH", on_match, *patterns)
+ matcher.add("HEALTH", patterns, on_match=on_match)
```
</Infobox>
## PhraseMatcher.remove {#remove tag="method" new="2.2"} ## PhraseMatcher.remove {#remove tag="method" new="2.2"}
Remove a rule from the matcher by match ID. A `KeyError` is raised if the key Remove a rule from the matcher by match ID. A `KeyError` is raised if the key

View File

@ -1,9 +1,33 @@
<div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px">But <div
<mark class="entity" style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Google class="entities"
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>is starting from behind. The company made a late push into hardware, style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px"
and >But
<mark class="entity" style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Apple <mark
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>s Siri, available on iPhones, and class="entity"
<mark class="entity" style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Amazon style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>s Alexa software, which runs on its Echo and Dot devices, have clear >Google
leads in consumer adoption.</div> <span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>ORG</span
></mark
>is starting from behind. The company made a late push into hardware, and
<mark
class="entity"
style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>Apple
<span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>ORG</span
></mark
>s Siri, available on iPhones, and
<mark
class="entity"
style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>Amazon
<span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>ORG</span
></mark
>s Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer
adoption.</div
>

View File

@ -2,17 +2,25 @@
class="entities" class="entities"
style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px"
> >
🌱🌿 <mark 🌱🌿
class="entity" <mark
style="background: #3dff74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone" class="entity"
>🐍 <span style="background: #3dff74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem" >🐍
>SNEK</span <span
></mark> ____ 🌳🌲 ____ <mark style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
class="entity" >SNEK</span
style="background: #cfc5ff; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone" ></mark
>👨‍🌾 <span >
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem" ____ 🌳🌲 ____
>HUMAN</span <mark
></mark> 🏘️ class="entity"
style="background: #cfc5ff; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>👨‍🌾
<span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>HUMAN</span
></mark
>
🏘️
</div> </div>

View File

@ -1,16 +1,37 @@
<div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px"> <div
<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> class="entities"
style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px"
>
<mark
class="entity"
style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>
Apple Apple
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span> <span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>ORG</span
>
</mark> </mark>
is looking at buying is looking at buying
<mark class="entity" style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> <mark
class="entity"
style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>
U.K. U.K.
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">GPE</span> <span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>GPE</span
>
</mark> </mark>
startup for startup for
<mark class="entity" style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> <mark
class="entity"
style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>
$1 billion $1 billion
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">MONEY</span> <span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>MONEY</span
>
</mark> </mark>
</div> </div>

View File

@ -1,18 +1,39 @@
<div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px"> <div
class="entities"
style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px"
>
When When
<mark class="entity" style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> <mark
class="entity"
style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>
Sebastian Thrun Sebastian Thrun
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PERSON</span> <span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>PERSON</span
>
</mark> </mark>
started working on self-driving cars at started working on self-driving cars at
<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> <mark
class="entity"
style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>
Google Google
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span> <span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>ORG</span
>
</mark> </mark>
in in
<mark class="entity" style="background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"> <mark
class="entity"
style="background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>
2007 2007
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">DATE</span> <span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>DATE</span
>
</mark> </mark>
, few people outside of the company took him seriously. , few people outside of the company took him seriously.
</div> </div>

View File

@ -986,6 +986,37 @@ doc = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents]) print([(ent.text, ent.label_) for ent in doc.ents])
``` ```
### Adding IDs to patterns {#entityruler-ent-ids new="2.2.2"}
The [`EntityRuler`](/api/entityruler) can also accept an `id` attribute for each
pattern. Using the `id` attribute allows multiple patterns to be associated with
the same entity.
```python
### {executable="true"}
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "ORG", "pattern": "Apple", "id": "apple"},
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}], "id": "san-francisco"},
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "fran"}], "id": "san-francisco"}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
doc1 = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_, ent.ent_id_) for ent in doc1.ents])
doc2 = nlp("Apple is opening its first big office in San Fran.")
print([(ent.text, ent.label_, ent.ent_id_) for ent in doc2.ents])
```
If the `id` attribute is included in the [`EntityRuler`](/api/entityruler)
patterns, the `ent_id_` property of the matched entity is set to the `id` given
in the patterns. So in the example above it's easy to identify that "San
Francisco" and "San Fran" are both the same entity.
The entity ruler is designed to integrate with spaCy's existing statistical The entity ruler is designed to integrate with spaCy's existing statistical
models and enhance the named entity recognizer. If it's added **before the models and enhance the named entity recognizer. If it's added **before the
`"ner"` component**, the entity recognizer will respect the existing entity `"ner"` component**, the entity recognizer will respect the existing entity

View File

@ -127,6 +127,7 @@
{ "code": "sr", "name": "Serbian" }, { "code": "sr", "name": "Serbian" },
{ "code": "sk", "name": "Slovak" }, { "code": "sk", "name": "Slovak" },
{ "code": "sl", "name": "Slovenian" }, { "code": "sl", "name": "Slovenian" },
{ "code": "lb", "name": "Luxembourgish" },
{ {
"code": "sq", "code": "sq",
"name": "Albanian", "name": "Albanian",