Merge branch 'master' into spacy.io

This commit is contained in:
Ines Montani 2019-10-31 17:30:42 +01:00
commit 07ba9b4aa2
88 changed files with 2370 additions and 565 deletions

106
.github/contributors/GiorgioPorgio.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | George Ketsopoulos |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 23 October 2019 |
| GitHub username | GiorgioPorgio |
| Website (optional) | |

106
.github/contributors/zhuorulin.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------ |
| Name | Zhuoru Lin |
| Company name (if applicable) | Bombora Inc. |
| Title or role (if applicable) | Data Scientist |
| Date | 2017-11-13 |
| GitHub username | ZhuoruLin |
| Website (optional) | |

View File

@ -9,7 +9,7 @@ dist/spacy.pex : dist/spacy-$(sha).pex
dist/spacy-$(sha).pex : dist/$(wheel)
env3.6/bin/python -m pip install pex==1.5.3
env3.6/bin/pex pytest dist/$(wheel) -e spacy -o dist/spacy-$(sha).pex
env3.6/bin/pex pytest dist/$(wheel) spacy_lookups_data -e spacy -o dist/spacy-$(sha).pex
dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py*
python3.6 -m venv env3.6

View File

@ -135,8 +135,7 @@ Thanks to our great community, we've finally re-added conda support. You can now
install spaCy via `conda-forge`:
```bash
conda config --add channels conda-forge
conda install spacy
conda install -c conda-forge spacy
```
For the feedstock including the build recipe and configuration, check out
@ -214,16 +213,6 @@ doc = nlp("This is a sentence.")
📖 **For more info and examples, check out the
[models documentation](https://spacy.io/docs/usage/models).**
### Support for older versions
If you're using an older version (`v1.6.0` or below), you can still download and
install the old models from within spaCy using `python -m spacy.en.download all`
or `python -m spacy.de.download all`. The `.tar.gz` archives are also
[attached to the v1.6.0 release](https://github.com/explosion/spaCy/tree/v1.6.0).
To download and install the models manually, unpack the archive, drop the
contained directory into `spacy/data` and load the model via `spacy.load('en')`
or `spacy.load('de')`.
## Compile from source
The other way to install spaCy is to clone its

View File

@ -84,7 +84,7 @@ def read_conllu(file_):
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
if text_loc.parts[-1].endswith(".conllu"):
docs = []
with text_loc.open() as file_:
with text_loc.open(encoding="utf8") as file_:
for conllu_doc in read_conllu(file_):
for conllu_sent in conllu_doc:
words = [line[1] for line in conllu_sent]

View File

@ -203,7 +203,7 @@ def golds_to_gold_tuples(docs, golds):
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
if text_loc.parts[-1].endswith(".conllu"):
docs = []
with text_loc.open() as file_:
with text_loc.open(encoding="utf8") as file_:
for conllu_doc in read_conllu(file_):
for conllu_sent in conllu_doc:
words = [line[1] for line in conllu_sent]
@ -378,7 +378,7 @@ def _load_pretrained_tok2vec(nlp, loc):
"""Load pretrained weights for the 'token-to-vector' part of the component
models, which is typically a CNN. See 'spacy pretrain'. Experimental.
"""
with Path(loc).open("rb") as file_:
with Path(loc).open("rb", encoding="utf8") as file_:
weights_data = file_.read()
loaded = []
for name, component in nlp.pipeline:
@ -519,8 +519,8 @@ def main(
for i in range(config.nr_epoch):
docs, golds = read_data(
nlp,
paths.train.conllu.open(),
paths.train.text.open(),
paths.train.conllu.open(encoding="utf8"),
paths.train.text.open(encoding="utf8"),
max_doc_length=config.max_doc_length,
limit=limit,
oracle_segments=use_oracle_segments,
@ -560,7 +560,7 @@ def main(
def _render_parses(i, to_render):
to_render[0].user_data["title"] = "Batch %d" % i
with Path("/tmp/parses.html").open("w") as file_:
with Path("/tmp/parses.html").open("w", encoding="utf8") as file_:
html = displacy.render(to_render[:5], style="dep", page=True)
file_.write(html)

View File

@ -77,6 +77,8 @@ def main(
if labels_discard:
labels_discard = [x.strip() for x in labels_discard.split(",")]
logger.info("Discarding {} NER types: {}".format(len(labels_discard), labels_discard))
else:
labels_discard = []
train_data = wikipedia_processor.read_training(
nlp=nlp,

View File

@ -18,19 +18,21 @@ during training. We discard the auxiliary model before run-time.
The specific example here is not necessarily a good idea --- but it shows
how an arbitrary objective function for some word can be used.
Developed and tested for spaCy 2.0.6
Developed and tested for spaCy 2.0.6. Updated for v2.2.2
"""
import random
import plac
import spacy
import os.path
from spacy.tokens import Doc
from spacy.gold import read_json_file, GoldParse
random.seed(0)
PWD = os.path.dirname(__file__)
TRAIN_DATA = list(read_json_file(os.path.join(PWD, "training-data.json")))
TRAIN_DATA = list(read_json_file(
os.path.join(PWD, "ner_example_data", "ner-sent-per-line.json")))
def get_position_label(i, words, tags, heads, labels, ents):
@ -55,6 +57,7 @@ def main(n_iter=10):
ner = nlp.create_pipe("ner")
ner.add_multitask_objective(get_position_label)
nlp.add_pipe(ner)
print(nlp.pipeline)
print("Create data", len(TRAIN_DATA))
optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA)
@ -62,23 +65,24 @@ def main(n_iter=10):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annot_brackets in TRAIN_DATA:
annotations, _ = annot_brackets
doc = nlp.make_doc(text)
gold = GoldParse.from_annot_tuples(doc, annotations[0])
nlp.update(
[doc], # batch of texts
[gold], # batch of annotations
drop=0.2, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses,
)
for annotations, _ in annot_brackets:
doc = Doc(nlp.vocab, words=annotations[1])
gold = GoldParse.from_annot_tuples(doc, annotations)
nlp.update(
[doc], # batch of texts
[gold], # batch of annotations
drop=0.2, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses,
)
print(losses.get("nn_labeller", 0.0), losses["ner"])
# test the trained model
for text, _ in TRAIN_DATA:
doc = nlp(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
if text is not None:
doc = nlp(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
if __name__ == "__main__":

View File

@ -1,10 +1,10 @@
# Our libraries
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
thinc>=7.2.0,<7.3.0
thinc>=7.3.0,<7.4.0
blis>=0.4.0,<0.5.0
murmurhash>=0.28.0,<1.1.0
wasabi>=0.2.0,<1.1.0
wasabi>=0.3.0,<1.1.0
srsly>=0.1.0,<1.1.0
# Third party dependencies
numpy>=1.15.0

View File

@ -38,18 +38,18 @@ setup_requires =
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
murmurhash>=0.28.0,<1.1.0
thinc>=7.2.0,<7.3.0
thinc>=7.3.0,<7.4.0
install_requires =
setuptools
numpy>=1.15.0
murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
thinc>=7.2.0,<7.3.0
thinc>=7.3.0,<7.4.0
blis>=0.4.0,<0.5.0
plac>=0.9.6,<1.2.0
requests>=2.13.0,<3.0.0
wasabi>=0.2.0,<1.1.0
wasabi>=0.3.0,<1.1.0
srsly>=0.1.0,<1.1.0
pathlib==1.0.1; python_version < "3.4"
importlib_metadata>=0.20; python_version < "3.8"

View File

@ -9,12 +9,14 @@ warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
# These are imported as part of the API
from thinc.neural.util import prefer_gpu, require_gpu
from . import pipeline
from .cli.info import info as cli_info
from .glossary import explain
from .about import __version__
from .errors import Errors, Warnings, deprecation_warning
from . import util
from .util import register_architecture, get_architecture
from .language import component
if sys.maxunicode == 65535:

View File

@ -3,16 +3,14 @@ from __future__ import unicode_literals
import numpy
from thinc.v2v import Model, Maxout, Softmax, Affine, ReLu
from thinc.i2v import HashEmbed, StaticVectors
from thinc.t2t import ExtractWindow, ParametricAttention
from thinc.t2v import Pooling, sum_pool, mean_pool
from thinc.misc import Residual
from thinc.i2v import HashEmbed
from thinc.misc import Residual, FeatureExtracter
from thinc.misc import LayerNorm as LN
from thinc.misc import FeatureExtracter
from thinc.api import add, layerize, chain, clone, concatenate, with_flatten
from thinc.api import with_getitem, flatten_add_lengths
from thinc.api import uniqued, wrap, noop
from thinc.api import with_square_sequences
from thinc.linear.linear import LinearModel
from thinc.neural.ops import NumpyOps, CupyOps
from thinc.neural.util import get_array_module, copy_array
@ -26,14 +24,13 @@ import thinc.extra.load_nlp
from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
from .errors import Errors, user_warning, Warnings
from . import util
from . import ml as new_ml
from .ml import _legacy_tok2vec
try:
import torch.nn
from thinc.extra.wrappers import PyTorchWrapperRNN
except ImportError:
torch = None
VECTORS_KEY = "spacy_pretrained_vectors"
# Backwards compatibility with <2.2.2
USE_MODEL_REGISTRY_TOK2VEC = False
def cosine(vec1, vec2):
@ -310,6 +307,10 @@ def link_vectors_to_models(vocab):
def PyTorchBiLSTM(nO, nI, depth, dropout=0.2):
import torch.nn
from thinc.api import with_square_sequences
from thinc.extra.wrappers import PyTorchWrapperRNN
if depth == 0:
return layerize(noop())
model = torch.nn.LSTM(nI, nO // 2, depth, bidirectional=True, dropout=dropout)
@ -317,81 +318,89 @@ def PyTorchBiLSTM(nO, nI, depth, dropout=0.2):
def Tok2Vec(width, embed_size, **kwargs):
if not USE_MODEL_REGISTRY_TOK2VEC:
# Preserve prior tok2vec for backwards compat, in v2.2.2
return _legacy_tok2vec.Tok2Vec(width, embed_size, **kwargs)
pretrained_vectors = kwargs.get("pretrained_vectors", None)
cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3)
subword_features = kwargs.get("subword_features", True)
char_embed = kwargs.get("char_embed", False)
if char_embed:
subword_features = False
conv_depth = kwargs.get("conv_depth", 4)
bilstm_depth = kwargs.get("bilstm_depth", 0)
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
with Model.define_operators(
{">>": chain, "|": concatenate, "**": clone, "+": add, "*": reapply}
):
norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm")
if subword_features:
prefix = HashEmbed(
width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix"
)
suffix = HashEmbed(
width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix"
)
shape = HashEmbed(
width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape"
)
else:
prefix, suffix, shape = (None, None, None)
if pretrained_vectors is not None:
glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
conv_window = kwargs.get("conv_window", 1)
if subword_features:
embed = uniqued(
(glove | norm | prefix | suffix | shape)
>> LN(Maxout(width, width * 5, pieces=3)),
column=cols.index(ORTH),
)
else:
embed = uniqued(
(glove | norm) >> LN(Maxout(width, width * 2, pieces=3)),
column=cols.index(ORTH),
)
elif subword_features:
embed = uniqued(
(norm | prefix | suffix | shape)
>> LN(Maxout(width, width * 4, pieces=3)),
column=cols.index(ORTH),
)
elif char_embed:
embed = concatenate_lists(
CharacterEmbed(nM=64, nC=8),
FeatureExtracter(cols) >> with_flatten(norm),
)
reduce_dimensions = LN(
Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces)
)
else:
embed = norm
cols = ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"]
convolution = Residual(
ExtractWindow(nW=1)
>> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces))
)
if char_embed:
tok2vec = embed >> with_flatten(
reduce_dimensions >> convolution ** conv_depth, pad=conv_depth
)
else:
tok2vec = FeatureExtracter(cols) >> with_flatten(
embed >> convolution ** conv_depth, pad=conv_depth
)
if bilstm_depth >= 1:
tok2vec = tok2vec >> PyTorchBiLSTM(width, width, bilstm_depth)
# Work around thinc API limitations :(. TODO: Revise in Thinc 7
tok2vec.nO = width
tok2vec.embed = embed
return tok2vec
doc2feats_cfg = {"arch": "spacy.Doc2Feats.v1", "config": {"columns": cols}}
if char_embed:
embed_cfg = {
"arch": "spacy.CharacterEmbed.v1",
"config": {
"width": 64,
"chars": 6,
"@mix": {
"arch": "spacy.LayerNormalizedMaxout.v1",
"config": {"width": width, "pieces": 3},
},
"@embed_features": None,
},
}
else:
embed_cfg = {
"arch": "spacy.MultiHashEmbed.v1",
"config": {
"width": width,
"rows": embed_size,
"columns": cols,
"use_subwords": subword_features,
"@pretrained_vectors": None,
"@mix": {
"arch": "spacy.LayerNormalizedMaxout.v1",
"config": {"width": width, "pieces": 3},
},
},
}
if pretrained_vectors:
embed_cfg["config"]["@pretrained_vectors"] = {
"arch": "spacy.PretrainedVectors.v1",
"config": {
"vectors_name": pretrained_vectors,
"width": width,
"column": cols.index("ID"),
},
}
if cnn_maxout_pieces >= 2:
cnn_cfg = {
"arch": "spacy.MaxoutWindowEncoder.v1",
"config": {
"width": width,
"window_size": conv_window,
"pieces": cnn_maxout_pieces,
"depth": conv_depth,
},
}
else:
cnn_cfg = {
"arch": "spacy.MishWindowEncoder.v1",
"config": {"width": width, "window_size": conv_window, "depth": conv_depth},
}
bilstm_cfg = {
"arch": "spacy.TorchBiLSTMEncoder.v1",
"config": {"width": width, "depth": bilstm_depth},
}
if conv_depth == 0 and bilstm_depth == 0:
encode_cfg = {}
elif conv_depth >= 1 and bilstm_depth >= 1:
encode_cfg = {
"arch": "thinc.FeedForward.v1",
"config": {"children": [cnn_cfg, bilstm_cfg]},
}
elif conv_depth >= 1:
encode_cfg = cnn_cfg
else:
encode_cfg = bilstm_cfg
config = {"@doc2feats": doc2feats_cfg, "@embed": embed_cfg, "@encode": encode_cfg}
return new_ml.Tok2Vec(config)
def reapply(layer, n_times):

View File

@ -1,6 +1,6 @@
# fmt: off
__title__ = "spacy"
__version__ = "2.2.2.dev1"
__version__ = "2.2.2"
__release__ = True
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"

179
spacy/analysis.py Normal file
View File

@ -0,0 +1,179 @@
# coding: utf8
from __future__ import unicode_literals
from collections import OrderedDict
from wasabi import Printer
from .tokens import Doc, Token, Span
from .errors import Errors, Warnings, user_warning
def analyze_pipes(pipeline, name, pipe, index, warn=True):
"""Analyze a pipeline component with respect to its position in the current
pipeline and the other components. Will check whether requirements are
fulfilled (e.g. if previous components assign the attributes).
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
name (unicode): The name of the pipeline component to analyze.
pipe (callable): The pipeline component function to analyze.
index (int): The index of the component in the pipeline.
warn (bool): Show user warning if problem is found.
RETURNS (list): The problems found for the given pipeline component.
"""
assert pipeline[index][0] == name
prev_pipes = pipeline[:index]
pipe_requires = getattr(pipe, "requires", [])
requires = OrderedDict([(annot, False) for annot in pipe_requires])
if requires:
for prev_name, prev_pipe in prev_pipes:
prev_assigns = getattr(prev_pipe, "assigns", [])
for annot in prev_assigns:
requires[annot] = True
problems = []
for annot, fulfilled in requires.items():
if not fulfilled:
problems.append(annot)
if warn:
user_warning(Warnings.W025.format(name=name, attr=annot))
return problems
def analyze_all_pipes(pipeline, warn=True):
"""Analyze all pipes in the pipeline in order.
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
warn (bool): Show user warning if problem is found.
RETURNS (dict): The problems found, keyed by component name.
"""
problems = {}
for i, (name, pipe) in enumerate(pipeline):
problems[name] = analyze_pipes(pipeline, name, pipe, i, warn=warn)
return problems
def dot_to_dict(values):
"""Convert dot notation to a dict. For example: ["token.pos", "token._.xyz"]
become {"token": {"pos": True, "_": {"xyz": True }}}.
values (iterable): The values to convert.
RETURNS (dict): The converted values.
"""
result = {}
for value in values:
path = result
parts = value.lower().split(".")
for i, item in enumerate(parts):
is_last = i == len(parts) - 1
path = path.setdefault(item, True if is_last else {})
return result
def validate_attrs(values):
"""Validate component attributes provided to "assigns", "requires" etc.
Raises error for invalid attributes and formatting. Doesn't check if
custom extension attributes are registered, since this is something the
user might want to do themselves later in the component.
values (iterable): The string attributes to check, e.g. `["token.pos"]`.
RETURNS (iterable): The checked attributes.
"""
data = dot_to_dict(values)
objs = {"doc": Doc, "token": Token, "span": Span}
for obj_key, attrs in data.items():
if obj_key == "span":
# Support Span only for custom extension attributes
span_attrs = [attr for attr in values if attr.startswith("span.")]
span_attrs = [attr for attr in span_attrs if not attr.startswith("span._.")]
if span_attrs:
raise ValueError(Errors.E180.format(attrs=", ".join(span_attrs)))
if obj_key not in objs: # first element is not doc/token/span
invalid_attrs = ", ".join(a for a in values if a.startswith(obj_key))
raise ValueError(Errors.E181.format(obj=obj_key, attrs=invalid_attrs))
if not isinstance(attrs, dict): # attr is something like "doc"
raise ValueError(Errors.E182.format(attr=obj_key))
for attr, value in attrs.items():
if attr == "_":
if value is True: # attr is something like "doc._"
raise ValueError(Errors.E182.format(attr="{}._".format(obj_key)))
for ext_attr, ext_value in value.items():
# We don't check whether the attribute actually exists
if ext_value is not True: # attr is something like doc._.x.y
good = "{}._.{}".format(obj_key, ext_attr)
bad = "{}.{}".format(good, ".".join(ext_value))
raise ValueError(Errors.E183.format(attr=bad, solution=good))
continue # we can't validate those further
if attr.endswith("_"): # attr is something like "token.pos_"
raise ValueError(Errors.E184.format(attr=attr, solution=attr[:-1]))
if value is not True: # attr is something like doc.x.y
good = "{}.{}".format(obj_key, attr)
bad = "{}.{}".format(good, ".".join(value))
raise ValueError(Errors.E183.format(attr=bad, solution=good))
obj = objs[obj_key]
if not hasattr(obj, attr):
raise ValueError(Errors.E185.format(obj=obj_key, attr=attr))
return values
def _get_feature_for_attr(pipeline, attr, feature):
assert feature in ["assigns", "requires"]
result = []
for pipe_name, pipe in pipeline:
pipe_assigns = getattr(pipe, feature, [])
if attr in pipe_assigns:
result.append((pipe_name, pipe))
return result
def get_assigns_for_attr(pipeline, attr):
"""Get all pipeline components that assign an attr, e.g. "doc.tensor".
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
attr (unicode): The attribute to check.
RETURNS (list): (name, pipeline) tuples of components that assign the attr.
"""
return _get_feature_for_attr(pipeline, attr, "assigns")
def get_requires_for_attr(pipeline, attr):
"""Get all pipeline components that require an attr, e.g. "doc.tensor".
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
attr (unicode): The attribute to check.
RETURNS (list): (name, pipeline) tuples of components that require the attr.
"""
return _get_feature_for_attr(pipeline, attr, "requires")
def print_summary(nlp, pretty=True, no_print=False):
"""Print a formatted summary for the current nlp object's pipeline. Shows
a table with the pipeline components and why they assign and require, as
well as any problems if available.
nlp (Language): The nlp object.
pretty (bool): Pretty-print the results (color etc).
no_print (bool): Don't print anything, just return the data.
RETURNS (dict): A dict with "overview" and "problems".
"""
msg = Printer(pretty=pretty, no_print=no_print)
overview = []
problems = {}
for i, (name, pipe) in enumerate(nlp.pipeline):
requires = getattr(pipe, "requires", [])
assigns = getattr(pipe, "assigns", [])
retok = getattr(pipe, "retokenizes", False)
overview.append((i, name, requires, assigns, retok))
problems[name] = analyze_pipes(nlp.pipeline, name, pipe, i, warn=False)
msg.divider("Pipeline Overview")
header = ("#", "Component", "Requires", "Assigns", "Retokenizes")
msg.table(overview, header=header, divider=True, multiline=True)
n_problems = sum(len(p) for p in problems.values())
if any(p for p in problems.values()):
msg.divider("Problems ({})".format(n_problems))
for name, problem in problems.items():
if problem:
problem = ", ".join(problem)
msg.warn("'{}' requirements not met: {}".format(name, problem))
else:
msg.good("No problems found.")
if no_print:
return {"overview": overview, "problems": problems}

View File

@ -57,7 +57,7 @@ def convert(
is written to stdout, so you can pipe them forward to a JSON file:
$ spacy convert some_file.conllu > some_file.json
"""
no_print = (output_dir == "-")
no_print = output_dir == "-"
msg = Printer(no_print=no_print)
input_path = Path(input_file)
if file_type not in FILE_TYPES:

View File

@ -9,7 +9,9 @@ from ...tokens.doc import Doc
from ...util import load_model
def conll_ner2json(input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs):
def conll_ner2json(
input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs
):
"""
Convert files in the CoNLL-2003 NER format and similar
whitespace-separated columns into JSON format for use with train cli.

View File

@ -35,6 +35,10 @@ from .train import _load_pretrained_tok2vec
output_dir=("Directory to write models to on each epoch", "positional", None, str),
width=("Width of CNN layers", "option", "cw", int),
depth=("Depth of CNN layers", "option", "cd", int),
cnn_window=("Window size for CNN layers", "option", "cW", int),
cnn_pieces=("Maxout size for CNN layers. 1 for Mish", "option", "cP", int),
use_chars=("Whether to use character-based embedding", "flag", "chr", bool),
sa_depth=("Depth of self-attention layers", "option", "sa", int),
bilstm_depth=("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int),
embed_rows=("Number of embedding rows", "option", "er", int),
loss_func=(
@ -81,7 +85,11 @@ def pretrain(
output_dir,
width=96,
depth=4,
bilstm_depth=2,
bilstm_depth=0,
cnn_pieces=3,
sa_depth=0,
use_chars=False,
cnn_window=1,
embed_rows=2000,
loss_func="cosine",
use_vectors=False,
@ -158,8 +166,8 @@ def pretrain(
conv_depth=depth,
pretrained_vectors=pretrained_vectors,
bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental.
cnn_maxout_pieces=3, # You can try setting this higher
subword_features=True, # Set to False for Chinese etc
subword_features=not use_chars, # Set to False for Chinese etc
cnn_maxout_pieces=cnn_pieces, # If set to 1, use Mish activation.
),
)
# Load in pretrained weights

View File

@ -156,8 +156,7 @@ def train(
"`lang` argument ('{}') ".format(nlp.lang, lang),
exits=1,
)
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipeline]
nlp.disable_pipes(*other_pipes)
nlp.disable_pipes([p for p in nlp.pipe_names if p not in pipeline])
for pipe in pipeline:
if pipe not in nlp.pipe_names:
if pipe == "parser":
@ -263,7 +262,11 @@ def train(
exits=1,
)
train_docs = corpus.train_docs(
nlp, noise_level=noise_level, gold_preproc=gold_preproc, max_length=0
nlp,
noise_level=noise_level,
gold_preproc=gold_preproc,
max_length=0,
ignore_misaligned=True,
)
train_labels = set()
if textcat_multilabel:
@ -344,6 +347,7 @@ def train(
orth_variant_level=orth_variant_level,
gold_preproc=gold_preproc,
max_length=0,
ignore_misaligned=True,
)
if raw_text:
random.shuffle(raw_text)
@ -382,7 +386,11 @@ def train(
if hasattr(component, "cfg"):
component.cfg["beam_width"] = beam_width
dev_docs = list(
corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc)
corpus.dev_docs(
nlp_loaded,
gold_preproc=gold_preproc,
ignore_misaligned=True,
)
)
nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
start_time = timer()
@ -399,7 +407,11 @@ def train(
if hasattr(component, "cfg"):
component.cfg["beam_width"] = beam_width
dev_docs = list(
corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc)
corpus.dev_docs(
nlp_loaded,
gold_preproc=gold_preproc,
ignore_misaligned=True,
)
)
start_time = timer()
scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose)

View File

@ -12,6 +12,7 @@ import os
import sys
import itertools
import ast
import types
from thinc.neural.util import copy_array
@ -67,6 +68,7 @@ if is_python2:
basestring_ = basestring # noqa: F821
input_ = raw_input # noqa: F821
path2str = lambda path: str(path).decode("utf8")
class_types = (type, types.ClassType)
elif is_python3:
bytes_ = bytes
@ -74,6 +76,7 @@ elif is_python3:
basestring_ = str
input_ = input
path2str = lambda path: str(path)
class_types = (type, types.ClassType) if is_python_pre_3_5 else type
def b_to_str(b_str):

View File

@ -44,14 +44,14 @@ TPL_ENTS = """
TPL_ENT = """
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
{text}
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">{label}</span>
</mark>
"""
TPL_ENT_RTL = """
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em">
{text}
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span>
</mark>

View File

@ -99,6 +99,8 @@ class Warnings(object):
"'n_process' will be set to 1.")
W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in "
"the Knowledge Base.")
W025 = ("'{name}' requires '{attr}' to be assigned, but none of the "
"previous components in the pipeline declare that they assign it.")
@add_codes
@ -504,6 +506,29 @@ class Errors(object):
E175 = ("Can't remove rule for unknown match pattern ID: {key}")
E176 = ("Alias '{alias}' is not defined in the Knowledge Base.")
E177 = ("Ill-formed IOB input detected: {tag}")
E178 = ("Invalid pattern. Expected list of dicts but got: {pat}. Maybe you "
"accidentally passed a single pattern to Matcher.add instead of a "
"list of patterns? If you only want to add one pattern, make sure "
"to wrap it in a list. For example: matcher.add('{key}', [pattern])")
E179 = ("Invalid pattern. Expected a list of Doc objects but got a single "
"Doc. If you only want to add one pattern, make sure to wrap it "
"in a list. For example: matcher.add('{key}', [doc])")
E180 = ("Span attributes can't be declared as required or assigned by "
"components, since spans are only views of the Doc. Use Doc and "
"Token attributes (or custom extension attributes) only and remove "
"the following: {attrs}")
E181 = ("Received invalid attributes for unkown object {obj}: {attrs}. "
"Only Doc and Token attributes are supported.")
E182 = ("Received invalid attribute declaration: {attr}\nDid you forget "
"to define the attribute? For example: {attr}.???")
E183 = ("Received invalid attribute declaration: {attr}\nOnly top-level "
"attributes are supported, for example: {solution}")
E184 = ("Only attributes without underscores are supported in component "
"attribute declarations (because underscore and non-underscore "
"attributes are connected anyways): {attr} -> {solution}")
E185 = ("Received invalid attribute in component attribute declaration: "
"{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
@add_codes
@ -536,6 +561,10 @@ class MatchPatternError(ValueError):
ValueError.__init__(self, msg)
class AlignmentError(ValueError):
pass
class ModelsWarning(UserWarning):
pass

View File

@ -80,7 +80,7 @@ GLOSSARY = {
"RBR": "adverb, comparative",
"RBS": "adverb, superlative",
"RP": "adverb, particle",
"TO": "infinitival to",
"TO": 'infinitival "to"',
"UH": "interjection",
"VB": "verb, base form",
"VBD": "verb, past tense",
@ -279,6 +279,12 @@ GLOSSARY = {
"re": "repeated element",
"rs": "reported speech",
"sb": "subject",
"sb": "subject",
"sbp": "passivized subject (PP)",
"sp": "subject or predicate",
"svp": "separable verb prefix",
"uc": "unit component",
"vo": "vocative",
# Named Entity Recognition
# OntoNotes 5
# https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf

View File

@ -11,10 +11,9 @@ import itertools
from pathlib import Path
import srsly
from . import _align
from .syntax import nonproj
from .tokens import Doc, Span
from .errors import Errors
from .errors import Errors, AlignmentError
from .compat import path2str
from . import util
from .util import minibatch, itershuffle
@ -22,6 +21,7 @@ from .util import minibatch, itershuffle
from libc.stdio cimport FILE, fopen, fclose, fread, fwrite, feof, fseek
USE_NEW_ALIGN = False
punct_re = re.compile(r"\W")
@ -56,10 +56,10 @@ def tags_to_entities(tags):
def merge_sents(sents):
m_deps = [[], [], [], [], [], []]
m_cats = {}
m_brackets = []
m_cats = sents.pop()
i = 0
for (ids, words, tags, heads, labels, ner), brackets in sents:
for (ids, words, tags, heads, labels, ner), (cats, brackets) in sents:
m_deps[0].extend(id_ + i for id_ in ids)
m_deps[1].extend(words)
m_deps[2].extend(tags)
@ -68,12 +68,26 @@ def merge_sents(sents):
m_deps[5].extend(ner)
m_brackets.extend((b["first"] + i, b["last"] + i, b["label"])
for b in brackets)
m_cats.update(cats)
i += len(ids)
m_deps.append(m_cats)
return [(m_deps, m_brackets)]
return [(m_deps, (m_cats, m_brackets))]
def align(tokens_a, tokens_b):
_ALIGNMENT_NORM_MAP = [("``", "'"), ("''", "'"), ('"', "'"), ("`", "'")]
def _normalize_for_alignment(tokens):
tokens = [w.replace(" ", "").lower() for w in tokens]
output = []
for token in tokens:
token = token.replace(" ", "").lower()
for before, after in _ALIGNMENT_NORM_MAP:
token = token.replace(before, after)
output.append(token)
return output
def _align_before_v2_2_2(tokens_a, tokens_b):
"""Calculate alignment tables between two tokenizations, using the Levenshtein
algorithm. The alignment is case-insensitive.
@ -92,6 +106,7 @@ def align(tokens_a, tokens_b):
* b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other
direction.
"""
from . import _align
if tokens_a == tokens_b:
alignment = numpy.arange(len(tokens_a))
return 0, alignment, alignment, {}, {}
@ -111,6 +126,82 @@ def align(tokens_a, tokens_b):
return cost, i2j, j2i, i2j_multi, j2i_multi
def align(tokens_a, tokens_b):
"""Calculate alignment tables between two tokenizations.
tokens_a (List[str]): The candidate tokenization.
tokens_b (List[str]): The reference tokenization.
RETURNS: (tuple): A 5-tuple consisting of the following information:
* cost (int): The number of misaligned tokens.
* a2b (List[int]): Mapping of indices in `tokens_a` to indices in `tokens_b`.
For instance, if `a2b[4] == 6`, that means that `tokens_a[4]` aligns
to `tokens_b[6]`. If there's no one-to-one alignment for a token,
it has the value -1.
* b2a (List[int]): The same as `a2b`, but mapping the other direction.
* a2b_multi (Dict[int, int]): A dictionary mapping indices in `tokens_a`
to indices in `tokens_b`, where multiple tokens of `tokens_a` align to
the same token of `tokens_b`.
* b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other
direction.
"""
if not USE_NEW_ALIGN:
return _align_before_v2_2_2(tokens_a, tokens_b)
tokens_a = _normalize_for_alignment(tokens_a)
tokens_b = _normalize_for_alignment(tokens_b)
cost = 0
a2b = numpy.empty(len(tokens_a), dtype="i")
b2a = numpy.empty(len(tokens_b), dtype="i")
a2b_multi = {}
b2a_multi = {}
i = 0
j = 0
offset_a = 0
offset_b = 0
while i < len(tokens_a) and j < len(tokens_b):
a = tokens_a[i][offset_a:]
b = tokens_b[j][offset_b:]
a2b[i] = b2a[j] = -1
if a == b:
if offset_a == offset_b == 0:
a2b[i] = j
b2a[j] = i
elif offset_a == 0:
cost += 2
a2b_multi[i] = j
elif offset_b == 0:
cost += 2
b2a_multi[j] = i
offset_a = offset_b = 0
i += 1
j += 1
elif a == "":
assert offset_a == 0
cost += 1
i += 1
elif b == "":
assert offset_b == 0
cost += 1
j += 1
elif b.startswith(a):
cost += 1
if offset_a == 0:
a2b_multi[i] = j
i += 1
offset_a = 0
offset_b += len(a)
elif a.startswith(b):
cost += 1
if offset_b == 0:
b2a_multi[j] = i
j += 1
offset_b = 0
offset_a += len(b)
else:
assert "".join(tokens_a) != "".join(tokens_b)
raise AlignmentError(Errors.E186.format(tok_a=tokens_a, tok_b=tokens_b))
return cost, a2b, b2a, a2b_multi, b2a_multi
class GoldCorpus(object):
"""An annotated corpus, using the JSON file format. Manages
annotations for tagging, dependency parsing and NER.
@ -176,6 +267,11 @@ class GoldCorpus(object):
gold_tuples = read_json_file(loc)
elif loc.parts[-1].endswith("jsonl"):
gold_tuples = srsly.read_jsonl(loc)
first_gold_tuple = next(gold_tuples)
gold_tuples = itertools.chain([first_gold_tuple], gold_tuples)
# TODO: proper format checks with schemas
if isinstance(first_gold_tuple, dict):
gold_tuples = read_json_object(gold_tuples)
elif loc.parts[-1].endswith("msg"):
gold_tuples = srsly.read_msgpack(loc)
else:
@ -201,7 +297,6 @@ class GoldCorpus(object):
n = 0
i = 0
for raw_text, paragraph_tuples in self.train_tuples:
cats = paragraph_tuples.pop()
for sent_tuples, brackets in paragraph_tuples:
n += len(sent_tuples[1])
if self.limit and i >= self.limit:
@ -210,7 +305,8 @@ class GoldCorpus(object):
return n
def train_docs(self, nlp, gold_preproc=False, max_length=None,
noise_level=0.0, orth_variant_level=0.0):
noise_level=0.0, orth_variant_level=0.0,
ignore_misaligned=False):
locs = list((self.tmp_dir / 'train').iterdir())
random.shuffle(locs)
train_tuples = self.read_tuples(locs, limit=self.limit)
@ -218,20 +314,23 @@ class GoldCorpus(object):
max_length=max_length,
noise_level=noise_level,
orth_variant_level=orth_variant_level,
make_projective=True)
make_projective=True,
ignore_misaligned=ignore_misaligned)
yield from gold_docs
def train_docs_without_preprocessing(self, nlp, gold_preproc=False):
gold_docs = self.iter_gold_docs(nlp, self.train_tuples, gold_preproc=gold_preproc)
yield from gold_docs
def dev_docs(self, nlp, gold_preproc=False):
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc)
def dev_docs(self, nlp, gold_preproc=False, ignore_misaligned=False):
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc,
ignore_misaligned=ignore_misaligned)
yield from gold_docs
@classmethod
def iter_gold_docs(cls, nlp, tuples, gold_preproc, max_length=None,
noise_level=0.0, orth_variant_level=0.0, make_projective=False):
noise_level=0.0, orth_variant_level=0.0, make_projective=False,
ignore_misaligned=False):
for raw_text, paragraph_tuples in tuples:
if gold_preproc:
raw_text = None
@ -240,10 +339,12 @@ class GoldCorpus(object):
docs, paragraph_tuples = cls._make_docs(nlp, raw_text,
paragraph_tuples, gold_preproc, noise_level=noise_level,
orth_variant_level=orth_variant_level)
golds = cls._make_golds(docs, paragraph_tuples, make_projective)
golds = cls._make_golds(docs, paragraph_tuples, make_projective,
ignore_misaligned=ignore_misaligned)
for doc, gold in zip(docs, golds):
if (not max_length) or len(doc) < max_length:
yield doc, gold
if gold is not None:
if (not max_length) or len(doc) < max_length:
yield doc, gold
@classmethod
def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc, noise_level=0.0, orth_variant_level=0.0):
@ -259,14 +360,22 @@ class GoldCorpus(object):
@classmethod
def _make_golds(cls, docs, paragraph_tuples, make_projective):
def _make_golds(cls, docs, paragraph_tuples, make_projective, ignore_misaligned=False):
if len(docs) != len(paragraph_tuples):
n_annots = len(paragraph_tuples)
raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots=n_annots))
return [GoldParse.from_annot_tuples(doc, sent_tuples,
make_projective=make_projective)
for doc, (sent_tuples, brackets)
in zip(docs, paragraph_tuples)]
golds = []
for doc, (sent_tuples, (cats, brackets)) in zip(docs, paragraph_tuples):
try:
gold = GoldParse.from_annot_tuples(doc, sent_tuples, cats=cats,
make_projective=make_projective)
except AlignmentError:
if ignore_misaligned:
gold = None
else:
raise
golds.append(gold)
return golds
def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0):
@ -281,7 +390,7 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0):
# modify words in paragraph_tuples
variant_paragraph_tuples = []
for sent_tuples, brackets in paragraph_tuples:
ids, words, tags, heads, labels, ner, cats = sent_tuples
ids, words, tags, heads, labels, ner = sent_tuples
if lower:
words = [w.lower() for w in words]
# single variants
@ -310,7 +419,7 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0):
pair_idx = pair.index(words[word_idx])
words[word_idx] = punct_choices[punct_idx][pair_idx]
variant_paragraph_tuples.append(((ids, words, tags, heads, labels, ner, cats), brackets))
variant_paragraph_tuples.append(((ids, words, tags, heads, labels, ner), brackets))
# modify raw to match variant_paragraph_tuples
if raw is not None:
variants = []
@ -329,7 +438,7 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0):
variant_raw += raw[raw_idx]
raw_idx += 1
for sent_tuples, brackets in variant_paragraph_tuples:
ids, words, tags, heads, labels, ner, cats = sent_tuples
ids, words, tags, heads, labels, ner = sent_tuples
for word in words:
match_found = False
# add identical word
@ -400,6 +509,9 @@ def json_to_tuple(doc):
paragraphs = []
for paragraph in doc["paragraphs"]:
sents = []
cats = {}
for cat in paragraph.get("cats", {}):
cats[cat["label"]] = cat["value"]
for sent in paragraph["sentences"]:
words = []
ids = []
@ -419,11 +531,7 @@ def json_to_tuple(doc):
ner.append(token.get("ner", "-"))
sents.append([
[ids, words, tags, heads, labels, ner],
sent.get("brackets", [])])
cats = {}
for cat in paragraph.get("cats", {}):
cats[cat["label"]] = cat["value"]
sents.append(cats)
[cats, sent.get("brackets", [])]])
if sents:
yield [paragraph.get("raw", None), sents]
@ -537,8 +645,8 @@ cdef class GoldParse:
DOCS: https://spacy.io/api/goldparse
"""
@classmethod
def from_annot_tuples(cls, doc, annot_tuples, make_projective=False):
_, words, tags, heads, deps, entities, cats = annot_tuples
def from_annot_tuples(cls, doc, annot_tuples, cats=None, make_projective=False):
_, words, tags, heads, deps, entities = annot_tuples
return cls(doc, words=words, tags=tags, heads=heads, deps=deps,
entities=entities, cats=cats,
make_projective=make_projective)
@ -595,9 +703,9 @@ cdef class GoldParse:
if morphology is None:
morphology = [None for _ in words]
if entities is None:
entities = ["-" for _ in doc]
entities = ["-" for _ in words]
elif len(entities) == 0:
entities = ["O" for _ in doc]
entities = ["O" for _ in words]
else:
# Translate the None values to '-', to make processing easier.
# See Issue #2603
@ -660,7 +768,9 @@ cdef class GoldParse:
self.heads[i] = i+1
self.labels[i] = "subtok"
else:
self.heads[i] = self.gold_to_cand[heads[i2j_multi[i]]]
head_i = heads[i2j_multi[i]]
if head_i:
self.heads[i] = self.gold_to_cand[head_i]
self.labels[i] = deps[i2j_multi[i]]
# Now set NER...This is annoying because if we've split
# got an entity word split into two, we need to adjust the
@ -748,7 +858,7 @@ def docs_to_json(docs, id=0):
docs (iterable / Doc): The Doc object(s) to convert.
id (int): Id for the JSON.
RETURNS (dict): The data in spaCy's JSON format
RETURNS (dict): The data in spaCy's JSON format
- each input doc will be treated as a paragraph in the output doc
"""
if isinstance(docs, Doc):
@ -804,7 +914,7 @@ def biluo_tags_from_offsets(doc, entities, missing="O"):
"""
# Ensure no overlapping entity labels exist
tokens_in_ents = {}
starts = {token.idx: token.i for token in doc}
ends = {token.idx + len(token): token.i for token in doc}
biluo = ["-" for _ in doc]

View File

@ -1,8 +1,8 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import POS, PUNCT, ADJ, CONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX
from ...symbols import POS, PUNCT, ADJ, CCONJ, SCONJ, NUM, DET, ADV, ADP, X
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX, VERB
TAG_MAP = {
@ -20,8 +20,8 @@ TAG_MAP = {
"CARD": {POS: NUM, "NumType": "card"},
"FM": {POS: X, "Foreign": "yes"},
"ITJ": {POS: INTJ},
"KOKOM": {POS: CONJ, "ConjType": "comp"},
"KON": {POS: CONJ},
"KOKOM": {POS: CCONJ, "ConjType": "comp"},
"KON": {POS: CCONJ},
"KOUI": {POS: SCONJ},
"KOUS": {POS: SCONJ},
"NE": {POS: PROPN},
@ -43,7 +43,7 @@ TAG_MAP = {
"PTKA": {POS: PART},
"PTKANT": {POS: PART, "PartType": "res"},
"PTKNEG": {POS: PART, "Polarity": "neg"},
"PTKVZ": {POS: PART, "PartType": "vbp"},
"PTKVZ": {POS: ADP, "PartType": "vbp"},
"PTKZU": {POS: PART, "PartType": "inf"},
"PWAT": {POS: DET, "PronType": "int"},
"PWAV": {POS: ADV, "PronType": "int"},

View File

@ -2,7 +2,7 @@
from __future__ import unicode_literals
from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
TAG_MAP = {
@ -28,8 +28,8 @@ TAG_MAP = {
"JJR": {POS: ADJ, "Degree": "comp"},
"JJS": {POS: ADJ, "Degree": "sup"},
"LS": {POS: X, "NumType": "ord"},
"MD": {POS: AUX, "VerbType": "mod"},
"NIL": {POS: ""},
"MD": {POS: VERB, "VerbType": "mod"},
"NIL": {POS: X},
"NN": {POS: NOUN, "Number": "sing"},
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
@ -37,7 +37,7 @@ TAG_MAP = {
"PDT": {POS: DET},
"POS": {POS: PART, "Poss": "yes"},
"PRP": {POS: PRON, "PronType": "prs"},
"PRP$": {POS: PRON, "PronType": "prs", "Poss": "yes"},
"PRP$": {POS: DET, "PronType": "prs", "Poss": "yes"},
"RB": {POS: ADV, "Degree": "pos"},
"RBR": {POS: ADV, "Degree": "comp"},
"RBS": {POS: ADV, "Degree": "sup"},
@ -58,9 +58,9 @@ TAG_MAP = {
"Number": "sing",
"Person": "three",
},
"WDT": {POS: PRON},
"WDT": {POS: DET},
"WP": {POS: PRON},
"WP$": {POS: PRON, "Poss": "yes"},
"WP$": {POS: DET, "Poss": "yes"},
"WRB": {POS: ADV},
"ADD": {POS: X},
"NFP": {POS: PUNCT},

View File

@ -18,13 +18,8 @@ from .tokenizer import Tokenizer
from .vocab import Vocab
from .lemmatizer import Lemmatizer
from .lookups import Lookups
from .pipeline import DependencyParser, Tagger
from .pipeline import Tensorizer, EntityRecognizer, EntityLinker
from .pipeline import SimilarityHook, TextCategorizer, Sentencizer
from .pipeline import merge_noun_chunks, merge_entities, merge_subtokens
from .pipeline import EntityRuler
from .pipeline import Morphologizer
from .compat import izip, basestring_, is_python2
from .analysis import analyze_pipes, analyze_all_pipes, validate_attrs
from .compat import izip, basestring_, is_python2, class_types
from .gold import GoldParse
from .scorer import Scorer
from ._ml import link_vectors_to_models, create_default_optimizer
@ -40,6 +35,9 @@ from . import util
from . import about
ENABLE_PIPELINE_ANALYSIS = False
class BaseDefaults(object):
@classmethod
def create_lemmatizer(cls, nlp=None, lookups=None):
@ -133,22 +131,7 @@ class Language(object):
Defaults = BaseDefaults
lang = None
factories = {
"tokenizer": lambda nlp: nlp.Defaults.create_tokenizer(nlp),
"tensorizer": lambda nlp, **cfg: Tensorizer(nlp.vocab, **cfg),
"tagger": lambda nlp, **cfg: Tagger(nlp.vocab, **cfg),
"morphologizer": lambda nlp, **cfg: Morphologizer(nlp.vocab, **cfg),
"parser": lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg),
"ner": lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg),
"entity_linker": lambda nlp, **cfg: EntityLinker(nlp.vocab, **cfg),
"similarity": lambda nlp, **cfg: SimilarityHook(nlp.vocab, **cfg),
"textcat": lambda nlp, **cfg: TextCategorizer(nlp.vocab, **cfg),
"sentencizer": lambda nlp, **cfg: Sentencizer(**cfg),
"merge_noun_chunks": lambda nlp, **cfg: merge_noun_chunks,
"merge_entities": lambda nlp, **cfg: merge_entities,
"merge_subtokens": lambda nlp, **cfg: merge_subtokens,
"entity_ruler": lambda nlp, **cfg: EntityRuler(nlp, **cfg),
}
factories = {"tokenizer": lambda nlp: nlp.Defaults.create_tokenizer(nlp)}
def __init__(
self, vocab=True, make_doc=True, max_length=10 ** 6, meta={}, **kwargs
@ -218,6 +201,7 @@ class Language(object):
"name": self.vocab.vectors.name,
}
self._meta["pipeline"] = self.pipe_names
self._meta["factories"] = self.pipe_factories
self._meta["labels"] = self.pipe_labels
return self._meta
@ -259,6 +243,17 @@ class Language(object):
"""
return [pipe_name for pipe_name, _ in self.pipeline]
@property
def pipe_factories(self):
"""Get the component factories for the available pipeline components.
RETURNS (dict): Factory names, keyed by component names.
"""
factories = {}
for pipe_name, pipe in self.pipeline:
factories[pipe_name] = getattr(pipe, "factory", pipe_name)
return factories
@property
def pipe_labels(self):
"""Get the labels set by the pipeline components, if available (if
@ -327,33 +322,30 @@ class Language(object):
msg += Errors.E004.format(component=component)
raise ValueError(msg)
if name is None:
if hasattr(component, "name"):
name = component.name
elif hasattr(component, "__name__"):
name = component.__name__
elif hasattr(component, "__class__") and hasattr(
component.__class__, "__name__"
):
name = component.__class__.__name__
else:
name = repr(component)
name = util.get_component_name(component)
if name in self.pipe_names:
raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names))
if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2:
raise ValueError(Errors.E006)
pipe_index = 0
pipe = (name, component)
if last or not any([first, before, after]):
pipe_index = len(self.pipeline)
self.pipeline.append(pipe)
elif first:
self.pipeline.insert(0, pipe)
elif before and before in self.pipe_names:
pipe_index = self.pipe_names.index(before)
self.pipeline.insert(self.pipe_names.index(before), pipe)
elif after and after in self.pipe_names:
pipe_index = self.pipe_names.index(after) + 1
self.pipeline.insert(self.pipe_names.index(after) + 1, pipe)
else:
raise ValueError(
Errors.E001.format(name=before or after, opts=self.pipe_names)
)
if ENABLE_PIPELINE_ANALYSIS:
analyze_pipes(self.pipeline, name, component, pipe_index)
def has_pipe(self, name):
"""Check if a component name is present in the pipeline. Equivalent to
@ -382,6 +374,8 @@ class Language(object):
msg += Errors.E135.format(name=name)
raise ValueError(msg)
self.pipeline[self.pipe_names.index(name)] = (name, component)
if ENABLE_PIPELINE_ANALYSIS:
analyze_all_pipes(self.pipeline)
def rename_pipe(self, old_name, new_name):
"""Rename a pipeline component.
@ -408,7 +402,10 @@ class Language(object):
"""
if name not in self.pipe_names:
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
return self.pipeline.pop(self.pipe_names.index(name))
removed = self.pipeline.pop(self.pipe_names.index(name))
if ENABLE_PIPELINE_ANALYSIS:
analyze_all_pipes(self.pipeline)
return removed
def __call__(self, text, disable=[], component_cfg=None):
"""Apply the pipeline to some text. The text can span multiple sentences,
@ -448,6 +445,8 @@ class Language(object):
DOCS: https://spacy.io/api/language#disable_pipes
"""
if len(names) == 1 and isinstance(names[0], (list, tuple)):
names = names[0] # support list of names instead of spread
return DisabledPipes(self, *names)
def make_doc(self, text):
@ -999,6 +998,52 @@ class Language(object):
return self
class component(object):
"""Decorator for pipeline components. Can decorate both function components
and class components and will automatically register components in the
Language.factories. If the component is a class and needs access to the
nlp object or config parameters, it can expose a from_nlp classmethod
that takes the nlp object and **cfg arguments and returns the initialized
component.
"""
# NB: This decorator needs to live here, because it needs to write to
# Language.factories. All other solutions would cause circular import.
def __init__(self, name=None, assigns=tuple(), requires=tuple(), retokenizes=False):
"""Decorate a pipeline component.
name (unicode): Default component and factory name.
assigns (list): Attributes assigned by component, e.g. `["token.pos"]`.
requires (list): Attributes required by component, e.g. `["token.dep"]`.
retokenizes (bool): Whether the component changes the tokenization.
"""
self.name = name
self.assigns = validate_attrs(assigns)
self.requires = validate_attrs(requires)
self.retokenizes = retokenizes
def __call__(self, *args, **kwargs):
obj = args[0]
args = args[1:]
factory_name = self.name or util.get_component_name(obj)
obj.name = factory_name
obj.factory = factory_name
obj.assigns = self.assigns
obj.requires = self.requires
obj.retokenizes = self.retokenizes
def factory(nlp, **cfg):
if hasattr(obj, "from_nlp"):
return obj.from_nlp(nlp, **cfg)
elif isinstance(obj, class_types):
return obj()
return obj
Language.factories[obj.factory] = factory
return obj
def _fix_pretrained_vectors_name(nlp):
# TODO: Replace this once we handle vectors consistently as static
# data

View File

@ -102,7 +102,10 @@ cdef class DependencyMatcher:
visitedNodes[relation["SPEC"]["NBOR_NAME"]] = True
idx = idx + 1
def add(self, key, on_match, *patterns):
def add(self, key, patterns, *_patterns, on_match=None):
if patterns is None or hasattr(patterns, "__call__"): # old API
on_match = patterns
patterns = _patterns
for pattern in patterns:
if len(pattern) == 0:
raise ValueError(Errors.E012.format(key=key))

View File

@ -74,7 +74,7 @@ cdef class Matcher:
"""
return self._normalize_key(key) in self._patterns
def add(self, key, on_match, *patterns):
def add(self, key, patterns, *_patterns, on_match=None):
"""Add a match-rule to the matcher. A match-rule consists of: an ID
key, an on_match callback, and one or more patterns.
@ -98,16 +98,29 @@ cdef class Matcher:
operator will behave non-greedily. This quirk in the semantics makes
the matcher more efficient, by avoiding the need for back-tracking.
As of spaCy v2.2.2, Matcher.add supports the future API, which makes
the patterns the second argument and a list (instead of a variable
number of arguments). The on_match callback becomes an optional keyword
argument.
key (unicode): The match ID.
on_match (callable): Callback executed on match.
*patterns (list): List of token descriptions.
patterns (list): The patterns to add for the given key.
on_match (callable): Optional callback executed on match.
*_patterns (list): For backwards compatibility: list of patterns to add
as variable arguments. Will be ignored if a list of patterns is
provided as the second argument.
"""
errors = {}
if on_match is not None and not hasattr(on_match, "__call__"):
raise ValueError(Errors.E171.format(arg_type=type(on_match)))
if patterns is None or hasattr(patterns, "__call__"): # old API
on_match = patterns
patterns = _patterns
for i, pattern in enumerate(patterns):
if len(pattern) == 0:
raise ValueError(Errors.E012.format(key=key))
if not isinstance(pattern, list):
raise ValueError(Errors.E178.format(pat=pattern, key=key))
if self.validator:
errors[i] = validate_json(pattern, self.validator)
if any(err for err in errors.values()):

View File

@ -152,16 +152,27 @@ cdef class PhraseMatcher:
del self._callbacks[key]
del self._docs[key]
def add(self, key, on_match, *docs):
def add(self, key, docs, *_docs, on_match=None):
"""Add a match-rule to the phrase-matcher. A match-rule consists of: an ID
key, an on_match callback, and one or more patterns.
As of spaCy v2.2.2, PhraseMatcher.add supports the future API, which
makes the patterns the second argument and a list (instead of a variable
number of arguments). The on_match callback becomes an optional keyword
argument.
key (unicode): The match ID.
docs (list): List of `Doc` objects representing match patterns.
on_match (callable): Callback executed on match.
*docs (Doc): `Doc` objects representing match patterns.
*_docs (Doc): For backwards compatibility: list of patterns to add
as variable arguments. Will be ignored if a list of patterns is
provided as the second argument.
DOCS: https://spacy.io/api/phrasematcher#add
"""
if docs is None or hasattr(docs, "__call__"): # old API
on_match = docs
docs = _docs
_ = self.vocab[key]
self._callbacks[key] = on_match
@ -171,6 +182,8 @@ cdef class PhraseMatcher:
cdef MapStruct* internal_node
cdef void* result
if isinstance(docs, Doc):
raise ValueError(Errors.E179.format(key=key))
for doc in docs:
if len(doc) == 0:
continue

5
spacy/ml/__init__.py Normal file
View File

@ -0,0 +1,5 @@
# coding: utf8
from __future__ import unicode_literals
from .tok2vec import Tok2Vec # noqa: F401
from .common import FeedForward, LayerNormalizedMaxout # noqa: F401

131
spacy/ml/_legacy_tok2vec.py Normal file
View File

@ -0,0 +1,131 @@
# coding: utf8
from __future__ import unicode_literals
from thinc.v2v import Model, Maxout
from thinc.i2v import HashEmbed, StaticVectors
from thinc.t2t import ExtractWindow
from thinc.misc import Residual
from thinc.misc import LayerNorm as LN
from thinc.misc import FeatureExtracter
from thinc.api import layerize, chain, clone, concatenate, with_flatten
from thinc.api import uniqued, wrap, noop
from ..attrs import ID, ORTH, NORM, PREFIX, SUFFIX, SHAPE
def Tok2Vec(width, embed_size, **kwargs):
# Circular imports :(
from .._ml import CharacterEmbed
from .._ml import PyTorchBiLSTM
pretrained_vectors = kwargs.get("pretrained_vectors", None)
cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3)
subword_features = kwargs.get("subword_features", True)
char_embed = kwargs.get("char_embed", False)
if char_embed:
subword_features = False
conv_depth = kwargs.get("conv_depth", 4)
bilstm_depth = kwargs.get("bilstm_depth", 0)
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
with Model.define_operators({">>": chain, "|": concatenate, "**": clone}):
norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm")
if subword_features:
prefix = HashEmbed(
width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix"
)
suffix = HashEmbed(
width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix"
)
shape = HashEmbed(
width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape"
)
else:
prefix, suffix, shape = (None, None, None)
if pretrained_vectors is not None:
glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
if subword_features:
embed = uniqued(
(glove | norm | prefix | suffix | shape)
>> LN(Maxout(width, width * 5, pieces=3)),
column=cols.index(ORTH),
)
else:
embed = uniqued(
(glove | norm) >> LN(Maxout(width, width * 2, pieces=3)),
column=cols.index(ORTH),
)
elif subword_features:
embed = uniqued(
(norm | prefix | suffix | shape)
>> LN(Maxout(width, width * 4, pieces=3)),
column=cols.index(ORTH),
)
elif char_embed:
embed = concatenate_lists(
CharacterEmbed(nM=64, nC=8),
FeatureExtracter(cols) >> with_flatten(norm),
)
reduce_dimensions = LN(
Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces)
)
else:
embed = norm
convolution = Residual(
ExtractWindow(nW=1)
>> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces))
)
if char_embed:
tok2vec = embed >> with_flatten(
reduce_dimensions >> convolution ** conv_depth, pad=conv_depth
)
else:
tok2vec = FeatureExtracter(cols) >> with_flatten(
embed >> convolution ** conv_depth, pad=conv_depth
)
if bilstm_depth >= 1:
tok2vec = tok2vec >> PyTorchBiLSTM(width, width, bilstm_depth)
# Work around thinc API limitations :(. TODO: Revise in Thinc 7
tok2vec.nO = width
tok2vec.embed = embed
return tok2vec
@layerize
def flatten(seqs, drop=0.0):
ops = Model.ops
lengths = ops.asarray([len(seq) for seq in seqs], dtype="i")
def finish_update(d_X, sgd=None):
return ops.unflatten(d_X, lengths, pad=0)
X = ops.flatten(seqs, pad=0)
return X, finish_update
def concatenate_lists(*layers, **kwargs): # pragma: no cover
"""Compose two or more models `f`, `g`, etc, such that their outputs are
concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))`
"""
if not layers:
return noop()
drop_factor = kwargs.get("drop_factor", 1.0)
ops = layers[0].ops
layers = [chain(layer, flatten) for layer in layers]
concat = concatenate(*layers)
def concatenate_lists_fwd(Xs, drop=0.0):
if drop is not None:
drop *= drop_factor
lengths = ops.asarray([len(X) for X in Xs], dtype="i")
flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop)
ys = ops.unflatten(flat_y, lengths)
def concatenate_lists_bwd(d_ys, sgd=None):
return bp_flat_y(ops.flatten(d_ys), sgd=sgd)
return ys, concatenate_lists_bwd
model = wrap(concatenate_lists_fwd, concat)
return model

42
spacy/ml/_wire.py Normal file
View File

@ -0,0 +1,42 @@
from __future__ import unicode_literals
from thinc.api import layerize, wrap, noop, chain, concatenate
from thinc.v2v import Model
def concatenate_lists(*layers, **kwargs): # pragma: no cover
"""Compose two or more models `f`, `g`, etc, such that their outputs are
concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))`
"""
if not layers:
return layerize(noop())
drop_factor = kwargs.get("drop_factor", 1.0)
ops = layers[0].ops
layers = [chain(layer, flatten) for layer in layers]
concat = concatenate(*layers)
def concatenate_lists_fwd(Xs, drop=0.0):
if drop is not None:
drop *= drop_factor
lengths = ops.asarray([len(X) for X in Xs], dtype="i")
flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop)
ys = ops.unflatten(flat_y, lengths)
def concatenate_lists_bwd(d_ys, sgd=None):
return bp_flat_y(ops.flatten(d_ys), sgd=sgd)
return ys, concatenate_lists_bwd
model = wrap(concatenate_lists_fwd, concat)
return model
@layerize
def flatten(seqs, drop=0.0):
ops = Model.ops
lengths = ops.asarray([len(seq) for seq in seqs], dtype="i")
def finish_update(d_X, sgd=None):
return ops.unflatten(d_X, lengths, pad=0)
X = ops.flatten(seqs, pad=0)
return X, finish_update

23
spacy/ml/common.py Normal file
View File

@ -0,0 +1,23 @@
from __future__ import unicode_literals
from thinc.api import chain
from thinc.v2v import Maxout
from thinc.misc import LayerNorm
from ..util import register_architecture, make_layer
@register_architecture("thinc.FeedForward.v1")
def FeedForward(config):
layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]]
model = chain(*layers)
model.cfg = config
return model
@register_architecture("spacy.LayerNormalizedMaxout.v1")
def LayerNormalizedMaxout(config):
width = config["width"]
pieces = config["pieces"]
layer = LayerNorm(Maxout(width, pieces=pieces))
layer.nO = width
return layer

176
spacy/ml/tok2vec.py Normal file
View File

@ -0,0 +1,176 @@
from __future__ import unicode_literals
from thinc.api import chain, layerize, clone, concatenate, with_flatten, uniqued
from thinc.api import noop, with_square_sequences
from thinc.v2v import Maxout, Model
from thinc.i2v import HashEmbed, StaticVectors
from thinc.t2t import ExtractWindow
from thinc.misc import Residual, LayerNorm, FeatureExtracter
from ..util import make_layer, register_architecture
from ._wire import concatenate_lists
@register_architecture("spacy.Tok2Vec.v1")
def Tok2Vec(config):
doc2feats = make_layer(config["@doc2feats"])
embed = make_layer(config["@embed"])
encode = make_layer(config["@encode"])
field_size = getattr(encode, "receptive_field", 0)
tok2vec = chain(doc2feats, with_flatten(chain(embed, encode), pad=field_size))
tok2vec.cfg = config
tok2vec.nO = encode.nO
tok2vec.embed = embed
tok2vec.encode = encode
return tok2vec
@register_architecture("spacy.Doc2Feats.v1")
def Doc2Feats(config):
columns = config["columns"]
return FeatureExtracter(columns)
@register_architecture("spacy.MultiHashEmbed.v1")
def MultiHashEmbed(config):
# For backwards compatibility with models before the architecture registry,
# we have to be careful to get exactly the same model structure. One subtle
# trick is that when we define concatenation with the operator, the operator
# is actually binary associative. So when we write (a | b | c), we're actually
# getting concatenate(concatenate(a, b), c). That's why the implementation
# is a bit ugly here.
cols = config["columns"]
width = config["width"]
rows = config["rows"]
norm = HashEmbed(width, rows, column=cols.index("NORM"), name="embed_norm")
if config["use_subwords"]:
prefix = HashEmbed(
width, rows // 2, column=cols.index("PREFIX"), name="embed_prefix"
)
suffix = HashEmbed(
width, rows // 2, column=cols.index("SUFFIX"), name="embed_suffix"
)
shape = HashEmbed(
width, rows // 2, column=cols.index("SHAPE"), name="embed_shape"
)
if config.get("@pretrained_vectors"):
glove = make_layer(config["@pretrained_vectors"])
mix = make_layer(config["@mix"])
with Model.define_operators({">>": chain, "|": concatenate}):
if config["use_subwords"] and config["@pretrained_vectors"]:
mix._layers[0].nI = width * 5
layer = uniqued(
(glove | norm | prefix | suffix | shape) >> mix,
column=cols.index("ORTH"),
)
elif config["use_subwords"]:
mix._layers[0].nI = width * 4
layer = uniqued(
(norm | prefix | suffix | shape) >> mix, column=cols.index("ORTH")
)
elif config["@pretrained_vectors"]:
mix._layers[0].nI = width * 2
layer = uniqued((glove | norm) >> mix, column=cols.index("ORTH"),)
else:
layer = norm
layer.cfg = config
return layer
@register_architecture("spacy.CharacterEmbed.v1")
def CharacterEmbed(config):
from .. import _ml
width = config["width"]
chars = config["chars"]
chr_embed = _ml.CharacterEmbedModel(nM=width, nC=chars)
other_tables = make_layer(config["@embed_features"])
mix = make_layer(config["@mix"])
model = chain(concatenate_lists(chr_embed, other_tables), mix)
model.cfg = config
return model
@register_architecture("spacy.MaxoutWindowEncoder.v1")
def MaxoutWindowEncoder(config):
nO = config["width"]
nW = config["window_size"]
nP = config["pieces"]
depth = config["depth"]
cnn = chain(
ExtractWindow(nW=nW), LayerNorm(Maxout(nO, nO * ((nW * 2) + 1), pieces=nP))
)
model = clone(Residual(cnn), depth)
model.nO = nO
model.receptive_field = nW * depth
return model
@register_architecture("spacy.MishWindowEncoder.v1")
def MishWindowEncoder(config):
from thinc.v2v import Mish
nO = config["width"]
nW = config["window_size"]
depth = config["depth"]
cnn = chain(ExtractWindow(nW=nW), LayerNorm(Mish(nO, nO * ((nW * 2) + 1))))
model = clone(Residual(cnn), depth)
model.nO = nO
return model
@register_architecture("spacy.PretrainedVectors.v1")
def PretrainedVectors(config):
return StaticVectors(config["vectors_name"], config["width"], config["column"])
@register_architecture("spacy.TorchBiLSTMEncoder.v1")
def TorchBiLSTMEncoder(config):
import torch.nn
from thinc.extra.wrappers import PyTorchWrapperRNN
width = config["width"]
depth = config["depth"]
if depth == 0:
return layerize(noop())
return with_square_sequences(
PyTorchWrapperRNN(torch.nn.LSTM(width, width // 2, depth, bidirectional=True))
)
_EXAMPLE_CONFIG = {
"@doc2feats": {
"arch": "Doc2Feats",
"config": {"columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"]},
},
"@embed": {
"arch": "spacy.MultiHashEmbed.v1",
"config": {
"width": 96,
"rows": 2000,
"columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"],
"use_subwords": True,
"@pretrained_vectors": {
"arch": "TransformedStaticVectors",
"config": {
"vectors_name": "en_vectors_web_lg.vectors",
"width": 96,
"column": 0,
},
},
"@mix": {
"arch": "LayerNormalizedMaxout",
"config": {"width": 96, "pieces": 3},
},
},
},
"@encode": {
"arch": "MaxoutWindowEncode",
"config": {"width": 96, "window_size": 1, "depth": 4, "pieces": 3},
},
}

View File

@ -4,6 +4,7 @@ from __future__ import unicode_literals
from collections import defaultdict, OrderedDict
import srsly
from ..language import component
from ..errors import Errors
from ..compat import basestring_
from ..util import ensure_path, to_disk, from_disk
@ -13,6 +14,7 @@ from ..matcher import Matcher, PhraseMatcher
DEFAULT_ENT_ID_SEP = "||"
@component("entity_ruler", assigns=["doc.ents", "token.ent_type", "token.ent_iob"])
class EntityRuler(object):
"""The EntityRuler lets you add spans to the `Doc.ents` using token-based
rules or exact phrase matches. It can be combined with the statistical
@ -24,8 +26,6 @@ class EntityRuler(object):
USAGE: https://spacy.io/usage/rule-based-matching#entityruler
"""
name = "entity_ruler"
def __init__(self, nlp, phrase_matcher_attr=None, validate=False, **cfg):
"""Initialize the entitiy ruler. If patterns are supplied here, they
need to be a list of dictionaries with a `"label"` and `"pattern"`
@ -64,10 +64,15 @@ class EntityRuler(object):
self.phrase_matcher_attr = None
self.phrase_matcher = PhraseMatcher(nlp.vocab, validate=validate)
self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
self._ent_ids = defaultdict(dict)
patterns = cfg.get("patterns")
if patterns is not None:
self.add_patterns(patterns)
@classmethod
def from_nlp(cls, nlp, **cfg):
return cls(nlp, **cfg)
def __len__(self):
"""The number of all patterns added to the entity ruler."""
n_token_patterns = sum(len(p) for p in self.token_patterns.values())
@ -100,10 +105,9 @@ class EntityRuler(object):
continue
# check for end - 1 here because boundaries are inclusive
if start not in seen_tokens and end - 1 not in seen_tokens:
if self.ent_ids:
label_ = self.nlp.vocab.strings[match_id]
ent_label, ent_id = self._split_label(label_)
span = Span(doc, start, end, label=ent_label)
if match_id in self._ent_ids:
label, ent_id = self._ent_ids[match_id]
span = Span(doc, start, end, label=label)
if ent_id:
for token in span:
token.ent_id_ = ent_id
@ -131,11 +135,11 @@ class EntityRuler(object):
@property
def ent_ids(self):
"""All entity ids present in the match patterns meta dicts.
"""All entity ids present in the match patterns `id` properties.
RETURNS (set): The string entity ids.
DOCS: https://spacy.io/api/entityruler#labels
DOCS: https://spacy.io/api/entityruler#ent_ids
"""
all_ent_ids = set()
for l in self.labels:
@ -147,7 +151,6 @@ class EntityRuler(object):
@property
def patterns(self):
"""Get all patterns that were added to the entity ruler.
RETURNS (list): The original patterns, one dictionary per pattern.
DOCS: https://spacy.io/api/entityruler#patterns
@ -188,11 +191,15 @@ class EntityRuler(object):
]
except ValueError:
subsequent_pipes = []
with self.nlp.disable_pipes(*subsequent_pipes):
with self.nlp.disable_pipes(subsequent_pipes):
for entry in patterns:
label = entry["label"]
if "id" in entry:
ent_label = label
label = self._create_label(label, entry["id"])
key = self.matcher._normalize_key(label)
self._ent_ids[key] = (ent_label, entry["id"])
pattern = entry["pattern"]
if isinstance(pattern, basestring_):
self.phrase_patterns[label].append(self.nlp(pattern))
@ -201,9 +208,9 @@ class EntityRuler(object):
else:
raise ValueError(Errors.E097.format(pattern=pattern))
for label, patterns in self.token_patterns.items():
self.matcher.add(label, None, *patterns)
self.matcher.add(label, patterns)
for label, patterns in self.phrase_patterns.items():
self.phrase_matcher.add(label, None, *patterns)
self.phrase_matcher.add(label, patterns)
def _split_label(self, label):
"""Split Entity label into ent_label and ent_id if it contains self.ent_id_sep

View File

@ -1,9 +1,16 @@
# coding: utf8
from __future__ import unicode_literals
from ..language import component
from ..matcher import Matcher
from ..util import filter_spans
@component(
"merge_noun_chunks",
requires=["token.dep", "token.tag", "token.pos"],
retokenizes=True,
)
def merge_noun_chunks(doc):
"""Merge noun chunks into a single token.
@ -21,6 +28,11 @@ def merge_noun_chunks(doc):
return doc
@component(
"merge_entities",
requires=["doc.ents", "token.ent_iob", "token.ent_type"],
retokenizes=True,
)
def merge_entities(doc):
"""Merge entities into a single token.
@ -36,6 +48,7 @@ def merge_entities(doc):
return doc
@component("merge_subtokens", requires=["token.dep"], retokenizes=True)
def merge_subtokens(doc, label="subtok"):
"""Merge subtokens into a single token.
@ -48,7 +61,7 @@ def merge_subtokens(doc, label="subtok"):
merger = Matcher(doc.vocab)
merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}])
matches = merger(doc)
spans = [doc[start : end + 1] for _, start, end in matches]
spans = filter_spans([doc[start : end + 1] for _, start, end in matches])
with doc.retokenize() as retokenizer:
for span in spans:
retokenizer.merge(span)

View File

@ -5,9 +5,11 @@ from thinc.t2v import Pooling, max_pool, mean_pool
from thinc.neural._classes.difference import Siamese, CauchySimilarity
from .pipes import Pipe
from ..language import component
from .._ml import link_vectors_to_models
@component("sentencizer_hook", assigns=["doc.user_hooks"])
class SentenceSegmenter(object):
"""A simple spaCy hook, to allow custom sentence boundary detection logic
(that doesn't require the dependency parse). To change the sentence
@ -17,8 +19,6 @@ class SentenceSegmenter(object):
and yield `Span` objects for each sentence.
"""
name = "sentencizer"
def __init__(self, vocab, strategy=None):
self.vocab = vocab
if strategy is None or strategy == "on_punct":
@ -44,6 +44,7 @@ class SentenceSegmenter(object):
yield doc[start : len(doc)]
@component("similarity", assigns=["doc.user_hooks"])
class SimilarityHook(Pipe):
"""
Experimental: A pipeline component to install a hook for supervised
@ -58,8 +59,6 @@ class SimilarityHook(Pipe):
Where W is a vector of dimension weights, initialized to 1.
"""
name = "similarity"
def __init__(self, vocab, model=True, **cfg):
self.vocab = vocab
self.model = model

View File

@ -8,6 +8,7 @@ from thinc.api import chain
from thinc.neural.util import to_categorical, copy_array, get_array_module
from .. import util
from .pipes import Pipe
from ..language import component
from .._ml import Tok2Vec, build_morphologizer_model
from .._ml import link_vectors_to_models, zero_init, flatten
from .._ml import create_default_optimizer
@ -18,9 +19,9 @@ from ..vocab cimport Vocab
from ..morphology cimport Morphology
@component("morphologizer", assigns=["token.morph", "token.pos"])
class Morphologizer(Pipe):
name = 'morphologizer'
@classmethod
def Model(cls, **cfg):
if cfg.get('pretrained_dims') and not cfg.get('pretrained_vectors'):

View File

@ -13,7 +13,6 @@ from thinc.misc import LayerNorm
from thinc.neural.util import to_categorical
from thinc.neural.util import get_array_module
from .functions import merge_subtokens
from ..tokens.doc cimport Doc
from ..syntax.nn_parser cimport Parser
from ..syntax.ner cimport BiluoPushDown
@ -21,6 +20,8 @@ from ..syntax.arc_eager cimport ArcEager
from ..morphology cimport Morphology
from ..vocab cimport Vocab
from .functions import merge_subtokens
from ..language import Language, component
from ..syntax import nonproj
from ..attrs import POS, ID
from ..parts_of_speech import X
@ -54,6 +55,10 @@ class Pipe(object):
"""Initialize a model for the pipe."""
raise NotImplementedError
@classmethod
def from_nlp(cls, nlp, **cfg):
return cls(nlp.vocab, **cfg)
def __init__(self, vocab, model=True, **cfg):
"""Create a new pipe instance."""
raise NotImplementedError
@ -223,11 +228,10 @@ class Pipe(object):
return self
@component("tensorizer", assigns=["doc.tensor"])
class Tensorizer(Pipe):
"""Pre-train position-sensitive vectors for tokens."""
name = "tensorizer"
@classmethod
def Model(cls, output_size=300, **cfg):
"""Create a new statistical model for the class.
@ -362,14 +366,13 @@ class Tensorizer(Pipe):
return sgd
@component("tagger", assigns=["token.tag", "token.pos"])
class Tagger(Pipe):
"""Pipeline component for part-of-speech tagging.
DOCS: https://spacy.io/api/tagger
"""
name = "tagger"
def __init__(self, vocab, model=True, **cfg):
self.vocab = vocab
self.model = model
@ -514,7 +517,6 @@ class Tagger(Pipe):
orig_tag_map = dict(self.vocab.morphology.tag_map)
new_tag_map = OrderedDict()
for raw_text, annots_brackets in get_gold_tuples():
_ = annots_brackets.pop()
for annots, brackets in annots_brackets:
ids, words, tags, heads, deps, ents = annots
for tag in tags:
@ -657,13 +659,12 @@ class Tagger(Pipe):
return self
@component("nn_labeller")
class MultitaskObjective(Tagger):
"""Experimental: Assist training of a parser or tagger, by training a
side-objective.
"""
name = "nn_labeller"
def __init__(self, vocab, model=True, target='dep_tag_offset', **cfg):
self.vocab = vocab
self.model = model
@ -898,12 +899,12 @@ class ClozeMultitask(Pipe):
losses[self.name] += loss
@component("textcat", assigns=["doc.cats"])
class TextCategorizer(Pipe):
"""Pipeline component for text classification.
DOCS: https://spacy.io/api/textcategorizer
"""
name = 'textcat'
@classmethod
def Model(cls, nr_class=1, **cfg):
@ -1032,10 +1033,10 @@ class TextCategorizer(Pipe):
return 1
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs):
for raw_text, annots_brackets in get_gold_tuples():
cats = annots_brackets.pop()
for cat in cats:
self.add_label(cat)
for raw_text, annot_brackets in get_gold_tuples():
for _, (cats, _2) in annot_brackets:
for cat in cats:
self.add_label(cat)
if self.model is True:
self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
self.require_labels()
@ -1051,8 +1052,11 @@ cdef class DependencyParser(Parser):
DOCS: https://spacy.io/api/dependencyparser
"""
# cdef classes can't have decorators, so we're defining this here
name = "parser"
factory = "parser"
assigns = ["token.dep", "token.is_sent_start", "doc.sents"]
requires = []
TransitionSystem = ArcEager
@property
@ -1097,8 +1101,10 @@ cdef class EntityRecognizer(Parser):
DOCS: https://spacy.io/api/entityrecognizer
"""
name = "ner"
factory = "ner"
assigns = ["doc.ents", "token.ent_iob", "token.ent_type"]
requires = []
TransitionSystem = BiluoPushDown
nr_feature = 6
@ -1129,12 +1135,16 @@ cdef class EntityRecognizer(Parser):
return tuple(sorted(labels))
@component(
"entity_linker",
requires=["doc.ents", "token.ent_iob", "token.ent_type"],
assigns=["token.ent_kb_id"]
)
class EntityLinker(Pipe):
"""Pipeline component for named entity linking.
DOCS: https://spacy.io/api/entitylinker
"""
name = 'entity_linker'
NIL = "NIL" # string used to refer to a non-existing link
@classmethod
@ -1298,7 +1308,8 @@ class EntityLinker(Pipe):
for ent in sent_doc.ents:
entity_count += 1
if ent.label_ in self.cfg.get("labels_discard", []):
to_discard = self.cfg.get("labels_discard", [])
if to_discard and ent.label_ in to_discard:
# ignoring this entity - setting to NIL
final_kb_ids.append(self.NIL)
final_tensors.append(sentence_encoding)
@ -1404,13 +1415,13 @@ class EntityLinker(Pipe):
raise NotImplementedError
@component("sentencizer", assigns=["token.is_sent_start", "doc.sents"])
class Sentencizer(object):
"""Segment the Doc into sentences using a rule-based strategy.
DOCS: https://spacy.io/api/sentencizer
"""
name = "sentencizer"
default_punct_chars = ['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹',
'', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '᱿',
@ -1436,6 +1447,10 @@ class Sentencizer(object):
else:
self.punct_chars = set(self.default_punct_chars)
@classmethod
def from_nlp(cls, nlp, **cfg):
return cls(**cfg)
def __call__(self, doc):
"""Apply the sentencizer to a Doc and set Token.is_sent_start.
@ -1502,4 +1517,9 @@ class Sentencizer(object):
return self
# Cython classes can't be decorated, so we need to add the factories here
Language.factories["parser"] = lambda nlp, **cfg: DependencyParser.from_nlp(nlp, **cfg)
Language.factories["ner"] = lambda nlp, **cfg: EntityRecognizer.from_nlp(nlp, **cfg)
__all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer"]

View File

@ -19,7 +19,7 @@ from thinc.extra.search cimport Beam
from thinc.api import chain, clone
from thinc.v2v import Model, Maxout, Affine
from thinc.misc import LayerNorm
from thinc.neural.ops import CupyOps
from thinc.neural.ops import CupyOps, NumpyOps
from thinc.neural.util import get_array_module
from thinc.linalg cimport Vec, VecVec
cimport blis.cy
@ -440,28 +440,38 @@ cdef class precompute_hiddens:
def backward(d_state_vector_ids, sgd=None):
d_state_vector, token_ids = d_state_vector_ids
d_state_vector = bp_nonlinearity(d_state_vector, sgd)
# This will usually be on GPU
if not isinstance(d_state_vector, self.ops.xp.ndarray):
d_state_vector = self.ops.xp.array(d_state_vector)
d_tokens = bp_hiddens((d_state_vector, token_ids), sgd)
return d_tokens
return state_vector, backward
def _nonlinearity(self, state_vector):
if isinstance(state_vector, numpy.ndarray):
ops = NumpyOps()
else:
ops = CupyOps()
if self.nP == 1:
state_vector = state_vector.reshape(state_vector.shape[:-1])
mask = state_vector >= 0.
state_vector *= mask
else:
state_vector, mask = self.ops.maxout(state_vector)
state_vector, mask = ops.maxout(state_vector)
def backprop_nonlinearity(d_best, sgd=None):
if isinstance(d_best, numpy.ndarray):
ops = NumpyOps()
else:
ops = CupyOps()
mask_ = ops.asarray(mask)
# This will usually be on GPU
d_best = ops.asarray(d_best)
# Fix nans (which can occur from unseen classes.)
d_best[self.ops.xp.isnan(d_best)] = 0.
d_best[ops.xp.isnan(d_best)] = 0.
if self.nP == 1:
d_best *= mask
d_best *= mask_
d_best = d_best.reshape((d_best.shape + (1,)))
return d_best
else:
return self.ops.backprop_maxout(d_best, mask, self.nP)
return ops.backprop_maxout(d_best, mask_, self.nP)
return state_vector, backprop_nonlinearity

View File

@ -342,7 +342,6 @@ cdef class ArcEager(TransitionSystem):
actions[RIGHT][label] = 1
actions[REDUCE][label] = 1
for raw_text, sents in kwargs.get('gold_parses', []):
_ = sents.pop()
for (ids, words, tags, heads, labels, iob), ctnts in sents:
heads, labels = nonproj.projectivize(heads, labels)
for child, head, label in zip(ids, heads, labels):

View File

@ -73,7 +73,6 @@ cdef class BiluoPushDown(TransitionSystem):
actions[action][entity_type] = 1
moves = ('M', 'B', 'I', 'L', 'U')
for raw_text, sents in kwargs.get('gold_parses', []):
_ = sents.pop()
for (ids, words, tags, heads, labels, biluo), _ in sents:
for i, ner_tag in enumerate(biluo):
if ner_tag != 'O' and ner_tag != '-':

View File

@ -57,7 +57,10 @@ cdef class Parser:
subword_features = util.env_opt('subword_features',
cfg.get('subword_features', True))
conv_depth = util.env_opt('conv_depth', cfg.get('conv_depth', 4))
conv_window = util.env_opt('conv_window', cfg.get('conv_depth', 1))
t2v_pieces = util.env_opt('cnn_maxout_pieces', cfg.get('cnn_maxout_pieces', 3))
bilstm_depth = util.env_opt('bilstm_depth', cfg.get('bilstm_depth', 0))
self_attn_depth = util.env_opt('self_attn_depth', cfg.get('self_attn_depth', 0))
if depth != 1:
raise ValueError(TempErrors.T004.format(value=depth))
parser_maxout_pieces = util.env_opt('parser_maxout_pieces',
@ -69,6 +72,8 @@ cdef class Parser:
pretrained_vectors = cfg.get('pretrained_vectors', None)
tok2vec = Tok2Vec(token_vector_width, embed_size,
conv_depth=conv_depth,
conv_window=conv_window,
cnn_maxout_pieces=t2v_pieces,
subword_features=subword_features,
pretrained_vectors=pretrained_vectors,
bilstm_depth=bilstm_depth)
@ -90,7 +95,12 @@ cdef class Parser:
'hidden_width': hidden_width,
'maxout_pieces': parser_maxout_pieces,
'pretrained_vectors': pretrained_vectors,
'bilstm_depth': bilstm_depth
'bilstm_depth': bilstm_depth,
'self_attn_depth': self_attn_depth,
'conv_depth': conv_depth,
'conv_window': conv_window,
'embed_size': embed_size,
'cnn_maxout_pieces': t2v_pieces
}
return ParserModel(tok2vec, lower, upper), cfg
@ -128,6 +138,10 @@ cdef class Parser:
self._multitasks = []
self._rehearsal_model = None
@classmethod
def from_nlp(cls, nlp, **cfg):
return cls(nlp.vocab, **cfg)
def __reduce__(self):
return (Parser, (self.vocab, self.moves, self.model), None, None)
@ -602,12 +616,11 @@ cdef class Parser:
doc_sample = []
gold_sample = []
for raw_text, annots_brackets in islice(get_gold_tuples(), 1000):
_ = annots_brackets.pop()
for annots, brackets in annots_brackets:
ids, words, tags, heads, deps, ents = annots
doc_sample.append(Doc(self.vocab, words=words))
gold_sample.append(GoldParse(doc_sample[-1], words=words, tags=tags,
heads=heads, deps=deps, ents=ents))
heads=heads, deps=deps, entities=ents))
self.model.begin_training(doc_sample, gold_sample)
if pipeline is not None:
self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg)

View File

@ -2,9 +2,9 @@
from __future__ import unicode_literals
import pytest
from spacy.lang.sv.syntax_iterators import SYNTAX_ITERATORS
from ...util import get_doc
SV_NP_TEST_EXAMPLES = [
(
"En student läste en bok", # A student read a book
@ -45,4 +45,3 @@ def test_sv_noun_chunks(sv_tokenizer, text, pos, deps, heads, expected_noun_chun
assert len(noun_chunks) == len(expected_noun_chunks)
for i, np in enumerate(noun_chunks):
assert np.text == expected_noun_chunks[i]

View File

@ -17,7 +17,7 @@ def matcher(en_vocab):
}
matcher = Matcher(en_vocab)
for key, patterns in rules.items():
matcher.add(key, None, *patterns)
matcher.add(key, patterns)
return matcher
@ -25,11 +25,11 @@ def test_matcher_from_api_docs(en_vocab):
matcher = Matcher(en_vocab)
pattern = [{"ORTH": "test"}]
assert len(matcher) == 0
matcher.add("Rule", None, pattern)
matcher.add("Rule", [pattern])
assert len(matcher) == 1
matcher.remove("Rule")
assert "Rule" not in matcher
matcher.add("Rule", None, pattern)
matcher.add("Rule", [pattern])
assert "Rule" in matcher
on_match, patterns = matcher.get("Rule")
assert len(patterns[0])
@ -52,7 +52,7 @@ def test_matcher_from_usage_docs(en_vocab):
token.vocab[token.text].norm_ = "happy emoji"
matcher = Matcher(en_vocab)
matcher.add("HAPPY", label_sentiment, *pos_patterns)
matcher.add("HAPPY", pos_patterns, on_match=label_sentiment)
matcher(doc)
assert doc.sentiment != 0
assert doc[1].norm_ == "happy emoji"
@ -60,11 +60,33 @@ def test_matcher_from_usage_docs(en_vocab):
def test_matcher_len_contains(matcher):
assert len(matcher) == 3
matcher.add("TEST", None, [{"ORTH": "test"}])
matcher.add("TEST", [[{"ORTH": "test"}]])
assert "TEST" in matcher
assert "TEST2" not in matcher
def test_matcher_add_new_old_api(en_vocab):
doc = Doc(en_vocab, words=["a", "b"])
patterns = [[{"TEXT": "a"}], [{"TEXT": "a"}, {"TEXT": "b"}]]
matcher = Matcher(en_vocab)
matcher.add("OLD_API", None, *patterns)
assert len(matcher(doc)) == 2
matcher = Matcher(en_vocab)
on_match = Mock()
matcher.add("OLD_API_CALLBACK", on_match, *patterns)
assert len(matcher(doc)) == 2
assert on_match.call_count == 2
# New API: add(key: str, patterns: List[List[dict]], on_match: Callable)
matcher = Matcher(en_vocab)
matcher.add("NEW_API", patterns)
assert len(matcher(doc)) == 2
matcher = Matcher(en_vocab)
on_match = Mock()
matcher.add("NEW_API_CALLBACK", patterns, on_match=on_match)
assert len(matcher(doc)) == 2
assert on_match.call_count == 2
def test_matcher_no_match(matcher):
doc = Doc(matcher.vocab, words=["I", "like", "cheese", "."])
assert matcher(doc) == []
@ -100,12 +122,12 @@ def test_matcher_empty_dict(en_vocab):
"""Test matcher allows empty token specs, meaning match on any token."""
matcher = Matcher(en_vocab)
doc = Doc(matcher.vocab, words=["a", "b", "c"])
matcher.add("A.C", None, [{"ORTH": "a"}, {}, {"ORTH": "c"}])
matcher.add("A.C", [[{"ORTH": "a"}, {}, {"ORTH": "c"}]])
matches = matcher(doc)
assert len(matches) == 1
assert matches[0][1:] == (0, 3)
matcher = Matcher(en_vocab)
matcher.add("A.", None, [{"ORTH": "a"}, {}])
matcher.add("A.", [[{"ORTH": "a"}, {}]])
matches = matcher(doc)
assert matches[0][1:] == (0, 2)
@ -114,7 +136,7 @@ def test_matcher_operator_shadow(en_vocab):
matcher = Matcher(en_vocab)
doc = Doc(matcher.vocab, words=["a", "b", "c"])
pattern = [{"ORTH": "a"}, {"IS_ALPHA": True, "OP": "+"}, {"ORTH": "c"}]
matcher.add("A.C", None, pattern)
matcher.add("A.C", [pattern])
matches = matcher(doc)
assert len(matches) == 1
assert matches[0][1:] == (0, 3)
@ -136,12 +158,12 @@ def test_matcher_match_zero(matcher):
{"IS_PUNCT": True},
{"ORTH": '"'},
]
matcher.add("Quote", None, pattern1)
matcher.add("Quote", [pattern1])
doc = Doc(matcher.vocab, words=words1)
assert len(matcher(doc)) == 1
doc = Doc(matcher.vocab, words=words2)
assert len(matcher(doc)) == 0
matcher.add("Quote", None, pattern2)
matcher.add("Quote", [pattern2])
assert len(matcher(doc)) == 0
@ -149,7 +171,7 @@ def test_matcher_match_zero_plus(matcher):
words = 'He said , " some words " ...'.split()
pattern = [{"ORTH": '"'}, {"OP": "*", "IS_PUNCT": False}, {"ORTH": '"'}]
matcher = Matcher(matcher.vocab)
matcher.add("Quote", None, pattern)
matcher.add("Quote", [pattern])
doc = Doc(matcher.vocab, words=words)
assert len(matcher(doc)) == 1
@ -160,11 +182,8 @@ def test_matcher_match_one_plus(matcher):
doc = Doc(control.vocab, words=["Philippe", "Philippe"])
m = control(doc)
assert len(m) == 2
matcher.add(
"KleenePhilippe",
None,
[{"ORTH": "Philippe", "OP": "1"}, {"ORTH": "Philippe", "OP": "+"}],
)
pattern = [{"ORTH": "Philippe", "OP": "1"}, {"ORTH": "Philippe", "OP": "+"}]
matcher.add("KleenePhilippe", [pattern])
m = matcher(doc)
assert len(m) == 1
@ -172,7 +191,7 @@ def test_matcher_match_one_plus(matcher):
def test_matcher_any_token_operator(en_vocab):
"""Test that patterns with "any token" {} work with operators."""
matcher = Matcher(en_vocab)
matcher.add("TEST", None, [{"ORTH": "test"}, {"OP": "*"}])
matcher.add("TEST", [[{"ORTH": "test"}, {"OP": "*"}]])
doc = Doc(en_vocab, words=["test", "hello", "world"])
matches = [doc[start:end].text for _, start, end in matcher(doc)]
assert len(matches) == 3
@ -186,7 +205,7 @@ def test_matcher_extension_attribute(en_vocab):
get_is_fruit = lambda token: token.text in ("apple", "banana")
Token.set_extension("is_fruit", getter=get_is_fruit, force=True)
pattern = [{"ORTH": "an"}, {"_": {"is_fruit": True}}]
matcher.add("HAVING_FRUIT", None, pattern)
matcher.add("HAVING_FRUIT", [pattern])
doc = Doc(en_vocab, words=["an", "apple"])
matches = matcher(doc)
assert len(matches) == 1
@ -198,7 +217,7 @@ def test_matcher_extension_attribute(en_vocab):
def test_matcher_set_value(en_vocab):
matcher = Matcher(en_vocab)
pattern = [{"ORTH": {"IN": ["an", "a"]}}]
matcher.add("A_OR_AN", None, pattern)
matcher.add("A_OR_AN", [pattern])
doc = Doc(en_vocab, words=["an", "a", "apple"])
matches = matcher(doc)
assert len(matches) == 2
@ -210,7 +229,7 @@ def test_matcher_set_value(en_vocab):
def test_matcher_set_value_operator(en_vocab):
matcher = Matcher(en_vocab)
pattern = [{"ORTH": {"IN": ["a", "the"]}, "OP": "?"}, {"ORTH": "house"}]
matcher.add("DET_HOUSE", None, pattern)
matcher.add("DET_HOUSE", [pattern])
doc = Doc(en_vocab, words=["In", "a", "house"])
matches = matcher(doc)
assert len(matches) == 2
@ -222,7 +241,7 @@ def test_matcher_set_value_operator(en_vocab):
def test_matcher_regex(en_vocab):
matcher = Matcher(en_vocab)
pattern = [{"ORTH": {"REGEX": r"(?:a|an)"}}]
matcher.add("A_OR_AN", None, pattern)
matcher.add("A_OR_AN", [pattern])
doc = Doc(en_vocab, words=["an", "a", "hi"])
matches = matcher(doc)
assert len(matches) == 2
@ -234,7 +253,7 @@ def test_matcher_regex(en_vocab):
def test_matcher_regex_shape(en_vocab):
matcher = Matcher(en_vocab)
pattern = [{"SHAPE": {"REGEX": r"^[^x]+$"}}]
matcher.add("NON_ALPHA", None, pattern)
matcher.add("NON_ALPHA", [pattern])
doc = Doc(en_vocab, words=["99", "problems", "!"])
matches = matcher(doc)
assert len(matches) == 2
@ -246,7 +265,7 @@ def test_matcher_regex_shape(en_vocab):
def test_matcher_compare_length(en_vocab):
matcher = Matcher(en_vocab)
pattern = [{"LENGTH": {">=": 2}}]
matcher.add("LENGTH_COMPARE", None, pattern)
matcher.add("LENGTH_COMPARE", [pattern])
doc = Doc(en_vocab, words=["a", "aa", "aaa"])
matches = matcher(doc)
assert len(matches) == 2
@ -260,7 +279,7 @@ def test_matcher_extension_set_membership(en_vocab):
get_reversed = lambda token: "".join(reversed(token.text))
Token.set_extension("reversed", getter=get_reversed, force=True)
pattern = [{"_": {"reversed": {"IN": ["eyb", "ih"]}}}]
matcher.add("REVERSED", None, pattern)
matcher.add("REVERSED", [pattern])
doc = Doc(en_vocab, words=["hi", "bye", "hello"])
matches = matcher(doc)
assert len(matches) == 2
@ -328,9 +347,9 @@ def dependency_matcher(en_vocab):
]
matcher = DependencyMatcher(en_vocab)
matcher.add("pattern1", None, pattern1)
matcher.add("pattern2", None, pattern2)
matcher.add("pattern3", None, pattern3)
matcher.add("pattern1", [pattern1])
matcher.add("pattern2", [pattern2])
matcher.add("pattern3", [pattern3])
return matcher
@ -347,6 +366,14 @@ def test_dependency_matcher_compile(dependency_matcher):
# assert matches[2][1] == [[4, 3, 2]]
def test_matcher_basic_check(en_vocab):
matcher = Matcher(en_vocab)
# Potential mistake: pass in pattern instead of list of patterns
pattern = [{"TEXT": "hello"}, {"TEXT": "world"}]
with pytest.raises(ValueError):
matcher.add("TEST", pattern)
def test_attr_pipeline_checks(en_vocab):
doc1 = Doc(en_vocab, words=["Test"])
doc1.is_parsed = True
@ -355,7 +382,7 @@ def test_attr_pipeline_checks(en_vocab):
doc3 = Doc(en_vocab, words=["Test"])
# DEP requires is_parsed
matcher = Matcher(en_vocab)
matcher.add("TEST", None, [{"DEP": "a"}])
matcher.add("TEST", [[{"DEP": "a"}]])
matcher(doc1)
with pytest.raises(ValueError):
matcher(doc2)
@ -364,7 +391,7 @@ def test_attr_pipeline_checks(en_vocab):
# TAG, POS, LEMMA require is_tagged
for attr in ("TAG", "POS", "LEMMA"):
matcher = Matcher(en_vocab)
matcher.add("TEST", None, [{attr: "a"}])
matcher.add("TEST", [[{attr: "a"}]])
matcher(doc2)
with pytest.raises(ValueError):
matcher(doc1)
@ -372,12 +399,12 @@ def test_attr_pipeline_checks(en_vocab):
matcher(doc3)
# TEXT/ORTH only require tokens
matcher = Matcher(en_vocab)
matcher.add("TEST", None, [{"ORTH": "a"}])
matcher.add("TEST", [[{"ORTH": "a"}]])
matcher(doc1)
matcher(doc2)
matcher(doc3)
matcher = Matcher(en_vocab)
matcher.add("TEST", None, [{"TEXT": "a"}])
matcher.add("TEST", [[{"TEXT": "a"}]])
matcher(doc1)
matcher(doc2)
matcher(doc3)
@ -407,7 +434,7 @@ def test_attr_pipeline_checks(en_vocab):
def test_matcher_schema_token_attributes(en_vocab, pattern, text):
matcher = Matcher(en_vocab)
doc = Doc(en_vocab, words=text.split(" "))
matcher.add("Rule", None, pattern)
matcher.add("Rule", [pattern])
assert len(matcher) == 1
matches = matcher(doc)
assert len(matches) == 1
@ -417,7 +444,7 @@ def test_matcher_valid_callback(en_vocab):
"""Test that on_match can only be None or callable."""
matcher = Matcher(en_vocab)
with pytest.raises(ValueError):
matcher.add("TEST", [], [{"TEXT": "test"}])
matcher.add("TEST", [[{"TEXT": "test"}]], on_match=[])
matcher(Doc(en_vocab, words=["test"]))
@ -425,7 +452,7 @@ def test_matcher_callback(en_vocab):
mock = Mock()
matcher = Matcher(en_vocab)
pattern = [{"ORTH": "test"}]
matcher.add("Rule", mock, pattern)
matcher.add("Rule", [pattern], on_match=mock)
doc = Doc(en_vocab, words=["This", "is", "a", "test", "."])
matches = matcher(doc)
mock.assert_called_once_with(matcher, doc, 0, matches)

View File

@ -55,7 +55,7 @@ def test_greedy_matching(doc, text, pattern, re_pattern):
"""Test that the greedy matching behavior of the * op is consistant with
other re implementations."""
matcher = Matcher(doc.vocab)
matcher.add(re_pattern, None, pattern)
matcher.add(re_pattern, [pattern])
matches = matcher(doc)
re_matches = [m.span() for m in re.finditer(re_pattern, text)]
for match, re_match in zip(matches, re_matches):
@ -77,7 +77,7 @@ def test_match_consuming(doc, text, pattern, re_pattern):
"""Test that matcher.__call__ consumes tokens on a match similar to
re.findall."""
matcher = Matcher(doc.vocab)
matcher.add(re_pattern, None, pattern)
matcher.add(re_pattern, [pattern])
matches = matcher(doc)
re_matches = [m.span() for m in re.finditer(re_pattern, text)]
assert len(matches) == len(re_matches)
@ -111,7 +111,7 @@ def test_operator_combos(en_vocab):
pattern.append({"ORTH": part[0], "OP": "+"})
else:
pattern.append({"ORTH": part})
matcher.add("PATTERN", None, pattern)
matcher.add("PATTERN", [pattern])
matches = matcher(doc)
if result:
assert matches, (string, pattern_str)
@ -123,7 +123,7 @@ def test_matcher_end_zero_plus(en_vocab):
"""Test matcher works when patterns end with * operator. (issue 1450)"""
matcher = Matcher(en_vocab)
pattern = [{"ORTH": "a"}, {"ORTH": "b", "OP": "*"}]
matcher.add("TSTEND", None, pattern)
matcher.add("TSTEND", [pattern])
nlp = lambda string: Doc(matcher.vocab, words=string.split())
assert len(matcher(nlp("a"))) == 1
assert len(matcher(nlp("a b"))) == 2
@ -140,7 +140,7 @@ def test_matcher_sets_return_correct_tokens(en_vocab):
[{"LOWER": {"IN": ["one"]}}],
[{"LOWER": {"IN": ["two"]}}],
]
matcher.add("TEST", None, *patterns)
matcher.add("TEST", patterns)
doc = Doc(en_vocab, words="zero one two three".split())
matches = matcher(doc)
texts = [Span(doc, s, e, label=L).text for L, s, e in matches]
@ -154,7 +154,7 @@ def test_matcher_remove():
pattern = [{"ORTH": "test"}, {"OP": "?"}]
assert len(matcher) == 0
matcher.add("Rule", None, pattern)
matcher.add("Rule", [pattern])
assert "Rule" in matcher
# should give two matches

View File

@ -50,7 +50,7 @@ def validator():
def test_matcher_pattern_validation(en_vocab, pattern):
matcher = Matcher(en_vocab, validate=True)
with pytest.raises(MatchPatternError):
matcher.add("TEST", None, pattern)
matcher.add("TEST", [pattern])
@pytest.mark.parametrize("pattern,n_errors,_", TEST_PATTERNS)
@ -71,6 +71,6 @@ def test_minimal_pattern_validation(en_vocab, pattern, n_errors, n_min_errors):
matcher = Matcher(en_vocab)
if n_min_errors > 0:
with pytest.raises(ValueError):
matcher.add("TEST", None, pattern)
matcher.add("TEST", [pattern])
elif n_errors == 0:
matcher.add("TEST", None, pattern)
matcher.add("TEST", [pattern])

View File

@ -13,53 +13,75 @@ def test_matcher_phrase_matcher(en_vocab):
# intermediate phrase
pattern = Doc(en_vocab, words=["Google", "Now"])
matcher = PhraseMatcher(en_vocab)
matcher.add("COMPANY", None, pattern)
matcher.add("COMPANY", [pattern])
assert len(matcher(doc)) == 1
# initial token
pattern = Doc(en_vocab, words=["I"])
matcher = PhraseMatcher(en_vocab)
matcher.add("I", None, pattern)
matcher.add("I", [pattern])
assert len(matcher(doc)) == 1
# initial phrase
pattern = Doc(en_vocab, words=["I", "like"])
matcher = PhraseMatcher(en_vocab)
matcher.add("ILIKE", None, pattern)
matcher.add("ILIKE", [pattern])
assert len(matcher(doc)) == 1
# final token
pattern = Doc(en_vocab, words=["best"])
matcher = PhraseMatcher(en_vocab)
matcher.add("BEST", None, pattern)
matcher.add("BEST", [pattern])
assert len(matcher(doc)) == 1
# final phrase
pattern = Doc(en_vocab, words=["Now", "best"])
matcher = PhraseMatcher(en_vocab)
matcher.add("NOWBEST", None, pattern)
matcher.add("NOWBEST", [pattern])
assert len(matcher(doc)) == 1
def test_phrase_matcher_length(en_vocab):
matcher = PhraseMatcher(en_vocab)
assert len(matcher) == 0
matcher.add("TEST", None, Doc(en_vocab, words=["test"]))
matcher.add("TEST", [Doc(en_vocab, words=["test"])])
assert len(matcher) == 1
matcher.add("TEST2", None, Doc(en_vocab, words=["test2"]))
matcher.add("TEST2", [Doc(en_vocab, words=["test2"])])
assert len(matcher) == 2
def test_phrase_matcher_contains(en_vocab):
matcher = PhraseMatcher(en_vocab)
matcher.add("TEST", None, Doc(en_vocab, words=["test"]))
matcher.add("TEST", [Doc(en_vocab, words=["test"])])
assert "TEST" in matcher
assert "TEST2" not in matcher
def test_phrase_matcher_add_new_api(en_vocab):
doc = Doc(en_vocab, words=["a", "b"])
patterns = [Doc(en_vocab, words=["a"]), Doc(en_vocab, words=["a", "b"])]
matcher = PhraseMatcher(en_vocab)
matcher.add("OLD_API", None, *patterns)
assert len(matcher(doc)) == 2
matcher = PhraseMatcher(en_vocab)
on_match = Mock()
matcher.add("OLD_API_CALLBACK", on_match, *patterns)
assert len(matcher(doc)) == 2
assert on_match.call_count == 2
# New API: add(key: str, patterns: List[List[dict]], on_match: Callable)
matcher = PhraseMatcher(en_vocab)
matcher.add("NEW_API", patterns)
assert len(matcher(doc)) == 2
matcher = PhraseMatcher(en_vocab)
on_match = Mock()
matcher.add("NEW_API_CALLBACK", patterns, on_match=on_match)
assert len(matcher(doc)) == 2
assert on_match.call_count == 2
def test_phrase_matcher_repeated_add(en_vocab):
matcher = PhraseMatcher(en_vocab)
# match ID only gets added once
matcher.add("TEST", None, Doc(en_vocab, words=["like"]))
matcher.add("TEST", None, Doc(en_vocab, words=["like"]))
matcher.add("TEST", None, Doc(en_vocab, words=["like"]))
matcher.add("TEST", None, Doc(en_vocab, words=["like"]))
matcher.add("TEST", [Doc(en_vocab, words=["like"])])
matcher.add("TEST", [Doc(en_vocab, words=["like"])])
matcher.add("TEST", [Doc(en_vocab, words=["like"])])
matcher.add("TEST", [Doc(en_vocab, words=["like"])])
doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"])
assert "TEST" in matcher
assert "TEST2" not in matcher
@ -68,8 +90,8 @@ def test_phrase_matcher_repeated_add(en_vocab):
def test_phrase_matcher_remove(en_vocab):
matcher = PhraseMatcher(en_vocab)
matcher.add("TEST1", None, Doc(en_vocab, words=["like"]))
matcher.add("TEST2", None, Doc(en_vocab, words=["best"]))
matcher.add("TEST1", [Doc(en_vocab, words=["like"])])
matcher.add("TEST2", [Doc(en_vocab, words=["best"])])
doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"])
assert "TEST1" in matcher
assert "TEST2" in matcher
@ -95,9 +117,9 @@ def test_phrase_matcher_remove(en_vocab):
def test_phrase_matcher_overlapping_with_remove(en_vocab):
matcher = PhraseMatcher(en_vocab)
matcher.add("TEST", None, Doc(en_vocab, words=["like"]))
matcher.add("TEST", [Doc(en_vocab, words=["like"])])
# TEST2 is added alongside TEST
matcher.add("TEST2", None, Doc(en_vocab, words=["like"]))
matcher.add("TEST2", [Doc(en_vocab, words=["like"])])
doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"])
assert "TEST" in matcher
assert len(matcher) == 2
@ -122,7 +144,7 @@ def test_phrase_matcher_string_attrs(en_vocab):
pos2 = ["INTJ", "PUNCT", "PRON", "VERB", "NOUN", "ADV", "ADV"]
pattern = get_doc(en_vocab, words=words1, pos=pos1)
matcher = PhraseMatcher(en_vocab, attr="POS")
matcher.add("TEST", None, pattern)
matcher.add("TEST", [pattern])
doc = get_doc(en_vocab, words=words2, pos=pos2)
matches = matcher(doc)
assert len(matches) == 1
@ -140,7 +162,7 @@ def test_phrase_matcher_string_attrs_negative(en_vocab):
pos2 = ["X", "X", "X"]
pattern = get_doc(en_vocab, words=words1, pos=pos1)
matcher = PhraseMatcher(en_vocab, attr="POS")
matcher.add("TEST", None, pattern)
matcher.add("TEST", [pattern])
doc = get_doc(en_vocab, words=words2, pos=pos2)
matches = matcher(doc)
assert len(matches) == 0
@ -151,7 +173,7 @@ def test_phrase_matcher_bool_attrs(en_vocab):
words2 = ["No", "problem", ",", "he", "said", "."]
pattern = Doc(en_vocab, words=words1)
matcher = PhraseMatcher(en_vocab, attr="IS_PUNCT")
matcher.add("TEST", None, pattern)
matcher.add("TEST", [pattern])
doc = Doc(en_vocab, words=words2)
matches = matcher(doc)
assert len(matches) == 2
@ -173,15 +195,15 @@ def test_phrase_matcher_validation(en_vocab):
doc3 = Doc(en_vocab, words=["Test"])
matcher = PhraseMatcher(en_vocab, validate=True)
with pytest.warns(UserWarning):
matcher.add("TEST1", None, doc1)
matcher.add("TEST1", [doc1])
with pytest.warns(UserWarning):
matcher.add("TEST2", None, doc2)
matcher.add("TEST2", [doc2])
with pytest.warns(None) as record:
matcher.add("TEST3", None, doc3)
matcher.add("TEST3", [doc3])
assert not record.list
matcher = PhraseMatcher(en_vocab, attr="POS", validate=True)
with pytest.warns(None) as record:
matcher.add("TEST4", None, doc2)
matcher.add("TEST4", [doc2])
assert not record.list
@ -198,24 +220,24 @@ def test_attr_pipeline_checks(en_vocab):
doc3 = Doc(en_vocab, words=["Test"])
# DEP requires is_parsed
matcher = PhraseMatcher(en_vocab, attr="DEP")
matcher.add("TEST1", None, doc1)
matcher.add("TEST1", [doc1])
with pytest.raises(ValueError):
matcher.add("TEST2", None, doc2)
matcher.add("TEST2", [doc2])
with pytest.raises(ValueError):
matcher.add("TEST3", None, doc3)
matcher.add("TEST3", [doc3])
# TAG, POS, LEMMA require is_tagged
for attr in ("TAG", "POS", "LEMMA"):
matcher = PhraseMatcher(en_vocab, attr=attr)
matcher.add("TEST2", None, doc2)
matcher.add("TEST2", [doc2])
with pytest.raises(ValueError):
matcher.add("TEST1", None, doc1)
matcher.add("TEST1", [doc1])
with pytest.raises(ValueError):
matcher.add("TEST3", None, doc3)
matcher.add("TEST3", [doc3])
# TEXT/ORTH only require tokens
matcher = PhraseMatcher(en_vocab, attr="ORTH")
matcher.add("TEST3", None, doc3)
matcher.add("TEST3", [doc3])
matcher = PhraseMatcher(en_vocab, attr="TEXT")
matcher.add("TEST3", None, doc3)
matcher.add("TEST3", [doc3])
def test_phrase_matcher_callback(en_vocab):
@ -223,7 +245,7 @@ def test_phrase_matcher_callback(en_vocab):
doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"])
pattern = Doc(en_vocab, words=["Google", "Now"])
matcher = PhraseMatcher(en_vocab)
matcher.add("COMPANY", mock, pattern)
matcher.add("COMPANY", [pattern], on_match=mock)
matches = matcher(doc)
mock.assert_called_once_with(matcher, doc, 0, matches)
@ -234,5 +256,13 @@ def test_phrase_matcher_remove_overlapping_patterns(en_vocab):
pattern2 = Doc(en_vocab, words=["this", "is"])
pattern3 = Doc(en_vocab, words=["this", "is", "a"])
pattern4 = Doc(en_vocab, words=["this", "is", "a", "word"])
matcher.add("THIS", None, pattern1, pattern2, pattern3, pattern4)
matcher.add("THIS", [pattern1, pattern2, pattern3, pattern4])
matcher.remove("THIS")
def test_phrase_matcher_basic_check(en_vocab):
matcher = PhraseMatcher(en_vocab)
# Potential mistake: pass in pattern instead of list of patterns
pattern = Doc(en_vocab, words=["hello", "world"])
with pytest.raises(ValueError):
matcher.add("TEST", pattern)

View File

@ -0,0 +1,168 @@
# coding: utf8
from __future__ import unicode_literals
import spacy.language
from spacy.language import Language, component
from spacy.analysis import print_summary, validate_attrs
from spacy.analysis import get_assigns_for_attr, get_requires_for_attr
from spacy.compat import is_python2
from mock import Mock, ANY
import pytest
def test_component_decorator_function():
@component(name="test")
def test_component(doc):
"""docstring"""
return doc
assert test_component.name == "test"
if not is_python2:
assert test_component.__doc__ == "docstring"
assert test_component("foo") == "foo"
def test_component_decorator_class():
@component(name="test")
class TestComponent(object):
"""docstring1"""
foo = "bar"
def __call__(self, doc):
"""docstring2"""
return doc
def custom(self, x):
"""docstring3"""
return x
assert TestComponent.name == "test"
assert TestComponent.foo == "bar"
assert hasattr(TestComponent, "custom")
test_component = TestComponent()
assert test_component.foo == "bar"
assert test_component("foo") == "foo"
assert hasattr(test_component, "custom")
assert test_component.custom("bar") == "bar"
if not is_python2:
assert TestComponent.__doc__ == "docstring1"
assert TestComponent.__call__.__doc__ == "docstring2"
assert TestComponent.custom.__doc__ == "docstring3"
assert test_component.__doc__ == "docstring1"
assert test_component.__call__.__doc__ == "docstring2"
assert test_component.custom.__doc__ == "docstring3"
def test_component_decorator_assigns():
spacy.language.ENABLE_PIPELINE_ANALYSIS = True
@component("c1", assigns=["token.tag", "doc.tensor"])
def test_component1(doc):
return doc
@component(
"c2", requires=["token.tag", "token.pos"], assigns=["token.lemma", "doc.tensor"]
)
def test_component2(doc):
return doc
@component("c3", requires=["token.lemma"], assigns=["token._.custom_lemma"])
def test_component3(doc):
return doc
assert "c1" in Language.factories
assert "c2" in Language.factories
assert "c3" in Language.factories
nlp = Language()
nlp.add_pipe(test_component1)
with pytest.warns(UserWarning):
nlp.add_pipe(test_component2)
nlp.add_pipe(test_component3)
assigns_tensor = get_assigns_for_attr(nlp.pipeline, "doc.tensor")
assert [name for name, _ in assigns_tensor] == ["c1", "c2"]
test_component4 = nlp.create_pipe("c1")
assert test_component4.name == "c1"
assert test_component4.factory == "c1"
nlp.add_pipe(test_component4, name="c4")
assert nlp.pipe_names == ["c1", "c2", "c3", "c4"]
assert "c4" not in Language.factories
assert nlp.pipe_factories["c1"] == "c1"
assert nlp.pipe_factories["c4"] == "c1"
assigns_tensor = get_assigns_for_attr(nlp.pipeline, "doc.tensor")
assert [name for name, _ in assigns_tensor] == ["c1", "c2", "c4"]
requires_pos = get_requires_for_attr(nlp.pipeline, "token.pos")
assert [name for name, _ in requires_pos] == ["c2"]
assert print_summary(nlp, no_print=True)
assert nlp("hello world")
def test_component_factories_from_nlp():
"""Test that class components can implement a from_nlp classmethod that
gives them access to the nlp object and config via the factory."""
class TestComponent5(object):
def __call__(self, doc):
return doc
mock = Mock()
mock.return_value = TestComponent5()
TestComponent5.from_nlp = classmethod(mock)
TestComponent5 = component("c5")(TestComponent5)
assert "c5" in Language.factories
nlp = Language()
pipe = nlp.create_pipe("c5", config={"foo": "bar"})
nlp.add_pipe(pipe)
assert nlp("hello world")
# The first argument here is the class itself, so we're accepting any here
mock.assert_called_once_with(ANY, nlp, foo="bar")
def test_analysis_validate_attrs_valid():
attrs = ["doc.sents", "doc.ents", "token.tag", "token._.xyz", "span._.xyz"]
assert validate_attrs(attrs)
for attr in attrs:
assert validate_attrs([attr])
with pytest.raises(ValueError):
validate_attrs(["doc.sents", "doc.xyz"])
@pytest.mark.parametrize(
"attr",
[
"doc",
"doc_ents",
"doc.xyz",
"token.xyz",
"token.tag_",
"token.tag.xyz",
"token._.xyz.abc",
"span.label",
],
)
def test_analysis_validate_attrs_invalid(attr):
with pytest.raises(ValueError):
validate_attrs([attr])
def test_analysis_validate_attrs_remove_pipe():
"""Test that attributes are validated correctly on remove."""
spacy.language.ENABLE_PIPELINE_ANALYSIS = True
@component("c1", assigns=["token.tag"])
def c1(doc):
return doc
@component("c2", requires=["token.pos"])
def c2(doc):
return doc
nlp = Language()
nlp.add_pipe(c1)
with pytest.warns(UserWarning):
nlp.add_pipe(c2)
with pytest.warns(None) as record:
nlp.remove_pipe("c2")
assert not record.list

View File

@ -154,7 +154,8 @@ def test_append_alias(nlp):
assert len(mykb.get_candidates("douglas")) == 3
# append the same alias-entity pair again should not work (will throw a warning)
mykb.append_alias(alias="douglas", entity="Q1", prior_prob=0.3)
with pytest.warns(UserWarning):
mykb.append_alias(alias="douglas", entity="Q1", prior_prob=0.3)
# test the size of the relevant candidates remained unchanged
assert len(mykb.get_candidates("douglas")) == 3

View File

@ -0,0 +1,34 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy.pipeline.functions import merge_subtokens
from ..util import get_doc
@pytest.fixture
def doc(en_tokenizer):
# fmt: off
text = "This is a sentence. This is another sentence. And a third."
heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 1, 1, 1, 0]
deps = ["nsubj", "ROOT", "subtok", "attr", "punct", "nsubj", "ROOT",
"subtok", "attr", "punct", "subtok", "subtok", "subtok", "ROOT"]
# fmt: on
tokens = en_tokenizer(text)
return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
def test_merge_subtokens(doc):
doc = merge_subtokens(doc)
# get_doc() doesn't set spaces, so the result is "And a third ."
assert [t.text for t in doc] == [
"This",
"is",
"a sentence",
".",
"This",
"is",
"another sentence",
".",
"And a third .",
]

View File

@ -105,6 +105,16 @@ def test_disable_pipes_context(nlp, name):
assert nlp.has_pipe(name)
def test_disable_pipes_list_arg(nlp):
for name in ["c1", "c2", "c3"]:
nlp.add_pipe(new_pipe, name=name)
assert nlp.has_pipe(name)
with nlp.disable_pipes(["c1", "c2"]):
assert not nlp.has_pipe("c1")
assert not nlp.has_pipe("c2")
assert nlp.has_pipe("c3")
@pytest.mark.parametrize("n_pipes", [100])
def test_add_lots_of_pipes(nlp, n_pipes):
for i in range(n_pipes):

View File

@ -30,7 +30,7 @@ def test_issue118(en_tokenizer, patterns):
doc = en_tokenizer(text)
ORG = doc.vocab.strings["ORG"]
matcher = Matcher(doc.vocab)
matcher.add("BostonCeltics", None, *patterns)
matcher.add("BostonCeltics", patterns)
assert len(list(doc.ents)) == 0
matches = [(ORG, start, end) for _, start, end in matcher(doc)]
assert matches == [(ORG, 9, 11), (ORG, 10, 11)]
@ -57,7 +57,7 @@ def test_issue118_prefix_reorder(en_tokenizer, patterns):
doc = en_tokenizer(text)
ORG = doc.vocab.strings["ORG"]
matcher = Matcher(doc.vocab)
matcher.add("BostonCeltics", None, *patterns)
matcher.add("BostonCeltics", patterns)
assert len(list(doc.ents)) == 0
matches = [(ORG, start, end) for _, start, end in matcher(doc)]
doc.ents += tuple(matches)[1:]
@ -78,7 +78,7 @@ def test_issue242(en_tokenizer):
]
doc = en_tokenizer(text)
matcher = Matcher(doc.vocab)
matcher.add("FOOD", None, *patterns)
matcher.add("FOOD", patterns)
matches = [(ent_type, start, end) for ent_type, start, end in matcher(doc)]
match1, match2 = matches
assert match1[1] == 3
@ -127,17 +127,13 @@ def test_issue587(en_tokenizer):
"""Test that Matcher doesn't segfault on particular input"""
doc = en_tokenizer("a b; c")
matcher = Matcher(doc.vocab)
matcher.add("TEST1", None, [{ORTH: "a"}, {ORTH: "b"}])
matcher.add("TEST1", [[{ORTH: "a"}, {ORTH: "b"}]])
matches = matcher(doc)
assert len(matches) == 1
matcher.add(
"TEST2", None, [{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "c"}]
)
matcher.add("TEST2", [[{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "c"}]])
matches = matcher(doc)
assert len(matches) == 2
matcher.add(
"TEST3", None, [{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "d"}]
)
matcher.add("TEST3", [[{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "d"}]])
matches = matcher(doc)
assert len(matches) == 2
@ -145,7 +141,7 @@ def test_issue587(en_tokenizer):
def test_issue588(en_vocab):
matcher = Matcher(en_vocab)
with pytest.raises(ValueError):
matcher.add("TEST", None, [])
matcher.add("TEST", [[]])
@pytest.mark.xfail
@ -161,11 +157,9 @@ def test_issue590(en_vocab):
doc = Doc(en_vocab, words=["n", "=", "1", ";", "a", ":", "5", "%"])
matcher = Matcher(en_vocab)
matcher.add(
"ab",
None,
[{"IS_ALPHA": True}, {"ORTH": ":"}, {"LIKE_NUM": True}, {"ORTH": "%"}],
"ab", [[{"IS_ALPHA": True}, {"ORTH": ":"}, {"LIKE_NUM": True}, {"ORTH": "%"}]]
)
matcher.add("ab", None, [{"IS_ALPHA": True}, {"ORTH": "="}, {"LIKE_NUM": True}])
matcher.add("ab", [[{"IS_ALPHA": True}, {"ORTH": "="}, {"LIKE_NUM": True}]])
matches = matcher(doc)
assert len(matches) == 2
@ -221,7 +215,7 @@ def test_issue615(en_tokenizer):
label = "Sport_Equipment"
doc = en_tokenizer(text)
matcher = Matcher(doc.vocab)
matcher.add(label, merge_phrases, pattern)
matcher.add(label, [pattern], on_match=merge_phrases)
matcher(doc)
entities = list(doc.ents)
assert entities != []
@ -339,7 +333,7 @@ def test_issue850():
vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()})
matcher = Matcher(vocab)
pattern = [{"LOWER": "bob"}, {"OP": "*"}, {"LOWER": "frank"}]
matcher.add("FarAway", None, pattern)
matcher.add("FarAway", [pattern])
doc = Doc(matcher.vocab, words=["bob", "and", "and", "frank"])
match = matcher(doc)
assert len(match) == 1
@ -353,7 +347,7 @@ def test_issue850_basic():
vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()})
matcher = Matcher(vocab)
pattern = [{"LOWER": "bob"}, {"OP": "*", "LOWER": "and"}, {"LOWER": "frank"}]
matcher.add("FarAway", None, pattern)
matcher.add("FarAway", [pattern])
doc = Doc(matcher.vocab, words=["bob", "and", "and", "frank"])
match = matcher(doc)
assert len(match) == 1

View File

@ -111,7 +111,7 @@ def test_issue1434():
hello_world = Doc(vocab, words=["Hello", "World"])
hello = Doc(vocab, words=["Hello"])
matcher = Matcher(vocab)
matcher.add("MyMatcher", None, pattern)
matcher.add("MyMatcher", [pattern])
matches = matcher(hello_world)
assert matches
matches = matcher(hello)
@ -133,7 +133,7 @@ def test_issue1450(string, start, end):
"""Test matcher works when patterns end with * operator."""
pattern = [{"ORTH": "a"}, {"ORTH": "b", "OP": "*"}]
matcher = Matcher(Vocab())
matcher.add("TSTEND", None, pattern)
matcher.add("TSTEND", [pattern])
doc = Doc(Vocab(), words=string.split())
matches = matcher(doc)
if start is None or end is None:

View File

@ -224,7 +224,7 @@ def test_issue1868():
def test_issue1883():
matcher = Matcher(Vocab())
matcher.add("pat1", None, [{"orth": "hello"}])
matcher.add("pat1", [[{"orth": "hello"}]])
doc = Doc(matcher.vocab, words=["hello"])
assert len(matcher(doc)) == 1
new_matcher = copy.deepcopy(matcher)
@ -249,7 +249,7 @@ def test_issue1915():
def test_issue1945():
"""Test regression in Matcher introduced in v2.0.6."""
matcher = Matcher(Vocab())
matcher.add("MWE", None, [{"orth": "a"}, {"orth": "a"}])
matcher.add("MWE", [[{"orth": "a"}, {"orth": "a"}]])
doc = Doc(matcher.vocab, words=["a", "a", "a"])
matches = matcher(doc) # we should see two overlapping matches here
assert len(matches) == 2
@ -285,7 +285,7 @@ def test_issue1971(en_vocab):
{"ORTH": "!", "OP": "?"},
]
Token.set_extension("optional", default=False)
matcher.add("TEST", None, pattern)
matcher.add("TEST", [pattern])
doc = Doc(en_vocab, words=["Hello", "John", "Doe", "!"])
# We could also assert length 1 here, but this is more conclusive, because
# the real problem here is that it returns a duplicate match for a match_id
@ -299,7 +299,7 @@ def test_issue_1971_2(en_vocab):
pattern1 = [{"ORTH": "EUR", "LOWER": {"IN": ["eur"]}}, {"LIKE_NUM": True}]
pattern2 = [{"LIKE_NUM": True}, {"ORTH": "EUR"}] # {"IN": ["EUR"]}}]
doc = Doc(en_vocab, words=["EUR", "10", "is", "10", "EUR"])
matcher.add("TEST1", None, pattern1, pattern2)
matcher.add("TEST1", [pattern1, pattern2])
matches = matcher(doc)
assert len(matches) == 2
@ -310,8 +310,8 @@ def test_issue_1971_3(en_vocab):
Token.set_extension("b", default=2, force=True)
doc = Doc(en_vocab, words=["hello", "world"])
matcher = Matcher(en_vocab)
matcher.add("A", None, [{"_": {"a": 1}}])
matcher.add("B", None, [{"_": {"b": 2}}])
matcher.add("A", [[{"_": {"a": 1}}]])
matcher.add("B", [[{"_": {"b": 2}}]])
matches = sorted((en_vocab.strings[m_id], s, e) for m_id, s, e in matcher(doc))
assert len(matches) == 4
assert matches == sorted([("A", 0, 1), ("A", 1, 2), ("B", 0, 1), ("B", 1, 2)])
@ -326,7 +326,7 @@ def test_issue_1971_4(en_vocab):
matcher = Matcher(en_vocab)
doc = Doc(en_vocab, words=["this", "is", "text"])
pattern = [{"_": {"ext_a": "str_a", "ext_b": "str_b"}}] * 3
matcher.add("TEST", None, pattern)
matcher.add("TEST", [pattern])
matches = matcher(doc)
# Uncommenting this caused a segmentation fault
assert len(matches) == 1

View File

@ -128,7 +128,7 @@ def test_issue2464(en_vocab):
"""Test problem with successive ?. This is the same bug, so putting it here."""
matcher = Matcher(en_vocab)
doc = Doc(en_vocab, words=["a", "b"])
matcher.add("4", None, [{"OP": "?"}, {"OP": "?"}])
matcher.add("4", [[{"OP": "?"}, {"OP": "?"}]])
matches = matcher(doc)
assert len(matches) == 3

View File

@ -37,7 +37,7 @@ def test_issue2569(en_tokenizer):
doc = en_tokenizer("It is May 15, 1993.")
doc.ents = [Span(doc, 2, 6, label=doc.vocab.strings["DATE"])]
matcher = Matcher(doc.vocab)
matcher.add("RULE", None, [{"ENT_TYPE": "DATE", "OP": "+"}])
matcher.add("RULE", [[{"ENT_TYPE": "DATE", "OP": "+"}]])
matched = [doc[start:end] for _, start, end in matcher(doc)]
matched = sorted(matched, key=len, reverse=True)
assert len(matched) == 10
@ -89,7 +89,7 @@ def test_issue2671():
{"IS_PUNCT": True, "OP": "?"},
{"LOWER": "adrenaline"},
]
matcher.add(pattern_id, None, pattern)
matcher.add(pattern_id, [pattern])
doc1 = nlp("This is a high-adrenaline situation.")
doc2 = nlp("This is a high adrenaline situation.")
matches1 = matcher(doc1)

View File

@ -52,7 +52,7 @@ def test_issue3009(en_vocab):
doc = get_doc(en_vocab, words=words, tags=tags)
matcher = Matcher(en_vocab)
for i, pattern in enumerate(patterns):
matcher.add(str(i), None, pattern)
matcher.add(str(i), [pattern])
matches = matcher(doc)
assert matches
@ -116,8 +116,8 @@ def test_issue3248_1():
total number of patterns."""
nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c"))
matcher.add("TEST2", None, nlp("d"))
matcher.add("TEST1", [nlp("a"), nlp("b"), nlp("c")])
matcher.add("TEST2", [nlp("d")])
assert len(matcher) == 2
@ -125,8 +125,8 @@ def test_issue3248_2():
"""Test that the PhraseMatcher can be pickled correctly."""
nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c"))
matcher.add("TEST2", None, nlp("d"))
matcher.add("TEST1", [nlp("a"), nlp("b"), nlp("c")])
matcher.add("TEST2", [nlp("d")])
data = pickle.dumps(matcher)
new_matcher = pickle.loads(data)
assert len(new_matcher) == len(matcher)
@ -170,7 +170,7 @@ def test_issue3328(en_vocab):
[{"LOWER": {"IN": ["hello", "how"]}}],
[{"LOWER": {"IN": ["you", "doing"]}}],
]
matcher.add("TEST", None, *patterns)
matcher.add("TEST", patterns)
matches = matcher(doc)
assert len(matches) == 4
matched_texts = [doc[start:end].text for _, start, end in matches]
@ -183,8 +183,8 @@ def test_issue3331(en_vocab):
matches, one per rule.
"""
matcher = PhraseMatcher(en_vocab)
matcher.add("A", None, Doc(en_vocab, words=["Barack", "Obama"]))
matcher.add("B", None, Doc(en_vocab, words=["Barack", "Obama"]))
matcher.add("A", [Doc(en_vocab, words=["Barack", "Obama"])])
matcher.add("B", [Doc(en_vocab, words=["Barack", "Obama"])])
doc = Doc(en_vocab, words=["Barack", "Obama", "lifts", "America"])
matches = matcher(doc)
assert len(matches) == 2
@ -297,8 +297,10 @@ def test_issue3410():
def test_issue3412():
data = numpy.asarray([[0, 0, 0], [1, 2, 3], [9, 8, 7]], dtype="f")
vectors = Vectors(data=data)
keys, best_rows, scores = vectors.most_similar(numpy.asarray([[9, 8, 7], [0, 0, 0]], dtype="f"))
assert(best_rows[0] == 2)
keys, best_rows, scores = vectors.most_similar(
numpy.asarray([[9, 8, 7], [0, 0, 0]], dtype="f")
)
assert best_rows[0] == 2
def test_issue3447():

View File

@ -10,6 +10,6 @@ def test_issue3549(en_vocab):
"""Test that match pattern validation doesn't raise on empty errors."""
matcher = Matcher(en_vocab, validate=True)
pattern = [{"LOWER": "hello"}, {"LOWER": "world"}]
matcher.add("GOOD", None, pattern)
matcher.add("GOOD", [pattern])
with pytest.raises(MatchPatternError):
matcher.add("BAD", None, [{"X": "Y"}])
matcher.add("BAD", [[{"X": "Y"}]])

View File

@ -12,6 +12,6 @@ def test_issue3555(en_vocab):
Token.set_extension("issue3555", default=None)
matcher = Matcher(en_vocab)
pattern = [{"LEMMA": "have"}, {"_": {"issue3555": True}}]
matcher.add("TEST", None, pattern)
matcher.add("TEST", [pattern])
doc = Doc(en_vocab, words=["have", "apple"])
matcher(doc)

View File

@ -34,8 +34,7 @@ def test_issue3611():
nlp.add_pipe(textcat, last=True)
# training the network
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
with nlp.disable_pipes(*other_pipes):
with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]):
optimizer = nlp.begin_training()
for i in range(3):
losses = {}

View File

@ -12,10 +12,10 @@ def test_issue3839(en_vocab):
match_id = "PATTERN"
pattern1 = [{"LOWER": "terrific"}, {"OP": "?"}, {"LOWER": "group"}]
pattern2 = [{"LOWER": "terrific"}, {"OP": "?"}, {"OP": "?"}, {"LOWER": "group"}]
matcher.add(match_id, None, pattern1)
matcher.add(match_id, [pattern1])
matches = matcher(doc)
assert matches[0][0] == en_vocab.strings[match_id]
matcher = Matcher(en_vocab)
matcher.add(match_id, None, pattern2)
matcher.add(match_id, [pattern2])
matches = matcher(doc)
assert matches[0][0] == en_vocab.strings[match_id]

View File

@ -10,5 +10,5 @@ def test_issue3879(en_vocab):
assert len(doc) == 5
pattern = [{"ORTH": "This", "OP": "?"}, {"OP": "?"}, {"ORTH": "test"}]
matcher = Matcher(en_vocab)
matcher.add("TEST", None, pattern)
matcher.add("TEST", [pattern])
assert len(matcher(doc)) == 2 # fails because of a FP match 'is a test'

View File

@ -14,7 +14,7 @@ def test_issue3951(en_vocab):
{"OP": "?"},
{"LOWER": "world"},
]
matcher.add("TEST", None, pattern)
matcher.add("TEST", [pattern])
doc = Doc(en_vocab, words=["Hello", "my", "new", "world"])
matches = matcher(doc)
assert len(matches) == 0

View File

@ -9,8 +9,8 @@ def test_issue3972(en_vocab):
"""Test that the PhraseMatcher returns duplicates for duplicate match IDs.
"""
matcher = PhraseMatcher(en_vocab)
matcher.add("A", None, Doc(en_vocab, words=["New", "York"]))
matcher.add("B", None, Doc(en_vocab, words=["New", "York"]))
matcher.add("A", [Doc(en_vocab, words=["New", "York"])])
matcher.add("B", [Doc(en_vocab, words=["New", "York"])])
doc = Doc(en_vocab, words=["I", "live", "in", "New", "York"])
matches = matcher(doc)

View File

@ -11,7 +11,7 @@ def test_issue4002(en_vocab):
matcher = PhraseMatcher(en_vocab, attr="NORM")
pattern1 = Doc(en_vocab, words=["c", "d"])
assert [t.norm_ for t in pattern1] == ["c", "d"]
matcher.add("TEST", None, pattern1)
matcher.add("TEST", [pattern1])
doc = Doc(en_vocab, words=["a", "b", "c", "d"])
assert [t.norm_ for t in doc] == ["a", "b", "c", "d"]
matches = matcher(doc)
@ -21,6 +21,6 @@ def test_issue4002(en_vocab):
pattern2[0].norm_ = "c"
pattern2[1].norm_ = "d"
assert [t.norm_ for t in pattern2] == ["c", "d"]
matcher.add("TEST", None, pattern2)
matcher.add("TEST", [pattern2])
matches = matcher(doc)
assert len(matches) == 1

View File

@ -34,8 +34,7 @@ def test_issue4030():
nlp.add_pipe(textcat, last=True)
# training the network
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
with nlp.disable_pipes(*other_pipes):
with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]):
optimizer = nlp.begin_training()
for i in range(3):
losses = {}

View File

@ -8,7 +8,7 @@ from spacy.tokens import Doc
def test_issue4120(en_vocab):
"""Test that matches without a final {OP: ?} token are returned."""
matcher = Matcher(en_vocab)
matcher.add("TEST", None, [{"ORTH": "a"}, {"OP": "?"}])
matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}]])
doc1 = Doc(en_vocab, words=["a"])
assert len(matcher(doc1)) == 1 # works
@ -16,11 +16,11 @@ def test_issue4120(en_vocab):
assert len(matcher(doc2)) == 2 # fixed
matcher = Matcher(en_vocab)
matcher.add("TEST", None, [{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b"}])
matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b"}]])
doc3 = Doc(en_vocab, words=["a", "b", "b", "c"])
assert len(matcher(doc3)) == 2 # works
matcher = Matcher(en_vocab)
matcher.add("TEST", None, [{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b", "OP": "?"}])
matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b", "OP": "?"}]])
doc4 = Doc(en_vocab, words=["a", "b", "b", "c"])
assert len(matcher(doc4)) == 3 # fixed

View File

@ -0,0 +1,96 @@
# coding: utf8
from __future__ import unicode_literals
import srsly
from spacy.gold import GoldCorpus
from spacy.lang.en import English
from spacy.tests.util import make_tempdir
def test_issue4402():
nlp = English()
with make_tempdir() as tmpdir:
print("temp", tmpdir)
json_path = tmpdir / "test4402.json"
srsly.write_json(json_path, json_data)
corpus = GoldCorpus(str(json_path), str(json_path))
train_docs = list(corpus.train_docs(nlp, gold_preproc=True, max_length=0))
# assert that the data got split into 4 sentences
assert len(train_docs) == 4
json_data = [
{
"id": 0,
"paragraphs": [
{
"raw": "How should I cook bacon in an oven?\nI've heard of people cooking bacon in an oven.",
"sentences": [
{
"tokens": [
{"id": 0, "orth": "How", "ner": "O"},
{"id": 1, "orth": "should", "ner": "O"},
{"id": 2, "orth": "I", "ner": "O"},
{"id": 3, "orth": "cook", "ner": "O"},
{"id": 4, "orth": "bacon", "ner": "O"},
{"id": 5, "orth": "in", "ner": "O"},
{"id": 6, "orth": "an", "ner": "O"},
{"id": 7, "orth": "oven", "ner": "O"},
{"id": 8, "orth": "?", "ner": "O"},
],
"brackets": [],
},
{
"tokens": [
{"id": 9, "orth": "\n", "ner": "O"},
{"id": 10, "orth": "I", "ner": "O"},
{"id": 11, "orth": "'ve", "ner": "O"},
{"id": 12, "orth": "heard", "ner": "O"},
{"id": 13, "orth": "of", "ner": "O"},
{"id": 14, "orth": "people", "ner": "O"},
{"id": 15, "orth": "cooking", "ner": "O"},
{"id": 16, "orth": "bacon", "ner": "O"},
{"id": 17, "orth": "in", "ner": "O"},
{"id": 18, "orth": "an", "ner": "O"},
{"id": 19, "orth": "oven", "ner": "O"},
{"id": 20, "orth": ".", "ner": "O"},
],
"brackets": [],
},
],
"cats": [
{"label": "baking", "value": 1.0},
{"label": "not_baking", "value": 0.0},
],
},
{
"raw": "What is the difference between white and brown eggs?\n",
"sentences": [
{
"tokens": [
{"id": 0, "orth": "What", "ner": "O"},
{"id": 1, "orth": "is", "ner": "O"},
{"id": 2, "orth": "the", "ner": "O"},
{"id": 3, "orth": "difference", "ner": "O"},
{"id": 4, "orth": "between", "ner": "O"},
{"id": 5, "orth": "white", "ner": "O"},
{"id": 6, "orth": "and", "ner": "O"},
{"id": 7, "orth": "brown", "ner": "O"},
{"id": 8, "orth": "eggs", "ner": "O"},
{"id": 9, "orth": "?", "ner": "O"},
],
"brackets": [],
},
{"tokens": [{"id": 10, "orth": "\n", "ner": "O"}], "brackets": []},
],
"cats": [
{"label": "baking", "value": 0.0},
{"label": "not_baking", "value": 1.0},
],
},
],
}
]

View File

@ -0,0 +1,19 @@
# coding: utf8
from __future__ import unicode_literals
from spacy.tokens import Doc, DocBin
def test_issue4528(en_vocab):
"""Test that user_data is correctly serialized in DocBin."""
doc = Doc(en_vocab, words=["hello", "world"])
doc.user_data["foo"] = "bar"
# This is how extension attribute values are stored in the user data
doc.user_data[("._.", "foo", None, None)] = "bar"
doc_bin = DocBin(store_user_data=True)
doc_bin.add(doc)
doc_bin_bytes = doc_bin.to_bytes()
new_doc_bin = DocBin(store_user_data=True).from_bytes(doc_bin_bytes)
new_doc = list(new_doc_bin.get_docs(en_vocab))[0]
assert new_doc.user_data["foo"] == "bar"
assert new_doc.user_data[("._.", "foo", None, None)] == "bar"

View File

@ -0,0 +1,13 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from spacy.gold import GoldParse
@pytest.mark.parametrize(
"text,words", [("A'B C", ["A", "'", "B", "C"]), ("A-B", ["A-B"])]
)
def test_gold_misaligned(en_tokenizer, text, words):
doc = en_tokenizer(text)
GoldParse(doc, words=words)

View File

@ -3,7 +3,7 @@ from __future__ import unicode_literals
from spacy.gold import biluo_tags_from_offsets, offsets_from_biluo_tags
from spacy.gold import spans_from_biluo_tags, GoldParse, iob_to_biluo
from spacy.gold import GoldCorpus, docs_to_json
from spacy.gold import GoldCorpus, docs_to_json, align
from spacy.lang.en import English
from spacy.tokens import Doc
from .util import make_tempdir
@ -90,7 +90,7 @@ def test_gold_ner_missing_tags(en_tokenizer):
def test_iob_to_biluo():
good_iob = ["O", "O", "B-LOC", "I-LOC", "O", "B-PERSON"]
good_biluo = ["O", "O", "B-LOC", "L-LOC", "O", "U-PERSON"]
bad_iob = ["O", "O", "\"", "B-LOC", "I-LOC"]
bad_iob = ["O", "O", '"', "B-LOC", "I-LOC"]
converted_biluo = iob_to_biluo(good_iob)
assert good_biluo == converted_biluo
with pytest.raises(ValueError):
@ -99,14 +99,23 @@ def test_iob_to_biluo():
def test_roundtrip_docs_to_json():
text = "I flew to Silicon Valley via London."
tags = ["PRP", "VBD", "IN", "NNP", "NNP", "IN", "NNP", "."]
heads = [1, 1, 1, 4, 2, 1, 5, 1]
deps = ["nsubj", "ROOT", "prep", "compound", "pobj", "prep", "pobj", "punct"]
biluo_tags = ["O", "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"]
cats = {"TRAVEL": 1.0, "BAKING": 0.0}
nlp = English()
doc = nlp(text)
for i in range(len(tags)):
doc[i].tag_ = tags[i]
doc[i].dep_ = deps[i]
doc[i].head = doc[heads[i]]
doc.ents = spans_from_biluo_tags(doc, biluo_tags)
doc.cats = cats
doc[0].is_sent_start = True
for i in range(1, len(doc)):
doc[i].is_sent_start = False
doc.is_tagged = True
doc.is_parsed = True
# roundtrip to JSON
with make_tempdir() as tmpdir:
json_file = tmpdir / "roundtrip.json"
srsly.write_json(json_file, [docs_to_json(doc)])
@ -116,7 +125,95 @@ def test_roundtrip_docs_to_json():
assert len(doc) == goldcorpus.count_train()
assert text == reloaded_doc.text
assert tags == goldparse.tags
assert deps == goldparse.labels
assert heads == goldparse.heads
assert biluo_tags == goldparse.ner
assert "TRAVEL" in goldparse.cats
assert "BAKING" in goldparse.cats
assert cats["TRAVEL"] == goldparse.cats["TRAVEL"]
assert cats["BAKING"] == goldparse.cats["BAKING"]
# roundtrip to JSONL train dicts
with make_tempdir() as tmpdir:
jsonl_file = tmpdir / "roundtrip.jsonl"
srsly.write_jsonl(jsonl_file, [docs_to_json(doc)])
goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file))
reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp))
assert len(doc) == goldcorpus.count_train()
assert text == reloaded_doc.text
assert tags == goldparse.tags
assert deps == goldparse.labels
assert heads == goldparse.heads
assert biluo_tags == goldparse.ner
assert "TRAVEL" in goldparse.cats
assert "BAKING" in goldparse.cats
assert cats["TRAVEL"] == goldparse.cats["TRAVEL"]
assert cats["BAKING"] == goldparse.cats["BAKING"]
# roundtrip to JSONL tuples
with make_tempdir() as tmpdir:
jsonl_file = tmpdir / "roundtrip.jsonl"
# write to JSONL train dicts
srsly.write_jsonl(jsonl_file, [docs_to_json(doc)])
goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file))
# load and rewrite as JSONL tuples
srsly.write_jsonl(jsonl_file, goldcorpus.train_tuples)
goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file))
reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp))
assert len(doc) == goldcorpus.count_train()
assert text == reloaded_doc.text
assert tags == goldparse.tags
assert deps == goldparse.labels
assert heads == goldparse.heads
assert biluo_tags == goldparse.ner
assert "TRAVEL" in goldparse.cats
assert "BAKING" in goldparse.cats
assert cats["TRAVEL"] == goldparse.cats["TRAVEL"]
assert cats["BAKING"] == goldparse.cats["BAKING"]
# xfail while we have backwards-compatible alignment
@pytest.mark.xfail
@pytest.mark.parametrize(
"tokens_a,tokens_b,expected",
[
(["a", "b", "c"], ["ab", "c"], (3, [-1, -1, 1], [-1, 2], {0: 0, 1: 0}, {})),
(
["a", "b", "``", "c"],
['ab"', "c"],
(4, [-1, -1, -1, 1], [-1, 3], {0: 0, 1: 0, 2: 0}, {}),
),
(["a", "bc"], ["ab", "c"], (4, [-1, -1], [-1, -1], {0: 0}, {1: 1})),
(
["ab", "c", "d"],
["a", "b", "cd"],
(6, [-1, -1, -1], [-1, -1, -1], {1: 2, 2: 2}, {0: 0, 1: 0}),
),
(
["a", "b", "cd"],
["a", "b", "c", "d"],
(3, [0, 1, -1], [0, 1, -1, -1], {}, {2: 2, 3: 2}),
),
([" ", "a"], ["a"], (1, [-1, 0], [1], {}, {})),
],
)
def test_align(tokens_a, tokens_b, expected):
cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_a, tokens_b)
assert (cost, list(a2b), list(b2a), a2b_multi, b2a_multi) == expected
# check symmetry
cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_b, tokens_a)
assert (cost, list(b2a), list(a2b), b2a_multi, a2b_multi) == expected
def test_goldparse_startswith_space(en_tokenizer):
text = " a"
doc = en_tokenizer(text)
g = GoldParse(doc, words=["a"], entities=["U-DATE"], deps=["ROOT"], heads=[0])
assert g.words == [" ", "a"]
assert g.ner == [None, "U-DATE"]
assert g.labels == [None, "ROOT"]

View File

@ -95,12 +95,18 @@ def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
def test_prefer_gpu():
assert not prefer_gpu()
try:
import cupy # noqa: F401
except ImportError:
assert not prefer_gpu()
def test_require_gpu():
with pytest.raises(ValueError):
require_gpu()
try:
import cupy # noqa: F401
except ImportError:
with pytest.raises(ValueError):
require_gpu()
def test_create_symlink_windows(

View File

@ -0,0 +1,66 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy._ml import Tok2Vec
from spacy.vocab import Vocab
from spacy.tokens import Doc
from spacy.compat import unicode_
def get_batch(batch_size):
vocab = Vocab()
docs = []
start = 0
for size in range(1, batch_size + 1):
# Make the words numbers, so that they're distnct
# across the batch, and easy to track.
numbers = [unicode_(i) for i in range(start, start + size)]
docs.append(Doc(vocab, words=numbers))
start += size
return docs
# This fails in Thinc v7.3.1. Need to push patch
@pytest.mark.xfail
def test_empty_doc():
width = 128
embed_size = 2000
vocab = Vocab()
doc = Doc(vocab, words=[])
tok2vec = Tok2Vec(width, embed_size)
vectors, backprop = tok2vec.begin_update([doc])
assert len(vectors) == 1
assert vectors[0].shape == (0, width)
@pytest.mark.parametrize(
"batch_size,width,embed_size", [[1, 128, 2000], [2, 128, 2000], [3, 8, 63]]
)
def test_tok2vec_batch_sizes(batch_size, width, embed_size):
batch = get_batch(batch_size)
tok2vec = Tok2Vec(width, embed_size)
vectors, backprop = tok2vec.begin_update(batch)
assert len(vectors) == len(batch)
for doc_vec, doc in zip(vectors, batch):
assert doc_vec.shape == (len(doc), width)
@pytest.mark.parametrize(
"tok2vec_config",
[
{"width": 8, "embed_size": 100, "char_embed": False},
{"width": 8, "embed_size": 100, "char_embed": True},
{"width": 8, "embed_size": 100, "conv_depth": 6},
{"width": 8, "embed_size": 100, "conv_depth": 6},
{"width": 8, "embed_size": 100, "subword_features": False},
],
)
def test_tok2vec_configs(tok2vec_config):
docs = get_batch(3)
tok2vec = Tok2Vec(**tok2vec_config)
vectors, backprop = tok2vec.begin_update(docs)
assert len(vectors) == len(docs)
assert vectors[0].shape == (len(docs[0]), tok2vec_config["width"])
backprop(vectors)

View File

@ -103,7 +103,8 @@ class DocBin(object):
doc = Doc(vocab, words=words, spaces=spaces)
doc = doc.from_array(self.attrs, tokens)
if self.store_user_data:
doc.user_data.update(srsly.msgpack_loads(self.user_data[i]))
user_data = srsly.msgpack_loads(self.user_data[i], use_list=False)
doc.user_data.update(user_data)
yield doc
def merge(self, other):
@ -155,9 +156,9 @@ class DocBin(object):
msg = srsly.msgpack_loads(zlib.decompress(bytes_data))
self.attrs = msg["attrs"]
self.strings = set(msg["strings"])
lengths = numpy.fromstring(msg["lengths"], dtype="int32")
flat_spaces = numpy.fromstring(msg["spaces"], dtype=bool)
flat_tokens = numpy.fromstring(msg["tokens"], dtype="uint64")
lengths = numpy.frombuffer(msg["lengths"], dtype="int32")
flat_spaces = numpy.frombuffer(msg["spaces"], dtype=bool)
flat_tokens = numpy.frombuffer(msg["tokens"], dtype="uint64")
shape = (flat_tokens.size // len(self.attrs), len(self.attrs))
flat_tokens = flat_tokens.reshape(shape)
flat_spaces = flat_spaces.reshape((flat_spaces.size, 1))

View File

@ -142,6 +142,11 @@ def register_architecture(name, arch=None):
return do_registration
def make_layer(arch_config):
arch_func = get_architecture(arch_config["arch"])
return arch_func(arch_config["config"])
def get_architecture(name):
"""Get a model architecture function by name. Raises a KeyError if the
architecture is not found.
@ -242,6 +247,7 @@ def load_model_from_path(model_path, meta=False, **overrides):
cls = get_lang_class(lang)
nlp = cls(meta=meta, **overrides)
pipeline = meta.get("pipeline", [])
factories = meta.get("factories", {})
disable = overrides.get("disable", [])
if pipeline is True:
pipeline = nlp.Defaults.pipe_names
@ -250,7 +256,8 @@ def load_model_from_path(model_path, meta=False, **overrides):
for name in pipeline:
if name not in disable:
config = meta.get("pipeline_args", {}).get(name, {})
component = nlp.create_pipe(name, config=config)
factory = factories.get(name, name)
component = nlp.create_pipe(factory, config=config)
nlp.add_pipe(component, name=name)
return nlp.from_disk(model_path)
@ -363,6 +370,16 @@ def is_in_jupyter():
return False
def get_component_name(component):
if hasattr(component, "name"):
return component.name
if hasattr(component, "__name__"):
return component.__name__
if hasattr(component, "__class__") and hasattr(component.__class__, "__name__"):
return component.__class__.__name__
return repr(component)
def get_cuda_stream(require=False):
if CudaStream is None:
return None
@ -404,7 +421,7 @@ def env_opt(name, default=None):
def read_regex(path):
path = ensure_path(path)
with path.open() as file_:
with path.open(encoding="utf8") as file_:
entries = file_.read().split("\n")
expression = "|".join(
["^" + re.escape(piece) for piece in entries if piece.strip()]

View File

@ -48,14 +48,14 @@ be installed if needed via `pip install spacy[lookups]`. Some languages provide
full lemmatization rules and exceptions, while other languages currently only
rely on simple lookup tables.
<Infobox title="About spaCy's custom pronoun lemma" variant="warning">
<Infobox title="About spaCy's custom pronoun lemma for English" variant="warning">
spaCy adds a **special case for pronouns**: all pronouns are lemmatized to the
special token `-PRON-`. Unlike verbs and common nouns, there's no clear base
form of a personal pronoun. Should the lemma of "me" be "I", or should we
normalize person as well, giving "it" — or maybe "he"? spaCy's solution is to
introduce a novel symbol, `-PRON-`, which is used as the lemma for all personal
pronouns.
spaCy adds a **special case for English pronouns**: all English pronouns are
lemmatized to the special token `-PRON-`. Unlike verbs and common nouns,
there's no clear base form of a personal pronoun. Should the lemma of "me" be
"I", or should we normalize person as well, giving "it" — or maybe "he"?
spaCy's solution is to introduce a novel symbol, `-PRON-`, which is used as the
lemma for all personal pronouns.
</Infobox>
@ -117,76 +117,72 @@ type. They're available as the [`Token.pos`](/api/token#attributes) and
The English part-of-speech tagger uses the
[OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) version of the Penn
Treebank tag set. We also map the tags to the simpler Google Universal POS tag
set.
| Tag |  POS | Morphology | Description |
| ----------------------------------- | ------- | ---------------------------------------------- | ----------------------------------------- |
| `-LRB-` | `PUNCT` | `PunctType=brck PunctSide=ini` | left round bracket |
| `-RRB-` | `PUNCT` | `PunctType=brck PunctSide=fin` | right round bracket |
| `,` | `PUNCT` | `PunctType=comm` | punctuation mark, comma |
| `:` | `PUNCT` | | punctuation mark, colon or ellipsis |
| `.` | `PUNCT` | `PunctType=peri` | punctuation mark, sentence closer |
| `''` | `PUNCT` | `PunctType=quot PunctSide=fin` | closing quotation mark |
| `""` | `PUNCT` | `PunctType=quot PunctSide=fin` | closing quotation mark |
| <InlineCode>&#96;&#96;</InlineCode> | `PUNCT` | `PunctType=quot PunctSide=ini` | opening quotation mark |
| `#` | `SYM` | `SymType=numbersign` | symbol, number sign |
| `$` | `SYM` | `SymType=currency` | symbol, currency |
| `ADD` | `X` | | email |
| `AFX` | `ADJ` | `Hyph=yes` | affix |
| `BES` | `VERB` | | auxiliary "be" |
| `CC` | `CONJ` | `ConjType=coor` | conjunction, coordinating |
| `CD` | `NUM` | `NumType=card` | cardinal number |
| `DT` | `DET` | | determiner |
| `EX` | `ADV` | `AdvType=ex` | existential there |
| `FW` | `X` | `Foreign=yes` | foreign word |
| `GW` | `X` | | additional word in multi-word expression |
| `HVS` | `VERB` | | forms of "have" |
| `HYPH` | `PUNCT` | `PunctType=dash` | punctuation mark, hyphen |
| `IN` | `ADP` | | conjunction, subordinating or preposition |
| `JJ` | `ADJ` | `Degree=pos` | adjective |
| `JJR` | `ADJ` | `Degree=comp` | adjective, comparative |
| `JJS` | `ADJ` | `Degree=sup` | adjective, superlative |
| `LS` | `PUNCT` | `NumType=ord` | list item marker |
| `MD` | `VERB` | `VerbType=mod` | verb, modal auxiliary |
| `NFP` | `PUNCT` | | superfluous punctuation |
| `NIL` | | | missing tag |
| `NN` | `NOUN` | `Number=sing` | noun, singular or mass |
| `NNP` | `PROPN` | `NounType=prop Number=sign` | noun, proper singular |
| `NNPS` | `PROPN` | `NounType=prop Number=plur` | noun, proper plural |
| `NNS` | `NOUN` | `Number=plur` | noun, plural |
| `PDT` | `ADJ` | `AdjType=pdt PronType=prn` | predeterminer |
| `POS` | `PART` | `Poss=yes` | possessive ending |
| `PRP` | `PRON` | `PronType=prs` | pronoun, personal |
| `PRP$` | `ADJ` | `PronType=prs Poss=yes` | pronoun, possessive |
| `RB` | `ADV` | `Degree=pos` | adverb |
| `RBR` | `ADV` | `Degree=comp` | adverb, comparative |
| `RBS` | `ADV` | `Degree=sup` | adverb, superlative |
| `RP` | `PART` | | adverb, particle |
| `_SP` | `SPACE` | | space |
| `SYM` | `SYM` | | symbol |
| `TO` | `PART` | `PartType=inf VerbForm=inf` | infinitival "to" |
| `UH` | `INTJ` | | interjection |
| `VB` | `VERB` | `VerbForm=inf` | verb, base form |
| `VBD` | `VERB` | `VerbForm=fin Tense=past` | verb, past tense |
| `VBG` | `VERB` | `VerbForm=part Tense=pres Aspect=prog` | verb, gerund or present participle |
| `VBN` | `VERB` | `VerbForm=part Tense=past Aspect=perf` | verb, past participle |
| `VBP` | `VERB` | `VerbForm=fin Tense=pres` | verb, non-3rd person singular present |
| `VBZ` | `VERB` | `VerbForm=fin Tense=pres Number=sing Person=3` | verb, 3rd person singular present |
| `WDT` | `ADJ` | `PronType=int|rel` | wh-determiner |
| `WP` | `NOUN` | `PronType=int|rel` | wh-pronoun, personal |
| `WP$` | `ADJ` | `Poss=yes PronType=int|rel` | wh-pronoun, possessive |
| `WRB` | `ADV` | `PronType=int|rel` | wh-adverb |
| `XX` | `X` | | unknown |
Treebank tag set. We also map the tags to the simpler Universal Dependencies v2
POS tag set.
| Tag |  POS | Morphology | Description |
| ------------------------------------- | ------- | --------------------------------------- | ----------------------------------------- |
| `$` | `SYM` | | symbol, currency |
| <InlineCode>&#96;&#96;</InlineCode> | `PUNCT` | `PunctType=quot PunctSide=ini` | opening quotation mark |
| `''` | `PUNCT` | `PunctType=quot PunctSide=fin` | closing quotation mark |
| `,` | `PUNCT` | `PunctType=comm` | punctuation mark, comma |
| `-LRB-` | `PUNCT` | `PunctType=brck PunctSide=ini` | left round bracket |
| `-RRB-` | `PUNCT` | `PunctType=brck PunctSide=fin` | right round bracket |
| `.` | `PUNCT` | `PunctType=peri` | punctuation mark, sentence closer |
| `:` | `PUNCT` | | punctuation mark, colon or ellipsis |
| `ADD` | `X` | | email |
| `AFX` | `ADJ` | `Hyph=yes` | affix |
| `CC` | `CCONJ` | `ConjType=comp` | conjunction, coordinating |
| `CD` | `NUM` | `NumType=card` | cardinal number |
| `DT` | `DET` | | determiner |
| `EX` | `PRON` | `AdvType=ex` | existential there |
| `FW` | `X` | `Foreign=yes` | foreign word |
| `GW` | `X` | | additional word in multi-word expression |
| `HYPH` | `PUNCT` | `PunctType=dash` | punctuation mark, hyphen |
| `IN` | `ADP` | | conjunction, subordinating or preposition |
| `JJ` | `ADJ` | `Degree=pos` | adjective |
| `JJR` | `ADJ` | `Degree=comp` | adjective, comparative |
| `JJS` | `ADJ` | `Degree=sup` | adjective, superlative |
| `LS` | `X` | `NumType=ord` | list item marker |
| `MD` | `VERB` | `VerbType=mod` | verb, modal auxiliary |
| `NFP` | `PUNCT` | | superfluous punctuation |
| `NIL` | `X` | | missing tag |
| `NN` | `NOUN` | `Number=sing` | noun, singular or mass |
| `NNP` | `PROPN` | `NounType=prop Number=sing` | noun, proper singular |
| `NNPS` | `PROPN` | `NounType=prop Number=plur` | noun, proper plural |
| `NNS` | `NOUN` | `Number=plur` | noun, plural |
| `PDT` | `DET` | | predeterminer |
| `POS` | `PART` | `Poss=yes` | possessive ending |
| `PRP` | `PRON` | `PronType=prs` | pronoun, personal |
| `PRP$` | `DET` | `PronType=prs Poss=yes` | pronoun, possessive |
| `RB` | `ADV` | `Degree=pos` | adverb |
| `RBR` | `ADV` | `Degree=comp` | adverb, comparative |
| `RBS` | `ADV` | `Degree=sup` | adverb, superlative |
| `RP` | `ADP` | | adverb, particle |
| `SP` | `SPACE` | | space |
| `SYM` | `SYM` | | symbol |
| `TO` | `PART` | `PartType=inf VerbForm=inf` | infinitival "to" |
| `UH` | `INTJ` | | interjection |
| `VB` | `VERB` | `VerbForm=inf` | verb, base form |
| `VBD` | `VERB` | `VerbForm=fin Tense=past` | verb, past tense |
| `VBG` | `VERB` | `VerbForm=part Tense=pres Aspect=prog` | verb, gerund or present participle |
| `VBN` | `VERB` | `VerbForm=part Tense=past Aspect=perf` | verb, past participle |
| `VBP` | `VERB` | `VerbForm=fin Tense=pres` | verb, non-3rd person singular present |
| `VBZ` | `VERB` | `VerbForm=fin Tense=pres Number=sing Person=three` | verb, 3rd person singular present |
| `WDT` | `DET` | | wh-determiner |
| `WP` | `PRON` | | wh-pronoun, personal |
| `WP$` | `DET` | `Poss=yes` | wh-pronoun, possessive |
| `WRB` | `ADV` | | wh-adverb |
| `XX` | `X` | | unknown |
| `_SP` | `SPACE` | | |
</Accordion>
<Accordion title="German" id="pos-de">
The German part-of-speech tagger uses the
[TIGER Treebank](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html)
annotation scheme. We also map the tags to the simpler Google Universal POS tag
set.
annotation scheme. We also map the tags to the simpler Universal Dependencies
v2 POS tag set.
| Tag |  POS | Morphology | Description |
| --------- | ------- | ---------------------------------------- | ------------------------------------------------- |
@ -194,7 +190,7 @@ set.
| `$,` | `PUNCT` | `PunctType=comm` | comma |
| `$.` | `PUNCT` | `PunctType=peri` | sentence-final punctuation mark |
| `ADJA` | `ADJ` | | adjective, attributive |
| `ADJD` | `ADJ` | `Variant=short` | adjective, adverbial or predicative |
| `ADJD` | `ADJ` | | adjective, adverbial or predicative |
| `ADV` | `ADV` | | adverb |
| `APPO` | `ADP` | `AdpType=post` | postposition |
| `APPR` | `ADP` | `AdpType=prep` | preposition; circumposition left |
@ -204,28 +200,28 @@ set.
| `CARD` | `NUM` | `NumType=card` | cardinal number |
| `FM` | `X` | `Foreign=yes` | foreign language material |
| `ITJ` | `INTJ` | | interjection |
| `KOKOM` | `CONJ` | `ConjType=comp` | comparative conjunction |
| `KON` | `CONJ` | | coordinate conjunction |
| `KOKOM` | `CCONJ` | `ConjType=comp` | comparative conjunction |
| `KON` | `CCONJ` | | coordinate conjunction |
| `KOUI` | `SCONJ` | | subordinate conjunction with "zu" and infinitive |
| `KOUS` | `SCONJ` | | subordinate conjunction with sentence |
| `NE` | `PROPN` | | proper noun |
| `NNE` | `PROPN` | | proper noun |
| `NN` | `NOUN` | | noun, singular or mass |
| `PROAV` | `ADV` | `PronType=dem` | pronominal adverb |
| `NNE` | `PROPN` | | proper noun |
| `PDAT` | `DET` | `PronType=dem` | attributive demonstrative pronoun |
| `PDS` | `PRON` | `PronType=dem` | substituting demonstrative pronoun |
| `PIAT` | `DET` | `PronType=ind\|neg\|tot` | attributive indefinite pronoun without determiner |
| `PIS` | `PRON` | `PronType=ind\|neg\|tot` | substituting indefinite pronoun |
| `PIAT` | `DET` | `PronType=ind|neg|tot` | attributive indefinite pronoun without determiner |
| `PIS` | `PRON` | `PronType=ind|neg|tot` | substituting indefinite pronoun |
| `PPER` | `PRON` | `PronType=prs` | non-reflexive personal pronoun |
| `PPOSAT` | `DET` | `Poss=yes PronType=prs` | attributive possessive pronoun |
| `PPOSS` | `PRON` | `PronType=rel` | substituting possessive pronoun |
| `PPOSS` | `PRON` | `Poss=yes PronType=prs` | substituting possessive pronoun |
| `PRELAT` | `DET` | `PronType=rel` | attributive relative pronoun |
| `PRELS` | `PRON` | `PronType=rel` | substituting relative pronoun |
| `PRF` | `PRON` | `PronType=prs Reflex=yes` | reflexive personal pronoun |
| `PROAV` | `ADV` | `PronType=dem` | pronominal adverb |
| `PTKA` | `PART` | | particle with adjective or adverb |
| `PTKANT` | `PART` | `PartType=res` | answer particle |
| `PTKNEG` | `PART` | `Negative=yes` | negative particle |
| `PTKVZ` | `PART` | `PartType=vbp` | separable verbal particle |
| `PTKNEG` | `PART` | `Polarity=neg` | negative particle |
| `PTKVZ` | `ADP` | `PartType=vbp` | separable verbal particle |
| `PTKZU` | `PART` | `PartType=inf` | "zu" before infinitive |
| `PWAT` | `DET` | `PronType=int` | attributive interrogative pronoun |
| `PWAV` | `ADV` | `PronType=int` | adverbial interrogative or relative pronoun |
@ -234,9 +230,9 @@ set.
| `VAFIN` | `AUX` | `Mood=ind VerbForm=fin` | finite verb, auxiliary |
| `VAIMP` | `AUX` | `Mood=imp VerbForm=fin` | imperative, auxiliary |
| `VAINF` | `AUX` | `VerbForm=inf` | infinitive, auxiliary |
| `VAPP` | `AUX` | `Aspect=perf VerbForm=fin` | perfect participle, auxiliary |
| `VAPP` | `AUX` | `Aspect=perf VerbForm=part` | perfect participle, auxiliary |
| `VMFIN` | `VERB` | `Mood=ind VerbForm=fin VerbType=mod` | finite verb, modal |
| `VMINF` | `VERB` | `VerbForm=fin VerbType=mod` | infinitive, modal |
| `VMINF` | `VERB` | `VerbForm=inf VerbType=mod` | infinitive, modal |
| `VMPP` | `VERB` | `Aspect=perf VerbForm=part VerbType=mod` | perfect participle, modal |
| `VVFIN` | `VERB` | `Mood=ind VerbForm=fin` | finite verb, full |
| `VVIMP` | `VERB` | `Mood=imp VerbForm=fin` | imperative, full |
@ -244,8 +240,7 @@ set.
| `VVIZU` | `VERB` | `VerbForm=inf` | infinitive with "zu", full |
| `VVPP` | `VERB` | `Aspect=perf VerbForm=part` | perfect participle, full |
| `XY` | `X` | | non-word containing non-letter |
| `SP` | `SPACE` | | space |
| `_SP` | `SPACE` | | |
</Accordion>
---

View File

@ -155,21 +155,14 @@ $ python -m spacy convert [input_file] [output_dir] [--file-type] [--converter]
### Output file types {new="2.1"}
> #### Which format should I choose?
>
> If you're not sure, go with the default `jsonl`. Newline-delimited JSON means
> that there's one JSON object per line. Unlike a regular JSON file, it can also
> be read in line-by-line and you won't have to parse the _entire file_ first.
> This makes it a very convenient format for larger corpora.
All output files generated by this command are compatible with
[`spacy train`](/api/cli#train).
| ID | Description |
| ------- | --------------------------------- |
| `jsonl` | Newline-delimited JSON (default). |
| `json` | Regular JSON. |
| `msg` | Binary MessagePack format. |
| ID | Description |
| ------- | -------------------------- |
| `json` | Regular JSON (default). |
| `jsonl` | Newline-delimited JSON. |
| `msg` | Binary MessagePack format. |
### Converter options
@ -453,8 +446,10 @@ improvement.
```bash
$ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
[--width] [--depth] [--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length] [--min-length]
[--seed] [--n-iter] [--use-vectors] [--n-save_every] [--init-tok2vec] [--epoch-start]
[--width] [--depth] [--cnn-window] [--cnn-pieces] [--use-chars] [--sa-depth]
[--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length]
[--min-length] [--seed] [--n-iter] [--use-vectors] [--n-save_every]
[--init-tok2vec] [--epoch-start]
```
| Argument | Type | Description |
@ -464,6 +459,10 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
| `output_dir` | positional | Directory to write models to on each epoch. |
| `--width`, `-cw` | option | Width of CNN layers. |
| `--depth`, `-cd` | option | Depth of CNN layers. |
| `--cnn-window`, `-cW` <Tag variant="new">2.2.2</Tag> | option | Window size for CNN layers. |
| `--cnn-pieces`, `-cP` <Tag variant="new">2.2.2</Tag> | option | Maxout size for CNN layers. `1` for [Mish](https://github.com/digantamisra98/Mish). |
| `--use-chars`, `-chr` <Tag variant="new">2.2.2</Tag> | flag | Whether to use character-based embedding. |
| `--sa-depth`, `-sa` <Tag variant="new">2.2.2</Tag> | option | Depth of self-attention layers. |
| `--embed-rows`, `-er` | option | Number of embedding rows. |
| `--loss-func`, `-L` | option | Loss function to use for the objective. Either `"L2"` or `"cosine"`. |
| `--dropout`, `-d` | option | Dropout rate. |
@ -476,7 +475,7 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
| `--n-save-every`, `-se` | option | Save model every X batches. |
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. |
| `--epoch-start`, `-es` <Tag variant="new">2.1.5</Tag> | option | The epoch to start counting at. Only relevant when using `--init-tok2vec` and the given weight file has been renamed. Prevents unintended overwriting of existing weight files. |
| **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. |
| **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. |
### JSONL format for raw text {#pretrain-jsonl}

View File

@ -202,6 +202,14 @@ All labels present in the match patterns.
| ----------- | ----- | ------------------ |
| **RETURNS** | tuple | The string labels. |
## EntityRuler.ent_ids {#labels tag="property" new="2.2.2"}
All entity ids present in the match patterns `id` properties.
| Name | Type | Description |
| ----------- | ----- | ------------------- |
| **RETURNS** | tuple | The string ent_ids. |
## EntityRuler.patterns {#patterns tag="property"}
Get all patterns that were added to the entity ruler.

View File

@ -323,18 +323,38 @@ you can use to undo your changes.
> #### Example
>
> ```python
> with nlp.disable_pipes('tagger', 'parser'):
> # New API as of v2.2.2
> with nlp.disable_pipes(["tagger", "parser"]):
> nlp.begin_training()
>
> with nlp.disable_pipes("tagger", "parser"):
> nlp.begin_training()
>
> disabled = nlp.disable_pipes('tagger', 'parser')
> disabled = nlp.disable_pipes("tagger", "parser")
> nlp.begin_training()
> disabled.restore()
> ```
| Name | Type | Description |
| ----------- | --------------- | ------------------------------------------------------------------------------------ |
| `*disabled` | unicode | Names of pipeline components to disable. |
| **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. |
| Name | Type | Description |
| ----------------------------------------- | --------------- | ------------------------------------------------------------------------------------ |
| `disabled` <Tag variant="new">2.2.2</Tag> | list | Names of pipeline components to disable. |
| `*disabled` | unicode | Names of pipeline components to disable. |
| **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. |
<Infobox title="Changed in v2.2.2" variant="warning">
As of spaCy v2.2.2, the `Language.disable_pipes` method can also take a list of
component names as its first argument (instead of a variable number of
arguments). This is especially useful if you're generating the component names
to disable programmatically. The new syntax will become the default in the
future.
```diff
- disabled = nlp.disable_pipes("tagger", "parser")
+ disabled = nlp.disable_pipes(["tagger", "parser"])
```
</Infobox>
## Language.to_disk {#to_disk tag="method" new="2"}

View File

@ -157,16 +157,19 @@ overwritten.
| `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. |
| `*patterns` | list | Match pattern. A pattern consists of a list of dicts, where each dict describes a token. |
<Infobox title="Changed in v2.0" variant="warning">
<Infobox title="Changed in v2.2.2" variant="warning">
As of spaCy 2.0, `Matcher.add_pattern` and `Matcher.add_entity` are deprecated
and have been replaced with a simpler [`Matcher.add`](/api/matcher#add) that
lets you add a list of patterns and a callback for a given match ID.
As of spaCy 2.2.2, `Matcher.add` also supports the new API, which will become
the default in the future. The patterns are now the second argument and a list
(instead of a variable number of arguments). The `on_match` callback becomes an
optional keyword argument.
```diff
- matcher.add_entity("GoogleNow", on_match=merge_phrases)
- matcher.add_pattern("GoogleNow", [{ORTH: "Google"}, {ORTH: "Now"}])
+ matcher.add('GoogleNow', merge_phrases, [{"ORTH": "Google"}, {"ORTH": "Now"}])
patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]
- matcher.add("GoogleNow", None, *patterns)
+ matcher.add("GoogleNow", patterns)
- matcher.add("GoogleNow", on_match, *patterns)
+ matcher.add("GoogleNow", patterns, on_match=on_match)
```
</Infobox>

View File

@ -153,6 +153,23 @@ overwritten.
| `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. |
| `*docs` | `Doc` | `Doc` objects of the phrases to match. |
<Infobox title="Changed in v2.2.2" variant="warning">
As of spaCy 2.2.2, `PhraseMatcher.add` also supports the new API, which will
become the default in the future. The `Doc` patterns are now the second argument
and a list (instead of a variable number of arguments). The `on_match` callback
becomes an optional keyword argument.
```diff
patterns = [nlp("health care reform"), nlp("healthcare reform")]
- matcher.add("HEALTH", None, *patterns)
+ matcher.add("HEALTH", patterns)
- matcher.add("HEALTH", on_match, *patterns)
+ matcher.add("HEALTH", patterns, on_match=on_match)
```
</Infobox>
## PhraseMatcher.remove {#remove tag="method" new="2.2"}
Remove a rule from the matcher by match ID. A `KeyError` is raised if the key

View File

@ -1,9 +1,33 @@
<div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px">But
<mark class="entity" style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Google
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>is starting from behind. The company made a late push into hardware,
and
<mark class="entity" style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Apple
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>s Siri, available on iPhones, and
<mark class="entity" style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Amazon
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>s Alexa software, which runs on its Echo and Dot devices, have clear
leads in consumer adoption.</div>
<div
class="entities"
style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px"
>But
<mark
class="entity"
style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>Google
<span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>ORG</span
></mark
>is starting from behind. The company made a late push into hardware, and
<mark
class="entity"
style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>Apple
<span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>ORG</span
></mark
>s Siri, available on iPhones, and
<mark
class="entity"
style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>Amazon
<span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>ORG</span
></mark
>s Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer
adoption.</div
>

View File

@ -2,17 +2,25 @@
class="entities"
style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px"
>
🌱🌿 <mark
class="entity"
style="background: #3dff74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"
>🐍 <span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>SNEK</span
></mark> ____ 🌳🌲 ____ <mark
class="entity"
style="background: #cfc5ff; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"
>👨‍🌾 <span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>HUMAN</span
></mark> 🏘️
🌱🌿
<mark
class="entity"
style="background: #3dff74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>🐍
<span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>SNEK</span
></mark
>
____ 🌳🌲 ____
<mark
class="entity"
style="background: #cfc5ff; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>👨‍🌾
<span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>HUMAN</span
></mark
>
🏘️
</div>

View File

@ -1,16 +1,37 @@
<div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px">
<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
<div
class="entities"
style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px"
>
<mark
class="entity"
style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>
Apple
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span>
<span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>ORG</span
>
</mark>
is looking at buying
<mark class="entity" style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
<mark
class="entity"
style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>
U.K.
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">GPE</span>
<span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>GPE</span
>
</mark>
startup for
<mark class="entity" style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
<mark
class="entity"
style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>
$1 billion
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">MONEY</span>
<span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>MONEY</span
>
</mark>
</div>

View File

@ -1,18 +1,39 @@
<div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px">
<div
class="entities"
style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px"
>
When
<mark class="entity" style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
<mark
class="entity"
style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>
Sebastian Thrun
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PERSON</span>
<span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>PERSON</span
>
</mark>
started working on self-driving cars at
<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
<mark
class="entity"
style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>
Google
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span>
<span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>ORG</span
>
</mark>
in
<mark class="entity" style="background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
<mark
class="entity"
style="background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
>
2007
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">DATE</span>
<span
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
>DATE</span
>
</mark>
, few people outside of the company took him seriously.
</div>

View File

@ -986,6 +986,37 @@ doc = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents])
```
### Adding IDs to patterns {#entityruler-ent-ids new="2.2.2"}
The [`EntityRuler`](/api/entityruler) can also accept an `id` attribute for each
pattern. Using the `id` attribute allows multiple patterns to be associated with
the same entity.
```python
### {executable="true"}
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "ORG", "pattern": "Apple", "id": "apple"},
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}], "id": "san-francisco"},
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "fran"}], "id": "san-francisco"}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
doc1 = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_, ent.ent_id_) for ent in doc1.ents])
doc2 = nlp("Apple is opening its first big office in San Fran.")
print([(ent.text, ent.label_, ent.ent_id_) for ent in doc2.ents])
```
If the `id` attribute is included in the [`EntityRuler`](/api/entityruler)
patterns, the `ent_id_` property of the matched entity is set to the `id` given
in the patterns. So in the example above it's easy to identify that "San
Francisco" and "San Fran" are both the same entity.
The entity ruler is designed to integrate with spaCy's existing statistical
models and enhance the named entity recognizer. If it's added **before the
`"ner"` component**, the entity recognizer will respect the existing entity

View File

@ -127,6 +127,7 @@
{ "code": "sr", "name": "Serbian" },
{ "code": "sk", "name": "Slovak" },
{ "code": "sl", "name": "Slovenian" },
{ "code": "lb", "name": "Luxembourgish" },
{
"code": "sq",
"name": "Albanian",