mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-03 22:06:37 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
07ba9b4aa2
106
.github/contributors/GiorgioPorgio.md
vendored
Normal file
106
.github/contributors/GiorgioPorgio.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | George Ketsopoulos |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 23 October 2019 |
|
||||
| GitHub username | GiorgioPorgio |
|
||||
| Website (optional) | |
|
106
.github/contributors/zhuorulin.md
vendored
Normal file
106
.github/contributors/zhuorulin.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ------------------------ |
|
||||
| Name | Zhuoru Lin |
|
||||
| Company name (if applicable) | Bombora Inc. |
|
||||
| Title or role (if applicable) | Data Scientist |
|
||||
| Date | 2017-11-13 |
|
||||
| GitHub username | ZhuoruLin |
|
||||
| Website (optional) | |
|
2
Makefile
2
Makefile
|
@ -9,7 +9,7 @@ dist/spacy.pex : dist/spacy-$(sha).pex
|
|||
|
||||
dist/spacy-$(sha).pex : dist/$(wheel)
|
||||
env3.6/bin/python -m pip install pex==1.5.3
|
||||
env3.6/bin/pex pytest dist/$(wheel) -e spacy -o dist/spacy-$(sha).pex
|
||||
env3.6/bin/pex pytest dist/$(wheel) spacy_lookups_data -e spacy -o dist/spacy-$(sha).pex
|
||||
|
||||
dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py*
|
||||
python3.6 -m venv env3.6
|
||||
|
|
13
README.md
13
README.md
|
@ -135,8 +135,7 @@ Thanks to our great community, we've finally re-added conda support. You can now
|
|||
install spaCy via `conda-forge`:
|
||||
|
||||
```bash
|
||||
conda config --add channels conda-forge
|
||||
conda install spacy
|
||||
conda install -c conda-forge spacy
|
||||
```
|
||||
|
||||
For the feedstock including the build recipe and configuration, check out
|
||||
|
@ -214,16 +213,6 @@ doc = nlp("This is a sentence.")
|
|||
📖 **For more info and examples, check out the
|
||||
[models documentation](https://spacy.io/docs/usage/models).**
|
||||
|
||||
### Support for older versions
|
||||
|
||||
If you're using an older version (`v1.6.0` or below), you can still download and
|
||||
install the old models from within spaCy using `python -m spacy.en.download all`
|
||||
or `python -m spacy.de.download all`. The `.tar.gz` archives are also
|
||||
[attached to the v1.6.0 release](https://github.com/explosion/spaCy/tree/v1.6.0).
|
||||
To download and install the models manually, unpack the archive, drop the
|
||||
contained directory into `spacy/data` and load the model via `spacy.load('en')`
|
||||
or `spacy.load('de')`.
|
||||
|
||||
## Compile from source
|
||||
|
||||
The other way to install spaCy is to clone its
|
||||
|
|
|
@ -84,7 +84,7 @@ def read_conllu(file_):
|
|||
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
|
||||
if text_loc.parts[-1].endswith(".conllu"):
|
||||
docs = []
|
||||
with text_loc.open() as file_:
|
||||
with text_loc.open(encoding="utf8") as file_:
|
||||
for conllu_doc in read_conllu(file_):
|
||||
for conllu_sent in conllu_doc:
|
||||
words = [line[1] for line in conllu_sent]
|
||||
|
|
|
@ -203,7 +203,7 @@ def golds_to_gold_tuples(docs, golds):
|
|||
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
|
||||
if text_loc.parts[-1].endswith(".conllu"):
|
||||
docs = []
|
||||
with text_loc.open() as file_:
|
||||
with text_loc.open(encoding="utf8") as file_:
|
||||
for conllu_doc in read_conllu(file_):
|
||||
for conllu_sent in conllu_doc:
|
||||
words = [line[1] for line in conllu_sent]
|
||||
|
@ -378,7 +378,7 @@ def _load_pretrained_tok2vec(nlp, loc):
|
|||
"""Load pretrained weights for the 'token-to-vector' part of the component
|
||||
models, which is typically a CNN. See 'spacy pretrain'. Experimental.
|
||||
"""
|
||||
with Path(loc).open("rb") as file_:
|
||||
with Path(loc).open("rb", encoding="utf8") as file_:
|
||||
weights_data = file_.read()
|
||||
loaded = []
|
||||
for name, component in nlp.pipeline:
|
||||
|
@ -519,8 +519,8 @@ def main(
|
|||
for i in range(config.nr_epoch):
|
||||
docs, golds = read_data(
|
||||
nlp,
|
||||
paths.train.conllu.open(),
|
||||
paths.train.text.open(),
|
||||
paths.train.conllu.open(encoding="utf8"),
|
||||
paths.train.text.open(encoding="utf8"),
|
||||
max_doc_length=config.max_doc_length,
|
||||
limit=limit,
|
||||
oracle_segments=use_oracle_segments,
|
||||
|
@ -560,7 +560,7 @@ def main(
|
|||
|
||||
def _render_parses(i, to_render):
|
||||
to_render[0].user_data["title"] = "Batch %d" % i
|
||||
with Path("/tmp/parses.html").open("w") as file_:
|
||||
with Path("/tmp/parses.html").open("w", encoding="utf8") as file_:
|
||||
html = displacy.render(to_render[:5], style="dep", page=True)
|
||||
file_.write(html)
|
||||
|
||||
|
|
|
@ -77,6 +77,8 @@ def main(
|
|||
if labels_discard:
|
||||
labels_discard = [x.strip() for x in labels_discard.split(",")]
|
||||
logger.info("Discarding {} NER types: {}".format(len(labels_discard), labels_discard))
|
||||
else:
|
||||
labels_discard = []
|
||||
|
||||
train_data = wikipedia_processor.read_training(
|
||||
nlp=nlp,
|
||||
|
|
|
@ -18,19 +18,21 @@ during training. We discard the auxiliary model before run-time.
|
|||
The specific example here is not necessarily a good idea --- but it shows
|
||||
how an arbitrary objective function for some word can be used.
|
||||
|
||||
Developed and tested for spaCy 2.0.6
|
||||
Developed and tested for spaCy 2.0.6. Updated for v2.2.2
|
||||
"""
|
||||
import random
|
||||
import plac
|
||||
import spacy
|
||||
import os.path
|
||||
from spacy.tokens import Doc
|
||||
from spacy.gold import read_json_file, GoldParse
|
||||
|
||||
random.seed(0)
|
||||
|
||||
PWD = os.path.dirname(__file__)
|
||||
|
||||
TRAIN_DATA = list(read_json_file(os.path.join(PWD, "training-data.json")))
|
||||
TRAIN_DATA = list(read_json_file(
|
||||
os.path.join(PWD, "ner_example_data", "ner-sent-per-line.json")))
|
||||
|
||||
|
||||
def get_position_label(i, words, tags, heads, labels, ents):
|
||||
|
@ -55,6 +57,7 @@ def main(n_iter=10):
|
|||
ner = nlp.create_pipe("ner")
|
||||
ner.add_multitask_objective(get_position_label)
|
||||
nlp.add_pipe(ner)
|
||||
print(nlp.pipeline)
|
||||
|
||||
print("Create data", len(TRAIN_DATA))
|
||||
optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA)
|
||||
|
@ -62,23 +65,24 @@ def main(n_iter=10):
|
|||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
for text, annot_brackets in TRAIN_DATA:
|
||||
annotations, _ = annot_brackets
|
||||
doc = nlp.make_doc(text)
|
||||
gold = GoldParse.from_annot_tuples(doc, annotations[0])
|
||||
nlp.update(
|
||||
[doc], # batch of texts
|
||||
[gold], # batch of annotations
|
||||
drop=0.2, # dropout - make it harder to memorise data
|
||||
sgd=optimizer, # callable to update weights
|
||||
losses=losses,
|
||||
)
|
||||
for annotations, _ in annot_brackets:
|
||||
doc = Doc(nlp.vocab, words=annotations[1])
|
||||
gold = GoldParse.from_annot_tuples(doc, annotations)
|
||||
nlp.update(
|
||||
[doc], # batch of texts
|
||||
[gold], # batch of annotations
|
||||
drop=0.2, # dropout - make it harder to memorise data
|
||||
sgd=optimizer, # callable to update weights
|
||||
losses=losses,
|
||||
)
|
||||
print(losses.get("nn_labeller", 0.0), losses["ner"])
|
||||
|
||||
# test the trained model
|
||||
for text, _ in TRAIN_DATA:
|
||||
doc = nlp(text)
|
||||
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
|
||||
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
|
||||
if text is not None:
|
||||
doc = nlp(text)
|
||||
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
|
||||
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
|
|
@ -1,10 +1,10 @@
|
|||
# Our libraries
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc>=7.2.0,<7.3.0
|
||||
thinc>=7.3.0,<7.4.0
|
||||
blis>=0.4.0,<0.5.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
wasabi>=0.2.0,<1.1.0
|
||||
wasabi>=0.3.0,<1.1.0
|
||||
srsly>=0.1.0,<1.1.0
|
||||
# Third party dependencies
|
||||
numpy>=1.15.0
|
||||
|
|
|
@ -38,18 +38,18 @@ setup_requires =
|
|||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
thinc>=7.2.0,<7.3.0
|
||||
thinc>=7.3.0,<7.4.0
|
||||
install_requires =
|
||||
setuptools
|
||||
numpy>=1.15.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc>=7.2.0,<7.3.0
|
||||
thinc>=7.3.0,<7.4.0
|
||||
blis>=0.4.0,<0.5.0
|
||||
plac>=0.9.6,<1.2.0
|
||||
requests>=2.13.0,<3.0.0
|
||||
wasabi>=0.2.0,<1.1.0
|
||||
wasabi>=0.3.0,<1.1.0
|
||||
srsly>=0.1.0,<1.1.0
|
||||
pathlib==1.0.1; python_version < "3.4"
|
||||
importlib_metadata>=0.20; python_version < "3.8"
|
||||
|
|
|
@ -9,12 +9,14 @@ warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
|
|||
# These are imported as part of the API
|
||||
from thinc.neural.util import prefer_gpu, require_gpu
|
||||
|
||||
from . import pipeline
|
||||
from .cli.info import info as cli_info
|
||||
from .glossary import explain
|
||||
from .about import __version__
|
||||
from .errors import Errors, Warnings, deprecation_warning
|
||||
from . import util
|
||||
from .util import register_architecture, get_architecture
|
||||
from .language import component
|
||||
|
||||
|
||||
if sys.maxunicode == 65535:
|
||||
|
|
161
spacy/_ml.py
161
spacy/_ml.py
|
@ -3,16 +3,14 @@ from __future__ import unicode_literals
|
|||
|
||||
import numpy
|
||||
from thinc.v2v import Model, Maxout, Softmax, Affine, ReLu
|
||||
from thinc.i2v import HashEmbed, StaticVectors
|
||||
from thinc.t2t import ExtractWindow, ParametricAttention
|
||||
from thinc.t2v import Pooling, sum_pool, mean_pool
|
||||
from thinc.misc import Residual
|
||||
from thinc.i2v import HashEmbed
|
||||
from thinc.misc import Residual, FeatureExtracter
|
||||
from thinc.misc import LayerNorm as LN
|
||||
from thinc.misc import FeatureExtracter
|
||||
from thinc.api import add, layerize, chain, clone, concatenate, with_flatten
|
||||
from thinc.api import with_getitem, flatten_add_lengths
|
||||
from thinc.api import uniqued, wrap, noop
|
||||
from thinc.api import with_square_sequences
|
||||
from thinc.linear.linear import LinearModel
|
||||
from thinc.neural.ops import NumpyOps, CupyOps
|
||||
from thinc.neural.util import get_array_module, copy_array
|
||||
|
@ -26,14 +24,13 @@ import thinc.extra.load_nlp
|
|||
from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
|
||||
from .errors import Errors, user_warning, Warnings
|
||||
from . import util
|
||||
from . import ml as new_ml
|
||||
from .ml import _legacy_tok2vec
|
||||
|
||||
try:
|
||||
import torch.nn
|
||||
from thinc.extra.wrappers import PyTorchWrapperRNN
|
||||
except ImportError:
|
||||
torch = None
|
||||
|
||||
VECTORS_KEY = "spacy_pretrained_vectors"
|
||||
# Backwards compatibility with <2.2.2
|
||||
USE_MODEL_REGISTRY_TOK2VEC = False
|
||||
|
||||
|
||||
def cosine(vec1, vec2):
|
||||
|
@ -310,6 +307,10 @@ def link_vectors_to_models(vocab):
|
|||
|
||||
|
||||
def PyTorchBiLSTM(nO, nI, depth, dropout=0.2):
|
||||
import torch.nn
|
||||
from thinc.api import with_square_sequences
|
||||
from thinc.extra.wrappers import PyTorchWrapperRNN
|
||||
|
||||
if depth == 0:
|
||||
return layerize(noop())
|
||||
model = torch.nn.LSTM(nI, nO // 2, depth, bidirectional=True, dropout=dropout)
|
||||
|
@ -317,81 +318,89 @@ def PyTorchBiLSTM(nO, nI, depth, dropout=0.2):
|
|||
|
||||
|
||||
def Tok2Vec(width, embed_size, **kwargs):
|
||||
if not USE_MODEL_REGISTRY_TOK2VEC:
|
||||
# Preserve prior tok2vec for backwards compat, in v2.2.2
|
||||
return _legacy_tok2vec.Tok2Vec(width, embed_size, **kwargs)
|
||||
pretrained_vectors = kwargs.get("pretrained_vectors", None)
|
||||
cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3)
|
||||
subword_features = kwargs.get("subword_features", True)
|
||||
char_embed = kwargs.get("char_embed", False)
|
||||
if char_embed:
|
||||
subword_features = False
|
||||
conv_depth = kwargs.get("conv_depth", 4)
|
||||
bilstm_depth = kwargs.get("bilstm_depth", 0)
|
||||
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
|
||||
with Model.define_operators(
|
||||
{">>": chain, "|": concatenate, "**": clone, "+": add, "*": reapply}
|
||||
):
|
||||
norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm")
|
||||
if subword_features:
|
||||
prefix = HashEmbed(
|
||||
width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix"
|
||||
)
|
||||
suffix = HashEmbed(
|
||||
width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix"
|
||||
)
|
||||
shape = HashEmbed(
|
||||
width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape"
|
||||
)
|
||||
else:
|
||||
prefix, suffix, shape = (None, None, None)
|
||||
if pretrained_vectors is not None:
|
||||
glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
|
||||
conv_window = kwargs.get("conv_window", 1)
|
||||
|
||||
if subword_features:
|
||||
embed = uniqued(
|
||||
(glove | norm | prefix | suffix | shape)
|
||||
>> LN(Maxout(width, width * 5, pieces=3)),
|
||||
column=cols.index(ORTH),
|
||||
)
|
||||
else:
|
||||
embed = uniqued(
|
||||
(glove | norm) >> LN(Maxout(width, width * 2, pieces=3)),
|
||||
column=cols.index(ORTH),
|
||||
)
|
||||
elif subword_features:
|
||||
embed = uniqued(
|
||||
(norm | prefix | suffix | shape)
|
||||
>> LN(Maxout(width, width * 4, pieces=3)),
|
||||
column=cols.index(ORTH),
|
||||
)
|
||||
elif char_embed:
|
||||
embed = concatenate_lists(
|
||||
CharacterEmbed(nM=64, nC=8),
|
||||
FeatureExtracter(cols) >> with_flatten(norm),
|
||||
)
|
||||
reduce_dimensions = LN(
|
||||
Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces)
|
||||
)
|
||||
else:
|
||||
embed = norm
|
||||
cols = ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"]
|
||||
|
||||
convolution = Residual(
|
||||
ExtractWindow(nW=1)
|
||||
>> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces))
|
||||
)
|
||||
if char_embed:
|
||||
tok2vec = embed >> with_flatten(
|
||||
reduce_dimensions >> convolution ** conv_depth, pad=conv_depth
|
||||
)
|
||||
else:
|
||||
tok2vec = FeatureExtracter(cols) >> with_flatten(
|
||||
embed >> convolution ** conv_depth, pad=conv_depth
|
||||
)
|
||||
|
||||
if bilstm_depth >= 1:
|
||||
tok2vec = tok2vec >> PyTorchBiLSTM(width, width, bilstm_depth)
|
||||
# Work around thinc API limitations :(. TODO: Revise in Thinc 7
|
||||
tok2vec.nO = width
|
||||
tok2vec.embed = embed
|
||||
return tok2vec
|
||||
doc2feats_cfg = {"arch": "spacy.Doc2Feats.v1", "config": {"columns": cols}}
|
||||
if char_embed:
|
||||
embed_cfg = {
|
||||
"arch": "spacy.CharacterEmbed.v1",
|
||||
"config": {
|
||||
"width": 64,
|
||||
"chars": 6,
|
||||
"@mix": {
|
||||
"arch": "spacy.LayerNormalizedMaxout.v1",
|
||||
"config": {"width": width, "pieces": 3},
|
||||
},
|
||||
"@embed_features": None,
|
||||
},
|
||||
}
|
||||
else:
|
||||
embed_cfg = {
|
||||
"arch": "spacy.MultiHashEmbed.v1",
|
||||
"config": {
|
||||
"width": width,
|
||||
"rows": embed_size,
|
||||
"columns": cols,
|
||||
"use_subwords": subword_features,
|
||||
"@pretrained_vectors": None,
|
||||
"@mix": {
|
||||
"arch": "spacy.LayerNormalizedMaxout.v1",
|
||||
"config": {"width": width, "pieces": 3},
|
||||
},
|
||||
},
|
||||
}
|
||||
if pretrained_vectors:
|
||||
embed_cfg["config"]["@pretrained_vectors"] = {
|
||||
"arch": "spacy.PretrainedVectors.v1",
|
||||
"config": {
|
||||
"vectors_name": pretrained_vectors,
|
||||
"width": width,
|
||||
"column": cols.index("ID"),
|
||||
},
|
||||
}
|
||||
if cnn_maxout_pieces >= 2:
|
||||
cnn_cfg = {
|
||||
"arch": "spacy.MaxoutWindowEncoder.v1",
|
||||
"config": {
|
||||
"width": width,
|
||||
"window_size": conv_window,
|
||||
"pieces": cnn_maxout_pieces,
|
||||
"depth": conv_depth,
|
||||
},
|
||||
}
|
||||
else:
|
||||
cnn_cfg = {
|
||||
"arch": "spacy.MishWindowEncoder.v1",
|
||||
"config": {"width": width, "window_size": conv_window, "depth": conv_depth},
|
||||
}
|
||||
bilstm_cfg = {
|
||||
"arch": "spacy.TorchBiLSTMEncoder.v1",
|
||||
"config": {"width": width, "depth": bilstm_depth},
|
||||
}
|
||||
if conv_depth == 0 and bilstm_depth == 0:
|
||||
encode_cfg = {}
|
||||
elif conv_depth >= 1 and bilstm_depth >= 1:
|
||||
encode_cfg = {
|
||||
"arch": "thinc.FeedForward.v1",
|
||||
"config": {"children": [cnn_cfg, bilstm_cfg]},
|
||||
}
|
||||
elif conv_depth >= 1:
|
||||
encode_cfg = cnn_cfg
|
||||
else:
|
||||
encode_cfg = bilstm_cfg
|
||||
config = {"@doc2feats": doc2feats_cfg, "@embed": embed_cfg, "@encode": encode_cfg}
|
||||
return new_ml.Tok2Vec(config)
|
||||
|
||||
|
||||
def reapply(layer, n_times):
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# fmt: off
|
||||
__title__ = "spacy"
|
||||
__version__ = "2.2.2.dev1"
|
||||
__version__ = "2.2.2"
|
||||
__release__ = True
|
||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||
|
|
179
spacy/analysis.py
Normal file
179
spacy/analysis.py
Normal file
|
@ -0,0 +1,179 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from collections import OrderedDict
|
||||
from wasabi import Printer
|
||||
|
||||
from .tokens import Doc, Token, Span
|
||||
from .errors import Errors, Warnings, user_warning
|
||||
|
||||
|
||||
def analyze_pipes(pipeline, name, pipe, index, warn=True):
|
||||
"""Analyze a pipeline component with respect to its position in the current
|
||||
pipeline and the other components. Will check whether requirements are
|
||||
fulfilled (e.g. if previous components assign the attributes).
|
||||
|
||||
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
|
||||
name (unicode): The name of the pipeline component to analyze.
|
||||
pipe (callable): The pipeline component function to analyze.
|
||||
index (int): The index of the component in the pipeline.
|
||||
warn (bool): Show user warning if problem is found.
|
||||
RETURNS (list): The problems found for the given pipeline component.
|
||||
"""
|
||||
assert pipeline[index][0] == name
|
||||
prev_pipes = pipeline[:index]
|
||||
pipe_requires = getattr(pipe, "requires", [])
|
||||
requires = OrderedDict([(annot, False) for annot in pipe_requires])
|
||||
if requires:
|
||||
for prev_name, prev_pipe in prev_pipes:
|
||||
prev_assigns = getattr(prev_pipe, "assigns", [])
|
||||
for annot in prev_assigns:
|
||||
requires[annot] = True
|
||||
problems = []
|
||||
for annot, fulfilled in requires.items():
|
||||
if not fulfilled:
|
||||
problems.append(annot)
|
||||
if warn:
|
||||
user_warning(Warnings.W025.format(name=name, attr=annot))
|
||||
return problems
|
||||
|
||||
|
||||
def analyze_all_pipes(pipeline, warn=True):
|
||||
"""Analyze all pipes in the pipeline in order.
|
||||
|
||||
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
|
||||
warn (bool): Show user warning if problem is found.
|
||||
RETURNS (dict): The problems found, keyed by component name.
|
||||
"""
|
||||
problems = {}
|
||||
for i, (name, pipe) in enumerate(pipeline):
|
||||
problems[name] = analyze_pipes(pipeline, name, pipe, i, warn=warn)
|
||||
return problems
|
||||
|
||||
|
||||
def dot_to_dict(values):
|
||||
"""Convert dot notation to a dict. For example: ["token.pos", "token._.xyz"]
|
||||
become {"token": {"pos": True, "_": {"xyz": True }}}.
|
||||
|
||||
values (iterable): The values to convert.
|
||||
RETURNS (dict): The converted values.
|
||||
"""
|
||||
result = {}
|
||||
for value in values:
|
||||
path = result
|
||||
parts = value.lower().split(".")
|
||||
for i, item in enumerate(parts):
|
||||
is_last = i == len(parts) - 1
|
||||
path = path.setdefault(item, True if is_last else {})
|
||||
return result
|
||||
|
||||
|
||||
def validate_attrs(values):
|
||||
"""Validate component attributes provided to "assigns", "requires" etc.
|
||||
Raises error for invalid attributes and formatting. Doesn't check if
|
||||
custom extension attributes are registered, since this is something the
|
||||
user might want to do themselves later in the component.
|
||||
|
||||
values (iterable): The string attributes to check, e.g. `["token.pos"]`.
|
||||
RETURNS (iterable): The checked attributes.
|
||||
"""
|
||||
data = dot_to_dict(values)
|
||||
objs = {"doc": Doc, "token": Token, "span": Span}
|
||||
for obj_key, attrs in data.items():
|
||||
if obj_key == "span":
|
||||
# Support Span only for custom extension attributes
|
||||
span_attrs = [attr for attr in values if attr.startswith("span.")]
|
||||
span_attrs = [attr for attr in span_attrs if not attr.startswith("span._.")]
|
||||
if span_attrs:
|
||||
raise ValueError(Errors.E180.format(attrs=", ".join(span_attrs)))
|
||||
if obj_key not in objs: # first element is not doc/token/span
|
||||
invalid_attrs = ", ".join(a for a in values if a.startswith(obj_key))
|
||||
raise ValueError(Errors.E181.format(obj=obj_key, attrs=invalid_attrs))
|
||||
if not isinstance(attrs, dict): # attr is something like "doc"
|
||||
raise ValueError(Errors.E182.format(attr=obj_key))
|
||||
for attr, value in attrs.items():
|
||||
if attr == "_":
|
||||
if value is True: # attr is something like "doc._"
|
||||
raise ValueError(Errors.E182.format(attr="{}._".format(obj_key)))
|
||||
for ext_attr, ext_value in value.items():
|
||||
# We don't check whether the attribute actually exists
|
||||
if ext_value is not True: # attr is something like doc._.x.y
|
||||
good = "{}._.{}".format(obj_key, ext_attr)
|
||||
bad = "{}.{}".format(good, ".".join(ext_value))
|
||||
raise ValueError(Errors.E183.format(attr=bad, solution=good))
|
||||
continue # we can't validate those further
|
||||
if attr.endswith("_"): # attr is something like "token.pos_"
|
||||
raise ValueError(Errors.E184.format(attr=attr, solution=attr[:-1]))
|
||||
if value is not True: # attr is something like doc.x.y
|
||||
good = "{}.{}".format(obj_key, attr)
|
||||
bad = "{}.{}".format(good, ".".join(value))
|
||||
raise ValueError(Errors.E183.format(attr=bad, solution=good))
|
||||
obj = objs[obj_key]
|
||||
if not hasattr(obj, attr):
|
||||
raise ValueError(Errors.E185.format(obj=obj_key, attr=attr))
|
||||
return values
|
||||
|
||||
|
||||
def _get_feature_for_attr(pipeline, attr, feature):
|
||||
assert feature in ["assigns", "requires"]
|
||||
result = []
|
||||
for pipe_name, pipe in pipeline:
|
||||
pipe_assigns = getattr(pipe, feature, [])
|
||||
if attr in pipe_assigns:
|
||||
result.append((pipe_name, pipe))
|
||||
return result
|
||||
|
||||
|
||||
def get_assigns_for_attr(pipeline, attr):
|
||||
"""Get all pipeline components that assign an attr, e.g. "doc.tensor".
|
||||
|
||||
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
|
||||
attr (unicode): The attribute to check.
|
||||
RETURNS (list): (name, pipeline) tuples of components that assign the attr.
|
||||
"""
|
||||
return _get_feature_for_attr(pipeline, attr, "assigns")
|
||||
|
||||
|
||||
def get_requires_for_attr(pipeline, attr):
|
||||
"""Get all pipeline components that require an attr, e.g. "doc.tensor".
|
||||
|
||||
pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline.
|
||||
attr (unicode): The attribute to check.
|
||||
RETURNS (list): (name, pipeline) tuples of components that require the attr.
|
||||
"""
|
||||
return _get_feature_for_attr(pipeline, attr, "requires")
|
||||
|
||||
|
||||
def print_summary(nlp, pretty=True, no_print=False):
|
||||
"""Print a formatted summary for the current nlp object's pipeline. Shows
|
||||
a table with the pipeline components and why they assign and require, as
|
||||
well as any problems if available.
|
||||
|
||||
nlp (Language): The nlp object.
|
||||
pretty (bool): Pretty-print the results (color etc).
|
||||
no_print (bool): Don't print anything, just return the data.
|
||||
RETURNS (dict): A dict with "overview" and "problems".
|
||||
"""
|
||||
msg = Printer(pretty=pretty, no_print=no_print)
|
||||
overview = []
|
||||
problems = {}
|
||||
for i, (name, pipe) in enumerate(nlp.pipeline):
|
||||
requires = getattr(pipe, "requires", [])
|
||||
assigns = getattr(pipe, "assigns", [])
|
||||
retok = getattr(pipe, "retokenizes", False)
|
||||
overview.append((i, name, requires, assigns, retok))
|
||||
problems[name] = analyze_pipes(nlp.pipeline, name, pipe, i, warn=False)
|
||||
msg.divider("Pipeline Overview")
|
||||
header = ("#", "Component", "Requires", "Assigns", "Retokenizes")
|
||||
msg.table(overview, header=header, divider=True, multiline=True)
|
||||
n_problems = sum(len(p) for p in problems.values())
|
||||
if any(p for p in problems.values()):
|
||||
msg.divider("Problems ({})".format(n_problems))
|
||||
for name, problem in problems.items():
|
||||
if problem:
|
||||
problem = ", ".join(problem)
|
||||
msg.warn("'{}' requirements not met: {}".format(name, problem))
|
||||
else:
|
||||
msg.good("No problems found.")
|
||||
if no_print:
|
||||
return {"overview": overview, "problems": problems}
|
|
@ -57,7 +57,7 @@ def convert(
|
|||
is written to stdout, so you can pipe them forward to a JSON file:
|
||||
$ spacy convert some_file.conllu > some_file.json
|
||||
"""
|
||||
no_print = (output_dir == "-")
|
||||
no_print = output_dir == "-"
|
||||
msg = Printer(no_print=no_print)
|
||||
input_path = Path(input_file)
|
||||
if file_type not in FILE_TYPES:
|
||||
|
|
|
@ -9,7 +9,9 @@ from ...tokens.doc import Doc
|
|||
from ...util import load_model
|
||||
|
||||
|
||||
def conll_ner2json(input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs):
|
||||
def conll_ner2json(
|
||||
input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs
|
||||
):
|
||||
"""
|
||||
Convert files in the CoNLL-2003 NER format and similar
|
||||
whitespace-separated columns into JSON format for use with train cli.
|
||||
|
|
|
@ -35,6 +35,10 @@ from .train import _load_pretrained_tok2vec
|
|||
output_dir=("Directory to write models to on each epoch", "positional", None, str),
|
||||
width=("Width of CNN layers", "option", "cw", int),
|
||||
depth=("Depth of CNN layers", "option", "cd", int),
|
||||
cnn_window=("Window size for CNN layers", "option", "cW", int),
|
||||
cnn_pieces=("Maxout size for CNN layers. 1 for Mish", "option", "cP", int),
|
||||
use_chars=("Whether to use character-based embedding", "flag", "chr", bool),
|
||||
sa_depth=("Depth of self-attention layers", "option", "sa", int),
|
||||
bilstm_depth=("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int),
|
||||
embed_rows=("Number of embedding rows", "option", "er", int),
|
||||
loss_func=(
|
||||
|
@ -81,7 +85,11 @@ def pretrain(
|
|||
output_dir,
|
||||
width=96,
|
||||
depth=4,
|
||||
bilstm_depth=2,
|
||||
bilstm_depth=0,
|
||||
cnn_pieces=3,
|
||||
sa_depth=0,
|
||||
use_chars=False,
|
||||
cnn_window=1,
|
||||
embed_rows=2000,
|
||||
loss_func="cosine",
|
||||
use_vectors=False,
|
||||
|
@ -158,8 +166,8 @@ def pretrain(
|
|||
conv_depth=depth,
|
||||
pretrained_vectors=pretrained_vectors,
|
||||
bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental.
|
||||
cnn_maxout_pieces=3, # You can try setting this higher
|
||||
subword_features=True, # Set to False for Chinese etc
|
||||
subword_features=not use_chars, # Set to False for Chinese etc
|
||||
cnn_maxout_pieces=cnn_pieces, # If set to 1, use Mish activation.
|
||||
),
|
||||
)
|
||||
# Load in pretrained weights
|
||||
|
|
|
@ -156,8 +156,7 @@ def train(
|
|||
"`lang` argument ('{}') ".format(nlp.lang, lang),
|
||||
exits=1,
|
||||
)
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipeline]
|
||||
nlp.disable_pipes(*other_pipes)
|
||||
nlp.disable_pipes([p for p in nlp.pipe_names if p not in pipeline])
|
||||
for pipe in pipeline:
|
||||
if pipe not in nlp.pipe_names:
|
||||
if pipe == "parser":
|
||||
|
@ -263,7 +262,11 @@ def train(
|
|||
exits=1,
|
||||
)
|
||||
train_docs = corpus.train_docs(
|
||||
nlp, noise_level=noise_level, gold_preproc=gold_preproc, max_length=0
|
||||
nlp,
|
||||
noise_level=noise_level,
|
||||
gold_preproc=gold_preproc,
|
||||
max_length=0,
|
||||
ignore_misaligned=True,
|
||||
)
|
||||
train_labels = set()
|
||||
if textcat_multilabel:
|
||||
|
@ -344,6 +347,7 @@ def train(
|
|||
orth_variant_level=orth_variant_level,
|
||||
gold_preproc=gold_preproc,
|
||||
max_length=0,
|
||||
ignore_misaligned=True,
|
||||
)
|
||||
if raw_text:
|
||||
random.shuffle(raw_text)
|
||||
|
@ -382,7 +386,11 @@ def train(
|
|||
if hasattr(component, "cfg"):
|
||||
component.cfg["beam_width"] = beam_width
|
||||
dev_docs = list(
|
||||
corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc)
|
||||
corpus.dev_docs(
|
||||
nlp_loaded,
|
||||
gold_preproc=gold_preproc,
|
||||
ignore_misaligned=True,
|
||||
)
|
||||
)
|
||||
nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
|
||||
start_time = timer()
|
||||
|
@ -399,7 +407,11 @@ def train(
|
|||
if hasattr(component, "cfg"):
|
||||
component.cfg["beam_width"] = beam_width
|
||||
dev_docs = list(
|
||||
corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc)
|
||||
corpus.dev_docs(
|
||||
nlp_loaded,
|
||||
gold_preproc=gold_preproc,
|
||||
ignore_misaligned=True,
|
||||
)
|
||||
)
|
||||
start_time = timer()
|
||||
scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose)
|
||||
|
|
|
@ -12,6 +12,7 @@ import os
|
|||
import sys
|
||||
import itertools
|
||||
import ast
|
||||
import types
|
||||
|
||||
from thinc.neural.util import copy_array
|
||||
|
||||
|
@ -67,6 +68,7 @@ if is_python2:
|
|||
basestring_ = basestring # noqa: F821
|
||||
input_ = raw_input # noqa: F821
|
||||
path2str = lambda path: str(path).decode("utf8")
|
||||
class_types = (type, types.ClassType)
|
||||
|
||||
elif is_python3:
|
||||
bytes_ = bytes
|
||||
|
@ -74,6 +76,7 @@ elif is_python3:
|
|||
basestring_ = str
|
||||
input_ = input
|
||||
path2str = lambda path: str(path)
|
||||
class_types = (type, types.ClassType) if is_python_pre_3_5 else type
|
||||
|
||||
|
||||
def b_to_str(b_str):
|
||||
|
|
|
@ -44,14 +44,14 @@ TPL_ENTS = """
|
|||
|
||||
|
||||
TPL_ENT = """
|
||||
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
|
||||
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
|
||||
{text}
|
||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">{label}</span>
|
||||
</mark>
|
||||
"""
|
||||
|
||||
TPL_ENT_RTL = """
|
||||
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
|
||||
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em">
|
||||
{text}
|
||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span>
|
||||
</mark>
|
||||
|
|
|
@ -99,6 +99,8 @@ class Warnings(object):
|
|||
"'n_process' will be set to 1.")
|
||||
W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in "
|
||||
"the Knowledge Base.")
|
||||
W025 = ("'{name}' requires '{attr}' to be assigned, but none of the "
|
||||
"previous components in the pipeline declare that they assign it.")
|
||||
|
||||
|
||||
@add_codes
|
||||
|
@ -504,6 +506,29 @@ class Errors(object):
|
|||
E175 = ("Can't remove rule for unknown match pattern ID: {key}")
|
||||
E176 = ("Alias '{alias}' is not defined in the Knowledge Base.")
|
||||
E177 = ("Ill-formed IOB input detected: {tag}")
|
||||
E178 = ("Invalid pattern. Expected list of dicts but got: {pat}. Maybe you "
|
||||
"accidentally passed a single pattern to Matcher.add instead of a "
|
||||
"list of patterns? If you only want to add one pattern, make sure "
|
||||
"to wrap it in a list. For example: matcher.add('{key}', [pattern])")
|
||||
E179 = ("Invalid pattern. Expected a list of Doc objects but got a single "
|
||||
"Doc. If you only want to add one pattern, make sure to wrap it "
|
||||
"in a list. For example: matcher.add('{key}', [doc])")
|
||||
E180 = ("Span attributes can't be declared as required or assigned by "
|
||||
"components, since spans are only views of the Doc. Use Doc and "
|
||||
"Token attributes (or custom extension attributes) only and remove "
|
||||
"the following: {attrs}")
|
||||
E181 = ("Received invalid attributes for unkown object {obj}: {attrs}. "
|
||||
"Only Doc and Token attributes are supported.")
|
||||
E182 = ("Received invalid attribute declaration: {attr}\nDid you forget "
|
||||
"to define the attribute? For example: {attr}.???")
|
||||
E183 = ("Received invalid attribute declaration: {attr}\nOnly top-level "
|
||||
"attributes are supported, for example: {solution}")
|
||||
E184 = ("Only attributes without underscores are supported in component "
|
||||
"attribute declarations (because underscore and non-underscore "
|
||||
"attributes are connected anyways): {attr} -> {solution}")
|
||||
E185 = ("Received invalid attribute in component attribute declaration: "
|
||||
"{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
|
||||
E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
|
||||
|
||||
|
||||
@add_codes
|
||||
|
@ -536,6 +561,10 @@ class MatchPatternError(ValueError):
|
|||
ValueError.__init__(self, msg)
|
||||
|
||||
|
||||
class AlignmentError(ValueError):
|
||||
pass
|
||||
|
||||
|
||||
class ModelsWarning(UserWarning):
|
||||
pass
|
||||
|
||||
|
|
|
@ -80,7 +80,7 @@ GLOSSARY = {
|
|||
"RBR": "adverb, comparative",
|
||||
"RBS": "adverb, superlative",
|
||||
"RP": "adverb, particle",
|
||||
"TO": "infinitival to",
|
||||
"TO": 'infinitival "to"',
|
||||
"UH": "interjection",
|
||||
"VB": "verb, base form",
|
||||
"VBD": "verb, past tense",
|
||||
|
@ -279,6 +279,12 @@ GLOSSARY = {
|
|||
"re": "repeated element",
|
||||
"rs": "reported speech",
|
||||
"sb": "subject",
|
||||
"sb": "subject",
|
||||
"sbp": "passivized subject (PP)",
|
||||
"sp": "subject or predicate",
|
||||
"svp": "separable verb prefix",
|
||||
"uc": "unit component",
|
||||
"vo": "vocative",
|
||||
# Named Entity Recognition
|
||||
# OntoNotes 5
|
||||
# https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf
|
||||
|
|
182
spacy/gold.pyx
182
spacy/gold.pyx
|
@ -11,10 +11,9 @@ import itertools
|
|||
from pathlib import Path
|
||||
import srsly
|
||||
|
||||
from . import _align
|
||||
from .syntax import nonproj
|
||||
from .tokens import Doc, Span
|
||||
from .errors import Errors
|
||||
from .errors import Errors, AlignmentError
|
||||
from .compat import path2str
|
||||
from . import util
|
||||
from .util import minibatch, itershuffle
|
||||
|
@ -22,6 +21,7 @@ from .util import minibatch, itershuffle
|
|||
from libc.stdio cimport FILE, fopen, fclose, fread, fwrite, feof, fseek
|
||||
|
||||
|
||||
USE_NEW_ALIGN = False
|
||||
punct_re = re.compile(r"\W")
|
||||
|
||||
|
||||
|
@ -56,10 +56,10 @@ def tags_to_entities(tags):
|
|||
|
||||
def merge_sents(sents):
|
||||
m_deps = [[], [], [], [], [], []]
|
||||
m_cats = {}
|
||||
m_brackets = []
|
||||
m_cats = sents.pop()
|
||||
i = 0
|
||||
for (ids, words, tags, heads, labels, ner), brackets in sents:
|
||||
for (ids, words, tags, heads, labels, ner), (cats, brackets) in sents:
|
||||
m_deps[0].extend(id_ + i for id_ in ids)
|
||||
m_deps[1].extend(words)
|
||||
m_deps[2].extend(tags)
|
||||
|
@ -68,12 +68,26 @@ def merge_sents(sents):
|
|||
m_deps[5].extend(ner)
|
||||
m_brackets.extend((b["first"] + i, b["last"] + i, b["label"])
|
||||
for b in brackets)
|
||||
m_cats.update(cats)
|
||||
i += len(ids)
|
||||
m_deps.append(m_cats)
|
||||
return [(m_deps, m_brackets)]
|
||||
return [(m_deps, (m_cats, m_brackets))]
|
||||
|
||||
|
||||
def align(tokens_a, tokens_b):
|
||||
_ALIGNMENT_NORM_MAP = [("``", "'"), ("''", "'"), ('"', "'"), ("`", "'")]
|
||||
|
||||
|
||||
def _normalize_for_alignment(tokens):
|
||||
tokens = [w.replace(" ", "").lower() for w in tokens]
|
||||
output = []
|
||||
for token in tokens:
|
||||
token = token.replace(" ", "").lower()
|
||||
for before, after in _ALIGNMENT_NORM_MAP:
|
||||
token = token.replace(before, after)
|
||||
output.append(token)
|
||||
return output
|
||||
|
||||
|
||||
def _align_before_v2_2_2(tokens_a, tokens_b):
|
||||
"""Calculate alignment tables between two tokenizations, using the Levenshtein
|
||||
algorithm. The alignment is case-insensitive.
|
||||
|
||||
|
@ -92,6 +106,7 @@ def align(tokens_a, tokens_b):
|
|||
* b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other
|
||||
direction.
|
||||
"""
|
||||
from . import _align
|
||||
if tokens_a == tokens_b:
|
||||
alignment = numpy.arange(len(tokens_a))
|
||||
return 0, alignment, alignment, {}, {}
|
||||
|
@ -111,6 +126,82 @@ def align(tokens_a, tokens_b):
|
|||
return cost, i2j, j2i, i2j_multi, j2i_multi
|
||||
|
||||
|
||||
def align(tokens_a, tokens_b):
|
||||
"""Calculate alignment tables between two tokenizations.
|
||||
|
||||
tokens_a (List[str]): The candidate tokenization.
|
||||
tokens_b (List[str]): The reference tokenization.
|
||||
RETURNS: (tuple): A 5-tuple consisting of the following information:
|
||||
* cost (int): The number of misaligned tokens.
|
||||
* a2b (List[int]): Mapping of indices in `tokens_a` to indices in `tokens_b`.
|
||||
For instance, if `a2b[4] == 6`, that means that `tokens_a[4]` aligns
|
||||
to `tokens_b[6]`. If there's no one-to-one alignment for a token,
|
||||
it has the value -1.
|
||||
* b2a (List[int]): The same as `a2b`, but mapping the other direction.
|
||||
* a2b_multi (Dict[int, int]): A dictionary mapping indices in `tokens_a`
|
||||
to indices in `tokens_b`, where multiple tokens of `tokens_a` align to
|
||||
the same token of `tokens_b`.
|
||||
* b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other
|
||||
direction.
|
||||
"""
|
||||
if not USE_NEW_ALIGN:
|
||||
return _align_before_v2_2_2(tokens_a, tokens_b)
|
||||
tokens_a = _normalize_for_alignment(tokens_a)
|
||||
tokens_b = _normalize_for_alignment(tokens_b)
|
||||
cost = 0
|
||||
a2b = numpy.empty(len(tokens_a), dtype="i")
|
||||
b2a = numpy.empty(len(tokens_b), dtype="i")
|
||||
a2b_multi = {}
|
||||
b2a_multi = {}
|
||||
i = 0
|
||||
j = 0
|
||||
offset_a = 0
|
||||
offset_b = 0
|
||||
while i < len(tokens_a) and j < len(tokens_b):
|
||||
a = tokens_a[i][offset_a:]
|
||||
b = tokens_b[j][offset_b:]
|
||||
a2b[i] = b2a[j] = -1
|
||||
if a == b:
|
||||
if offset_a == offset_b == 0:
|
||||
a2b[i] = j
|
||||
b2a[j] = i
|
||||
elif offset_a == 0:
|
||||
cost += 2
|
||||
a2b_multi[i] = j
|
||||
elif offset_b == 0:
|
||||
cost += 2
|
||||
b2a_multi[j] = i
|
||||
offset_a = offset_b = 0
|
||||
i += 1
|
||||
j += 1
|
||||
elif a == "":
|
||||
assert offset_a == 0
|
||||
cost += 1
|
||||
i += 1
|
||||
elif b == "":
|
||||
assert offset_b == 0
|
||||
cost += 1
|
||||
j += 1
|
||||
elif b.startswith(a):
|
||||
cost += 1
|
||||
if offset_a == 0:
|
||||
a2b_multi[i] = j
|
||||
i += 1
|
||||
offset_a = 0
|
||||
offset_b += len(a)
|
||||
elif a.startswith(b):
|
||||
cost += 1
|
||||
if offset_b == 0:
|
||||
b2a_multi[j] = i
|
||||
j += 1
|
||||
offset_b = 0
|
||||
offset_a += len(b)
|
||||
else:
|
||||
assert "".join(tokens_a) != "".join(tokens_b)
|
||||
raise AlignmentError(Errors.E186.format(tok_a=tokens_a, tok_b=tokens_b))
|
||||
return cost, a2b, b2a, a2b_multi, b2a_multi
|
||||
|
||||
|
||||
class GoldCorpus(object):
|
||||
"""An annotated corpus, using the JSON file format. Manages
|
||||
annotations for tagging, dependency parsing and NER.
|
||||
|
@ -176,6 +267,11 @@ class GoldCorpus(object):
|
|||
gold_tuples = read_json_file(loc)
|
||||
elif loc.parts[-1].endswith("jsonl"):
|
||||
gold_tuples = srsly.read_jsonl(loc)
|
||||
first_gold_tuple = next(gold_tuples)
|
||||
gold_tuples = itertools.chain([first_gold_tuple], gold_tuples)
|
||||
# TODO: proper format checks with schemas
|
||||
if isinstance(first_gold_tuple, dict):
|
||||
gold_tuples = read_json_object(gold_tuples)
|
||||
elif loc.parts[-1].endswith("msg"):
|
||||
gold_tuples = srsly.read_msgpack(loc)
|
||||
else:
|
||||
|
@ -201,7 +297,6 @@ class GoldCorpus(object):
|
|||
n = 0
|
||||
i = 0
|
||||
for raw_text, paragraph_tuples in self.train_tuples:
|
||||
cats = paragraph_tuples.pop()
|
||||
for sent_tuples, brackets in paragraph_tuples:
|
||||
n += len(sent_tuples[1])
|
||||
if self.limit and i >= self.limit:
|
||||
|
@ -210,7 +305,8 @@ class GoldCorpus(object):
|
|||
return n
|
||||
|
||||
def train_docs(self, nlp, gold_preproc=False, max_length=None,
|
||||
noise_level=0.0, orth_variant_level=0.0):
|
||||
noise_level=0.0, orth_variant_level=0.0,
|
||||
ignore_misaligned=False):
|
||||
locs = list((self.tmp_dir / 'train').iterdir())
|
||||
random.shuffle(locs)
|
||||
train_tuples = self.read_tuples(locs, limit=self.limit)
|
||||
|
@ -218,20 +314,23 @@ class GoldCorpus(object):
|
|||
max_length=max_length,
|
||||
noise_level=noise_level,
|
||||
orth_variant_level=orth_variant_level,
|
||||
make_projective=True)
|
||||
make_projective=True,
|
||||
ignore_misaligned=ignore_misaligned)
|
||||
yield from gold_docs
|
||||
|
||||
def train_docs_without_preprocessing(self, nlp, gold_preproc=False):
|
||||
gold_docs = self.iter_gold_docs(nlp, self.train_tuples, gold_preproc=gold_preproc)
|
||||
yield from gold_docs
|
||||
|
||||
def dev_docs(self, nlp, gold_preproc=False):
|
||||
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc)
|
||||
def dev_docs(self, nlp, gold_preproc=False, ignore_misaligned=False):
|
||||
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc,
|
||||
ignore_misaligned=ignore_misaligned)
|
||||
yield from gold_docs
|
||||
|
||||
@classmethod
|
||||
def iter_gold_docs(cls, nlp, tuples, gold_preproc, max_length=None,
|
||||
noise_level=0.0, orth_variant_level=0.0, make_projective=False):
|
||||
noise_level=0.0, orth_variant_level=0.0, make_projective=False,
|
||||
ignore_misaligned=False):
|
||||
for raw_text, paragraph_tuples in tuples:
|
||||
if gold_preproc:
|
||||
raw_text = None
|
||||
|
@ -240,10 +339,12 @@ class GoldCorpus(object):
|
|||
docs, paragraph_tuples = cls._make_docs(nlp, raw_text,
|
||||
paragraph_tuples, gold_preproc, noise_level=noise_level,
|
||||
orth_variant_level=orth_variant_level)
|
||||
golds = cls._make_golds(docs, paragraph_tuples, make_projective)
|
||||
golds = cls._make_golds(docs, paragraph_tuples, make_projective,
|
||||
ignore_misaligned=ignore_misaligned)
|
||||
for doc, gold in zip(docs, golds):
|
||||
if (not max_length) or len(doc) < max_length:
|
||||
yield doc, gold
|
||||
if gold is not None:
|
||||
if (not max_length) or len(doc) < max_length:
|
||||
yield doc, gold
|
||||
|
||||
@classmethod
|
||||
def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc, noise_level=0.0, orth_variant_level=0.0):
|
||||
|
@ -259,14 +360,22 @@ class GoldCorpus(object):
|
|||
|
||||
|
||||
@classmethod
|
||||
def _make_golds(cls, docs, paragraph_tuples, make_projective):
|
||||
def _make_golds(cls, docs, paragraph_tuples, make_projective, ignore_misaligned=False):
|
||||
if len(docs) != len(paragraph_tuples):
|
||||
n_annots = len(paragraph_tuples)
|
||||
raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots=n_annots))
|
||||
return [GoldParse.from_annot_tuples(doc, sent_tuples,
|
||||
make_projective=make_projective)
|
||||
for doc, (sent_tuples, brackets)
|
||||
in zip(docs, paragraph_tuples)]
|
||||
golds = []
|
||||
for doc, (sent_tuples, (cats, brackets)) in zip(docs, paragraph_tuples):
|
||||
try:
|
||||
gold = GoldParse.from_annot_tuples(doc, sent_tuples, cats=cats,
|
||||
make_projective=make_projective)
|
||||
except AlignmentError:
|
||||
if ignore_misaligned:
|
||||
gold = None
|
||||
else:
|
||||
raise
|
||||
golds.append(gold)
|
||||
return golds
|
||||
|
||||
|
||||
def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0):
|
||||
|
@ -281,7 +390,7 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0):
|
|||
# modify words in paragraph_tuples
|
||||
variant_paragraph_tuples = []
|
||||
for sent_tuples, brackets in paragraph_tuples:
|
||||
ids, words, tags, heads, labels, ner, cats = sent_tuples
|
||||
ids, words, tags, heads, labels, ner = sent_tuples
|
||||
if lower:
|
||||
words = [w.lower() for w in words]
|
||||
# single variants
|
||||
|
@ -310,7 +419,7 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0):
|
|||
pair_idx = pair.index(words[word_idx])
|
||||
words[word_idx] = punct_choices[punct_idx][pair_idx]
|
||||
|
||||
variant_paragraph_tuples.append(((ids, words, tags, heads, labels, ner, cats), brackets))
|
||||
variant_paragraph_tuples.append(((ids, words, tags, heads, labels, ner), brackets))
|
||||
# modify raw to match variant_paragraph_tuples
|
||||
if raw is not None:
|
||||
variants = []
|
||||
|
@ -329,7 +438,7 @@ def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0):
|
|||
variant_raw += raw[raw_idx]
|
||||
raw_idx += 1
|
||||
for sent_tuples, brackets in variant_paragraph_tuples:
|
||||
ids, words, tags, heads, labels, ner, cats = sent_tuples
|
||||
ids, words, tags, heads, labels, ner = sent_tuples
|
||||
for word in words:
|
||||
match_found = False
|
||||
# add identical word
|
||||
|
@ -400,6 +509,9 @@ def json_to_tuple(doc):
|
|||
paragraphs = []
|
||||
for paragraph in doc["paragraphs"]:
|
||||
sents = []
|
||||
cats = {}
|
||||
for cat in paragraph.get("cats", {}):
|
||||
cats[cat["label"]] = cat["value"]
|
||||
for sent in paragraph["sentences"]:
|
||||
words = []
|
||||
ids = []
|
||||
|
@ -419,11 +531,7 @@ def json_to_tuple(doc):
|
|||
ner.append(token.get("ner", "-"))
|
||||
sents.append([
|
||||
[ids, words, tags, heads, labels, ner],
|
||||
sent.get("brackets", [])])
|
||||
cats = {}
|
||||
for cat in paragraph.get("cats", {}):
|
||||
cats[cat["label"]] = cat["value"]
|
||||
sents.append(cats)
|
||||
[cats, sent.get("brackets", [])]])
|
||||
if sents:
|
||||
yield [paragraph.get("raw", None), sents]
|
||||
|
||||
|
@ -537,8 +645,8 @@ cdef class GoldParse:
|
|||
DOCS: https://spacy.io/api/goldparse
|
||||
"""
|
||||
@classmethod
|
||||
def from_annot_tuples(cls, doc, annot_tuples, make_projective=False):
|
||||
_, words, tags, heads, deps, entities, cats = annot_tuples
|
||||
def from_annot_tuples(cls, doc, annot_tuples, cats=None, make_projective=False):
|
||||
_, words, tags, heads, deps, entities = annot_tuples
|
||||
return cls(doc, words=words, tags=tags, heads=heads, deps=deps,
|
||||
entities=entities, cats=cats,
|
||||
make_projective=make_projective)
|
||||
|
@ -595,9 +703,9 @@ cdef class GoldParse:
|
|||
if morphology is None:
|
||||
morphology = [None for _ in words]
|
||||
if entities is None:
|
||||
entities = ["-" for _ in doc]
|
||||
entities = ["-" for _ in words]
|
||||
elif len(entities) == 0:
|
||||
entities = ["O" for _ in doc]
|
||||
entities = ["O" for _ in words]
|
||||
else:
|
||||
# Translate the None values to '-', to make processing easier.
|
||||
# See Issue #2603
|
||||
|
@ -660,7 +768,9 @@ cdef class GoldParse:
|
|||
self.heads[i] = i+1
|
||||
self.labels[i] = "subtok"
|
||||
else:
|
||||
self.heads[i] = self.gold_to_cand[heads[i2j_multi[i]]]
|
||||
head_i = heads[i2j_multi[i]]
|
||||
if head_i:
|
||||
self.heads[i] = self.gold_to_cand[head_i]
|
||||
self.labels[i] = deps[i2j_multi[i]]
|
||||
# Now set NER...This is annoying because if we've split
|
||||
# got an entity word split into two, we need to adjust the
|
||||
|
@ -748,7 +858,7 @@ def docs_to_json(docs, id=0):
|
|||
|
||||
docs (iterable / Doc): The Doc object(s) to convert.
|
||||
id (int): Id for the JSON.
|
||||
RETURNS (dict): The data in spaCy's JSON format
|
||||
RETURNS (dict): The data in spaCy's JSON format
|
||||
- each input doc will be treated as a paragraph in the output doc
|
||||
"""
|
||||
if isinstance(docs, Doc):
|
||||
|
@ -804,7 +914,7 @@ def biluo_tags_from_offsets(doc, entities, missing="O"):
|
|||
"""
|
||||
# Ensure no overlapping entity labels exist
|
||||
tokens_in_ents = {}
|
||||
|
||||
|
||||
starts = {token.idx: token.i for token in doc}
|
||||
ends = {token.idx + len(token): token.i for token in doc}
|
||||
biluo = ["-" for _ in doc]
|
||||
|
|
|
@ -1,8 +1,8 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, PUNCT, ADJ, CONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX
|
||||
from ...symbols import POS, PUNCT, ADJ, CCONJ, SCONJ, NUM, DET, ADV, ADP, X
|
||||
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX, VERB
|
||||
|
||||
|
||||
TAG_MAP = {
|
||||
|
@ -20,8 +20,8 @@ TAG_MAP = {
|
|||
"CARD": {POS: NUM, "NumType": "card"},
|
||||
"FM": {POS: X, "Foreign": "yes"},
|
||||
"ITJ": {POS: INTJ},
|
||||
"KOKOM": {POS: CONJ, "ConjType": "comp"},
|
||||
"KON": {POS: CONJ},
|
||||
"KOKOM": {POS: CCONJ, "ConjType": "comp"},
|
||||
"KON": {POS: CCONJ},
|
||||
"KOUI": {POS: SCONJ},
|
||||
"KOUS": {POS: SCONJ},
|
||||
"NE": {POS: PROPN},
|
||||
|
@ -43,7 +43,7 @@ TAG_MAP = {
|
|||
"PTKA": {POS: PART},
|
||||
"PTKANT": {POS: PART, "PartType": "res"},
|
||||
"PTKNEG": {POS: PART, "Polarity": "neg"},
|
||||
"PTKVZ": {POS: PART, "PartType": "vbp"},
|
||||
"PTKVZ": {POS: ADP, "PartType": "vbp"},
|
||||
"PTKZU": {POS: PART, "PartType": "inf"},
|
||||
"PWAT": {POS: DET, "PronType": "int"},
|
||||
"PWAV": {POS: ADV, "PronType": "int"},
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX
|
||||
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
|
||||
|
||||
|
||||
TAG_MAP = {
|
||||
|
@ -28,8 +28,8 @@ TAG_MAP = {
|
|||
"JJR": {POS: ADJ, "Degree": "comp"},
|
||||
"JJS": {POS: ADJ, "Degree": "sup"},
|
||||
"LS": {POS: X, "NumType": "ord"},
|
||||
"MD": {POS: AUX, "VerbType": "mod"},
|
||||
"NIL": {POS: ""},
|
||||
"MD": {POS: VERB, "VerbType": "mod"},
|
||||
"NIL": {POS: X},
|
||||
"NN": {POS: NOUN, "Number": "sing"},
|
||||
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
|
||||
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
|
||||
|
@ -37,7 +37,7 @@ TAG_MAP = {
|
|||
"PDT": {POS: DET},
|
||||
"POS": {POS: PART, "Poss": "yes"},
|
||||
"PRP": {POS: PRON, "PronType": "prs"},
|
||||
"PRP$": {POS: PRON, "PronType": "prs", "Poss": "yes"},
|
||||
"PRP$": {POS: DET, "PronType": "prs", "Poss": "yes"},
|
||||
"RB": {POS: ADV, "Degree": "pos"},
|
||||
"RBR": {POS: ADV, "Degree": "comp"},
|
||||
"RBS": {POS: ADV, "Degree": "sup"},
|
||||
|
@ -58,9 +58,9 @@ TAG_MAP = {
|
|||
"Number": "sing",
|
||||
"Person": "three",
|
||||
},
|
||||
"WDT": {POS: PRON},
|
||||
"WDT": {POS: DET},
|
||||
"WP": {POS: PRON},
|
||||
"WP$": {POS: PRON, "Poss": "yes"},
|
||||
"WP$": {POS: DET, "Poss": "yes"},
|
||||
"WRB": {POS: ADV},
|
||||
"ADD": {POS: X},
|
||||
"NFP": {POS: PUNCT},
|
||||
|
|
|
@ -18,13 +18,8 @@ from .tokenizer import Tokenizer
|
|||
from .vocab import Vocab
|
||||
from .lemmatizer import Lemmatizer
|
||||
from .lookups import Lookups
|
||||
from .pipeline import DependencyParser, Tagger
|
||||
from .pipeline import Tensorizer, EntityRecognizer, EntityLinker
|
||||
from .pipeline import SimilarityHook, TextCategorizer, Sentencizer
|
||||
from .pipeline import merge_noun_chunks, merge_entities, merge_subtokens
|
||||
from .pipeline import EntityRuler
|
||||
from .pipeline import Morphologizer
|
||||
from .compat import izip, basestring_, is_python2
|
||||
from .analysis import analyze_pipes, analyze_all_pipes, validate_attrs
|
||||
from .compat import izip, basestring_, is_python2, class_types
|
||||
from .gold import GoldParse
|
||||
from .scorer import Scorer
|
||||
from ._ml import link_vectors_to_models, create_default_optimizer
|
||||
|
@ -40,6 +35,9 @@ from . import util
|
|||
from . import about
|
||||
|
||||
|
||||
ENABLE_PIPELINE_ANALYSIS = False
|
||||
|
||||
|
||||
class BaseDefaults(object):
|
||||
@classmethod
|
||||
def create_lemmatizer(cls, nlp=None, lookups=None):
|
||||
|
@ -133,22 +131,7 @@ class Language(object):
|
|||
Defaults = BaseDefaults
|
||||
lang = None
|
||||
|
||||
factories = {
|
||||
"tokenizer": lambda nlp: nlp.Defaults.create_tokenizer(nlp),
|
||||
"tensorizer": lambda nlp, **cfg: Tensorizer(nlp.vocab, **cfg),
|
||||
"tagger": lambda nlp, **cfg: Tagger(nlp.vocab, **cfg),
|
||||
"morphologizer": lambda nlp, **cfg: Morphologizer(nlp.vocab, **cfg),
|
||||
"parser": lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg),
|
||||
"ner": lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg),
|
||||
"entity_linker": lambda nlp, **cfg: EntityLinker(nlp.vocab, **cfg),
|
||||
"similarity": lambda nlp, **cfg: SimilarityHook(nlp.vocab, **cfg),
|
||||
"textcat": lambda nlp, **cfg: TextCategorizer(nlp.vocab, **cfg),
|
||||
"sentencizer": lambda nlp, **cfg: Sentencizer(**cfg),
|
||||
"merge_noun_chunks": lambda nlp, **cfg: merge_noun_chunks,
|
||||
"merge_entities": lambda nlp, **cfg: merge_entities,
|
||||
"merge_subtokens": lambda nlp, **cfg: merge_subtokens,
|
||||
"entity_ruler": lambda nlp, **cfg: EntityRuler(nlp, **cfg),
|
||||
}
|
||||
factories = {"tokenizer": lambda nlp: nlp.Defaults.create_tokenizer(nlp)}
|
||||
|
||||
def __init__(
|
||||
self, vocab=True, make_doc=True, max_length=10 ** 6, meta={}, **kwargs
|
||||
|
@ -218,6 +201,7 @@ class Language(object):
|
|||
"name": self.vocab.vectors.name,
|
||||
}
|
||||
self._meta["pipeline"] = self.pipe_names
|
||||
self._meta["factories"] = self.pipe_factories
|
||||
self._meta["labels"] = self.pipe_labels
|
||||
return self._meta
|
||||
|
||||
|
@ -259,6 +243,17 @@ class Language(object):
|
|||
"""
|
||||
return [pipe_name for pipe_name, _ in self.pipeline]
|
||||
|
||||
@property
|
||||
def pipe_factories(self):
|
||||
"""Get the component factories for the available pipeline components.
|
||||
|
||||
RETURNS (dict): Factory names, keyed by component names.
|
||||
"""
|
||||
factories = {}
|
||||
for pipe_name, pipe in self.pipeline:
|
||||
factories[pipe_name] = getattr(pipe, "factory", pipe_name)
|
||||
return factories
|
||||
|
||||
@property
|
||||
def pipe_labels(self):
|
||||
"""Get the labels set by the pipeline components, if available (if
|
||||
|
@ -327,33 +322,30 @@ class Language(object):
|
|||
msg += Errors.E004.format(component=component)
|
||||
raise ValueError(msg)
|
||||
if name is None:
|
||||
if hasattr(component, "name"):
|
||||
name = component.name
|
||||
elif hasattr(component, "__name__"):
|
||||
name = component.__name__
|
||||
elif hasattr(component, "__class__") and hasattr(
|
||||
component.__class__, "__name__"
|
||||
):
|
||||
name = component.__class__.__name__
|
||||
else:
|
||||
name = repr(component)
|
||||
name = util.get_component_name(component)
|
||||
if name in self.pipe_names:
|
||||
raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names))
|
||||
if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2:
|
||||
raise ValueError(Errors.E006)
|
||||
pipe_index = 0
|
||||
pipe = (name, component)
|
||||
if last or not any([first, before, after]):
|
||||
pipe_index = len(self.pipeline)
|
||||
self.pipeline.append(pipe)
|
||||
elif first:
|
||||
self.pipeline.insert(0, pipe)
|
||||
elif before and before in self.pipe_names:
|
||||
pipe_index = self.pipe_names.index(before)
|
||||
self.pipeline.insert(self.pipe_names.index(before), pipe)
|
||||
elif after and after in self.pipe_names:
|
||||
pipe_index = self.pipe_names.index(after) + 1
|
||||
self.pipeline.insert(self.pipe_names.index(after) + 1, pipe)
|
||||
else:
|
||||
raise ValueError(
|
||||
Errors.E001.format(name=before or after, opts=self.pipe_names)
|
||||
)
|
||||
if ENABLE_PIPELINE_ANALYSIS:
|
||||
analyze_pipes(self.pipeline, name, component, pipe_index)
|
||||
|
||||
def has_pipe(self, name):
|
||||
"""Check if a component name is present in the pipeline. Equivalent to
|
||||
|
@ -382,6 +374,8 @@ class Language(object):
|
|||
msg += Errors.E135.format(name=name)
|
||||
raise ValueError(msg)
|
||||
self.pipeline[self.pipe_names.index(name)] = (name, component)
|
||||
if ENABLE_PIPELINE_ANALYSIS:
|
||||
analyze_all_pipes(self.pipeline)
|
||||
|
||||
def rename_pipe(self, old_name, new_name):
|
||||
"""Rename a pipeline component.
|
||||
|
@ -408,7 +402,10 @@ class Language(object):
|
|||
"""
|
||||
if name not in self.pipe_names:
|
||||
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
|
||||
return self.pipeline.pop(self.pipe_names.index(name))
|
||||
removed = self.pipeline.pop(self.pipe_names.index(name))
|
||||
if ENABLE_PIPELINE_ANALYSIS:
|
||||
analyze_all_pipes(self.pipeline)
|
||||
return removed
|
||||
|
||||
def __call__(self, text, disable=[], component_cfg=None):
|
||||
"""Apply the pipeline to some text. The text can span multiple sentences,
|
||||
|
@ -448,6 +445,8 @@ class Language(object):
|
|||
|
||||
DOCS: https://spacy.io/api/language#disable_pipes
|
||||
"""
|
||||
if len(names) == 1 and isinstance(names[0], (list, tuple)):
|
||||
names = names[0] # support list of names instead of spread
|
||||
return DisabledPipes(self, *names)
|
||||
|
||||
def make_doc(self, text):
|
||||
|
@ -999,6 +998,52 @@ class Language(object):
|
|||
return self
|
||||
|
||||
|
||||
class component(object):
|
||||
"""Decorator for pipeline components. Can decorate both function components
|
||||
and class components and will automatically register components in the
|
||||
Language.factories. If the component is a class and needs access to the
|
||||
nlp object or config parameters, it can expose a from_nlp classmethod
|
||||
that takes the nlp object and **cfg arguments and returns the initialized
|
||||
component.
|
||||
"""
|
||||
|
||||
# NB: This decorator needs to live here, because it needs to write to
|
||||
# Language.factories. All other solutions would cause circular import.
|
||||
|
||||
def __init__(self, name=None, assigns=tuple(), requires=tuple(), retokenizes=False):
|
||||
"""Decorate a pipeline component.
|
||||
|
||||
name (unicode): Default component and factory name.
|
||||
assigns (list): Attributes assigned by component, e.g. `["token.pos"]`.
|
||||
requires (list): Attributes required by component, e.g. `["token.dep"]`.
|
||||
retokenizes (bool): Whether the component changes the tokenization.
|
||||
"""
|
||||
self.name = name
|
||||
self.assigns = validate_attrs(assigns)
|
||||
self.requires = validate_attrs(requires)
|
||||
self.retokenizes = retokenizes
|
||||
|
||||
def __call__(self, *args, **kwargs):
|
||||
obj = args[0]
|
||||
args = args[1:]
|
||||
factory_name = self.name or util.get_component_name(obj)
|
||||
obj.name = factory_name
|
||||
obj.factory = factory_name
|
||||
obj.assigns = self.assigns
|
||||
obj.requires = self.requires
|
||||
obj.retokenizes = self.retokenizes
|
||||
|
||||
def factory(nlp, **cfg):
|
||||
if hasattr(obj, "from_nlp"):
|
||||
return obj.from_nlp(nlp, **cfg)
|
||||
elif isinstance(obj, class_types):
|
||||
return obj()
|
||||
return obj
|
||||
|
||||
Language.factories[obj.factory] = factory
|
||||
return obj
|
||||
|
||||
|
||||
def _fix_pretrained_vectors_name(nlp):
|
||||
# TODO: Replace this once we handle vectors consistently as static
|
||||
# data
|
||||
|
|
|
@ -102,7 +102,10 @@ cdef class DependencyMatcher:
|
|||
visitedNodes[relation["SPEC"]["NBOR_NAME"]] = True
|
||||
idx = idx + 1
|
||||
|
||||
def add(self, key, on_match, *patterns):
|
||||
def add(self, key, patterns, *_patterns, on_match=None):
|
||||
if patterns is None or hasattr(patterns, "__call__"): # old API
|
||||
on_match = patterns
|
||||
patterns = _patterns
|
||||
for pattern in patterns:
|
||||
if len(pattern) == 0:
|
||||
raise ValueError(Errors.E012.format(key=key))
|
||||
|
|
|
@ -74,7 +74,7 @@ cdef class Matcher:
|
|||
"""
|
||||
return self._normalize_key(key) in self._patterns
|
||||
|
||||
def add(self, key, on_match, *patterns):
|
||||
def add(self, key, patterns, *_patterns, on_match=None):
|
||||
"""Add a match-rule to the matcher. A match-rule consists of: an ID
|
||||
key, an on_match callback, and one or more patterns.
|
||||
|
||||
|
@ -98,16 +98,29 @@ cdef class Matcher:
|
|||
operator will behave non-greedily. This quirk in the semantics makes
|
||||
the matcher more efficient, by avoiding the need for back-tracking.
|
||||
|
||||
As of spaCy v2.2.2, Matcher.add supports the future API, which makes
|
||||
the patterns the second argument and a list (instead of a variable
|
||||
number of arguments). The on_match callback becomes an optional keyword
|
||||
argument.
|
||||
|
||||
key (unicode): The match ID.
|
||||
on_match (callable): Callback executed on match.
|
||||
*patterns (list): List of token descriptions.
|
||||
patterns (list): The patterns to add for the given key.
|
||||
on_match (callable): Optional callback executed on match.
|
||||
*_patterns (list): For backwards compatibility: list of patterns to add
|
||||
as variable arguments. Will be ignored if a list of patterns is
|
||||
provided as the second argument.
|
||||
"""
|
||||
errors = {}
|
||||
if on_match is not None and not hasattr(on_match, "__call__"):
|
||||
raise ValueError(Errors.E171.format(arg_type=type(on_match)))
|
||||
if patterns is None or hasattr(patterns, "__call__"): # old API
|
||||
on_match = patterns
|
||||
patterns = _patterns
|
||||
for i, pattern in enumerate(patterns):
|
||||
if len(pattern) == 0:
|
||||
raise ValueError(Errors.E012.format(key=key))
|
||||
if not isinstance(pattern, list):
|
||||
raise ValueError(Errors.E178.format(pat=pattern, key=key))
|
||||
if self.validator:
|
||||
errors[i] = validate_json(pattern, self.validator)
|
||||
if any(err for err in errors.values()):
|
||||
|
|
|
@ -152,16 +152,27 @@ cdef class PhraseMatcher:
|
|||
del self._callbacks[key]
|
||||
del self._docs[key]
|
||||
|
||||
def add(self, key, on_match, *docs):
|
||||
def add(self, key, docs, *_docs, on_match=None):
|
||||
"""Add a match-rule to the phrase-matcher. A match-rule consists of: an ID
|
||||
key, an on_match callback, and one or more patterns.
|
||||
|
||||
As of spaCy v2.2.2, PhraseMatcher.add supports the future API, which
|
||||
makes the patterns the second argument and a list (instead of a variable
|
||||
number of arguments). The on_match callback becomes an optional keyword
|
||||
argument.
|
||||
|
||||
key (unicode): The match ID.
|
||||
docs (list): List of `Doc` objects representing match patterns.
|
||||
on_match (callable): Callback executed on match.
|
||||
*docs (Doc): `Doc` objects representing match patterns.
|
||||
*_docs (Doc): For backwards compatibility: list of patterns to add
|
||||
as variable arguments. Will be ignored if a list of patterns is
|
||||
provided as the second argument.
|
||||
|
||||
DOCS: https://spacy.io/api/phrasematcher#add
|
||||
"""
|
||||
if docs is None or hasattr(docs, "__call__"): # old API
|
||||
on_match = docs
|
||||
docs = _docs
|
||||
|
||||
_ = self.vocab[key]
|
||||
self._callbacks[key] = on_match
|
||||
|
@ -171,6 +182,8 @@ cdef class PhraseMatcher:
|
|||
cdef MapStruct* internal_node
|
||||
cdef void* result
|
||||
|
||||
if isinstance(docs, Doc):
|
||||
raise ValueError(Errors.E179.format(key=key))
|
||||
for doc in docs:
|
||||
if len(doc) == 0:
|
||||
continue
|
||||
|
|
5
spacy/ml/__init__.py
Normal file
5
spacy/ml/__init__.py
Normal file
|
@ -0,0 +1,5 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .tok2vec import Tok2Vec # noqa: F401
|
||||
from .common import FeedForward, LayerNormalizedMaxout # noqa: F401
|
131
spacy/ml/_legacy_tok2vec.py
Normal file
131
spacy/ml/_legacy_tok2vec.py
Normal file
|
@ -0,0 +1,131 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
from thinc.v2v import Model, Maxout
|
||||
from thinc.i2v import HashEmbed, StaticVectors
|
||||
from thinc.t2t import ExtractWindow
|
||||
from thinc.misc import Residual
|
||||
from thinc.misc import LayerNorm as LN
|
||||
from thinc.misc import FeatureExtracter
|
||||
from thinc.api import layerize, chain, clone, concatenate, with_flatten
|
||||
from thinc.api import uniqued, wrap, noop
|
||||
|
||||
from ..attrs import ID, ORTH, NORM, PREFIX, SUFFIX, SHAPE
|
||||
|
||||
|
||||
def Tok2Vec(width, embed_size, **kwargs):
|
||||
# Circular imports :(
|
||||
from .._ml import CharacterEmbed
|
||||
from .._ml import PyTorchBiLSTM
|
||||
|
||||
pretrained_vectors = kwargs.get("pretrained_vectors", None)
|
||||
cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3)
|
||||
subword_features = kwargs.get("subword_features", True)
|
||||
char_embed = kwargs.get("char_embed", False)
|
||||
if char_embed:
|
||||
subword_features = False
|
||||
conv_depth = kwargs.get("conv_depth", 4)
|
||||
bilstm_depth = kwargs.get("bilstm_depth", 0)
|
||||
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
|
||||
with Model.define_operators({">>": chain, "|": concatenate, "**": clone}):
|
||||
norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm")
|
||||
if subword_features:
|
||||
prefix = HashEmbed(
|
||||
width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix"
|
||||
)
|
||||
suffix = HashEmbed(
|
||||
width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix"
|
||||
)
|
||||
shape = HashEmbed(
|
||||
width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape"
|
||||
)
|
||||
else:
|
||||
prefix, suffix, shape = (None, None, None)
|
||||
if pretrained_vectors is not None:
|
||||
glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
|
||||
|
||||
if subword_features:
|
||||
embed = uniqued(
|
||||
(glove | norm | prefix | suffix | shape)
|
||||
>> LN(Maxout(width, width * 5, pieces=3)),
|
||||
column=cols.index(ORTH),
|
||||
)
|
||||
else:
|
||||
embed = uniqued(
|
||||
(glove | norm) >> LN(Maxout(width, width * 2, pieces=3)),
|
||||
column=cols.index(ORTH),
|
||||
)
|
||||
elif subword_features:
|
||||
embed = uniqued(
|
||||
(norm | prefix | suffix | shape)
|
||||
>> LN(Maxout(width, width * 4, pieces=3)),
|
||||
column=cols.index(ORTH),
|
||||
)
|
||||
elif char_embed:
|
||||
embed = concatenate_lists(
|
||||
CharacterEmbed(nM=64, nC=8),
|
||||
FeatureExtracter(cols) >> with_flatten(norm),
|
||||
)
|
||||
reduce_dimensions = LN(
|
||||
Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces)
|
||||
)
|
||||
else:
|
||||
embed = norm
|
||||
|
||||
convolution = Residual(
|
||||
ExtractWindow(nW=1)
|
||||
>> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces))
|
||||
)
|
||||
if char_embed:
|
||||
tok2vec = embed >> with_flatten(
|
||||
reduce_dimensions >> convolution ** conv_depth, pad=conv_depth
|
||||
)
|
||||
else:
|
||||
tok2vec = FeatureExtracter(cols) >> with_flatten(
|
||||
embed >> convolution ** conv_depth, pad=conv_depth
|
||||
)
|
||||
|
||||
if bilstm_depth >= 1:
|
||||
tok2vec = tok2vec >> PyTorchBiLSTM(width, width, bilstm_depth)
|
||||
# Work around thinc API limitations :(. TODO: Revise in Thinc 7
|
||||
tok2vec.nO = width
|
||||
tok2vec.embed = embed
|
||||
return tok2vec
|
||||
|
||||
|
||||
@layerize
|
||||
def flatten(seqs, drop=0.0):
|
||||
ops = Model.ops
|
||||
lengths = ops.asarray([len(seq) for seq in seqs], dtype="i")
|
||||
|
||||
def finish_update(d_X, sgd=None):
|
||||
return ops.unflatten(d_X, lengths, pad=0)
|
||||
|
||||
X = ops.flatten(seqs, pad=0)
|
||||
return X, finish_update
|
||||
|
||||
|
||||
def concatenate_lists(*layers, **kwargs): # pragma: no cover
|
||||
"""Compose two or more models `f`, `g`, etc, such that their outputs are
|
||||
concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))`
|
||||
"""
|
||||
if not layers:
|
||||
return noop()
|
||||
drop_factor = kwargs.get("drop_factor", 1.0)
|
||||
ops = layers[0].ops
|
||||
layers = [chain(layer, flatten) for layer in layers]
|
||||
concat = concatenate(*layers)
|
||||
|
||||
def concatenate_lists_fwd(Xs, drop=0.0):
|
||||
if drop is not None:
|
||||
drop *= drop_factor
|
||||
lengths = ops.asarray([len(X) for X in Xs], dtype="i")
|
||||
flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop)
|
||||
ys = ops.unflatten(flat_y, lengths)
|
||||
|
||||
def concatenate_lists_bwd(d_ys, sgd=None):
|
||||
return bp_flat_y(ops.flatten(d_ys), sgd=sgd)
|
||||
|
||||
return ys, concatenate_lists_bwd
|
||||
|
||||
model = wrap(concatenate_lists_fwd, concat)
|
||||
return model
|
42
spacy/ml/_wire.py
Normal file
42
spacy/ml/_wire.py
Normal file
|
@ -0,0 +1,42 @@
|
|||
from __future__ import unicode_literals
|
||||
from thinc.api import layerize, wrap, noop, chain, concatenate
|
||||
from thinc.v2v import Model
|
||||
|
||||
|
||||
def concatenate_lists(*layers, **kwargs): # pragma: no cover
|
||||
"""Compose two or more models `f`, `g`, etc, such that their outputs are
|
||||
concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))`
|
||||
"""
|
||||
if not layers:
|
||||
return layerize(noop())
|
||||
drop_factor = kwargs.get("drop_factor", 1.0)
|
||||
ops = layers[0].ops
|
||||
layers = [chain(layer, flatten) for layer in layers]
|
||||
concat = concatenate(*layers)
|
||||
|
||||
def concatenate_lists_fwd(Xs, drop=0.0):
|
||||
if drop is not None:
|
||||
drop *= drop_factor
|
||||
lengths = ops.asarray([len(X) for X in Xs], dtype="i")
|
||||
flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop)
|
||||
ys = ops.unflatten(flat_y, lengths)
|
||||
|
||||
def concatenate_lists_bwd(d_ys, sgd=None):
|
||||
return bp_flat_y(ops.flatten(d_ys), sgd=sgd)
|
||||
|
||||
return ys, concatenate_lists_bwd
|
||||
|
||||
model = wrap(concatenate_lists_fwd, concat)
|
||||
return model
|
||||
|
||||
|
||||
@layerize
|
||||
def flatten(seqs, drop=0.0):
|
||||
ops = Model.ops
|
||||
lengths = ops.asarray([len(seq) for seq in seqs], dtype="i")
|
||||
|
||||
def finish_update(d_X, sgd=None):
|
||||
return ops.unflatten(d_X, lengths, pad=0)
|
||||
|
||||
X = ops.flatten(seqs, pad=0)
|
||||
return X, finish_update
|
23
spacy/ml/common.py
Normal file
23
spacy/ml/common.py
Normal file
|
@ -0,0 +1,23 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
from thinc.api import chain
|
||||
from thinc.v2v import Maxout
|
||||
from thinc.misc import LayerNorm
|
||||
from ..util import register_architecture, make_layer
|
||||
|
||||
|
||||
@register_architecture("thinc.FeedForward.v1")
|
||||
def FeedForward(config):
|
||||
layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]]
|
||||
model = chain(*layers)
|
||||
model.cfg = config
|
||||
return model
|
||||
|
||||
|
||||
@register_architecture("spacy.LayerNormalizedMaxout.v1")
|
||||
def LayerNormalizedMaxout(config):
|
||||
width = config["width"]
|
||||
pieces = config["pieces"]
|
||||
layer = LayerNorm(Maxout(width, pieces=pieces))
|
||||
layer.nO = width
|
||||
return layer
|
176
spacy/ml/tok2vec.py
Normal file
176
spacy/ml/tok2vec.py
Normal file
|
@ -0,0 +1,176 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
from thinc.api import chain, layerize, clone, concatenate, with_flatten, uniqued
|
||||
from thinc.api import noop, with_square_sequences
|
||||
from thinc.v2v import Maxout, Model
|
||||
from thinc.i2v import HashEmbed, StaticVectors
|
||||
from thinc.t2t import ExtractWindow
|
||||
from thinc.misc import Residual, LayerNorm, FeatureExtracter
|
||||
from ..util import make_layer, register_architecture
|
||||
from ._wire import concatenate_lists
|
||||
|
||||
|
||||
@register_architecture("spacy.Tok2Vec.v1")
|
||||
def Tok2Vec(config):
|
||||
doc2feats = make_layer(config["@doc2feats"])
|
||||
embed = make_layer(config["@embed"])
|
||||
encode = make_layer(config["@encode"])
|
||||
field_size = getattr(encode, "receptive_field", 0)
|
||||
tok2vec = chain(doc2feats, with_flatten(chain(embed, encode), pad=field_size))
|
||||
tok2vec.cfg = config
|
||||
tok2vec.nO = encode.nO
|
||||
tok2vec.embed = embed
|
||||
tok2vec.encode = encode
|
||||
return tok2vec
|
||||
|
||||
|
||||
@register_architecture("spacy.Doc2Feats.v1")
|
||||
def Doc2Feats(config):
|
||||
columns = config["columns"]
|
||||
return FeatureExtracter(columns)
|
||||
|
||||
|
||||
@register_architecture("spacy.MultiHashEmbed.v1")
|
||||
def MultiHashEmbed(config):
|
||||
# For backwards compatibility with models before the architecture registry,
|
||||
# we have to be careful to get exactly the same model structure. One subtle
|
||||
# trick is that when we define concatenation with the operator, the operator
|
||||
# is actually binary associative. So when we write (a | b | c), we're actually
|
||||
# getting concatenate(concatenate(a, b), c). That's why the implementation
|
||||
# is a bit ugly here.
|
||||
cols = config["columns"]
|
||||
width = config["width"]
|
||||
rows = config["rows"]
|
||||
|
||||
norm = HashEmbed(width, rows, column=cols.index("NORM"), name="embed_norm")
|
||||
if config["use_subwords"]:
|
||||
prefix = HashEmbed(
|
||||
width, rows // 2, column=cols.index("PREFIX"), name="embed_prefix"
|
||||
)
|
||||
suffix = HashEmbed(
|
||||
width, rows // 2, column=cols.index("SUFFIX"), name="embed_suffix"
|
||||
)
|
||||
shape = HashEmbed(
|
||||
width, rows // 2, column=cols.index("SHAPE"), name="embed_shape"
|
||||
)
|
||||
if config.get("@pretrained_vectors"):
|
||||
glove = make_layer(config["@pretrained_vectors"])
|
||||
mix = make_layer(config["@mix"])
|
||||
|
||||
with Model.define_operators({">>": chain, "|": concatenate}):
|
||||
if config["use_subwords"] and config["@pretrained_vectors"]:
|
||||
mix._layers[0].nI = width * 5
|
||||
layer = uniqued(
|
||||
(glove | norm | prefix | suffix | shape) >> mix,
|
||||
column=cols.index("ORTH"),
|
||||
)
|
||||
elif config["use_subwords"]:
|
||||
mix._layers[0].nI = width * 4
|
||||
layer = uniqued(
|
||||
(norm | prefix | suffix | shape) >> mix, column=cols.index("ORTH")
|
||||
)
|
||||
elif config["@pretrained_vectors"]:
|
||||
mix._layers[0].nI = width * 2
|
||||
layer = uniqued((glove | norm) >> mix, column=cols.index("ORTH"),)
|
||||
else:
|
||||
layer = norm
|
||||
layer.cfg = config
|
||||
return layer
|
||||
|
||||
|
||||
@register_architecture("spacy.CharacterEmbed.v1")
|
||||
def CharacterEmbed(config):
|
||||
from .. import _ml
|
||||
|
||||
width = config["width"]
|
||||
chars = config["chars"]
|
||||
|
||||
chr_embed = _ml.CharacterEmbedModel(nM=width, nC=chars)
|
||||
other_tables = make_layer(config["@embed_features"])
|
||||
mix = make_layer(config["@mix"])
|
||||
|
||||
model = chain(concatenate_lists(chr_embed, other_tables), mix)
|
||||
model.cfg = config
|
||||
return model
|
||||
|
||||
|
||||
@register_architecture("spacy.MaxoutWindowEncoder.v1")
|
||||
def MaxoutWindowEncoder(config):
|
||||
nO = config["width"]
|
||||
nW = config["window_size"]
|
||||
nP = config["pieces"]
|
||||
depth = config["depth"]
|
||||
|
||||
cnn = chain(
|
||||
ExtractWindow(nW=nW), LayerNorm(Maxout(nO, nO * ((nW * 2) + 1), pieces=nP))
|
||||
)
|
||||
model = clone(Residual(cnn), depth)
|
||||
model.nO = nO
|
||||
model.receptive_field = nW * depth
|
||||
return model
|
||||
|
||||
|
||||
@register_architecture("spacy.MishWindowEncoder.v1")
|
||||
def MishWindowEncoder(config):
|
||||
from thinc.v2v import Mish
|
||||
|
||||
nO = config["width"]
|
||||
nW = config["window_size"]
|
||||
depth = config["depth"]
|
||||
|
||||
cnn = chain(ExtractWindow(nW=nW), LayerNorm(Mish(nO, nO * ((nW * 2) + 1))))
|
||||
model = clone(Residual(cnn), depth)
|
||||
model.nO = nO
|
||||
return model
|
||||
|
||||
|
||||
@register_architecture("spacy.PretrainedVectors.v1")
|
||||
def PretrainedVectors(config):
|
||||
return StaticVectors(config["vectors_name"], config["width"], config["column"])
|
||||
|
||||
|
||||
@register_architecture("spacy.TorchBiLSTMEncoder.v1")
|
||||
def TorchBiLSTMEncoder(config):
|
||||
import torch.nn
|
||||
from thinc.extra.wrappers import PyTorchWrapperRNN
|
||||
|
||||
width = config["width"]
|
||||
depth = config["depth"]
|
||||
if depth == 0:
|
||||
return layerize(noop())
|
||||
return with_square_sequences(
|
||||
PyTorchWrapperRNN(torch.nn.LSTM(width, width // 2, depth, bidirectional=True))
|
||||
)
|
||||
|
||||
|
||||
_EXAMPLE_CONFIG = {
|
||||
"@doc2feats": {
|
||||
"arch": "Doc2Feats",
|
||||
"config": {"columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"]},
|
||||
},
|
||||
"@embed": {
|
||||
"arch": "spacy.MultiHashEmbed.v1",
|
||||
"config": {
|
||||
"width": 96,
|
||||
"rows": 2000,
|
||||
"columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"],
|
||||
"use_subwords": True,
|
||||
"@pretrained_vectors": {
|
||||
"arch": "TransformedStaticVectors",
|
||||
"config": {
|
||||
"vectors_name": "en_vectors_web_lg.vectors",
|
||||
"width": 96,
|
||||
"column": 0,
|
||||
},
|
||||
},
|
||||
"@mix": {
|
||||
"arch": "LayerNormalizedMaxout",
|
||||
"config": {"width": 96, "pieces": 3},
|
||||
},
|
||||
},
|
||||
},
|
||||
"@encode": {
|
||||
"arch": "MaxoutWindowEncode",
|
||||
"config": {"width": 96, "window_size": 1, "depth": 4, "pieces": 3},
|
||||
},
|
||||
}
|
|
@ -4,6 +4,7 @@ from __future__ import unicode_literals
|
|||
from collections import defaultdict, OrderedDict
|
||||
import srsly
|
||||
|
||||
from ..language import component
|
||||
from ..errors import Errors
|
||||
from ..compat import basestring_
|
||||
from ..util import ensure_path, to_disk, from_disk
|
||||
|
@ -13,6 +14,7 @@ from ..matcher import Matcher, PhraseMatcher
|
|||
DEFAULT_ENT_ID_SEP = "||"
|
||||
|
||||
|
||||
@component("entity_ruler", assigns=["doc.ents", "token.ent_type", "token.ent_iob"])
|
||||
class EntityRuler(object):
|
||||
"""The EntityRuler lets you add spans to the `Doc.ents` using token-based
|
||||
rules or exact phrase matches. It can be combined with the statistical
|
||||
|
@ -24,8 +26,6 @@ class EntityRuler(object):
|
|||
USAGE: https://spacy.io/usage/rule-based-matching#entityruler
|
||||
"""
|
||||
|
||||
name = "entity_ruler"
|
||||
|
||||
def __init__(self, nlp, phrase_matcher_attr=None, validate=False, **cfg):
|
||||
"""Initialize the entitiy ruler. If patterns are supplied here, they
|
||||
need to be a list of dictionaries with a `"label"` and `"pattern"`
|
||||
|
@ -64,10 +64,15 @@ class EntityRuler(object):
|
|||
self.phrase_matcher_attr = None
|
||||
self.phrase_matcher = PhraseMatcher(nlp.vocab, validate=validate)
|
||||
self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
|
||||
self._ent_ids = defaultdict(dict)
|
||||
patterns = cfg.get("patterns")
|
||||
if patterns is not None:
|
||||
self.add_patterns(patterns)
|
||||
|
||||
@classmethod
|
||||
def from_nlp(cls, nlp, **cfg):
|
||||
return cls(nlp, **cfg)
|
||||
|
||||
def __len__(self):
|
||||
"""The number of all patterns added to the entity ruler."""
|
||||
n_token_patterns = sum(len(p) for p in self.token_patterns.values())
|
||||
|
@ -100,10 +105,9 @@ class EntityRuler(object):
|
|||
continue
|
||||
# check for end - 1 here because boundaries are inclusive
|
||||
if start not in seen_tokens and end - 1 not in seen_tokens:
|
||||
if self.ent_ids:
|
||||
label_ = self.nlp.vocab.strings[match_id]
|
||||
ent_label, ent_id = self._split_label(label_)
|
||||
span = Span(doc, start, end, label=ent_label)
|
||||
if match_id in self._ent_ids:
|
||||
label, ent_id = self._ent_ids[match_id]
|
||||
span = Span(doc, start, end, label=label)
|
||||
if ent_id:
|
||||
for token in span:
|
||||
token.ent_id_ = ent_id
|
||||
|
@ -131,11 +135,11 @@ class EntityRuler(object):
|
|||
|
||||
@property
|
||||
def ent_ids(self):
|
||||
"""All entity ids present in the match patterns meta dicts.
|
||||
"""All entity ids present in the match patterns `id` properties.
|
||||
|
||||
RETURNS (set): The string entity ids.
|
||||
|
||||
DOCS: https://spacy.io/api/entityruler#labels
|
||||
DOCS: https://spacy.io/api/entityruler#ent_ids
|
||||
"""
|
||||
all_ent_ids = set()
|
||||
for l in self.labels:
|
||||
|
@ -147,7 +151,6 @@ class EntityRuler(object):
|
|||
@property
|
||||
def patterns(self):
|
||||
"""Get all patterns that were added to the entity ruler.
|
||||
|
||||
RETURNS (list): The original patterns, one dictionary per pattern.
|
||||
|
||||
DOCS: https://spacy.io/api/entityruler#patterns
|
||||
|
@ -188,11 +191,15 @@ class EntityRuler(object):
|
|||
]
|
||||
except ValueError:
|
||||
subsequent_pipes = []
|
||||
with self.nlp.disable_pipes(*subsequent_pipes):
|
||||
with self.nlp.disable_pipes(subsequent_pipes):
|
||||
for entry in patterns:
|
||||
label = entry["label"]
|
||||
if "id" in entry:
|
||||
ent_label = label
|
||||
label = self._create_label(label, entry["id"])
|
||||
key = self.matcher._normalize_key(label)
|
||||
self._ent_ids[key] = (ent_label, entry["id"])
|
||||
|
||||
pattern = entry["pattern"]
|
||||
if isinstance(pattern, basestring_):
|
||||
self.phrase_patterns[label].append(self.nlp(pattern))
|
||||
|
@ -201,9 +208,9 @@ class EntityRuler(object):
|
|||
else:
|
||||
raise ValueError(Errors.E097.format(pattern=pattern))
|
||||
for label, patterns in self.token_patterns.items():
|
||||
self.matcher.add(label, None, *patterns)
|
||||
self.matcher.add(label, patterns)
|
||||
for label, patterns in self.phrase_patterns.items():
|
||||
self.phrase_matcher.add(label, None, *patterns)
|
||||
self.phrase_matcher.add(label, patterns)
|
||||
|
||||
def _split_label(self, label):
|
||||
"""Split Entity label into ent_label and ent_id if it contains self.ent_id_sep
|
||||
|
|
|
@ -1,9 +1,16 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..language import component
|
||||
from ..matcher import Matcher
|
||||
from ..util import filter_spans
|
||||
|
||||
|
||||
@component(
|
||||
"merge_noun_chunks",
|
||||
requires=["token.dep", "token.tag", "token.pos"],
|
||||
retokenizes=True,
|
||||
)
|
||||
def merge_noun_chunks(doc):
|
||||
"""Merge noun chunks into a single token.
|
||||
|
||||
|
@ -21,6 +28,11 @@ def merge_noun_chunks(doc):
|
|||
return doc
|
||||
|
||||
|
||||
@component(
|
||||
"merge_entities",
|
||||
requires=["doc.ents", "token.ent_iob", "token.ent_type"],
|
||||
retokenizes=True,
|
||||
)
|
||||
def merge_entities(doc):
|
||||
"""Merge entities into a single token.
|
||||
|
||||
|
@ -36,6 +48,7 @@ def merge_entities(doc):
|
|||
return doc
|
||||
|
||||
|
||||
@component("merge_subtokens", requires=["token.dep"], retokenizes=True)
|
||||
def merge_subtokens(doc, label="subtok"):
|
||||
"""Merge subtokens into a single token.
|
||||
|
||||
|
@ -48,7 +61,7 @@ def merge_subtokens(doc, label="subtok"):
|
|||
merger = Matcher(doc.vocab)
|
||||
merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}])
|
||||
matches = merger(doc)
|
||||
spans = [doc[start : end + 1] for _, start, end in matches]
|
||||
spans = filter_spans([doc[start : end + 1] for _, start, end in matches])
|
||||
with doc.retokenize() as retokenizer:
|
||||
for span in spans:
|
||||
retokenizer.merge(span)
|
||||
|
|
|
@ -5,9 +5,11 @@ from thinc.t2v import Pooling, max_pool, mean_pool
|
|||
from thinc.neural._classes.difference import Siamese, CauchySimilarity
|
||||
|
||||
from .pipes import Pipe
|
||||
from ..language import component
|
||||
from .._ml import link_vectors_to_models
|
||||
|
||||
|
||||
@component("sentencizer_hook", assigns=["doc.user_hooks"])
|
||||
class SentenceSegmenter(object):
|
||||
"""A simple spaCy hook, to allow custom sentence boundary detection logic
|
||||
(that doesn't require the dependency parse). To change the sentence
|
||||
|
@ -17,8 +19,6 @@ class SentenceSegmenter(object):
|
|||
and yield `Span` objects for each sentence.
|
||||
"""
|
||||
|
||||
name = "sentencizer"
|
||||
|
||||
def __init__(self, vocab, strategy=None):
|
||||
self.vocab = vocab
|
||||
if strategy is None or strategy == "on_punct":
|
||||
|
@ -44,6 +44,7 @@ class SentenceSegmenter(object):
|
|||
yield doc[start : len(doc)]
|
||||
|
||||
|
||||
@component("similarity", assigns=["doc.user_hooks"])
|
||||
class SimilarityHook(Pipe):
|
||||
"""
|
||||
Experimental: A pipeline component to install a hook for supervised
|
||||
|
@ -58,8 +59,6 @@ class SimilarityHook(Pipe):
|
|||
Where W is a vector of dimension weights, initialized to 1.
|
||||
"""
|
||||
|
||||
name = "similarity"
|
||||
|
||||
def __init__(self, vocab, model=True, **cfg):
|
||||
self.vocab = vocab
|
||||
self.model = model
|
||||
|
|
|
@ -8,6 +8,7 @@ from thinc.api import chain
|
|||
from thinc.neural.util import to_categorical, copy_array, get_array_module
|
||||
from .. import util
|
||||
from .pipes import Pipe
|
||||
from ..language import component
|
||||
from .._ml import Tok2Vec, build_morphologizer_model
|
||||
from .._ml import link_vectors_to_models, zero_init, flatten
|
||||
from .._ml import create_default_optimizer
|
||||
|
@ -18,9 +19,9 @@ from ..vocab cimport Vocab
|
|||
from ..morphology cimport Morphology
|
||||
|
||||
|
||||
@component("morphologizer", assigns=["token.morph", "token.pos"])
|
||||
class Morphologizer(Pipe):
|
||||
name = 'morphologizer'
|
||||
|
||||
|
||||
@classmethod
|
||||
def Model(cls, **cfg):
|
||||
if cfg.get('pretrained_dims') and not cfg.get('pretrained_vectors'):
|
||||
|
|
|
@ -13,7 +13,6 @@ from thinc.misc import LayerNorm
|
|||
from thinc.neural.util import to_categorical
|
||||
from thinc.neural.util import get_array_module
|
||||
|
||||
from .functions import merge_subtokens
|
||||
from ..tokens.doc cimport Doc
|
||||
from ..syntax.nn_parser cimport Parser
|
||||
from ..syntax.ner cimport BiluoPushDown
|
||||
|
@ -21,6 +20,8 @@ from ..syntax.arc_eager cimport ArcEager
|
|||
from ..morphology cimport Morphology
|
||||
from ..vocab cimport Vocab
|
||||
|
||||
from .functions import merge_subtokens
|
||||
from ..language import Language, component
|
||||
from ..syntax import nonproj
|
||||
from ..attrs import POS, ID
|
||||
from ..parts_of_speech import X
|
||||
|
@ -54,6 +55,10 @@ class Pipe(object):
|
|||
"""Initialize a model for the pipe."""
|
||||
raise NotImplementedError
|
||||
|
||||
@classmethod
|
||||
def from_nlp(cls, nlp, **cfg):
|
||||
return cls(nlp.vocab, **cfg)
|
||||
|
||||
def __init__(self, vocab, model=True, **cfg):
|
||||
"""Create a new pipe instance."""
|
||||
raise NotImplementedError
|
||||
|
@ -223,11 +228,10 @@ class Pipe(object):
|
|||
return self
|
||||
|
||||
|
||||
@component("tensorizer", assigns=["doc.tensor"])
|
||||
class Tensorizer(Pipe):
|
||||
"""Pre-train position-sensitive vectors for tokens."""
|
||||
|
||||
name = "tensorizer"
|
||||
|
||||
@classmethod
|
||||
def Model(cls, output_size=300, **cfg):
|
||||
"""Create a new statistical model for the class.
|
||||
|
@ -362,14 +366,13 @@ class Tensorizer(Pipe):
|
|||
return sgd
|
||||
|
||||
|
||||
@component("tagger", assigns=["token.tag", "token.pos"])
|
||||
class Tagger(Pipe):
|
||||
"""Pipeline component for part-of-speech tagging.
|
||||
|
||||
DOCS: https://spacy.io/api/tagger
|
||||
"""
|
||||
|
||||
name = "tagger"
|
||||
|
||||
def __init__(self, vocab, model=True, **cfg):
|
||||
self.vocab = vocab
|
||||
self.model = model
|
||||
|
@ -514,7 +517,6 @@ class Tagger(Pipe):
|
|||
orig_tag_map = dict(self.vocab.morphology.tag_map)
|
||||
new_tag_map = OrderedDict()
|
||||
for raw_text, annots_brackets in get_gold_tuples():
|
||||
_ = annots_brackets.pop()
|
||||
for annots, brackets in annots_brackets:
|
||||
ids, words, tags, heads, deps, ents = annots
|
||||
for tag in tags:
|
||||
|
@ -657,13 +659,12 @@ class Tagger(Pipe):
|
|||
return self
|
||||
|
||||
|
||||
@component("nn_labeller")
|
||||
class MultitaskObjective(Tagger):
|
||||
"""Experimental: Assist training of a parser or tagger, by training a
|
||||
side-objective.
|
||||
"""
|
||||
|
||||
name = "nn_labeller"
|
||||
|
||||
def __init__(self, vocab, model=True, target='dep_tag_offset', **cfg):
|
||||
self.vocab = vocab
|
||||
self.model = model
|
||||
|
@ -898,12 +899,12 @@ class ClozeMultitask(Pipe):
|
|||
losses[self.name] += loss
|
||||
|
||||
|
||||
@component("textcat", assigns=["doc.cats"])
|
||||
class TextCategorizer(Pipe):
|
||||
"""Pipeline component for text classification.
|
||||
|
||||
DOCS: https://spacy.io/api/textcategorizer
|
||||
"""
|
||||
name = 'textcat'
|
||||
|
||||
@classmethod
|
||||
def Model(cls, nr_class=1, **cfg):
|
||||
|
@ -1032,10 +1033,10 @@ class TextCategorizer(Pipe):
|
|||
return 1
|
||||
|
||||
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs):
|
||||
for raw_text, annots_brackets in get_gold_tuples():
|
||||
cats = annots_brackets.pop()
|
||||
for cat in cats:
|
||||
self.add_label(cat)
|
||||
for raw_text, annot_brackets in get_gold_tuples():
|
||||
for _, (cats, _2) in annot_brackets:
|
||||
for cat in cats:
|
||||
self.add_label(cat)
|
||||
if self.model is True:
|
||||
self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
|
||||
self.require_labels()
|
||||
|
@ -1051,8 +1052,11 @@ cdef class DependencyParser(Parser):
|
|||
|
||||
DOCS: https://spacy.io/api/dependencyparser
|
||||
"""
|
||||
|
||||
# cdef classes can't have decorators, so we're defining this here
|
||||
name = "parser"
|
||||
factory = "parser"
|
||||
assigns = ["token.dep", "token.is_sent_start", "doc.sents"]
|
||||
requires = []
|
||||
TransitionSystem = ArcEager
|
||||
|
||||
@property
|
||||
|
@ -1097,8 +1101,10 @@ cdef class EntityRecognizer(Parser):
|
|||
|
||||
DOCS: https://spacy.io/api/entityrecognizer
|
||||
"""
|
||||
|
||||
name = "ner"
|
||||
factory = "ner"
|
||||
assigns = ["doc.ents", "token.ent_iob", "token.ent_type"]
|
||||
requires = []
|
||||
TransitionSystem = BiluoPushDown
|
||||
nr_feature = 6
|
||||
|
||||
|
@ -1129,12 +1135,16 @@ cdef class EntityRecognizer(Parser):
|
|||
return tuple(sorted(labels))
|
||||
|
||||
|
||||
@component(
|
||||
"entity_linker",
|
||||
requires=["doc.ents", "token.ent_iob", "token.ent_type"],
|
||||
assigns=["token.ent_kb_id"]
|
||||
)
|
||||
class EntityLinker(Pipe):
|
||||
"""Pipeline component for named entity linking.
|
||||
|
||||
DOCS: https://spacy.io/api/entitylinker
|
||||
"""
|
||||
name = 'entity_linker'
|
||||
NIL = "NIL" # string used to refer to a non-existing link
|
||||
|
||||
@classmethod
|
||||
|
@ -1298,7 +1308,8 @@ class EntityLinker(Pipe):
|
|||
for ent in sent_doc.ents:
|
||||
entity_count += 1
|
||||
|
||||
if ent.label_ in self.cfg.get("labels_discard", []):
|
||||
to_discard = self.cfg.get("labels_discard", [])
|
||||
if to_discard and ent.label_ in to_discard:
|
||||
# ignoring this entity - setting to NIL
|
||||
final_kb_ids.append(self.NIL)
|
||||
final_tensors.append(sentence_encoding)
|
||||
|
@ -1404,13 +1415,13 @@ class EntityLinker(Pipe):
|
|||
raise NotImplementedError
|
||||
|
||||
|
||||
@component("sentencizer", assigns=["token.is_sent_start", "doc.sents"])
|
||||
class Sentencizer(object):
|
||||
"""Segment the Doc into sentences using a rule-based strategy.
|
||||
|
||||
DOCS: https://spacy.io/api/sentencizer
|
||||
"""
|
||||
|
||||
name = "sentencizer"
|
||||
default_punct_chars = ['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹',
|
||||
'।', '॥', '၊', '။', '።', '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄',
|
||||
'᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿',
|
||||
|
@ -1436,6 +1447,10 @@ class Sentencizer(object):
|
|||
else:
|
||||
self.punct_chars = set(self.default_punct_chars)
|
||||
|
||||
@classmethod
|
||||
def from_nlp(cls, nlp, **cfg):
|
||||
return cls(**cfg)
|
||||
|
||||
def __call__(self, doc):
|
||||
"""Apply the sentencizer to a Doc and set Token.is_sent_start.
|
||||
|
||||
|
@ -1502,4 +1517,9 @@ class Sentencizer(object):
|
|||
return self
|
||||
|
||||
|
||||
# Cython classes can't be decorated, so we need to add the factories here
|
||||
Language.factories["parser"] = lambda nlp, **cfg: DependencyParser.from_nlp(nlp, **cfg)
|
||||
Language.factories["ner"] = lambda nlp, **cfg: EntityRecognizer.from_nlp(nlp, **cfg)
|
||||
|
||||
|
||||
__all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer"]
|
||||
|
|
|
@ -19,7 +19,7 @@ from thinc.extra.search cimport Beam
|
|||
from thinc.api import chain, clone
|
||||
from thinc.v2v import Model, Maxout, Affine
|
||||
from thinc.misc import LayerNorm
|
||||
from thinc.neural.ops import CupyOps
|
||||
from thinc.neural.ops import CupyOps, NumpyOps
|
||||
from thinc.neural.util import get_array_module
|
||||
from thinc.linalg cimport Vec, VecVec
|
||||
cimport blis.cy
|
||||
|
@ -440,28 +440,38 @@ cdef class precompute_hiddens:
|
|||
def backward(d_state_vector_ids, sgd=None):
|
||||
d_state_vector, token_ids = d_state_vector_ids
|
||||
d_state_vector = bp_nonlinearity(d_state_vector, sgd)
|
||||
# This will usually be on GPU
|
||||
if not isinstance(d_state_vector, self.ops.xp.ndarray):
|
||||
d_state_vector = self.ops.xp.array(d_state_vector)
|
||||
d_tokens = bp_hiddens((d_state_vector, token_ids), sgd)
|
||||
return d_tokens
|
||||
return state_vector, backward
|
||||
|
||||
def _nonlinearity(self, state_vector):
|
||||
if isinstance(state_vector, numpy.ndarray):
|
||||
ops = NumpyOps()
|
||||
else:
|
||||
ops = CupyOps()
|
||||
|
||||
if self.nP == 1:
|
||||
state_vector = state_vector.reshape(state_vector.shape[:-1])
|
||||
mask = state_vector >= 0.
|
||||
state_vector *= mask
|
||||
else:
|
||||
state_vector, mask = self.ops.maxout(state_vector)
|
||||
state_vector, mask = ops.maxout(state_vector)
|
||||
|
||||
def backprop_nonlinearity(d_best, sgd=None):
|
||||
if isinstance(d_best, numpy.ndarray):
|
||||
ops = NumpyOps()
|
||||
else:
|
||||
ops = CupyOps()
|
||||
mask_ = ops.asarray(mask)
|
||||
|
||||
# This will usually be on GPU
|
||||
d_best = ops.asarray(d_best)
|
||||
# Fix nans (which can occur from unseen classes.)
|
||||
d_best[self.ops.xp.isnan(d_best)] = 0.
|
||||
d_best[ops.xp.isnan(d_best)] = 0.
|
||||
if self.nP == 1:
|
||||
d_best *= mask
|
||||
d_best *= mask_
|
||||
d_best = d_best.reshape((d_best.shape + (1,)))
|
||||
return d_best
|
||||
else:
|
||||
return self.ops.backprop_maxout(d_best, mask, self.nP)
|
||||
return ops.backprop_maxout(d_best, mask_, self.nP)
|
||||
return state_vector, backprop_nonlinearity
|
||||
|
|
|
@ -342,7 +342,6 @@ cdef class ArcEager(TransitionSystem):
|
|||
actions[RIGHT][label] = 1
|
||||
actions[REDUCE][label] = 1
|
||||
for raw_text, sents in kwargs.get('gold_parses', []):
|
||||
_ = sents.pop()
|
||||
for (ids, words, tags, heads, labels, iob), ctnts in sents:
|
||||
heads, labels = nonproj.projectivize(heads, labels)
|
||||
for child, head, label in zip(ids, heads, labels):
|
||||
|
|
|
@ -73,7 +73,6 @@ cdef class BiluoPushDown(TransitionSystem):
|
|||
actions[action][entity_type] = 1
|
||||
moves = ('M', 'B', 'I', 'L', 'U')
|
||||
for raw_text, sents in kwargs.get('gold_parses', []):
|
||||
_ = sents.pop()
|
||||
for (ids, words, tags, heads, labels, biluo), _ in sents:
|
||||
for i, ner_tag in enumerate(biluo):
|
||||
if ner_tag != 'O' and ner_tag != '-':
|
||||
|
|
|
@ -57,7 +57,10 @@ cdef class Parser:
|
|||
subword_features = util.env_opt('subword_features',
|
||||
cfg.get('subword_features', True))
|
||||
conv_depth = util.env_opt('conv_depth', cfg.get('conv_depth', 4))
|
||||
conv_window = util.env_opt('conv_window', cfg.get('conv_depth', 1))
|
||||
t2v_pieces = util.env_opt('cnn_maxout_pieces', cfg.get('cnn_maxout_pieces', 3))
|
||||
bilstm_depth = util.env_opt('bilstm_depth', cfg.get('bilstm_depth', 0))
|
||||
self_attn_depth = util.env_opt('self_attn_depth', cfg.get('self_attn_depth', 0))
|
||||
if depth != 1:
|
||||
raise ValueError(TempErrors.T004.format(value=depth))
|
||||
parser_maxout_pieces = util.env_opt('parser_maxout_pieces',
|
||||
|
@ -69,6 +72,8 @@ cdef class Parser:
|
|||
pretrained_vectors = cfg.get('pretrained_vectors', None)
|
||||
tok2vec = Tok2Vec(token_vector_width, embed_size,
|
||||
conv_depth=conv_depth,
|
||||
conv_window=conv_window,
|
||||
cnn_maxout_pieces=t2v_pieces,
|
||||
subword_features=subword_features,
|
||||
pretrained_vectors=pretrained_vectors,
|
||||
bilstm_depth=bilstm_depth)
|
||||
|
@ -90,7 +95,12 @@ cdef class Parser:
|
|||
'hidden_width': hidden_width,
|
||||
'maxout_pieces': parser_maxout_pieces,
|
||||
'pretrained_vectors': pretrained_vectors,
|
||||
'bilstm_depth': bilstm_depth
|
||||
'bilstm_depth': bilstm_depth,
|
||||
'self_attn_depth': self_attn_depth,
|
||||
'conv_depth': conv_depth,
|
||||
'conv_window': conv_window,
|
||||
'embed_size': embed_size,
|
||||
'cnn_maxout_pieces': t2v_pieces
|
||||
}
|
||||
return ParserModel(tok2vec, lower, upper), cfg
|
||||
|
||||
|
@ -128,6 +138,10 @@ cdef class Parser:
|
|||
self._multitasks = []
|
||||
self._rehearsal_model = None
|
||||
|
||||
@classmethod
|
||||
def from_nlp(cls, nlp, **cfg):
|
||||
return cls(nlp.vocab, **cfg)
|
||||
|
||||
def __reduce__(self):
|
||||
return (Parser, (self.vocab, self.moves, self.model), None, None)
|
||||
|
||||
|
@ -602,12 +616,11 @@ cdef class Parser:
|
|||
doc_sample = []
|
||||
gold_sample = []
|
||||
for raw_text, annots_brackets in islice(get_gold_tuples(), 1000):
|
||||
_ = annots_brackets.pop()
|
||||
for annots, brackets in annots_brackets:
|
||||
ids, words, tags, heads, deps, ents = annots
|
||||
doc_sample.append(Doc(self.vocab, words=words))
|
||||
gold_sample.append(GoldParse(doc_sample[-1], words=words, tags=tags,
|
||||
heads=heads, deps=deps, ents=ents))
|
||||
heads=heads, deps=deps, entities=ents))
|
||||
self.model.begin_training(doc_sample, gold_sample)
|
||||
if pipeline is not None:
|
||||
self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg)
|
||||
|
|
|
@ -2,9 +2,9 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.lang.sv.syntax_iterators import SYNTAX_ITERATORS
|
||||
from ...util import get_doc
|
||||
|
||||
|
||||
SV_NP_TEST_EXAMPLES = [
|
||||
(
|
||||
"En student läste en bok", # A student read a book
|
||||
|
@ -45,4 +45,3 @@ def test_sv_noun_chunks(sv_tokenizer, text, pos, deps, heads, expected_noun_chun
|
|||
assert len(noun_chunks) == len(expected_noun_chunks)
|
||||
for i, np in enumerate(noun_chunks):
|
||||
assert np.text == expected_noun_chunks[i]
|
||||
|
||||
|
|
|
@ -17,7 +17,7 @@ def matcher(en_vocab):
|
|||
}
|
||||
matcher = Matcher(en_vocab)
|
||||
for key, patterns in rules.items():
|
||||
matcher.add(key, None, *patterns)
|
||||
matcher.add(key, patterns)
|
||||
return matcher
|
||||
|
||||
|
||||
|
@ -25,11 +25,11 @@ def test_matcher_from_api_docs(en_vocab):
|
|||
matcher = Matcher(en_vocab)
|
||||
pattern = [{"ORTH": "test"}]
|
||||
assert len(matcher) == 0
|
||||
matcher.add("Rule", None, pattern)
|
||||
matcher.add("Rule", [pattern])
|
||||
assert len(matcher) == 1
|
||||
matcher.remove("Rule")
|
||||
assert "Rule" not in matcher
|
||||
matcher.add("Rule", None, pattern)
|
||||
matcher.add("Rule", [pattern])
|
||||
assert "Rule" in matcher
|
||||
on_match, patterns = matcher.get("Rule")
|
||||
assert len(patterns[0])
|
||||
|
@ -52,7 +52,7 @@ def test_matcher_from_usage_docs(en_vocab):
|
|||
token.vocab[token.text].norm_ = "happy emoji"
|
||||
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add("HAPPY", label_sentiment, *pos_patterns)
|
||||
matcher.add("HAPPY", pos_patterns, on_match=label_sentiment)
|
||||
matcher(doc)
|
||||
assert doc.sentiment != 0
|
||||
assert doc[1].norm_ == "happy emoji"
|
||||
|
@ -60,11 +60,33 @@ def test_matcher_from_usage_docs(en_vocab):
|
|||
|
||||
def test_matcher_len_contains(matcher):
|
||||
assert len(matcher) == 3
|
||||
matcher.add("TEST", None, [{"ORTH": "test"}])
|
||||
matcher.add("TEST", [[{"ORTH": "test"}]])
|
||||
assert "TEST" in matcher
|
||||
assert "TEST2" not in matcher
|
||||
|
||||
|
||||
def test_matcher_add_new_old_api(en_vocab):
|
||||
doc = Doc(en_vocab, words=["a", "b"])
|
||||
patterns = [[{"TEXT": "a"}], [{"TEXT": "a"}, {"TEXT": "b"}]]
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add("OLD_API", None, *patterns)
|
||||
assert len(matcher(doc)) == 2
|
||||
matcher = Matcher(en_vocab)
|
||||
on_match = Mock()
|
||||
matcher.add("OLD_API_CALLBACK", on_match, *patterns)
|
||||
assert len(matcher(doc)) == 2
|
||||
assert on_match.call_count == 2
|
||||
# New API: add(key: str, patterns: List[List[dict]], on_match: Callable)
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add("NEW_API", patterns)
|
||||
assert len(matcher(doc)) == 2
|
||||
matcher = Matcher(en_vocab)
|
||||
on_match = Mock()
|
||||
matcher.add("NEW_API_CALLBACK", patterns, on_match=on_match)
|
||||
assert len(matcher(doc)) == 2
|
||||
assert on_match.call_count == 2
|
||||
|
||||
|
||||
def test_matcher_no_match(matcher):
|
||||
doc = Doc(matcher.vocab, words=["I", "like", "cheese", "."])
|
||||
assert matcher(doc) == []
|
||||
|
@ -100,12 +122,12 @@ def test_matcher_empty_dict(en_vocab):
|
|||
"""Test matcher allows empty token specs, meaning match on any token."""
|
||||
matcher = Matcher(en_vocab)
|
||||
doc = Doc(matcher.vocab, words=["a", "b", "c"])
|
||||
matcher.add("A.C", None, [{"ORTH": "a"}, {}, {"ORTH": "c"}])
|
||||
matcher.add("A.C", [[{"ORTH": "a"}, {}, {"ORTH": "c"}]])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 1
|
||||
assert matches[0][1:] == (0, 3)
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add("A.", None, [{"ORTH": "a"}, {}])
|
||||
matcher.add("A.", [[{"ORTH": "a"}, {}]])
|
||||
matches = matcher(doc)
|
||||
assert matches[0][1:] == (0, 2)
|
||||
|
||||
|
@ -114,7 +136,7 @@ def test_matcher_operator_shadow(en_vocab):
|
|||
matcher = Matcher(en_vocab)
|
||||
doc = Doc(matcher.vocab, words=["a", "b", "c"])
|
||||
pattern = [{"ORTH": "a"}, {"IS_ALPHA": True, "OP": "+"}, {"ORTH": "c"}]
|
||||
matcher.add("A.C", None, pattern)
|
||||
matcher.add("A.C", [pattern])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 1
|
||||
assert matches[0][1:] == (0, 3)
|
||||
|
@ -136,12 +158,12 @@ def test_matcher_match_zero(matcher):
|
|||
{"IS_PUNCT": True},
|
||||
{"ORTH": '"'},
|
||||
]
|
||||
matcher.add("Quote", None, pattern1)
|
||||
matcher.add("Quote", [pattern1])
|
||||
doc = Doc(matcher.vocab, words=words1)
|
||||
assert len(matcher(doc)) == 1
|
||||
doc = Doc(matcher.vocab, words=words2)
|
||||
assert len(matcher(doc)) == 0
|
||||
matcher.add("Quote", None, pattern2)
|
||||
matcher.add("Quote", [pattern2])
|
||||
assert len(matcher(doc)) == 0
|
||||
|
||||
|
||||
|
@ -149,7 +171,7 @@ def test_matcher_match_zero_plus(matcher):
|
|||
words = 'He said , " some words " ...'.split()
|
||||
pattern = [{"ORTH": '"'}, {"OP": "*", "IS_PUNCT": False}, {"ORTH": '"'}]
|
||||
matcher = Matcher(matcher.vocab)
|
||||
matcher.add("Quote", None, pattern)
|
||||
matcher.add("Quote", [pattern])
|
||||
doc = Doc(matcher.vocab, words=words)
|
||||
assert len(matcher(doc)) == 1
|
||||
|
||||
|
@ -160,11 +182,8 @@ def test_matcher_match_one_plus(matcher):
|
|||
doc = Doc(control.vocab, words=["Philippe", "Philippe"])
|
||||
m = control(doc)
|
||||
assert len(m) == 2
|
||||
matcher.add(
|
||||
"KleenePhilippe",
|
||||
None,
|
||||
[{"ORTH": "Philippe", "OP": "1"}, {"ORTH": "Philippe", "OP": "+"}],
|
||||
)
|
||||
pattern = [{"ORTH": "Philippe", "OP": "1"}, {"ORTH": "Philippe", "OP": "+"}]
|
||||
matcher.add("KleenePhilippe", [pattern])
|
||||
m = matcher(doc)
|
||||
assert len(m) == 1
|
||||
|
||||
|
@ -172,7 +191,7 @@ def test_matcher_match_one_plus(matcher):
|
|||
def test_matcher_any_token_operator(en_vocab):
|
||||
"""Test that patterns with "any token" {} work with operators."""
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add("TEST", None, [{"ORTH": "test"}, {"OP": "*"}])
|
||||
matcher.add("TEST", [[{"ORTH": "test"}, {"OP": "*"}]])
|
||||
doc = Doc(en_vocab, words=["test", "hello", "world"])
|
||||
matches = [doc[start:end].text for _, start, end in matcher(doc)]
|
||||
assert len(matches) == 3
|
||||
|
@ -186,7 +205,7 @@ def test_matcher_extension_attribute(en_vocab):
|
|||
get_is_fruit = lambda token: token.text in ("apple", "banana")
|
||||
Token.set_extension("is_fruit", getter=get_is_fruit, force=True)
|
||||
pattern = [{"ORTH": "an"}, {"_": {"is_fruit": True}}]
|
||||
matcher.add("HAVING_FRUIT", None, pattern)
|
||||
matcher.add("HAVING_FRUIT", [pattern])
|
||||
doc = Doc(en_vocab, words=["an", "apple"])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 1
|
||||
|
@ -198,7 +217,7 @@ def test_matcher_extension_attribute(en_vocab):
|
|||
def test_matcher_set_value(en_vocab):
|
||||
matcher = Matcher(en_vocab)
|
||||
pattern = [{"ORTH": {"IN": ["an", "a"]}}]
|
||||
matcher.add("A_OR_AN", None, pattern)
|
||||
matcher.add("A_OR_AN", [pattern])
|
||||
doc = Doc(en_vocab, words=["an", "a", "apple"])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 2
|
||||
|
@ -210,7 +229,7 @@ def test_matcher_set_value(en_vocab):
|
|||
def test_matcher_set_value_operator(en_vocab):
|
||||
matcher = Matcher(en_vocab)
|
||||
pattern = [{"ORTH": {"IN": ["a", "the"]}, "OP": "?"}, {"ORTH": "house"}]
|
||||
matcher.add("DET_HOUSE", None, pattern)
|
||||
matcher.add("DET_HOUSE", [pattern])
|
||||
doc = Doc(en_vocab, words=["In", "a", "house"])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 2
|
||||
|
@ -222,7 +241,7 @@ def test_matcher_set_value_operator(en_vocab):
|
|||
def test_matcher_regex(en_vocab):
|
||||
matcher = Matcher(en_vocab)
|
||||
pattern = [{"ORTH": {"REGEX": r"(?:a|an)"}}]
|
||||
matcher.add("A_OR_AN", None, pattern)
|
||||
matcher.add("A_OR_AN", [pattern])
|
||||
doc = Doc(en_vocab, words=["an", "a", "hi"])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 2
|
||||
|
@ -234,7 +253,7 @@ def test_matcher_regex(en_vocab):
|
|||
def test_matcher_regex_shape(en_vocab):
|
||||
matcher = Matcher(en_vocab)
|
||||
pattern = [{"SHAPE": {"REGEX": r"^[^x]+$"}}]
|
||||
matcher.add("NON_ALPHA", None, pattern)
|
||||
matcher.add("NON_ALPHA", [pattern])
|
||||
doc = Doc(en_vocab, words=["99", "problems", "!"])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 2
|
||||
|
@ -246,7 +265,7 @@ def test_matcher_regex_shape(en_vocab):
|
|||
def test_matcher_compare_length(en_vocab):
|
||||
matcher = Matcher(en_vocab)
|
||||
pattern = [{"LENGTH": {">=": 2}}]
|
||||
matcher.add("LENGTH_COMPARE", None, pattern)
|
||||
matcher.add("LENGTH_COMPARE", [pattern])
|
||||
doc = Doc(en_vocab, words=["a", "aa", "aaa"])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 2
|
||||
|
@ -260,7 +279,7 @@ def test_matcher_extension_set_membership(en_vocab):
|
|||
get_reversed = lambda token: "".join(reversed(token.text))
|
||||
Token.set_extension("reversed", getter=get_reversed, force=True)
|
||||
pattern = [{"_": {"reversed": {"IN": ["eyb", "ih"]}}}]
|
||||
matcher.add("REVERSED", None, pattern)
|
||||
matcher.add("REVERSED", [pattern])
|
||||
doc = Doc(en_vocab, words=["hi", "bye", "hello"])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 2
|
||||
|
@ -328,9 +347,9 @@ def dependency_matcher(en_vocab):
|
|||
]
|
||||
|
||||
matcher = DependencyMatcher(en_vocab)
|
||||
matcher.add("pattern1", None, pattern1)
|
||||
matcher.add("pattern2", None, pattern2)
|
||||
matcher.add("pattern3", None, pattern3)
|
||||
matcher.add("pattern1", [pattern1])
|
||||
matcher.add("pattern2", [pattern2])
|
||||
matcher.add("pattern3", [pattern3])
|
||||
|
||||
return matcher
|
||||
|
||||
|
@ -347,6 +366,14 @@ def test_dependency_matcher_compile(dependency_matcher):
|
|||
# assert matches[2][1] == [[4, 3, 2]]
|
||||
|
||||
|
||||
def test_matcher_basic_check(en_vocab):
|
||||
matcher = Matcher(en_vocab)
|
||||
# Potential mistake: pass in pattern instead of list of patterns
|
||||
pattern = [{"TEXT": "hello"}, {"TEXT": "world"}]
|
||||
with pytest.raises(ValueError):
|
||||
matcher.add("TEST", pattern)
|
||||
|
||||
|
||||
def test_attr_pipeline_checks(en_vocab):
|
||||
doc1 = Doc(en_vocab, words=["Test"])
|
||||
doc1.is_parsed = True
|
||||
|
@ -355,7 +382,7 @@ def test_attr_pipeline_checks(en_vocab):
|
|||
doc3 = Doc(en_vocab, words=["Test"])
|
||||
# DEP requires is_parsed
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add("TEST", None, [{"DEP": "a"}])
|
||||
matcher.add("TEST", [[{"DEP": "a"}]])
|
||||
matcher(doc1)
|
||||
with pytest.raises(ValueError):
|
||||
matcher(doc2)
|
||||
|
@ -364,7 +391,7 @@ def test_attr_pipeline_checks(en_vocab):
|
|||
# TAG, POS, LEMMA require is_tagged
|
||||
for attr in ("TAG", "POS", "LEMMA"):
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add("TEST", None, [{attr: "a"}])
|
||||
matcher.add("TEST", [[{attr: "a"}]])
|
||||
matcher(doc2)
|
||||
with pytest.raises(ValueError):
|
||||
matcher(doc1)
|
||||
|
@ -372,12 +399,12 @@ def test_attr_pipeline_checks(en_vocab):
|
|||
matcher(doc3)
|
||||
# TEXT/ORTH only require tokens
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add("TEST", None, [{"ORTH": "a"}])
|
||||
matcher.add("TEST", [[{"ORTH": "a"}]])
|
||||
matcher(doc1)
|
||||
matcher(doc2)
|
||||
matcher(doc3)
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add("TEST", None, [{"TEXT": "a"}])
|
||||
matcher.add("TEST", [[{"TEXT": "a"}]])
|
||||
matcher(doc1)
|
||||
matcher(doc2)
|
||||
matcher(doc3)
|
||||
|
@ -407,7 +434,7 @@ def test_attr_pipeline_checks(en_vocab):
|
|||
def test_matcher_schema_token_attributes(en_vocab, pattern, text):
|
||||
matcher = Matcher(en_vocab)
|
||||
doc = Doc(en_vocab, words=text.split(" "))
|
||||
matcher.add("Rule", None, pattern)
|
||||
matcher.add("Rule", [pattern])
|
||||
assert len(matcher) == 1
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 1
|
||||
|
@ -417,7 +444,7 @@ def test_matcher_valid_callback(en_vocab):
|
|||
"""Test that on_match can only be None or callable."""
|
||||
matcher = Matcher(en_vocab)
|
||||
with pytest.raises(ValueError):
|
||||
matcher.add("TEST", [], [{"TEXT": "test"}])
|
||||
matcher.add("TEST", [[{"TEXT": "test"}]], on_match=[])
|
||||
matcher(Doc(en_vocab, words=["test"]))
|
||||
|
||||
|
||||
|
@ -425,7 +452,7 @@ def test_matcher_callback(en_vocab):
|
|||
mock = Mock()
|
||||
matcher = Matcher(en_vocab)
|
||||
pattern = [{"ORTH": "test"}]
|
||||
matcher.add("Rule", mock, pattern)
|
||||
matcher.add("Rule", [pattern], on_match=mock)
|
||||
doc = Doc(en_vocab, words=["This", "is", "a", "test", "."])
|
||||
matches = matcher(doc)
|
||||
mock.assert_called_once_with(matcher, doc, 0, matches)
|
||||
|
|
|
@ -55,7 +55,7 @@ def test_greedy_matching(doc, text, pattern, re_pattern):
|
|||
"""Test that the greedy matching behavior of the * op is consistant with
|
||||
other re implementations."""
|
||||
matcher = Matcher(doc.vocab)
|
||||
matcher.add(re_pattern, None, pattern)
|
||||
matcher.add(re_pattern, [pattern])
|
||||
matches = matcher(doc)
|
||||
re_matches = [m.span() for m in re.finditer(re_pattern, text)]
|
||||
for match, re_match in zip(matches, re_matches):
|
||||
|
@ -77,7 +77,7 @@ def test_match_consuming(doc, text, pattern, re_pattern):
|
|||
"""Test that matcher.__call__ consumes tokens on a match similar to
|
||||
re.findall."""
|
||||
matcher = Matcher(doc.vocab)
|
||||
matcher.add(re_pattern, None, pattern)
|
||||
matcher.add(re_pattern, [pattern])
|
||||
matches = matcher(doc)
|
||||
re_matches = [m.span() for m in re.finditer(re_pattern, text)]
|
||||
assert len(matches) == len(re_matches)
|
||||
|
@ -111,7 +111,7 @@ def test_operator_combos(en_vocab):
|
|||
pattern.append({"ORTH": part[0], "OP": "+"})
|
||||
else:
|
||||
pattern.append({"ORTH": part})
|
||||
matcher.add("PATTERN", None, pattern)
|
||||
matcher.add("PATTERN", [pattern])
|
||||
matches = matcher(doc)
|
||||
if result:
|
||||
assert matches, (string, pattern_str)
|
||||
|
@ -123,7 +123,7 @@ def test_matcher_end_zero_plus(en_vocab):
|
|||
"""Test matcher works when patterns end with * operator. (issue 1450)"""
|
||||
matcher = Matcher(en_vocab)
|
||||
pattern = [{"ORTH": "a"}, {"ORTH": "b", "OP": "*"}]
|
||||
matcher.add("TSTEND", None, pattern)
|
||||
matcher.add("TSTEND", [pattern])
|
||||
nlp = lambda string: Doc(matcher.vocab, words=string.split())
|
||||
assert len(matcher(nlp("a"))) == 1
|
||||
assert len(matcher(nlp("a b"))) == 2
|
||||
|
@ -140,7 +140,7 @@ def test_matcher_sets_return_correct_tokens(en_vocab):
|
|||
[{"LOWER": {"IN": ["one"]}}],
|
||||
[{"LOWER": {"IN": ["two"]}}],
|
||||
]
|
||||
matcher.add("TEST", None, *patterns)
|
||||
matcher.add("TEST", patterns)
|
||||
doc = Doc(en_vocab, words="zero one two three".split())
|
||||
matches = matcher(doc)
|
||||
texts = [Span(doc, s, e, label=L).text for L, s, e in matches]
|
||||
|
@ -154,7 +154,7 @@ def test_matcher_remove():
|
|||
|
||||
pattern = [{"ORTH": "test"}, {"OP": "?"}]
|
||||
assert len(matcher) == 0
|
||||
matcher.add("Rule", None, pattern)
|
||||
matcher.add("Rule", [pattern])
|
||||
assert "Rule" in matcher
|
||||
|
||||
# should give two matches
|
||||
|
|
|
@ -50,7 +50,7 @@ def validator():
|
|||
def test_matcher_pattern_validation(en_vocab, pattern):
|
||||
matcher = Matcher(en_vocab, validate=True)
|
||||
with pytest.raises(MatchPatternError):
|
||||
matcher.add("TEST", None, pattern)
|
||||
matcher.add("TEST", [pattern])
|
||||
|
||||
|
||||
@pytest.mark.parametrize("pattern,n_errors,_", TEST_PATTERNS)
|
||||
|
@ -71,6 +71,6 @@ def test_minimal_pattern_validation(en_vocab, pattern, n_errors, n_min_errors):
|
|||
matcher = Matcher(en_vocab)
|
||||
if n_min_errors > 0:
|
||||
with pytest.raises(ValueError):
|
||||
matcher.add("TEST", None, pattern)
|
||||
matcher.add("TEST", [pattern])
|
||||
elif n_errors == 0:
|
||||
matcher.add("TEST", None, pattern)
|
||||
matcher.add("TEST", [pattern])
|
||||
|
|
|
@ -13,53 +13,75 @@ def test_matcher_phrase_matcher(en_vocab):
|
|||
# intermediate phrase
|
||||
pattern = Doc(en_vocab, words=["Google", "Now"])
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add("COMPANY", None, pattern)
|
||||
matcher.add("COMPANY", [pattern])
|
||||
assert len(matcher(doc)) == 1
|
||||
# initial token
|
||||
pattern = Doc(en_vocab, words=["I"])
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add("I", None, pattern)
|
||||
matcher.add("I", [pattern])
|
||||
assert len(matcher(doc)) == 1
|
||||
# initial phrase
|
||||
pattern = Doc(en_vocab, words=["I", "like"])
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add("ILIKE", None, pattern)
|
||||
matcher.add("ILIKE", [pattern])
|
||||
assert len(matcher(doc)) == 1
|
||||
# final token
|
||||
pattern = Doc(en_vocab, words=["best"])
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add("BEST", None, pattern)
|
||||
matcher.add("BEST", [pattern])
|
||||
assert len(matcher(doc)) == 1
|
||||
# final phrase
|
||||
pattern = Doc(en_vocab, words=["Now", "best"])
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add("NOWBEST", None, pattern)
|
||||
matcher.add("NOWBEST", [pattern])
|
||||
assert len(matcher(doc)) == 1
|
||||
|
||||
|
||||
def test_phrase_matcher_length(en_vocab):
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
assert len(matcher) == 0
|
||||
matcher.add("TEST", None, Doc(en_vocab, words=["test"]))
|
||||
matcher.add("TEST", [Doc(en_vocab, words=["test"])])
|
||||
assert len(matcher) == 1
|
||||
matcher.add("TEST2", None, Doc(en_vocab, words=["test2"]))
|
||||
matcher.add("TEST2", [Doc(en_vocab, words=["test2"])])
|
||||
assert len(matcher) == 2
|
||||
|
||||
|
||||
def test_phrase_matcher_contains(en_vocab):
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add("TEST", None, Doc(en_vocab, words=["test"]))
|
||||
matcher.add("TEST", [Doc(en_vocab, words=["test"])])
|
||||
assert "TEST" in matcher
|
||||
assert "TEST2" not in matcher
|
||||
|
||||
|
||||
def test_phrase_matcher_add_new_api(en_vocab):
|
||||
doc = Doc(en_vocab, words=["a", "b"])
|
||||
patterns = [Doc(en_vocab, words=["a"]), Doc(en_vocab, words=["a", "b"])]
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add("OLD_API", None, *patterns)
|
||||
assert len(matcher(doc)) == 2
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
on_match = Mock()
|
||||
matcher.add("OLD_API_CALLBACK", on_match, *patterns)
|
||||
assert len(matcher(doc)) == 2
|
||||
assert on_match.call_count == 2
|
||||
# New API: add(key: str, patterns: List[List[dict]], on_match: Callable)
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add("NEW_API", patterns)
|
||||
assert len(matcher(doc)) == 2
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
on_match = Mock()
|
||||
matcher.add("NEW_API_CALLBACK", patterns, on_match=on_match)
|
||||
assert len(matcher(doc)) == 2
|
||||
assert on_match.call_count == 2
|
||||
|
||||
|
||||
def test_phrase_matcher_repeated_add(en_vocab):
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
# match ID only gets added once
|
||||
matcher.add("TEST", None, Doc(en_vocab, words=["like"]))
|
||||
matcher.add("TEST", None, Doc(en_vocab, words=["like"]))
|
||||
matcher.add("TEST", None, Doc(en_vocab, words=["like"]))
|
||||
matcher.add("TEST", None, Doc(en_vocab, words=["like"]))
|
||||
matcher.add("TEST", [Doc(en_vocab, words=["like"])])
|
||||
matcher.add("TEST", [Doc(en_vocab, words=["like"])])
|
||||
matcher.add("TEST", [Doc(en_vocab, words=["like"])])
|
||||
matcher.add("TEST", [Doc(en_vocab, words=["like"])])
|
||||
doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"])
|
||||
assert "TEST" in matcher
|
||||
assert "TEST2" not in matcher
|
||||
|
@ -68,8 +90,8 @@ def test_phrase_matcher_repeated_add(en_vocab):
|
|||
|
||||
def test_phrase_matcher_remove(en_vocab):
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add("TEST1", None, Doc(en_vocab, words=["like"]))
|
||||
matcher.add("TEST2", None, Doc(en_vocab, words=["best"]))
|
||||
matcher.add("TEST1", [Doc(en_vocab, words=["like"])])
|
||||
matcher.add("TEST2", [Doc(en_vocab, words=["best"])])
|
||||
doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"])
|
||||
assert "TEST1" in matcher
|
||||
assert "TEST2" in matcher
|
||||
|
@ -95,9 +117,9 @@ def test_phrase_matcher_remove(en_vocab):
|
|||
|
||||
def test_phrase_matcher_overlapping_with_remove(en_vocab):
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add("TEST", None, Doc(en_vocab, words=["like"]))
|
||||
matcher.add("TEST", [Doc(en_vocab, words=["like"])])
|
||||
# TEST2 is added alongside TEST
|
||||
matcher.add("TEST2", None, Doc(en_vocab, words=["like"]))
|
||||
matcher.add("TEST2", [Doc(en_vocab, words=["like"])])
|
||||
doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"])
|
||||
assert "TEST" in matcher
|
||||
assert len(matcher) == 2
|
||||
|
@ -122,7 +144,7 @@ def test_phrase_matcher_string_attrs(en_vocab):
|
|||
pos2 = ["INTJ", "PUNCT", "PRON", "VERB", "NOUN", "ADV", "ADV"]
|
||||
pattern = get_doc(en_vocab, words=words1, pos=pos1)
|
||||
matcher = PhraseMatcher(en_vocab, attr="POS")
|
||||
matcher.add("TEST", None, pattern)
|
||||
matcher.add("TEST", [pattern])
|
||||
doc = get_doc(en_vocab, words=words2, pos=pos2)
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 1
|
||||
|
@ -140,7 +162,7 @@ def test_phrase_matcher_string_attrs_negative(en_vocab):
|
|||
pos2 = ["X", "X", "X"]
|
||||
pattern = get_doc(en_vocab, words=words1, pos=pos1)
|
||||
matcher = PhraseMatcher(en_vocab, attr="POS")
|
||||
matcher.add("TEST", None, pattern)
|
||||
matcher.add("TEST", [pattern])
|
||||
doc = get_doc(en_vocab, words=words2, pos=pos2)
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 0
|
||||
|
@ -151,7 +173,7 @@ def test_phrase_matcher_bool_attrs(en_vocab):
|
|||
words2 = ["No", "problem", ",", "he", "said", "."]
|
||||
pattern = Doc(en_vocab, words=words1)
|
||||
matcher = PhraseMatcher(en_vocab, attr="IS_PUNCT")
|
||||
matcher.add("TEST", None, pattern)
|
||||
matcher.add("TEST", [pattern])
|
||||
doc = Doc(en_vocab, words=words2)
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 2
|
||||
|
@ -173,15 +195,15 @@ def test_phrase_matcher_validation(en_vocab):
|
|||
doc3 = Doc(en_vocab, words=["Test"])
|
||||
matcher = PhraseMatcher(en_vocab, validate=True)
|
||||
with pytest.warns(UserWarning):
|
||||
matcher.add("TEST1", None, doc1)
|
||||
matcher.add("TEST1", [doc1])
|
||||
with pytest.warns(UserWarning):
|
||||
matcher.add("TEST2", None, doc2)
|
||||
matcher.add("TEST2", [doc2])
|
||||
with pytest.warns(None) as record:
|
||||
matcher.add("TEST3", None, doc3)
|
||||
matcher.add("TEST3", [doc3])
|
||||
assert not record.list
|
||||
matcher = PhraseMatcher(en_vocab, attr="POS", validate=True)
|
||||
with pytest.warns(None) as record:
|
||||
matcher.add("TEST4", None, doc2)
|
||||
matcher.add("TEST4", [doc2])
|
||||
assert not record.list
|
||||
|
||||
|
||||
|
@ -198,24 +220,24 @@ def test_attr_pipeline_checks(en_vocab):
|
|||
doc3 = Doc(en_vocab, words=["Test"])
|
||||
# DEP requires is_parsed
|
||||
matcher = PhraseMatcher(en_vocab, attr="DEP")
|
||||
matcher.add("TEST1", None, doc1)
|
||||
matcher.add("TEST1", [doc1])
|
||||
with pytest.raises(ValueError):
|
||||
matcher.add("TEST2", None, doc2)
|
||||
matcher.add("TEST2", [doc2])
|
||||
with pytest.raises(ValueError):
|
||||
matcher.add("TEST3", None, doc3)
|
||||
matcher.add("TEST3", [doc3])
|
||||
# TAG, POS, LEMMA require is_tagged
|
||||
for attr in ("TAG", "POS", "LEMMA"):
|
||||
matcher = PhraseMatcher(en_vocab, attr=attr)
|
||||
matcher.add("TEST2", None, doc2)
|
||||
matcher.add("TEST2", [doc2])
|
||||
with pytest.raises(ValueError):
|
||||
matcher.add("TEST1", None, doc1)
|
||||
matcher.add("TEST1", [doc1])
|
||||
with pytest.raises(ValueError):
|
||||
matcher.add("TEST3", None, doc3)
|
||||
matcher.add("TEST3", [doc3])
|
||||
# TEXT/ORTH only require tokens
|
||||
matcher = PhraseMatcher(en_vocab, attr="ORTH")
|
||||
matcher.add("TEST3", None, doc3)
|
||||
matcher.add("TEST3", [doc3])
|
||||
matcher = PhraseMatcher(en_vocab, attr="TEXT")
|
||||
matcher.add("TEST3", None, doc3)
|
||||
matcher.add("TEST3", [doc3])
|
||||
|
||||
|
||||
def test_phrase_matcher_callback(en_vocab):
|
||||
|
@ -223,7 +245,7 @@ def test_phrase_matcher_callback(en_vocab):
|
|||
doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"])
|
||||
pattern = Doc(en_vocab, words=["Google", "Now"])
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add("COMPANY", mock, pattern)
|
||||
matcher.add("COMPANY", [pattern], on_match=mock)
|
||||
matches = matcher(doc)
|
||||
mock.assert_called_once_with(matcher, doc, 0, matches)
|
||||
|
||||
|
@ -234,5 +256,13 @@ def test_phrase_matcher_remove_overlapping_patterns(en_vocab):
|
|||
pattern2 = Doc(en_vocab, words=["this", "is"])
|
||||
pattern3 = Doc(en_vocab, words=["this", "is", "a"])
|
||||
pattern4 = Doc(en_vocab, words=["this", "is", "a", "word"])
|
||||
matcher.add("THIS", None, pattern1, pattern2, pattern3, pattern4)
|
||||
matcher.add("THIS", [pattern1, pattern2, pattern3, pattern4])
|
||||
matcher.remove("THIS")
|
||||
|
||||
|
||||
def test_phrase_matcher_basic_check(en_vocab):
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
# Potential mistake: pass in pattern instead of list of patterns
|
||||
pattern = Doc(en_vocab, words=["hello", "world"])
|
||||
with pytest.raises(ValueError):
|
||||
matcher.add("TEST", pattern)
|
||||
|
|
168
spacy/tests/pipeline/test_analysis.py
Normal file
168
spacy/tests/pipeline/test_analysis.py
Normal file
|
@ -0,0 +1,168 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import spacy.language
|
||||
from spacy.language import Language, component
|
||||
from spacy.analysis import print_summary, validate_attrs
|
||||
from spacy.analysis import get_assigns_for_attr, get_requires_for_attr
|
||||
from spacy.compat import is_python2
|
||||
from mock import Mock, ANY
|
||||
import pytest
|
||||
|
||||
|
||||
def test_component_decorator_function():
|
||||
@component(name="test")
|
||||
def test_component(doc):
|
||||
"""docstring"""
|
||||
return doc
|
||||
|
||||
assert test_component.name == "test"
|
||||
if not is_python2:
|
||||
assert test_component.__doc__ == "docstring"
|
||||
assert test_component("foo") == "foo"
|
||||
|
||||
|
||||
def test_component_decorator_class():
|
||||
@component(name="test")
|
||||
class TestComponent(object):
|
||||
"""docstring1"""
|
||||
|
||||
foo = "bar"
|
||||
|
||||
def __call__(self, doc):
|
||||
"""docstring2"""
|
||||
return doc
|
||||
|
||||
def custom(self, x):
|
||||
"""docstring3"""
|
||||
return x
|
||||
|
||||
assert TestComponent.name == "test"
|
||||
assert TestComponent.foo == "bar"
|
||||
assert hasattr(TestComponent, "custom")
|
||||
test_component = TestComponent()
|
||||
assert test_component.foo == "bar"
|
||||
assert test_component("foo") == "foo"
|
||||
assert hasattr(test_component, "custom")
|
||||
assert test_component.custom("bar") == "bar"
|
||||
if not is_python2:
|
||||
assert TestComponent.__doc__ == "docstring1"
|
||||
assert TestComponent.__call__.__doc__ == "docstring2"
|
||||
assert TestComponent.custom.__doc__ == "docstring3"
|
||||
assert test_component.__doc__ == "docstring1"
|
||||
assert test_component.__call__.__doc__ == "docstring2"
|
||||
assert test_component.custom.__doc__ == "docstring3"
|
||||
|
||||
|
||||
def test_component_decorator_assigns():
|
||||
spacy.language.ENABLE_PIPELINE_ANALYSIS = True
|
||||
|
||||
@component("c1", assigns=["token.tag", "doc.tensor"])
|
||||
def test_component1(doc):
|
||||
return doc
|
||||
|
||||
@component(
|
||||
"c2", requires=["token.tag", "token.pos"], assigns=["token.lemma", "doc.tensor"]
|
||||
)
|
||||
def test_component2(doc):
|
||||
return doc
|
||||
|
||||
@component("c3", requires=["token.lemma"], assigns=["token._.custom_lemma"])
|
||||
def test_component3(doc):
|
||||
return doc
|
||||
|
||||
assert "c1" in Language.factories
|
||||
assert "c2" in Language.factories
|
||||
assert "c3" in Language.factories
|
||||
|
||||
nlp = Language()
|
||||
nlp.add_pipe(test_component1)
|
||||
with pytest.warns(UserWarning):
|
||||
nlp.add_pipe(test_component2)
|
||||
nlp.add_pipe(test_component3)
|
||||
assigns_tensor = get_assigns_for_attr(nlp.pipeline, "doc.tensor")
|
||||
assert [name for name, _ in assigns_tensor] == ["c1", "c2"]
|
||||
test_component4 = nlp.create_pipe("c1")
|
||||
assert test_component4.name == "c1"
|
||||
assert test_component4.factory == "c1"
|
||||
nlp.add_pipe(test_component4, name="c4")
|
||||
assert nlp.pipe_names == ["c1", "c2", "c3", "c4"]
|
||||
assert "c4" not in Language.factories
|
||||
assert nlp.pipe_factories["c1"] == "c1"
|
||||
assert nlp.pipe_factories["c4"] == "c1"
|
||||
assigns_tensor = get_assigns_for_attr(nlp.pipeline, "doc.tensor")
|
||||
assert [name for name, _ in assigns_tensor] == ["c1", "c2", "c4"]
|
||||
requires_pos = get_requires_for_attr(nlp.pipeline, "token.pos")
|
||||
assert [name for name, _ in requires_pos] == ["c2"]
|
||||
assert print_summary(nlp, no_print=True)
|
||||
assert nlp("hello world")
|
||||
|
||||
|
||||
def test_component_factories_from_nlp():
|
||||
"""Test that class components can implement a from_nlp classmethod that
|
||||
gives them access to the nlp object and config via the factory."""
|
||||
|
||||
class TestComponent5(object):
|
||||
def __call__(self, doc):
|
||||
return doc
|
||||
|
||||
mock = Mock()
|
||||
mock.return_value = TestComponent5()
|
||||
TestComponent5.from_nlp = classmethod(mock)
|
||||
TestComponent5 = component("c5")(TestComponent5)
|
||||
|
||||
assert "c5" in Language.factories
|
||||
nlp = Language()
|
||||
pipe = nlp.create_pipe("c5", config={"foo": "bar"})
|
||||
nlp.add_pipe(pipe)
|
||||
assert nlp("hello world")
|
||||
# The first argument here is the class itself, so we're accepting any here
|
||||
mock.assert_called_once_with(ANY, nlp, foo="bar")
|
||||
|
||||
|
||||
def test_analysis_validate_attrs_valid():
|
||||
attrs = ["doc.sents", "doc.ents", "token.tag", "token._.xyz", "span._.xyz"]
|
||||
assert validate_attrs(attrs)
|
||||
for attr in attrs:
|
||||
assert validate_attrs([attr])
|
||||
with pytest.raises(ValueError):
|
||||
validate_attrs(["doc.sents", "doc.xyz"])
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"attr",
|
||||
[
|
||||
"doc",
|
||||
"doc_ents",
|
||||
"doc.xyz",
|
||||
"token.xyz",
|
||||
"token.tag_",
|
||||
"token.tag.xyz",
|
||||
"token._.xyz.abc",
|
||||
"span.label",
|
||||
],
|
||||
)
|
||||
def test_analysis_validate_attrs_invalid(attr):
|
||||
with pytest.raises(ValueError):
|
||||
validate_attrs([attr])
|
||||
|
||||
|
||||
def test_analysis_validate_attrs_remove_pipe():
|
||||
"""Test that attributes are validated correctly on remove."""
|
||||
spacy.language.ENABLE_PIPELINE_ANALYSIS = True
|
||||
|
||||
@component("c1", assigns=["token.tag"])
|
||||
def c1(doc):
|
||||
return doc
|
||||
|
||||
@component("c2", requires=["token.pos"])
|
||||
def c2(doc):
|
||||
return doc
|
||||
|
||||
nlp = Language()
|
||||
nlp.add_pipe(c1)
|
||||
with pytest.warns(UserWarning):
|
||||
nlp.add_pipe(c2)
|
||||
with pytest.warns(None) as record:
|
||||
nlp.remove_pipe("c2")
|
||||
assert not record.list
|
|
@ -154,7 +154,8 @@ def test_append_alias(nlp):
|
|||
assert len(mykb.get_candidates("douglas")) == 3
|
||||
|
||||
# append the same alias-entity pair again should not work (will throw a warning)
|
||||
mykb.append_alias(alias="douglas", entity="Q1", prior_prob=0.3)
|
||||
with pytest.warns(UserWarning):
|
||||
mykb.append_alias(alias="douglas", entity="Q1", prior_prob=0.3)
|
||||
|
||||
# test the size of the relevant candidates remained unchanged
|
||||
assert len(mykb.get_candidates("douglas")) == 3
|
||||
|
|
34
spacy/tests/pipeline/test_functions.py
Normal file
34
spacy/tests/pipeline/test_functions.py
Normal file
|
@ -0,0 +1,34 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.pipeline.functions import merge_subtokens
|
||||
from ..util import get_doc
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def doc(en_tokenizer):
|
||||
# fmt: off
|
||||
text = "This is a sentence. This is another sentence. And a third."
|
||||
heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 1, 1, 1, 0]
|
||||
deps = ["nsubj", "ROOT", "subtok", "attr", "punct", "nsubj", "ROOT",
|
||||
"subtok", "attr", "punct", "subtok", "subtok", "subtok", "ROOT"]
|
||||
# fmt: on
|
||||
tokens = en_tokenizer(text)
|
||||
return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
|
||||
|
||||
|
||||
def test_merge_subtokens(doc):
|
||||
doc = merge_subtokens(doc)
|
||||
# get_doc() doesn't set spaces, so the result is "And a third ."
|
||||
assert [t.text for t in doc] == [
|
||||
"This",
|
||||
"is",
|
||||
"a sentence",
|
||||
".",
|
||||
"This",
|
||||
"is",
|
||||
"another sentence",
|
||||
".",
|
||||
"And a third .",
|
||||
]
|
|
@ -105,6 +105,16 @@ def test_disable_pipes_context(nlp, name):
|
|||
assert nlp.has_pipe(name)
|
||||
|
||||
|
||||
def test_disable_pipes_list_arg(nlp):
|
||||
for name in ["c1", "c2", "c3"]:
|
||||
nlp.add_pipe(new_pipe, name=name)
|
||||
assert nlp.has_pipe(name)
|
||||
with nlp.disable_pipes(["c1", "c2"]):
|
||||
assert not nlp.has_pipe("c1")
|
||||
assert not nlp.has_pipe("c2")
|
||||
assert nlp.has_pipe("c3")
|
||||
|
||||
|
||||
@pytest.mark.parametrize("n_pipes", [100])
|
||||
def test_add_lots_of_pipes(nlp, n_pipes):
|
||||
for i in range(n_pipes):
|
||||
|
|
|
@ -30,7 +30,7 @@ def test_issue118(en_tokenizer, patterns):
|
|||
doc = en_tokenizer(text)
|
||||
ORG = doc.vocab.strings["ORG"]
|
||||
matcher = Matcher(doc.vocab)
|
||||
matcher.add("BostonCeltics", None, *patterns)
|
||||
matcher.add("BostonCeltics", patterns)
|
||||
assert len(list(doc.ents)) == 0
|
||||
matches = [(ORG, start, end) for _, start, end in matcher(doc)]
|
||||
assert matches == [(ORG, 9, 11), (ORG, 10, 11)]
|
||||
|
@ -57,7 +57,7 @@ def test_issue118_prefix_reorder(en_tokenizer, patterns):
|
|||
doc = en_tokenizer(text)
|
||||
ORG = doc.vocab.strings["ORG"]
|
||||
matcher = Matcher(doc.vocab)
|
||||
matcher.add("BostonCeltics", None, *patterns)
|
||||
matcher.add("BostonCeltics", patterns)
|
||||
assert len(list(doc.ents)) == 0
|
||||
matches = [(ORG, start, end) for _, start, end in matcher(doc)]
|
||||
doc.ents += tuple(matches)[1:]
|
||||
|
@ -78,7 +78,7 @@ def test_issue242(en_tokenizer):
|
|||
]
|
||||
doc = en_tokenizer(text)
|
||||
matcher = Matcher(doc.vocab)
|
||||
matcher.add("FOOD", None, *patterns)
|
||||
matcher.add("FOOD", patterns)
|
||||
matches = [(ent_type, start, end) for ent_type, start, end in matcher(doc)]
|
||||
match1, match2 = matches
|
||||
assert match1[1] == 3
|
||||
|
@ -127,17 +127,13 @@ def test_issue587(en_tokenizer):
|
|||
"""Test that Matcher doesn't segfault on particular input"""
|
||||
doc = en_tokenizer("a b; c")
|
||||
matcher = Matcher(doc.vocab)
|
||||
matcher.add("TEST1", None, [{ORTH: "a"}, {ORTH: "b"}])
|
||||
matcher.add("TEST1", [[{ORTH: "a"}, {ORTH: "b"}]])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 1
|
||||
matcher.add(
|
||||
"TEST2", None, [{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "c"}]
|
||||
)
|
||||
matcher.add("TEST2", [[{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "c"}]])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 2
|
||||
matcher.add(
|
||||
"TEST3", None, [{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "d"}]
|
||||
)
|
||||
matcher.add("TEST3", [[{ORTH: "a"}, {ORTH: "b"}, {IS_PUNCT: True}, {ORTH: "d"}]])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 2
|
||||
|
||||
|
@ -145,7 +141,7 @@ def test_issue587(en_tokenizer):
|
|||
def test_issue588(en_vocab):
|
||||
matcher = Matcher(en_vocab)
|
||||
with pytest.raises(ValueError):
|
||||
matcher.add("TEST", None, [])
|
||||
matcher.add("TEST", [[]])
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
|
@ -161,11 +157,9 @@ def test_issue590(en_vocab):
|
|||
doc = Doc(en_vocab, words=["n", "=", "1", ";", "a", ":", "5", "%"])
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add(
|
||||
"ab",
|
||||
None,
|
||||
[{"IS_ALPHA": True}, {"ORTH": ":"}, {"LIKE_NUM": True}, {"ORTH": "%"}],
|
||||
"ab", [[{"IS_ALPHA": True}, {"ORTH": ":"}, {"LIKE_NUM": True}, {"ORTH": "%"}]]
|
||||
)
|
||||
matcher.add("ab", None, [{"IS_ALPHA": True}, {"ORTH": "="}, {"LIKE_NUM": True}])
|
||||
matcher.add("ab", [[{"IS_ALPHA": True}, {"ORTH": "="}, {"LIKE_NUM": True}]])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 2
|
||||
|
||||
|
@ -221,7 +215,7 @@ def test_issue615(en_tokenizer):
|
|||
label = "Sport_Equipment"
|
||||
doc = en_tokenizer(text)
|
||||
matcher = Matcher(doc.vocab)
|
||||
matcher.add(label, merge_phrases, pattern)
|
||||
matcher.add(label, [pattern], on_match=merge_phrases)
|
||||
matcher(doc)
|
||||
entities = list(doc.ents)
|
||||
assert entities != []
|
||||
|
@ -339,7 +333,7 @@ def test_issue850():
|
|||
vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()})
|
||||
matcher = Matcher(vocab)
|
||||
pattern = [{"LOWER": "bob"}, {"OP": "*"}, {"LOWER": "frank"}]
|
||||
matcher.add("FarAway", None, pattern)
|
||||
matcher.add("FarAway", [pattern])
|
||||
doc = Doc(matcher.vocab, words=["bob", "and", "and", "frank"])
|
||||
match = matcher(doc)
|
||||
assert len(match) == 1
|
||||
|
@ -353,7 +347,7 @@ def test_issue850_basic():
|
|||
vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()})
|
||||
matcher = Matcher(vocab)
|
||||
pattern = [{"LOWER": "bob"}, {"OP": "*", "LOWER": "and"}, {"LOWER": "frank"}]
|
||||
matcher.add("FarAway", None, pattern)
|
||||
matcher.add("FarAway", [pattern])
|
||||
doc = Doc(matcher.vocab, words=["bob", "and", "and", "frank"])
|
||||
match = matcher(doc)
|
||||
assert len(match) == 1
|
||||
|
|
|
@ -111,7 +111,7 @@ def test_issue1434():
|
|||
hello_world = Doc(vocab, words=["Hello", "World"])
|
||||
hello = Doc(vocab, words=["Hello"])
|
||||
matcher = Matcher(vocab)
|
||||
matcher.add("MyMatcher", None, pattern)
|
||||
matcher.add("MyMatcher", [pattern])
|
||||
matches = matcher(hello_world)
|
||||
assert matches
|
||||
matches = matcher(hello)
|
||||
|
@ -133,7 +133,7 @@ def test_issue1450(string, start, end):
|
|||
"""Test matcher works when patterns end with * operator."""
|
||||
pattern = [{"ORTH": "a"}, {"ORTH": "b", "OP": "*"}]
|
||||
matcher = Matcher(Vocab())
|
||||
matcher.add("TSTEND", None, pattern)
|
||||
matcher.add("TSTEND", [pattern])
|
||||
doc = Doc(Vocab(), words=string.split())
|
||||
matches = matcher(doc)
|
||||
if start is None or end is None:
|
||||
|
|
|
@ -224,7 +224,7 @@ def test_issue1868():
|
|||
|
||||
def test_issue1883():
|
||||
matcher = Matcher(Vocab())
|
||||
matcher.add("pat1", None, [{"orth": "hello"}])
|
||||
matcher.add("pat1", [[{"orth": "hello"}]])
|
||||
doc = Doc(matcher.vocab, words=["hello"])
|
||||
assert len(matcher(doc)) == 1
|
||||
new_matcher = copy.deepcopy(matcher)
|
||||
|
@ -249,7 +249,7 @@ def test_issue1915():
|
|||
def test_issue1945():
|
||||
"""Test regression in Matcher introduced in v2.0.6."""
|
||||
matcher = Matcher(Vocab())
|
||||
matcher.add("MWE", None, [{"orth": "a"}, {"orth": "a"}])
|
||||
matcher.add("MWE", [[{"orth": "a"}, {"orth": "a"}]])
|
||||
doc = Doc(matcher.vocab, words=["a", "a", "a"])
|
||||
matches = matcher(doc) # we should see two overlapping matches here
|
||||
assert len(matches) == 2
|
||||
|
@ -285,7 +285,7 @@ def test_issue1971(en_vocab):
|
|||
{"ORTH": "!", "OP": "?"},
|
||||
]
|
||||
Token.set_extension("optional", default=False)
|
||||
matcher.add("TEST", None, pattern)
|
||||
matcher.add("TEST", [pattern])
|
||||
doc = Doc(en_vocab, words=["Hello", "John", "Doe", "!"])
|
||||
# We could also assert length 1 here, but this is more conclusive, because
|
||||
# the real problem here is that it returns a duplicate match for a match_id
|
||||
|
@ -299,7 +299,7 @@ def test_issue_1971_2(en_vocab):
|
|||
pattern1 = [{"ORTH": "EUR", "LOWER": {"IN": ["eur"]}}, {"LIKE_NUM": True}]
|
||||
pattern2 = [{"LIKE_NUM": True}, {"ORTH": "EUR"}] # {"IN": ["EUR"]}}]
|
||||
doc = Doc(en_vocab, words=["EUR", "10", "is", "10", "EUR"])
|
||||
matcher.add("TEST1", None, pattern1, pattern2)
|
||||
matcher.add("TEST1", [pattern1, pattern2])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 2
|
||||
|
||||
|
@ -310,8 +310,8 @@ def test_issue_1971_3(en_vocab):
|
|||
Token.set_extension("b", default=2, force=True)
|
||||
doc = Doc(en_vocab, words=["hello", "world"])
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add("A", None, [{"_": {"a": 1}}])
|
||||
matcher.add("B", None, [{"_": {"b": 2}}])
|
||||
matcher.add("A", [[{"_": {"a": 1}}]])
|
||||
matcher.add("B", [[{"_": {"b": 2}}]])
|
||||
matches = sorted((en_vocab.strings[m_id], s, e) for m_id, s, e in matcher(doc))
|
||||
assert len(matches) == 4
|
||||
assert matches == sorted([("A", 0, 1), ("A", 1, 2), ("B", 0, 1), ("B", 1, 2)])
|
||||
|
@ -326,7 +326,7 @@ def test_issue_1971_4(en_vocab):
|
|||
matcher = Matcher(en_vocab)
|
||||
doc = Doc(en_vocab, words=["this", "is", "text"])
|
||||
pattern = [{"_": {"ext_a": "str_a", "ext_b": "str_b"}}] * 3
|
||||
matcher.add("TEST", None, pattern)
|
||||
matcher.add("TEST", [pattern])
|
||||
matches = matcher(doc)
|
||||
# Uncommenting this caused a segmentation fault
|
||||
assert len(matches) == 1
|
||||
|
|
|
@ -128,7 +128,7 @@ def test_issue2464(en_vocab):
|
|||
"""Test problem with successive ?. This is the same bug, so putting it here."""
|
||||
matcher = Matcher(en_vocab)
|
||||
doc = Doc(en_vocab, words=["a", "b"])
|
||||
matcher.add("4", None, [{"OP": "?"}, {"OP": "?"}])
|
||||
matcher.add("4", [[{"OP": "?"}, {"OP": "?"}]])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 3
|
||||
|
||||
|
|
|
@ -37,7 +37,7 @@ def test_issue2569(en_tokenizer):
|
|||
doc = en_tokenizer("It is May 15, 1993.")
|
||||
doc.ents = [Span(doc, 2, 6, label=doc.vocab.strings["DATE"])]
|
||||
matcher = Matcher(doc.vocab)
|
||||
matcher.add("RULE", None, [{"ENT_TYPE": "DATE", "OP": "+"}])
|
||||
matcher.add("RULE", [[{"ENT_TYPE": "DATE", "OP": "+"}]])
|
||||
matched = [doc[start:end] for _, start, end in matcher(doc)]
|
||||
matched = sorted(matched, key=len, reverse=True)
|
||||
assert len(matched) == 10
|
||||
|
@ -89,7 +89,7 @@ def test_issue2671():
|
|||
{"IS_PUNCT": True, "OP": "?"},
|
||||
{"LOWER": "adrenaline"},
|
||||
]
|
||||
matcher.add(pattern_id, None, pattern)
|
||||
matcher.add(pattern_id, [pattern])
|
||||
doc1 = nlp("This is a high-adrenaline situation.")
|
||||
doc2 = nlp("This is a high adrenaline situation.")
|
||||
matches1 = matcher(doc1)
|
||||
|
|
|
@ -52,7 +52,7 @@ def test_issue3009(en_vocab):
|
|||
doc = get_doc(en_vocab, words=words, tags=tags)
|
||||
matcher = Matcher(en_vocab)
|
||||
for i, pattern in enumerate(patterns):
|
||||
matcher.add(str(i), None, pattern)
|
||||
matcher.add(str(i), [pattern])
|
||||
matches = matcher(doc)
|
||||
assert matches
|
||||
|
||||
|
@ -116,8 +116,8 @@ def test_issue3248_1():
|
|||
total number of patterns."""
|
||||
nlp = English()
|
||||
matcher = PhraseMatcher(nlp.vocab)
|
||||
matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c"))
|
||||
matcher.add("TEST2", None, nlp("d"))
|
||||
matcher.add("TEST1", [nlp("a"), nlp("b"), nlp("c")])
|
||||
matcher.add("TEST2", [nlp("d")])
|
||||
assert len(matcher) == 2
|
||||
|
||||
|
||||
|
@ -125,8 +125,8 @@ def test_issue3248_2():
|
|||
"""Test that the PhraseMatcher can be pickled correctly."""
|
||||
nlp = English()
|
||||
matcher = PhraseMatcher(nlp.vocab)
|
||||
matcher.add("TEST1", None, nlp("a"), nlp("b"), nlp("c"))
|
||||
matcher.add("TEST2", None, nlp("d"))
|
||||
matcher.add("TEST1", [nlp("a"), nlp("b"), nlp("c")])
|
||||
matcher.add("TEST2", [nlp("d")])
|
||||
data = pickle.dumps(matcher)
|
||||
new_matcher = pickle.loads(data)
|
||||
assert len(new_matcher) == len(matcher)
|
||||
|
@ -170,7 +170,7 @@ def test_issue3328(en_vocab):
|
|||
[{"LOWER": {"IN": ["hello", "how"]}}],
|
||||
[{"LOWER": {"IN": ["you", "doing"]}}],
|
||||
]
|
||||
matcher.add("TEST", None, *patterns)
|
||||
matcher.add("TEST", patterns)
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 4
|
||||
matched_texts = [doc[start:end].text for _, start, end in matches]
|
||||
|
@ -183,8 +183,8 @@ def test_issue3331(en_vocab):
|
|||
matches, one per rule.
|
||||
"""
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add("A", None, Doc(en_vocab, words=["Barack", "Obama"]))
|
||||
matcher.add("B", None, Doc(en_vocab, words=["Barack", "Obama"]))
|
||||
matcher.add("A", [Doc(en_vocab, words=["Barack", "Obama"])])
|
||||
matcher.add("B", [Doc(en_vocab, words=["Barack", "Obama"])])
|
||||
doc = Doc(en_vocab, words=["Barack", "Obama", "lifts", "America"])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 2
|
||||
|
@ -297,8 +297,10 @@ def test_issue3410():
|
|||
def test_issue3412():
|
||||
data = numpy.asarray([[0, 0, 0], [1, 2, 3], [9, 8, 7]], dtype="f")
|
||||
vectors = Vectors(data=data)
|
||||
keys, best_rows, scores = vectors.most_similar(numpy.asarray([[9, 8, 7], [0, 0, 0]], dtype="f"))
|
||||
assert(best_rows[0] == 2)
|
||||
keys, best_rows, scores = vectors.most_similar(
|
||||
numpy.asarray([[9, 8, 7], [0, 0, 0]], dtype="f")
|
||||
)
|
||||
assert best_rows[0] == 2
|
||||
|
||||
|
||||
def test_issue3447():
|
||||
|
|
|
@ -10,6 +10,6 @@ def test_issue3549(en_vocab):
|
|||
"""Test that match pattern validation doesn't raise on empty errors."""
|
||||
matcher = Matcher(en_vocab, validate=True)
|
||||
pattern = [{"LOWER": "hello"}, {"LOWER": "world"}]
|
||||
matcher.add("GOOD", None, pattern)
|
||||
matcher.add("GOOD", [pattern])
|
||||
with pytest.raises(MatchPatternError):
|
||||
matcher.add("BAD", None, [{"X": "Y"}])
|
||||
matcher.add("BAD", [[{"X": "Y"}]])
|
||||
|
|
|
@ -12,6 +12,6 @@ def test_issue3555(en_vocab):
|
|||
Token.set_extension("issue3555", default=None)
|
||||
matcher = Matcher(en_vocab)
|
||||
pattern = [{"LEMMA": "have"}, {"_": {"issue3555": True}}]
|
||||
matcher.add("TEST", None, pattern)
|
||||
matcher.add("TEST", [pattern])
|
||||
doc = Doc(en_vocab, words=["have", "apple"])
|
||||
matcher(doc)
|
||||
|
|
|
@ -34,8 +34,7 @@ def test_issue3611():
|
|||
nlp.add_pipe(textcat, last=True)
|
||||
|
||||
# training the network
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
|
||||
with nlp.disable_pipes(*other_pipes):
|
||||
with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]):
|
||||
optimizer = nlp.begin_training()
|
||||
for i in range(3):
|
||||
losses = {}
|
||||
|
|
|
@ -12,10 +12,10 @@ def test_issue3839(en_vocab):
|
|||
match_id = "PATTERN"
|
||||
pattern1 = [{"LOWER": "terrific"}, {"OP": "?"}, {"LOWER": "group"}]
|
||||
pattern2 = [{"LOWER": "terrific"}, {"OP": "?"}, {"OP": "?"}, {"LOWER": "group"}]
|
||||
matcher.add(match_id, None, pattern1)
|
||||
matcher.add(match_id, [pattern1])
|
||||
matches = matcher(doc)
|
||||
assert matches[0][0] == en_vocab.strings[match_id]
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add(match_id, None, pattern2)
|
||||
matcher.add(match_id, [pattern2])
|
||||
matches = matcher(doc)
|
||||
assert matches[0][0] == en_vocab.strings[match_id]
|
||||
|
|
|
@ -10,5 +10,5 @@ def test_issue3879(en_vocab):
|
|||
assert len(doc) == 5
|
||||
pattern = [{"ORTH": "This", "OP": "?"}, {"OP": "?"}, {"ORTH": "test"}]
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add("TEST", None, pattern)
|
||||
matcher.add("TEST", [pattern])
|
||||
assert len(matcher(doc)) == 2 # fails because of a FP match 'is a test'
|
||||
|
|
|
@ -14,7 +14,7 @@ def test_issue3951(en_vocab):
|
|||
{"OP": "?"},
|
||||
{"LOWER": "world"},
|
||||
]
|
||||
matcher.add("TEST", None, pattern)
|
||||
matcher.add("TEST", [pattern])
|
||||
doc = Doc(en_vocab, words=["Hello", "my", "new", "world"])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 0
|
||||
|
|
|
@ -9,8 +9,8 @@ def test_issue3972(en_vocab):
|
|||
"""Test that the PhraseMatcher returns duplicates for duplicate match IDs.
|
||||
"""
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add("A", None, Doc(en_vocab, words=["New", "York"]))
|
||||
matcher.add("B", None, Doc(en_vocab, words=["New", "York"]))
|
||||
matcher.add("A", [Doc(en_vocab, words=["New", "York"])])
|
||||
matcher.add("B", [Doc(en_vocab, words=["New", "York"])])
|
||||
doc = Doc(en_vocab, words=["I", "live", "in", "New", "York"])
|
||||
matches = matcher(doc)
|
||||
|
||||
|
|
|
@ -11,7 +11,7 @@ def test_issue4002(en_vocab):
|
|||
matcher = PhraseMatcher(en_vocab, attr="NORM")
|
||||
pattern1 = Doc(en_vocab, words=["c", "d"])
|
||||
assert [t.norm_ for t in pattern1] == ["c", "d"]
|
||||
matcher.add("TEST", None, pattern1)
|
||||
matcher.add("TEST", [pattern1])
|
||||
doc = Doc(en_vocab, words=["a", "b", "c", "d"])
|
||||
assert [t.norm_ for t in doc] == ["a", "b", "c", "d"]
|
||||
matches = matcher(doc)
|
||||
|
@ -21,6 +21,6 @@ def test_issue4002(en_vocab):
|
|||
pattern2[0].norm_ = "c"
|
||||
pattern2[1].norm_ = "d"
|
||||
assert [t.norm_ for t in pattern2] == ["c", "d"]
|
||||
matcher.add("TEST", None, pattern2)
|
||||
matcher.add("TEST", [pattern2])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 1
|
||||
|
|
|
@ -34,8 +34,7 @@ def test_issue4030():
|
|||
nlp.add_pipe(textcat, last=True)
|
||||
|
||||
# training the network
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
|
||||
with nlp.disable_pipes(*other_pipes):
|
||||
with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]):
|
||||
optimizer = nlp.begin_training()
|
||||
for i in range(3):
|
||||
losses = {}
|
||||
|
|
|
@ -8,7 +8,7 @@ from spacy.tokens import Doc
|
|||
def test_issue4120(en_vocab):
|
||||
"""Test that matches without a final {OP: ?} token are returned."""
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add("TEST", None, [{"ORTH": "a"}, {"OP": "?"}])
|
||||
matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}]])
|
||||
doc1 = Doc(en_vocab, words=["a"])
|
||||
assert len(matcher(doc1)) == 1 # works
|
||||
|
||||
|
@ -16,11 +16,11 @@ def test_issue4120(en_vocab):
|
|||
assert len(matcher(doc2)) == 2 # fixed
|
||||
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add("TEST", None, [{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b"}])
|
||||
matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b"}]])
|
||||
doc3 = Doc(en_vocab, words=["a", "b", "b", "c"])
|
||||
assert len(matcher(doc3)) == 2 # works
|
||||
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add("TEST", None, [{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b", "OP": "?"}])
|
||||
matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b", "OP": "?"}]])
|
||||
doc4 = Doc(en_vocab, words=["a", "b", "b", "c"])
|
||||
assert len(matcher(doc4)) == 3 # fixed
|
||||
|
|
96
spacy/tests/regression/test_issue4402.py
Normal file
96
spacy/tests/regression/test_issue4402.py
Normal file
|
@ -0,0 +1,96 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import srsly
|
||||
from spacy.gold import GoldCorpus
|
||||
|
||||
from spacy.lang.en import English
|
||||
from spacy.tests.util import make_tempdir
|
||||
|
||||
|
||||
def test_issue4402():
|
||||
nlp = English()
|
||||
with make_tempdir() as tmpdir:
|
||||
print("temp", tmpdir)
|
||||
json_path = tmpdir / "test4402.json"
|
||||
srsly.write_json(json_path, json_data)
|
||||
|
||||
corpus = GoldCorpus(str(json_path), str(json_path))
|
||||
|
||||
train_docs = list(corpus.train_docs(nlp, gold_preproc=True, max_length=0))
|
||||
# assert that the data got split into 4 sentences
|
||||
assert len(train_docs) == 4
|
||||
|
||||
|
||||
json_data = [
|
||||
{
|
||||
"id": 0,
|
||||
"paragraphs": [
|
||||
{
|
||||
"raw": "How should I cook bacon in an oven?\nI've heard of people cooking bacon in an oven.",
|
||||
"sentences": [
|
||||
{
|
||||
"tokens": [
|
||||
{"id": 0, "orth": "How", "ner": "O"},
|
||||
{"id": 1, "orth": "should", "ner": "O"},
|
||||
{"id": 2, "orth": "I", "ner": "O"},
|
||||
{"id": 3, "orth": "cook", "ner": "O"},
|
||||
{"id": 4, "orth": "bacon", "ner": "O"},
|
||||
{"id": 5, "orth": "in", "ner": "O"},
|
||||
{"id": 6, "orth": "an", "ner": "O"},
|
||||
{"id": 7, "orth": "oven", "ner": "O"},
|
||||
{"id": 8, "orth": "?", "ner": "O"},
|
||||
],
|
||||
"brackets": [],
|
||||
},
|
||||
{
|
||||
"tokens": [
|
||||
{"id": 9, "orth": "\n", "ner": "O"},
|
||||
{"id": 10, "orth": "I", "ner": "O"},
|
||||
{"id": 11, "orth": "'ve", "ner": "O"},
|
||||
{"id": 12, "orth": "heard", "ner": "O"},
|
||||
{"id": 13, "orth": "of", "ner": "O"},
|
||||
{"id": 14, "orth": "people", "ner": "O"},
|
||||
{"id": 15, "orth": "cooking", "ner": "O"},
|
||||
{"id": 16, "orth": "bacon", "ner": "O"},
|
||||
{"id": 17, "orth": "in", "ner": "O"},
|
||||
{"id": 18, "orth": "an", "ner": "O"},
|
||||
{"id": 19, "orth": "oven", "ner": "O"},
|
||||
{"id": 20, "orth": ".", "ner": "O"},
|
||||
],
|
||||
"brackets": [],
|
||||
},
|
||||
],
|
||||
"cats": [
|
||||
{"label": "baking", "value": 1.0},
|
||||
{"label": "not_baking", "value": 0.0},
|
||||
],
|
||||
},
|
||||
{
|
||||
"raw": "What is the difference between white and brown eggs?\n",
|
||||
"sentences": [
|
||||
{
|
||||
"tokens": [
|
||||
{"id": 0, "orth": "What", "ner": "O"},
|
||||
{"id": 1, "orth": "is", "ner": "O"},
|
||||
{"id": 2, "orth": "the", "ner": "O"},
|
||||
{"id": 3, "orth": "difference", "ner": "O"},
|
||||
{"id": 4, "orth": "between", "ner": "O"},
|
||||
{"id": 5, "orth": "white", "ner": "O"},
|
||||
{"id": 6, "orth": "and", "ner": "O"},
|
||||
{"id": 7, "orth": "brown", "ner": "O"},
|
||||
{"id": 8, "orth": "eggs", "ner": "O"},
|
||||
{"id": 9, "orth": "?", "ner": "O"},
|
||||
],
|
||||
"brackets": [],
|
||||
},
|
||||
{"tokens": [{"id": 10, "orth": "\n", "ner": "O"}], "brackets": []},
|
||||
],
|
||||
"cats": [
|
||||
{"label": "baking", "value": 0.0},
|
||||
{"label": "not_baking", "value": 1.0},
|
||||
],
|
||||
},
|
||||
],
|
||||
}
|
||||
]
|
19
spacy/tests/regression/test_issue4528.py
Normal file
19
spacy/tests/regression/test_issue4528.py
Normal file
|
@ -0,0 +1,19 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from spacy.tokens import Doc, DocBin
|
||||
|
||||
|
||||
def test_issue4528(en_vocab):
|
||||
"""Test that user_data is correctly serialized in DocBin."""
|
||||
doc = Doc(en_vocab, words=["hello", "world"])
|
||||
doc.user_data["foo"] = "bar"
|
||||
# This is how extension attribute values are stored in the user data
|
||||
doc.user_data[("._.", "foo", None, None)] = "bar"
|
||||
doc_bin = DocBin(store_user_data=True)
|
||||
doc_bin.add(doc)
|
||||
doc_bin_bytes = doc_bin.to_bytes()
|
||||
new_doc_bin = DocBin(store_user_data=True).from_bytes(doc_bin_bytes)
|
||||
new_doc = list(new_doc_bin.get_docs(en_vocab))[0]
|
||||
assert new_doc.user_data["foo"] == "bar"
|
||||
assert new_doc.user_data[("._.", "foo", None, None)] == "bar"
|
13
spacy/tests/regression/test_issue4529.py
Normal file
13
spacy/tests/regression/test_issue4529.py
Normal file
|
@ -0,0 +1,13 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.gold import GoldParse
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text,words", [("A'B C", ["A", "'", "B", "C"]), ("A-B", ["A-B"])]
|
||||
)
|
||||
def test_gold_misaligned(en_tokenizer, text, words):
|
||||
doc = en_tokenizer(text)
|
||||
GoldParse(doc, words=words)
|
|
@ -3,7 +3,7 @@ from __future__ import unicode_literals
|
|||
|
||||
from spacy.gold import biluo_tags_from_offsets, offsets_from_biluo_tags
|
||||
from spacy.gold import spans_from_biluo_tags, GoldParse, iob_to_biluo
|
||||
from spacy.gold import GoldCorpus, docs_to_json
|
||||
from spacy.gold import GoldCorpus, docs_to_json, align
|
||||
from spacy.lang.en import English
|
||||
from spacy.tokens import Doc
|
||||
from .util import make_tempdir
|
||||
|
@ -90,7 +90,7 @@ def test_gold_ner_missing_tags(en_tokenizer):
|
|||
def test_iob_to_biluo():
|
||||
good_iob = ["O", "O", "B-LOC", "I-LOC", "O", "B-PERSON"]
|
||||
good_biluo = ["O", "O", "B-LOC", "L-LOC", "O", "U-PERSON"]
|
||||
bad_iob = ["O", "O", "\"", "B-LOC", "I-LOC"]
|
||||
bad_iob = ["O", "O", '"', "B-LOC", "I-LOC"]
|
||||
converted_biluo = iob_to_biluo(good_iob)
|
||||
assert good_biluo == converted_biluo
|
||||
with pytest.raises(ValueError):
|
||||
|
@ -99,14 +99,23 @@ def test_iob_to_biluo():
|
|||
|
||||
def test_roundtrip_docs_to_json():
|
||||
text = "I flew to Silicon Valley via London."
|
||||
tags = ["PRP", "VBD", "IN", "NNP", "NNP", "IN", "NNP", "."]
|
||||
heads = [1, 1, 1, 4, 2, 1, 5, 1]
|
||||
deps = ["nsubj", "ROOT", "prep", "compound", "pobj", "prep", "pobj", "punct"]
|
||||
biluo_tags = ["O", "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"]
|
||||
cats = {"TRAVEL": 1.0, "BAKING": 0.0}
|
||||
nlp = English()
|
||||
doc = nlp(text)
|
||||
for i in range(len(tags)):
|
||||
doc[i].tag_ = tags[i]
|
||||
doc[i].dep_ = deps[i]
|
||||
doc[i].head = doc[heads[i]]
|
||||
doc.ents = spans_from_biluo_tags(doc, biluo_tags)
|
||||
doc.cats = cats
|
||||
doc[0].is_sent_start = True
|
||||
for i in range(1, len(doc)):
|
||||
doc[i].is_sent_start = False
|
||||
doc.is_tagged = True
|
||||
doc.is_parsed = True
|
||||
|
||||
# roundtrip to JSON
|
||||
with make_tempdir() as tmpdir:
|
||||
json_file = tmpdir / "roundtrip.json"
|
||||
srsly.write_json(json_file, [docs_to_json(doc)])
|
||||
|
@ -116,7 +125,95 @@ def test_roundtrip_docs_to_json():
|
|||
|
||||
assert len(doc) == goldcorpus.count_train()
|
||||
assert text == reloaded_doc.text
|
||||
assert tags == goldparse.tags
|
||||
assert deps == goldparse.labels
|
||||
assert heads == goldparse.heads
|
||||
assert biluo_tags == goldparse.ner
|
||||
assert "TRAVEL" in goldparse.cats
|
||||
assert "BAKING" in goldparse.cats
|
||||
assert cats["TRAVEL"] == goldparse.cats["TRAVEL"]
|
||||
assert cats["BAKING"] == goldparse.cats["BAKING"]
|
||||
|
||||
# roundtrip to JSONL train dicts
|
||||
with make_tempdir() as tmpdir:
|
||||
jsonl_file = tmpdir / "roundtrip.jsonl"
|
||||
srsly.write_jsonl(jsonl_file, [docs_to_json(doc)])
|
||||
goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file))
|
||||
|
||||
reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp))
|
||||
|
||||
assert len(doc) == goldcorpus.count_train()
|
||||
assert text == reloaded_doc.text
|
||||
assert tags == goldparse.tags
|
||||
assert deps == goldparse.labels
|
||||
assert heads == goldparse.heads
|
||||
assert biluo_tags == goldparse.ner
|
||||
assert "TRAVEL" in goldparse.cats
|
||||
assert "BAKING" in goldparse.cats
|
||||
assert cats["TRAVEL"] == goldparse.cats["TRAVEL"]
|
||||
assert cats["BAKING"] == goldparse.cats["BAKING"]
|
||||
|
||||
# roundtrip to JSONL tuples
|
||||
with make_tempdir() as tmpdir:
|
||||
jsonl_file = tmpdir / "roundtrip.jsonl"
|
||||
# write to JSONL train dicts
|
||||
srsly.write_jsonl(jsonl_file, [docs_to_json(doc)])
|
||||
goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file))
|
||||
# load and rewrite as JSONL tuples
|
||||
srsly.write_jsonl(jsonl_file, goldcorpus.train_tuples)
|
||||
goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file))
|
||||
|
||||
reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp))
|
||||
|
||||
assert len(doc) == goldcorpus.count_train()
|
||||
assert text == reloaded_doc.text
|
||||
assert tags == goldparse.tags
|
||||
assert deps == goldparse.labels
|
||||
assert heads == goldparse.heads
|
||||
assert biluo_tags == goldparse.ner
|
||||
assert "TRAVEL" in goldparse.cats
|
||||
assert "BAKING" in goldparse.cats
|
||||
assert cats["TRAVEL"] == goldparse.cats["TRAVEL"]
|
||||
assert cats["BAKING"] == goldparse.cats["BAKING"]
|
||||
|
||||
|
||||
# xfail while we have backwards-compatible alignment
|
||||
@pytest.mark.xfail
|
||||
@pytest.mark.parametrize(
|
||||
"tokens_a,tokens_b,expected",
|
||||
[
|
||||
(["a", "b", "c"], ["ab", "c"], (3, [-1, -1, 1], [-1, 2], {0: 0, 1: 0}, {})),
|
||||
(
|
||||
["a", "b", "``", "c"],
|
||||
['ab"', "c"],
|
||||
(4, [-1, -1, -1, 1], [-1, 3], {0: 0, 1: 0, 2: 0}, {}),
|
||||
),
|
||||
(["a", "bc"], ["ab", "c"], (4, [-1, -1], [-1, -1], {0: 0}, {1: 1})),
|
||||
(
|
||||
["ab", "c", "d"],
|
||||
["a", "b", "cd"],
|
||||
(6, [-1, -1, -1], [-1, -1, -1], {1: 2, 2: 2}, {0: 0, 1: 0}),
|
||||
),
|
||||
(
|
||||
["a", "b", "cd"],
|
||||
["a", "b", "c", "d"],
|
||||
(3, [0, 1, -1], [0, 1, -1, -1], {}, {2: 2, 3: 2}),
|
||||
),
|
||||
([" ", "a"], ["a"], (1, [-1, 0], [1], {}, {})),
|
||||
],
|
||||
)
|
||||
def test_align(tokens_a, tokens_b, expected):
|
||||
cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_a, tokens_b)
|
||||
assert (cost, list(a2b), list(b2a), a2b_multi, b2a_multi) == expected
|
||||
# check symmetry
|
||||
cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_b, tokens_a)
|
||||
assert (cost, list(b2a), list(a2b), b2a_multi, a2b_multi) == expected
|
||||
|
||||
|
||||
def test_goldparse_startswith_space(en_tokenizer):
|
||||
text = " a"
|
||||
doc = en_tokenizer(text)
|
||||
g = GoldParse(doc, words=["a"], entities=["U-DATE"], deps=["ROOT"], heads=[0])
|
||||
assert g.words == [" ", "a"]
|
||||
assert g.ner == [None, "U-DATE"]
|
||||
assert g.labels == [None, "ROOT"]
|
||||
|
|
|
@ -95,12 +95,18 @@ def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
|
|||
|
||||
|
||||
def test_prefer_gpu():
|
||||
assert not prefer_gpu()
|
||||
try:
|
||||
import cupy # noqa: F401
|
||||
except ImportError:
|
||||
assert not prefer_gpu()
|
||||
|
||||
|
||||
def test_require_gpu():
|
||||
with pytest.raises(ValueError):
|
||||
require_gpu()
|
||||
try:
|
||||
import cupy # noqa: F401
|
||||
except ImportError:
|
||||
with pytest.raises(ValueError):
|
||||
require_gpu()
|
||||
|
||||
|
||||
def test_create_symlink_windows(
|
||||
|
|
66
spacy/tests/test_tok2vec.py
Normal file
66
spacy/tests/test_tok2vec.py
Normal file
|
@ -0,0 +1,66 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
from spacy._ml import Tok2Vec
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.tokens import Doc
|
||||
from spacy.compat import unicode_
|
||||
|
||||
|
||||
def get_batch(batch_size):
|
||||
vocab = Vocab()
|
||||
docs = []
|
||||
start = 0
|
||||
for size in range(1, batch_size + 1):
|
||||
# Make the words numbers, so that they're distnct
|
||||
# across the batch, and easy to track.
|
||||
numbers = [unicode_(i) for i in range(start, start + size)]
|
||||
docs.append(Doc(vocab, words=numbers))
|
||||
start += size
|
||||
return docs
|
||||
|
||||
|
||||
# This fails in Thinc v7.3.1. Need to push patch
|
||||
@pytest.mark.xfail
|
||||
def test_empty_doc():
|
||||
width = 128
|
||||
embed_size = 2000
|
||||
vocab = Vocab()
|
||||
doc = Doc(vocab, words=[])
|
||||
tok2vec = Tok2Vec(width, embed_size)
|
||||
vectors, backprop = tok2vec.begin_update([doc])
|
||||
assert len(vectors) == 1
|
||||
assert vectors[0].shape == (0, width)
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"batch_size,width,embed_size", [[1, 128, 2000], [2, 128, 2000], [3, 8, 63]]
|
||||
)
|
||||
def test_tok2vec_batch_sizes(batch_size, width, embed_size):
|
||||
batch = get_batch(batch_size)
|
||||
tok2vec = Tok2Vec(width, embed_size)
|
||||
vectors, backprop = tok2vec.begin_update(batch)
|
||||
assert len(vectors) == len(batch)
|
||||
for doc_vec, doc in zip(vectors, batch):
|
||||
assert doc_vec.shape == (len(doc), width)
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"tok2vec_config",
|
||||
[
|
||||
{"width": 8, "embed_size": 100, "char_embed": False},
|
||||
{"width": 8, "embed_size": 100, "char_embed": True},
|
||||
{"width": 8, "embed_size": 100, "conv_depth": 6},
|
||||
{"width": 8, "embed_size": 100, "conv_depth": 6},
|
||||
{"width": 8, "embed_size": 100, "subword_features": False},
|
||||
],
|
||||
)
|
||||
def test_tok2vec_configs(tok2vec_config):
|
||||
docs = get_batch(3)
|
||||
tok2vec = Tok2Vec(**tok2vec_config)
|
||||
vectors, backprop = tok2vec.begin_update(docs)
|
||||
assert len(vectors) == len(docs)
|
||||
assert vectors[0].shape == (len(docs[0]), tok2vec_config["width"])
|
||||
backprop(vectors)
|
|
@ -103,7 +103,8 @@ class DocBin(object):
|
|||
doc = Doc(vocab, words=words, spaces=spaces)
|
||||
doc = doc.from_array(self.attrs, tokens)
|
||||
if self.store_user_data:
|
||||
doc.user_data.update(srsly.msgpack_loads(self.user_data[i]))
|
||||
user_data = srsly.msgpack_loads(self.user_data[i], use_list=False)
|
||||
doc.user_data.update(user_data)
|
||||
yield doc
|
||||
|
||||
def merge(self, other):
|
||||
|
@ -155,9 +156,9 @@ class DocBin(object):
|
|||
msg = srsly.msgpack_loads(zlib.decompress(bytes_data))
|
||||
self.attrs = msg["attrs"]
|
||||
self.strings = set(msg["strings"])
|
||||
lengths = numpy.fromstring(msg["lengths"], dtype="int32")
|
||||
flat_spaces = numpy.fromstring(msg["spaces"], dtype=bool)
|
||||
flat_tokens = numpy.fromstring(msg["tokens"], dtype="uint64")
|
||||
lengths = numpy.frombuffer(msg["lengths"], dtype="int32")
|
||||
flat_spaces = numpy.frombuffer(msg["spaces"], dtype=bool)
|
||||
flat_tokens = numpy.frombuffer(msg["tokens"], dtype="uint64")
|
||||
shape = (flat_tokens.size // len(self.attrs), len(self.attrs))
|
||||
flat_tokens = flat_tokens.reshape(shape)
|
||||
flat_spaces = flat_spaces.reshape((flat_spaces.size, 1))
|
||||
|
|
|
@ -142,6 +142,11 @@ def register_architecture(name, arch=None):
|
|||
return do_registration
|
||||
|
||||
|
||||
def make_layer(arch_config):
|
||||
arch_func = get_architecture(arch_config["arch"])
|
||||
return arch_func(arch_config["config"])
|
||||
|
||||
|
||||
def get_architecture(name):
|
||||
"""Get a model architecture function by name. Raises a KeyError if the
|
||||
architecture is not found.
|
||||
|
@ -242,6 +247,7 @@ def load_model_from_path(model_path, meta=False, **overrides):
|
|||
cls = get_lang_class(lang)
|
||||
nlp = cls(meta=meta, **overrides)
|
||||
pipeline = meta.get("pipeline", [])
|
||||
factories = meta.get("factories", {})
|
||||
disable = overrides.get("disable", [])
|
||||
if pipeline is True:
|
||||
pipeline = nlp.Defaults.pipe_names
|
||||
|
@ -250,7 +256,8 @@ def load_model_from_path(model_path, meta=False, **overrides):
|
|||
for name in pipeline:
|
||||
if name not in disable:
|
||||
config = meta.get("pipeline_args", {}).get(name, {})
|
||||
component = nlp.create_pipe(name, config=config)
|
||||
factory = factories.get(name, name)
|
||||
component = nlp.create_pipe(factory, config=config)
|
||||
nlp.add_pipe(component, name=name)
|
||||
return nlp.from_disk(model_path)
|
||||
|
||||
|
@ -363,6 +370,16 @@ def is_in_jupyter():
|
|||
return False
|
||||
|
||||
|
||||
def get_component_name(component):
|
||||
if hasattr(component, "name"):
|
||||
return component.name
|
||||
if hasattr(component, "__name__"):
|
||||
return component.__name__
|
||||
if hasattr(component, "__class__") and hasattr(component.__class__, "__name__"):
|
||||
return component.__class__.__name__
|
||||
return repr(component)
|
||||
|
||||
|
||||
def get_cuda_stream(require=False):
|
||||
if CudaStream is None:
|
||||
return None
|
||||
|
@ -404,7 +421,7 @@ def env_opt(name, default=None):
|
|||
|
||||
def read_regex(path):
|
||||
path = ensure_path(path)
|
||||
with path.open() as file_:
|
||||
with path.open(encoding="utf8") as file_:
|
||||
entries = file_.read().split("\n")
|
||||
expression = "|".join(
|
||||
["^" + re.escape(piece) for piece in entries if piece.strip()]
|
||||
|
|
|
@ -48,14 +48,14 @@ be installed if needed via `pip install spacy[lookups]`. Some languages provide
|
|||
full lemmatization rules and exceptions, while other languages currently only
|
||||
rely on simple lookup tables.
|
||||
|
||||
<Infobox title="About spaCy's custom pronoun lemma" variant="warning">
|
||||
<Infobox title="About spaCy's custom pronoun lemma for English" variant="warning">
|
||||
|
||||
spaCy adds a **special case for pronouns**: all pronouns are lemmatized to the
|
||||
special token `-PRON-`. Unlike verbs and common nouns, there's no clear base
|
||||
form of a personal pronoun. Should the lemma of "me" be "I", or should we
|
||||
normalize person as well, giving "it" — or maybe "he"? spaCy's solution is to
|
||||
introduce a novel symbol, `-PRON-`, which is used as the lemma for all personal
|
||||
pronouns.
|
||||
spaCy adds a **special case for English pronouns**: all English pronouns are
|
||||
lemmatized to the special token `-PRON-`. Unlike verbs and common nouns,
|
||||
there's no clear base form of a personal pronoun. Should the lemma of "me" be
|
||||
"I", or should we normalize person as well, giving "it" — or maybe "he"?
|
||||
spaCy's solution is to introduce a novel symbol, `-PRON-`, which is used as the
|
||||
lemma for all personal pronouns.
|
||||
|
||||
</Infobox>
|
||||
|
||||
|
@ -117,76 +117,72 @@ type. They're available as the [`Token.pos`](/api/token#attributes) and
|
|||
|
||||
The English part-of-speech tagger uses the
|
||||
[OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) version of the Penn
|
||||
Treebank tag set. We also map the tags to the simpler Google Universal POS tag
|
||||
set.
|
||||
|
||||
| Tag | POS | Morphology | Description |
|
||||
| ----------------------------------- | ------- | ---------------------------------------------- | ----------------------------------------- |
|
||||
| `-LRB-` | `PUNCT` | `PunctType=brck PunctSide=ini` | left round bracket |
|
||||
| `-RRB-` | `PUNCT` | `PunctType=brck PunctSide=fin` | right round bracket |
|
||||
| `,` | `PUNCT` | `PunctType=comm` | punctuation mark, comma |
|
||||
| `:` | `PUNCT` | | punctuation mark, colon or ellipsis |
|
||||
| `.` | `PUNCT` | `PunctType=peri` | punctuation mark, sentence closer |
|
||||
| `''` | `PUNCT` | `PunctType=quot PunctSide=fin` | closing quotation mark |
|
||||
| `""` | `PUNCT` | `PunctType=quot PunctSide=fin` | closing quotation mark |
|
||||
| <InlineCode>``</InlineCode> | `PUNCT` | `PunctType=quot PunctSide=ini` | opening quotation mark |
|
||||
| `#` | `SYM` | `SymType=numbersign` | symbol, number sign |
|
||||
| `$` | `SYM` | `SymType=currency` | symbol, currency |
|
||||
| `ADD` | `X` | | email |
|
||||
| `AFX` | `ADJ` | `Hyph=yes` | affix |
|
||||
| `BES` | `VERB` | | auxiliary "be" |
|
||||
| `CC` | `CONJ` | `ConjType=coor` | conjunction, coordinating |
|
||||
| `CD` | `NUM` | `NumType=card` | cardinal number |
|
||||
| `DT` | `DET` | | determiner |
|
||||
| `EX` | `ADV` | `AdvType=ex` | existential there |
|
||||
| `FW` | `X` | `Foreign=yes` | foreign word |
|
||||
| `GW` | `X` | | additional word in multi-word expression |
|
||||
| `HVS` | `VERB` | | forms of "have" |
|
||||
| `HYPH` | `PUNCT` | `PunctType=dash` | punctuation mark, hyphen |
|
||||
| `IN` | `ADP` | | conjunction, subordinating or preposition |
|
||||
| `JJ` | `ADJ` | `Degree=pos` | adjective |
|
||||
| `JJR` | `ADJ` | `Degree=comp` | adjective, comparative |
|
||||
| `JJS` | `ADJ` | `Degree=sup` | adjective, superlative |
|
||||
| `LS` | `PUNCT` | `NumType=ord` | list item marker |
|
||||
| `MD` | `VERB` | `VerbType=mod` | verb, modal auxiliary |
|
||||
| `NFP` | `PUNCT` | | superfluous punctuation |
|
||||
| `NIL` | | | missing tag |
|
||||
| `NN` | `NOUN` | `Number=sing` | noun, singular or mass |
|
||||
| `NNP` | `PROPN` | `NounType=prop Number=sign` | noun, proper singular |
|
||||
| `NNPS` | `PROPN` | `NounType=prop Number=plur` | noun, proper plural |
|
||||
| `NNS` | `NOUN` | `Number=plur` | noun, plural |
|
||||
| `PDT` | `ADJ` | `AdjType=pdt PronType=prn` | predeterminer |
|
||||
| `POS` | `PART` | `Poss=yes` | possessive ending |
|
||||
| `PRP` | `PRON` | `PronType=prs` | pronoun, personal |
|
||||
| `PRP$` | `ADJ` | `PronType=prs Poss=yes` | pronoun, possessive |
|
||||
| `RB` | `ADV` | `Degree=pos` | adverb |
|
||||
| `RBR` | `ADV` | `Degree=comp` | adverb, comparative |
|
||||
| `RBS` | `ADV` | `Degree=sup` | adverb, superlative |
|
||||
| `RP` | `PART` | | adverb, particle |
|
||||
| `_SP` | `SPACE` | | space |
|
||||
| `SYM` | `SYM` | | symbol |
|
||||
| `TO` | `PART` | `PartType=inf VerbForm=inf` | infinitival "to" |
|
||||
| `UH` | `INTJ` | | interjection |
|
||||
| `VB` | `VERB` | `VerbForm=inf` | verb, base form |
|
||||
| `VBD` | `VERB` | `VerbForm=fin Tense=past` | verb, past tense |
|
||||
| `VBG` | `VERB` | `VerbForm=part Tense=pres Aspect=prog` | verb, gerund or present participle |
|
||||
| `VBN` | `VERB` | `VerbForm=part Tense=past Aspect=perf` | verb, past participle |
|
||||
| `VBP` | `VERB` | `VerbForm=fin Tense=pres` | verb, non-3rd person singular present |
|
||||
| `VBZ` | `VERB` | `VerbForm=fin Tense=pres Number=sing Person=3` | verb, 3rd person singular present |
|
||||
| `WDT` | `ADJ` | `PronType=int|rel` | wh-determiner |
|
||||
| `WP` | `NOUN` | `PronType=int|rel` | wh-pronoun, personal |
|
||||
| `WP$` | `ADJ` | `Poss=yes PronType=int|rel` | wh-pronoun, possessive |
|
||||
| `WRB` | `ADV` | `PronType=int|rel` | wh-adverb |
|
||||
| `XX` | `X` | | unknown |
|
||||
Treebank tag set. We also map the tags to the simpler Universal Dependencies v2
|
||||
POS tag set.
|
||||
|
||||
| Tag | POS | Morphology | Description |
|
||||
| ------------------------------------- | ------- | --------------------------------------- | ----------------------------------------- |
|
||||
| `$` | `SYM` | | symbol, currency |
|
||||
| <InlineCode>``</InlineCode> | `PUNCT` | `PunctType=quot PunctSide=ini` | opening quotation mark |
|
||||
| `''` | `PUNCT` | `PunctType=quot PunctSide=fin` | closing quotation mark |
|
||||
| `,` | `PUNCT` | `PunctType=comm` | punctuation mark, comma |
|
||||
| `-LRB-` | `PUNCT` | `PunctType=brck PunctSide=ini` | left round bracket |
|
||||
| `-RRB-` | `PUNCT` | `PunctType=brck PunctSide=fin` | right round bracket |
|
||||
| `.` | `PUNCT` | `PunctType=peri` | punctuation mark, sentence closer |
|
||||
| `:` | `PUNCT` | | punctuation mark, colon or ellipsis |
|
||||
| `ADD` | `X` | | email |
|
||||
| `AFX` | `ADJ` | `Hyph=yes` | affix |
|
||||
| `CC` | `CCONJ` | `ConjType=comp` | conjunction, coordinating |
|
||||
| `CD` | `NUM` | `NumType=card` | cardinal number |
|
||||
| `DT` | `DET` | | determiner |
|
||||
| `EX` | `PRON` | `AdvType=ex` | existential there |
|
||||
| `FW` | `X` | `Foreign=yes` | foreign word |
|
||||
| `GW` | `X` | | additional word in multi-word expression |
|
||||
| `HYPH` | `PUNCT` | `PunctType=dash` | punctuation mark, hyphen |
|
||||
| `IN` | `ADP` | | conjunction, subordinating or preposition |
|
||||
| `JJ` | `ADJ` | `Degree=pos` | adjective |
|
||||
| `JJR` | `ADJ` | `Degree=comp` | adjective, comparative |
|
||||
| `JJS` | `ADJ` | `Degree=sup` | adjective, superlative |
|
||||
| `LS` | `X` | `NumType=ord` | list item marker |
|
||||
| `MD` | `VERB` | `VerbType=mod` | verb, modal auxiliary |
|
||||
| `NFP` | `PUNCT` | | superfluous punctuation |
|
||||
| `NIL` | `X` | | missing tag |
|
||||
| `NN` | `NOUN` | `Number=sing` | noun, singular or mass |
|
||||
| `NNP` | `PROPN` | `NounType=prop Number=sing` | noun, proper singular |
|
||||
| `NNPS` | `PROPN` | `NounType=prop Number=plur` | noun, proper plural |
|
||||
| `NNS` | `NOUN` | `Number=plur` | noun, plural |
|
||||
| `PDT` | `DET` | | predeterminer |
|
||||
| `POS` | `PART` | `Poss=yes` | possessive ending |
|
||||
| `PRP` | `PRON` | `PronType=prs` | pronoun, personal |
|
||||
| `PRP$` | `DET` | `PronType=prs Poss=yes` | pronoun, possessive |
|
||||
| `RB` | `ADV` | `Degree=pos` | adverb |
|
||||
| `RBR` | `ADV` | `Degree=comp` | adverb, comparative |
|
||||
| `RBS` | `ADV` | `Degree=sup` | adverb, superlative |
|
||||
| `RP` | `ADP` | | adverb, particle |
|
||||
| `SP` | `SPACE` | | space |
|
||||
| `SYM` | `SYM` | | symbol |
|
||||
| `TO` | `PART` | `PartType=inf VerbForm=inf` | infinitival "to" |
|
||||
| `UH` | `INTJ` | | interjection |
|
||||
| `VB` | `VERB` | `VerbForm=inf` | verb, base form |
|
||||
| `VBD` | `VERB` | `VerbForm=fin Tense=past` | verb, past tense |
|
||||
| `VBG` | `VERB` | `VerbForm=part Tense=pres Aspect=prog` | verb, gerund or present participle |
|
||||
| `VBN` | `VERB` | `VerbForm=part Tense=past Aspect=perf` | verb, past participle |
|
||||
| `VBP` | `VERB` | `VerbForm=fin Tense=pres` | verb, non-3rd person singular present |
|
||||
| `VBZ` | `VERB` | `VerbForm=fin Tense=pres Number=sing Person=three` | verb, 3rd person singular present |
|
||||
| `WDT` | `DET` | | wh-determiner |
|
||||
| `WP` | `PRON` | | wh-pronoun, personal |
|
||||
| `WP$` | `DET` | `Poss=yes` | wh-pronoun, possessive |
|
||||
| `WRB` | `ADV` | | wh-adverb |
|
||||
| `XX` | `X` | | unknown |
|
||||
| `_SP` | `SPACE` | | |
|
||||
</Accordion>
|
||||
|
||||
<Accordion title="German" id="pos-de">
|
||||
|
||||
The German part-of-speech tagger uses the
|
||||
[TIGER Treebank](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html)
|
||||
annotation scheme. We also map the tags to the simpler Google Universal POS tag
|
||||
set.
|
||||
annotation scheme. We also map the tags to the simpler Universal Dependencies
|
||||
v2 POS tag set.
|
||||
|
||||
| Tag | POS | Morphology | Description |
|
||||
| --------- | ------- | ---------------------------------------- | ------------------------------------------------- |
|
||||
|
@ -194,7 +190,7 @@ set.
|
|||
| `$,` | `PUNCT` | `PunctType=comm` | comma |
|
||||
| `$.` | `PUNCT` | `PunctType=peri` | sentence-final punctuation mark |
|
||||
| `ADJA` | `ADJ` | | adjective, attributive |
|
||||
| `ADJD` | `ADJ` | `Variant=short` | adjective, adverbial or predicative |
|
||||
| `ADJD` | `ADJ` | | adjective, adverbial or predicative |
|
||||
| `ADV` | `ADV` | | adverb |
|
||||
| `APPO` | `ADP` | `AdpType=post` | postposition |
|
||||
| `APPR` | `ADP` | `AdpType=prep` | preposition; circumposition left |
|
||||
|
@ -204,28 +200,28 @@ set.
|
|||
| `CARD` | `NUM` | `NumType=card` | cardinal number |
|
||||
| `FM` | `X` | `Foreign=yes` | foreign language material |
|
||||
| `ITJ` | `INTJ` | | interjection |
|
||||
| `KOKOM` | `CONJ` | `ConjType=comp` | comparative conjunction |
|
||||
| `KON` | `CONJ` | | coordinate conjunction |
|
||||
| `KOKOM` | `CCONJ` | `ConjType=comp` | comparative conjunction |
|
||||
| `KON` | `CCONJ` | | coordinate conjunction |
|
||||
| `KOUI` | `SCONJ` | | subordinate conjunction with "zu" and infinitive |
|
||||
| `KOUS` | `SCONJ` | | subordinate conjunction with sentence |
|
||||
| `NE` | `PROPN` | | proper noun |
|
||||
| `NNE` | `PROPN` | | proper noun |
|
||||
| `NN` | `NOUN` | | noun, singular or mass |
|
||||
| `PROAV` | `ADV` | `PronType=dem` | pronominal adverb |
|
||||
| `NNE` | `PROPN` | | proper noun |
|
||||
| `PDAT` | `DET` | `PronType=dem` | attributive demonstrative pronoun |
|
||||
| `PDS` | `PRON` | `PronType=dem` | substituting demonstrative pronoun |
|
||||
| `PIAT` | `DET` | `PronType=ind\|neg\|tot` | attributive indefinite pronoun without determiner |
|
||||
| `PIS` | `PRON` | `PronType=ind\|neg\|tot` | substituting indefinite pronoun |
|
||||
| `PIAT` | `DET` | `PronType=ind|neg|tot` | attributive indefinite pronoun without determiner |
|
||||
| `PIS` | `PRON` | `PronType=ind|neg|tot` | substituting indefinite pronoun |
|
||||
| `PPER` | `PRON` | `PronType=prs` | non-reflexive personal pronoun |
|
||||
| `PPOSAT` | `DET` | `Poss=yes PronType=prs` | attributive possessive pronoun |
|
||||
| `PPOSS` | `PRON` | `PronType=rel` | substituting possessive pronoun |
|
||||
| `PPOSS` | `PRON` | `Poss=yes PronType=prs` | substituting possessive pronoun |
|
||||
| `PRELAT` | `DET` | `PronType=rel` | attributive relative pronoun |
|
||||
| `PRELS` | `PRON` | `PronType=rel` | substituting relative pronoun |
|
||||
| `PRF` | `PRON` | `PronType=prs Reflex=yes` | reflexive personal pronoun |
|
||||
| `PROAV` | `ADV` | `PronType=dem` | pronominal adverb |
|
||||
| `PTKA` | `PART` | | particle with adjective or adverb |
|
||||
| `PTKANT` | `PART` | `PartType=res` | answer particle |
|
||||
| `PTKNEG` | `PART` | `Negative=yes` | negative particle |
|
||||
| `PTKVZ` | `PART` | `PartType=vbp` | separable verbal particle |
|
||||
| `PTKNEG` | `PART` | `Polarity=neg` | negative particle |
|
||||
| `PTKVZ` | `ADP` | `PartType=vbp` | separable verbal particle |
|
||||
| `PTKZU` | `PART` | `PartType=inf` | "zu" before infinitive |
|
||||
| `PWAT` | `DET` | `PronType=int` | attributive interrogative pronoun |
|
||||
| `PWAV` | `ADV` | `PronType=int` | adverbial interrogative or relative pronoun |
|
||||
|
@ -234,9 +230,9 @@ set.
|
|||
| `VAFIN` | `AUX` | `Mood=ind VerbForm=fin` | finite verb, auxiliary |
|
||||
| `VAIMP` | `AUX` | `Mood=imp VerbForm=fin` | imperative, auxiliary |
|
||||
| `VAINF` | `AUX` | `VerbForm=inf` | infinitive, auxiliary |
|
||||
| `VAPP` | `AUX` | `Aspect=perf VerbForm=fin` | perfect participle, auxiliary |
|
||||
| `VAPP` | `AUX` | `Aspect=perf VerbForm=part` | perfect participle, auxiliary |
|
||||
| `VMFIN` | `VERB` | `Mood=ind VerbForm=fin VerbType=mod` | finite verb, modal |
|
||||
| `VMINF` | `VERB` | `VerbForm=fin VerbType=mod` | infinitive, modal |
|
||||
| `VMINF` | `VERB` | `VerbForm=inf VerbType=mod` | infinitive, modal |
|
||||
| `VMPP` | `VERB` | `Aspect=perf VerbForm=part VerbType=mod` | perfect participle, modal |
|
||||
| `VVFIN` | `VERB` | `Mood=ind VerbForm=fin` | finite verb, full |
|
||||
| `VVIMP` | `VERB` | `Mood=imp VerbForm=fin` | imperative, full |
|
||||
|
@ -244,8 +240,7 @@ set.
|
|||
| `VVIZU` | `VERB` | `VerbForm=inf` | infinitive with "zu", full |
|
||||
| `VVPP` | `VERB` | `Aspect=perf VerbForm=part` | perfect participle, full |
|
||||
| `XY` | `X` | | non-word containing non-letter |
|
||||
| `SP` | `SPACE` | | space |
|
||||
|
||||
| `_SP` | `SPACE` | | |
|
||||
</Accordion>
|
||||
|
||||
---
|
||||
|
|
|
@ -155,21 +155,14 @@ $ python -m spacy convert [input_file] [output_dir] [--file-type] [--converter]
|
|||
|
||||
### Output file types {new="2.1"}
|
||||
|
||||
> #### Which format should I choose?
|
||||
>
|
||||
> If you're not sure, go with the default `jsonl`. Newline-delimited JSON means
|
||||
> that there's one JSON object per line. Unlike a regular JSON file, it can also
|
||||
> be read in line-by-line and you won't have to parse the _entire file_ first.
|
||||
> This makes it a very convenient format for larger corpora.
|
||||
|
||||
All output files generated by this command are compatible with
|
||||
[`spacy train`](/api/cli#train).
|
||||
|
||||
| ID | Description |
|
||||
| ------- | --------------------------------- |
|
||||
| `jsonl` | Newline-delimited JSON (default). |
|
||||
| `json` | Regular JSON. |
|
||||
| `msg` | Binary MessagePack format. |
|
||||
| ID | Description |
|
||||
| ------- | -------------------------- |
|
||||
| `json` | Regular JSON (default). |
|
||||
| `jsonl` | Newline-delimited JSON. |
|
||||
| `msg` | Binary MessagePack format. |
|
||||
|
||||
### Converter options
|
||||
|
||||
|
@ -453,8 +446,10 @@ improvement.
|
|||
|
||||
```bash
|
||||
$ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
|
||||
[--width] [--depth] [--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length] [--min-length]
|
||||
[--seed] [--n-iter] [--use-vectors] [--n-save_every] [--init-tok2vec] [--epoch-start]
|
||||
[--width] [--depth] [--cnn-window] [--cnn-pieces] [--use-chars] [--sa-depth]
|
||||
[--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length]
|
||||
[--min-length] [--seed] [--n-iter] [--use-vectors] [--n-save_every]
|
||||
[--init-tok2vec] [--epoch-start]
|
||||
```
|
||||
|
||||
| Argument | Type | Description |
|
||||
|
@ -464,6 +459,10 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
|
|||
| `output_dir` | positional | Directory to write models to on each epoch. |
|
||||
| `--width`, `-cw` | option | Width of CNN layers. |
|
||||
| `--depth`, `-cd` | option | Depth of CNN layers. |
|
||||
| `--cnn-window`, `-cW` <Tag variant="new">2.2.2</Tag> | option | Window size for CNN layers. |
|
||||
| `--cnn-pieces`, `-cP` <Tag variant="new">2.2.2</Tag> | option | Maxout size for CNN layers. `1` for [Mish](https://github.com/digantamisra98/Mish). |
|
||||
| `--use-chars`, `-chr` <Tag variant="new">2.2.2</Tag> | flag | Whether to use character-based embedding. |
|
||||
| `--sa-depth`, `-sa` <Tag variant="new">2.2.2</Tag> | option | Depth of self-attention layers. |
|
||||
| `--embed-rows`, `-er` | option | Number of embedding rows. |
|
||||
| `--loss-func`, `-L` | option | Loss function to use for the objective. Either `"L2"` or `"cosine"`. |
|
||||
| `--dropout`, `-d` | option | Dropout rate. |
|
||||
|
@ -476,7 +475,7 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
|
|||
| `--n-save-every`, `-se` | option | Save model every X batches. |
|
||||
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. |
|
||||
| `--epoch-start`, `-es` <Tag variant="new">2.1.5</Tag> | option | The epoch to start counting at. Only relevant when using `--init-tok2vec` and the given weight file has been renamed. Prevents unintended overwriting of existing weight files. |
|
||||
| **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. |
|
||||
| **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. |
|
||||
|
||||
### JSONL format for raw text {#pretrain-jsonl}
|
||||
|
||||
|
|
|
@ -202,6 +202,14 @@ All labels present in the match patterns.
|
|||
| ----------- | ----- | ------------------ |
|
||||
| **RETURNS** | tuple | The string labels. |
|
||||
|
||||
## EntityRuler.ent_ids {#labels tag="property" new="2.2.2"}
|
||||
|
||||
All entity ids present in the match patterns `id` properties.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------- |
|
||||
| **RETURNS** | tuple | The string ent_ids. |
|
||||
|
||||
## EntityRuler.patterns {#patterns tag="property"}
|
||||
|
||||
Get all patterns that were added to the entity ruler.
|
||||
|
|
|
@ -323,18 +323,38 @@ you can use to undo your changes.
|
|||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> with nlp.disable_pipes('tagger', 'parser'):
|
||||
> # New API as of v2.2.2
|
||||
> with nlp.disable_pipes(["tagger", "parser"]):
|
||||
> nlp.begin_training()
|
||||
>
|
||||
> with nlp.disable_pipes("tagger", "parser"):
|
||||
> nlp.begin_training()
|
||||
>
|
||||
> disabled = nlp.disable_pipes('tagger', 'parser')
|
||||
> disabled = nlp.disable_pipes("tagger", "parser")
|
||||
> nlp.begin_training()
|
||||
> disabled.restore()
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | --------------- | ------------------------------------------------------------------------------------ |
|
||||
| `*disabled` | unicode | Names of pipeline components to disable. |
|
||||
| **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. |
|
||||
| Name | Type | Description |
|
||||
| ----------------------------------------- | --------------- | ------------------------------------------------------------------------------------ |
|
||||
| `disabled` <Tag variant="new">2.2.2</Tag> | list | Names of pipeline components to disable. |
|
||||
| `*disabled` | unicode | Names of pipeline components to disable. |
|
||||
| **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. |
|
||||
|
||||
<Infobox title="Changed in v2.2.2" variant="warning">
|
||||
|
||||
As of spaCy v2.2.2, the `Language.disable_pipes` method can also take a list of
|
||||
component names as its first argument (instead of a variable number of
|
||||
arguments). This is especially useful if you're generating the component names
|
||||
to disable programmatically. The new syntax will become the default in the
|
||||
future.
|
||||
|
||||
```diff
|
||||
- disabled = nlp.disable_pipes("tagger", "parser")
|
||||
+ disabled = nlp.disable_pipes(["tagger", "parser"])
|
||||
```
|
||||
|
||||
</Infobox>
|
||||
|
||||
## Language.to_disk {#to_disk tag="method" new="2"}
|
||||
|
||||
|
|
|
@ -157,16 +157,19 @@ overwritten.
|
|||
| `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. |
|
||||
| `*patterns` | list | Match pattern. A pattern consists of a list of dicts, where each dict describes a token. |
|
||||
|
||||
<Infobox title="Changed in v2.0" variant="warning">
|
||||
<Infobox title="Changed in v2.2.2" variant="warning">
|
||||
|
||||
As of spaCy 2.0, `Matcher.add_pattern` and `Matcher.add_entity` are deprecated
|
||||
and have been replaced with a simpler [`Matcher.add`](/api/matcher#add) that
|
||||
lets you add a list of patterns and a callback for a given match ID.
|
||||
As of spaCy 2.2.2, `Matcher.add` also supports the new API, which will become
|
||||
the default in the future. The patterns are now the second argument and a list
|
||||
(instead of a variable number of arguments). The `on_match` callback becomes an
|
||||
optional keyword argument.
|
||||
|
||||
```diff
|
||||
- matcher.add_entity("GoogleNow", on_match=merge_phrases)
|
||||
- matcher.add_pattern("GoogleNow", [{ORTH: "Google"}, {ORTH: "Now"}])
|
||||
+ matcher.add('GoogleNow', merge_phrases, [{"ORTH": "Google"}, {"ORTH": "Now"}])
|
||||
patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]
|
||||
- matcher.add("GoogleNow", None, *patterns)
|
||||
+ matcher.add("GoogleNow", patterns)
|
||||
- matcher.add("GoogleNow", on_match, *patterns)
|
||||
+ matcher.add("GoogleNow", patterns, on_match=on_match)
|
||||
```
|
||||
|
||||
</Infobox>
|
||||
|
|
|
@ -153,6 +153,23 @@ overwritten.
|
|||
| `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. |
|
||||
| `*docs` | `Doc` | `Doc` objects of the phrases to match. |
|
||||
|
||||
<Infobox title="Changed in v2.2.2" variant="warning">
|
||||
|
||||
As of spaCy 2.2.2, `PhraseMatcher.add` also supports the new API, which will
|
||||
become the default in the future. The `Doc` patterns are now the second argument
|
||||
and a list (instead of a variable number of arguments). The `on_match` callback
|
||||
becomes an optional keyword argument.
|
||||
|
||||
```diff
|
||||
patterns = [nlp("health care reform"), nlp("healthcare reform")]
|
||||
- matcher.add("HEALTH", None, *patterns)
|
||||
+ matcher.add("HEALTH", patterns)
|
||||
- matcher.add("HEALTH", on_match, *patterns)
|
||||
+ matcher.add("HEALTH", patterns, on_match=on_match)
|
||||
```
|
||||
|
||||
</Infobox>
|
||||
|
||||
## PhraseMatcher.remove {#remove tag="method" new="2.2"}
|
||||
|
||||
Remove a rule from the matcher by match ID. A `KeyError` is raised if the key
|
||||
|
|
|
@ -1,9 +1,33 @@
|
|||
<div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px">But
|
||||
<mark class="entity" style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Google
|
||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>is starting from behind. The company made a late push into hardware,
|
||||
and
|
||||
<mark class="entity" style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Apple
|
||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>’s Siri, available on iPhones, and
|
||||
<mark class="entity" style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Amazon
|
||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>’s Alexa software, which runs on its Echo and Dot devices, have clear
|
||||
leads in consumer adoption.</div>
|
||||
<div
|
||||
class="entities"
|
||||
style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px"
|
||||
>But
|
||||
<mark
|
||||
class="entity"
|
||||
style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
|
||||
>Google
|
||||
<span
|
||||
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
|
||||
>ORG</span
|
||||
></mark
|
||||
>is starting from behind. The company made a late push into hardware, and
|
||||
<mark
|
||||
class="entity"
|
||||
style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
|
||||
>Apple
|
||||
<span
|
||||
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
|
||||
>ORG</span
|
||||
></mark
|
||||
>’s Siri, available on iPhones, and
|
||||
<mark
|
||||
class="entity"
|
||||
style="background: linear-gradient(90deg, #AA9CFC, #FC9CE7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
|
||||
>Amazon
|
||||
<span
|
||||
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
|
||||
>ORG</span
|
||||
></mark
|
||||
>’s Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer
|
||||
adoption.</div
|
||||
>
|
||||
|
|
|
@ -2,17 +2,25 @@
|
|||
class="entities"
|
||||
style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px"
|
||||
>
|
||||
🌱🌿 <mark
|
||||
class="entity"
|
||||
style="background: #3dff74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"
|
||||
>🐍 <span
|
||||
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
|
||||
>SNEK</span
|
||||
></mark> ____ 🌳🌲 ____ <mark
|
||||
class="entity"
|
||||
style="background: #cfc5ff; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone"
|
||||
>👨🌾 <span
|
||||
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
|
||||
>HUMAN</span
|
||||
></mark> 🏘️
|
||||
🌱🌿
|
||||
<mark
|
||||
class="entity"
|
||||
style="background: #3dff74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
|
||||
>🐍
|
||||
<span
|
||||
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
|
||||
>SNEK</span
|
||||
></mark
|
||||
>
|
||||
____ 🌳🌲 ____
|
||||
<mark
|
||||
class="entity"
|
||||
style="background: #cfc5ff; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
|
||||
>👨🌾
|
||||
<span
|
||||
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
|
||||
>HUMAN</span
|
||||
></mark
|
||||
>
|
||||
🏘️
|
||||
</div>
|
||||
|
|
|
@ -1,16 +1,37 @@
|
|||
<div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px">
|
||||
<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
|
||||
<div
|
||||
class="entities"
|
||||
style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px"
|
||||
>
|
||||
<mark
|
||||
class="entity"
|
||||
style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
|
||||
>
|
||||
Apple
|
||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span>
|
||||
<span
|
||||
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
|
||||
>ORG</span
|
||||
>
|
||||
</mark>
|
||||
is looking at buying
|
||||
<mark class="entity" style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
|
||||
<mark
|
||||
class="entity"
|
||||
style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
|
||||
>
|
||||
U.K.
|
||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">GPE</span>
|
||||
<span
|
||||
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
|
||||
>GPE</span
|
||||
>
|
||||
</mark>
|
||||
startup for
|
||||
<mark class="entity" style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
|
||||
<mark
|
||||
class="entity"
|
||||
style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
|
||||
>
|
||||
$1 billion
|
||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">MONEY</span>
|
||||
<span
|
||||
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
|
||||
>MONEY</span
|
||||
>
|
||||
</mark>
|
||||
</div>
|
||||
|
|
|
@ -1,18 +1,39 @@
|
|||
<div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px">
|
||||
<div
|
||||
class="entities"
|
||||
style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px"
|
||||
>
|
||||
When
|
||||
<mark class="entity" style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
|
||||
<mark
|
||||
class="entity"
|
||||
style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
|
||||
>
|
||||
Sebastian Thrun
|
||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PERSON</span>
|
||||
<span
|
||||
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
|
||||
>PERSON</span
|
||||
>
|
||||
</mark>
|
||||
started working on self-driving cars at
|
||||
<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
|
||||
<mark
|
||||
class="entity"
|
||||
style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
|
||||
>
|
||||
Google
|
||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span>
|
||||
<span
|
||||
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
|
||||
>ORG</span
|
||||
>
|
||||
</mark>
|
||||
in
|
||||
<mark class="entity" style="background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
|
||||
<mark
|
||||
class="entity"
|
||||
style="background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"
|
||||
>
|
||||
2007
|
||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">DATE</span>
|
||||
<span
|
||||
style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem"
|
||||
>DATE</span
|
||||
>
|
||||
</mark>
|
||||
, few people outside of the company took him seriously.
|
||||
</div>
|
||||
|
|
|
@ -986,6 +986,37 @@ doc = nlp("Apple is opening its first big office in San Francisco.")
|
|||
print([(ent.text, ent.label_) for ent in doc.ents])
|
||||
```
|
||||
|
||||
### Adding IDs to patterns {#entityruler-ent-ids new="2.2.2"}
|
||||
|
||||
The [`EntityRuler`](/api/entityruler) can also accept an `id` attribute for each
|
||||
pattern. Using the `id` attribute allows multiple patterns to be associated with
|
||||
the same entity.
|
||||
|
||||
```python
|
||||
### {executable="true"}
|
||||
from spacy.lang.en import English
|
||||
from spacy.pipeline import EntityRuler
|
||||
|
||||
nlp = English()
|
||||
ruler = EntityRuler(nlp)
|
||||
patterns = [{"label": "ORG", "pattern": "Apple", "id": "apple"},
|
||||
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}], "id": "san-francisco"},
|
||||
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "fran"}], "id": "san-francisco"}]
|
||||
ruler.add_patterns(patterns)
|
||||
nlp.add_pipe(ruler)
|
||||
|
||||
doc1 = nlp("Apple is opening its first big office in San Francisco.")
|
||||
print([(ent.text, ent.label_, ent.ent_id_) for ent in doc1.ents])
|
||||
|
||||
doc2 = nlp("Apple is opening its first big office in San Fran.")
|
||||
print([(ent.text, ent.label_, ent.ent_id_) for ent in doc2.ents])
|
||||
```
|
||||
|
||||
If the `id` attribute is included in the [`EntityRuler`](/api/entityruler)
|
||||
patterns, the `ent_id_` property of the matched entity is set to the `id` given
|
||||
in the patterns. So in the example above it's easy to identify that "San
|
||||
Francisco" and "San Fran" are both the same entity.
|
||||
|
||||
The entity ruler is designed to integrate with spaCy's existing statistical
|
||||
models and enhance the named entity recognizer. If it's added **before the
|
||||
`"ner"` component**, the entity recognizer will respect the existing entity
|
||||
|
|
|
@ -127,6 +127,7 @@
|
|||
{ "code": "sr", "name": "Serbian" },
|
||||
{ "code": "sk", "name": "Slovak" },
|
||||
{ "code": "sl", "name": "Slovenian" },
|
||||
{ "code": "lb", "name": "Luxembourgish" },
|
||||
{
|
||||
"code": "sq",
|
||||
"name": "Albanian",
|
||||
|
|
Loading…
Reference in New Issue
Block a user