mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-21 22:10:34 +03:00
Merge remote-tracking branch 'origin/develop' into rliaw-develop
This commit is contained in:
commit
3141ea0931
106
.github/contributors/tiangolo.md
vendored
Normal file
106
.github/contributors/tiangolo.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Sebastián Ramírez |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-07-01 |
|
||||
| GitHub username | tiangolo |
|
||||
| Website (optional) | |
|
|
@ -2,7 +2,6 @@ recursive-include include *.h
|
|||
recursive-include spacy *.pyx *.pxd *.txt *.cfg
|
||||
include LICENSE
|
||||
include README.md
|
||||
include bin/spacy
|
||||
include pyproject.toml
|
||||
recursive-exclude spacy/lang *.json
|
||||
recursive-include spacy/lang *.json.gz
|
||||
|
|
|
@ -194,7 +194,7 @@ pip install https://github.com/explosion/spacy-models/releases/download/en_core_
|
|||
|
||||
### Loading and using models
|
||||
|
||||
To load a model, use `spacy.load()` with the model name, a shortcut link or a
|
||||
To load a model, use `spacy.load()` with the model name or a
|
||||
path to the model data directory.
|
||||
|
||||
```python
|
||||
|
|
|
@ -54,6 +54,10 @@ seed = ${training:seed}
|
|||
use_pytorch_for_gpu_memory = ${training:use_pytorch_for_gpu_memory}
|
||||
tok2vec_model = "nlp.pipeline.tok2vec.model"
|
||||
|
||||
[pretraining.objective]
|
||||
type = "characters"
|
||||
n_characters = 4
|
||||
|
||||
[pretraining.optimizer]
|
||||
@optimizers = "Adam.v1"
|
||||
beta1 = 0.9
|
||||
|
@ -65,10 +69,6 @@ use_averages = true
|
|||
eps = 1e-8
|
||||
learn_rate = 0.001
|
||||
|
||||
[pretraining.loss_func]
|
||||
@losses = "CosineDistance.v1"
|
||||
normalize = true
|
||||
|
||||
[nlp]
|
||||
lang = "en"
|
||||
vectors = null
|
||||
|
|
|
@ -33,7 +33,7 @@ def read_raw_data(nlp, jsonl_loc):
|
|||
for json_obj in srsly.read_jsonl(jsonl_loc):
|
||||
if json_obj["text"].strip():
|
||||
doc = nlp.make_doc(json_obj["text"])
|
||||
yield doc
|
||||
yield Example.from_dict(doc, {})
|
||||
|
||||
|
||||
def read_gold_data(nlp, gold_loc):
|
||||
|
@ -52,7 +52,7 @@ def main(model_name, unlabelled_loc):
|
|||
batch_size = 4
|
||||
nlp = spacy.load(model_name)
|
||||
nlp.get_pipe("ner").add_label(LABEL)
|
||||
raw_docs = list(read_raw_data(nlp, unlabelled_loc))
|
||||
raw_examples = list(read_raw_data(nlp, unlabelled_loc))
|
||||
optimizer = nlp.resume_training()
|
||||
# Avoid use of Adam when resuming training. I don't understand this well
|
||||
# yet, but I'm getting weird results from Adam. Try commenting out the
|
||||
|
@ -61,20 +61,24 @@ def main(model_name, unlabelled_loc):
|
|||
optimizer.learn_rate = 0.1
|
||||
optimizer.b1 = 0.0
|
||||
optimizer.b2 = 0.0
|
||||
|
||||
sizes = compounding(1.0, 4.0, 1.001)
|
||||
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
|
||||
with nlp.select_pipes(enable="ner") and warnings.catch_warnings():
|
||||
# show warnings for misaligned entity spans once
|
||||
warnings.filterwarnings("once", category=UserWarning, module="spacy")
|
||||
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
random.shuffle(raw_docs)
|
||||
random.shuffle(train_examples)
|
||||
random.shuffle(raw_examples)
|
||||
losses = {}
|
||||
r_losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
raw_batches = minibatch(raw_docs, size=4)
|
||||
for batch in minibatch(TRAIN_DATA, size=sizes):
|
||||
raw_batches = minibatch(raw_examples, size=4)
|
||||
for batch in minibatch(train_examples, size=sizes):
|
||||
nlp.update(batch, sgd=optimizer, drop=dropout, losses=losses)
|
||||
raw_batch = list(next(raw_batches))
|
||||
nlp.rehearse(raw_batch, sgd=optimizer, losses=r_losses)
|
||||
|
|
|
@ -20,6 +20,8 @@ from pathlib import Path
|
|||
from spacy.vocab import Vocab
|
||||
import spacy
|
||||
from spacy.kb import KnowledgeBase
|
||||
|
||||
from spacy.gold import Example
|
||||
from spacy.pipeline import EntityRuler
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
|
@ -94,7 +96,7 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
|
|||
# Convert the texts to docs to make sure we have doc.ents set for the training examples.
|
||||
# Also ensure that the annotated examples correspond to known identifiers in the knowledge base.
|
||||
kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings()
|
||||
TRAIN_DOCS = []
|
||||
train_examples = []
|
||||
for text, annotation in TRAIN_DATA:
|
||||
with nlp.select_pipes(disable="entity_linker"):
|
||||
doc = nlp(text)
|
||||
|
@ -109,17 +111,17 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
|
|||
"Removed", kb_id, "from training because it is not in the KB."
|
||||
)
|
||||
annotation_clean["links"][offset] = new_dict
|
||||
TRAIN_DOCS.append((doc, annotation_clean))
|
||||
train_examples .append(Example.from_dict(doc, annotation_clean))
|
||||
|
||||
with nlp.select_pipes(enable="entity_linker"): # only train entity linker
|
||||
# reset and initialize the weights randomly
|
||||
optimizer = nlp.begin_training()
|
||||
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DOCS)
|
||||
random.shuffle(train_examples)
|
||||
losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(TRAIN_DOCS, size=compounding(4.0, 32.0, 1.001))
|
||||
batches = minibatch(train_examples, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
nlp.update(
|
||||
batch,
|
||||
|
|
|
@ -23,6 +23,7 @@ import plac
|
|||
import random
|
||||
from pathlib import Path
|
||||
import spacy
|
||||
from spacy.gold import Example
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
|
||||
|
@ -120,17 +121,19 @@ def main(model=None, output_dir=None, n_iter=15):
|
|||
parser = nlp.create_pipe("parser")
|
||||
nlp.add_pipe(parser, first=True)
|
||||
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
for dep in annotations.get("deps", []):
|
||||
parser.add_label(dep)
|
||||
|
||||
with nlp.select_pipes(enable="parser"): # only train parser
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
random.shuffle(train_examples)
|
||||
losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
||||
batches = minibatch(train_examples, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
nlp.update(batch, sgd=optimizer, losses=losses)
|
||||
print("Losses", losses)
|
||||
|
|
|
@ -14,6 +14,7 @@ import plac
|
|||
import random
|
||||
from pathlib import Path
|
||||
import spacy
|
||||
from spacy.gold import Example
|
||||
from spacy.util import minibatch, compounding
|
||||
from spacy.morphology import Morphology
|
||||
|
||||
|
@ -84,8 +85,10 @@ def main(lang="en", output_dir=None, n_iter=25):
|
|||
morphologizer = nlp.create_pipe("morphologizer")
|
||||
nlp.add_pipe(morphologizer)
|
||||
|
||||
# add labels
|
||||
for _, annotations in TRAIN_DATA:
|
||||
# add labels and create the Example instances
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
morph_labels = annotations.get("morphs")
|
||||
pos_labels = annotations.get("pos", [""] * len(annotations.get("morphs")))
|
||||
assert len(morph_labels) == len(pos_labels)
|
||||
|
@ -98,10 +101,10 @@ def main(lang="en", output_dir=None, n_iter=25):
|
|||
|
||||
optimizer = nlp.begin_training()
|
||||
for i in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
random.shuffle(train_examples)
|
||||
losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
||||
batches = minibatch(train_examples, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
nlp.update(batch, sgd=optimizer, losses=losses)
|
||||
print("Losses", losses)
|
||||
|
|
|
@ -17,6 +17,7 @@ import random
|
|||
import warnings
|
||||
from pathlib import Path
|
||||
import spacy
|
||||
from spacy.gold import Example
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
|
||||
|
@ -50,8 +51,10 @@ def main(model=None, output_dir=None, n_iter=100):
|
|||
else:
|
||||
ner = nlp.get_pipe("simple_ner")
|
||||
|
||||
# add labels
|
||||
for _, annotations in TRAIN_DATA:
|
||||
# add labels and create Example objects
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
for ent in annotations.get("entities"):
|
||||
print("Add label", ent[2])
|
||||
ner.add_label(ent[2])
|
||||
|
@ -68,10 +71,10 @@ def main(model=None, output_dir=None, n_iter=100):
|
|||
"Transitions", list(enumerate(nlp.get_pipe("simple_ner").get_tag_names()))
|
||||
)
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
random.shuffle(train_examples)
|
||||
losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
||||
batches = minibatch(train_examples, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
nlp.update(
|
||||
batch,
|
||||
|
|
|
@ -80,6 +80,10 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
|
|||
print("Created blank 'en' model")
|
||||
# Add entity recognizer to model if it's not in the pipeline
|
||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||
train_examples = []
|
||||
for text, annotation in TRAIN_DATA:
|
||||
train_examples.append(TRAIN_DATA.from_dict(nlp(text), annotation))
|
||||
|
||||
if "ner" not in nlp.pipe_names:
|
||||
ner = nlp.create_pipe("ner")
|
||||
nlp.add_pipe(ner)
|
||||
|
@ -102,8 +106,8 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
|
|||
sizes = compounding(1.0, 4.0, 1.001)
|
||||
# batch up the examples using spaCy's minibatch
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
batches = minibatch(TRAIN_DATA, size=sizes)
|
||||
random.shuffle(train_examples)
|
||||
batches = minibatch(train_examples, size=sizes)
|
||||
losses = {}
|
||||
for batch in batches:
|
||||
nlp.update(batch, sgd=optimizer, drop=0.35, losses=losses)
|
||||
|
|
|
@ -14,6 +14,7 @@ import plac
|
|||
import random
|
||||
from pathlib import Path
|
||||
import spacy
|
||||
from spacy.gold import Example
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
|
||||
|
@ -59,18 +60,20 @@ def main(model=None, output_dir=None, n_iter=15):
|
|||
else:
|
||||
parser = nlp.get_pipe("parser")
|
||||
|
||||
# add labels to the parser
|
||||
for _, annotations in TRAIN_DATA:
|
||||
# add labels to the parser and create the Example objects
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
for dep in annotations.get("deps", []):
|
||||
parser.add_label(dep)
|
||||
|
||||
with nlp.select_pipes(enable="parser"): # only train parser
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
random.shuffle(train_examples)
|
||||
losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
||||
batches = minibatch(train_examples, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
nlp.update(batch, sgd=optimizer, losses=losses)
|
||||
print("Losses", losses)
|
||||
|
|
|
@ -17,6 +17,7 @@ import plac
|
|||
import random
|
||||
from pathlib import Path
|
||||
import spacy
|
||||
from spacy.gold import Example
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
|
||||
|
@ -58,12 +59,16 @@ def main(lang="en", output_dir=None, n_iter=25):
|
|||
tagger.add_label(tag, values)
|
||||
nlp.add_pipe(tagger)
|
||||
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
|
||||
optimizer = nlp.begin_training()
|
||||
for i in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
random.shuffle(train_examples)
|
||||
losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
||||
batches = minibatch(train_examples, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
nlp.update(batch, sgd=optimizer, losses=losses)
|
||||
print("Losses", losses)
|
||||
|
|
|
@ -2,7 +2,7 @@ redirects = [
|
|||
# Netlify
|
||||
{from = "https://spacy.netlify.com/*", to="https://spacy.io/:splat", force = true },
|
||||
# Subdomain for branches
|
||||
{from = "https://nightly.spacy.io/*", to="https://spacy-io-develop.spacy.io/:splat", force = true, status = 200},
|
||||
{from = "https://nightly.spacy.io/*", to="https://nightly-spacy-io.spacy.io/:splat", force = true, status = 200},
|
||||
# Old subdomains
|
||||
{from = "https://survey.spacy.io/*", to = "https://spacy.io", force = true},
|
||||
{from = "http://survey.spacy.io/*", to = "https://spacy.io", force = true},
|
||||
|
@ -38,6 +38,13 @@ redirects = [
|
|||
{from = "/docs/usage/showcase", to = "/universe", force = true},
|
||||
{from = "/tutorials/load-new-word-vectors", to = "/usage/vectors-similarity#custom", force = true},
|
||||
{from = "/tutorials", to = "/usage/examples", force = true},
|
||||
# Old documentation pages (v2.x)
|
||||
{from = "/usage/adding-languages", to = "/usage/linguistic-features", force = true},
|
||||
{from = "/usage/vectors-similarity", to = "/usage/vectors-embeddings", force = true},
|
||||
{from = "/api/goldparse", to = "/api/top-level", force = true},
|
||||
{from = "/api/goldcorpus", to = "/api/corpus", force = true},
|
||||
{from = "/api/annotation", to = "/api/data-formats", force = true},
|
||||
{from = "/usage/examples", to = "/usage/projects", force = true},
|
||||
# Rewrite all other docs pages to /
|
||||
{from = "/docs/*", to = "/:splat"},
|
||||
# Updated documentation pages
|
||||
|
|
|
@ -6,7 +6,8 @@ requires = [
|
|||
"cymem>=2.0.2,<2.1.0",
|
||||
"preshed>=3.0.2,<3.1.0",
|
||||
"murmurhash>=0.28.0,<1.1.0",
|
||||
"thinc==8.0.0a11",
|
||||
"blis>=0.4.0,<0.5.0"
|
||||
"thinc>=8.0.0a12,<8.0.0a20",
|
||||
"blis>=0.4.0,<0.5.0",
|
||||
"pytokenizations"
|
||||
]
|
||||
build-backend = "setuptools.build_meta"
|
||||
|
|
|
@ -1,19 +1,20 @@
|
|||
# Our libraries
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc==8.0.0a11
|
||||
thinc>=8.0.0a12,<8.0.0a20
|
||||
blis>=0.4.0,<0.5.0
|
||||
ml_datasets>=0.1.1
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
wasabi>=0.7.0,<1.1.0
|
||||
srsly>=2.1.0,<3.0.0
|
||||
catalogue>=0.0.7,<1.1.0
|
||||
typer>=0.3.0,<1.0.0
|
||||
typer>=0.3.0,<0.4.0
|
||||
# Third party dependencies
|
||||
numpy>=1.15.0
|
||||
requests>=2.13.0,<3.0.0
|
||||
tqdm>=4.38.0,<5.0.0
|
||||
pydantic>=1.3.0,<2.0.0
|
||||
pytokenizations
|
||||
# Official Python utilities
|
||||
setuptools
|
||||
packaging
|
||||
|
|
13
setup.cfg
13
setup.cfg
|
@ -25,8 +25,6 @@ classifiers =
|
|||
[options]
|
||||
zip_safe = false
|
||||
include_package_data = true
|
||||
scripts =
|
||||
bin/spacy
|
||||
python_requires = >=3.6
|
||||
setup_requires =
|
||||
wheel
|
||||
|
@ -36,28 +34,33 @@ setup_requires =
|
|||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
thinc==8.0.0a11
|
||||
thinc>=8.0.0a12,<8.0.0a20
|
||||
install_requires =
|
||||
# Our libraries
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc==8.0.0a11
|
||||
thinc>=8.0.0a12,<8.0.0a20
|
||||
blis>=0.4.0,<0.5.0
|
||||
wasabi>=0.7.0,<1.1.0
|
||||
srsly>=2.1.0,<3.0.0
|
||||
catalogue>=0.0.7,<1.1.0
|
||||
typer>=0.3.0,<1.0.0
|
||||
typer>=0.3.0,<0.4.0
|
||||
# Third-party dependencies
|
||||
tqdm>=4.38.0,<5.0.0
|
||||
numpy>=1.15.0
|
||||
requests>=2.13.0,<3.0.0
|
||||
pydantic>=1.3.0,<2.0.0
|
||||
pytokenizations
|
||||
# Official Python utilities
|
||||
setuptools
|
||||
packaging
|
||||
importlib_metadata>=0.20; python_version < "3.8"
|
||||
|
||||
[options.entry_points]
|
||||
console_scripts =
|
||||
spacy = spacy.cli:app
|
||||
|
||||
[options.extras_require]
|
||||
lookups =
|
||||
spacy_lookups_data>=0.3.2,<0.4.0
|
||||
|
|
3
setup.py
3
setup.py
|
@ -1,11 +1,11 @@
|
|||
#!/usr/bin/env python
|
||||
from setuptools import Extension, setup, find_packages
|
||||
import sys
|
||||
import platform
|
||||
from distutils.command.build_ext import build_ext
|
||||
from distutils.sysconfig import get_python_inc
|
||||
import distutils.util
|
||||
from distutils import ccompiler, msvccompiler
|
||||
from setuptools import Extension, setup, find_packages
|
||||
import numpy
|
||||
from pathlib import Path
|
||||
import shutil
|
||||
|
@ -23,7 +23,6 @@ Options.docstrings = True
|
|||
|
||||
PACKAGES = find_packages()
|
||||
MOD_NAMES = [
|
||||
"spacy.gold.align",
|
||||
"spacy.gold.example",
|
||||
"spacy.parts_of_speech",
|
||||
"spacy.strings",
|
||||
|
|
|
@ -25,9 +25,6 @@ config = registry
|
|||
|
||||
|
||||
def load(name, **overrides):
|
||||
depr_path = overrides.get("path")
|
||||
if depr_path not in (True, False, None):
|
||||
warnings.warn(Warnings.W001.format(path=depr_path), DeprecationWarning)
|
||||
return util.load_model(name, **overrides)
|
||||
|
||||
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# fmt: off
|
||||
__title__ = "spacy-nightly"
|
||||
__version__ = "3.0.0a0"
|
||||
__version__ = "3.0.0a2"
|
||||
__release__ = True
|
||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||
|
|
|
@ -9,7 +9,7 @@ import sys
|
|||
from ._app import app, Arg, Opt
|
||||
from ..gold import docs_to_json
|
||||
from ..tokens import DocBin
|
||||
from ..gold.converters import iob2docs, conll_ner2docs, json2docs
|
||||
from ..gold.converters import iob2docs, conll_ner2docs, json2docs, conllu2docs
|
||||
|
||||
|
||||
# Converters are matched by file extension except for ner/iob, which are
|
||||
|
@ -18,9 +18,9 @@ from ..gold.converters import iob2docs, conll_ner2docs, json2docs
|
|||
# imported from /converters.
|
||||
|
||||
CONVERTERS = {
|
||||
# "conllubio": conllu2docs, TODO
|
||||
# "conllu": conllu2docs, TODO
|
||||
# "conll": conllu2docs, TODO
|
||||
"conllubio": conllu2docs,
|
||||
"conllu": conllu2docs,
|
||||
"conll": conllu2docs,
|
||||
"ner": conll_ner2docs,
|
||||
"iob": iob2docs,
|
||||
"json": json2docs,
|
||||
|
@ -28,7 +28,7 @@ CONVERTERS = {
|
|||
|
||||
|
||||
# File types that can be written to stdout
|
||||
FILE_TYPES_STDOUT = ("json")
|
||||
FILE_TYPES_STDOUT = ("json",)
|
||||
|
||||
|
||||
class FileTypes(str, Enum):
|
||||
|
@ -48,7 +48,7 @@ def convert_cli(
|
|||
morphology: bool = Opt(False, "--morphology", "-m", help="Enable appending morphology to tags"),
|
||||
merge_subtokens: bool = Opt(False, "--merge-subtokens", "-T", help="Merge CoNLL-U subtokens"),
|
||||
converter: str = Opt("auto", "--converter", "-c", help=f"Converter: {tuple(CONVERTERS.keys())}"),
|
||||
ner_map: Optional[Path] = Opt(None, "--ner-map", "-N", help="NER tag mapping (as JSON-encoded dict of entity types)", exists=True),
|
||||
ner_map: Optional[Path] = Opt(None, "--ner-map", "-nm", help="NER tag mapping (as JSON-encoded dict of entity types)", exists=True),
|
||||
lang: Optional[str] = Opt(None, "--lang", "-l", help="Language (if tokenizer required)"),
|
||||
# fmt: on
|
||||
):
|
||||
|
@ -86,20 +86,20 @@ def convert_cli(
|
|||
|
||||
|
||||
def convert(
|
||||
input_path: Path,
|
||||
output_dir: Path,
|
||||
*,
|
||||
file_type: str = "json",
|
||||
n_sents: int = 1,
|
||||
seg_sents: bool = False,
|
||||
model: Optional[str] = None,
|
||||
morphology: bool = False,
|
||||
merge_subtokens: bool = False,
|
||||
converter: str = "auto",
|
||||
ner_map: Optional[Path] = None,
|
||||
lang: Optional[str] = None,
|
||||
silent: bool = True,
|
||||
msg: Optional[Path] = None,
|
||||
input_path: Path,
|
||||
output_dir: Path,
|
||||
*,
|
||||
file_type: str = "json",
|
||||
n_sents: int = 1,
|
||||
seg_sents: bool = False,
|
||||
model: Optional[str] = None,
|
||||
morphology: bool = False,
|
||||
merge_subtokens: bool = False,
|
||||
converter: str = "auto",
|
||||
ner_map: Optional[Path] = None,
|
||||
lang: Optional[str] = None,
|
||||
silent: bool = True,
|
||||
msg: Optional[Path] = None,
|
||||
) -> None:
|
||||
if not msg:
|
||||
msg = Printer(no_print=silent)
|
||||
|
@ -135,18 +135,18 @@ def convert(
|
|||
|
||||
def _print_docs_to_stdout(docs, output_type):
|
||||
if output_type == "json":
|
||||
srsly.write_json("-", docs_to_json(docs))
|
||||
srsly.write_json("-", [docs_to_json(docs)])
|
||||
else:
|
||||
sys.stdout.buffer.write(DocBin(docs=docs).to_bytes())
|
||||
sys.stdout.buffer.write(DocBin(docs=docs, store_user_data=True).to_bytes())
|
||||
|
||||
|
||||
def _write_docs_to_file(docs, output_file, output_type):
|
||||
if not output_file.parent.exists():
|
||||
output_file.parent.mkdir(parents=True)
|
||||
if output_type == "json":
|
||||
srsly.write_json(output_file, docs_to_json(docs))
|
||||
srsly.write_json(output_file, [docs_to_json(docs)])
|
||||
else:
|
||||
data = DocBin(docs=docs).to_bytes()
|
||||
data = DocBin(docs=docs, store_user_data=True).to_bytes()
|
||||
with output_file.open("wb") as file_:
|
||||
file_.write(data)
|
||||
|
||||
|
|
|
@ -118,7 +118,9 @@ def debug_data(
|
|||
|
||||
# Create all gold data here to avoid iterating over the train_dataset constantly
|
||||
gold_train_data = _compile_gold(train_dataset, pipeline, nlp, make_proj=True)
|
||||
gold_train_unpreprocessed_data = _compile_gold(train_dataset, pipeline, nlp, make_proj=False)
|
||||
gold_train_unpreprocessed_data = _compile_gold(
|
||||
train_dataset, pipeline, nlp, make_proj=False
|
||||
)
|
||||
gold_dev_data = _compile_gold(dev_dataset, pipeline, nlp, make_proj=True)
|
||||
|
||||
train_texts = gold_train_data["texts"]
|
||||
|
|
|
@ -16,7 +16,7 @@ from ..util import is_package, get_base_version, run_command
|
|||
def download_cli(
|
||||
# fmt: off
|
||||
ctx: typer.Context,
|
||||
model: str = Arg(..., help="Model to download (shortcut or name)"),
|
||||
model: str = Arg(..., help="Name of model to download"),
|
||||
direct: bool = Opt(False, "--direct", "-d", "-D", help="Force direct download of name + version"),
|
||||
# fmt: on
|
||||
):
|
||||
|
|
|
@ -4,6 +4,7 @@ from wasabi import Printer
|
|||
from pathlib import Path
|
||||
import re
|
||||
import srsly
|
||||
from thinc.api import require_gpu, fix_random_seed
|
||||
|
||||
from ..gold import Corpus
|
||||
from ..tokens import Doc
|
||||
|
@ -52,9 +53,9 @@ def evaluate(
|
|||
silent: bool = True,
|
||||
) -> Scorer:
|
||||
msg = Printer(no_print=silent, pretty=not silent)
|
||||
util.fix_random_seed()
|
||||
fix_random_seed()
|
||||
if gpu_id >= 0:
|
||||
util.use_gpu(gpu_id)
|
||||
require_gpu(gpu_id)
|
||||
util.set_env_log(False)
|
||||
data_path = util.ensure_path(data_path)
|
||||
output_path = util.ensure_path(output)
|
||||
|
|
|
@ -37,7 +37,7 @@ def init_model_cli(
|
|||
clusters_loc: Optional[Path] = Opt(None, "--clusters-loc", "-c", help="Optional location of brown clusters data", exists=True),
|
||||
jsonl_loc: Optional[Path] = Opt(None, "--jsonl-loc", "-j", help="Location of JSONL-formatted attributes file", exists=True),
|
||||
vectors_loc: Optional[Path] = Opt(None, "--vectors-loc", "-v", help="Optional vectors file in Word2Vec format", exists=True),
|
||||
prune_vectors: int = Opt(-1 , "--prune-vectors", "-V", help="Optional number of vectors to prune to"),
|
||||
prune_vectors: int = Opt(-1, "--prune-vectors", "-V", help="Optional number of vectors to prune to"),
|
||||
truncate_vectors: int = Opt(0, "--truncate-vectors", "-t", help="Optional number of vectors to truncate to when reading in vectors file"),
|
||||
vectors_name: Optional[str] = Opt(None, "--vectors-name", "-vn", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"),
|
||||
model_name: Optional[str] = Opt(None, "--model-name", "-mn", help="Optional name for the model meta"),
|
||||
|
@ -56,6 +56,7 @@ def init_model_cli(
|
|||
freqs_loc=freqs_loc,
|
||||
clusters_loc=clusters_loc,
|
||||
jsonl_loc=jsonl_loc,
|
||||
vectors_loc=vectors_loc,
|
||||
prune_vectors=prune_vectors,
|
||||
truncate_vectors=truncate_vectors,
|
||||
vectors_name=vectors_name,
|
||||
|
@ -228,7 +229,9 @@ def add_vectors(
|
|||
else:
|
||||
if vectors_loc:
|
||||
with msg.loading(f"Reading vectors from {vectors_loc}"):
|
||||
vectors_data, vector_keys = read_vectors(msg, vectors_loc)
|
||||
vectors_data, vector_keys = read_vectors(
|
||||
msg, vectors_loc, truncate_vectors
|
||||
)
|
||||
msg.good(f"Loaded vectors from {vectors_loc}")
|
||||
else:
|
||||
vectors_data, vector_keys = (None, None)
|
||||
|
@ -247,7 +250,7 @@ def add_vectors(
|
|||
nlp.vocab.prune_vectors(prune_vectors)
|
||||
|
||||
|
||||
def read_vectors(msg: Printer, vectors_loc: Path, truncate_vectors: int = 0):
|
||||
def read_vectors(msg: Printer, vectors_loc: Path, truncate_vectors: int):
|
||||
f = open_file(vectors_loc)
|
||||
shape = tuple(int(size) for size in next(f).split())
|
||||
if truncate_vectors >= 1:
|
||||
|
|
|
@ -5,24 +5,26 @@ import time
|
|||
import re
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
from thinc.api import Linear, Maxout, chain, list2array, use_pytorch_for_gpu_memory
|
||||
from thinc.api import use_pytorch_for_gpu_memory, require_gpu
|
||||
from thinc.api import set_dropout_rate, to_categorical, fix_random_seed
|
||||
from thinc.api import CosineDistance, L2Distance
|
||||
from wasabi import msg
|
||||
import srsly
|
||||
from functools import partial
|
||||
|
||||
from ._app import app, Arg, Opt
|
||||
from ..errors import Errors
|
||||
from ..ml.models.multi_task import build_masked_language_model
|
||||
from ..ml.models.multi_task import build_cloze_multi_task_model
|
||||
from ..ml.models.multi_task import build_cloze_characters_multi_task_model
|
||||
from ..tokens import Doc
|
||||
from ..attrs import ID, HEAD
|
||||
from .. import util
|
||||
from ..gold import Example
|
||||
|
||||
|
||||
@app.command("pretrain")
|
||||
def pretrain_cli(
|
||||
# fmt: off
|
||||
texts_loc: Path = Arg(..., help="Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the key 'tokens'", exists=True),
|
||||
vectors_model: str = Arg(..., help="Name or path to spaCy model with vectors to learn from"),
|
||||
output_dir: Path = Arg(..., help="Directory to write models to on each epoch"),
|
||||
config_path: Path = Arg(..., help="Path to config file", exists=True, dir_okay=False),
|
||||
use_gpu: int = Opt(-1, "--use-gpu", "-g", help="Use GPU"),
|
||||
|
@ -32,11 +34,15 @@ def pretrain_cli(
|
|||
):
|
||||
"""
|
||||
Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
|
||||
using an approximate language-modelling objective. Specifically, we load
|
||||
pretrained vectors, and train a component like a CNN, BiLSTM, etc to predict
|
||||
vectors which match the pretrained ones. The weights are saved to a directory
|
||||
after each epoch. You can then pass a path to one of these pretrained weights
|
||||
files to the 'spacy train' command.
|
||||
using an approximate language-modelling objective. Two objective types
|
||||
are available, vector-based and character-based.
|
||||
|
||||
In the vector-based objective, we load word vectors that have been trained
|
||||
using a word2vec-style distributional similarity algorithm, and train a
|
||||
component like a CNN, BiLSTM, etc to predict vectors which match the
|
||||
pretrained ones. The weights are saved to a directory after each epoch. You
|
||||
can then pass a path to one of these pretrained weights files to the
|
||||
'spacy train' command.
|
||||
|
||||
This technique may be especially helpful if you have little labelled data.
|
||||
However, it's still quite experimental, so your mileage may vary.
|
||||
|
@ -47,7 +53,6 @@ def pretrain_cli(
|
|||
"""
|
||||
pretrain(
|
||||
texts_loc,
|
||||
vectors_model,
|
||||
output_dir,
|
||||
config_path,
|
||||
use_gpu=use_gpu,
|
||||
|
@ -58,101 +63,55 @@ def pretrain_cli(
|
|||
|
||||
def pretrain(
|
||||
texts_loc: Path,
|
||||
vectors_model: str,
|
||||
output_dir: Path,
|
||||
config_path: Path,
|
||||
use_gpu: int = -1,
|
||||
resume_path: Optional[Path] = None,
|
||||
epoch_resume: Optional[int] = None,
|
||||
):
|
||||
if not config_path or not config_path.exists():
|
||||
msg.fail("Config file not found", config_path, exits=1)
|
||||
verify_cli_args(**locals())
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
msg.good(f"Created output directory: {output_dir}")
|
||||
|
||||
if use_gpu >= 0:
|
||||
msg.info("Using GPU")
|
||||
util.use_gpu(use_gpu)
|
||||
require_gpu(use_gpu)
|
||||
else:
|
||||
msg.info("Using CPU")
|
||||
|
||||
msg.info(f"Loading config from: {config_path}")
|
||||
config = util.load_config(config_path, create_objects=False)
|
||||
util.fix_random_seed(config["pretraining"]["seed"])
|
||||
if config["pretraining"]["use_pytorch_for_gpu_memory"]:
|
||||
fix_random_seed(config["pretraining"]["seed"])
|
||||
if use_gpu >= 0 and config["pretraining"]["use_pytorch_for_gpu_memory"]:
|
||||
use_pytorch_for_gpu_memory()
|
||||
|
||||
if output_dir.exists() and [p for p in output_dir.iterdir()]:
|
||||
if resume_path:
|
||||
msg.warn(
|
||||
"Output directory is not empty. ",
|
||||
"If you're resuming a run from a previous model in this directory, "
|
||||
"the old models for the consecutive epochs will be overwritten "
|
||||
"with the new ones.",
|
||||
)
|
||||
else:
|
||||
msg.warn(
|
||||
"Output directory is not empty. ",
|
||||
"It is better to use an empty directory or refer to a new output path, "
|
||||
"then the new directory will be created for you.",
|
||||
)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
msg.good(f"Created output directory: {output_dir}")
|
||||
nlp_config = config["nlp"]
|
||||
srsly.write_json(output_dir / "config.json", config)
|
||||
msg.good("Saved config file in the output directory")
|
||||
|
||||
config = util.load_config(config_path, create_objects=True)
|
||||
nlp = util.load_model_from_config(nlp_config)
|
||||
pretrain_config = config["pretraining"]
|
||||
|
||||
# Load texts from file or stdin
|
||||
if texts_loc != "-": # reading from a file
|
||||
texts_loc = Path(texts_loc)
|
||||
if not texts_loc.exists():
|
||||
msg.fail("Input text file doesn't exist", texts_loc, exits=1)
|
||||
with msg.loading("Loading input texts..."):
|
||||
texts = list(srsly.read_jsonl(texts_loc))
|
||||
if not texts:
|
||||
msg.fail("Input file is empty", texts_loc, exits=1)
|
||||
msg.good("Loaded input texts")
|
||||
random.shuffle(texts)
|
||||
else: # reading from stdin
|
||||
msg.info("Reading input text from stdin...")
|
||||
texts = srsly.read_jsonl("-")
|
||||
|
||||
with msg.loading(f"Loading model '{vectors_model}'..."):
|
||||
nlp = util.load_model(vectors_model)
|
||||
msg.good(f"Loaded model '{vectors_model}'")
|
||||
tok2vec_path = pretrain_config["tok2vec_model"]
|
||||
tok2vec = config
|
||||
for subpath in tok2vec_path.split("."):
|
||||
tok2vec = tok2vec.get(subpath)
|
||||
model = create_pretraining_model(nlp, tok2vec)
|
||||
model = create_pretraining_model(nlp, tok2vec, pretrain_config)
|
||||
optimizer = pretrain_config["optimizer"]
|
||||
|
||||
# Load in pretrained weights to resume from
|
||||
if resume_path is not None:
|
||||
msg.info(f"Resume training tok2vec from: {resume_path}")
|
||||
with resume_path.open("rb") as file_:
|
||||
weights_data = file_.read()
|
||||
model.get_ref("tok2vec").from_bytes(weights_data)
|
||||
# Parse the epoch number from the given weight file
|
||||
model_name = re.search(r"model\d+\.bin", str(resume_path))
|
||||
if model_name:
|
||||
# Default weight file name so read epoch_start from it by cutting off 'model' and '.bin'
|
||||
epoch_resume = int(model_name.group(0)[5:][:-4]) + 1
|
||||
msg.info(f"Resuming from epoch: {epoch_resume}")
|
||||
else:
|
||||
if not epoch_resume:
|
||||
msg.fail(
|
||||
"You have to use the --epoch-resume setting when using a renamed weight file for --resume-path",
|
||||
exits=True,
|
||||
)
|
||||
elif epoch_resume < 0:
|
||||
msg.fail(
|
||||
f"The argument --epoch-resume has to be greater or equal to 0. {epoch_resume} is invalid",
|
||||
exits=True,
|
||||
)
|
||||
else:
|
||||
msg.info(f"Resuming from epoch: {epoch_resume}")
|
||||
_resume_model(model, resume_path, epoch_resume)
|
||||
else:
|
||||
# Without '--resume-path' the '--epoch-resume' argument is ignored
|
||||
epoch_resume = 0
|
||||
|
@ -177,18 +136,18 @@ def pretrain(
|
|||
file_.write(srsly.json_dumps(log) + "\n")
|
||||
|
||||
skip_counter = 0
|
||||
loss_func = pretrain_config["loss_func"]
|
||||
objective = create_objective(pretrain_config["objective"])
|
||||
for epoch in range(epoch_resume, pretrain_config["max_epochs"]):
|
||||
batches = util.minibatch_by_words(texts, size=pretrain_config["batch_size"])
|
||||
for batch_id, batch in enumerate(batches):
|
||||
docs, count = make_docs(
|
||||
nlp,
|
||||
[ex.doc for ex in batch],
|
||||
batch,
|
||||
max_length=pretrain_config["max_length"],
|
||||
min_length=pretrain_config["min_length"],
|
||||
)
|
||||
skip_counter += count
|
||||
loss = make_update(model, docs, optimizer, distance=loss_func)
|
||||
loss = make_update(model, docs, optimizer, objective)
|
||||
progress = tracker.update(epoch, loss, docs)
|
||||
if progress:
|
||||
msg.row(progress, **row_settings)
|
||||
|
@ -208,7 +167,22 @@ def pretrain(
|
|||
msg.good("Successfully finished pretrain")
|
||||
|
||||
|
||||
def make_update(model, docs, optimizer, distance):
|
||||
def _resume_model(model, resume_path, epoch_resume):
|
||||
msg.info(f"Resume training tok2vec from: {resume_path}")
|
||||
with resume_path.open("rb") as file_:
|
||||
weights_data = file_.read()
|
||||
model.get_ref("tok2vec").from_bytes(weights_data)
|
||||
# Parse the epoch number from the given weight file
|
||||
model_name = re.search(r"model\d+\.bin", str(resume_path))
|
||||
if model_name:
|
||||
# Default weight file name so read epoch_start from it by cutting off 'model' and '.bin'
|
||||
epoch_resume = int(model_name.group(0)[5:][:-4]) + 1
|
||||
msg.info(f"Resuming from epoch: {epoch_resume}")
|
||||
else:
|
||||
msg.info(f"Resuming from epoch: {epoch_resume}")
|
||||
|
||||
|
||||
def make_update(model, docs, optimizer, objective_func):
|
||||
"""Perform an update over a single batch of documents.
|
||||
|
||||
docs (iterable): A batch of `Doc` objects.
|
||||
|
@ -216,7 +190,7 @@ def make_update(model, docs, optimizer, distance):
|
|||
RETURNS loss: A float for the loss.
|
||||
"""
|
||||
predictions, backprop = model.begin_update(docs)
|
||||
loss, gradients = get_vectors_loss(model.ops, docs, predictions, distance)
|
||||
loss, gradients = objective_func(model.ops, docs, predictions)
|
||||
backprop(gradients)
|
||||
model.finish_update(optimizer)
|
||||
# Don't want to return a cupy object here
|
||||
|
@ -255,13 +229,38 @@ def make_docs(nlp, batch, min_length, max_length):
|
|||
return docs, skip_count
|
||||
|
||||
|
||||
def get_vectors_loss(ops, docs, prediction, distance):
|
||||
"""Compute a mean-squared error loss between the documents' vectors and
|
||||
the prediction.
|
||||
def create_objective(config):
|
||||
"""Create the objective for pretraining.
|
||||
|
||||
Note that this is ripe for customization! We could compute the vectors
|
||||
in some other word, e.g. with an LSTM language model, or use some other
|
||||
type of objective.
|
||||
We'd like to replace this with a registry function but it's tricky because
|
||||
we're also making a model choice based on this. For now we hard-code support
|
||||
for two types (characters, vectors). For characters you can specify
|
||||
n_characters, for vectors you can specify the loss.
|
||||
|
||||
Bleh.
|
||||
"""
|
||||
objective_type = config["type"]
|
||||
if objective_type == "characters":
|
||||
return partial(get_characters_loss, nr_char=config["n_characters"])
|
||||
elif objective_type == "vectors":
|
||||
if config["loss"] == "cosine":
|
||||
return partial(
|
||||
get_vectors_loss,
|
||||
distance=CosineDistance(normalize=True, ignore_zeros=True),
|
||||
)
|
||||
elif config["loss"] == "L2":
|
||||
return partial(
|
||||
get_vectors_loss, distance=L2Distance(normalize=True, ignore_zeros=True)
|
||||
)
|
||||
else:
|
||||
raise ValueError("Unexpected loss type", config["loss"])
|
||||
else:
|
||||
raise ValueError("Unexpected objective_type", objective_type)
|
||||
|
||||
|
||||
def get_vectors_loss(ops, docs, prediction, distance):
|
||||
"""Compute a loss based on a distance between the documents' vectors and
|
||||
the prediction.
|
||||
"""
|
||||
# The simplest way to implement this would be to vstack the
|
||||
# token.vector values, but that's a bit inefficient, especially on GPU.
|
||||
|
@ -273,7 +272,19 @@ def get_vectors_loss(ops, docs, prediction, distance):
|
|||
return loss, d_target
|
||||
|
||||
|
||||
def create_pretraining_model(nlp, tok2vec):
|
||||
def get_characters_loss(ops, docs, prediction, nr_char):
|
||||
"""Compute a loss based on a number of characters predicted from the docs."""
|
||||
target_ids = numpy.vstack([doc.to_utf8_array(nr_char=nr_char) for doc in docs])
|
||||
target_ids = target_ids.reshape((-1,))
|
||||
target = ops.asarray(to_categorical(target_ids, n_classes=256), dtype="f")
|
||||
target = target.reshape((-1, 256 * nr_char))
|
||||
diff = prediction - target
|
||||
loss = (diff ** 2).sum()
|
||||
d_target = diff / float(prediction.shape[0])
|
||||
return loss, d_target
|
||||
|
||||
|
||||
def create_pretraining_model(nlp, tok2vec, pretrain_config):
|
||||
"""Define a network for the pretraining. We simply add an output layer onto
|
||||
the tok2vec input model. The tok2vec input model needs to be a model that
|
||||
takes a batch of Doc objects (as a list), and returns a list of arrays.
|
||||
|
@ -281,18 +292,24 @@ def create_pretraining_model(nlp, tok2vec):
|
|||
The actual tok2vec layer is stored as a reference, and only this bit will be
|
||||
serialized to file and read back in when calling the 'train' command.
|
||||
"""
|
||||
output_size = nlp.vocab.vectors.data.shape[1]
|
||||
output_layer = chain(
|
||||
Maxout(nO=300, nP=3, normalize=True, dropout=0.0), Linear(output_size)
|
||||
)
|
||||
model = chain(tok2vec, list2array())
|
||||
model = chain(model, output_layer)
|
||||
# TODO
|
||||
maxout_pieces = 3
|
||||
hidden_size = 300
|
||||
if pretrain_config["objective"]["type"] == "vectors":
|
||||
model = build_cloze_multi_task_model(
|
||||
nlp.vocab, tok2vec, hidden_size=hidden_size, maxout_pieces=maxout_pieces
|
||||
)
|
||||
elif pretrain_config["objective"]["type"] == "characters":
|
||||
model = build_cloze_characters_multi_task_model(
|
||||
nlp.vocab,
|
||||
tok2vec,
|
||||
hidden_size=hidden_size,
|
||||
maxout_pieces=maxout_pieces,
|
||||
nr_char=pretrain_config["objective"]["n_characters"],
|
||||
)
|
||||
model.initialize(X=[nlp.make_doc("Give it a doc to infer shapes")])
|
||||
mlm_model = build_masked_language_model(nlp.vocab, model)
|
||||
mlm_model.set_ref("tok2vec", tok2vec)
|
||||
mlm_model.set_ref("output_layer", output_layer)
|
||||
mlm_model.initialize(X=[nlp.make_doc("Give it a doc to infer shapes")])
|
||||
return mlm_model
|
||||
set_dropout_rate(model, pretrain_config["dropout"])
|
||||
return model
|
||||
|
||||
|
||||
class ProgressTracker(object):
|
||||
|
@ -341,3 +358,53 @@ def _smart_round(figure, width=10, max_decimal=4):
|
|||
n_decimal = min(n_decimal, max_decimal)
|
||||
format_str = "%." + str(n_decimal) + "f"
|
||||
return format_str % figure
|
||||
|
||||
|
||||
def verify_cli_args(
|
||||
texts_loc, output_dir, config_path, use_gpu, resume_path, epoch_resume
|
||||
):
|
||||
if not config_path or not config_path.exists():
|
||||
msg.fail("Config file not found", config_path, exits=1)
|
||||
if output_dir.exists() and [p for p in output_dir.iterdir()]:
|
||||
if resume_path:
|
||||
msg.warn(
|
||||
"Output directory is not empty. ",
|
||||
"If you're resuming a run from a previous model in this directory, "
|
||||
"the old models for the consecutive epochs will be overwritten "
|
||||
"with the new ones.",
|
||||
)
|
||||
else:
|
||||
msg.warn(
|
||||
"Output directory is not empty. ",
|
||||
"It is better to use an empty directory or refer to a new output path, "
|
||||
"then the new directory will be created for you.",
|
||||
)
|
||||
if texts_loc != "-": # reading from a file
|
||||
texts_loc = Path(texts_loc)
|
||||
if not texts_loc.exists():
|
||||
msg.fail("Input text file doesn't exist", texts_loc, exits=1)
|
||||
|
||||
for text in srsly.read_jsonl(texts_loc):
|
||||
break
|
||||
else:
|
||||
msg.fail("Input file is empty", texts_loc, exits=1)
|
||||
|
||||
if resume_path is not None:
|
||||
model_name = re.search(r"model\d+\.bin", str(resume_path))
|
||||
if not model_name and not epoch_resume:
|
||||
msg.fail(
|
||||
"You have to use the --epoch-resume setting when using a renamed weight file for --resume-path",
|
||||
exits=True,
|
||||
)
|
||||
elif not model_name and epoch_resume < 0:
|
||||
msg.fail(
|
||||
f"The argument --epoch-resume has to be greater or equal to 0. {epoch_resume} is invalid",
|
||||
exits=True,
|
||||
)
|
||||
config = util.load_config(config_path, create_objects=False)
|
||||
if config["pretraining"]["objective"]["type"] == "vectors":
|
||||
if not config["nlp"]["vectors"]:
|
||||
msg.fail(
|
||||
"Must specify nlp.vectors if pretraining.objective.type is vectors",
|
||||
exits=True,
|
||||
)
|
||||
|
|
|
@ -31,17 +31,20 @@ def profile_cli(
|
|||
|
||||
|
||||
def profile(model: str, inputs: Optional[Path] = None, n_texts: int = 10000) -> None:
|
||||
try:
|
||||
import ml_datasets
|
||||
except ImportError:
|
||||
msg.fail(
|
||||
"This command requires the ml_datasets library to be installed:"
|
||||
"pip install ml_datasets",
|
||||
exits=1,
|
||||
)
|
||||
|
||||
if inputs is not None:
|
||||
inputs = _read_inputs(inputs, msg)
|
||||
if inputs is None:
|
||||
try:
|
||||
import ml_datasets
|
||||
except ImportError:
|
||||
msg.fail(
|
||||
"This command, when run without an input file, "
|
||||
"requires the ml_datasets library to be installed: "
|
||||
"pip install ml_datasets",
|
||||
exits=1,
|
||||
)
|
||||
|
||||
n_inputs = 25000
|
||||
with msg.loading("Loading IMDB dataset via Thinc..."):
|
||||
imdb_train, _ = ml_datasets.imdb()
|
||||
|
|
|
@ -1,6 +1,5 @@
|
|||
from typing import Optional, Dict, List, Union, Sequence
|
||||
from timeit import default_timer as timer
|
||||
|
||||
import srsly
|
||||
import tqdm
|
||||
from pydantic import BaseModel, FilePath
|
||||
|
@ -8,11 +7,11 @@ from pathlib import Path
|
|||
from wasabi import msg
|
||||
import thinc
|
||||
import thinc.schedules
|
||||
from thinc.api import Model, use_pytorch_for_gpu_memory
|
||||
from thinc.api import Model, use_pytorch_for_gpu_memory, require_gpu, fix_random_seed
|
||||
import random
|
||||
|
||||
from ._app import app, Arg, Opt
|
||||
from ..gold import Corpus
|
||||
from ..gold import Corpus, Example
|
||||
from ..lookups import Lookups
|
||||
from .. import util
|
||||
from ..errors import Errors
|
||||
|
@ -125,7 +124,7 @@ def train_cli(
|
|||
train_path: Path = Arg(..., help="Location of JSON-formatted training data", exists=True),
|
||||
dev_path: Path = Arg(..., help="Location of JSON-formatted development data", exists=True),
|
||||
config_path: Path = Arg(..., help="Path to config file", exists=True),
|
||||
output_path: Optional[Path] = Opt(None, "--output-path", "-o", help="Output directory to store model in"),
|
||||
output_path: Optional[Path] = Opt(None, "--output", "--output-path", "-o", help="Output directory to store model in"),
|
||||
code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
||||
init_tok2vec: Optional[Path] = Opt(None, "--init-tok2vec", "-t2v", help="Path to pretrained weights for the tok2vec components. See 'spacy pretrain'. Experimental."),
|
||||
raw_text: Optional[Path] = Opt(None, "--raw-text", "-rt", help="Path to jsonl file with unlabelled text documents."),
|
||||
|
@ -156,7 +155,6 @@ def train_cli(
|
|||
if init_tok2vec is not None:
|
||||
with init_tok2vec.open("rb") as file_:
|
||||
weights_data = file_.read()
|
||||
|
||||
train_args = dict(
|
||||
config_path=config_path,
|
||||
data_paths={"train": train_path, "dev": dev_path},
|
||||
|
@ -193,7 +191,7 @@ def train(
|
|||
msg.info(f"Loading config from: {config_path}")
|
||||
# Read the config first without creating objects, to get to the original nlp_config
|
||||
config = util.load_config(config_path, create_objects=False)
|
||||
util.fix_random_seed(config["training"]["seed"])
|
||||
fix_random_seed(config["training"]["seed"])
|
||||
if config["training"].get("use_pytorch_for_gpu_memory"):
|
||||
# It feels kind of weird to not have a default for this.
|
||||
use_pytorch_for_gpu_memory()
|
||||
|
@ -216,11 +214,11 @@ def train(
|
|||
nlp.resume_training()
|
||||
else:
|
||||
msg.info(f"Initializing the nlp pipeline: {nlp.pipe_names}")
|
||||
train_examples = list(corpus.train_dataset(
|
||||
nlp,
|
||||
shuffle=False,
|
||||
gold_preproc=training["gold_preproc"]
|
||||
))
|
||||
train_examples = list(
|
||||
corpus.train_dataset(
|
||||
nlp, shuffle=False, gold_preproc=training["gold_preproc"]
|
||||
)
|
||||
)
|
||||
nlp.begin_training(lambda: train_examples)
|
||||
|
||||
# Update tag map with provided mapping
|
||||
|
@ -307,12 +305,14 @@ def train(
|
|||
|
||||
def create_train_batches(nlp, corpus, cfg, randomization_index):
|
||||
max_epochs = cfg.get("max_epochs", 0)
|
||||
train_examples = list(corpus.train_dataset(
|
||||
nlp,
|
||||
shuffle=True,
|
||||
gold_preproc=cfg["gold_preproc"],
|
||||
max_length=cfg["max_length"]
|
||||
))
|
||||
train_examples = list(
|
||||
corpus.train_dataset(
|
||||
nlp,
|
||||
shuffle=True,
|
||||
gold_preproc=cfg["gold_preproc"],
|
||||
max_length=cfg["max_length"],
|
||||
)
|
||||
)
|
||||
|
||||
epoch = 0
|
||||
while True:
|
||||
|
@ -440,9 +440,8 @@ def train_while_improving(
|
|||
|
||||
if raw_text:
|
||||
random.shuffle(raw_text)
|
||||
raw_batches = util.minibatch(
|
||||
(nlp.make_doc(rt["text"]) for rt in raw_text), size=8
|
||||
)
|
||||
raw_examples = [Example.from_dict(nlp.make_doc(rt["text"]), {}) for rt in raw_text]
|
||||
raw_batches = util.minibatch(raw_examples, size=8)
|
||||
|
||||
for step, (epoch, batch) in enumerate(train_data):
|
||||
dropout = next(dropouts)
|
||||
|
@ -539,7 +538,10 @@ def setup_printer(training, nlp):
|
|||
)
|
||||
)
|
||||
data = (
|
||||
[info["epoch"], info["step"]] + losses + scores + ["{0:.2f}".format(float(info["score"]))]
|
||||
[info["epoch"], info["step"]]
|
||||
+ losses
|
||||
+ scores
|
||||
+ ["{0:.2f}".format(float(info["score"]))]
|
||||
)
|
||||
msg.row(data, widths=table_widths, aligns=table_aligns)
|
||||
|
||||
|
|
|
@ -16,16 +16,6 @@ def add_codes(err_cls):
|
|||
|
||||
@add_codes
|
||||
class Warnings(object):
|
||||
W001 = ("As of spaCy v2.0, the keyword argument `path=` is deprecated. "
|
||||
"You can now call spacy.load with the path as its first argument, "
|
||||
"and the model's meta.json will be used to determine the language "
|
||||
"to load. For example:\nnlp = spacy.load('{path}')")
|
||||
W002 = ("Tokenizer.from_list is now deprecated. Create a new Doc object "
|
||||
"instead and pass in the strings as the `words` keyword argument, "
|
||||
"for example:\nfrom spacy.tokens import Doc\n"
|
||||
"doc = Doc(nlp.vocab, words=[...])")
|
||||
W003 = ("Positional arguments to Doc.merge are deprecated. Instead, use "
|
||||
"the keyword arguments, for example tag=, lemma= or ent_type=.")
|
||||
W004 = ("No text fixing enabled. Run `pip install ftfy` to enable fixing "
|
||||
"using ftfy.fix_text if necessary.")
|
||||
W005 = ("Doc object not parsed. This means displaCy won't be able to "
|
||||
|
@ -45,12 +35,6 @@ class Warnings(object):
|
|||
"use context-sensitive tensors. You can always add your own word "
|
||||
"vectors, or use one of the larger models instead if available.")
|
||||
W008 = ("Evaluating {obj}.similarity based on empty vectors.")
|
||||
W009 = ("Custom factory '{name}' provided by entry points of another "
|
||||
"package overwrites built-in factory.")
|
||||
W010 = ("As of v2.1.0, the PhraseMatcher doesn't have a phrase length "
|
||||
"limit anymore, so the max_length argument is now deprecated. "
|
||||
"If you did not specify this parameter, make sure you call the "
|
||||
"constructor with named arguments instead of positional ones.")
|
||||
W011 = ("It looks like you're calling displacy.serve from within a "
|
||||
"Jupyter notebook or a similar environment. This likely means "
|
||||
"you're already running a local web server, so there's no need to "
|
||||
|
@ -64,23 +48,9 @@ class Warnings(object):
|
|||
"components are applied. To only create tokenized Doc objects, "
|
||||
"try using `nlp.make_doc(text)` or process all texts as a stream "
|
||||
"using `list(nlp.tokenizer.pipe(all_texts))`.")
|
||||
W013 = ("As of v2.1.0, {obj}.merge is deprecated. Please use the more "
|
||||
"efficient and less error-prone Doc.retokenize context manager "
|
||||
"instead.")
|
||||
W014 = ("As of v2.1.0, the `disable` keyword argument on the serialization "
|
||||
"methods is and should be replaced with `exclude`. This makes it "
|
||||
"consistent with the other serializable objects.")
|
||||
W015 = ("As of v2.1.0, the use of keyword arguments to exclude fields from "
|
||||
"being serialized or deserialized is deprecated. Please use the "
|
||||
"`exclude` argument instead. For example: exclude=['{arg}'].")
|
||||
W016 = ("The keyword argument `n_threads` is now deprecated. As of v2.2.2, "
|
||||
"the argument `n_process` controls parallel inference via "
|
||||
"multiprocessing.")
|
||||
W017 = ("Alias '{alias}' already exists in the Knowledge Base.")
|
||||
W018 = ("Entity '{entity}' already exists in the Knowledge Base - "
|
||||
"ignoring the duplicate entry.")
|
||||
W019 = ("Changing vectors name from {old} to {new}, to avoid clash with "
|
||||
"previously loaded vectors. See Issue #3853.")
|
||||
W020 = ("Unnamed vectors. This won't allow multiple vectors models to be "
|
||||
"loaded. (Shape: {shape})")
|
||||
W021 = ("Unexpected hash collision in PhraseMatcher. Matches may be "
|
||||
|
@ -91,8 +61,6 @@ class Warnings(object):
|
|||
"or the language you're using doesn't have lemmatization data, "
|
||||
"you can ignore this warning. If this is surprising, make sure you "
|
||||
"have the spacy-lookups-data package installed.")
|
||||
W023 = ("Multiprocessing of Language.pipe is not supported in Python 2. "
|
||||
"'n_process' will be set to 1.")
|
||||
W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in "
|
||||
"the Knowledge Base.")
|
||||
W025 = ("'{name}' requires '{attr}' to be assigned, but none of the "
|
||||
|
@ -101,28 +69,11 @@ class Warnings(object):
|
|||
W027 = ("Found a large training file of {size} bytes. Note that it may "
|
||||
"be more efficient to split your training data into multiple "
|
||||
"smaller JSON files instead.")
|
||||
W028 = ("Doc.from_array was called with a vector of type '{type}', "
|
||||
"but is expecting one of type 'uint64' instead. This may result "
|
||||
"in problems with the vocab further on in the pipeline.")
|
||||
W029 = ("Unable to align tokens with entities from character offsets. "
|
||||
"Discarding entity annotation for the text: {text}.")
|
||||
W030 = ("Some entities could not be aligned in the text \"{text}\" with "
|
||||
"entities \"{entities}\". Use "
|
||||
"`spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)`"
|
||||
" to check the alignment. Misaligned entities ('-') will be "
|
||||
"ignored during training.")
|
||||
W031 = ("Model '{model}' ({model_version}) requires spaCy {version} and "
|
||||
"is incompatible with the current spaCy version ({current}). This "
|
||||
"may lead to unexpected results or runtime errors. To resolve "
|
||||
"this, download a newer compatible model or retrain your custom "
|
||||
"model with the current spaCy version. For more details and "
|
||||
"available updates, run: python -m spacy validate")
|
||||
W032 = ("Unable to determine model compatibility for model '{model}' "
|
||||
"({model_version}) with the current spaCy version ({current}). "
|
||||
"This may lead to unexpected results or runtime errors. To resolve "
|
||||
"this, download a newer compatible model or retrain your custom "
|
||||
"model with the current spaCy version. For more details and "
|
||||
"available updates, run: python -m spacy validate")
|
||||
W033 = ("Training a new {model} using a model with no lexeme normalization "
|
||||
"table. This may degrade the performance of the model to some "
|
||||
"degree. If this is intentional or the language you're using "
|
||||
|
@ -159,6 +110,8 @@ class Warnings(object):
|
|||
W100 = ("Skipping unsupported morphological feature(s): '{feature}'. "
|
||||
"Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or "
|
||||
"string \"Field1=Value1,Value2|Field2=Value3\".")
|
||||
W101 = ("Skipping `Doc` custom extension '{name}' while merging docs.")
|
||||
W102 = ("Skipping unsupported user data '{key}: {value}' while merging docs.")
|
||||
|
||||
|
||||
@add_codes
|
||||
|
@ -234,9 +187,6 @@ class Errors(object):
|
|||
"the HEAD attribute would potentially override the sentence "
|
||||
"boundaries set by SENT_START.")
|
||||
E033 = ("Cannot load into non-empty Doc of length {length}.")
|
||||
E034 = ("Doc.merge received {n_args} non-keyword arguments. Expected "
|
||||
"either 3 arguments (deprecated), or 0 (use keyword arguments).\n"
|
||||
"Arguments supplied:\n{args}\nKeyword arguments:{kwargs}")
|
||||
E035 = ("Error creating span with start {start} and end {end} for Doc of "
|
||||
"length {length}.")
|
||||
E036 = ("Error calculating span: Can't find a token starting at character "
|
||||
|
@ -345,14 +295,9 @@ class Errors(object):
|
|||
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
|
||||
"token can only be part of one entity, so make sure the entities "
|
||||
"you're setting don't overlap.")
|
||||
E105 = ("The Doc.print_tree() method is now deprecated. Please use "
|
||||
"Doc.to_json() instead or write your own function.")
|
||||
E106 = ("Can't find doc._.{attr} attribute specified in the underscore "
|
||||
"settings: {opts}")
|
||||
E107 = ("Value of doc._.{attr} is not JSON-serializable: {value}")
|
||||
E108 = ("As of spaCy v2.1, the pipe name `sbd` has been deprecated "
|
||||
"in favor of the pipe name `sentencizer`, which does the same "
|
||||
"thing. For example, use `nlp.create_pipeline('sentencizer')`")
|
||||
E109 = ("Component '{name}' could not be run. Did you forget to "
|
||||
"call begin_training()?")
|
||||
E110 = ("Invalid displaCy render wrapper. Expected callable, got: {obj}")
|
||||
|
@ -392,10 +337,6 @@ class Errors(object):
|
|||
E125 = ("Unexpected value: {value}")
|
||||
E126 = ("Unexpected matcher predicate: '{bad}'. Expected one of: {good}. "
|
||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
||||
E128 = ("Unsupported serialization argument: '{arg}'. The use of keyword "
|
||||
"arguments to exclude fields from being serialized or deserialized "
|
||||
"is now deprecated. Please use the `exclude` argument instead. "
|
||||
"For example: exclude=['{arg}'].")
|
||||
E129 = ("Cannot write the label of an existing Span object because a Span "
|
||||
"is a read-only view of the underlying Token objects stored in the "
|
||||
"Doc. Instead, create a new Span object and specify the `label` "
|
||||
|
@ -487,9 +428,6 @@ class Errors(object):
|
|||
E172 = ("The Lemmatizer.load classmethod is deprecated. To create a "
|
||||
"Lemmatizer, initialize the class directly. See the docs for "
|
||||
"details: https://spacy.io/api/lemmatizer")
|
||||
E173 = ("As of v2.2, the Lemmatizer is initialized with an instance of "
|
||||
"Lookups containing the lemmatization tables. See the docs for "
|
||||
"details: https://spacy.io/api/lemmatizer#init")
|
||||
E175 = ("Can't remove rule for unknown match pattern ID: {key}")
|
||||
E176 = ("Alias '{alias}' is not defined in the Knowledge Base.")
|
||||
E177 = ("Ill-formed IOB input detected: {tag}")
|
||||
|
@ -545,19 +483,19 @@ class Errors(object):
|
|||
E972 = ("Example.__init__ got None for '{arg}'. Requires Doc.")
|
||||
E973 = ("Unexpected type for NER data")
|
||||
E974 = ("Unknown {obj} attribute: {key}")
|
||||
E975 = ("The method Example.from_dict expects a Doc as first argument, "
|
||||
E975 = ("The method 'Example.from_dict' expects a Doc as first argument, "
|
||||
"but got {type}")
|
||||
E976 = ("The method Example.from_dict expects a dict as second argument, "
|
||||
E976 = ("The method 'Example.from_dict' expects a dict as second argument, "
|
||||
"but received None.")
|
||||
E977 = ("Can not compare a MorphAnalysis with a string object. "
|
||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
||||
E978 = ("The {method} method of component {name} takes a list of Example objects, "
|
||||
E978 = ("The '{method}' method of {name} takes a list of Example objects, "
|
||||
"but found {types} instead.")
|
||||
E979 = ("Cannot convert {type} to an Example object.")
|
||||
E980 = ("Each link annotation should refer to a dictionary with at most one "
|
||||
"identifier mapping to 1.0, and all others to 0.0.")
|
||||
E981 = ("The offsets of the annotations for 'links' need to refer exactly "
|
||||
"to the offsets of the 'entities' annotations.")
|
||||
E981 = ("The offsets of the annotations for 'links' could not be aligned "
|
||||
"to token boundaries.")
|
||||
E982 = ("The 'ent_iob' attribute of a Token should be an integer indexing "
|
||||
"into {values}, but found {value}.")
|
||||
E983 = ("Invalid key for '{dict}': {key}. Available keys: "
|
||||
|
@ -593,6 +531,8 @@ class Errors(object):
|
|||
E997 = ("Tokenizer special cases are not allowed to modify the text. "
|
||||
"This would map '{chunk}' to '{orth}' given token attributes "
|
||||
"'{token_attrs}'.")
|
||||
E999 = ("Unable to merge the `Doc` objects because they do not all share "
|
||||
"the same `Vocab`.")
|
||||
|
||||
|
||||
@add_codes
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
from .corpus import Corpus
|
||||
from .example import Example
|
||||
from .align import align
|
||||
from .align import Alignment
|
||||
|
||||
from .iob_utils import iob_to_biluo, biluo_to_iob
|
||||
from .iob_utils import biluo_tags_from_offsets, offsets_from_biluo_tags
|
||||
|
|
|
@ -1,8 +0,0 @@
|
|||
cdef class Alignment:
|
||||
cdef public object cost
|
||||
cdef public object i2j
|
||||
cdef public object j2i
|
||||
cdef public object i2j_multi
|
||||
cdef public object j2i_multi
|
||||
cdef public object cand_to_gold
|
||||
cdef public object gold_to_cand
|
30
spacy/gold/align.py
Normal file
30
spacy/gold/align.py
Normal file
|
@ -0,0 +1,30 @@
|
|||
from typing import List
|
||||
import numpy
|
||||
from thinc.types import Ragged
|
||||
from dataclasses import dataclass
|
||||
import tokenizations
|
||||
|
||||
|
||||
@dataclass
|
||||
class Alignment:
|
||||
x2y: Ragged
|
||||
y2x: Ragged
|
||||
|
||||
@classmethod
|
||||
def from_indices(cls, x2y: List[List[int]], y2x: List[List[int]]) -> "Alignment":
|
||||
x2y = _make_ragged(x2y)
|
||||
y2x = _make_ragged(y2x)
|
||||
return Alignment(x2y=x2y, y2x=y2x)
|
||||
|
||||
@classmethod
|
||||
def from_strings(cls, A: List[str], B: List[str]) -> "Alignment":
|
||||
x2y, y2x = tokenizations.get_alignments(A, B)
|
||||
return Alignment.from_indices(x2y=x2y, y2x=y2x)
|
||||
|
||||
|
||||
def _make_ragged(indices):
|
||||
lengths = numpy.array([len(x) for x in indices], dtype="i")
|
||||
flat = []
|
||||
for x in indices:
|
||||
flat.extend(x)
|
||||
return Ragged(numpy.array(flat, dtype="i"), lengths)
|
|
@ -1,101 +0,0 @@
|
|||
import numpy
|
||||
from ..errors import Errors, AlignmentError
|
||||
|
||||
|
||||
cdef class Alignment:
|
||||
def __init__(self, spacy_words, gold_words):
|
||||
# Do many-to-one alignment for misaligned tokens.
|
||||
# If we over-segment, we'll have one gold word that covers a sequence
|
||||
# of predicted words
|
||||
# If we under-segment, we'll have one predicted word that covers a
|
||||
# sequence of gold words.
|
||||
# If we "mis-segment", we'll have a sequence of predicted words covering
|
||||
# a sequence of gold words. That's many-to-many -- we don't do that
|
||||
# except for NER spans where the start and end can be aligned.
|
||||
cost, i2j, j2i, i2j_multi, j2i_multi = align(spacy_words, gold_words)
|
||||
self.cost = cost
|
||||
self.i2j = i2j
|
||||
self.j2i = j2i
|
||||
self.i2j_multi = i2j_multi
|
||||
self.j2i_multi = j2i_multi
|
||||
self.cand_to_gold = [(j if j >= 0 else None) for j in i2j]
|
||||
self.gold_to_cand = [(i if i >= 0 else None) for i in j2i]
|
||||
|
||||
|
||||
def align(tokens_a, tokens_b):
|
||||
"""Calculate alignment tables between two tokenizations.
|
||||
|
||||
tokens_a (List[str]): The candidate tokenization.
|
||||
tokens_b (List[str]): The reference tokenization.
|
||||
RETURNS: (tuple): A 5-tuple consisting of the following information:
|
||||
* cost (int): The number of misaligned tokens.
|
||||
* a2b (List[int]): Mapping of indices in `tokens_a` to indices in `tokens_b`.
|
||||
For instance, if `a2b[4] == 6`, that means that `tokens_a[4]` aligns
|
||||
to `tokens_b[6]`. If there's no one-to-one alignment for a token,
|
||||
it has the value -1.
|
||||
* b2a (List[int]): The same as `a2b`, but mapping the other direction.
|
||||
* a2b_multi (Dict[int, int]): A dictionary mapping indices in `tokens_a`
|
||||
to indices in `tokens_b`, where multiple tokens of `tokens_a` align to
|
||||
the same token of `tokens_b`.
|
||||
* b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other
|
||||
direction.
|
||||
"""
|
||||
tokens_a = _normalize_for_alignment(tokens_a)
|
||||
tokens_b = _normalize_for_alignment(tokens_b)
|
||||
cost = 0
|
||||
a2b = numpy.empty(len(tokens_a), dtype="i")
|
||||
b2a = numpy.empty(len(tokens_b), dtype="i")
|
||||
a2b.fill(-1)
|
||||
b2a.fill(-1)
|
||||
a2b_multi = {}
|
||||
b2a_multi = {}
|
||||
i = 0
|
||||
j = 0
|
||||
offset_a = 0
|
||||
offset_b = 0
|
||||
while i < len(tokens_a) and j < len(tokens_b):
|
||||
a = tokens_a[i][offset_a:]
|
||||
b = tokens_b[j][offset_b:]
|
||||
if a == b:
|
||||
if offset_a == offset_b == 0:
|
||||
a2b[i] = j
|
||||
b2a[j] = i
|
||||
elif offset_a == 0:
|
||||
cost += 2
|
||||
a2b_multi[i] = j
|
||||
elif offset_b == 0:
|
||||
cost += 2
|
||||
b2a_multi[j] = i
|
||||
offset_a = offset_b = 0
|
||||
i += 1
|
||||
j += 1
|
||||
elif a == "":
|
||||
assert offset_a == 0
|
||||
cost += 1
|
||||
i += 1
|
||||
elif b == "":
|
||||
assert offset_b == 0
|
||||
cost += 1
|
||||
j += 1
|
||||
elif b.startswith(a):
|
||||
cost += 1
|
||||
if offset_a == 0:
|
||||
a2b_multi[i] = j
|
||||
i += 1
|
||||
offset_a = 0
|
||||
offset_b += len(a)
|
||||
elif a.startswith(b):
|
||||
cost += 1
|
||||
if offset_b == 0:
|
||||
b2a_multi[j] = i
|
||||
j += 1
|
||||
offset_b = 0
|
||||
offset_a += len(b)
|
||||
else:
|
||||
assert "".join(tokens_a) != "".join(tokens_b)
|
||||
raise AlignmentError(Errors.E186.format(tok_a=tokens_a, tok_b=tokens_b))
|
||||
return cost, a2b, b2a, a2b_multi, b2a_multi
|
||||
|
||||
|
||||
def _normalize_for_alignment(tokens):
|
||||
return [w.replace(" ", "").lower() for w in tokens]
|
|
@ -1,6 +1,4 @@
|
|||
from .iob2docs import iob2docs # noqa: F401
|
||||
from .conll_ner2docs import conll_ner2docs # noqa: F401
|
||||
from .json2docs import json2docs
|
||||
|
||||
# TODO: Update this one
|
||||
# from .conllu2docs import conllu2docs # noqa: F401
|
||||
from .conllu2docs import conllu2docs # noqa: F401
|
||||
|
|
|
@ -4,11 +4,11 @@ from .conll_ner2docs import n_sents_info
|
|||
from ...gold import Example
|
||||
from ...gold import iob_to_biluo, spans_from_biluo_tags
|
||||
from ...language import Language
|
||||
from ...tokens import Doc, Token
|
||||
from ...tokens import Doc, Token, Span
|
||||
from wasabi import Printer
|
||||
|
||||
|
||||
def conllu2json(
|
||||
def conllu2docs(
|
||||
input_data,
|
||||
n_sents=10,
|
||||
append_morphology=False,
|
||||
|
@ -28,34 +28,22 @@ def conllu2json(
|
|||
MISC_NER_PATTERN = "^((?:name|NE)=)?([BILU])-([A-Z_]+)|O$"
|
||||
msg = Printer(no_print=no_print)
|
||||
n_sents_info(msg, n_sents)
|
||||
docs = []
|
||||
raw = ""
|
||||
sentences = []
|
||||
conll_data = read_conllx(
|
||||
sent_docs = read_conllx(
|
||||
input_data,
|
||||
append_morphology=append_morphology,
|
||||
ner_tag_pattern=MISC_NER_PATTERN,
|
||||
ner_map=ner_map,
|
||||
merge_subtokens=merge_subtokens,
|
||||
)
|
||||
has_ner_tags = has_ner(input_data, MISC_NER_PATTERN)
|
||||
for i, example in enumerate(conll_data):
|
||||
raw += example.text
|
||||
sentences.append(
|
||||
generate_sentence(
|
||||
example.to_dict(), has_ner_tags, MISC_NER_PATTERN, ner_map=ner_map,
|
||||
)
|
||||
)
|
||||
# Real-sized documents could be extracted using the comments on the
|
||||
# conllu document
|
||||
if len(sentences) % n_sents == 0:
|
||||
doc = create_json_doc(raw, sentences, i)
|
||||
docs.append(doc)
|
||||
raw = ""
|
||||
sentences = []
|
||||
if sentences:
|
||||
doc = create_json_doc(raw, sentences, i)
|
||||
docs.append(doc)
|
||||
docs = []
|
||||
sent_docs_to_merge = []
|
||||
for sent_doc in sent_docs:
|
||||
sent_docs_to_merge.append(sent_doc)
|
||||
if len(sent_docs_to_merge) % n_sents == 0:
|
||||
docs.append(Doc.from_docs(sent_docs_to_merge))
|
||||
sent_docs_to_merge = []
|
||||
if sent_docs_to_merge:
|
||||
docs.append(Doc.from_docs(sent_docs_to_merge))
|
||||
return docs
|
||||
|
||||
|
||||
|
@ -84,14 +72,14 @@ def read_conllx(
|
|||
ner_tag_pattern="",
|
||||
ner_map=None,
|
||||
):
|
||||
""" Yield examples, one for each sentence """
|
||||
""" Yield docs, one for each sentence """
|
||||
vocab = Language.Defaults.create_vocab() # need vocab to make a minimal Doc
|
||||
for sent in input_data.strip().split("\n\n"):
|
||||
lines = sent.strip().split("\n")
|
||||
if lines:
|
||||
while lines[0].startswith("#"):
|
||||
lines.pop(0)
|
||||
example = example_from_conllu_sentence(
|
||||
doc = doc_from_conllu_sentence(
|
||||
vocab,
|
||||
lines,
|
||||
ner_tag_pattern,
|
||||
|
@ -99,7 +87,7 @@ def read_conllx(
|
|||
append_morphology=append_morphology,
|
||||
ner_map=ner_map,
|
||||
)
|
||||
yield example
|
||||
yield doc
|
||||
|
||||
|
||||
def get_entities(lines, tag_pattern, ner_map=None):
|
||||
|
@ -141,39 +129,7 @@ def get_entities(lines, tag_pattern, ner_map=None):
|
|||
return iob_to_biluo(iob)
|
||||
|
||||
|
||||
def generate_sentence(example_dict, has_ner_tags, tag_pattern, ner_map=None):
|
||||
sentence = {}
|
||||
tokens = []
|
||||
token_annotation = example_dict["token_annotation"]
|
||||
for i, id_ in enumerate(token_annotation["ids"]):
|
||||
token = {}
|
||||
token["id"] = id_
|
||||
token["orth"] = token_annotation["words"][i]
|
||||
token["tag"] = token_annotation["tags"][i]
|
||||
token["pos"] = token_annotation["pos"][i]
|
||||
token["lemma"] = token_annotation["lemmas"][i]
|
||||
token["morph"] = token_annotation["morphs"][i]
|
||||
token["head"] = token_annotation["heads"][i] - i
|
||||
token["dep"] = token_annotation["deps"][i]
|
||||
if has_ner_tags:
|
||||
token["ner"] = example_dict["doc_annotation"]["entities"][i]
|
||||
tokens.append(token)
|
||||
sentence["tokens"] = tokens
|
||||
return sentence
|
||||
|
||||
|
||||
def create_json_doc(raw, sentences, id_):
|
||||
doc = {}
|
||||
paragraph = {}
|
||||
doc["id"] = id_
|
||||
doc["paragraphs"] = []
|
||||
paragraph["raw"] = raw.strip()
|
||||
paragraph["sentences"] = sentences
|
||||
doc["paragraphs"].append(paragraph)
|
||||
return doc
|
||||
|
||||
|
||||
def example_from_conllu_sentence(
|
||||
def doc_from_conllu_sentence(
|
||||
vocab,
|
||||
lines,
|
||||
ner_tag_pattern,
|
||||
|
@ -263,8 +219,9 @@ def example_from_conllu_sentence(
|
|||
if merge_subtokens:
|
||||
doc = merge_conllu_subtokens(lines, doc)
|
||||
|
||||
# create Example from custom Doc annotation
|
||||
words, spaces, tags, morphs, lemmas = [], [], [], [], []
|
||||
# create final Doc from custom Doc annotation
|
||||
words, spaces, tags, morphs, lemmas, poses = [], [], [], [], [], []
|
||||
heads, deps = [], []
|
||||
for i, t in enumerate(doc):
|
||||
words.append(t._.merged_orth)
|
||||
lemmas.append(t._.merged_lemma)
|
||||
|
@ -274,16 +231,23 @@ def example_from_conllu_sentence(
|
|||
tags.append(t.tag_ + "__" + t._.merged_morph)
|
||||
else:
|
||||
tags.append(t.tag_)
|
||||
poses.append(t.pos_)
|
||||
heads.append(t.head.i)
|
||||
deps.append(t.dep_)
|
||||
|
||||
doc_x = Doc(vocab, words=words, spaces=spaces)
|
||||
ref_dict = Example(doc_x, reference=doc).to_dict()
|
||||
ref_dict["words"] = words
|
||||
ref_dict["lemmas"] = lemmas
|
||||
ref_dict["spaces"] = spaces
|
||||
ref_dict["tags"] = tags
|
||||
ref_dict["morphs"] = morphs
|
||||
example = Example.from_dict(doc_x, ref_dict)
|
||||
return example
|
||||
for i in range(len(doc)):
|
||||
doc_x[i].tag_ = tags[i]
|
||||
doc_x[i].morph_ = morphs[i]
|
||||
doc_x[i].lemma_ = lemmas[i]
|
||||
doc_x[i].pos_ = poses[i]
|
||||
doc_x[i].dep_ = deps[i]
|
||||
doc_x[i].head = doc_x[heads[i]]
|
||||
doc_x.ents = [Span(doc_x, ent.start, ent.end, label=ent.label) for ent in doc.ents]
|
||||
doc_x.is_parsed = True
|
||||
doc_x.is_tagged = True
|
||||
|
||||
return doc_x
|
||||
|
||||
|
||||
def merge_conllu_subtokens(lines, doc):
|
|
@ -59,6 +59,6 @@ def read_iob(raw_sents, vocab, n_sents):
|
|||
doc[i].is_sent_start = sent_start
|
||||
biluo = iob_to_biluo(iob)
|
||||
entities = tags_to_entities(biluo)
|
||||
doc.ents = [Span(doc, start=s, end=e+1, label=L) for (L, s, e) in entities]
|
||||
doc.ents = [Span(doc, start=s, end=e + 1, label=L) for (L, s, e) in entities]
|
||||
docs.append(doc)
|
||||
return docs
|
||||
|
|
|
@ -17,8 +17,6 @@ def json2docs(input_data, model=None, **kwargs):
|
|||
for json_para in json_to_annotations(json_doc):
|
||||
example_dict = _fix_legacy_dict_data(json_para)
|
||||
tok_dict, doc_dict = _parse_example_dict_data(example_dict)
|
||||
if json_para.get("raw"):
|
||||
assert tok_dict.get("SPACY")
|
||||
doc = annotations2doc(nlp.vocab, tok_dict, doc_dict)
|
||||
docs.append(doc)
|
||||
return docs
|
||||
|
|
|
@ -8,7 +8,7 @@ class Corpus:
|
|||
"""An annotated corpus, reading train and dev datasets from
|
||||
the DocBin (.spacy) format.
|
||||
|
||||
DOCS: https://spacy.io/api/goldcorpus
|
||||
DOCS: https://spacy.io/api/corpus
|
||||
"""
|
||||
|
||||
def __init__(self, train_loc, dev_loc, limit=0):
|
||||
|
@ -43,24 +43,31 @@ class Corpus:
|
|||
locs.append(path)
|
||||
return locs
|
||||
|
||||
def _make_example(self, nlp, reference, gold_preproc):
|
||||
if gold_preproc or reference.has_unknown_spaces:
|
||||
return Example(
|
||||
Doc(
|
||||
nlp.vocab,
|
||||
words=[word.text for word in reference],
|
||||
spaces=[bool(word.whitespace_) for word in reference],
|
||||
),
|
||||
reference,
|
||||
)
|
||||
else:
|
||||
return Example(nlp.make_doc(reference.text), reference)
|
||||
|
||||
def make_examples(self, nlp, reference_docs, max_length=0):
|
||||
for reference in reference_docs:
|
||||
if len(reference) == 0:
|
||||
continue
|
||||
elif max_length == 0 or len(reference) < max_length:
|
||||
yield Example(
|
||||
nlp.make_doc(reference.text),
|
||||
reference
|
||||
)
|
||||
yield self._make_example(nlp, reference, False)
|
||||
elif reference.is_sentenced:
|
||||
for ref_sent in reference.sents:
|
||||
if len(ref_sent) == 0:
|
||||
continue
|
||||
elif max_length == 0 or len(ref_sent) < max_length:
|
||||
yield Example(
|
||||
nlp.make_doc(ref_sent.text),
|
||||
ref_sent.as_doc()
|
||||
)
|
||||
yield self._make_example(nlp, ref_sent.as_doc(), False)
|
||||
|
||||
def make_examples_gold_preproc(self, nlp, reference_docs):
|
||||
for reference in reference_docs:
|
||||
|
@ -69,14 +76,7 @@ class Corpus:
|
|||
else:
|
||||
ref_sents = [reference]
|
||||
for ref_sent in ref_sents:
|
||||
eg = Example(
|
||||
Doc(
|
||||
nlp.vocab,
|
||||
words=[w.text for w in ref_sent],
|
||||
spaces=[bool(w.whitespace_) for w in ref_sent]
|
||||
),
|
||||
ref_sent
|
||||
)
|
||||
eg = self._make_example(nlp, ref_sent, True)
|
||||
if len(eg.x):
|
||||
yield eg
|
||||
|
||||
|
@ -107,8 +107,9 @@ class Corpus:
|
|||
i += 1
|
||||
return n
|
||||
|
||||
def train_dataset(self, nlp, *, shuffle=True, gold_preproc=False,
|
||||
max_length=0, **kwargs):
|
||||
def train_dataset(
|
||||
self, nlp, *, shuffle=True, gold_preproc=False, max_length=0, **kwargs
|
||||
):
|
||||
ref_docs = self.read_docbin(nlp.vocab, self.walk_corpus(self.train_loc))
|
||||
if gold_preproc:
|
||||
examples = self.make_examples_gold_preproc(nlp, ref_docs)
|
||||
|
|
|
@ -1,8 +1,7 @@
|
|||
from ..tokens.doc cimport Doc
|
||||
from .align cimport Alignment
|
||||
|
||||
|
||||
cdef class Example:
|
||||
cdef readonly Doc x
|
||||
cdef readonly Doc y
|
||||
cdef readonly Alignment _alignment
|
||||
cdef readonly object _alignment
|
||||
|
|
|
@ -6,16 +6,15 @@ from ..tokens.doc cimport Doc
|
|||
from ..tokens.span cimport Span
|
||||
from ..tokens.span import Span
|
||||
from ..attrs import IDS
|
||||
from .align cimport Alignment
|
||||
from .align import Alignment
|
||||
from .iob_utils import biluo_to_iob, biluo_tags_from_offsets, biluo_tags_from_doc
|
||||
from .iob_utils import spans_from_biluo_tags
|
||||
from .align import Alignment
|
||||
from ..errors import Errors, Warnings
|
||||
from ..syntax import nonproj
|
||||
|
||||
|
||||
cpdef Doc annotations2doc(vocab, tok_annot, doc_annot):
|
||||
""" Create a Doc from dictionaries with token and doc annotations. Assumes ORTH & SPACY are set. """
|
||||
""" Create a Doc from dictionaries with token and doc annotations. """
|
||||
attrs, array = _annot2array(vocab, tok_annot, doc_annot)
|
||||
output = Doc(vocab, words=tok_annot["ORTH"], spaces=tok_annot["SPACY"])
|
||||
if "entities" in doc_annot:
|
||||
|
@ -28,7 +27,7 @@ cpdef Doc annotations2doc(vocab, tok_annot, doc_annot):
|
|||
|
||||
|
||||
cdef class Example:
|
||||
def __init__(self, Doc predicted, Doc reference, *, Alignment alignment=None):
|
||||
def __init__(self, Doc predicted, Doc reference, *, alignment=None):
|
||||
""" Doc can either be text, or an actual Doc """
|
||||
if predicted is None:
|
||||
raise TypeError(Errors.E972.format(arg="predicted"))
|
||||
|
@ -83,34 +82,38 @@ cdef class Example:
|
|||
gold_words = [token.orth_ for token in self.reference]
|
||||
if gold_words == []:
|
||||
gold_words = spacy_words
|
||||
self._alignment = Alignment(spacy_words, gold_words)
|
||||
self._alignment = Alignment.from_strings(spacy_words, gold_words)
|
||||
return self._alignment
|
||||
|
||||
def get_aligned(self, field, as_string=False):
|
||||
"""Return an aligned array for a token attribute."""
|
||||
i2j_multi = self.alignment.i2j_multi
|
||||
cand_to_gold = self.alignment.cand_to_gold
|
||||
align = self.alignment.x2y
|
||||
|
||||
vocab = self.reference.vocab
|
||||
gold_values = self.reference.to_array([field])
|
||||
output = [None] * len(self.predicted)
|
||||
for i, gold_i in enumerate(cand_to_gold):
|
||||
if self.predicted[i].text.isspace():
|
||||
output[i] = None
|
||||
if gold_i is None:
|
||||
if i in i2j_multi:
|
||||
output[i] = gold_values[i2j_multi[i]]
|
||||
else:
|
||||
output[i] = None
|
||||
for token in self.predicted:
|
||||
if token.is_space:
|
||||
output[token.i] = None
|
||||
else:
|
||||
output[i] = gold_values[gold_i]
|
||||
values = gold_values[align[token.i].dataXd]
|
||||
values = values.ravel()
|
||||
if len(values) == 0:
|
||||
output[token.i] = None
|
||||
elif len(values) == 1:
|
||||
output[token.i] = values[0]
|
||||
elif len(set(list(values))) == 1:
|
||||
# If all aligned tokens have the same value, use it.
|
||||
output[token.i] = values[0]
|
||||
else:
|
||||
output[token.i] = None
|
||||
if as_string and field not in ["ENT_IOB", "SENT_START"]:
|
||||
output = [vocab.strings[o] if o is not None else o for o in output]
|
||||
return output
|
||||
|
||||
def get_aligned_parse(self, projectivize=True):
|
||||
cand_to_gold = self.alignment.cand_to_gold
|
||||
gold_to_cand = self.alignment.gold_to_cand
|
||||
cand_to_gold = self.alignment.x2y
|
||||
gold_to_cand = self.alignment.y2x
|
||||
aligned_heads = [None] * self.x.length
|
||||
aligned_deps = [None] * self.x.length
|
||||
heads = [token.head.i for token in self.y]
|
||||
|
@ -118,52 +121,51 @@ cdef class Example:
|
|||
if projectivize:
|
||||
heads, deps = nonproj.projectivize(heads, deps)
|
||||
for cand_i in range(self.x.length):
|
||||
gold_i = cand_to_gold[cand_i]
|
||||
if gold_i is not None: # Alignment found
|
||||
gold_head = gold_to_cand[heads[gold_i]]
|
||||
if gold_head is not None:
|
||||
aligned_heads[cand_i] = gold_head
|
||||
if cand_to_gold.lengths[cand_i] == 1:
|
||||
gold_i = cand_to_gold[cand_i].dataXd[0, 0]
|
||||
if gold_to_cand.lengths[heads[gold_i]] == 1:
|
||||
aligned_heads[cand_i] = int(gold_to_cand[heads[gold_i]].dataXd[0, 0])
|
||||
aligned_deps[cand_i] = deps[gold_i]
|
||||
return aligned_heads, aligned_deps
|
||||
|
||||
def get_aligned_spans_x2y(self, x_spans):
|
||||
return self._get_aligned_spans(self.y, x_spans, self.alignment.x2y)
|
||||
|
||||
def get_aligned_spans_y2x(self, y_spans):
|
||||
return self._get_aligned_spans(self.x, y_spans, self.alignment.y2x)
|
||||
|
||||
def _get_aligned_spans(self, doc, spans, align):
|
||||
seen = set()
|
||||
output = []
|
||||
for span in spans:
|
||||
indices = align[span.start : span.end].data.ravel()
|
||||
indices = [idx for idx in indices if idx not in seen]
|
||||
if len(indices) >= 1:
|
||||
aligned_span = Span(doc, indices[0], indices[-1] + 1, label=span.label)
|
||||
target_text = span.text.lower().strip().replace(" ", "")
|
||||
our_text = aligned_span.text.lower().strip().replace(" ", "")
|
||||
if our_text == target_text:
|
||||
output.append(aligned_span)
|
||||
seen.update(indices)
|
||||
return output
|
||||
|
||||
def get_aligned_ner(self):
|
||||
if not self.y.is_nered:
|
||||
return [None] * len(self.x) # should this be 'missing' instead of 'None' ?
|
||||
x_text = self.x.text
|
||||
# Get a list of entities, and make spans for non-entity tokens.
|
||||
# We then work through the spans in order, trying to find them in
|
||||
# the text and using that to get the offset. Any token that doesn't
|
||||
# get a tag set this way is tagged None.
|
||||
# This could maybe be improved? It at least feels easy to reason about.
|
||||
y_spans = list(self.y.ents)
|
||||
y_spans.sort()
|
||||
x_text_offset = 0
|
||||
x_spans = []
|
||||
for y_span in y_spans:
|
||||
if x_text.count(y_span.text) >= 1:
|
||||
start_char = x_text.index(y_span.text) + x_text_offset
|
||||
end_char = start_char + len(y_span.text)
|
||||
x_span = self.x.char_span(start_char, end_char, label=y_span.label)
|
||||
if x_span is not None:
|
||||
x_spans.append(x_span)
|
||||
x_text = self.x.text[end_char:]
|
||||
x_text_offset = end_char
|
||||
x_ents = self.get_aligned_spans_y2x(self.y.ents)
|
||||
# Default to 'None' for missing values
|
||||
x_tags = biluo_tags_from_offsets(
|
||||
self.x,
|
||||
[(e.start_char, e.end_char, e.label_) for e in x_spans],
|
||||
[(e.start_char, e.end_char, e.label_) for e in x_ents],
|
||||
missing=None
|
||||
)
|
||||
gold_to_cand = self.alignment.gold_to_cand
|
||||
for token in self.y:
|
||||
if token.ent_iob_ == "O":
|
||||
cand_i = gold_to_cand[token.i]
|
||||
if cand_i is not None and x_tags[cand_i] is None:
|
||||
x_tags[cand_i] = "O"
|
||||
i2j_multi = self.alignment.i2j_multi
|
||||
for i, tag in enumerate(x_tags):
|
||||
if tag is None and i in i2j_multi:
|
||||
gold_i = i2j_multi[i]
|
||||
if gold_i is not None and self.y[gold_i].ent_iob_ == "O":
|
||||
# Now fill the tokens we can align to O.
|
||||
O = 2 # I=1, O=2, B=3
|
||||
for i, ent_iob in enumerate(self.get_aligned("ENT_IOB")):
|
||||
if x_tags[i] is None:
|
||||
if ent_iob == O:
|
||||
x_tags[i] = "O"
|
||||
elif self.x[i].is_space:
|
||||
x_tags[i] = "O"
|
||||
return x_tags
|
||||
|
||||
|
@ -194,25 +196,22 @@ cdef class Example:
|
|||
links[(ent.start_char, ent.end_char)] = {ent.kb_id_: 1.0}
|
||||
return links
|
||||
|
||||
|
||||
def split_sents(self):
|
||||
""" Split the token annotations into multiple Examples based on
|
||||
sent_starts and return a list of the new Examples"""
|
||||
if not self.reference.is_sentenced:
|
||||
return [self]
|
||||
|
||||
sent_starts = self.get_aligned("SENT_START")
|
||||
sent_starts.append(1) # appending virtual start of a next sentence to facilitate search
|
||||
|
||||
align = self.alignment.y2x
|
||||
seen_indices = set()
|
||||
output = []
|
||||
pred_start = 0
|
||||
for sent in self.reference.sents:
|
||||
new_ref = sent.as_doc()
|
||||
pred_end = sent_starts.index(1, pred_start+1) # find where the next sentence starts
|
||||
new_pred = self.predicted[pred_start : pred_end].as_doc()
|
||||
output.append(Example(new_pred, new_ref))
|
||||
pred_start = pred_end
|
||||
|
||||
for y_sent in self.reference.sents:
|
||||
indices = align[y_sent.start : y_sent.end].data.ravel()
|
||||
indices = [idx for idx in indices if idx not in seen_indices]
|
||||
if indices:
|
||||
x_sent = self.predicted[indices[0] : indices[-1] + 1]
|
||||
output.append(Example(x_sent.as_doc(), y_sent.as_doc()))
|
||||
seen_indices.update(indices)
|
||||
return output
|
||||
|
||||
property text:
|
||||
|
@ -235,10 +234,7 @@ def _annot2array(vocab, tok_annot, doc_annot):
|
|||
if key == "entities":
|
||||
pass
|
||||
elif key == "links":
|
||||
entities = doc_annot.get("entities", {})
|
||||
if not entities:
|
||||
raise ValueError(Errors.E981)
|
||||
ent_kb_ids = _parse_links(vocab, tok_annot["ORTH"], value, entities)
|
||||
ent_kb_ids = _parse_links(vocab, tok_annot["ORTH"], tok_annot["SPACY"], value)
|
||||
tok_annot["ENT_KB_ID"] = ent_kb_ids
|
||||
elif key == "cats":
|
||||
pass
|
||||
|
@ -381,18 +377,11 @@ def _parse_ner_tags(biluo_or_offsets, vocab, words, spaces):
|
|||
ent_types.append("")
|
||||
return ent_iobs, ent_types
|
||||
|
||||
def _parse_links(vocab, words, links, entities):
|
||||
reference = Doc(vocab, words=words)
|
||||
def _parse_links(vocab, words, spaces, links):
|
||||
reference = Doc(vocab, words=words, spaces=spaces)
|
||||
starts = {token.idx: token.i for token in reference}
|
||||
ends = {token.idx + len(token): token.i for token in reference}
|
||||
ent_kb_ids = ["" for _ in reference]
|
||||
entity_map = [(ent[0], ent[1]) for ent in entities]
|
||||
|
||||
# links annotations need to refer 1-1 to entity annotations - throw error otherwise
|
||||
for index, annot_dict in links.items():
|
||||
start_char, end_char = index
|
||||
if (start_char, end_char) not in entity_map:
|
||||
raise ValueError(Errors.E981)
|
||||
|
||||
for index, annot_dict in links.items():
|
||||
true_kb_ids = []
|
||||
|
@ -406,6 +395,8 @@ def _parse_links(vocab, words, links, entities):
|
|||
start_char, end_char = index
|
||||
start_token = starts.get(start_char)
|
||||
end_token = ends.get(end_char)
|
||||
if start_token is None or end_token is None:
|
||||
raise ValueError(Errors.E981)
|
||||
for i in range(start_token, end_token+1):
|
||||
ent_kb_ids[i] = true_kb_ids[0]
|
||||
|
||||
|
@ -414,7 +405,7 @@ def _parse_links(vocab, words, links, entities):
|
|||
|
||||
def _guess_spaces(text, words):
|
||||
if text is None:
|
||||
return [True] * len(words)
|
||||
return None
|
||||
spaces = []
|
||||
text_pos = 0
|
||||
# align words with text
|
||||
|
|
|
@ -24,14 +24,15 @@ def docs_to_json(docs, doc_id=0, ner_missing_tag="O"):
|
|||
for cat, val in doc.cats.items():
|
||||
json_cat = {"label": cat, "value": val}
|
||||
json_para["cats"].append(json_cat)
|
||||
# warning: entities information is currently duplicated as
|
||||
# doc-level "entities" and token-level "ner"
|
||||
for ent in doc.ents:
|
||||
ent_tuple = (ent.start_char, ent.end_char, ent.label_)
|
||||
json_para["entities"].append(ent_tuple)
|
||||
if ent.kb_id_:
|
||||
link_dict = {(ent.start_char, ent.end_char): {ent.kb_id_: 1.0}}
|
||||
json_para["links"].append(link_dict)
|
||||
ent_offsets = [(e.start_char, e.end_char, e.label_) for e in doc.ents]
|
||||
biluo_tags = biluo_tags_from_offsets(doc, ent_offsets, missing=ner_missing_tag)
|
||||
biluo_tags = biluo_tags_from_offsets(doc, json_para["entities"], missing=ner_missing_tag)
|
||||
for j, sent in enumerate(doc.sents):
|
||||
json_sent = {"tokens": [], "brackets": []}
|
||||
for token in sent:
|
||||
|
@ -44,6 +45,7 @@ def docs_to_json(docs, doc_id=0, ner_missing_tag="O"):
|
|||
if doc.is_parsed:
|
||||
json_token["head"] = token.head.i-token.i
|
||||
json_token["dep"] = token.dep_
|
||||
json_token["ner"] = biluo_tags[token.i]
|
||||
json_sent["tokens"].append(json_token)
|
||||
json_para["sentences"].append(json_sent)
|
||||
json_doc["paragraphs"].append(json_para)
|
||||
|
|
|
@ -92,7 +92,7 @@ def biluo_tags_from_offsets(doc, entities, missing="O"):
|
|||
# Handle entity cases
|
||||
for start_char, end_char, label in entities:
|
||||
if not label:
|
||||
for s in starts: # account for many-to-one
|
||||
for s in starts: # account for many-to-one
|
||||
if s >= start_char and s < end_char:
|
||||
biluo[starts[s]] = "O"
|
||||
else:
|
||||
|
|
|
@ -2,12 +2,13 @@ import random
|
|||
import itertools
|
||||
import weakref
|
||||
import functools
|
||||
from collections import Iterable
|
||||
from contextlib import contextmanager
|
||||
from copy import copy, deepcopy
|
||||
from pathlib import Path
|
||||
import warnings
|
||||
|
||||
from thinc.api import get_current_ops, Config
|
||||
from thinc.api import get_current_ops, Config, require_gpu
|
||||
import srsly
|
||||
import multiprocessing as mp
|
||||
from itertools import chain, cycle
|
||||
|
@ -232,32 +233,6 @@ class Language(object):
|
|||
def config(self):
|
||||
return self._config
|
||||
|
||||
# Conveniences to access pipeline components
|
||||
# Shouldn't be used anymore!
|
||||
@property
|
||||
def tagger(self):
|
||||
return self.get_pipe("tagger")
|
||||
|
||||
@property
|
||||
def parser(self):
|
||||
return self.get_pipe("parser")
|
||||
|
||||
@property
|
||||
def entity(self):
|
||||
return self.get_pipe("ner")
|
||||
|
||||
@property
|
||||
def linker(self):
|
||||
return self.get_pipe("entity_linker")
|
||||
|
||||
@property
|
||||
def senter(self):
|
||||
return self.get_pipe("senter")
|
||||
|
||||
@property
|
||||
def matcher(self):
|
||||
return self.get_pipe("matcher")
|
||||
|
||||
@property
|
||||
def pipe_names(self):
|
||||
"""Get names of available pipeline components.
|
||||
|
@ -313,10 +288,7 @@ class Language(object):
|
|||
DOCS: https://spacy.io/api/language#create_pipe
|
||||
"""
|
||||
if name not in self.factories:
|
||||
if name == "sbd":
|
||||
raise KeyError(Errors.E108.format(name=name))
|
||||
else:
|
||||
raise KeyError(Errors.E002.format(name=name))
|
||||
raise KeyError(Errors.E002.format(name=name))
|
||||
factory = self.factories[name]
|
||||
|
||||
# transform the model's config to an actual Model
|
||||
|
@ -529,22 +501,6 @@ class Language(object):
|
|||
def make_doc(self, text):
|
||||
return self.tokenizer(text)
|
||||
|
||||
def _convert_examples(self, examples):
|
||||
converted_examples = []
|
||||
if isinstance(examples, tuple):
|
||||
examples = [examples]
|
||||
for eg in examples:
|
||||
if isinstance(eg, Example):
|
||||
converted_examples.append(eg.copy())
|
||||
elif isinstance(eg, tuple):
|
||||
doc, annot = eg
|
||||
if isinstance(doc, str):
|
||||
doc = self.make_doc(doc)
|
||||
converted_examples.append(Example.from_dict(doc, annot))
|
||||
else:
|
||||
raise ValueError(Errors.E979.format(type=type(eg)))
|
||||
return converted_examples
|
||||
|
||||
def update(
|
||||
self,
|
||||
examples,
|
||||
|
@ -557,7 +513,7 @@ class Language(object):
|
|||
):
|
||||
"""Update the models in the pipeline.
|
||||
|
||||
examples (iterable): A batch of `Example` or `Doc` objects.
|
||||
examples (iterable): A batch of `Example` objects.
|
||||
dummy: Should not be set - serves to catch backwards-incompatible scripts.
|
||||
drop (float): The dropout rate.
|
||||
sgd (callable): An optimizer.
|
||||
|
@ -569,10 +525,13 @@ class Language(object):
|
|||
"""
|
||||
if dummy is not None:
|
||||
raise ValueError(Errors.E989)
|
||||
|
||||
if len(examples) == 0:
|
||||
return
|
||||
examples = self._convert_examples(examples)
|
||||
if not isinstance(examples, Iterable):
|
||||
raise TypeError(Errors.E978.format(name="language", method="update", types=type(examples)))
|
||||
wrong_types = set([type(eg) for eg in examples if not isinstance(eg, Example)])
|
||||
if wrong_types:
|
||||
raise TypeError(Errors.E978.format(name="language", method="update", types=wrong_types))
|
||||
|
||||
if sgd is None:
|
||||
if self._optimizer is None:
|
||||
|
@ -605,22 +564,26 @@ class Language(object):
|
|||
initial ones. This is useful for keeping a pretrained model on-track,
|
||||
even if you're updating it with a smaller set of examples.
|
||||
|
||||
examples (iterable): A batch of `Doc` objects.
|
||||
examples (iterable): A batch of `Example` objects.
|
||||
drop (float): The dropout rate.
|
||||
sgd (callable): An optimizer.
|
||||
RETURNS (dict): Results from the update.
|
||||
|
||||
EXAMPLE:
|
||||
>>> raw_text_batches = minibatch(raw_texts)
|
||||
>>> for labelled_batch in minibatch(zip(train_docs, train_golds)):
|
||||
>>> for labelled_batch in minibatch(examples):
|
||||
>>> nlp.update(labelled_batch)
|
||||
>>> raw_batch = [nlp.make_doc(text) for text in next(raw_text_batches)]
|
||||
>>> raw_batch = [Example.from_dict(nlp.make_doc(text), {}) for text in next(raw_text_batches)]
|
||||
>>> nlp.rehearse(raw_batch)
|
||||
"""
|
||||
# TODO: document
|
||||
if len(examples) == 0:
|
||||
return
|
||||
examples = self._convert_examples(examples)
|
||||
if not isinstance(examples, Iterable):
|
||||
raise TypeError(Errors.E978.format(name="language", method="rehearse", types=type(examples)))
|
||||
wrong_types = set([type(eg) for eg in examples if not isinstance(eg, Example)])
|
||||
if wrong_types:
|
||||
raise TypeError(Errors.E978.format(name="language", method="rehearse", types=wrong_types))
|
||||
if sgd is None:
|
||||
if self._optimizer is None:
|
||||
self._optimizer = create_default_optimizer()
|
||||
|
@ -669,7 +632,7 @@ class Language(object):
|
|||
_ = self.vocab[word] # noqa: F841
|
||||
|
||||
if cfg.get("device", -1) >= 0:
|
||||
util.use_gpu(cfg["device"])
|
||||
require_gpu(cfg["device"])
|
||||
if self.vocab.vectors.data.shape[1] >= 1:
|
||||
ops = get_current_ops()
|
||||
self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data)
|
||||
|
@ -696,10 +659,10 @@ class Language(object):
|
|||
component that has a .rehearse() method. Rehearsal is used to prevent
|
||||
models from "forgetting" their initialised "knowledge". To perform
|
||||
rehearsal, collect samples of text you want the models to retain performance
|
||||
on, and call nlp.rehearse() with a batch of Doc objects.
|
||||
on, and call nlp.rehearse() with a batch of Example objects.
|
||||
"""
|
||||
if cfg.get("device", -1) >= 0:
|
||||
util.use_gpu(cfg["device"])
|
||||
require_gpu(cfg["device"])
|
||||
ops = get_current_ops()
|
||||
if self.vocab.vectors.data.shape[1] >= 1:
|
||||
self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data)
|
||||
|
@ -728,7 +691,11 @@ class Language(object):
|
|||
|
||||
DOCS: https://spacy.io/api/language#evaluate
|
||||
"""
|
||||
examples = self._convert_examples(examples)
|
||||
if not isinstance(examples, Iterable):
|
||||
raise TypeError(Errors.E978.format(name="language", method="evaluate", types=type(examples)))
|
||||
wrong_types = set([type(eg) for eg in examples if not isinstance(eg, Example)])
|
||||
if wrong_types:
|
||||
raise TypeError(Errors.E978.format(name="language", method="evaluate", types=wrong_types))
|
||||
if scorer is None:
|
||||
scorer = Scorer(pipeline=self.pipeline)
|
||||
if component_cfg is None:
|
||||
|
@ -786,7 +753,6 @@ class Language(object):
|
|||
self,
|
||||
texts,
|
||||
as_tuples=False,
|
||||
n_threads=-1,
|
||||
batch_size=1000,
|
||||
disable=[],
|
||||
cleanup=False,
|
||||
|
@ -811,8 +777,6 @@ class Language(object):
|
|||
|
||||
DOCS: https://spacy.io/api/language#pipe
|
||||
"""
|
||||
if n_threads != -1:
|
||||
warnings.warn(Warnings.W016, DeprecationWarning)
|
||||
if n_process == -1:
|
||||
n_process = mp.cpu_count()
|
||||
if as_tuples:
|
||||
|
@ -939,7 +903,7 @@ class Language(object):
|
|||
if hasattr(proc2, "model"):
|
||||
proc1.find_listeners(proc2.model)
|
||||
|
||||
def to_disk(self, path, exclude=tuple(), disable=None):
|
||||
def to_disk(self, path, exclude=tuple()):
|
||||
"""Save the current state to a directory. If a model is loaded, this
|
||||
will include the model.
|
||||
|
||||
|
@ -949,9 +913,6 @@ class Language(object):
|
|||
|
||||
DOCS: https://spacy.io/api/language#to_disk
|
||||
"""
|
||||
if disable is not None:
|
||||
warnings.warn(Warnings.W014, DeprecationWarning)
|
||||
exclude = disable
|
||||
path = util.ensure_path(path)
|
||||
serializers = {}
|
||||
serializers["tokenizer"] = lambda p: self.tokenizer.to_disk(
|
||||
|
@ -970,7 +931,7 @@ class Language(object):
|
|||
serializers["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||
util.to_disk(path, serializers, exclude)
|
||||
|
||||
def from_disk(self, path, exclude=tuple(), disable=None):
|
||||
def from_disk(self, path, exclude=tuple()):
|
||||
"""Loads state from a directory. Modifies the object in place and
|
||||
returns it. If the saved `Language` object contains a model, the
|
||||
model will be loaded.
|
||||
|
@ -995,9 +956,6 @@ class Language(object):
|
|||
self.vocab.from_disk(path)
|
||||
_fix_pretrained_vectors_name(self)
|
||||
|
||||
if disable is not None:
|
||||
warnings.warn(Warnings.W014, DeprecationWarning)
|
||||
exclude = disable
|
||||
path = util.ensure_path(path)
|
||||
|
||||
deserializers = {}
|
||||
|
@ -1024,7 +982,7 @@ class Language(object):
|
|||
self._link_components()
|
||||
return self
|
||||
|
||||
def to_bytes(self, exclude=tuple(), disable=None, **kwargs):
|
||||
def to_bytes(self, exclude=tuple()):
|
||||
"""Serialize the current state to a binary string.
|
||||
|
||||
exclude (list): Names of components or serialization fields to exclude.
|
||||
|
@ -1032,9 +990,6 @@ class Language(object):
|
|||
|
||||
DOCS: https://spacy.io/api/language#to_bytes
|
||||
"""
|
||||
if disable is not None:
|
||||
warnings.warn(Warnings.W014, DeprecationWarning)
|
||||
exclude = disable
|
||||
serializers = {}
|
||||
serializers["vocab"] = lambda: self.vocab.to_bytes()
|
||||
serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
|
||||
|
@ -1046,10 +1001,9 @@ class Language(object):
|
|||
if not hasattr(proc, "to_bytes"):
|
||||
continue
|
||||
serializers[name] = lambda proc=proc: proc.to_bytes(exclude=["vocab"])
|
||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||
return util.to_bytes(serializers, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), disable=None, **kwargs):
|
||||
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||
"""Load state from a binary string.
|
||||
|
||||
bytes_data (bytes): The data to load from.
|
||||
|
@ -1070,9 +1024,6 @@ class Language(object):
|
|||
self.vocab.from_bytes(b)
|
||||
_fix_pretrained_vectors_name(self)
|
||||
|
||||
if disable is not None:
|
||||
warnings.warn(Warnings.W014, DeprecationWarning)
|
||||
exclude = disable
|
||||
deserializers = {}
|
||||
deserializers["config.cfg"] = lambda b: self.config.from_bytes(b)
|
||||
deserializers["meta.json"] = deserialize_meta
|
||||
|
@ -1088,7 +1039,6 @@ class Language(object):
|
|||
deserializers[name] = lambda b, proc=proc: proc.from_bytes(
|
||||
b, exclude=["vocab"]
|
||||
)
|
||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||
util.from_bytes(bytes_data, deserializers, exclude)
|
||||
self._link_components()
|
||||
return self
|
||||
|
@ -1210,7 +1160,7 @@ class DisabledPipes(list):
|
|||
def _pipe(examples, proc, kwargs):
|
||||
# We added some args for pipe that __call__ doesn't expect.
|
||||
kwargs = dict(kwargs)
|
||||
for arg in ["n_threads", "batch_size"]:
|
||||
for arg in ["batch_size"]:
|
||||
if arg in kwargs:
|
||||
kwargs.pop(arg)
|
||||
for eg in examples:
|
||||
|
|
|
@ -1,5 +1,4 @@
|
|||
from .errors import Errors
|
||||
from .lookups import Lookups
|
||||
from .parts_of_speech import NAMES as UPOS_NAMES
|
||||
|
||||
|
||||
|
@ -15,15 +14,13 @@ class Lemmatizer(object):
|
|||
def load(cls, *args, **kwargs):
|
||||
raise NotImplementedError(Errors.E172)
|
||||
|
||||
def __init__(self, lookups, *args, **kwargs):
|
||||
def __init__(self, lookups):
|
||||
"""Initialize a Lemmatizer.
|
||||
|
||||
lookups (Lookups): The lookups object containing the (optional) tables
|
||||
"lemma_rules", "lemma_index", "lemma_exc" and "lemma_lookup".
|
||||
RETURNS (Lemmatizer): The newly constructed object.
|
||||
"""
|
||||
if args or kwargs or not isinstance(lookups, Lookups):
|
||||
raise ValueError(Errors.E173)
|
||||
self.lookups = lookups
|
||||
|
||||
def __call__(self, string, univ_pos, morphology=None):
|
||||
|
|
|
@ -174,8 +174,7 @@ cdef class Matcher:
|
|||
return default
|
||||
return (self._callbacks[key], self._patterns[key])
|
||||
|
||||
def pipe(self, docs, batch_size=1000, n_threads=-1, return_matches=False,
|
||||
as_tuples=False):
|
||||
def pipe(self, docs, batch_size=1000, return_matches=False, as_tuples=False):
|
||||
"""Match a stream of documents, yielding them in turn.
|
||||
|
||||
docs (iterable): A stream of documents.
|
||||
|
@ -188,9 +187,6 @@ cdef class Matcher:
|
|||
be a sequence of ((doc, matches), context) tuples.
|
||||
YIELDS (Doc): Documents, in order.
|
||||
"""
|
||||
if n_threads != -1:
|
||||
warnings.warn(Warnings.W016, DeprecationWarning)
|
||||
|
||||
if as_tuples:
|
||||
for doc, context in docs:
|
||||
matches = self(doc)
|
||||
|
|
|
@ -26,7 +26,7 @@ cdef class PhraseMatcher:
|
|||
Copyright (c) 2017 Vikash Singh (vikash.duliajan@gmail.com)
|
||||
"""
|
||||
|
||||
def __init__(self, Vocab vocab, max_length=0, attr="ORTH", validate=False):
|
||||
def __init__(self, Vocab vocab, attr="ORTH", validate=False):
|
||||
"""Initialize the PhraseMatcher.
|
||||
|
||||
vocab (Vocab): The shared vocabulary.
|
||||
|
@ -36,8 +36,6 @@ cdef class PhraseMatcher:
|
|||
|
||||
DOCS: https://spacy.io/api/phrasematcher#init
|
||||
"""
|
||||
if max_length != 0:
|
||||
warnings.warn(Warnings.W010, DeprecationWarning)
|
||||
self.vocab = vocab
|
||||
self._callbacks = {}
|
||||
self._docs = {}
|
||||
|
@ -287,8 +285,7 @@ cdef class PhraseMatcher:
|
|||
current_node = self.c_map
|
||||
idx += 1
|
||||
|
||||
def pipe(self, stream, batch_size=1000, n_threads=-1, return_matches=False,
|
||||
as_tuples=False):
|
||||
def pipe(self, stream, batch_size=1000, return_matches=False, as_tuples=False):
|
||||
"""Match a stream of documents, yielding them in turn.
|
||||
|
||||
docs (iterable): A stream of documents.
|
||||
|
@ -303,8 +300,6 @@ cdef class PhraseMatcher:
|
|||
|
||||
DOCS: https://spacy.io/api/phrasematcher#pipe
|
||||
"""
|
||||
if n_threads != -1:
|
||||
warnings.warn(Warnings.W016, DeprecationWarning)
|
||||
if as_tuples:
|
||||
for doc, context in stream:
|
||||
matches = self(doc)
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
import numpy
|
||||
|
||||
from thinc.api import chain, Maxout, LayerNorm, Softmax, Linear, zero_init, Model
|
||||
from thinc.api import MultiSoftmax, list2array
|
||||
|
||||
|
||||
def build_multi_task_model(tok2vec, maxout_pieces, token_vector_width, nO=None):
|
||||
|
@ -21,9 +22,10 @@ def build_multi_task_model(tok2vec, maxout_pieces, token_vector_width, nO=None):
|
|||
return model
|
||||
|
||||
|
||||
def build_cloze_multi_task_model(vocab, tok2vec, maxout_pieces, nO=None):
|
||||
def build_cloze_multi_task_model(vocab, tok2vec, maxout_pieces, hidden_size, nO=None):
|
||||
# nO = vocab.vectors.data.shape[1]
|
||||
output_layer = chain(
|
||||
list2array(),
|
||||
Maxout(
|
||||
nO=nO,
|
||||
nI=tok2vec.get_dim("nO"),
|
||||
|
@ -40,6 +42,22 @@ def build_cloze_multi_task_model(vocab, tok2vec, maxout_pieces, nO=None):
|
|||
return model
|
||||
|
||||
|
||||
def build_cloze_characters_multi_task_model(
|
||||
vocab, tok2vec, maxout_pieces, hidden_size, nr_char
|
||||
):
|
||||
output_layer = chain(
|
||||
list2array(),
|
||||
Maxout(hidden_size, nP=maxout_pieces),
|
||||
LayerNorm(nI=hidden_size),
|
||||
MultiSoftmax([256] * nr_char, nI=hidden_size),
|
||||
)
|
||||
|
||||
model = build_masked_language_model(vocab, chain(tok2vec, output_layer))
|
||||
model.set_ref("tok2vec", tok2vec)
|
||||
model.set_ref("output_layer", output_layer)
|
||||
return model
|
||||
|
||||
|
||||
def build_masked_language_model(vocab, wrapped_model, mask_prob=0.15):
|
||||
"""Convert a model into a BERT-style masked language model"""
|
||||
|
||||
|
@ -48,7 +66,7 @@ def build_masked_language_model(vocab, wrapped_model, mask_prob=0.15):
|
|||
def mlm_forward(model, docs, is_train):
|
||||
mask, docs = _apply_mask(docs, random_words, mask_prob=mask_prob)
|
||||
mask = model.ops.asarray(mask).reshape((mask.shape[0], 1))
|
||||
output, backprop = model.get_ref("wrapped-model").begin_update(docs)
|
||||
output, backprop = model.layers[0](docs, is_train)
|
||||
|
||||
def mlm_backward(d_output):
|
||||
d_output *= 1 - mask
|
||||
|
@ -56,8 +74,22 @@ def build_masked_language_model(vocab, wrapped_model, mask_prob=0.15):
|
|||
|
||||
return output, mlm_backward
|
||||
|
||||
mlm_model = Model("masked-language-model", mlm_forward, layers=[wrapped_model])
|
||||
mlm_model.set_ref("wrapped-model", wrapped_model)
|
||||
def mlm_initialize(model, X=None, Y=None):
|
||||
wrapped = model.layers[0]
|
||||
wrapped.initialize(X=X, Y=Y)
|
||||
for dim in wrapped.dim_names:
|
||||
if wrapped.has_dim(dim):
|
||||
model.set_dim(dim, wrapped.get_dim(dim))
|
||||
|
||||
mlm_model = Model(
|
||||
"masked-language-model",
|
||||
mlm_forward,
|
||||
layers=[wrapped_model],
|
||||
init=mlm_initialize,
|
||||
refs={"wrapped": wrapped_model},
|
||||
dims={dim: None for dim in wrapped_model.dim_names},
|
||||
)
|
||||
mlm_model.set_ref("wrapped", wrapped_model)
|
||||
|
||||
return mlm_model
|
||||
|
||||
|
|
|
@ -17,11 +17,7 @@ def build_tb_parser_model(
|
|||
nO=None,
|
||||
):
|
||||
t2v_width = tok2vec.get_dim("nO") if tok2vec.has_dim("nO") else None
|
||||
tok2vec = chain(
|
||||
tok2vec,
|
||||
list2array(),
|
||||
Linear(hidden_width, t2v_width),
|
||||
)
|
||||
tok2vec = chain(tok2vec, list2array(), Linear(hidden_width, t2v_width),)
|
||||
tok2vec.set_dim("nO", hidden_width)
|
||||
|
||||
lower = PrecomputableAffine(
|
||||
|
|
|
@ -263,17 +263,21 @@ def build_Tok2Vec_model(
|
|||
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
|
||||
with Model.define_operators({">>": chain, "|": concatenate, "**": clone}):
|
||||
norm = HashEmbed(
|
||||
nO=width, nV=embed_size, column=cols.index(NORM), dropout=dropout
|
||||
nO=width, nV=embed_size, column=cols.index(NORM), dropout=dropout,
|
||||
seed=0
|
||||
)
|
||||
if subword_features:
|
||||
prefix = HashEmbed(
|
||||
nO=width, nV=embed_size // 2, column=cols.index(PREFIX), dropout=dropout
|
||||
nO=width, nV=embed_size // 2, column=cols.index(PREFIX), dropout=dropout,
|
||||
seed=1
|
||||
)
|
||||
suffix = HashEmbed(
|
||||
nO=width, nV=embed_size // 2, column=cols.index(SUFFIX), dropout=dropout
|
||||
nO=width, nV=embed_size // 2, column=cols.index(SUFFIX), dropout=dropout,
|
||||
seed=2
|
||||
)
|
||||
shape = HashEmbed(
|
||||
nO=width, nV=embed_size // 2, column=cols.index(SHAPE), dropout=dropout
|
||||
nO=width, nV=embed_size // 2, column=cols.index(SHAPE), dropout=dropout,
|
||||
seed=3
|
||||
)
|
||||
else:
|
||||
prefix, suffix, shape = (None, None, None)
|
||||
|
|
|
@ -120,15 +120,14 @@ class Morphologizer(Tagger):
|
|||
d_scores = self.model.ops.unflatten(d_scores, [len(d) for d in docs])
|
||||
return float(loss), d_scores
|
||||
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
def to_bytes(self, exclude=tuple()):
|
||||
serialize = {}
|
||||
serialize["model"] = self.model.to_bytes
|
||||
serialize["vocab"] = self.vocab.to_bytes
|
||||
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
return util.to_bytes(serialize, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||
def load_model(b):
|
||||
try:
|
||||
self.model.from_bytes(b)
|
||||
|
@ -140,20 +139,18 @@ class Morphologizer(Tagger):
|
|||
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
|
||||
"model": lambda b: load_model(b),
|
||||
}
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_bytes(bytes_data, deserialize, exclude)
|
||||
return self
|
||||
|
||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||
def to_disk(self, path, exclude=tuple()):
|
||||
serialize = {
|
||||
"vocab": lambda p: self.vocab.to_disk(p),
|
||||
"model": lambda p: p.open("wb").write(self.model.to_bytes()),
|
||||
"cfg": lambda p: srsly.write_json(p, self.cfg),
|
||||
}
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
util.to_disk(path, serialize, exclude)
|
||||
|
||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||
def from_disk(self, path, exclude=tuple()):
|
||||
def load_model(p):
|
||||
with p.open("rb") as file_:
|
||||
try:
|
||||
|
@ -166,6 +163,5 @@ class Morphologizer(Tagger):
|
|||
"cfg": lambda p: self.cfg.update(_load_cfg(p)),
|
||||
"model": load_model,
|
||||
}
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_disk(path, deserialize, exclude)
|
||||
return self
|
||||
|
|
|
@ -66,7 +66,7 @@ class Pipe(object):
|
|||
self.set_annotations([doc], predictions)
|
||||
return doc
|
||||
|
||||
def pipe(self, stream, batch_size=128, n_threads=-1):
|
||||
def pipe(self, stream, batch_size=128):
|
||||
"""Apply the pipe to a stream of documents.
|
||||
|
||||
Both __call__ and pipe should delegate to the `predict()`
|
||||
|
@ -151,7 +151,7 @@ class Pipe(object):
|
|||
with self.model.use_params(params):
|
||||
yield
|
||||
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
def to_bytes(self, exclude=tuple()):
|
||||
"""Serialize the pipe to a bytestring.
|
||||
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
|
@ -162,10 +162,9 @@ class Pipe(object):
|
|||
serialize["model"] = self.model.to_bytes
|
||||
if hasattr(self, "vocab"):
|
||||
serialize["vocab"] = self.vocab.to_bytes
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
return util.to_bytes(serialize, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||
"""Load the pipe from a bytestring."""
|
||||
|
||||
def load_model(b):
|
||||
|
@ -179,20 +178,18 @@ class Pipe(object):
|
|||
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
|
||||
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
|
||||
deserialize["model"] = load_model
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_bytes(bytes_data, deserialize, exclude)
|
||||
return self
|
||||
|
||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||
def to_disk(self, path, exclude=tuple()):
|
||||
"""Serialize the pipe to disk."""
|
||||
serialize = {}
|
||||
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
|
||||
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||
serialize["model"] = lambda p: self.model.to_disk(p)
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
util.to_disk(path, serialize, exclude)
|
||||
|
||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||
def from_disk(self, path, exclude=tuple()):
|
||||
"""Load the pipe from disk."""
|
||||
|
||||
def load_model(p):
|
||||
|
@ -205,7 +202,6 @@ class Pipe(object):
|
|||
deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
|
||||
deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p))
|
||||
deserialize["model"] = load_model
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_disk(path, deserialize, exclude)
|
||||
return self
|
||||
|
||||
|
@ -232,7 +228,7 @@ class Tagger(Pipe):
|
|||
self.set_annotations([doc], tags)
|
||||
return doc
|
||||
|
||||
def pipe(self, stream, batch_size=128, n_threads=-1):
|
||||
def pipe(self, stream, batch_size=128):
|
||||
for docs in util.minibatch(stream, size=batch_size):
|
||||
tag_ids = self.predict(docs)
|
||||
self.set_annotations(docs, tag_ids)
|
||||
|
@ -295,7 +291,7 @@ class Tagger(Pipe):
|
|||
return
|
||||
except AttributeError:
|
||||
types = set([type(eg) for eg in examples])
|
||||
raise ValueError(Errors.E978.format(name="Tagger", method="update", types=types))
|
||||
raise TypeError(Errors.E978.format(name="Tagger", method="update", types=types))
|
||||
set_dropout_rate(self.model, drop)
|
||||
tag_scores, bp_tag_scores = self.model.begin_update(
|
||||
[eg.predicted for eg in examples])
|
||||
|
@ -321,7 +317,7 @@ class Tagger(Pipe):
|
|||
docs = [eg.predicted for eg in examples]
|
||||
except AttributeError:
|
||||
types = set([type(eg) for eg in examples])
|
||||
raise ValueError(Errors.E978.format(name="Tagger", method="rehearse", types=types))
|
||||
raise TypeError(Errors.E978.format(name="Tagger", method="rehearse", types=types))
|
||||
if self._rehearsal_model is None:
|
||||
return
|
||||
if not any(len(doc) for doc in docs):
|
||||
|
@ -358,7 +354,7 @@ class Tagger(Pipe):
|
|||
try:
|
||||
y = example.y
|
||||
except AttributeError:
|
||||
raise ValueError(Errors.E978.format(name="Tagger", method="begin_training", types=type(example)))
|
||||
raise TypeError(Errors.E978.format(name="Tagger", method="begin_training", types=type(example)))
|
||||
for token in y:
|
||||
tag = token.tag_
|
||||
if tag in orig_tag_map:
|
||||
|
@ -421,17 +417,16 @@ class Tagger(Pipe):
|
|||
with self.model.use_params(params):
|
||||
yield
|
||||
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
def to_bytes(self, exclude=tuple()):
|
||||
serialize = {}
|
||||
serialize["model"] = self.model.to_bytes
|
||||
serialize["vocab"] = self.vocab.to_bytes
|
||||
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
||||
tag_map = dict(sorted(self.vocab.morphology.tag_map.items()))
|
||||
serialize["tag_map"] = lambda: srsly.msgpack_dumps(tag_map)
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
return util.to_bytes(serialize, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||
def load_model(b):
|
||||
try:
|
||||
self.model.from_bytes(b)
|
||||
|
@ -451,11 +446,10 @@ class Tagger(Pipe):
|
|||
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
|
||||
"model": lambda b: load_model(b),
|
||||
}
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_bytes(bytes_data, deserialize, exclude)
|
||||
return self
|
||||
|
||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||
def to_disk(self, path, exclude=tuple()):
|
||||
tag_map = dict(sorted(self.vocab.morphology.tag_map.items()))
|
||||
serialize = {
|
||||
"vocab": lambda p: self.vocab.to_disk(p),
|
||||
|
@ -463,10 +457,9 @@ class Tagger(Pipe):
|
|||
"model": lambda p: self.model.to_disk(p),
|
||||
"cfg": lambda p: srsly.write_json(p, self.cfg),
|
||||
}
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
util.to_disk(path, serialize, exclude)
|
||||
|
||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||
def from_disk(self, path, exclude=tuple()):
|
||||
def load_model(p):
|
||||
with p.open("rb") as file_:
|
||||
try:
|
||||
|
@ -487,7 +480,6 @@ class Tagger(Pipe):
|
|||
"tag_map": load_tag_map,
|
||||
"model": load_model,
|
||||
}
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_disk(path, deserialize, exclude)
|
||||
return self
|
||||
|
||||
|
@ -566,15 +558,14 @@ class SentenceRecognizer(Tagger):
|
|||
def add_label(self, label, values=None):
|
||||
raise NotImplementedError
|
||||
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
def to_bytes(self, exclude=tuple()):
|
||||
serialize = {}
|
||||
serialize["model"] = self.model.to_bytes
|
||||
serialize["vocab"] = self.vocab.to_bytes
|
||||
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
return util.to_bytes(serialize, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||
def load_model(b):
|
||||
try:
|
||||
self.model.from_bytes(b)
|
||||
|
@ -586,20 +577,18 @@ class SentenceRecognizer(Tagger):
|
|||
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
|
||||
"model": lambda b: load_model(b),
|
||||
}
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_bytes(bytes_data, deserialize, exclude)
|
||||
return self
|
||||
|
||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||
def to_disk(self, path, exclude=tuple()):
|
||||
serialize = {
|
||||
"vocab": lambda p: self.vocab.to_disk(p),
|
||||
"model": lambda p: p.open("wb").write(self.model.to_bytes()),
|
||||
"cfg": lambda p: srsly.write_json(p, self.cfg),
|
||||
}
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
util.to_disk(path, serialize, exclude)
|
||||
|
||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||
def from_disk(self, path, exclude=tuple()):
|
||||
def load_model(p):
|
||||
with p.open("rb") as file_:
|
||||
try:
|
||||
|
@ -612,7 +601,6 @@ class SentenceRecognizer(Tagger):
|
|||
"cfg": lambda p: self.cfg.update(_load_cfg(p)),
|
||||
"model": load_model,
|
||||
}
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_disk(path, deserialize, exclude)
|
||||
return self
|
||||
|
||||
|
@ -790,7 +778,7 @@ class ClozeMultitask(Pipe):
|
|||
predictions, bp_predictions = self.model.begin_update([eg.predicted for eg in examples])
|
||||
except AttributeError:
|
||||
types = set([type(eg) for eg in examples])
|
||||
raise ValueError(Errors.E978.format(name="ClozeMultitask", method="rehearse", types=types))
|
||||
raise TypeError(Errors.E978.format(name="ClozeMultitask", method="rehearse", types=types))
|
||||
loss, d_predictions = self.get_loss(examples, self.vocab.vectors.data, predictions)
|
||||
bp_predictions(d_predictions)
|
||||
if sgd is not None:
|
||||
|
@ -825,7 +813,7 @@ class TextCategorizer(Pipe):
|
|||
def labels(self, value):
|
||||
self.cfg["labels"] = tuple(value)
|
||||
|
||||
def pipe(self, stream, batch_size=128, n_threads=-1):
|
||||
def pipe(self, stream, batch_size=128):
|
||||
for docs in util.minibatch(stream, size=batch_size):
|
||||
scores, tensors = self.predict(docs)
|
||||
self.set_annotations(docs, scores, tensors=tensors)
|
||||
|
@ -856,7 +844,7 @@ class TextCategorizer(Pipe):
|
|||
return
|
||||
except AttributeError:
|
||||
types = set([type(eg) for eg in examples])
|
||||
raise ValueError(Errors.E978.format(name="TextCategorizer", method="update", types=types))
|
||||
raise TypeError(Errors.E978.format(name="TextCategorizer", method="update", types=types))
|
||||
set_dropout_rate(self.model, drop)
|
||||
scores, bp_scores = self.model.begin_update(
|
||||
[eg.predicted for eg in examples]
|
||||
|
@ -879,7 +867,7 @@ class TextCategorizer(Pipe):
|
|||
docs = [eg.predicted for eg in examples]
|
||||
except AttributeError:
|
||||
types = set([type(eg) for eg in examples])
|
||||
raise ValueError(Errors.E978.format(name="TextCategorizer", method="rehearse", types=types))
|
||||
raise TypeError(Errors.E978.format(name="TextCategorizer", method="rehearse", types=types))
|
||||
if not any(len(doc) for doc in docs):
|
||||
# Handle cases where there are no tokens in any docs.
|
||||
return
|
||||
|
@ -940,7 +928,7 @@ class TextCategorizer(Pipe):
|
|||
try:
|
||||
y = example.y
|
||||
except AttributeError:
|
||||
raise ValueError(Errors.E978.format(name="TextCategorizer", method="update", types=type(example)))
|
||||
raise TypeError(Errors.E978.format(name="TextCategorizer", method="update", types=type(example)))
|
||||
for cat in y.cats:
|
||||
self.add_label(cat)
|
||||
self.require_labels()
|
||||
|
@ -1105,7 +1093,7 @@ class EntityLinker(Pipe):
|
|||
docs = [eg.predicted for eg in examples]
|
||||
except AttributeError:
|
||||
types = set([type(eg) for eg in examples])
|
||||
raise ValueError(Errors.E978.format(name="EntityLinker", method="update", types=types))
|
||||
raise TypeError(Errors.E978.format(name="EntityLinker", method="update", types=types))
|
||||
if set_annotations:
|
||||
# This seems simpler than other ways to get that exact output -- but
|
||||
# it does run the model twice :(
|
||||
|
@ -1198,7 +1186,7 @@ class EntityLinker(Pipe):
|
|||
self.set_annotations([doc], kb_ids, tensors=tensors)
|
||||
return doc
|
||||
|
||||
def pipe(self, stream, batch_size=128, n_threads=-1):
|
||||
def pipe(self, stream, batch_size=128):
|
||||
for docs in util.minibatch(stream, size=batch_size):
|
||||
kb_ids, tensors = self.predict(docs)
|
||||
self.set_annotations(docs, kb_ids, tensors=tensors)
|
||||
|
@ -1309,17 +1297,16 @@ class EntityLinker(Pipe):
|
|||
for token in ent:
|
||||
token.ent_kb_id_ = kb_id
|
||||
|
||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||
def to_disk(self, path, exclude=tuple()):
|
||||
serialize = {}
|
||||
self.cfg["entity_width"] = self.kb.entity_vector_length
|
||||
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
|
||||
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||
serialize["kb"] = lambda p: self.kb.dump(p)
|
||||
serialize["model"] = lambda p: self.model.to_disk(p)
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
util.to_disk(path, serialize, exclude)
|
||||
|
||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||
def from_disk(self, path, exclude=tuple()):
|
||||
def load_model(p):
|
||||
try:
|
||||
self.model.from_bytes(p.open("rb").read())
|
||||
|
@ -1335,7 +1322,6 @@ class EntityLinker(Pipe):
|
|||
deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p))
|
||||
deserialize["kb"] = load_kb
|
||||
deserialize["model"] = load_model
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_disk(path, deserialize, exclude)
|
||||
return self
|
||||
|
||||
|
@ -1411,7 +1397,7 @@ class Sentencizer(Pipe):
|
|||
doc[start].is_sent_start = True
|
||||
return doc
|
||||
|
||||
def pipe(self, stream, batch_size=128, n_threads=-1):
|
||||
def pipe(self, stream, batch_size=128):
|
||||
for docs in util.minibatch(stream, size=batch_size):
|
||||
predictions = self.predict(docs)
|
||||
if isinstance(predictions, tuple) and len(tuple) == 2:
|
||||
|
|
|
@ -51,11 +51,10 @@ class Tok2Vec(Pipe):
|
|||
self.set_annotations([doc], tokvecses)
|
||||
return doc
|
||||
|
||||
def pipe(self, stream, batch_size=128, n_threads=-1):
|
||||
def pipe(self, stream, batch_size=128):
|
||||
"""Process `Doc` objects as a stream.
|
||||
stream (iterator): A sequence of `Doc` objects to process.
|
||||
batch_size (int): Number of `Doc` objects to group.
|
||||
n_threads (int): Number of threads.
|
||||
YIELDS (iterator): A sequence of `Doc` objects, in order of input.
|
||||
"""
|
||||
for docs in minibatch(stream, batch_size):
|
||||
|
|
|
@ -326,10 +326,11 @@ class Scorer(object):
|
|||
for token in doc:
|
||||
if token.orth_.isspace():
|
||||
continue
|
||||
gold_i = align.cand_to_gold[token.i]
|
||||
if gold_i is None:
|
||||
if align.x2y.lengths[token.i] != 1:
|
||||
self.tokens.fp += 1
|
||||
gold_i = None
|
||||
else:
|
||||
gold_i = align.x2y[token.i].dataXd[0, 0]
|
||||
self.tokens.tp += 1
|
||||
cand_tags.add((gold_i, token.tag_))
|
||||
cand_pos.add((gold_i, token.pos_))
|
||||
|
@ -345,7 +346,10 @@ class Scorer(object):
|
|||
if token.is_sent_start:
|
||||
cand_sent_starts.add(gold_i)
|
||||
if token.dep_.lower() not in punct_labels and token.orth_.strip():
|
||||
gold_head = align.cand_to_gold[token.head.i]
|
||||
if align.x2y.lengths[token.head.i] == 1:
|
||||
gold_head = align.x2y[token.head.i].dataXd[0, 0]
|
||||
else:
|
||||
gold_head = None
|
||||
# None is indistinct, so we can't just add it to the set
|
||||
# Multiple (None, None) deps are possible
|
||||
if gold_i is None or gold_head is None:
|
||||
|
@ -381,15 +385,9 @@ class Scorer(object):
|
|||
gold_ents.add(gold_ent)
|
||||
gold_per_ents[ent.label_].add((ent.label_, ent.start, ent.end - 1))
|
||||
cand_per_ents = {ent_label: set() for ent_label in ent_labels}
|
||||
for ent in doc.ents:
|
||||
first = align.cand_to_gold[ent.start]
|
||||
last = align.cand_to_gold[ent.end - 1]
|
||||
if first is None or last is None:
|
||||
self.ner.fp += 1
|
||||
self.ner_per_ents[ent.label_].fp += 1
|
||||
else:
|
||||
cand_ents.add((ent.label_, first, last))
|
||||
cand_per_ents[ent.label_].add((ent.label_, first, last))
|
||||
for ent in example.get_aligned_spans_x2y(doc.ents):
|
||||
cand_ents.add((ent.label_, ent.start, ent.end - 1))
|
||||
cand_per_ents[ent.label_].add((ent.label_, ent.start, ent.end - 1))
|
||||
# Scores per ent
|
||||
for k, v in self.ner_per_ents.items():
|
||||
if k in cand_per_ents:
|
||||
|
|
|
@ -157,7 +157,7 @@ cdef class Parser:
|
|||
self.set_annotations([doc], states, tensors=None)
|
||||
return doc
|
||||
|
||||
def pipe(self, docs, int batch_size=256, int n_threads=-1):
|
||||
def pipe(self, docs, int batch_size=256):
|
||||
"""Process a stream of documents.
|
||||
|
||||
stream: The sequence of documents to process.
|
||||
|
@ -461,24 +461,22 @@ cdef class Parser:
|
|||
link_vectors_to_models(self.vocab)
|
||||
return sgd
|
||||
|
||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||
def to_disk(self, path, exclude=tuple()):
|
||||
serializers = {
|
||||
'model': lambda p: (self.model.to_disk(p) if self.model is not True else True),
|
||||
'vocab': lambda p: self.vocab.to_disk(p),
|
||||
'moves': lambda p: self.moves.to_disk(p, exclude=["strings"]),
|
||||
'cfg': lambda p: srsly.write_json(p, self.cfg)
|
||||
}
|
||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||
util.to_disk(path, serializers, exclude)
|
||||
|
||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||
def from_disk(self, path, exclude=tuple()):
|
||||
deserializers = {
|
||||
'vocab': lambda p: self.vocab.from_disk(p),
|
||||
'moves': lambda p: self.moves.from_disk(p, exclude=["strings"]),
|
||||
'cfg': lambda p: self.cfg.update(srsly.read_json(p)),
|
||||
'model': lambda p: None,
|
||||
}
|
||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||
util.from_disk(path, deserializers, exclude)
|
||||
if 'model' not in exclude:
|
||||
path = util.ensure_path(path)
|
||||
|
@ -491,24 +489,22 @@ cdef class Parser:
|
|||
raise ValueError(Errors.E149)
|
||||
return self
|
||||
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
def to_bytes(self, exclude=tuple()):
|
||||
serializers = {
|
||||
"model": lambda: (self.model.to_bytes()),
|
||||
"vocab": lambda: self.vocab.to_bytes(),
|
||||
"moves": lambda: self.moves.to_bytes(exclude=["strings"]),
|
||||
"cfg": lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True)
|
||||
}
|
||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||
return util.to_bytes(serializers, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||
deserializers = {
|
||||
"vocab": lambda b: self.vocab.from_bytes(b),
|
||||
"moves": lambda b: self.moves.from_bytes(b, exclude=["strings"]),
|
||||
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
|
||||
"model": lambda b: None,
|
||||
}
|
||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||
if 'model' not in exclude:
|
||||
if 'model' in msg:
|
||||
|
|
|
@ -227,22 +227,20 @@ cdef class TransitionSystem:
|
|||
self.from_bytes(byte_data, **kwargs)
|
||||
return self
|
||||
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
def to_bytes(self, exclude=tuple()):
|
||||
transitions = []
|
||||
serializers = {
|
||||
'moves': lambda: srsly.json_dumps(self.labels),
|
||||
'strings': lambda: self.strings.to_bytes()
|
||||
}
|
||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||
return util.to_bytes(serializers, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||
labels = {}
|
||||
deserializers = {
|
||||
'moves': lambda b: labels.update(srsly.json_loads(b)),
|
||||
'strings': lambda b: self.strings.from_bytes(b)
|
||||
}
|
||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||
self.initialize_actions(labels)
|
||||
return self
|
||||
|
|
|
@ -179,22 +179,9 @@ def test_doc_api_right_edge(en_tokenizer):
|
|||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
assert doc[6].text == "for"
|
||||
subtree = [w.text for w in doc[6].subtree]
|
||||
assert subtree == [
|
||||
"for",
|
||||
"the",
|
||||
"sake",
|
||||
"of",
|
||||
"such",
|
||||
"as",
|
||||
"live",
|
||||
"under",
|
||||
"the",
|
||||
"government",
|
||||
"of",
|
||||
"the",
|
||||
"Romans",
|
||||
",",
|
||||
]
|
||||
# fmt: off
|
||||
assert subtree == ["for", "the", "sake", "of", "such", "as", "live", "under", "the", "government", "of", "the", "Romans", ","]
|
||||
# fmt: on
|
||||
assert doc[6].right_edge.text == ","
|
||||
|
||||
|
||||
|
@ -303,6 +290,68 @@ def test_doc_from_array_sent_starts(en_vocab):
|
|||
assert new_doc.is_parsed
|
||||
|
||||
|
||||
def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
|
||||
en_texts = ["Merging the docs is fun.", "They don't think alike."]
|
||||
de_text = "Wie war die Frage?"
|
||||
en_docs = [en_tokenizer(text) for text in en_texts]
|
||||
docs_idx = en_texts[0].index("docs")
|
||||
de_doc = de_tokenizer(de_text)
|
||||
en_docs[0].user_data[("._.", "is_ambiguous", docs_idx, None)] = (
|
||||
True,
|
||||
None,
|
||||
None,
|
||||
None,
|
||||
)
|
||||
|
||||
assert Doc.from_docs([]) is None
|
||||
|
||||
assert de_doc is not Doc.from_docs([de_doc])
|
||||
assert str(de_doc) == str(Doc.from_docs([de_doc]))
|
||||
|
||||
with pytest.raises(ValueError):
|
||||
Doc.from_docs(en_docs + [de_doc])
|
||||
|
||||
m_doc = Doc.from_docs(en_docs)
|
||||
assert len(en_docs) == len(list(m_doc.sents))
|
||||
assert len(str(m_doc)) > len(en_texts[0]) + len(en_texts[1])
|
||||
assert str(m_doc) == " ".join(en_texts)
|
||||
p_token = m_doc[len(en_docs[0]) - 1]
|
||||
assert p_token.text == "." and bool(p_token.whitespace_)
|
||||
en_docs_tokens = [t for doc in en_docs for t in doc]
|
||||
assert len(m_doc) == len(en_docs_tokens)
|
||||
think_idx = len(en_texts[0]) + 1 + en_texts[1].index("think")
|
||||
assert m_doc[9].idx == think_idx
|
||||
with pytest.raises(AttributeError):
|
||||
# not callable, because it was not set via set_extension
|
||||
m_doc[2]._.is_ambiguous
|
||||
assert len(m_doc.user_data) == len(en_docs[0].user_data) # but it's there
|
||||
|
||||
m_doc = Doc.from_docs(en_docs, ensure_whitespace=False)
|
||||
assert len(en_docs) == len(list(m_doc.sents))
|
||||
assert len(str(m_doc)) == len(en_texts[0]) + len(en_texts[1])
|
||||
assert str(m_doc) == "".join(en_texts)
|
||||
p_token = m_doc[len(en_docs[0]) - 1]
|
||||
assert p_token.text == "." and not bool(p_token.whitespace_)
|
||||
en_docs_tokens = [t for doc in en_docs for t in doc]
|
||||
assert len(m_doc) == len(en_docs_tokens)
|
||||
think_idx = len(en_texts[0]) + 0 + en_texts[1].index("think")
|
||||
assert m_doc[9].idx == think_idx
|
||||
|
||||
m_doc = Doc.from_docs(en_docs, attrs=["lemma", "length", "pos"])
|
||||
with pytest.raises(ValueError):
|
||||
# important attributes from sentenziser or parser are missing
|
||||
assert list(m_doc.sents)
|
||||
assert len(str(m_doc)) > len(en_texts[0]) + len(en_texts[1])
|
||||
# space delimiter considered, although spacy attribute was missing
|
||||
assert str(m_doc) == " ".join(en_texts)
|
||||
p_token = m_doc[len(en_docs[0]) - 1]
|
||||
assert p_token.text == "." and bool(p_token.whitespace_)
|
||||
en_docs_tokens = [t for doc in en_docs for t in doc]
|
||||
assert len(m_doc) == len(en_docs_tokens)
|
||||
think_idx = len(en_texts[0]) + 1 + en_texts[1].index("think")
|
||||
assert m_doc[9].idx == think_idx
|
||||
|
||||
|
||||
def test_doc_lang(en_vocab):
|
||||
doc = Doc(en_vocab, words=["Hello", "world"])
|
||||
assert doc.lang_ == "en"
|
||||
|
|
|
@ -66,8 +66,6 @@ def test_spans_string_fn(doc):
|
|||
span = doc[0:4]
|
||||
assert len(span) == 4
|
||||
assert span.text == "This is a sentence"
|
||||
assert span.upper_ == "THIS IS A SENTENCE"
|
||||
assert span.lower_ == "this is a sentence"
|
||||
|
||||
|
||||
def test_spans_root2(en_tokenizer):
|
||||
|
|
|
@ -1,13 +1,11 @@
|
|||
import pytest
|
||||
from thinc.api import Adam
|
||||
from thinc.api import Adam, fix_random_seed
|
||||
from spacy.attrs import NORM
|
||||
from spacy.vocab import Vocab
|
||||
|
||||
from spacy.gold import Example
|
||||
from spacy.pipeline.defaults import default_parser, default_ner
|
||||
from spacy.tokens import Doc
|
||||
from spacy.pipeline import DependencyParser, EntityRecognizer
|
||||
from spacy.util import fix_random_seed
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
|
|
|
@ -118,6 +118,7 @@ def test_oracle_moves_missing_B(en_vocab):
|
|||
moves.add_action(move_types.index("U"), label)
|
||||
moves.get_oracle_sequence(example)
|
||||
|
||||
|
||||
# We can't easily represent this on a Doc object. Not sure what the best solution
|
||||
# would be, but I don't think it's an important use case?
|
||||
@pytest.mark.xfail(reason="No longer supported")
|
||||
|
@ -208,6 +209,10 @@ def test_train_empty():
|
|||
]
|
||||
|
||||
nlp = English()
|
||||
train_examples = []
|
||||
for t in train_data:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||
|
||||
ner = nlp.create_pipe("ner")
|
||||
ner.add_label("PERSON")
|
||||
nlp.add_pipe(ner, last=True)
|
||||
|
@ -215,10 +220,9 @@ def test_train_empty():
|
|||
nlp.begin_training()
|
||||
for itn in range(2):
|
||||
losses = {}
|
||||
batches = util.minibatch(train_data)
|
||||
batches = util.minibatch(train_examples)
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(train_data, losses=losses)
|
||||
nlp.update(batch, losses=losses)
|
||||
|
||||
|
||||
def test_overwrite_token():
|
||||
|
@ -327,7 +331,9 @@ def test_overfitting_IO():
|
|||
# Simple test to try and quickly overfit the NER component - ensuring the ML models work correctly
|
||||
nlp = English()
|
||||
ner = nlp.create_pipe("ner")
|
||||
for _, annotations in TRAIN_DATA:
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
for ent in annotations.get("entities"):
|
||||
ner.add_label(ent[2])
|
||||
nlp.add_pipe(ner)
|
||||
|
@ -335,7 +341,7 @@ def test_overfitting_IO():
|
|||
|
||||
for i in range(50):
|
||||
losses = {}
|
||||
nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses)
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
assert losses["ner"] < 0.00001
|
||||
|
||||
# test the trained model
|
||||
|
|
|
@ -3,6 +3,7 @@ import pytest
|
|||
from spacy.lang.en import English
|
||||
from ..util import get_doc, apply_transition_sequence, make_tempdir
|
||||
from ... import util
|
||||
from ...gold import Example
|
||||
|
||||
TRAIN_DATA = [
|
||||
(
|
||||
|
@ -91,6 +92,7 @@ def test_parser_merge_pp(en_tokenizer):
|
|||
assert doc[2].text == "another phrase"
|
||||
assert doc[3].text == "occurs"
|
||||
|
||||
|
||||
# We removed the step_through API a while ago. we should bring it back though
|
||||
@pytest.mark.xfail(reason="Unsupported")
|
||||
def test_parser_arc_eager_finalize_state(en_tokenizer, en_parser):
|
||||
|
@ -188,7 +190,9 @@ def test_overfitting_IO():
|
|||
# Simple test to try and quickly overfit the dependency parser - ensuring the ML models work correctly
|
||||
nlp = English()
|
||||
parser = nlp.create_pipe("parser")
|
||||
for _, annotations in TRAIN_DATA:
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
for dep in annotations.get("deps", []):
|
||||
parser.add_label(dep)
|
||||
nlp.add_pipe(parser)
|
||||
|
@ -196,7 +200,7 @@ def test_overfitting_IO():
|
|||
|
||||
for i in range(50):
|
||||
losses = {}
|
||||
nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses)
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
assert losses["parser"] < 0.00001
|
||||
|
||||
# test the trained model
|
||||
|
|
|
@ -3,6 +3,7 @@ import pytest
|
|||
from spacy.kb import KnowledgeBase
|
||||
|
||||
from spacy import util
|
||||
from spacy.gold import Example
|
||||
from spacy.lang.en import English
|
||||
from spacy.pipeline import EntityRuler
|
||||
from spacy.tests.util import make_tempdir
|
||||
|
@ -283,11 +284,10 @@ def test_overfitting_IO():
|
|||
nlp.add_pipe(ruler)
|
||||
|
||||
# Convert the texts to docs to make sure we have doc.ents set for the training examples
|
||||
TRAIN_DOCS = []
|
||||
train_examples = []
|
||||
for text, annotation in TRAIN_DATA:
|
||||
doc = nlp(text)
|
||||
annotation_clean = annotation
|
||||
TRAIN_DOCS.append((doc, annotation_clean))
|
||||
train_examples.append(Example.from_dict(doc, annotation))
|
||||
|
||||
# create artificial KB - assign same prior weight to the two russ cochran's
|
||||
# Q2146908 (Russ Cochran): American golfer
|
||||
|
@ -309,7 +309,7 @@ def test_overfitting_IO():
|
|||
optimizer = nlp.begin_training()
|
||||
for i in range(50):
|
||||
losses = {}
|
||||
nlp.update(TRAIN_DOCS, sgd=optimizer, losses=losses)
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
assert losses["entity_linker"] < 0.001
|
||||
|
||||
# test the trained model
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
import pytest
|
||||
|
||||
from spacy import util
|
||||
from spacy.gold import Example
|
||||
from spacy.lang.en import English
|
||||
from spacy.language import Language
|
||||
from spacy.tests.util import make_tempdir
|
||||
|
@ -33,7 +34,9 @@ def test_overfitting_IO():
|
|||
# Simple test to try and quickly overfit the morphologizer - ensuring the ML models work correctly
|
||||
nlp = English()
|
||||
morphologizer = nlp.create_pipe("morphologizer")
|
||||
train_examples = []
|
||||
for inst in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(inst[0]), inst[1]))
|
||||
for morph, pos in zip(inst[1]["morphs"], inst[1]["pos"]):
|
||||
morphologizer.add_label(morph + "|POS=" + pos)
|
||||
nlp.add_pipe(morphologizer)
|
||||
|
@ -41,7 +44,7 @@ def test_overfitting_IO():
|
|||
|
||||
for i in range(50):
|
||||
losses = {}
|
||||
nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses)
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
assert losses["morphologizer"] < 0.00001
|
||||
|
||||
# test the trained model
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
import pytest
|
||||
|
||||
from spacy import util
|
||||
from spacy.gold import Example
|
||||
from spacy.lang.en import English
|
||||
from spacy.language import Language
|
||||
from spacy.tests.util import make_tempdir
|
||||
|
@ -34,12 +35,15 @@ def test_overfitting_IO():
|
|||
# Simple test to try and quickly overfit the senter - ensuring the ML models work correctly
|
||||
nlp = English()
|
||||
senter = nlp.create_pipe("senter")
|
||||
train_examples = []
|
||||
for t in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||
nlp.add_pipe(senter)
|
||||
optimizer = nlp.begin_training()
|
||||
|
||||
for i in range(200):
|
||||
losses = {}
|
||||
nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses)
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
assert losses["senter"] < 0.001
|
||||
|
||||
# test the trained model
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
import pytest
|
||||
|
||||
from spacy import util
|
||||
from spacy.gold import Example
|
||||
from spacy.lang.en import English
|
||||
from spacy.language import Language
|
||||
from spacy.tests.util import make_tempdir
|
||||
|
@ -28,12 +29,15 @@ def test_overfitting_IO():
|
|||
tagger = nlp.create_pipe("tagger")
|
||||
for tag, values in TAG_MAP.items():
|
||||
tagger.add_label(tag, values)
|
||||
train_examples = []
|
||||
for t in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||
nlp.add_pipe(tagger)
|
||||
optimizer = nlp.begin_training()
|
||||
|
||||
for i in range(50):
|
||||
losses = {}
|
||||
nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses)
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
assert losses["tagger"] < 0.00001
|
||||
|
||||
# test the trained model
|
||||
|
|
|
@ -1,18 +1,18 @@
|
|||
import pytest
|
||||
import random
|
||||
import numpy.random
|
||||
|
||||
from thinc.api import fix_random_seed
|
||||
from spacy import util
|
||||
from spacy.lang.en import English
|
||||
from spacy.language import Language
|
||||
from spacy.pipeline import TextCategorizer
|
||||
from spacy.tokens import Doc
|
||||
from spacy.util import fix_random_seed
|
||||
from spacy.pipeline.defaults import default_tok2vec
|
||||
|
||||
from ..util import make_tempdir
|
||||
from spacy.pipeline.defaults import default_tok2vec
|
||||
from ...gold import Example
|
||||
|
||||
|
||||
TRAIN_DATA = [
|
||||
("I'm so happy.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
|
||||
("I'm so angry", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
|
||||
|
@ -85,7 +85,9 @@ def test_overfitting_IO():
|
|||
fix_random_seed(0)
|
||||
nlp = English()
|
||||
textcat = nlp.create_pipe("textcat")
|
||||
for _, annotations in TRAIN_DATA:
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
for label, value in annotations.get("cats").items():
|
||||
textcat.add_label(label)
|
||||
nlp.add_pipe(textcat)
|
||||
|
@ -93,7 +95,7 @@ def test_overfitting_IO():
|
|||
|
||||
for i in range(50):
|
||||
losses = {}
|
||||
nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses)
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
assert losses["textcat"] < 0.01
|
||||
|
||||
# test the trained model
|
||||
|
@ -134,11 +136,13 @@ def test_textcat_configs(textcat_config):
|
|||
pipe_config = {"model": textcat_config}
|
||||
nlp = English()
|
||||
textcat = nlp.create_pipe("textcat", pipe_config)
|
||||
for _, annotations in TRAIN_DATA:
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
for label, value in annotations.get("cats").items():
|
||||
textcat.add_label(label)
|
||||
nlp.add_pipe(textcat)
|
||||
optimizer = nlp.begin_training()
|
||||
for i in range(5):
|
||||
losses = {}
|
||||
nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses)
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
|
|
|
@ -1,5 +1,6 @@
|
|||
import pytest
|
||||
from spacy import displacy
|
||||
from spacy.gold import Example
|
||||
from spacy.lang.en import English
|
||||
from spacy.lang.ja import Japanese
|
||||
from spacy.lang.xx import MultiLanguage
|
||||
|
@ -141,10 +142,10 @@ def test_issue2800():
|
|||
"""Test issue that arises when too many labels are added to NER model.
|
||||
Used to cause segfault.
|
||||
"""
|
||||
train_data = []
|
||||
train_data.extend([("One sentence", {"entities": []})])
|
||||
entity_types = [str(i) for i in range(1000)]
|
||||
nlp = English()
|
||||
train_data = []
|
||||
train_data.extend([Example.from_dict(nlp.make_doc("One sentence"), {"entities": []})])
|
||||
entity_types = [str(i) for i in range(1000)]
|
||||
ner = nlp.create_pipe("ner")
|
||||
nlp.add_pipe(ner)
|
||||
for entity_type in list(entity_types):
|
||||
|
@ -153,8 +154,8 @@ def test_issue2800():
|
|||
for i in range(20):
|
||||
losses = {}
|
||||
random.shuffle(train_data)
|
||||
for statement, entities in train_data:
|
||||
nlp.update((statement, entities), sgd=optimizer, losses=losses, drop=0.5)
|
||||
for example in train_data:
|
||||
nlp.update([example], sgd=optimizer, losses=losses, drop=0.5)
|
||||
|
||||
|
||||
def test_issue2822(it_tokenizer):
|
||||
|
|
|
@ -9,7 +9,6 @@ from spacy.vocab import Vocab
|
|||
from spacy.attrs import ENT_IOB, ENT_TYPE
|
||||
from spacy.compat import pickle
|
||||
from spacy import displacy
|
||||
from spacy.util import decaying
|
||||
import numpy
|
||||
|
||||
from spacy.vectors import Vectors
|
||||
|
@ -216,21 +215,6 @@ def test_issue3345():
|
|||
assert ner.moves.is_valid(state, "B-GPE")
|
||||
|
||||
|
||||
def test_issue3410():
|
||||
texts = ["Hello world", "This is a test"]
|
||||
nlp = English()
|
||||
matcher = Matcher(nlp.vocab)
|
||||
phrasematcher = PhraseMatcher(nlp.vocab)
|
||||
with pytest.deprecated_call():
|
||||
docs = list(nlp.pipe(texts, n_threads=4))
|
||||
with pytest.deprecated_call():
|
||||
docs = list(nlp.tokenizer.pipe(texts, n_threads=4))
|
||||
with pytest.deprecated_call():
|
||||
list(matcher.pipe(docs, n_threads=4))
|
||||
with pytest.deprecated_call():
|
||||
list(phrasematcher.pipe(docs, n_threads=4))
|
||||
|
||||
|
||||
def test_issue3412():
|
||||
data = numpy.asarray([[0, 0, 0], [1, 2, 3], [9, 8, 7]], dtype="f")
|
||||
vectors = Vectors(data=data, keys=["A", "B", "C"])
|
||||
|
@ -240,16 +224,6 @@ def test_issue3412():
|
|||
assert best_rows[0] == 2
|
||||
|
||||
|
||||
def test_issue3447():
|
||||
sizes = decaying(10.0, 1.0, 0.5)
|
||||
size = next(sizes)
|
||||
assert size == 10.0
|
||||
size = next(sizes)
|
||||
assert size == 10.0 - 0.5
|
||||
size = next(sizes)
|
||||
assert size == 10.0 - 0.5 - 0.5
|
||||
|
||||
|
||||
@pytest.mark.xfail(reason="default suffix rules avoid one upper-case letter before dot")
|
||||
def test_issue3449():
|
||||
nlp = English()
|
||||
|
|
|
@ -1,5 +1,7 @@
|
|||
import spacy
|
||||
from spacy.util import minibatch, compounding
|
||||
from spacy.util import minibatch
|
||||
from thinc.api import compounding
|
||||
from spacy.gold import Example
|
||||
|
||||
|
||||
def test_issue3611():
|
||||
|
@ -12,15 +14,15 @@ def test_issue3611():
|
|||
]
|
||||
y_train = ["offensive", "offensive", "inoffensive"]
|
||||
|
||||
# preparing the data
|
||||
pos_cats = list()
|
||||
for train_instance in y_train:
|
||||
pos_cats.append({label: label == train_instance for label in unique_classes})
|
||||
train_data = list(zip(x_train, [{"cats": cats} for cats in pos_cats]))
|
||||
|
||||
# set up the spacy model with a text categorizer component
|
||||
nlp = spacy.blank("en")
|
||||
|
||||
# preparing the data
|
||||
train_data = []
|
||||
for text, train_instance in zip(x_train, y_train):
|
||||
cat_dict = {label: label == train_instance for label in unique_classes}
|
||||
train_data.append(Example.from_dict(nlp.make_doc(text), {"cats": cat_dict}))
|
||||
|
||||
# add a text categorizer component
|
||||
textcat = nlp.create_pipe(
|
||||
"textcat",
|
||||
config={"exclusive_classes": True, "architecture": "bow", "ngram_size": 2},
|
||||
|
|
|
@ -1,5 +1,7 @@
|
|||
import spacy
|
||||
from spacy.util import minibatch, compounding
|
||||
from spacy.util import minibatch
|
||||
from thinc.api import compounding
|
||||
from spacy.gold import Example
|
||||
|
||||
|
||||
def test_issue4030():
|
||||
|
@ -12,15 +14,15 @@ def test_issue4030():
|
|||
]
|
||||
y_train = ["offensive", "offensive", "inoffensive"]
|
||||
|
||||
# preparing the data
|
||||
pos_cats = list()
|
||||
for train_instance in y_train:
|
||||
pos_cats.append({label: label == train_instance for label in unique_classes})
|
||||
train_data = list(zip(x_train, [{"cats": cats} for cats in pos_cats]))
|
||||
|
||||
# set up the spacy model with a text categorizer component
|
||||
nlp = spacy.blank("en")
|
||||
|
||||
# preparing the data
|
||||
train_data = []
|
||||
for text, train_instance in zip(x_train, y_train):
|
||||
cat_dict = {label: label == train_instance for label in unique_classes}
|
||||
train_data.append(Example.from_dict(nlp.make_doc(text), {"cats": cat_dict}))
|
||||
|
||||
# add a text categorizer component
|
||||
textcat = nlp.create_pipe(
|
||||
"textcat",
|
||||
config={"exclusive_classes": True, "architecture": "bow", "ngram_size": 2},
|
||||
|
|
|
@ -1,5 +1,7 @@
|
|||
from spacy.gold import Example
|
||||
from spacy.lang.en import English
|
||||
from spacy.util import minibatch, compounding
|
||||
from spacy.util import minibatch
|
||||
from thinc.api import compounding
|
||||
import pytest
|
||||
|
||||
|
||||
|
@ -7,9 +9,10 @@ import pytest
|
|||
def test_issue4348():
|
||||
"""Test that training the tagger with empty data, doesn't throw errors"""
|
||||
|
||||
TRAIN_DATA = [("", {"tags": []}), ("", {"tags": []})]
|
||||
|
||||
nlp = English()
|
||||
example = Example.from_dict(nlp.make_doc(""), {"tags": []})
|
||||
TRAIN_DATA = [example, example]
|
||||
|
||||
tagger = nlp.create_pipe("tagger")
|
||||
nlp.add_pipe(tagger)
|
||||
|
||||
|
|
|
@ -8,10 +8,11 @@ from ...tokens import DocBin
|
|||
|
||||
def test_issue4402():
|
||||
nlp = English()
|
||||
attrs = ["ORTH", "SENT_START", "ENT_IOB", "ENT_TYPE"]
|
||||
with make_tempdir() as tmpdir:
|
||||
output_file = tmpdir / "test4402.spacy"
|
||||
docs = json2docs([json_data])
|
||||
data = DocBin(docs=docs, attrs =["ORTH", "SENT_START", "ENT_IOB", "ENT_TYPE"]).to_bytes()
|
||||
data = DocBin(docs=docs, attrs=attrs).to_bytes()
|
||||
with output_file.open("wb") as file_:
|
||||
file_.write(data)
|
||||
corpus = Corpus(train_loc=str(output_file), dev_loc=str(output_file))
|
||||
|
@ -25,74 +26,73 @@ def test_issue4402():
|
|||
assert len(split_train_data) == 4
|
||||
|
||||
|
||||
json_data =\
|
||||
{
|
||||
"id": 0,
|
||||
"paragraphs": [
|
||||
{
|
||||
"raw": "How should I cook bacon in an oven?\nI've heard of people cooking bacon in an oven.",
|
||||
"sentences": [
|
||||
{
|
||||
"tokens": [
|
||||
{"id": 0, "orth": "How", "ner": "O"},
|
||||
{"id": 1, "orth": "should", "ner": "O"},
|
||||
{"id": 2, "orth": "I", "ner": "O"},
|
||||
{"id": 3, "orth": "cook", "ner": "O"},
|
||||
{"id": 4, "orth": "bacon", "ner": "O"},
|
||||
{"id": 5, "orth": "in", "ner": "O"},
|
||||
{"id": 6, "orth": "an", "ner": "O"},
|
||||
{"id": 7, "orth": "oven", "ner": "O"},
|
||||
{"id": 8, "orth": "?", "ner": "O"},
|
||||
],
|
||||
"brackets": [],
|
||||
},
|
||||
{
|
||||
"tokens": [
|
||||
{"id": 9, "orth": "\n", "ner": "O"},
|
||||
{"id": 10, "orth": "I", "ner": "O"},
|
||||
{"id": 11, "orth": "'ve", "ner": "O"},
|
||||
{"id": 12, "orth": "heard", "ner": "O"},
|
||||
{"id": 13, "orth": "of", "ner": "O"},
|
||||
{"id": 14, "orth": "people", "ner": "O"},
|
||||
{"id": 15, "orth": "cooking", "ner": "O"},
|
||||
{"id": 16, "orth": "bacon", "ner": "O"},
|
||||
{"id": 17, "orth": "in", "ner": "O"},
|
||||
{"id": 18, "orth": "an", "ner": "O"},
|
||||
{"id": 19, "orth": "oven", "ner": "O"},
|
||||
{"id": 20, "orth": ".", "ner": "O"},
|
||||
],
|
||||
"brackets": [],
|
||||
},
|
||||
],
|
||||
"cats": [
|
||||
{"label": "baking", "value": 1.0},
|
||||
{"label": "not_baking", "value": 0.0},
|
||||
],
|
||||
},
|
||||
{
|
||||
"raw": "What is the difference between white and brown eggs?\n",
|
||||
"sentences": [
|
||||
{
|
||||
"tokens": [
|
||||
{"id": 0, "orth": "What", "ner": "O"},
|
||||
{"id": 1, "orth": "is", "ner": "O"},
|
||||
{"id": 2, "orth": "the", "ner": "O"},
|
||||
{"id": 3, "orth": "difference", "ner": "O"},
|
||||
{"id": 4, "orth": "between", "ner": "O"},
|
||||
{"id": 5, "orth": "white", "ner": "O"},
|
||||
{"id": 6, "orth": "and", "ner": "O"},
|
||||
{"id": 7, "orth": "brown", "ner": "O"},
|
||||
{"id": 8, "orth": "eggs", "ner": "O"},
|
||||
{"id": 9, "orth": "?", "ner": "O"},
|
||||
],
|
||||
"brackets": [],
|
||||
},
|
||||
{"tokens": [{"id": 10, "orth": "\n", "ner": "O"}], "brackets": []},
|
||||
],
|
||||
"cats": [
|
||||
{"label": "baking", "value": 0.0},
|
||||
{"label": "not_baking", "value": 1.0},
|
||||
],
|
||||
},
|
||||
],
|
||||
}
|
||||
json_data = {
|
||||
"id": 0,
|
||||
"paragraphs": [
|
||||
{
|
||||
"raw": "How should I cook bacon in an oven?\nI've heard of people cooking bacon in an oven.",
|
||||
"sentences": [
|
||||
{
|
||||
"tokens": [
|
||||
{"id": 0, "orth": "How", "ner": "O"},
|
||||
{"id": 1, "orth": "should", "ner": "O"},
|
||||
{"id": 2, "orth": "I", "ner": "O"},
|
||||
{"id": 3, "orth": "cook", "ner": "O"},
|
||||
{"id": 4, "orth": "bacon", "ner": "O"},
|
||||
{"id": 5, "orth": "in", "ner": "O"},
|
||||
{"id": 6, "orth": "an", "ner": "O"},
|
||||
{"id": 7, "orth": "oven", "ner": "O"},
|
||||
{"id": 8, "orth": "?", "ner": "O"},
|
||||
],
|
||||
"brackets": [],
|
||||
},
|
||||
{
|
||||
"tokens": [
|
||||
{"id": 9, "orth": "\n", "ner": "O"},
|
||||
{"id": 10, "orth": "I", "ner": "O"},
|
||||
{"id": 11, "orth": "'ve", "ner": "O"},
|
||||
{"id": 12, "orth": "heard", "ner": "O"},
|
||||
{"id": 13, "orth": "of", "ner": "O"},
|
||||
{"id": 14, "orth": "people", "ner": "O"},
|
||||
{"id": 15, "orth": "cooking", "ner": "O"},
|
||||
{"id": 16, "orth": "bacon", "ner": "O"},
|
||||
{"id": 17, "orth": "in", "ner": "O"},
|
||||
{"id": 18, "orth": "an", "ner": "O"},
|
||||
{"id": 19, "orth": "oven", "ner": "O"},
|
||||
{"id": 20, "orth": ".", "ner": "O"},
|
||||
],
|
||||
"brackets": [],
|
||||
},
|
||||
],
|
||||
"cats": [
|
||||
{"label": "baking", "value": 1.0},
|
||||
{"label": "not_baking", "value": 0.0},
|
||||
],
|
||||
},
|
||||
{
|
||||
"raw": "What is the difference between white and brown eggs?\n",
|
||||
"sentences": [
|
||||
{
|
||||
"tokens": [
|
||||
{"id": 0, "orth": "What", "ner": "O"},
|
||||
{"id": 1, "orth": "is", "ner": "O"},
|
||||
{"id": 2, "orth": "the", "ner": "O"},
|
||||
{"id": 3, "orth": "difference", "ner": "O"},
|
||||
{"id": 4, "orth": "between", "ner": "O"},
|
||||
{"id": 5, "orth": "white", "ner": "O"},
|
||||
{"id": 6, "orth": "and", "ner": "O"},
|
||||
{"id": 7, "orth": "brown", "ner": "O"},
|
||||
{"id": 8, "orth": "eggs", "ner": "O"},
|
||||
{"id": 9, "orth": "?", "ner": "O"},
|
||||
],
|
||||
"brackets": [],
|
||||
},
|
||||
{"tokens": [{"id": 10, "orth": "\n", "ner": "O"}], "brackets": []},
|
||||
],
|
||||
"cats": [
|
||||
{"label": "baking", "value": 0.0},
|
||||
{"label": "not_baking", "value": 1.0},
|
||||
],
|
||||
},
|
||||
],
|
||||
}
|
||||
|
|
|
@ -1,7 +1,8 @@
|
|||
from spacy.gold import Example
|
||||
from spacy.language import Language
|
||||
|
||||
|
||||
def test_issue4924():
|
||||
nlp = Language()
|
||||
docs_golds = [("", {})]
|
||||
nlp.evaluate(docs_golds)
|
||||
example = Example.from_dict(nlp.make_doc(""), {})
|
||||
nlp.evaluate([example])
|
||||
|
|
|
@ -52,10 +52,6 @@ def test_serialize_doc_exclude(en_vocab):
|
|||
assert not new_doc.user_data
|
||||
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes(exclude=["user_data"]))
|
||||
assert not new_doc.user_data
|
||||
with pytest.raises(ValueError):
|
||||
doc.to_bytes(user_data=False)
|
||||
with pytest.raises(ValueError):
|
||||
Doc(en_vocab).from_bytes(doc.to_bytes(), tensor=False)
|
||||
|
||||
|
||||
def test_serialize_doc_bin():
|
||||
|
@ -75,3 +71,19 @@ def test_serialize_doc_bin():
|
|||
for i, doc in enumerate(reloaded_docs):
|
||||
assert doc.text == texts[i]
|
||||
assert doc.cats == cats
|
||||
|
||||
|
||||
def test_serialize_doc_bin_unknown_spaces(en_vocab):
|
||||
doc1 = Doc(en_vocab, words=["that", "'s"])
|
||||
assert doc1.has_unknown_spaces
|
||||
assert doc1.text == "that 's "
|
||||
doc2 = Doc(en_vocab, words=["that", "'s"], spaces=[False, False])
|
||||
assert not doc2.has_unknown_spaces
|
||||
assert doc2.text == "that's"
|
||||
|
||||
doc_bin = DocBin().from_bytes(DocBin(docs=[doc1, doc2]).to_bytes())
|
||||
re_doc1, re_doc2 = doc_bin.get_docs(en_vocab)
|
||||
assert re_doc1.has_unknown_spaces
|
||||
assert re_doc1.text == "that 's "
|
||||
assert not re_doc2.has_unknown_spaces
|
||||
assert re_doc2.text == "that's"
|
||||
|
|
|
@ -62,7 +62,3 @@ def test_serialize_language_exclude(meta_data):
|
|||
assert not new_nlp.meta["name"] == name
|
||||
new_nlp = Language().from_bytes(nlp.to_bytes(exclude=["meta"]))
|
||||
assert not new_nlp.meta["name"] == name
|
||||
with pytest.raises(ValueError):
|
||||
nlp.to_bytes(meta=False)
|
||||
with pytest.raises(ValueError):
|
||||
Language().from_bytes(nlp.to_bytes(), meta=False)
|
||||
|
|
|
@ -127,10 +127,6 @@ def test_serialize_pipe_exclude(en_vocab, Parser):
|
|||
parser.to_bytes(exclude=["cfg"]), exclude=["vocab"]
|
||||
)
|
||||
assert "foo" not in new_parser.cfg
|
||||
with pytest.raises(ValueError):
|
||||
parser.to_bytes(cfg=False, exclude=["vocab"])
|
||||
with pytest.raises(ValueError):
|
||||
get_new_parser().from_bytes(parser.to_bytes(exclude=["vocab"]), cfg=False)
|
||||
|
||||
|
||||
def test_serialize_sentencerecognizer(en_vocab):
|
||||
|
|
|
@ -1,14 +1,10 @@
|
|||
import pytest
|
||||
|
||||
from spacy.gold import docs_to_json
|
||||
from spacy.gold.converters import iob2docs, conll_ner2docs
|
||||
from spacy.gold.converters.conllu2json import conllu2json
|
||||
from spacy.gold import docs_to_json, biluo_tags_from_offsets
|
||||
from spacy.gold.converters import iob2docs, conll_ner2docs, conllu2docs
|
||||
from spacy.lang.en import English
|
||||
from spacy.cli.pretrain import make_docs
|
||||
|
||||
# TODO
|
||||
# from spacy.gold.converters import conllu2docs
|
||||
|
||||
|
||||
def test_cli_converters_conllu2json():
|
||||
# from NorNE: https://github.com/ltgoslo/norne/blob/3d23274965f513f23aa48455b28b1878dad23c05/ud/nob/no_bokmaal-ud-dev.conllu
|
||||
|
@ -19,8 +15,9 @@ def test_cli_converters_conllu2json():
|
|||
"4\tavstår\tavstå\tVERB\t_\tMood=Ind|Tense=Pres|VerbForm=Fin\t0\troot\t_\tO",
|
||||
]
|
||||
input_data = "\n".join(lines)
|
||||
converted = conllu2json(input_data, n_sents=1)
|
||||
assert len(converted) == 1
|
||||
converted_docs = conllu2docs(input_data, n_sents=1)
|
||||
assert len(converted_docs) == 1
|
||||
converted = [docs_to_json(converted_docs)]
|
||||
assert converted[0]["id"] == 0
|
||||
assert len(converted[0]["paragraphs"]) == 1
|
||||
assert len(converted[0]["paragraphs"][0]["sentences"]) == 1
|
||||
|
@ -31,7 +28,11 @@ def test_cli_converters_conllu2json():
|
|||
assert [t["tag"] for t in tokens] == ["NOUN", "PROPN", "PROPN", "VERB"]
|
||||
assert [t["head"] for t in tokens] == [1, 2, -1, 0]
|
||||
assert [t["dep"] for t in tokens] == ["appos", "nsubj", "name", "ROOT"]
|
||||
assert [t["ner"] for t in tokens] == ["O", "B-PER", "L-PER", "O"]
|
||||
ent_offsets = [
|
||||
(e[0], e[1], e[2]) for e in converted[0]["paragraphs"][0]["entities"]
|
||||
]
|
||||
biluo_tags = biluo_tags_from_offsets(converted_docs[0], ent_offsets, missing="O")
|
||||
assert biluo_tags == ["O", "B-PER", "L-PER", "O"]
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
|
@ -55,11 +56,14 @@ def test_cli_converters_conllu2json():
|
|||
)
|
||||
def test_cli_converters_conllu2json_name_ner_map(lines):
|
||||
input_data = "\n".join(lines)
|
||||
converted = conllu2json(input_data, n_sents=1, ner_map={"PER": "PERSON", "BAD": ""})
|
||||
assert len(converted) == 1
|
||||
converted_docs = conllu2docs(
|
||||
input_data, n_sents=1, ner_map={"PER": "PERSON", "BAD": ""}
|
||||
)
|
||||
assert len(converted_docs) == 1
|
||||
converted = [docs_to_json(converted_docs)]
|
||||
assert converted[0]["id"] == 0
|
||||
assert len(converted[0]["paragraphs"]) == 1
|
||||
assert converted[0]["paragraphs"][0]["raw"] == "Dommer FinnEilertsen avstår."
|
||||
assert converted[0]["paragraphs"][0]["raw"] == "Dommer FinnEilertsen avstår. "
|
||||
assert len(converted[0]["paragraphs"][0]["sentences"]) == 1
|
||||
sent = converted[0]["paragraphs"][0]["sentences"][0]
|
||||
assert len(sent["tokens"]) == 5
|
||||
|
@ -68,7 +72,11 @@ def test_cli_converters_conllu2json_name_ner_map(lines):
|
|||
assert [t["tag"] for t in tokens] == ["NOUN", "PROPN", "PROPN", "VERB", "PUNCT"]
|
||||
assert [t["head"] for t in tokens] == [1, 2, -1, 0, -1]
|
||||
assert [t["dep"] for t in tokens] == ["appos", "nsubj", "name", "ROOT", "punct"]
|
||||
assert [t["ner"] for t in tokens] == ["O", "B-PERSON", "L-PERSON", "O", "O"]
|
||||
ent_offsets = [
|
||||
(e[0], e[1], e[2]) for e in converted[0]["paragraphs"][0]["entities"]
|
||||
]
|
||||
biluo_tags = biluo_tags_from_offsets(converted_docs[0], ent_offsets, missing="O")
|
||||
assert biluo_tags == ["O", "B-PERSON", "L-PERSON", "O", "O"]
|
||||
|
||||
|
||||
def test_cli_converters_conllu2json_subtokens():
|
||||
|
@ -82,13 +90,15 @@ def test_cli_converters_conllu2json_subtokens():
|
|||
"5\t.\t$.\tPUNCT\t_\t_\t4\tpunct\t_\tname=O",
|
||||
]
|
||||
input_data = "\n".join(lines)
|
||||
converted = conllu2json(
|
||||
converted_docs = conllu2docs(
|
||||
input_data, n_sents=1, merge_subtokens=True, append_morphology=True
|
||||
)
|
||||
assert len(converted) == 1
|
||||
assert len(converted_docs) == 1
|
||||
converted = [docs_to_json(converted_docs)]
|
||||
|
||||
assert converted[0]["id"] == 0
|
||||
assert len(converted[0]["paragraphs"]) == 1
|
||||
assert converted[0]["paragraphs"][0]["raw"] == "Dommer FE avstår."
|
||||
assert converted[0]["paragraphs"][0]["raw"] == "Dommer FE avstår. "
|
||||
assert len(converted[0]["paragraphs"][0]["sentences"]) == 1
|
||||
sent = converted[0]["paragraphs"][0]["sentences"][0]
|
||||
assert len(sent["tokens"]) == 4
|
||||
|
@ -111,7 +121,11 @@ def test_cli_converters_conllu2json_subtokens():
|
|||
assert [t["lemma"] for t in tokens] == ["dommer", "Finn Eilertsen", "avstå", "$."]
|
||||
assert [t["head"] for t in tokens] == [1, 1, 0, -1]
|
||||
assert [t["dep"] for t in tokens] == ["appos", "nsubj", "ROOT", "punct"]
|
||||
assert [t["ner"] for t in tokens] == ["O", "U-PER", "O", "O"]
|
||||
ent_offsets = [
|
||||
(e[0], e[1], e[2]) for e in converted[0]["paragraphs"][0]["entities"]
|
||||
]
|
||||
biluo_tags = biluo_tags_from_offsets(converted_docs[0], ent_offsets, missing="O")
|
||||
assert biluo_tags == ["O", "U-PER", "O", "O"]
|
||||
|
||||
|
||||
def test_cli_converters_iob2json(en_vocab):
|
||||
|
@ -132,11 +146,11 @@ def test_cli_converters_iob2json(en_vocab):
|
|||
sent = converted["paragraphs"][0]["sentences"][i]
|
||||
assert len(sent["tokens"]) == 8
|
||||
tokens = sent["tokens"]
|
||||
# fmt: off
|
||||
assert [t["orth"] for t in tokens] == ["I", "like", "London", "and", "New", "York", "City", "."]
|
||||
expected = ["I", "like", "London", "and", "New", "York", "City", "."]
|
||||
assert [t["orth"] for t in tokens] == expected
|
||||
assert len(converted_docs[0].ents) == 8
|
||||
for ent in converted_docs[0].ents:
|
||||
assert(ent.text in ["New York City", "London"])
|
||||
assert ent.text in ["New York City", "London"]
|
||||
|
||||
|
||||
def test_cli_converters_conll_ner2json():
|
||||
|
@ -204,7 +218,7 @@ def test_cli_converters_conll_ner2json():
|
|||
# fmt: on
|
||||
assert len(converted_docs[0].ents) == 10
|
||||
for ent in converted_docs[0].ents:
|
||||
assert (ent.text in ["New York City", "London"])
|
||||
assert ent.text in ["New York City", "London"]
|
||||
|
||||
|
||||
def test_pretrain_make_docs():
|
||||
|
|
|
@ -1,13 +1,13 @@
|
|||
from spacy.errors import AlignmentError
|
||||
from spacy.gold import biluo_tags_from_offsets, offsets_from_biluo_tags
|
||||
from spacy.gold import spans_from_biluo_tags, iob_to_biluo, align
|
||||
from spacy.gold import spans_from_biluo_tags, iob_to_biluo
|
||||
from spacy.gold import Corpus, docs_to_json
|
||||
from spacy.gold.example import Example
|
||||
from spacy.gold.converters import json2docs
|
||||
from spacy.lang.en import English
|
||||
from spacy.syntax.nonproj import is_nonproj_tree
|
||||
from spacy.tokens import Doc, DocBin
|
||||
from spacy.util import get_words_and_spaces, compounding, minibatch
|
||||
from spacy.util import get_words_and_spaces, minibatch
|
||||
from thinc.api import compounding
|
||||
import pytest
|
||||
import srsly
|
||||
|
||||
|
@ -161,65 +161,54 @@ def test_example_from_dict_no_ner(en_vocab):
|
|||
ner_tags = example.get_aligned_ner()
|
||||
assert ner_tags == [None, None, None, None]
|
||||
|
||||
|
||||
def test_example_from_dict_some_ner(en_vocab):
|
||||
words = ["a", "b", "c", "d"]
|
||||
spaces = [True, True, False, True]
|
||||
predicted = Doc(en_vocab, words=words, spaces=spaces)
|
||||
example = Example.from_dict(
|
||||
predicted,
|
||||
{
|
||||
"words": words,
|
||||
"entities": ["U-LOC", None, None, None]
|
||||
}
|
||||
predicted, {"words": words, "entities": ["U-LOC", None, None, None]}
|
||||
)
|
||||
ner_tags = example.get_aligned_ner()
|
||||
assert ner_tags == ["U-LOC", None, None, None]
|
||||
|
||||
|
||||
def test_json2docs_no_ner(en_vocab):
|
||||
data = [{
|
||||
"id":1,
|
||||
"paragraphs":[
|
||||
{
|
||||
"sentences":[
|
||||
{
|
||||
"tokens":[
|
||||
{
|
||||
"dep":"nn",
|
||||
"head":1,
|
||||
"tag":"NNP",
|
||||
"orth":"Ms."
|
||||
},
|
||||
{
|
||||
"dep":"nsubj",
|
||||
"head":1,
|
||||
"tag":"NNP",
|
||||
"orth":"Haag"
|
||||
},
|
||||
{
|
||||
"dep":"ROOT",
|
||||
"head":0,
|
||||
"tag":"VBZ",
|
||||
"orth":"plays"
|
||||
},
|
||||
{
|
||||
"dep":"dobj",
|
||||
"head":-1,
|
||||
"tag":"NNP",
|
||||
"orth":"Elianti"
|
||||
},
|
||||
{
|
||||
"dep":"punct",
|
||||
"head":-2,
|
||||
"tag":".",
|
||||
"orth":"."
|
||||
}
|
||||
data = [
|
||||
{
|
||||
"id": 1,
|
||||
"paragraphs": [
|
||||
{
|
||||
"sentences": [
|
||||
{
|
||||
"tokens": [
|
||||
{"dep": "nn", "head": 1, "tag": "NNP", "orth": "Ms."},
|
||||
{
|
||||
"dep": "nsubj",
|
||||
"head": 1,
|
||||
"tag": "NNP",
|
||||
"orth": "Haag",
|
||||
},
|
||||
{
|
||||
"dep": "ROOT",
|
||||
"head": 0,
|
||||
"tag": "VBZ",
|
||||
"orth": "plays",
|
||||
},
|
||||
{
|
||||
"dep": "dobj",
|
||||
"head": -1,
|
||||
"tag": "NNP",
|
||||
"orth": "Elianti",
|
||||
},
|
||||
{"dep": "punct", "head": -2, "tag": ".", "orth": "."},
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}]
|
||||
}
|
||||
],
|
||||
}
|
||||
]
|
||||
docs = json2docs(data)
|
||||
assert len(docs) == 1
|
||||
for doc in docs:
|
||||
|
@ -282,75 +271,76 @@ def test_split_sentences(en_vocab):
|
|||
assert split_examples[1].text == "had loads of fun "
|
||||
|
||||
|
||||
@pytest.mark.xfail(reason="Alignment should be fixed after example refactor")
|
||||
def test_gold_biluo_one_to_many(en_vocab, en_tokenizer):
|
||||
words = ["I", "flew to", "San Francisco Valley", "."]
|
||||
spaces = [True, True, False, False]
|
||||
words = ["Mr. and ", "Mrs. Smith", "flew to", "San Francisco Valley", "."]
|
||||
spaces = [True, True, True, False, False]
|
||||
doc = Doc(en_vocab, words=words, spaces=spaces)
|
||||
entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")]
|
||||
gold_words = ["I", "flew", "to", "San", "Francisco", "Valley", "."]
|
||||
prefix = "Mr. and Mrs. Smith flew to "
|
||||
entities = [(len(prefix), len(prefix + "San Francisco Valley"), "LOC")]
|
||||
gold_words = ["Mr. and Mrs. Smith", "flew", "to", "San", "Francisco", "Valley", "."]
|
||||
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
||||
ner_tags = example.get_aligned_ner()
|
||||
assert ner_tags == ["O", "O", "U-LOC", "O"]
|
||||
assert ner_tags == ["O", "O", "O", "U-LOC", "O"]
|
||||
|
||||
entities = [
|
||||
(len("I "), len("I flew to"), "ORG"),
|
||||
(len("I flew to "), len("I flew to San Francisco Valley"), "LOC"),
|
||||
(len("Mr. and "), len("Mr. and Mrs. Smith"), "PERSON"), # "Mrs. Smith" is a PERSON
|
||||
(len(prefix), len(prefix + "San Francisco Valley"), "LOC"),
|
||||
]
|
||||
gold_words = ["I", "flew", "to", "San", "Francisco", "Valley", "."]
|
||||
gold_words = ["Mr. and", "Mrs.", "Smith", "flew", "to", "San", "Francisco", "Valley", "."]
|
||||
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
||||
ner_tags = example.get_aligned_ner()
|
||||
assert ner_tags == ["O", "U-ORG", "U-LOC", "O"]
|
||||
assert ner_tags == ["O", "U-PERSON", "O", "U-LOC", "O"]
|
||||
|
||||
entities = [
|
||||
(len("I "), len("I flew"), "ORG"),
|
||||
(len("I flew to "), len("I flew to San Francisco Valley"), "LOC"),
|
||||
(len("Mr. and "), len("Mr. and Mrs."), "PERSON"), # "Mrs." is a Person
|
||||
(len(prefix), len(prefix + "San Francisco Valley"), "LOC"),
|
||||
]
|
||||
gold_words = ["I", "flew", "to", "San", "Francisco", "Valley", "."]
|
||||
gold_words = ["Mr. and", "Mrs.", "Smith", "flew", "to", "San", "Francisco", "Valley", "."]
|
||||
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
||||
ner_tags = example.get_aligned_ner()
|
||||
assert ner_tags == ["O", None, "U-LOC", "O"]
|
||||
assert ner_tags == ["O", None, "O", "U-LOC", "O"]
|
||||
|
||||
|
||||
def test_gold_biluo_many_to_one(en_vocab, en_tokenizer):
|
||||
words = ["I", "flew", "to", "San", "Francisco", "Valley", "."]
|
||||
words = ["Mr. and", "Mrs.", "Smith", "flew", "to", "San", "Francisco", "Valley", "."]
|
||||
spaces = [True, True, True, True, True, True, True, False, False]
|
||||
doc = Doc(en_vocab, words=words, spaces=spaces)
|
||||
prefix = "Mr. and Mrs. Smith flew to "
|
||||
entities = [(len(prefix), len(prefix + "San Francisco Valley"), "LOC")]
|
||||
gold_words = ["Mr. and Mrs. Smith", "flew to", "San Francisco Valley", "."]
|
||||
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
||||
ner_tags = example.get_aligned_ner()
|
||||
assert ner_tags == ["O", "O", "O", "O", "O", "B-LOC", "I-LOC", "L-LOC", "O"]
|
||||
|
||||
entities = [
|
||||
(len("Mr. and "), len("Mr. and Mrs. Smith"), "PERSON"), # "Mrs. Smith" is a PERSON
|
||||
(len(prefix), len(prefix + "San Francisco Valley"), "LOC"),
|
||||
]
|
||||
gold_words = ["Mr. and", "Mrs. Smith", "flew to", "San Francisco Valley", "."]
|
||||
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
||||
ner_tags = example.get_aligned_ner()
|
||||
assert ner_tags == ["O", "B-PERSON", "L-PERSON", "O", "O", "B-LOC", "I-LOC", "L-LOC", "O"]
|
||||
|
||||
|
||||
def test_gold_biluo_misaligned(en_vocab, en_tokenizer):
|
||||
words = ["Mr. and Mrs.", "Smith", "flew", "to", "San Francisco", "Valley", "."]
|
||||
spaces = [True, True, True, True, True, False, False]
|
||||
doc = Doc(en_vocab, words=words, spaces=spaces)
|
||||
entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")]
|
||||
gold_words = ["I", "flew to", "San Francisco Valley", "."]
|
||||
prefix = "Mr. and Mrs. Smith flew to "
|
||||
entities = [(len(prefix), len(prefix + "San Francisco Valley"), "LOC")]
|
||||
gold_words = ["Mr.", "and Mrs. Smith", "flew to", "San", "Francisco Valley", "."]
|
||||
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
||||
ner_tags = example.get_aligned_ner()
|
||||
assert ner_tags == ["O", "O", "O", "B-LOC", "I-LOC", "L-LOC", "O"]
|
||||
assert ner_tags == ["O", "O", "O", "O", "B-LOC", "L-LOC", "O"]
|
||||
|
||||
entities = [
|
||||
(len("I "), len("I flew to"), "ORG"),
|
||||
(len("I flew to "), len("I flew to San Francisco Valley"), "LOC"),
|
||||
(len("Mr. and "), len("Mr. and Mrs. Smith"), "PERSON"), # "Mrs. Smith" is a PERSON
|
||||
(len(prefix), len(prefix + "San Francisco Valley"), "LOC"),
|
||||
]
|
||||
gold_words = ["I", "flew to", "San Francisco Valley", "."]
|
||||
gold_words = ["Mr. and", "Mrs. Smith", "flew to", "San", "Francisco Valley", "."]
|
||||
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
||||
ner_tags = example.get_aligned_ner()
|
||||
assert ner_tags == ["O", "B-ORG", "L-ORG", "B-LOC", "I-LOC", "L-LOC", "O"]
|
||||
|
||||
|
||||
@pytest.mark.xfail(reason="Alignment should be fixed after example refactor")
|
||||
def test_gold_biluo_misaligned(en_vocab, en_tokenizer):
|
||||
words = ["I flew", "to", "San Francisco", "Valley", "."]
|
||||
spaces = [True, True, True, False, False]
|
||||
doc = Doc(en_vocab, words=words, spaces=spaces)
|
||||
entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")]
|
||||
gold_words = ["I", "flew to", "San", "Francisco Valley", "."]
|
||||
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
||||
ner_tags = example.get_aligned_ner()
|
||||
assert ner_tags == ["O", "O", "B-LOC", "L-LOC", "O"]
|
||||
|
||||
entities = [
|
||||
(len("I "), len("I flew to"), "ORG"),
|
||||
(len("I flew to "), len("I flew to San Francisco Valley"), "LOC"),
|
||||
]
|
||||
gold_words = ["I", "flew to", "San", "Francisco Valley", "."]
|
||||
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
||||
ner_tags = example.get_aligned_ner()
|
||||
assert ner_tags == [None, None, "B-LOC", "L-LOC", "O"]
|
||||
assert ner_tags == [None, None, "O", "O", "B-LOC", "L-LOC", "O"]
|
||||
|
||||
|
||||
def test_gold_biluo_additional_whitespace(en_vocab, en_tokenizer):
|
||||
|
@ -360,7 +350,8 @@ def test_gold_biluo_additional_whitespace(en_vocab, en_tokenizer):
|
|||
"I flew to San Francisco Valley.",
|
||||
)
|
||||
doc = Doc(en_vocab, words=words, spaces=spaces)
|
||||
entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")]
|
||||
prefix = "I flew to "
|
||||
entities = [(len(prefix), len(prefix + "San Francisco Valley"), "LOC")]
|
||||
gold_words = ["I", "flew", " ", "to", "San Francisco Valley", "."]
|
||||
gold_spaces = [True, True, False, True, False, False]
|
||||
example = Example.from_dict(
|
||||
|
@ -522,11 +513,10 @@ def test_make_orth_variants(doc):
|
|||
|
||||
# due to randomness, test only that this runs with no errors for now
|
||||
train_example = next(goldcorpus.train_dataset(nlp))
|
||||
variant_example = make_orth_variants_example(
|
||||
nlp, train_example, orth_variant_level=0.2
|
||||
)
|
||||
make_orth_variants_example(nlp, train_example, orth_variant_level=0.2)
|
||||
|
||||
|
||||
@pytest.mark.skip("Outdated")
|
||||
@pytest.mark.parametrize(
|
||||
"tokens_a,tokens_b,expected",
|
||||
[
|
||||
|
@ -550,12 +540,12 @@ def test_make_orth_variants(doc):
|
|||
([" ", "a"], ["a"], (1, [-1, 0], [1], {}, {})),
|
||||
],
|
||||
)
|
||||
def test_align(tokens_a, tokens_b, expected):
|
||||
cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_a, tokens_b)
|
||||
assert (cost, list(a2b), list(b2a), a2b_multi, b2a_multi) == expected
|
||||
def test_align(tokens_a, tokens_b, expected): # noqa
|
||||
cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_a, tokens_b) # noqa
|
||||
assert (cost, list(a2b), list(b2a), a2b_multi, b2a_multi) == expected # noqa
|
||||
# check symmetry
|
||||
cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_b, tokens_a)
|
||||
assert (cost, list(b2a), list(a2b), b2a_multi, a2b_multi) == expected
|
||||
cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_b, tokens_a) # noqa
|
||||
assert (cost, list(b2a), list(a2b), b2a_multi, a2b_multi) == expected # noqa
|
||||
|
||||
|
||||
def test_goldparse_startswith_space(en_tokenizer):
|
||||
|
@ -569,7 +559,7 @@ def test_goldparse_startswith_space(en_tokenizer):
|
|||
doc, {"words": gold_words, "entities": entities, "deps": deps, "heads": heads}
|
||||
)
|
||||
ner_tags = example.get_aligned_ner()
|
||||
assert ner_tags == [None, "U-DATE"]
|
||||
assert ner_tags == ["O", "U-DATE"]
|
||||
assert example.get_aligned("DEP", as_string=True) == [None, "ROOT"]
|
||||
|
||||
|
||||
|
@ -600,7 +590,7 @@ def test_tuple_format_implicit():
|
|||
("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]}),
|
||||
]
|
||||
|
||||
_train(train_data)
|
||||
_train_tuples(train_data)
|
||||
|
||||
|
||||
def test_tuple_format_implicit_invalid():
|
||||
|
@ -616,20 +606,24 @@ def test_tuple_format_implicit_invalid():
|
|||
]
|
||||
|
||||
with pytest.raises(KeyError):
|
||||
_train(train_data)
|
||||
_train_tuples(train_data)
|
||||
|
||||
|
||||
def _train(train_data):
|
||||
def _train_tuples(train_data):
|
||||
nlp = English()
|
||||
ner = nlp.create_pipe("ner")
|
||||
ner.add_label("ORG")
|
||||
ner.add_label("LOC")
|
||||
nlp.add_pipe(ner)
|
||||
|
||||
train_examples = []
|
||||
for t in train_data:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||
|
||||
optimizer = nlp.begin_training()
|
||||
for i in range(5):
|
||||
losses = {}
|
||||
batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
|
||||
batches = minibatch(train_examples, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
nlp.update(batch, sgd=optimizer, losses=losses)
|
||||
|
||||
|
|
|
@ -5,6 +5,7 @@ from spacy.tokens import Doc, Span
|
|||
from spacy.vocab import Vocab
|
||||
|
||||
from .util import add_vecs_to_vocab, assert_docs_equal
|
||||
from ..gold import Example
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
|
@ -23,26 +24,45 @@ def test_language_update(nlp):
|
|||
annots = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}
|
||||
wrongkeyannots = {"LABEL": True}
|
||||
doc = Doc(nlp.vocab, words=text.split(" "))
|
||||
# Update with text and dict
|
||||
nlp.update((text, annots))
|
||||
example = Example.from_dict(doc, annots)
|
||||
nlp.update([example])
|
||||
|
||||
# Not allowed to call with just one Example
|
||||
with pytest.raises(TypeError):
|
||||
nlp.update(example)
|
||||
|
||||
# Update with text and dict: not supported anymore since v.3
|
||||
with pytest.raises(TypeError):
|
||||
nlp.update((text, annots))
|
||||
# Update with doc object and dict
|
||||
nlp.update((doc, annots))
|
||||
# Update badly
|
||||
with pytest.raises(TypeError):
|
||||
nlp.update((doc, annots))
|
||||
|
||||
# Create examples badly
|
||||
with pytest.raises(ValueError):
|
||||
nlp.update((doc, None))
|
||||
example = Example.from_dict(doc, None)
|
||||
with pytest.raises(KeyError):
|
||||
nlp.update((text, wrongkeyannots))
|
||||
example = Example.from_dict(doc, wrongkeyannots)
|
||||
|
||||
|
||||
def test_language_evaluate(nlp):
|
||||
text = "hello world"
|
||||
annots = {"doc_annotation": {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}}
|
||||
doc = Doc(nlp.vocab, words=text.split(" "))
|
||||
# Evaluate with text and dict
|
||||
nlp.evaluate([(text, annots)])
|
||||
example = Example.from_dict(doc, annots)
|
||||
nlp.evaluate([example])
|
||||
|
||||
# Not allowed to call with just one Example
|
||||
with pytest.raises(TypeError):
|
||||
nlp.evaluate(example)
|
||||
|
||||
# Evaluate with text and dict: not supported anymore since v.3
|
||||
with pytest.raises(TypeError):
|
||||
nlp.evaluate([(text, annots)])
|
||||
# Evaluate with doc object and dict
|
||||
nlp.evaluate([(doc, annots)])
|
||||
with pytest.raises(Exception):
|
||||
with pytest.raises(TypeError):
|
||||
nlp.evaluate([(doc, annots)])
|
||||
with pytest.raises(TypeError):
|
||||
nlp.evaluate([text, annots])
|
||||
|
||||
|
||||
|
@ -56,8 +76,9 @@ def test_evaluate_no_pipe(nlp):
|
|||
text = "hello world"
|
||||
annots = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}
|
||||
nlp = Language(Vocab())
|
||||
doc = nlp(text)
|
||||
nlp.add_pipe(pipe)
|
||||
nlp.evaluate([(text, annots)])
|
||||
nlp.evaluate([Example.from_dict(doc, annots)])
|
||||
|
||||
|
||||
def vector_modification_pipe(doc):
|
||||
|
|
|
@ -55,7 +55,7 @@ def test_aligned_tags():
|
|||
predicted = Doc(vocab, words=pred_words)
|
||||
example = Example.from_dict(predicted, annots)
|
||||
aligned_tags = example.get_aligned("tag", as_string=True)
|
||||
assert aligned_tags == ["VERB", "DET", None, "SCONJ", "PRON", "VERB", "VERB"]
|
||||
assert aligned_tags == ["VERB", "DET", "NOUN", "SCONJ", "PRON", "VERB", "VERB"]
|
||||
|
||||
|
||||
def test_aligned_tags_multi():
|
||||
|
@ -230,8 +230,7 @@ def test_Example_from_dict_with_links(annots):
|
|||
[
|
||||
{
|
||||
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
|
||||
"entities": [(7, 15, "LOC"), (20, 26, "LOC")],
|
||||
"links": {(0, 1): {"Q7381115": 1.0, "Q2146908": 0.0}},
|
||||
"links": {(7, 14): {"Q7381115": 1.0, "Q2146908": 0.0}},
|
||||
}
|
||||
],
|
||||
)
|
||||
|
|
|
@ -26,8 +26,6 @@ cdef class Tokenizer:
|
|||
cdef int _property_init_count
|
||||
cdef int _property_init_max
|
||||
|
||||
cpdef Doc tokens_from_list(self, list strings)
|
||||
|
||||
cdef Doc _tokenize_affixes(self, unicode string, bint with_special_cases)
|
||||
cdef int _apply_special_cases(self, Doc doc) except -1
|
||||
cdef void _filter_special_spans(self, vector[SpanC] &original,
|
||||
|
|
|
@ -140,10 +140,6 @@ cdef class Tokenizer:
|
|||
self.url_match)
|
||||
return (self.__class__, args, None, None)
|
||||
|
||||
cpdef Doc tokens_from_list(self, list strings):
|
||||
warnings.warn(Warnings.W002, DeprecationWarning)
|
||||
return Doc(self.vocab, words=strings)
|
||||
|
||||
def __call__(self, unicode string):
|
||||
"""Tokenize a string.
|
||||
|
||||
|
@ -218,7 +214,7 @@ cdef class Tokenizer:
|
|||
doc.c[doc.length - 1].spacy = string[-1] == " " and not in_ws
|
||||
return doc
|
||||
|
||||
def pipe(self, texts, batch_size=1000, n_threads=-1):
|
||||
def pipe(self, texts, batch_size=1000):
|
||||
"""Tokenize a stream of texts.
|
||||
|
||||
texts: A sequence of unicode texts.
|
||||
|
@ -228,8 +224,6 @@ cdef class Tokenizer:
|
|||
|
||||
DOCS: https://spacy.io/api/tokenizer#pipe
|
||||
"""
|
||||
if n_threads != -1:
|
||||
warnings.warn(Warnings.W016, DeprecationWarning)
|
||||
for text in texts:
|
||||
yield self(text)
|
||||
|
||||
|
@ -746,7 +740,7 @@ cdef class Tokenizer:
|
|||
self.from_bytes(bytes_data, **kwargs)
|
||||
return self
|
||||
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
def to_bytes(self, exclude=tuple()):
|
||||
"""Serialize the current state to a binary string.
|
||||
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
|
@ -763,10 +757,9 @@ cdef class Tokenizer:
|
|||
"url_match": lambda: _get_regex_pattern(self.url_match),
|
||||
"exceptions": lambda: dict(sorted(self._rules.items()))
|
||||
}
|
||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||
return util.to_bytes(serializers, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||
"""Load state from a binary string.
|
||||
|
||||
bytes_data (bytes): The data to load from.
|
||||
|
@ -785,7 +778,6 @@ cdef class Tokenizer:
|
|||
"url_match": lambda b: data.setdefault("url_match", b),
|
||||
"exceptions": lambda b: data.setdefault("rules", b)
|
||||
}
|
||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||
if "prefix_search" in data and isinstance(data["prefix_search"], str):
|
||||
self.prefix_search = re.compile(data["prefix_search"]).search
|
||||
|
|
|
@ -8,8 +8,9 @@ from ..tokens import Doc
|
|||
from ..attrs import SPACY, ORTH, intify_attr
|
||||
from ..errors import Errors
|
||||
|
||||
|
||||
ALL_ATTRS = ("ORTH", "TAG", "HEAD", "DEP", "ENT_IOB", "ENT_TYPE", "ENT_KB_ID", "LEMMA", "MORPH")
|
||||
# fmt: off
|
||||
ALL_ATTRS = ("ORTH", "TAG", "HEAD", "DEP", "ENT_IOB", "ENT_TYPE", "ENT_KB_ID", "LEMMA", "MORPH", "POS")
|
||||
# fmt: on
|
||||
|
||||
|
||||
class DocBin(object):
|
||||
|
@ -31,6 +32,7 @@ class DocBin(object):
|
|||
"spaces": bytes, # Serialized numpy boolean array with spaces data
|
||||
"lengths": bytes, # Serialized numpy int32 array with the doc lengths
|
||||
"strings": List[unicode] # List of unique strings in the token data
|
||||
"version": str, # DocBin version number
|
||||
}
|
||||
|
||||
Strings for the words, tags, labels etc are represented by 64-bit hashes in
|
||||
|
@ -53,12 +55,14 @@ class DocBin(object):
|
|||
DOCS: https://spacy.io/api/docbin#init
|
||||
"""
|
||||
attrs = sorted([intify_attr(attr) for attr in attrs])
|
||||
self.version = "0.1"
|
||||
self.attrs = [attr for attr in attrs if attr != ORTH and attr != SPACY]
|
||||
self.attrs.insert(0, ORTH) # Ensure ORTH is always attrs[0]
|
||||
self.tokens = []
|
||||
self.spaces = []
|
||||
self.cats = []
|
||||
self.user_data = []
|
||||
self.flags = []
|
||||
self.strings = set()
|
||||
self.store_user_data = store_user_data
|
||||
for doc in docs:
|
||||
|
@ -83,12 +87,15 @@ class DocBin(object):
|
|||
assert array.shape[0] == spaces.shape[0] # this should never happen
|
||||
spaces = spaces.reshape((spaces.shape[0], 1))
|
||||
self.spaces.append(numpy.asarray(spaces, dtype=bool))
|
||||
self.flags.append({"has_unknown_spaces": doc.has_unknown_spaces})
|
||||
for token in doc:
|
||||
self.strings.add(token.text)
|
||||
self.strings.add(token.tag_)
|
||||
self.strings.add(token.lemma_)
|
||||
self.strings.add(token.morph_)
|
||||
self.strings.add(token.dep_)
|
||||
self.strings.add(token.ent_type_)
|
||||
self.strings.add(token.ent_kb_id_)
|
||||
self.cats.append(doc.cats)
|
||||
if self.store_user_data:
|
||||
self.user_data.append(srsly.msgpack_dumps(doc.user_data))
|
||||
|
@ -105,8 +112,11 @@ class DocBin(object):
|
|||
vocab[string]
|
||||
orth_col = self.attrs.index(ORTH)
|
||||
for i in range(len(self.tokens)):
|
||||
flags = self.flags[i]
|
||||
tokens = self.tokens[i]
|
||||
spaces = self.spaces[i]
|
||||
if flags.get("has_unknown_spaces"):
|
||||
spaces = None
|
||||
doc = Doc(vocab, words=tokens[:, orth_col], spaces=spaces)
|
||||
doc = doc.from_array(self.attrs, tokens)
|
||||
doc.cats = self.cats[i]
|
||||
|
@ -130,6 +140,7 @@ class DocBin(object):
|
|||
self.spaces.extend(other.spaces)
|
||||
self.strings.update(other.strings)
|
||||
self.cats.extend(other.cats)
|
||||
self.flags.extend(other.flags)
|
||||
if self.store_user_data:
|
||||
self.user_data.extend(other.user_data)
|
||||
|
||||
|
@ -147,12 +158,14 @@ class DocBin(object):
|
|||
spaces = numpy.vstack(self.spaces) if self.spaces else numpy.asarray([])
|
||||
|
||||
msg = {
|
||||
"version": self.version,
|
||||
"attrs": self.attrs,
|
||||
"tokens": tokens.tobytes("C"),
|
||||
"spaces": spaces.tobytes("C"),
|
||||
"lengths": numpy.asarray(lengths, dtype="int32").tobytes("C"),
|
||||
"strings": list(self.strings),
|
||||
"cats": self.cats,
|
||||
"flags": self.flags,
|
||||
}
|
||||
if self.store_user_data:
|
||||
msg["user_data"] = self.user_data
|
||||
|
@ -178,6 +191,7 @@ class DocBin(object):
|
|||
self.tokens = NumpyOps().unflatten(flat_tokens, lengths)
|
||||
self.spaces = NumpyOps().unflatten(flat_spaces, lengths)
|
||||
self.cats = msg["cats"]
|
||||
self.flags = msg.get("flags", [{} for _ in lengths])
|
||||
if self.store_user_data and "user_data" in msg:
|
||||
self.user_data = list(msg["user_data"])
|
||||
for tokens in self.tokens:
|
||||
|
|
|
@ -59,11 +59,14 @@ cdef class Doc:
|
|||
cdef public dict user_token_hooks
|
||||
cdef public dict user_span_hooks
|
||||
|
||||
cdef public bint has_unknown_spaces
|
||||
|
||||
cdef public list _py_tokens
|
||||
|
||||
cdef int length
|
||||
cdef int max_length
|
||||
|
||||
|
||||
cdef public object noun_chunks_iterator
|
||||
|
||||
cdef object __weakref__
|
||||
|
|
|
@ -5,6 +5,7 @@ from libc.string cimport memcpy, memset
|
|||
from libc.math cimport sqrt
|
||||
from libc.stdint cimport int32_t, uint64_t
|
||||
|
||||
import copy
|
||||
from collections import Counter
|
||||
import numpy
|
||||
import numpy.linalg
|
||||
|
@ -24,7 +25,7 @@ from ..attrs cimport LENGTH, POS, LEMMA, TAG, MORPH, DEP, HEAD, SPACY, ENT_IOB
|
|||
from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, IDX, attr_id_t
|
||||
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
|
||||
|
||||
from ..attrs import intify_attrs, IDS
|
||||
from ..attrs import intify_attr, intify_attrs, IDS
|
||||
from ..util import normalize_slice
|
||||
from ..compat import copy_reg, pickle
|
||||
from ..errors import Errors, Warnings
|
||||
|
@ -171,8 +172,7 @@ cdef class Doc:
|
|||
raise ValueError(Errors.E046.format(name=name))
|
||||
return Underscore.doc_extensions.pop(name)
|
||||
|
||||
def __init__(self, Vocab vocab, words=None, spaces=None, user_data=None,
|
||||
orths_and_spaces=None):
|
||||
def __init__(self, Vocab vocab, words=None, spaces=None, user_data=None):
|
||||
"""Create a Doc object.
|
||||
|
||||
vocab (Vocab): A vocabulary object, which must match any models you
|
||||
|
@ -214,28 +214,25 @@ cdef class Doc:
|
|||
self._vector = None
|
||||
self.noun_chunks_iterator = _get_chunker(self.vocab.lang)
|
||||
cdef bint has_space
|
||||
if orths_and_spaces is None and words is not None:
|
||||
if spaces is None:
|
||||
spaces = [True] * len(words)
|
||||
elif len(spaces) != len(words):
|
||||
raise ValueError(Errors.E027)
|
||||
orths_and_spaces = zip(words, spaces)
|
||||
if words is None and spaces is not None:
|
||||
raise ValueError("words must be set if spaces is set")
|
||||
elif spaces is None and words is not None:
|
||||
self.has_unknown_spaces = True
|
||||
else:
|
||||
self.has_unknown_spaces = False
|
||||
words = words if words is not None else []
|
||||
spaces = spaces if spaces is not None else ([True] * len(words))
|
||||
if len(spaces) != len(words):
|
||||
raise ValueError(Errors.E027)
|
||||
cdef const LexemeC* lexeme
|
||||
if orths_and_spaces is not None:
|
||||
orths_and_spaces = list(orths_and_spaces)
|
||||
for orth_space in orths_and_spaces:
|
||||
if isinstance(orth_space, unicode):
|
||||
lexeme = self.vocab.get(self.mem, orth_space)
|
||||
has_space = True
|
||||
elif isinstance(orth_space, bytes):
|
||||
raise ValueError(Errors.E028.format(value=orth_space))
|
||||
elif isinstance(orth_space[0], unicode):
|
||||
lexeme = self.vocab.get(self.mem, orth_space[0])
|
||||
has_space = orth_space[1]
|
||||
else:
|
||||
lexeme = self.vocab.get_by_orth(self.mem, orth_space[0])
|
||||
has_space = orth_space[1]
|
||||
self.push_back(lexeme, has_space)
|
||||
for word, has_space in zip(words, spaces):
|
||||
if isinstance(word, unicode):
|
||||
lexeme = self.vocab.get(self.mem, word)
|
||||
elif isinstance(word, bytes):
|
||||
raise ValueError(Errors.E028.format(value=word))
|
||||
else:
|
||||
lexeme = self.vocab.get_by_orth(self.mem, word)
|
||||
self.push_back(lexeme, has_space)
|
||||
# Tough to decide on policy for this. Is an empty doc tagged and parsed?
|
||||
# There's no information we'd like to add to it, so I guess so?
|
||||
if self.length == 0:
|
||||
|
@ -806,7 +803,7 @@ cdef class Doc:
|
|||
attrs = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
|
||||
for id_ in attrs]
|
||||
if array.dtype != numpy.uint64:
|
||||
warnings.warn(Warnings.W028.format(type=array.dtype))
|
||||
warnings.warn(Warnings.W101.format(type=array.dtype))
|
||||
|
||||
if SENT_START in attrs and HEAD in attrs:
|
||||
raise ValueError(Errors.E032)
|
||||
|
@ -882,6 +879,87 @@ cdef class Doc:
|
|||
set_children_from_heads(self.c, length)
|
||||
return self
|
||||
|
||||
@staticmethod
|
||||
def from_docs(docs, ensure_whitespace=True, attrs=None):
|
||||
"""Concatenate multiple Doc objects to form a new one. Raises an error if the `Doc` objects do not all share
|
||||
the same `Vocab`.
|
||||
|
||||
docs (list): A list of Doc objects.
|
||||
ensure_whitespace (bool): Insert a space between two adjacent docs whenever the first doc does not end in whitespace.
|
||||
attrs (list): Optional list of attribute ID ints or attribute name strings.
|
||||
RETURNS (Doc): A doc that contains the concatenated docs, or None if no docs were given.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#from_docs
|
||||
"""
|
||||
if not docs:
|
||||
return None
|
||||
|
||||
vocab = {doc.vocab for doc in docs}
|
||||
if len(vocab) > 1:
|
||||
raise ValueError(Errors.E999)
|
||||
(vocab,) = vocab
|
||||
|
||||
if attrs is None:
|
||||
attrs = [LEMMA, NORM]
|
||||
if all(doc.is_nered for doc in docs):
|
||||
attrs.extend([ENT_IOB, ENT_KB_ID, ENT_TYPE])
|
||||
# TODO: separate for is_morphed?
|
||||
if all(doc.is_tagged for doc in docs):
|
||||
attrs.extend([TAG, POS, MORPH])
|
||||
if all(doc.is_parsed for doc in docs):
|
||||
attrs.extend([HEAD, DEP])
|
||||
else:
|
||||
attrs.append(SENT_START)
|
||||
else:
|
||||
if any(isinstance(attr, str) for attr in attrs): # resolve attribute names
|
||||
attrs = [intify_attr(attr) for attr in attrs] # intify_attr returns None for invalid attrs
|
||||
attrs = list(attr for attr in set(attrs) if attr) # filter duplicates, remove None if present
|
||||
if SPACY not in attrs:
|
||||
attrs.append(SPACY)
|
||||
|
||||
concat_words = []
|
||||
concat_spaces = []
|
||||
concat_user_data = {}
|
||||
char_offset = 0
|
||||
for doc in docs:
|
||||
concat_words.extend(t.text for t in doc)
|
||||
concat_spaces.extend(bool(t.whitespace_) for t in doc)
|
||||
|
||||
for key, value in doc.user_data.items():
|
||||
if isinstance(key, tuple) and len(key) == 4:
|
||||
data_type, name, start, end = key
|
||||
if start is not None or end is not None:
|
||||
start += char_offset
|
||||
if end is not None:
|
||||
end += char_offset
|
||||
concat_user_data[(data_type, name, start, end)] = copy.copy(value)
|
||||
else:
|
||||
warnings.warn(Warnings.W101.format(name=name))
|
||||
else:
|
||||
warnings.warn(Warnings.W102.format(key=key, value=value))
|
||||
char_offset += len(doc.text) if not ensure_whitespace or doc[-1].is_space else len(doc.text) + 1
|
||||
|
||||
arrays = [doc.to_array(attrs) for doc in docs]
|
||||
|
||||
if ensure_whitespace:
|
||||
spacy_index = attrs.index(SPACY)
|
||||
for i, array in enumerate(arrays[:-1]):
|
||||
if len(array) > 0 and not docs[i][-1].is_space:
|
||||
array[-1][spacy_index] = 1
|
||||
token_offset = -1
|
||||
for doc in docs[:-1]:
|
||||
token_offset += len(doc)
|
||||
if not doc[-1].is_space:
|
||||
concat_spaces[token_offset] = True
|
||||
|
||||
concat_array = numpy.concatenate(arrays)
|
||||
|
||||
concat_doc = Doc(vocab, words=concat_words, spaces=concat_spaces, user_data=concat_user_data)
|
||||
|
||||
concat_doc.from_array(attrs, concat_array)
|
||||
|
||||
return concat_doc
|
||||
|
||||
def get_lca_matrix(self):
|
||||
"""Calculates a matrix of Lowest Common Ancestors (LCA) for a given
|
||||
`Doc`, where LCA[i, j] is the index of the lowest common ancestor among
|
||||
|
@ -905,6 +983,7 @@ cdef class Doc:
|
|||
other.is_parsed = self.is_parsed
|
||||
other.is_morphed = self.is_morphed
|
||||
other.sentiment = self.sentiment
|
||||
other.has_unknown_spaces = self.has_unknown_spaces
|
||||
other.user_hooks = dict(self.user_hooks)
|
||||
other.user_token_hooks = dict(self.user_token_hooks)
|
||||
other.user_span_hooks = dict(self.user_span_hooks)
|
||||
|
@ -1000,10 +1079,8 @@ cdef class Doc:
|
|||
"sentiment": lambda: self.sentiment,
|
||||
"tensor": lambda: self.tensor,
|
||||
"cats": lambda: self.cats,
|
||||
"has_unknown_spaces": lambda: self.has_unknown_spaces
|
||||
}
|
||||
for key in kwargs:
|
||||
if key in serializers or key in ("user_data", "user_data_keys", "user_data_values"):
|
||||
raise ValueError(Errors.E128.format(arg=key))
|
||||
if "user_data" not in exclude and self.user_data:
|
||||
user_data_keys, user_data_values = list(zip(*self.user_data.items()))
|
||||
if "user_data_keys" not in exclude:
|
||||
|
@ -1032,10 +1109,8 @@ cdef class Doc:
|
|||
"cats": lambda b: None,
|
||||
"user_data_keys": lambda b: None,
|
||||
"user_data_values": lambda b: None,
|
||||
"has_unknown_spaces": lambda b: None
|
||||
}
|
||||
for key in kwargs:
|
||||
if key in deserializers or key in ("user_data",):
|
||||
raise ValueError(Errors.E128.format(arg=key))
|
||||
# Msgpack doesn't distinguish between lists and tuples, which is
|
||||
# vexing for user data. As a best guess, we *know* that within
|
||||
# keys, we must have tuples. In values we just have to hope
|
||||
|
@ -1052,6 +1127,8 @@ cdef class Doc:
|
|||
self.tensor = msg["tensor"]
|
||||
if "cats" not in exclude and "cats" in msg:
|
||||
self.cats = msg["cats"]
|
||||
if "has_unknown_spaces" not in exclude and "has_unknown_spaces" in msg:
|
||||
self.has_unknown_spaces = msg["has_unknown_spaces"]
|
||||
start = 0
|
||||
cdef const LexemeC* lex
|
||||
cdef unicode orth_
|
||||
|
@ -1123,50 +1200,6 @@ cdef class Doc:
|
|||
remove_label_if_necessary(attributes[i])
|
||||
retokenizer.merge(span, attributes[i])
|
||||
|
||||
def merge(self, int start_idx, int end_idx, *args, **attributes):
|
||||
"""Retokenize the document, such that the span at
|
||||
`doc.text[start_idx : end_idx]` is merged into a single token. If
|
||||
`start_idx` and `end_idx `do not mark start and end token boundaries,
|
||||
the document remains unchanged.
|
||||
|
||||
start_idx (int): Character index of the start of the slice to merge.
|
||||
end_idx (int): Character index after the end of the slice to merge.
|
||||
**attributes: Attributes to assign to the merged token. By default,
|
||||
attributes are inherited from the syntactic root of the span.
|
||||
RETURNS (Token): The newly merged token, or `None` if the start and end
|
||||
indices did not fall at token boundaries.
|
||||
"""
|
||||
cdef unicode tag, lemma, ent_type
|
||||
warnings.warn(Warnings.W013.format(obj="Doc"), DeprecationWarning)
|
||||
# TODO: ENT_KB_ID ?
|
||||
if len(args) == 3:
|
||||
warnings.warn(Warnings.W003, DeprecationWarning)
|
||||
tag, lemma, ent_type = args
|
||||
attributes[TAG] = tag
|
||||
attributes[LEMMA] = lemma
|
||||
attributes[ENT_TYPE] = ent_type
|
||||
elif not args:
|
||||
fix_attributes(self, attributes)
|
||||
elif args:
|
||||
raise ValueError(Errors.E034.format(n_args=len(args), args=repr(args),
|
||||
kwargs=repr(attributes)))
|
||||
remove_label_if_necessary(attributes)
|
||||
attributes = intify_attrs(attributes, strings_map=self.vocab.strings)
|
||||
cdef int start = token_by_start(self.c, self.length, start_idx)
|
||||
if start == -1:
|
||||
return None
|
||||
cdef int end = token_by_end(self.c, self.length, end_idx)
|
||||
if end == -1:
|
||||
return None
|
||||
# Currently we have the token index, we want the range-end index
|
||||
end += 1
|
||||
with self.retokenize() as retokenizer:
|
||||
retokenizer.merge(self[start:end], attrs=attributes)
|
||||
return self[start]
|
||||
|
||||
def print_tree(self, light=False, flat=False):
|
||||
raise ValueError(Errors.E105)
|
||||
|
||||
def to_json(self, underscore=None):
|
||||
"""Convert a Doc to JSON. The format it produces will be the new format
|
||||
for the `spacy train` command (not implemented yet).
|
||||
|
|
|
@ -280,18 +280,6 @@ cdef class Span:
|
|||
|
||||
return array
|
||||
|
||||
def merge(self, *args, **attributes):
|
||||
"""Retokenize the document, such that the span is merged into a single
|
||||
token.
|
||||
|
||||
**attributes: Attributes to assign to the merged token. By default,
|
||||
attributes are inherited from the syntactic root token of the span.
|
||||
RETURNS (Token): The newly merged token.
|
||||
"""
|
||||
warnings.warn(Warnings.W013.format(obj="Span"), DeprecationWarning)
|
||||
return self.doc.merge(self.start_char, self.end_char, *args,
|
||||
**attributes)
|
||||
|
||||
def get_lca_matrix(self):
|
||||
"""Calculates a matrix of Lowest Common Ancestors (LCA) for a given
|
||||
`Span`, where LCA[i, j] is the index of the lowest common ancestor among
|
||||
|
@ -698,21 +686,6 @@ cdef class Span:
|
|||
"""RETURNS (str): The span's lemma."""
|
||||
return " ".join([t.lemma_ for t in self]).strip()
|
||||
|
||||
@property
|
||||
def upper_(self):
|
||||
"""Deprecated. Use `Span.text.upper()` instead."""
|
||||
return "".join([t.text_with_ws.upper() for t in self]).strip()
|
||||
|
||||
@property
|
||||
def lower_(self):
|
||||
"""Deprecated. Use `Span.text.lower()` instead."""
|
||||
return "".join([t.text_with_ws.lower() for t in self]).strip()
|
||||
|
||||
@property
|
||||
def string(self):
|
||||
"""Deprecated: Use `Span.text_with_ws` instead."""
|
||||
return "".join([t.text_with_ws for t in self])
|
||||
|
||||
property label_:
|
||||
"""RETURNS (str): The span's label."""
|
||||
def __get__(self):
|
||||
|
|
|
@ -237,11 +237,6 @@ cdef class Token:
|
|||
index into tables, e.g. for word vectors."""
|
||||
return self.c.lex.id
|
||||
|
||||
@property
|
||||
def string(self):
|
||||
"""Deprecated: Use Token.text_with_ws instead."""
|
||||
return self.text_with_ws
|
||||
|
||||
@property
|
||||
def text(self):
|
||||
"""RETURNS (str): The original verbatim text of the token."""
|
||||
|
|
106
spacy/util.py
106
spacy/util.py
|
@ -4,9 +4,8 @@ import importlib
|
|||
import importlib.util
|
||||
import re
|
||||
from pathlib import Path
|
||||
import random
|
||||
import thinc
|
||||
from thinc.api import NumpyOps, get_current_ops, Adam, require_gpu, Config
|
||||
from thinc.api import NumpyOps, get_current_ops, Adam, Config
|
||||
import functools
|
||||
import itertools
|
||||
import numpy.random
|
||||
|
@ -34,6 +33,13 @@ try: # Python 3.8
|
|||
except ImportError:
|
||||
import importlib_metadata
|
||||
|
||||
# These are functions that were previously (v2.x) available from spacy.util
|
||||
# and have since moved to Thinc. We're importing them here so people's code
|
||||
# doesn't break, but they should always be imported from Thinc from now on,
|
||||
# not from spacy.util.
|
||||
from thinc.api import fix_random_seed, compounding, decaying # noqa: F401
|
||||
|
||||
|
||||
from .symbols import ORTH
|
||||
from .compat import cupy, CudaStream, is_windows
|
||||
from .errors import Errors, Warnings
|
||||
|
@ -595,15 +601,8 @@ def compile_prefix_regex(entries):
|
|||
entries (tuple): The prefix rules, e.g. spacy.lang.punctuation.TOKENIZER_PREFIXES.
|
||||
RETURNS (regex object): The regex object. to be used for Tokenizer.prefix_search.
|
||||
"""
|
||||
if "(" in entries:
|
||||
# Handle deprecated data
|
||||
expression = "|".join(
|
||||
["^" + re.escape(piece) for piece in entries if piece.strip()]
|
||||
)
|
||||
return re.compile(expression)
|
||||
else:
|
||||
expression = "|".join(["^" + piece for piece in entries if piece.strip()])
|
||||
return re.compile(expression)
|
||||
expression = "|".join(["^" + piece for piece in entries if piece.strip()])
|
||||
return re.compile(expression)
|
||||
|
||||
|
||||
def compile_suffix_regex(entries):
|
||||
|
@ -723,59 +722,6 @@ def minibatch(items, size=8):
|
|||
yield list(batch)
|
||||
|
||||
|
||||
def compounding(start, stop, compound):
|
||||
"""Yield an infinite series of compounding values. Each time the
|
||||
generator is called, a value is produced by multiplying the previous
|
||||
value by the compound rate.
|
||||
|
||||
EXAMPLE:
|
||||
>>> sizes = compounding(1., 10., 1.5)
|
||||
>>> assert next(sizes) == 1.
|
||||
>>> assert next(sizes) == 1 * 1.5
|
||||
>>> assert next(sizes) == 1.5 * 1.5
|
||||
"""
|
||||
|
||||
def clip(value):
|
||||
return max(value, stop) if (start > stop) else min(value, stop)
|
||||
|
||||
curr = float(start)
|
||||
while True:
|
||||
yield clip(curr)
|
||||
curr *= compound
|
||||
|
||||
|
||||
def stepping(start, stop, steps):
|
||||
"""Yield an infinite series of values that step from a start value to a
|
||||
final value over some number of steps. Each step is (stop-start)/steps.
|
||||
|
||||
After the final value is reached, the generator continues yielding that
|
||||
value.
|
||||
|
||||
EXAMPLE:
|
||||
>>> sizes = stepping(1., 200., 100)
|
||||
>>> assert next(sizes) == 1.
|
||||
>>> assert next(sizes) == 1 * (200.-1.) / 100
|
||||
>>> assert next(sizes) == 1 + (200.-1.) / 100 + (200.-1.) / 100
|
||||
"""
|
||||
|
||||
def clip(value):
|
||||
return max(value, stop) if (start > stop) else min(value, stop)
|
||||
|
||||
curr = float(start)
|
||||
while True:
|
||||
yield clip(curr)
|
||||
curr += (stop - start) / steps
|
||||
|
||||
|
||||
def decaying(start, stop, decay):
|
||||
"""Yield an infinite series of linearly decaying values."""
|
||||
|
||||
curr = float(start)
|
||||
while True:
|
||||
yield max(curr, stop)
|
||||
curr -= decay
|
||||
|
||||
|
||||
def minibatch_by_words(docs, size, tolerance=0.2, discard_oversize=False):
|
||||
"""Create minibatches of roughly a given number of words. If any examples
|
||||
are longer than the specified batch length, they will appear in a batch by
|
||||
|
@ -854,35 +800,6 @@ def minibatch_by_words(docs, size, tolerance=0.2, discard_oversize=False):
|
|||
yield batch
|
||||
|
||||
|
||||
def itershuffle(iterable, bufsize=1000):
|
||||
"""Shuffle an iterator. This works by holding `bufsize` items back
|
||||
and yielding them sometime later. Obviously, this is not unbiased –
|
||||
but should be good enough for batching. Larger bufsize means less bias.
|
||||
From https://gist.github.com/andres-erbsen/1307752
|
||||
|
||||
iterable (iterable): Iterator to shuffle.
|
||||
bufsize (int): Items to hold back.
|
||||
YIELDS (iterable): The shuffled iterator.
|
||||
"""
|
||||
iterable = iter(iterable)
|
||||
buf = []
|
||||
try:
|
||||
while True:
|
||||
for i in range(random.randint(1, bufsize - len(buf))):
|
||||
buf.append(next(iterable))
|
||||
random.shuffle(buf)
|
||||
for i in range(random.randint(1, bufsize)):
|
||||
if buf:
|
||||
yield buf.pop()
|
||||
else:
|
||||
break
|
||||
except StopIteration:
|
||||
random.shuffle(buf)
|
||||
while buf:
|
||||
yield buf.pop()
|
||||
raise StopIteration
|
||||
|
||||
|
||||
def filter_spans(spans):
|
||||
"""Filter a sequence of spans and remove duplicates or overlaps. Useful for
|
||||
creating named entities (where one token can only be part of one entity) or
|
||||
|
@ -989,6 +906,7 @@ def escape_html(text):
|
|||
return text
|
||||
|
||||
|
||||
<<<<<<< HEAD
|
||||
def use_gpu(gpu_id):
|
||||
return require_gpu(gpu_id)
|
||||
|
||||
|
@ -1025,6 +943,8 @@ def get_serialization_exclude(serializers, exclude, kwargs):
|
|||
return exclude
|
||||
|
||||
|
||||
=======
|
||||
>>>>>>> 19d42f42de30ba57e17427798ea2562cdab2c9f8
|
||||
def get_words_and_spaces(words, text):
|
||||
if "".join("".join(words).split()) != "".join(text.split()):
|
||||
raise ValueError(Errors.E194.format(text=text, words=words))
|
||||
|
|
|
@ -426,7 +426,7 @@ cdef class Vocab:
|
|||
orth = self.strings.add(orth)
|
||||
return orth in self.vectors
|
||||
|
||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||
def to_disk(self, path, exclude=tuple()):
|
||||
"""Save the current state to a directory.
|
||||
|
||||
path (unicode or Path): A path to a directory, which will be created if
|
||||
|
@ -439,7 +439,6 @@ cdef class Vocab:
|
|||
if not path.exists():
|
||||
path.mkdir()
|
||||
setters = ["strings", "vectors"]
|
||||
exclude = util.get_serialization_exclude(setters, exclude, kwargs)
|
||||
if "strings" not in exclude:
|
||||
self.strings.to_disk(path / "strings.json")
|
||||
if "vectors" not in "exclude" and self.vectors is not None:
|
||||
|
@ -449,7 +448,7 @@ cdef class Vocab:
|
|||
if "lookups_extra" not in "exclude" and self.lookups_extra is not None:
|
||||
self.lookups_extra.to_disk(path, filename="lookups_extra.bin")
|
||||
|
||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||
def from_disk(self, path, exclude=tuple()):
|
||||
"""Loads state from a directory. Modifies the object in place and
|
||||
returns it.
|
||||
|
||||
|
@ -461,7 +460,6 @@ cdef class Vocab:
|
|||
"""
|
||||
path = util.ensure_path(path)
|
||||
getters = ["strings", "vectors"]
|
||||
exclude = util.get_serialization_exclude(getters, exclude, kwargs)
|
||||
if "strings" not in exclude:
|
||||
self.strings.from_disk(path / "strings.json") # TODO: add exclude?
|
||||
if "vectors" not in exclude:
|
||||
|
@ -481,7 +479,7 @@ cdef class Vocab:
|
|||
self._by_orth = PreshMap()
|
||||
return self
|
||||
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
def to_bytes(self, exclude=tuple()):
|
||||
"""Serialize the current state to a binary string.
|
||||
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
|
@ -501,10 +499,9 @@ cdef class Vocab:
|
|||
"lookups": lambda: self.lookups.to_bytes(),
|
||||
"lookups_extra": lambda: self.lookups_extra.to_bytes()
|
||||
}
|
||||
exclude = util.get_serialization_exclude(getters, exclude, kwargs)
|
||||
return util.to_bytes(getters, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||
"""Load state from a binary string.
|
||||
|
||||
bytes_data (bytes): The data to load from.
|
||||
|
@ -526,7 +523,6 @@ cdef class Vocab:
|
|||
"lookups": lambda b: self.lookups.from_bytes(b),
|
||||
"lookups_extra": lambda b: self.lookups_extra.from_bytes(b)
|
||||
}
|
||||
exclude = util.get_serialization_exclude(setters, exclude, kwargs)
|
||||
util.from_bytes(bytes_data, setters, exclude)
|
||||
if "lexeme_norm" in self.lookups:
|
||||
self.lex_attr_getters[NORM] = util.add_lookups(
|
||||
|
|
|
@ -1,621 +0,0 @@
|
|||
---
|
||||
title: Annotation Specifications
|
||||
teaser: Schemes used for labels, tags and training data
|
||||
menu:
|
||||
- ['Text Processing', 'text-processing']
|
||||
- ['POS Tagging', 'pos-tagging']
|
||||
- ['Dependencies', 'dependency-parsing']
|
||||
- ['Named Entities', 'named-entities']
|
||||
- ['Models & Training', 'training']
|
||||
---
|
||||
|
||||
## Text processing {#text-processing}
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> from spacy.lang.en import English
|
||||
> nlp = English()
|
||||
> tokens = nlp("Some\\nspaces and\\ttab characters")
|
||||
> tokens_text = [t.text for t in tokens]
|
||||
> assert tokens_text == ["Some", "\\n", "spaces", " ", "and", "\\t", "tab", "characters"]
|
||||
> ```
|
||||
|
||||
Tokenization standards are based on the
|
||||
[OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) corpus. The tokenizer
|
||||
differs from most by including **tokens for significant whitespace**. Any
|
||||
sequence of whitespace characters beyond a single space (`' '`) is included as a
|
||||
token. The whitespace tokens are useful for much the same reason punctuation is
|
||||
– it's often an important delimiter in the text. By preserving it in the token
|
||||
output, we are able to maintain a simple alignment between the tokens and the
|
||||
original string, and we ensure that **no information is lost** during
|
||||
processing.
|
||||
|
||||
### Lemmatization {#lemmatization}
|
||||
|
||||
> #### Examples
|
||||
>
|
||||
> In English, this means:
|
||||
>
|
||||
> - **Adjectives**: happier, happiest → happy
|
||||
> - **Adverbs**: worse, worst → badly
|
||||
> - **Nouns**: dogs, children → dog, child
|
||||
> - **Verbs**: writes, writing, wrote, written → write
|
||||
|
||||
As of v2.2, lemmatization data is stored in a separate package,
|
||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can
|
||||
be installed if needed via `pip install spacy[lookups]`. Some languages provide
|
||||
full lemmatization rules and exceptions, while other languages currently only
|
||||
rely on simple lookup tables.
|
||||
|
||||
<Infobox title="About spaCy's custom pronoun lemma for English" variant="warning">
|
||||
|
||||
spaCy adds a **special case for English pronouns**: all English pronouns are
|
||||
lemmatized to the special token `-PRON-`. Unlike verbs and common nouns,
|
||||
there's no clear base form of a personal pronoun. Should the lemma of "me" be
|
||||
"I", or should we normalize person as well, giving "it" — or maybe "he"?
|
||||
spaCy's solution is to introduce a novel symbol, `-PRON-`, which is used as the
|
||||
lemma for all personal pronouns.
|
||||
|
||||
</Infobox>
|
||||
|
||||
### Sentence boundary detection {#sentence-boundary}
|
||||
|
||||
Sentence boundaries are calculated from the syntactic parse tree, so features
|
||||
such as punctuation and capitalization play an important but non-decisive role
|
||||
in determining the sentence boundaries. Usually this means that the sentence
|
||||
boundaries will at least coincide with clause boundaries, even given poorly
|
||||
punctuated text.
|
||||
|
||||
## Part-of-speech tagging {#pos-tagging}
|
||||
|
||||
> #### Tip: Understanding tags
|
||||
>
|
||||
> You can also use `spacy.explain` to get the description for the string
|
||||
> representation of a tag. For example, `spacy.explain("RB")` will return
|
||||
> "adverb".
|
||||
|
||||
This section lists the fine-grained and coarse-grained part-of-speech tags
|
||||
assigned by spaCy's [models](/models). The individual mapping is specific to the
|
||||
training corpus and can be defined in the respective language data's
|
||||
[`tag_map.py`](/usage/adding-languages#tag-map).
|
||||
|
||||
<Accordion title="Universal Part-of-speech Tags" id="pos-universal">
|
||||
|
||||
spaCy maps all language-specific part-of-speech tags to a small, fixed set of
|
||||
word type tags following the
|
||||
[Universal Dependencies scheme](http://universaldependencies.org/u/pos/). The
|
||||
universal tags don't code for any morphological features and only cover the word
|
||||
type. They're available as the [`Token.pos`](/api/token#attributes) and
|
||||
[`Token.pos_`](/api/token#attributes) attributes.
|
||||
|
||||
| POS | Description | Examples |
|
||||
| ------- | ------------------------- | --------------------------------------------- |
|
||||
| `ADJ` | adjective | big, old, green, incomprehensible, first |
|
||||
| `ADP` | adposition | in, to, during |
|
||||
| `ADV` | adverb | very, tomorrow, down, where, there |
|
||||
| `AUX` | auxiliary | is, has (done), will (do), should (do) |
|
||||
| `CONJ` | conjunction | and, or, but |
|
||||
| `CCONJ` | coordinating conjunction | and, or, but |
|
||||
| `DET` | determiner | a, an, the |
|
||||
| `INTJ` | interjection | psst, ouch, bravo, hello |
|
||||
| `NOUN` | noun | girl, cat, tree, air, beauty |
|
||||
| `NUM` | numeral | 1, 2017, one, seventy-seven, IV, MMXIV |
|
||||
| `PART` | particle | 's, not, |
|
||||
| `PRON` | pronoun | I, you, he, she, myself, themselves, somebody |
|
||||
| `PROPN` | proper noun | Mary, John, London, NATO, HBO |
|
||||
| `PUNCT` | punctuation | ., (, ), ? |
|
||||
| `SCONJ` | subordinating conjunction | if, while, that |
|
||||
| `SYM` | symbol | \$, %, §, ©, +, −, ×, ÷, =, :), 😝 |
|
||||
| `VERB` | verb | run, runs, running, eat, ate, eating |
|
||||
| `X` | other | sfpksdpsxmsa |
|
||||
| `SPACE` | space |
|
||||
|
||||
</Accordion>
|
||||
|
||||
<Accordion title="English" id="pos-en">
|
||||
|
||||
The English part-of-speech tagger uses the
|
||||
[OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) version of the Penn
|
||||
Treebank tag set. We also map the tags to the simpler Universal Dependencies v2
|
||||
POS tag set.
|
||||
|
||||
| Tag | POS | Morphology | Description |
|
||||
| ------------------------------------- | ------- | --------------------------------------- | ----------------------------------------- |
|
||||
| `$` | `SYM` | | symbol, currency |
|
||||
| <InlineCode>``</InlineCode> | `PUNCT` | `PunctType=quot PunctSide=ini` | opening quotation mark |
|
||||
| `''` | `PUNCT` | `PunctType=quot PunctSide=fin` | closing quotation mark |
|
||||
| `,` | `PUNCT` | `PunctType=comm` | punctuation mark, comma |
|
||||
| `-LRB-` | `PUNCT` | `PunctType=brck PunctSide=ini` | left round bracket |
|
||||
| `-RRB-` | `PUNCT` | `PunctType=brck PunctSide=fin` | right round bracket |
|
||||
| `.` | `PUNCT` | `PunctType=peri` | punctuation mark, sentence closer |
|
||||
| `:` | `PUNCT` | | punctuation mark, colon or ellipsis |
|
||||
| `ADD` | `X` | | email |
|
||||
| `AFX` | `ADJ` | `Hyph=yes` | affix |
|
||||
| `CC` | `CCONJ` | `ConjType=comp` | conjunction, coordinating |
|
||||
| `CD` | `NUM` | `NumType=card` | cardinal number |
|
||||
| `DT` | `DET` | | determiner |
|
||||
| `EX` | `PRON` | `AdvType=ex` | existential there |
|
||||
| `FW` | `X` | `Foreign=yes` | foreign word |
|
||||
| `GW` | `X` | | additional word in multi-word expression |
|
||||
| `HYPH` | `PUNCT` | `PunctType=dash` | punctuation mark, hyphen |
|
||||
| `IN` | `ADP` | | conjunction, subordinating or preposition |
|
||||
| `JJ` | `ADJ` | `Degree=pos` | adjective |
|
||||
| `JJR` | `ADJ` | `Degree=comp` | adjective, comparative |
|
||||
| `JJS` | `ADJ` | `Degree=sup` | adjective, superlative |
|
||||
| `LS` | `X` | `NumType=ord` | list item marker |
|
||||
| `MD` | `VERB` | `VerbType=mod` | verb, modal auxiliary |
|
||||
| `NFP` | `PUNCT` | | superfluous punctuation |
|
||||
| `NIL` | `X` | | missing tag |
|
||||
| `NN` | `NOUN` | `Number=sing` | noun, singular or mass |
|
||||
| `NNP` | `PROPN` | `NounType=prop Number=sing` | noun, proper singular |
|
||||
| `NNPS` | `PROPN` | `NounType=prop Number=plur` | noun, proper plural |
|
||||
| `NNS` | `NOUN` | `Number=plur` | noun, plural |
|
||||
| `PDT` | `DET` | | predeterminer |
|
||||
| `POS` | `PART` | `Poss=yes` | possessive ending |
|
||||
| `PRP` | `PRON` | `PronType=prs` | pronoun, personal |
|
||||
| `PRP$` | `DET` | `PronType=prs Poss=yes` | pronoun, possessive |
|
||||
| `RB` | `ADV` | `Degree=pos` | adverb |
|
||||
| `RBR` | `ADV` | `Degree=comp` | adverb, comparative |
|
||||
| `RBS` | `ADV` | `Degree=sup` | adverb, superlative |
|
||||
| `RP` | `ADP` | | adverb, particle |
|
||||
| `SP` | `SPACE` | | space |
|
||||
| `SYM` | `SYM` | | symbol |
|
||||
| `TO` | `PART` | `PartType=inf VerbForm=inf` | infinitival "to" |
|
||||
| `UH` | `INTJ` | | interjection |
|
||||
| `VB` | `VERB` | `VerbForm=inf` | verb, base form |
|
||||
| `VBD` | `VERB` | `VerbForm=fin Tense=past` | verb, past tense |
|
||||
| `VBG` | `VERB` | `VerbForm=part Tense=pres Aspect=prog` | verb, gerund or present participle |
|
||||
| `VBN` | `VERB` | `VerbForm=part Tense=past Aspect=perf` | verb, past participle |
|
||||
| `VBP` | `VERB` | `VerbForm=fin Tense=pres` | verb, non-3rd person singular present |
|
||||
| `VBZ` | `VERB` | `VerbForm=fin Tense=pres Number=sing Person=three` | verb, 3rd person singular present |
|
||||
| `WDT` | `DET` | | wh-determiner |
|
||||
| `WP` | `PRON` | | wh-pronoun, personal |
|
||||
| `WP$` | `DET` | `Poss=yes` | wh-pronoun, possessive |
|
||||
| `WRB` | `ADV` | | wh-adverb |
|
||||
| `XX` | `X` | | unknown |
|
||||
| `_SP` | `SPACE` | | |
|
||||
</Accordion>
|
||||
|
||||
<Accordion title="German" id="pos-de">
|
||||
|
||||
The German part-of-speech tagger uses the
|
||||
[TIGER Treebank](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html)
|
||||
annotation scheme. We also map the tags to the simpler Universal Dependencies
|
||||
v2 POS tag set.
|
||||
|
||||
| Tag | POS | Morphology | Description |
|
||||
| --------- | ------- | ---------------------------------------- | ------------------------------------------------- |
|
||||
| `$(` | `PUNCT` | `PunctType=brck` | other sentence-internal punctuation mark |
|
||||
| `$,` | `PUNCT` | `PunctType=comm` | comma |
|
||||
| `$.` | `PUNCT` | `PunctType=peri` | sentence-final punctuation mark |
|
||||
| `ADJA` | `ADJ` | | adjective, attributive |
|
||||
| `ADJD` | `ADJ` | | adjective, adverbial or predicative |
|
||||
| `ADV` | `ADV` | | adverb |
|
||||
| `APPO` | `ADP` | `AdpType=post` | postposition |
|
||||
| `APPR` | `ADP` | `AdpType=prep` | preposition; circumposition left |
|
||||
| `APPRART` | `ADP` | `AdpType=prep PronType=art` | preposition with article |
|
||||
| `APZR` | `ADP` | `AdpType=circ` | circumposition right |
|
||||
| `ART` | `DET` | `PronType=art` | definite or indefinite article |
|
||||
| `CARD` | `NUM` | `NumType=card` | cardinal number |
|
||||
| `FM` | `X` | `Foreign=yes` | foreign language material |
|
||||
| `ITJ` | `INTJ` | | interjection |
|
||||
| `KOKOM` | `CCONJ` | `ConjType=comp` | comparative conjunction |
|
||||
| `KON` | `CCONJ` | | coordinate conjunction |
|
||||
| `KOUI` | `SCONJ` | | subordinate conjunction with "zu" and infinitive |
|
||||
| `KOUS` | `SCONJ` | | subordinate conjunction with sentence |
|
||||
| `NE` | `PROPN` | | proper noun |
|
||||
| `NN` | `NOUN` | | noun, singular or mass |
|
||||
| `NNE` | `PROPN` | | proper noun |
|
||||
| `PDAT` | `DET` | `PronType=dem` | attributive demonstrative pronoun |
|
||||
| `PDS` | `PRON` | `PronType=dem` | substituting demonstrative pronoun |
|
||||
| `PIAT` | `DET` | `PronType=ind|neg|tot` | attributive indefinite pronoun without determiner |
|
||||
| `PIS` | `PRON` | `PronType=ind|neg|tot` | substituting indefinite pronoun |
|
||||
| `PPER` | `PRON` | `PronType=prs` | non-reflexive personal pronoun |
|
||||
| `PPOSAT` | `DET` | `Poss=yes PronType=prs` | attributive possessive pronoun |
|
||||
| `PPOSS` | `PRON` | `Poss=yes PronType=prs` | substituting possessive pronoun |
|
||||
| `PRELAT` | `DET` | `PronType=rel` | attributive relative pronoun |
|
||||
| `PRELS` | `PRON` | `PronType=rel` | substituting relative pronoun |
|
||||
| `PRF` | `PRON` | `PronType=prs Reflex=yes` | reflexive personal pronoun |
|
||||
| `PROAV` | `ADV` | `PronType=dem` | pronominal adverb |
|
||||
| `PTKA` | `PART` | | particle with adjective or adverb |
|
||||
| `PTKANT` | `PART` | `PartType=res` | answer particle |
|
||||
| `PTKNEG` | `PART` | `Polarity=neg` | negative particle |
|
||||
| `PTKVZ` | `ADP` | `PartType=vbp` | separable verbal particle |
|
||||
| `PTKZU` | `PART` | `PartType=inf` | "zu" before infinitive |
|
||||
| `PWAT` | `DET` | `PronType=int` | attributive interrogative pronoun |
|
||||
| `PWAV` | `ADV` | `PronType=int` | adverbial interrogative or relative pronoun |
|
||||
| `PWS` | `PRON` | `PronType=int` | substituting interrogative pronoun |
|
||||
| `TRUNC` | `X` | `Hyph=yes` | word remnant |
|
||||
| `VAFIN` | `AUX` | `Mood=ind VerbForm=fin` | finite verb, auxiliary |
|
||||
| `VAIMP` | `AUX` | `Mood=imp VerbForm=fin` | imperative, auxiliary |
|
||||
| `VAINF` | `AUX` | `VerbForm=inf` | infinitive, auxiliary |
|
||||
| `VAPP` | `AUX` | `Aspect=perf VerbForm=part` | perfect participle, auxiliary |
|
||||
| `VMFIN` | `VERB` | `Mood=ind VerbForm=fin VerbType=mod` | finite verb, modal |
|
||||
| `VMINF` | `VERB` | `VerbForm=inf VerbType=mod` | infinitive, modal |
|
||||
| `VMPP` | `VERB` | `Aspect=perf VerbForm=part VerbType=mod` | perfect participle, modal |
|
||||
| `VVFIN` | `VERB` | `Mood=ind VerbForm=fin` | finite verb, full |
|
||||
| `VVIMP` | `VERB` | `Mood=imp VerbForm=fin` | imperative, full |
|
||||
| `VVINF` | `VERB` | `VerbForm=inf` | infinitive, full |
|
||||
| `VVIZU` | `VERB` | `VerbForm=inf` | infinitive with "zu", full |
|
||||
| `VVPP` | `VERB` | `Aspect=perf VerbForm=part` | perfect participle, full |
|
||||
| `XY` | `X` | | non-word containing non-letter |
|
||||
| `_SP` | `SPACE` | | |
|
||||
</Accordion>
|
||||
|
||||
---
|
||||
|
||||
<Infobox title="Annotation schemes for other models">
|
||||
|
||||
For the label schemes used by the other models, see the respective `tag_map.py`
|
||||
in [`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang).
|
||||
|
||||
</Infobox>
|
||||
|
||||
## Syntactic Dependency Parsing {#dependency-parsing}
|
||||
|
||||
> #### Tip: Understanding labels
|
||||
>
|
||||
> You can also use `spacy.explain` to get the description for the string
|
||||
> representation of a label. For example, `spacy.explain("prt")` will return
|
||||
> "particle".
|
||||
|
||||
This section lists the syntactic dependency labels assigned by spaCy's
|
||||
[models](/models). The individual labels are language-specific and depend on the
|
||||
training corpus.
|
||||
|
||||
<Accordion title="Universal Dependency Labels" id="dependency-parsing-universal">
|
||||
|
||||
The [Universal Dependencies scheme](http://universaldependencies.org/u/dep/) is
|
||||
used in all languages trained on Universal Dependency Corpora.
|
||||
|
||||
| Label | Description |
|
||||
| ------------ | -------------------------------------------- |
|
||||
| `acl` | clausal modifier of noun (adjectival clause) |
|
||||
| `advcl` | adverbial clause modifier |
|
||||
| `advmod` | adverbial modifier |
|
||||
| `amod` | adjectival modifier |
|
||||
| `appos` | appositional modifier |
|
||||
| `aux` | auxiliary |
|
||||
| `case` | case marking |
|
||||
| `cc` | coordinating conjunction |
|
||||
| `ccomp` | clausal complement |
|
||||
| `clf` | classifier |
|
||||
| `compound` | compound |
|
||||
| `conj` | conjunct |
|
||||
| `cop` | copula |
|
||||
| `csubj` | clausal subject |
|
||||
| `dep` | unspecified dependency |
|
||||
| `det` | determiner |
|
||||
| `discourse` | discourse element |
|
||||
| `dislocated` | dislocated elements |
|
||||
| `expl` | expletive |
|
||||
| `fixed` | fixed multiword expression |
|
||||
| `flat` | flat multiword expression |
|
||||
| `goeswith` | goes with |
|
||||
| `iobj` | indirect object |
|
||||
| `list` | list |
|
||||
| `mark` | marker |
|
||||
| `nmod` | nominal modifier |
|
||||
| `nsubj` | nominal subject |
|
||||
| `nummod` | numeric modifier |
|
||||
| `obj` | object |
|
||||
| `obl` | oblique nominal |
|
||||
| `orphan` | orphan |
|
||||
| `parataxis` | parataxis |
|
||||
| `punct` | punctuation |
|
||||
| `reparandum` | overridden disfluency |
|
||||
| `root` | root |
|
||||
| `vocative` | vocative |
|
||||
| `xcomp` | open clausal complement |
|
||||
|
||||
</Accordion>
|
||||
|
||||
<Accordion title="English" id="dependency-parsing-english">
|
||||
|
||||
The English dependency labels use the
|
||||
[CLEAR Style](https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md)
|
||||
by [ClearNLP](http://www.clearnlp.com).
|
||||
|
||||
| Label | Description |
|
||||
| ----------- | -------------------------------------------- |
|
||||
| `acl` | clausal modifier of noun (adjectival clause) |
|
||||
| `acomp` | adjectival complement |
|
||||
| `advcl` | adverbial clause modifier |
|
||||
| `advmod` | adverbial modifier |
|
||||
| `agent` | agent |
|
||||
| `amod` | adjectival modifier |
|
||||
| `appos` | appositional modifier |
|
||||
| `attr` | attribute |
|
||||
| `aux` | auxiliary |
|
||||
| `auxpass` | auxiliary (passive) |
|
||||
| `case` | case marking |
|
||||
| `cc` | coordinating conjunction |
|
||||
| `ccomp` | clausal complement |
|
||||
| `compound` | compound |
|
||||
| `conj` | conjunct |
|
||||
| `cop` | copula |
|
||||
| `csubj` | clausal subject |
|
||||
| `csubjpass` | clausal subject (passive) |
|
||||
| `dative` | dative |
|
||||
| `dep` | unclassified dependent |
|
||||
| `det` | determiner |
|
||||
| `dobj` | direct object |
|
||||
| `expl` | expletive |
|
||||
| `intj` | interjection |
|
||||
| `mark` | marker |
|
||||
| `meta` | meta modifier |
|
||||
| `neg` | negation modifier |
|
||||
| `nn` | noun compound modifier |
|
||||
| `nounmod` | modifier of nominal |
|
||||
| `npmod` | noun phrase as adverbial modifier |
|
||||
| `nsubj` | nominal subject |
|
||||
| `nsubjpass` | nominal subject (passive) |
|
||||
| `nummod` | numeric modifier |
|
||||
| `oprd` | object predicate |
|
||||
| `obj` | object |
|
||||
| `obl` | oblique nominal |
|
||||
| `parataxis` | parataxis |
|
||||
| `pcomp` | complement of preposition |
|
||||
| `pobj` | object of preposition |
|
||||
| `poss` | possession modifier |
|
||||
| `preconj` | pre-correlative conjunction |
|
||||
| `prep` | prepositional modifier |
|
||||
| `prt` | particle |
|
||||
| `punct` | punctuation |
|
||||
| `quantmod` | modifier of quantifier |
|
||||
| `relcl` | relative clause modifier |
|
||||
| `root` | root |
|
||||
| `xcomp` | open clausal complement |
|
||||
|
||||
</Accordion>
|
||||
|
||||
<Accordion title="German" id="dependency-parsing-german">
|
||||
|
||||
The German dependency labels use the
|
||||
[TIGER Treebank](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html)
|
||||
annotation scheme.
|
||||
|
||||
| Label | Description |
|
||||
| ------- | ------------------------------- |
|
||||
| `ac` | adpositional case marker |
|
||||
| `adc` | adjective component |
|
||||
| `ag` | genitive attribute |
|
||||
| `ams` | measure argument of adjective |
|
||||
| `app` | apposition |
|
||||
| `avc` | adverbial phrase component |
|
||||
| `cc` | comparative complement |
|
||||
| `cd` | coordinating conjunction |
|
||||
| `cj` | conjunct |
|
||||
| `cm` | comparative conjunction |
|
||||
| `cp` | complementizer |
|
||||
| `cvc` | collocational verb construction |
|
||||
| `da` | dative |
|
||||
| `dm` | discourse marker |
|
||||
| `ep` | expletive es |
|
||||
| `ju` | junctor |
|
||||
| `mnr` | postnominal modifier |
|
||||
| `mo` | modifier |
|
||||
| `ng` | negation |
|
||||
| `nk` | noun kernel element |
|
||||
| `nmc` | numerical component |
|
||||
| `oa` | accusative object |
|
||||
| `oa2` | second accusative object |
|
||||
| `oc` | clausal object |
|
||||
| `og` | genitive object |
|
||||
| `op` | prepositional object |
|
||||
| `par` | parenthetical element |
|
||||
| `pd` | predicate |
|
||||
| `pg` | phrasal genitive |
|
||||
| `ph` | placeholder |
|
||||
| `pm` | morphological particle |
|
||||
| `pnc` | proper noun component |
|
||||
| `punct` | punctuation |
|
||||
| `rc` | relative clause |
|
||||
| `re` | repeated element |
|
||||
| `rs` | reported speech |
|
||||
| `sb` | subject |
|
||||
| `sbp` | passivized subject (PP) |
|
||||
| `sp` | subject or predicate |
|
||||
| `svp` | separable verb prefix |
|
||||
| `uc` | unit component |
|
||||
| `vo` | vocative |
|
||||
| `ROOT` | root |
|
||||
|
||||
</Accordion>
|
||||
|
||||
## Named Entity Recognition {#named-entities}
|
||||
|
||||
> #### Tip: Understanding entity types
|
||||
>
|
||||
> You can also use `spacy.explain` to get the description for the string
|
||||
> representation of an entity label. For example, `spacy.explain("LANGUAGE")`
|
||||
> will return "any named language".
|
||||
|
||||
Models trained on the [OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19)
|
||||
corpus support the following entity types:
|
||||
|
||||
| Type | Description |
|
||||
| ------------- | ---------------------------------------------------- |
|
||||
| `PERSON` | People, including fictional. |
|
||||
| `NORP` | Nationalities or religious or political groups. |
|
||||
| `FAC` | Buildings, airports, highways, bridges, etc. |
|
||||
| `ORG` | Companies, agencies, institutions, etc. |
|
||||
| `GPE` | Countries, cities, states. |
|
||||
| `LOC` | Non-GPE locations, mountain ranges, bodies of water. |
|
||||
| `PRODUCT` | Objects, vehicles, foods, etc. (Not services.) |
|
||||
| `EVENT` | Named hurricanes, battles, wars, sports events, etc. |
|
||||
| `WORK_OF_ART` | Titles of books, songs, etc. |
|
||||
| `LAW` | Named documents made into laws. |
|
||||
| `LANGUAGE` | Any named language. |
|
||||
| `DATE` | Absolute or relative dates or periods. |
|
||||
| `TIME` | Times smaller than a day. |
|
||||
| `PERCENT` | Percentage, including "%". |
|
||||
| `MONEY` | Monetary values, including unit. |
|
||||
| `QUANTITY` | Measurements, as of weight or distance. |
|
||||
| `ORDINAL` | "first", "second", etc. |
|
||||
| `CARDINAL` | Numerals that do not fall under another type. |
|
||||
|
||||
### Wikipedia scheme {#ner-wikipedia-scheme}
|
||||
|
||||
Models trained on Wikipedia corpus
|
||||
([Nothman et al., 2013](http://www.sciencedirect.com/science/article/pii/S0004370212000276))
|
||||
use a less fine-grained NER annotation scheme and recognise the following
|
||||
entities:
|
||||
|
||||
| Type | Description |
|
||||
| ------ | ----------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `PER` | Named person or family. |
|
||||
| `LOC` | Name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains). |
|
||||
| `ORG` | Named corporate, governmental, or other organizational entity. |
|
||||
| `MISC` | Miscellaneous entities, e.g. events, nationalities, products or works of art. |
|
||||
|
||||
### IOB Scheme {#iob}
|
||||
|
||||
| Tag | ID | Description |
|
||||
| ----- | --- | ------------------------------------- |
|
||||
| `"I"` | `1` | Token is inside an entity. |
|
||||
| `"O"` | `2` | Token is outside an entity. |
|
||||
| `"B"` | `3` | Token begins an entity. |
|
||||
| `""` | `0` | No entity tag is set (missing value). |
|
||||
|
||||
### BILUO Scheme {#biluo}
|
||||
|
||||
| Tag | Description |
|
||||
| ----------- | ---------------------------------------- |
|
||||
| **`B`**EGIN | The first token of a multi-token entity. |
|
||||
| **`I`**N | An inner token of a multi-token entity. |
|
||||
| **`L`**AST | The final token of a multi-token entity. |
|
||||
| **`U`**NIT | A single-token entity. |
|
||||
| **`O`**UT | A non-entity token. |
|
||||
|
||||
> #### Why BILUO, not IOB?
|
||||
>
|
||||
> There are several coding schemes for encoding entity annotations as token
|
||||
> tags. These coding schemes are equally expressive, but not necessarily equally
|
||||
> learnable. [Ratinov and Roth](http://www.aclweb.org/anthology/W09-1119) showed
|
||||
> that the minimal **Begin**, **In**, **Out** scheme was more difficult to learn
|
||||
> than the **BILUO** scheme that we use, which explicitly marks boundary tokens.
|
||||
|
||||
spaCy translates the character offsets into this scheme, in order to decide the
|
||||
cost of each action given the current state of the entity recognizer. The costs
|
||||
are then used to calculate the gradient of the loss, to train the model. The
|
||||
exact algorithm is a pastiche of well-known methods, and is not currently
|
||||
described in any single publication. The model is a greedy transition-based
|
||||
parser guided by a linear model whose weights are learned using the averaged
|
||||
perceptron loss, via the
|
||||
[dynamic oracle](http://www.aclweb.org/anthology/C12-1059) imitation learning
|
||||
strategy. The transition system is equivalent to the BILUO tagging scheme.
|
||||
|
||||
## Models and training data {#training}
|
||||
|
||||
### JSON input format for training {#json-input}
|
||||
|
||||
spaCy takes training data in JSON format. The built-in
|
||||
[`convert`](/api/cli#convert) command helps you convert the `.conllu` format
|
||||
used by the
|
||||
[Universal Dependencies corpora](https://github.com/UniversalDependencies) to
|
||||
spaCy's training format. To convert one or more existing `Doc` objects to
|
||||
spaCy's JSON format, you can use the
|
||||
[`gold.docs_to_json`](/api/goldparse#docs_to_json) helper.
|
||||
|
||||
> #### Annotating entities
|
||||
>
|
||||
> Named entities are provided in the [BILUO](#biluo) notation. Tokens outside an
|
||||
> entity are set to `"O"` and tokens that are part of an entity are set to the
|
||||
> entity label, prefixed by the BILUO marker. For example `"B-ORG"` describes
|
||||
> the first token of a multi-token `ORG` entity and `"U-PERSON"` a single token
|
||||
> representing a `PERSON` entity. The
|
||||
> [`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) function
|
||||
> can help you convert entity offsets to the right format.
|
||||
|
||||
```python
|
||||
### Example structure
|
||||
[{
|
||||
"id": int, # ID of the document within the corpus
|
||||
"paragraphs": [{ # list of paragraphs in the corpus
|
||||
"raw": string, # raw text of the paragraph
|
||||
"sentences": [{ # list of sentences in the paragraph
|
||||
"tokens": [{ # list of tokens in the sentence
|
||||
"id": int, # index of the token in the document
|
||||
"dep": string, # dependency label
|
||||
"head": int, # offset of token head relative to token index
|
||||
"tag": string, # part-of-speech tag
|
||||
"orth": string, # verbatim text of the token
|
||||
"ner": string # BILUO label, e.g. "O" or "B-ORG"
|
||||
}],
|
||||
"brackets": [{ # phrase structure (NOT USED by current models)
|
||||
"first": int, # index of first token
|
||||
"last": int, # index of last token
|
||||
"label": string # phrase label
|
||||
}]
|
||||
}],
|
||||
"cats": [{ # new in v2.2: categories for text classifier
|
||||
"label": string, # text category label
|
||||
"value": float / bool # label applies (1.0/true) or not (0.0/false)
|
||||
}]
|
||||
}]
|
||||
}]
|
||||
```
|
||||
|
||||
Here's an example of dependencies, part-of-speech tags and names entities, taken
|
||||
from the English Wall Street Journal portion of the Penn Treebank:
|
||||
|
||||
```json
|
||||
https://github.com/explosion/spaCy/tree/master/examples/training/training-data.json
|
||||
```
|
||||
|
||||
### Lexical data for vocabulary {#vocab-jsonl new="2"}
|
||||
|
||||
To populate a model's vocabulary, you can use the
|
||||
[`spacy init-model`](/api/cli#init-model) command and load in a
|
||||
[newline-delimited JSON](http://jsonlines.org/) (JSONL) file containing one
|
||||
lexical entry per line via the `--jsonl-loc` option. The first line defines the
|
||||
language and vocabulary settings. All other lines are expected to be JSON
|
||||
objects describing an individual lexeme. The lexical attributes will be then set
|
||||
as attributes on spaCy's [`Lexeme`](/api/lexeme#attributes) object. The `vocab`
|
||||
command outputs a ready-to-use spaCy model with a `Vocab` containing the lexical
|
||||
data.
|
||||
|
||||
```python
|
||||
### First line
|
||||
{"lang": "en", "settings": {"oov_prob": -20.502029418945312}}
|
||||
```
|
||||
|
||||
```python
|
||||
### Entry structure
|
||||
{
|
||||
"orth": string, # the word text
|
||||
"id": int, # can correspond to row in vectors table
|
||||
"lower": string,
|
||||
"norm": string,
|
||||
"shape": string
|
||||
"prefix": string,
|
||||
"suffix": string,
|
||||
"length": int,
|
||||
"cluster": string,
|
||||
"prob": float,
|
||||
"is_alpha": bool,
|
||||
"is_ascii": bool,
|
||||
"is_digit": bool,
|
||||
"is_lower": bool,
|
||||
"is_punct": bool,
|
||||
"is_space": bool,
|
||||
"is_title": bool,
|
||||
"is_upper": bool,
|
||||
"like_url": bool,
|
||||
"like_num": bool,
|
||||
"like_email": bool,
|
||||
"is_stop": bool,
|
||||
"is_oov": bool,
|
||||
"is_quote": bool,
|
||||
"is_left_punct": bool,
|
||||
"is_right_punct": bool
|
||||
}
|
||||
```
|
||||
|
||||
Here's an example of the 20 most frequent lexemes in the English training data:
|
||||
|
||||
```json
|
||||
https://github.com/explosion/spaCy/tree/master/examples/training/vocab-data.jsonl
|
||||
```
|
7
website/docs/api/architectures.md
Normal file
7
website/docs/api/architectures.md
Normal file
|
@ -0,0 +1,7 @@
|
|||
---
|
||||
title: Model Architectures
|
||||
teaser: Pre-defined model architectures included with the core library
|
||||
source: spacy/ml/models
|
||||
---
|
||||
|
||||
TODO: write
|
|
@ -4,7 +4,6 @@ teaser: Download, train and package models, and debug spaCy
|
|||
source: spacy/cli
|
||||
menu:
|
||||
- ['Download', 'download']
|
||||
- ['Link', 'link']
|
||||
- ['Info', 'info']
|
||||
- ['Validate', 'validate']
|
||||
- ['Convert', 'convert']
|
||||
|
@ -14,20 +13,19 @@ menu:
|
|||
- ['Init Model', 'init-model']
|
||||
- ['Evaluate', 'evaluate']
|
||||
- ['Package', 'package']
|
||||
- ['Project', 'project']
|
||||
---
|
||||
|
||||
As of v1.7.0, spaCy comes with new command line helpers to download and link
|
||||
models and show useful debugging information. For a list of available commands,
|
||||
type `spacy --help`.
|
||||
For a list of available commands, type `spacy --help`.
|
||||
|
||||
<!-- TODO: add notes on autocompletion etc. -->
|
||||
|
||||
## Download {#download}
|
||||
|
||||
Download [models](/usage/models) for spaCy. The downloader finds the
|
||||
best-matching compatible version, uses `pip install` to download the model as a
|
||||
package and creates a [shortcut link](/usage/models#usage) if the model was
|
||||
downloaded via a shortcut. Direct downloads don't perform any compatibility
|
||||
checks and require the model name to be specified with its version (e.g.
|
||||
`en_core_web_sm-2.2.0`).
|
||||
best-matching compatible version and uses `pip install` to download the model as
|
||||
a package. Direct downloads don't perform any compatibility checks and require
|
||||
the model name to be specified with its version (e.g. `en_core_web_sm-2.2.0`).
|
||||
|
||||
> #### Downloading best practices
|
||||
>
|
||||
|
@ -43,42 +41,13 @@ checks and require the model name to be specified with its version (e.g.
|
|||
$ python -m spacy download [model] [--direct] [pip args]
|
||||
```
|
||||
|
||||
| Argument | Type | Description |
|
||||
| ------------------------------------- | ------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `model` | positional | Model name or shortcut (`en`, `de`, `en_core_web_sm`). |
|
||||
| `--direct`, `-d` | flag | Force direct download of exact model version. |
|
||||
| pip args <Tag variant="new">2.1</Tag> | - | Additional installation options to be passed to `pip install` when installing the model package. For example, `--user` to install to the user home directory or `--no-deps` to not install model dependencies. |
|
||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
| **CREATES** | directory, symlink | The installed model package in your `site-packages` directory and a shortcut link as a symlink in `spacy/data` if installed via shortcut. |
|
||||
|
||||
## Link {#link}
|
||||
|
||||
Create a [shortcut link](/usage/models#usage) for a model, either a Python
|
||||
package or a local directory. This will let you load models from any location
|
||||
using a custom name via [`spacy.load()`](/api/top-level#spacy.load).
|
||||
|
||||
<Infobox title="Important note" variant="warning">
|
||||
|
||||
In spaCy v1.x, you had to use the model data directory to set up a shortcut link
|
||||
for a local path. As of v2.0, spaCy expects all shortcut links to be **loadable
|
||||
model packages**. If you want to load a data directory, call
|
||||
[`spacy.load()`](/api/top-level#spacy.load) or
|
||||
[`Language.from_disk()`](/api/language#from_disk) with the path, or use the
|
||||
[`package`](/api/cli#package) command to create a model package.
|
||||
|
||||
</Infobox>
|
||||
|
||||
```bash
|
||||
$ python -m spacy link [origin] [link_name] [--force]
|
||||
```
|
||||
|
||||
| Argument | Type | Description |
|
||||
| --------------- | ---------- | --------------------------------------------------------------- |
|
||||
| `origin` | positional | Model name if package, or path to local directory. |
|
||||
| `link_name` | positional | Name of the shortcut link to create. |
|
||||
| `--force`, `-f` | flag | Force overwriting of existing link. |
|
||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
| **CREATES** | symlink | A shortcut link of the given name as a symlink in `spacy/data`. |
|
||||
| Argument | Type | Description |
|
||||
| ------------------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `model` | positional | Model name, e.g. `en_core_web_sm`.. |
|
||||
| `--direct`, `-d` | flag | Force direct download of exact model version. |
|
||||
| pip args <Tag variant="new">2.1</Tag> | - | Additional installation options to be passed to `pip install` when installing the model package. For example, `--user` to install to the user home directory or `--no-deps` to not install model dependencies. |
|
||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
| **CREATES** | directory | The installed model package in your `site-packages` directory. |
|
||||
|
||||
## Info {#info}
|
||||
|
||||
|
@ -94,30 +63,28 @@ $ python -m spacy info [--markdown] [--silent]
|
|||
$ python -m spacy info [model] [--markdown] [--silent]
|
||||
```
|
||||
|
||||
| Argument | Type | Description |
|
||||
| ------------------------------------------------ | ---------- | ------------------------------------------------------------- |
|
||||
| `model` | positional | A model, i.e. shortcut link, package name or path (optional). |
|
||||
| `--markdown`, `-md` | flag | Print information as Markdown. |
|
||||
| `--silent`, `-s` <Tag variant="new">2.0.12</Tag> | flag | Don't print anything, just return the values. |
|
||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
| **PRINTS** | `stdout` | Information about your spaCy installation. |
|
||||
| Argument | Type | Description |
|
||||
| ------------------------------------------------ | ---------- | ---------------------------------------------- |
|
||||
| `model` | positional | A model, i.e. package name or path (optional). |
|
||||
| `--markdown`, `-md` | flag | Print information as Markdown. |
|
||||
| `--silent`, `-s` <Tag variant="new">2.0.12</Tag> | flag | Don't print anything, just return the values. |
|
||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
| **PRINTS** | `stdout` | Information about your spaCy installation. |
|
||||
|
||||
## Validate {#validate new="2"}
|
||||
|
||||
Find all models installed in the current environment (both packages and shortcut
|
||||
links) and check whether they are compatible with the currently installed
|
||||
version of spaCy. Should be run after upgrading spaCy via `pip install -U spacy`
|
||||
to ensure that all installed models are can be used with the new version. The
|
||||
command is also useful to detect out-of-sync model links resulting from links
|
||||
created in different virtual environments. It will show a list of models and
|
||||
their installed versions. If any model is out of date, the latest compatible
|
||||
versions and command for updating are shown.
|
||||
Find all models installed in the current environment and check whether they are
|
||||
compatible with the currently installed version of spaCy. Should be run after
|
||||
upgrading spaCy via `pip install -U spacy` to ensure that all installed models
|
||||
are can be used with the new version. It will show a list of models and their
|
||||
installed versions. If any model is out of date, the latest compatible versions
|
||||
and command for updating are shown.
|
||||
|
||||
> #### Automated validation
|
||||
>
|
||||
> You can also use the `validate` command as part of your build process or test
|
||||
> suite, to ensure all models are up to date before proceeding. If incompatible
|
||||
> models or shortcut links are found, it will return `1`.
|
||||
> models are found, it will return `1`.
|
||||
|
||||
```bash
|
||||
$ python -m spacy validate
|
||||
|
@ -129,50 +96,42 @@ $ python -m spacy validate
|
|||
|
||||
## Convert {#convert}
|
||||
|
||||
Convert files into spaCy's [JSON format](/api/annotation#json-input) for use
|
||||
with the `train` command and other experiment management functions. The
|
||||
converter can be specified on the command line, or chosen based on the file
|
||||
extension of the input file.
|
||||
Convert files into spaCy's
|
||||
[binary training data format](/api/data-formats#binary-training), a serialized
|
||||
[`DocBin`](/api/docbin), for use with the `train` command and other experiment
|
||||
management functions. The converter can be specified on the command line, or
|
||||
chosen based on the file extension of the input file.
|
||||
|
||||
```bash
|
||||
$ python -m spacy convert [input_file] [output_dir] [--file-type] [--converter]
|
||||
[--n-sents] [--morphology] [--lang]
|
||||
$ python -m spacy convert [input_file] [output_dir] [--converter]
|
||||
[--file-type] [--n-sents] [--seg-sents] [--model] [--morphology]
|
||||
[--merge-subtokens] [--ner-map] [--lang]
|
||||
```
|
||||
|
||||
| Argument | Type | Description |
|
||||
| ------------------------------------------------ | ---------- | ------------------------------------------------------------------------------------------------- |
|
||||
| `input_file` | positional | Input file. |
|
||||
| `output_dir` | positional | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. |
|
||||
| `--file-type`, `-t` <Tag variant="new">2.1</Tag> | option | Type of file to create (see below). |
|
||||
| `--converter`, `-c` <Tag variant="new">2</Tag> | option | Name of converter to use (see below). |
|
||||
| `--n-sents`, `-n` | option | Number of sentences per document. |
|
||||
| `--seg-sents`, `-s` <Tag variant="new">2.2</Tag> | flag | Segment sentences (for `-c ner`) |
|
||||
| `--model`, `-b` <Tag variant="new">2.2</Tag> | option | Model for parser-based sentence segmentation (for `-s`) |
|
||||
| `--morphology`, `-m` | option | Enable appending morphology to tags. |
|
||||
| `--lang`, `-l` <Tag variant="new">2.1</Tag> | option | Language code (if tokenizer required). |
|
||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
| **CREATES** | JSON | Data in spaCy's [JSON format](/api/annotation#json-input). |
|
||||
| Argument | Type | Description |
|
||||
| ------------------------------------------------ | ---------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `input_file` | positional | Input file. |
|
||||
| `output_dir` | positional | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. |
|
||||
| `--converter`, `-c` <Tag variant="new">2</Tag> | option | Name of converter to use (see below). |
|
||||
| `--file-type`, `-t` <Tag variant="new">2.1</Tag> | option | Type of file to create. Either `spacy` (default) for binary [`DocBin`](/api/docbin) data or `json` for v2.x JSON format. |
|
||||
| `--n-sents`, `-n` | option | Number of sentences per document. |
|
||||
| `--seg-sents`, `-s` <Tag variant="new">2.2</Tag> | flag | Segment sentences (for `-c ner`) |
|
||||
| `--model`, `-b` <Tag variant="new">2.2</Tag> | option | Model for parser-based sentence segmentation (for `-s`) |
|
||||
| `--morphology`, `-m` | option | Enable appending morphology to tags. |
|
||||
| `--ner-map`, `-nm` | option | NER tag mapping (as JSON-encoded dict of entity types). |
|
||||
| `--lang`, `-l` <Tag variant="new">2.1</Tag> | option | Language code (if tokenizer required). |
|
||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
| **CREATES** | binary | Binary [`DocBin`](/api/docbin) training data that can be used with [`spacy train`](/api/cli#train). |
|
||||
|
||||
### Output file types {new="2.1"}
|
||||
### Converters
|
||||
|
||||
All output files generated by this command are compatible with
|
||||
[`spacy train`](/api/cli#train).
|
||||
|
||||
| ID | Description |
|
||||
| ------- | -------------------------- |
|
||||
| `json` | Regular JSON (default). |
|
||||
| `jsonl` | Newline-delimited JSON. |
|
||||
| `msg` | Binary MessagePack format. |
|
||||
|
||||
### Converter options
|
||||
|
||||
| ID | Description |
|
||||
| ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `auto` | Automatically pick converter based on file extension and file content (default). |
|
||||
| `conll`, `conllu`, `conllubio` | Universal Dependencies `.conllu` or `.conll` format. |
|
||||
| `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
||||
| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
||||
| `jsonl` | NER data formatted as JSONL with one dict per line and a `"text"` and `"spans"` key. This is also the format exported by the [Prodigy](https://prodi.gy) annotation tool. See [sample data](https://raw.githubusercontent.com/explosion/projects/master/ner-fashion-brands/fashion_brands_training.jsonl). |
|
||||
| ID | Description |
|
||||
| ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `auto` | Automatically pick converter based on file extension and file content (default). |
|
||||
| `json` | JSON-formatted training data used in spaCy v2.x and produced by [`docs2json`](/api/top-level#docs_to_json). |
|
||||
| `conll` | Universal Dependencies `.conllu` or `.conll` format. |
|
||||
| `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
||||
| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
||||
|
||||
## Debug data {#debug-data new="2.2"}
|
||||
|
||||
|
@ -181,20 +140,21 @@ stats, and find problems like invalid entity annotations, cyclic dependencies,
|
|||
low data labels and more.
|
||||
|
||||
```bash
|
||||
$ python -m spacy debug-data [lang] [train_path] [dev_path] [--base-model] [--pipeline] [--ignore-warnings] [--verbose] [--no-format]
|
||||
$ python -m spacy debug-data [lang] [train_path] [dev_path] [--base-model]
|
||||
[--pipeline] [--tag-map-path] [--ignore-warnings] [--verbose] [--no-format]
|
||||
```
|
||||
|
||||
| Argument | Type | Description |
|
||||
| ------------------------------------------------------ | ---------- | -------------------------------------------------------------------------------------------------- |
|
||||
| `lang` | positional | Model language. |
|
||||
| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. |
|
||||
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
|
||||
| `--tag-map-path`, `-tm` <Tag variant="new">2.2.4</Tag> | option | Location of JSON-formatted tag map. |
|
||||
| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
|
||||
| `--pipeline`, `-p` | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
|
||||
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
|
||||
| `--verbose`, `-V` | flag | Print additional information and explanations. |
|
||||
| --no-format, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. |
|
||||
| Argument | Type | Description |
|
||||
| ------------------------------------------------------ | ---------- | ------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `lang` | positional | Model language. |
|
||||
| `train_path` | positional | Location of [binary training data](/usage/training#data-format). Can be a file or a directory of files. |
|
||||
| `dev_path` | positional | Location of [binary development data](/usage/training#data-format) for evaluation. Can be a file or a directory of files. |
|
||||
| `--tag-map-path`, `-tm` <Tag variant="new">2.2.4</Tag> | option | Location of JSON-formatted tag map. |
|
||||
| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
|
||||
| `--pipeline`, `-p` | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
|
||||
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
|
||||
| `--verbose`, `-V` | flag | Print additional information and explanations. |
|
||||
| `--no-format`, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. |
|
||||
|
||||
<Accordion title="Example output">
|
||||
|
||||
|
@ -337,21 +297,14 @@ will not be available.
|
|||
|
||||
## Train {#train}
|
||||
|
||||
<!-- TODO: document new training -->
|
||||
|
||||
Train a model. Expects data in spaCy's
|
||||
[JSON format](/api/annotation#json-input). On each epoch, a model will be saved
|
||||
out to the directory. Accuracy scores and model details will be added to a
|
||||
[JSON format](/api/data-formats#json-input). On each epoch, a model will be
|
||||
saved out to the directory. Accuracy scores and model details will be added to a
|
||||
[`meta.json`](/usage/training#models-generating) to allow packaging the model
|
||||
using the [`package`](/api/cli#package) command.
|
||||
|
||||
<Infobox title="Changed in v2.1" variant="warning">
|
||||
|
||||
As of spaCy 2.1, the `--no-tagger`, `--no-parser` and `--no-entities` flags have
|
||||
been replaced by a `--pipeline` option, which lets you define comma-separated
|
||||
names of pipeline components to train. For example, `--pipeline tagger,parser`
|
||||
will only train the tagger and parser.
|
||||
|
||||
</Infobox>
|
||||
|
||||
```bash
|
||||
$ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
||||
[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-early-stopping]
|
||||
|
@ -399,47 +352,10 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
|||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
| **CREATES** | model, pickle | A spaCy model on each epoch. |
|
||||
|
||||
### Environment variables for hyperparameters {#train-hyperparams new="2"}
|
||||
|
||||
spaCy lets you set hyperparameters for training via environment variables. For
|
||||
example:
|
||||
|
||||
```bash
|
||||
$ token_vector_width=256 learn_rate=0.0001 spacy train [...]
|
||||
```
|
||||
|
||||
> #### Usage with alias
|
||||
>
|
||||
> Environment variables keep the command simple and allow you to to
|
||||
> [create an alias](https://askubuntu.com/questions/17536/how-do-i-create-a-permanent-bash-alias/17537#17537)
|
||||
> for your custom `train` command while still being able to easily tweak the
|
||||
> hyperparameters.
|
||||
>
|
||||
> ```bash
|
||||
> alias train-parser="python -m spacy train en /output /data /train /dev -n 1000"
|
||||
> token_vector_width=256 train-parser
|
||||
> ```
|
||||
|
||||
| Name | Description | Default |
|
||||
| -------------------- | --------------------------------------------------- | ------- |
|
||||
| `dropout_from` | Initial dropout rate. | `0.2` |
|
||||
| `dropout_to` | Final dropout rate. | `0.2` |
|
||||
| `dropout_decay` | Rate of dropout change. | `0.0` |
|
||||
| `batch_from` | Initial batch size. | `1` |
|
||||
| `batch_to` | Final batch size. | `64` |
|
||||
| `batch_compound` | Rate of batch size acceleration. | `1.001` |
|
||||
| `token_vector_width` | Width of embedding tables and convolutional layers. | `128` |
|
||||
| `embed_size` | Number of rows in embedding tables. | `7500` |
|
||||
| `hidden_width` | Size of the parser's and NER's hidden layers. | `128` |
|
||||
| `learn_rate` | Learning rate. | `0.001` |
|
||||
| `optimizer_B1` | Momentum for the Adam solver. | `0.9` |
|
||||
| `optimizer_B2` | Adagrad-momentum for the Adam solver. | `0.999` |
|
||||
| `optimizer_eps` | Epsilon value for the Adam solver. | `1e-08` |
|
||||
| `L2_penalty` | L2 regularization penalty. | `1e-06` |
|
||||
| `grad_norm_clip` | Gradient L2 norm constraint. | `1.0` |
|
||||
|
||||
## Pretrain {#pretrain new="2.1" tag="experimental"}
|
||||
|
||||
<!-- TODO: document new pretrain command and link to new pretraining docs -->
|
||||
|
||||
Pre-train the "token to vector" (`tok2vec`) layer of pipeline components, using
|
||||
an approximate language-modeling objective. Specifically, we load pretrained
|
||||
vectors, and train a component like a CNN, BiLSTM, etc to predict vectors which
|
||||
|
@ -522,20 +438,10 @@ tokenization can be provided.
|
|||
Create a new model directory from raw data, like word frequencies, Brown
|
||||
clusters and word vectors. This command is similar to the `spacy model` command
|
||||
in v1.x. Note that in order to populate the model's vocab, you need to pass in a
|
||||
JSONL-formatted [vocabulary file](<(/api/annotation#vocab-jsonl)>) as
|
||||
JSONL-formatted [vocabulary file](/api/data-formats#vocab-jsonl) as
|
||||
`--jsonl-loc` with optional `id` values that correspond to the vectors table.
|
||||
Just loading in vectors will not automatically populate the vocab.
|
||||
|
||||
<Infobox title="Deprecation note" variant="warning">
|
||||
|
||||
As of v2.1.0, the `--freqs-loc` and `--clusters-loc` are deprecated and have
|
||||
been replaced with the `--jsonl-loc` argument, which lets you pass in a a
|
||||
[JSONL](http://jsonlines.org/) file containing one lexical entry per line. For
|
||||
more details on the format, see the
|
||||
[annotation specs](/api/annotation#vocab-jsonl).
|
||||
|
||||
</Infobox>
|
||||
|
||||
```bash
|
||||
$ python -m spacy init-model [lang] [output_dir] [--jsonl-loc] [--vectors-loc]
|
||||
[--prune-vectors]
|
||||
|
@ -545,7 +451,7 @@ $ python -m spacy init-model [lang] [output_dir] [--jsonl-loc] [--vectors-loc]
|
|||
| ----------------------------------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `lang` | positional | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), e.g. `en`. |
|
||||
| `output_dir` | positional | Model output directory. Will be created if it doesn't exist. |
|
||||
| `--jsonl-loc`, `-j` | option | Optional location of JSONL-formatted [vocabulary file](/api/annotation#vocab-jsonl) with lexical attributes. |
|
||||
| `--jsonl-loc`, `-j` | option | Optional location of JSONL-formatted [vocabulary file](/api/data-formats#vocab-jsonl) with lexical attributes. |
|
||||
| `--vectors-loc`, `-v` | option | Optional location of vectors. Should be a file where the first row contains the dimensions of the vectors, followed by a space-separated Word2Vec table. File can be provided in `.txt` format or as a zipped text file in `.zip` or `.tar.gz` format. |
|
||||
| `--truncate-vectors`, `-t` <Tag variant="new">2.3</Tag> | option | Number of vectors to truncate to when reading in vectors file. Defaults to `0` for no truncation. |
|
||||
| `--prune-vectors`, `-V` | option | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. |
|
||||
|
@ -555,6 +461,8 @@ $ python -m spacy init-model [lang] [output_dir] [--jsonl-loc] [--vectors-loc]
|
|||
|
||||
## Evaluate {#evaluate new="2"}
|
||||
|
||||
<!-- TODO: document new evaluate command -->
|
||||
|
||||
Evaluate a model's accuracy and speed on JSON-formatted annotated data. Will
|
||||
print the results and optionally export
|
||||
[displaCy visualizations](/usage/visualizers) of a sample set of parses to
|
||||
|
@ -569,7 +477,7 @@ $ python -m spacy evaluate [model] [data_path] [--displacy-path] [--displacy-lim
|
|||
|
||||
| Argument | Type | Description |
|
||||
| ------------------------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `model` | positional | Model to evaluate. Can be a package or shortcut link name, or a path to a model data directory. |
|
||||
| `model` | positional | Model to evaluate. Can be a package or a path to a model data directory. |
|
||||
| `data_path` | positional | Location of JSON-formatted evaluation data. |
|
||||
| `--displacy-path`, `-dp` | option | Directory to output rendered parses as HTML. If not set, no visualizations will be generated. |
|
||||
| `--displacy-limit`, `-dl` | option | Number of parses to generate per file. Defaults to `25`. Keep in mind that a significantly higher number might cause the `.html` files to render slowly. |
|
||||
|
@ -580,12 +488,20 @@ $ python -m spacy evaluate [model] [data_path] [--displacy-path] [--displacy-lim
|
|||
|
||||
## Package {#package}
|
||||
|
||||
Generate a [model Python package](/usage/training#models-generating) from an
|
||||
existing model data directory. All data files are copied over. If the path to a
|
||||
`meta.json` is supplied, or a `meta.json` is found in the input directory, this
|
||||
file is used. Otherwise, the data can be entered directly from the command line.
|
||||
After packaging, you can run `python setup.py sdist` from the newly created
|
||||
directory to turn your model into an installable archive file.
|
||||
Generate an installable
|
||||
[model Python package](/usage/training#models-generating) from an existing model
|
||||
data directory. All data files are copied over. If the path to a `meta.json` is
|
||||
supplied, or a `meta.json` is found in the input directory, this file is used.
|
||||
Otherwise, the data can be entered directly from the command line. spaCy will
|
||||
then create a `.tar.gz` archive file that you can distribute and install with
|
||||
`pip install`.
|
||||
|
||||
<Infobox title="New in v3.0" variant="warning">
|
||||
|
||||
The `spacy package` command now also builds the `.tar.gz` archive automatically,
|
||||
so you don't have to run `python setup.py sdist` separately anymore.
|
||||
|
||||
</Infobox>
|
||||
|
||||
```bash
|
||||
$ python -m spacy package [input_dir] [output_dir] [--meta-path] [--create-meta] [--force]
|
||||
|
@ -595,7 +511,6 @@ $ python -m spacy package [input_dir] [output_dir] [--meta-path] [--create-meta]
|
|||
### Example
|
||||
python -m spacy package /input /output
|
||||
cd /output/en_model-0.0.0
|
||||
python setup.py sdist
|
||||
pip install dist/en_model-0.0.0.tar.gz
|
||||
```
|
||||
|
||||
|
@ -605,6 +520,23 @@ pip install dist/en_model-0.0.0.tar.gz
|
|||
| `output_dir` | positional | Directory to create package folder in. |
|
||||
| `--meta-path`, `-m` <Tag variant="new">2</Tag> | option | Path to `meta.json` file (optional). |
|
||||
| `--create-meta`, `-c` <Tag variant="new">2</Tag> | flag | Create a `meta.json` file on the command line, even if one already exists in the directory. If an existing file is found, its entries will be shown as the defaults in the command line prompt. |
|
||||
| `--version`, `-v` <Tag variant="new">3</Tag> | option | Package version to override in meta. Useful when training new versions, as it doesn't require editing the meta template. |
|
||||
| `--force`, `-f` | flag | Force overwriting of existing folder in output directory. |
|
||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
| **CREATES** | directory | A Python package containing the spaCy model. |
|
||||
|
||||
## Project {#project}
|
||||
|
||||
<!-- TODO: document project command and subcommands. We should probably wait and only finalize this once we've finalized the design -->
|
||||
|
||||
### project clone {#project-clone}
|
||||
|
||||
### project assets {#project-assets}
|
||||
|
||||
### project run-all {#project-run-all}
|
||||
|
||||
### project run {#project-run}
|
||||
|
||||
### project init {#project-init}
|
||||
|
||||
### project update-dvc {#project-update-dvc}
|
||||
|
|
37
website/docs/api/corpus.md
Normal file
37
website/docs/api/corpus.md
Normal file
|
@ -0,0 +1,37 @@
|
|||
---
|
||||
title: Corpus
|
||||
teaser: An annotated corpus
|
||||
tag: class
|
||||
source: spacy/gold/corpus.py
|
||||
new: 3
|
||||
---
|
||||
|
||||
This class manages annotated corpora and can read training and development
|
||||
datasets in the [DocBin](/api/docbin) (`.spacy`) format.
|
||||
|
||||
## Corpus.\_\_init\_\_ {#init tag="method"}
|
||||
|
||||
Create a `Corpus`. The input data can be a file or a directory of files.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------ | ---------------------------------------------------------------- |
|
||||
| `train` | str / `Path` | Training data (`.spacy` file or directory of `.spacy` files). |
|
||||
| `dev` | str / `Path` | Development data (`.spacy` file or directory of `.spacy` files). |
|
||||
| `limit` | int | Maximum number of examples returned. |
|
||||
| **RETURNS** | `Corpus` | The newly constructed object. |
|
||||
|
||||
<!-- TODO: document remaining methods / decide which to document -->
|
||||
|
||||
## Corpus.walk_corpus {#walk_corpus tag="staticmethod"}
|
||||
|
||||
## Corpus.make_examples {#make_examples tag="method"}
|
||||
|
||||
## Corpus.make_examples_gold_preproc {#make_examples_gold_preproc tag="method"}
|
||||
|
||||
## Corpus.read_docbin {#read_docbin tag="method"}
|
||||
|
||||
## Corpus.count_train {#count_train tag="method"}
|
||||
|
||||
## Corpus.train_dataset {#train_dataset tag="method"}
|
||||
|
||||
## Corpus.dev_dataset {#dev_dataset tag="method"}
|
130
website/docs/api/data-formats.md
Normal file
130
website/docs/api/data-formats.md
Normal file
|
@ -0,0 +1,130 @@
|
|||
---
|
||||
title: Data formats
|
||||
teaser: Details on spaCy's input and output data formats
|
||||
menu:
|
||||
- ['Training data', 'training']
|
||||
- ['Vocabulary', 'vocab']
|
||||
---
|
||||
|
||||
This section documents input and output formats of data used by spaCy, including
|
||||
training data and lexical vocabulary data. For an overview of label schemes used
|
||||
by the models, see the [models directory](/models). Each model documents the
|
||||
label schemes used in its components, depending on the data it was trained on.
|
||||
|
||||
## Training data {#training}
|
||||
|
||||
### Binary training format {#binary-training new="3"}
|
||||
|
||||
<!-- TODO: document DocBin format -->
|
||||
|
||||
### JSON input format for training {#json-input}
|
||||
|
||||
spaCy takes training data in JSON format. The built-in
|
||||
[`convert`](/api/cli#convert) command helps you convert the `.conllu` format
|
||||
used by the
|
||||
[Universal Dependencies corpora](https://github.com/UniversalDependencies) to
|
||||
spaCy's training format. To convert one or more existing `Doc` objects to
|
||||
spaCy's JSON format, you can use the
|
||||
[`gold.docs_to_json`](/api/top-level#docs_to_json) helper.
|
||||
|
||||
> #### Annotating entities
|
||||
>
|
||||
> Named entities are provided in the
|
||||
> [BILUO](/usage/linguistic-features#accessing-ner) notation. Tokens outside an
|
||||
> entity are set to `"O"` and tokens that are part of an entity are set to the
|
||||
> entity label, prefixed by the BILUO marker. For example `"B-ORG"` describes
|
||||
> the first token of a multi-token `ORG` entity and `"U-PERSON"` a single token
|
||||
> representing a `PERSON` entity. The
|
||||
> [`biluo_tags_from_offsets`](/api/top-level#biluo_tags_from_offsets) function
|
||||
> can help you convert entity offsets to the right format.
|
||||
|
||||
```python
|
||||
### Example structure
|
||||
[{
|
||||
"id": int, # ID of the document within the corpus
|
||||
"paragraphs": [{ # list of paragraphs in the corpus
|
||||
"raw": string, # raw text of the paragraph
|
||||
"sentences": [{ # list of sentences in the paragraph
|
||||
"tokens": [{ # list of tokens in the sentence
|
||||
"id": int, # index of the token in the document
|
||||
"dep": string, # dependency label
|
||||
"head": int, # offset of token head relative to token index
|
||||
"tag": string, # part-of-speech tag
|
||||
"orth": string, # verbatim text of the token
|
||||
"ner": string # BILUO label, e.g. "O" or "B-ORG"
|
||||
}],
|
||||
"brackets": [{ # phrase structure (NOT USED by current models)
|
||||
"first": int, # index of first token
|
||||
"last": int, # index of last token
|
||||
"label": string # phrase label
|
||||
}]
|
||||
}],
|
||||
"cats": [{ # new in v2.2: categories for text classifier
|
||||
"label": string, # text category label
|
||||
"value": float / bool # label applies (1.0/true) or not (0.0/false)
|
||||
}]
|
||||
}]
|
||||
}]
|
||||
```
|
||||
|
||||
Here's an example of dependencies, part-of-speech tags and names entities, taken
|
||||
from the English Wall Street Journal portion of the Penn Treebank:
|
||||
|
||||
```json
|
||||
https://github.com/explosion/spaCy/tree/master/examples/training/training-data.json
|
||||
```
|
||||
|
||||
## Lexical data for vocabulary {#vocab-jsonl new="2"}
|
||||
|
||||
To populate a model's vocabulary, you can use the
|
||||
[`spacy init-model`](/api/cli#init-model) command and load in a
|
||||
[newline-delimited JSON](http://jsonlines.org/) (JSONL) file containing one
|
||||
lexical entry per line via the `--jsonl-loc` option. The first line defines the
|
||||
language and vocabulary settings. All other lines are expected to be JSON
|
||||
objects describing an individual lexeme. The lexical attributes will be then set
|
||||
as attributes on spaCy's [`Lexeme`](/api/lexeme#attributes) object. The `vocab`
|
||||
command outputs a ready-to-use spaCy model with a `Vocab` containing the lexical
|
||||
data.
|
||||
|
||||
```python
|
||||
### First line
|
||||
{"lang": "en", "settings": {"oov_prob": -20.502029418945312}}
|
||||
```
|
||||
|
||||
```python
|
||||
### Entry structure
|
||||
{
|
||||
"orth": string, # the word text
|
||||
"id": int, # can correspond to row in vectors table
|
||||
"lower": string,
|
||||
"norm": string,
|
||||
"shape": string
|
||||
"prefix": string,
|
||||
"suffix": string,
|
||||
"length": int,
|
||||
"cluster": string,
|
||||
"prob": float,
|
||||
"is_alpha": bool,
|
||||
"is_ascii": bool,
|
||||
"is_digit": bool,
|
||||
"is_lower": bool,
|
||||
"is_punct": bool,
|
||||
"is_space": bool,
|
||||
"is_title": bool,
|
||||
"is_upper": bool,
|
||||
"like_url": bool,
|
||||
"like_num": bool,
|
||||
"like_email": bool,
|
||||
"is_stop": bool,
|
||||
"is_oov": bool,
|
||||
"is_quote": bool,
|
||||
"is_left_punct": bool,
|
||||
"is_right_punct": bool
|
||||
}
|
||||
```
|
||||
|
||||
Here's an example of the 20 most frequent lexemes in the English training data:
|
||||
|
||||
```json
|
||||
https://github.com/explosion/spaCy/tree/master/examples/training/vocab-data.jsonl
|
||||
```
|
|
@ -123,7 +123,7 @@ details, see the documentation on
|
|||
|
||||
| Name | Type | Description |
|
||||
| --------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `name` | str | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `doc._.my_attr`. |
|
||||
| `name` | str | Name of the attribute to set by the extension. For example, `"my_attr"` will be available as `doc._.my_attr`. |
|
||||
| `default` | - | Optional default value of the attribute if no getter or method is defined. |
|
||||
| `method` | callable | Set a custom method on the object, for example `doc._.compare(other_doc)`. |
|
||||
| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. |
|
||||
|
@ -140,8 +140,8 @@ Look up a previously registered extension by name. Returns a 4-tuple
|
|||
>
|
||||
> ```python
|
||||
> from spacy.tokens import Doc
|
||||
> Doc.set_extension('has_city', default=False)
|
||||
> extension = Doc.get_extension('has_city')
|
||||
> Doc.set_extension("has_city", default=False)
|
||||
> extension = Doc.get_extension("has_city")
|
||||
> assert extension == (False, None, None, None)
|
||||
> ```
|
||||
|
||||
|
@ -158,8 +158,8 @@ Check whether an extension has been registered on the `Doc` class.
|
|||
>
|
||||
> ```python
|
||||
> from spacy.tokens import Doc
|
||||
> Doc.set_extension('has_city', default=False)
|
||||
> assert Doc.has_extension('has_city')
|
||||
> Doc.set_extension("has_city", default=False)
|
||||
> assert Doc.has_extension("has_city")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
|
@ -175,9 +175,9 @@ Remove a previously registered extension.
|
|||
>
|
||||
> ```python
|
||||
> from spacy.tokens import Doc
|
||||
> Doc.set_extension('has_city', default=False)
|
||||
> removed = Doc.remove_extension('has_city')
|
||||
> assert not Doc.has_extension('has_city')
|
||||
> Doc.set_extension("has_city", default=False)
|
||||
> removed = Doc.remove_extension("has_city")
|
||||
> assert not Doc.has_extension("has_city")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
|
@ -202,9 +202,9 @@ the character indices don't map to a valid span.
|
|||
| ------------------------------------ | ---------------------------------------- | --------------------------------------------------------------------- |
|
||||
| `start` | int | The index of the first character of the span. |
|
||||
| `end` | int | The index of the last character after the span. |
|
||||
| `label` | uint64 / unicode | A label to attach to the span, e.g. for named entities. |
|
||||
| `kb_id` <Tag variant="new">2.2</Tag> | uint64 / unicode | An ID from a knowledge base to capture the meaning of a named entity. |
|
||||
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
|
||||
| `label` | uint64 / str | A label to attach to the span, e.g. for named entities. |
|
||||
| `kb_id` <Tag variant="new">2.2</Tag> | uint64 / str | An ID from a knowledge base to capture the meaning of a named entity. |
|
||||
| `vector` | `numpy.ndarray[ndim=1, dtype="float32"]` | A meaning representation of the span. |
|
||||
| **RETURNS** | `Span` | The newly constructed object or `None`. |
|
||||
|
||||
## Doc.similarity {#similarity tag="method" model="vectors"}
|
||||
|
@ -264,7 +264,7 @@ ancestor is found, e.g. if span excludes a necessary ancestor.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------------------------------------- | ----------------------------------------------- |
|
||||
| **RETURNS** | `numpy.ndarray[ndim=2, dtype='int32']` | The lowest common ancestor matrix of the `Doc`. |
|
||||
| **RETURNS** | `numpy.ndarray[ndim=2, dtype="int32"]` | The lowest common ancestor matrix of the `Doc`. |
|
||||
|
||||
## Doc.to_json {#to_json tag="method" new="2.1"}
|
||||
|
||||
|
@ -297,22 +297,13 @@ They'll be added to an `"_"` key in the data, e.g. `"_": {"foo": "bar"}`.
|
|||
| `underscore` | list | Optional list of string names of custom JSON-serializable `doc._.` attributes. |
|
||||
| **RETURNS** | dict | The JSON-formatted data. |
|
||||
|
||||
<Infobox title="Deprecation note" variant="warning">
|
||||
|
||||
spaCy previously implemented a `Doc.print_tree` method that returned a similar
|
||||
JSON-formatted representation of a `Doc`. As of v2.1, this method is deprecated
|
||||
in favor of `Doc.to_json`. If you need more complex nested representations, you
|
||||
might want to write your own function to extract the data.
|
||||
|
||||
</Infobox>
|
||||
|
||||
## Doc.to_array {#to_array tag="method"}
|
||||
|
||||
Export given token attributes to a numpy `ndarray`. If `attr_ids` is a sequence
|
||||
of `M` attributes, the output array will be of shape `(N, M)`, where `N` is the
|
||||
length of the `Doc` (in tokens). If `attr_ids` is a single attribute, the output
|
||||
shape will be `(N,)`. You can specify attributes by integer ID (e.g.
|
||||
`spacy.attrs.LEMMA`) or string name (e.g. 'LEMMA' or 'lemma'). The values will
|
||||
`spacy.attrs.LEMMA`) or string name (e.g. "LEMMA" or "lemma"). The values will
|
||||
be 64-bit integers.
|
||||
|
||||
Returns a 2D array with one row per token and one column per attribute (when
|
||||
|
@ -332,7 +323,7 @@ Returns a 2D array with one row per token and one column per attribute (when
|
|||
| Name | Type | Description |
|
||||
| ----------- | ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- |
|
||||
| `attr_ids` | list or int or string | A list of attributes (int IDs or string names) or a single attribute (int ID or string name) |
|
||||
| **RETURNS** | `numpy.ndarray[ndim=2, dtype='uint64']` or `numpy.ndarray[ndim=1, dtype='uint64']` | The exported attributes as a numpy array. |
|
||||
| **RETURNS** | `numpy.ndarray[ndim=2, dtype="uint64"]` or `numpy.ndarray[ndim=1, dtype="uint64"]` | The exported attributes as a numpy array. |
|
||||
|
||||
## Doc.from_array {#from_array tag="method"}
|
||||
|
||||
|
@ -354,10 +345,37 @@ array of attributes.
|
|||
| Name | Type | Description |
|
||||
| ----------- | -------------------------------------- | ------------------------------------------------------------------------- |
|
||||
| `attrs` | list | A list of attribute ID ints. |
|
||||
| `array` | `numpy.ndarray[ndim=2, dtype='int32']` | The attribute values to load. |
|
||||
| `array` | `numpy.ndarray[ndim=2, dtype="int32"]` | The attribute values to load. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Doc` | Itself. |
|
||||
|
||||
## Doc.from_docs {#from_docs tag="staticmethod"}
|
||||
|
||||
Concatenate multiple `Doc` objects to form a new one. Raises an error if the
|
||||
`Doc` objects do not all share the same `Vocab`.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> from spacy.tokens import Doc
|
||||
> texts = ["London is the capital of the United Kingdom.",
|
||||
> "The River Thames flows through London.",
|
||||
> "The famous Tower Bridge crosses the River Thames."]
|
||||
> docs = list(nlp.pipe(texts))
|
||||
> c_doc = Doc.from_docs(docs)
|
||||
> assert str(c_doc) == " ".join(texts)
|
||||
> assert len(list(c_doc.sents)) == len(docs)
|
||||
> assert [str(ent) for ent in c_doc.ents] == \
|
||||
> [str(ent) for doc in docs for ent in doc.ents]
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------------- | ----- | ----------------------------------------------------------------------------------------------- |
|
||||
| `docs` | list | A list of `Doc` objects. |
|
||||
| `ensure_whitespace` | bool | Insert a space between two adjacent docs whenever the first doc does not end in whitespace. |
|
||||
| `attrs` | list | Optional list of attribute ID ints or attribute name strings. |
|
||||
| **RETURNS** | `Doc` | The new `Doc` object that is containing the other docs or `None`, if `docs` is empty or `None`. |
|
||||
|
||||
## Doc.to_disk {#to_disk tag="method" new="2"}
|
||||
|
||||
Save the current state to a directory.
|
||||
|
@ -507,14 +525,6 @@ underlying lexeme (if they're context-independent lexical attributes like
|
|||
|
||||
## Doc.merge {#merge tag="method"}
|
||||
|
||||
<Infobox title="Deprecation note" variant="danger">
|
||||
|
||||
As of v2.1.0, `Doc.merge` still works but is considered deprecated. You should
|
||||
use the new and less error-prone [`Doc.retokenize`](/api/doc#retokenize)
|
||||
instead.
|
||||
|
||||
</Infobox>
|
||||
|
||||
Retokenize the document, such that the span at `doc.text[start_idx : end_idx]`
|
||||
is merged into a single token. If `start_idx` and `end_idx` do not mark start
|
||||
and end token boundaries, the document remains unchanged.
|
||||
|
@ -624,7 +634,7 @@ vectors.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------------------------------- | ------------------------------------------------------- |
|
||||
| **RETURNS** | `numpy.ndarray[ndim=1, dtype='float32']` | A 1D numpy array representing the document's semantics. |
|
||||
| **RETURNS** | `numpy.ndarray[ndim=1, dtype="float32"]` | A 1D numpy array representing the document's semantics. |
|
||||
|
||||
## Doc.vector_norm {#vector_norm tag="property" model="vectors"}
|
||||
|
||||
|
@ -646,26 +656,26 @@ The L2 norm of the document's vector representation.
|
|||
|
||||
## Attributes {#attributes}
|
||||
|
||||
| Name | Type | Description |
|
||||
| --------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `text` | str | A unicode representation of the document text. |
|
||||
| `text_with_ws` | str | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
|
||||
| `mem` | `Pool` | The document's local memory heap, for all C data it owns. |
|
||||
| `vocab` | `Vocab` | The store of lexical types. |
|
||||
| `tensor` <Tag variant="new">2</Tag> | `ndarray` | Container for dense vector representations. |
|
||||
| `cats` <Tag variant="new">2</Tag> | dict | Maps a label to a score for categories applied to the document. The label is a string and the score should be a float. |
|
||||
| `user_data` | - | A generic storage area, for user custom data. |
|
||||
| `lang` <Tag variant="new">2.1</Tag> | int | Language of the document's vocabulary. |
|
||||
| `lang_` <Tag variant="new">2.1</Tag> | str | Language of the document's vocabulary. |
|
||||
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. Returns `True` if the `Doc` is empty. |
|
||||
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. Returns `True` if the `Doc` is empty. |
|
||||
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. Returns `True` if the `Doc` is empty. |
|
||||
| `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if the `Doc` is empty, or if _any_ of the tokens has an entity tag set, even if the others are unknown. |
|
||||
| `sentiment` | float | The document's positivity/negativity score, if available. |
|
||||
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
|
||||
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
|
||||
| `user_span_hooks` | dict | A dictionary that allows customization of properties of `Span` children. |
|
||||
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|
||||
| Name | Type | Description |
|
||||
| --------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `text` | str | A string representation of the document text. |
|
||||
| `text_with_ws` | str | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
|
||||
| `mem` | `Pool` | The document's local memory heap, for all C data it owns. |
|
||||
| `vocab` | `Vocab` | The store of lexical types. |
|
||||
| `tensor` <Tag variant="new">2</Tag> | `ndarray` | Container for dense vector representations. |
|
||||
| `cats` <Tag variant="new">2</Tag> | dict | Maps a label to a score for categories applied to the document. The label is a string and the score should be a float. |
|
||||
| `user_data` | - | A generic storage area, for user custom data. |
|
||||
| `lang` <Tag variant="new">2.1</Tag> | int | Language of the document's vocabulary. |
|
||||
| `lang_` <Tag variant="new">2.1</Tag> | str | Language of the document's vocabulary. |
|
||||
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. Returns `True` if the `Doc` is empty. |
|
||||
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. Returns `True` if the `Doc` is empty. |
|
||||
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. Returns `True` if the `Doc` is empty. |
|
||||
| `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if the `Doc` is empty, or if _any_ of the tokens has an entity tag set, even if the others are unknown. |
|
||||
| `sentiment` | float | The document's positivity/negativity score, if available. |
|
||||
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
|
||||
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
|
||||
| `user_span_hooks` | dict | A dictionary that allows customization of properties of `Span` children. |
|
||||
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
|
|
|
@ -16,13 +16,14 @@ document from the `DocBin`. The serialization format is gzipped msgpack, where
|
|||
the msgpack object has the following structure:
|
||||
|
||||
```python
|
||||
### msgpack object strcutrue
|
||||
### msgpack object structrue
|
||||
{
|
||||
"version": str, # DocBin version number
|
||||
"attrs": List[uint64], # e.g. [TAG, HEAD, ENT_IOB, ENT_TYPE]
|
||||
"tokens": bytes, # Serialized numpy uint64 array with the token data
|
||||
"spaces": bytes, # Serialized numpy boolean array with spaces data
|
||||
"lengths": bytes, # Serialized numpy int32 array with the doc lengths
|
||||
"strings": List[unicode] # List of unique strings in the token data
|
||||
"strings": List[str] # List of unique strings in the token data
|
||||
}
|
||||
```
|
||||
|
||||
|
@ -45,7 +46,7 @@ Create a `DocBin` object to hold serialized annotations.
|
|||
|
||||
| Argument | Type | Description |
|
||||
| ----------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `attrs` | list | List of attributes to serialize. `orth` (hash of token text) and `spacy` (whether the token is followed by whitespace) are always serialized, so they're not required. Defaults to `None`. |
|
||||
| `attrs` | list | List of attributes to serialize. `ORTH` (hash of token text) and `SPACY` (whether the token is followed by whitespace) are always serialized, so they're not required. Defaults to `("ORTH", "TAG", "HEAD", "DEP", "ENT_IOB", "ENT_TYPE", "ENT_KB_ID", "LEMMA", "MORPH", "POS")`. |
|
||||
| `store_user_data` | bool | Whether to include the `Doc.user_data` and the values of custom extension attributes. Defaults to `False`. |
|
||||
| **RETURNS** | `DocBin` | The newly constructed object. |
|
||||
|
||||
|
|
|
@ -36,7 +36,7 @@ be a token pattern (list) or a phrase pattern (string). For example:
|
|||
| --------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `nlp` | `Language` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. |
|
||||
| `patterns` | iterable | Optional patterns to load in. |
|
||||
| `phrase_matcher_attr` | int / unicode | Optional attr to pass to the internal [`PhraseMatcher`](/api/phrasematcher). defaults to `None` |
|
||||
| `phrase_matcher_attr` | int / str | Optional attr to pass to the internal [`PhraseMatcher`](/api/phrasematcher). defaults to `None` |
|
||||
| `validate` | bool | Whether patterns should be validated, passed to Matcher and PhraseMatcher as `validate`. Defaults to `False`. |
|
||||
| `overwrite_ents` | bool | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. |
|
||||
| `**cfg` | - | Other config parameters. If pipeline component is loaded as part of a model pipeline, this will include all keyword arguments passed to `spacy.load`. |
|
||||
|
|
10
website/docs/api/example.md
Normal file
10
website/docs/api/example.md
Normal file
|
@ -0,0 +1,10 @@
|
|||
---
|
||||
title: Example
|
||||
teaser: A training example
|
||||
tag: class
|
||||
source: spacy/gold/example.pyx
|
||||
---
|
||||
|
||||
<!-- TODO: -->
|
||||
|
||||
## Example.\_\_init\_\_ {#init tag="method"}
|
|
@ -1,24 +0,0 @@
|
|||
---
|
||||
title: GoldCorpus
|
||||
teaser: An annotated corpus, using the JSON file format
|
||||
tag: class
|
||||
source: spacy/gold.pyx
|
||||
new: 2
|
||||
---
|
||||
|
||||
This class manages annotations for tagging, dependency parsing and NER.
|
||||
|
||||
## GoldCorpus.\_\_init\_\_ {#init tag="method"}
|
||||
|
||||
Create a `GoldCorpus`. IF the input data is an iterable, each item should be a
|
||||
`(text, paragraphs)` tuple, where each paragraph is a tuple
|
||||
`(sentences, brackets)`, and each sentence is a tuple
|
||||
`(ids, words, tags, heads, ner)`. See the implementation of
|
||||
[`gold.read_json_file`](https://github.com/explosion/spaCy/tree/master/spacy/gold.pyx)
|
||||
for further details.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----------------------- | ------------------------------------------------------------ |
|
||||
| `train` | str / `Path` / iterable | Training data, as a path (file or directory) or iterable. |
|
||||
| `dev` | str / `Path` / iterable | Development data, as a path (file or directory) or iterable. |
|
||||
| **RETURNS** | `GoldCorpus` | The newly constructed object. |
|
|
@ -1,207 +0,0 @@
|
|||
---
|
||||
title: GoldParse
|
||||
teaser: A collection for training annotations
|
||||
tag: class
|
||||
source: spacy/gold.pyx
|
||||
---
|
||||
|
||||
## GoldParse.\_\_init\_\_ {#init tag="method"}
|
||||
|
||||
Create a `GoldParse`. The [`TextCategorizer`](/api/textcategorizer) component
|
||||
expects true examples of a label to have the value `1.0`, and negative examples
|
||||
of a label to have the value `0.0`. Labels not in the dictionary are treated as
|
||||
missing – the gradient for those labels will be zero.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `doc` | `Doc` | The document the annotations refer to. |
|
||||
| `words` | iterable | A sequence of unicode word strings. |
|
||||
| `tags` | iterable | A sequence of strings, representing tag annotations. |
|
||||
| `heads` | iterable | A sequence of integers, representing syntactic head offsets. |
|
||||
| `deps` | iterable | A sequence of strings, representing the syntactic relation types. |
|
||||
| `entities` | iterable | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. |
|
||||
| `cats` | dict | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative). |
|
||||
| `links` | dict | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative). |
|
||||
| `make_projective` | bool | Whether to projectivize the dependency tree. Defaults to `False`. |
|
||||
| **RETURNS** | `GoldParse` | The newly constructed object. |
|
||||
|
||||
## GoldParse.\_\_len\_\_ {#len tag="method"}
|
||||
|
||||
Get the number of gold-standard tokens.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | ----------------------------------- |
|
||||
| **RETURNS** | int | The number of gold-standard tokens. |
|
||||
|
||||
## GoldParse.is_projective {#is_projective tag="property"}
|
||||
|
||||
Whether the provided syntactic annotations form a projective dependency tree.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | ----------------------------------------- |
|
||||
| **RETURNS** | bool | Whether annotations form projective tree. |
|
||||
|
||||
## Attributes {#attributes}
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------------------------------ | ---- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `words` | list | The words. |
|
||||
| `tags` | list | The part-of-speech tag annotations. |
|
||||
| `heads` | list | The syntactic head annotations. |
|
||||
| `labels` | list | The syntactic relation-type annotations. |
|
||||
| `ner` | list | The named entity annotations as BILUO tags. |
|
||||
| `cand_to_gold` | list | The alignment from candidate tokenization to gold tokenization. |
|
||||
| `gold_to_cand` | list | The alignment from gold tokenization to candidate tokenization. |
|
||||
| `cats` <Tag variant="new">2</Tag> | dict | Keys in the dictionary are string category labels with values `1.0` or `0.0`. |
|
||||
| `links` <Tag variant="new">2.2</Tag> | dict | Keys in the dictionary are `(start_char, end_char)` triples, and the values are dictionaries with `kb_id:value` entries. |
|
||||
|
||||
## Utilities {#util}
|
||||
|
||||
### gold.docs_to_json {#docs_to_json tag="function"}
|
||||
|
||||
Convert a list of Doc objects into the
|
||||
[JSON-serializable format](/api/annotation#json-input) used by the
|
||||
[`spacy train`](/api/cli#train) command. Each input doc will be treated as a
|
||||
'paragraph' in the output doc.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> from spacy.gold import docs_to_json
|
||||
>
|
||||
> doc = nlp("I like London")
|
||||
> json_data = docs_to_json([doc])
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | ------------------------------------------ |
|
||||
| `docs` | iterable / `Doc` | The `Doc` object(s) to convert. |
|
||||
| `id` | int | ID to assign to the JSON. Defaults to `0`. |
|
||||
| **RETURNS** | dict | The data in spaCy's JSON format. |
|
||||
|
||||
### gold.align {#align tag="function"}
|
||||
|
||||
Calculate alignment tables between two tokenizations, using the Levenshtein
|
||||
algorithm. The alignment is case-insensitive.
|
||||
|
||||
<Infobox title="Important note" variant="warning">
|
||||
|
||||
The current implementation of the alignment algorithm assumes that both
|
||||
tokenizations add up to the same string. For example, you'll be able to align
|
||||
`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
|
||||
`["I", "'m"]` and `["I", "am"]`.
|
||||
|
||||
</Infobox>
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> from spacy.gold import align
|
||||
>
|
||||
> bert_tokens = ["obama", "'", "s", "podcast"]
|
||||
> spacy_tokens = ["obama", "'s", "podcast"]
|
||||
> alignment = align(bert_tokens, spacy_tokens)
|
||||
> cost, a2b, b2a, a2b_multi, b2a_multi = alignment
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | -------------------------------------------------------------------------- |
|
||||
| `tokens_a` | list | String values of candidate tokens to align. |
|
||||
| `tokens_b` | list | String values of reference tokens to align. |
|
||||
| **RETURNS** | tuple | A `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the alignment. |
|
||||
|
||||
The returned tuple contains the following alignment information:
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> a2b = array([0, -1, -1, 2])
|
||||
> b2a = array([0, 2, 3])
|
||||
> a2b_multi = {1: 1, 2: 1}
|
||||
> b2a_multi = {}
|
||||
> ```
|
||||
>
|
||||
> If `a2b[3] == 2`, that means that `tokens_a[3]` aligns to `tokens_b[2]`. If
|
||||
> there's no one-to-one alignment for a token, it has the value `-1`.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `cost` | int | The number of misaligned tokens. |
|
||||
| `a2b` | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_a` to indices in `tokens_b`. |
|
||||
| `b2a` | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_b` to indices in `tokens_a`. |
|
||||
| `a2b_multi` | dict | A dictionary mapping indices in `tokens_a` to indices in `tokens_b`, where multiple tokens of `tokens_a` align to the same token of `tokens_b`. |
|
||||
| `b2a_multi` | dict | A dictionary mapping indices in `tokens_b` to indices in `tokens_a`, where multiple tokens of `tokens_b` align to the same token of `tokens_a`. |
|
||||
|
||||
### gold.biluo_tags_from_offsets {#biluo_tags_from_offsets tag="function"}
|
||||
|
||||
Encode labelled spans into per-token tags, using the
|
||||
[BILUO scheme](/api/annotation#biluo) (Begin, In, Last, Unit, Out). Returns a
|
||||
list of unicode strings, describing the tags. Each tag string will be of the
|
||||
form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of
|
||||
`"B"`, `"I"`, `"L"`, `"U"`. The string `"-"` is used where the entity offsets
|
||||
don't align with the tokenization in the `Doc` object. The training algorithm
|
||||
will view these as missing values. `O` denotes a non-entity token. `B` denotes
|
||||
the beginning of a multi-token entity, `I` the inside of an entity of three or
|
||||
more tokens, and `L` the end of an entity of two or more tokens. `U` denotes a
|
||||
single-token entity.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> from spacy.gold import biluo_tags_from_offsets
|
||||
>
|
||||
> doc = nlp("I like London.")
|
||||
> entities = [(7, 13, "LOC")]
|
||||
> tags = biluo_tags_from_offsets(doc, entities)
|
||||
> assert tags == ["O", "O", "U-LOC", "O"]
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `doc` | `Doc` | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. |
|
||||
| `entities` | iterable | A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string. |
|
||||
| **RETURNS** | list | str strings, describing the [BILUO](/api/annotation#biluo) tags. |
|
||||
|
||||
### gold.offsets_from_biluo_tags {#offsets_from_biluo_tags tag="function"}
|
||||
|
||||
Encode per-token tags following the [BILUO scheme](/api/annotation#biluo) into
|
||||
entity offsets.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> from spacy.gold import offsets_from_biluo_tags
|
||||
>
|
||||
> doc = nlp("I like London.")
|
||||
> tags = ["O", "O", "U-LOC", "O"]
|
||||
> entities = offsets_from_biluo_tags(doc, tags)
|
||||
> assert entities == [(7, 13, "LOC")]
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `doc` | `Doc` | The document that the BILUO tags refer to. |
|
||||
| `entities` | iterable | A sequence of [BILUO](/api/annotation#biluo) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. |
|
||||
| **RETURNS** | list | A sequence of `(start, end, label)` triples. `start` and `end` will be character-offset integers denoting the slice into the original string. |
|
||||
|
||||
### gold.spans_from_biluo_tags {#spans_from_biluo_tags tag="function" new="2.1"}
|
||||
|
||||
Encode per-token tags following the [BILUO scheme](/api/annotation#biluo) into
|
||||
[`Span`](/api/span) objects. This can be used to create entity spans from
|
||||
token-based tags, e.g. to overwrite the `doc.ents`.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> from spacy.gold import spans_from_biluo_tags
|
||||
>
|
||||
> doc = nlp("I like London.")
|
||||
> tags = ["O", "O", "U-LOC", "O"]
|
||||
> doc.ents = spans_from_biluo_tags(doc, tags)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `doc` | `Doc` | The document that the BILUO tags refer to. |
|
||||
| `entities` | iterable | A sequence of [BILUO](/api/annotation#biluo) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. |
|
||||
| **RETURNS** | list | A sequence of `Span` objects with added entity labels. |
|
|
@ -1,10 +1,8 @@
|
|||
---
|
||||
title: Architecture
|
||||
next: /api/annotation
|
||||
title: Library Architecture
|
||||
next: /api/architectures
|
||||
---
|
||||
|
||||
## Library architecture {#architecture}
|
||||
|
||||
import Architecture101 from 'usage/101/\_architecture.md'
|
||||
|
||||
<Architecture101 />
|
||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user