mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-21 22:10:34 +03:00
Merge remote-tracking branch 'origin/develop' into rliaw-develop
This commit is contained in:
commit
3141ea0931
106
.github/contributors/tiangolo.md
vendored
Normal file
106
.github/contributors/tiangolo.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Sebastián Ramírez |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2020-07-01 |
|
||||||
|
| GitHub username | tiangolo |
|
||||||
|
| Website (optional) | |
|
|
@ -2,7 +2,6 @@ recursive-include include *.h
|
||||||
recursive-include spacy *.pyx *.pxd *.txt *.cfg
|
recursive-include spacy *.pyx *.pxd *.txt *.cfg
|
||||||
include LICENSE
|
include LICENSE
|
||||||
include README.md
|
include README.md
|
||||||
include bin/spacy
|
|
||||||
include pyproject.toml
|
include pyproject.toml
|
||||||
recursive-exclude spacy/lang *.json
|
recursive-exclude spacy/lang *.json
|
||||||
recursive-include spacy/lang *.json.gz
|
recursive-include spacy/lang *.json.gz
|
||||||
|
|
|
@ -194,7 +194,7 @@ pip install https://github.com/explosion/spacy-models/releases/download/en_core_
|
||||||
|
|
||||||
### Loading and using models
|
### Loading and using models
|
||||||
|
|
||||||
To load a model, use `spacy.load()` with the model name, a shortcut link or a
|
To load a model, use `spacy.load()` with the model name or a
|
||||||
path to the model data directory.
|
path to the model data directory.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
|
|
@ -54,6 +54,10 @@ seed = ${training:seed}
|
||||||
use_pytorch_for_gpu_memory = ${training:use_pytorch_for_gpu_memory}
|
use_pytorch_for_gpu_memory = ${training:use_pytorch_for_gpu_memory}
|
||||||
tok2vec_model = "nlp.pipeline.tok2vec.model"
|
tok2vec_model = "nlp.pipeline.tok2vec.model"
|
||||||
|
|
||||||
|
[pretraining.objective]
|
||||||
|
type = "characters"
|
||||||
|
n_characters = 4
|
||||||
|
|
||||||
[pretraining.optimizer]
|
[pretraining.optimizer]
|
||||||
@optimizers = "Adam.v1"
|
@optimizers = "Adam.v1"
|
||||||
beta1 = 0.9
|
beta1 = 0.9
|
||||||
|
@ -65,10 +69,6 @@ use_averages = true
|
||||||
eps = 1e-8
|
eps = 1e-8
|
||||||
learn_rate = 0.001
|
learn_rate = 0.001
|
||||||
|
|
||||||
[pretraining.loss_func]
|
|
||||||
@losses = "CosineDistance.v1"
|
|
||||||
normalize = true
|
|
||||||
|
|
||||||
[nlp]
|
[nlp]
|
||||||
lang = "en"
|
lang = "en"
|
||||||
vectors = null
|
vectors = null
|
||||||
|
|
|
@ -33,7 +33,7 @@ def read_raw_data(nlp, jsonl_loc):
|
||||||
for json_obj in srsly.read_jsonl(jsonl_loc):
|
for json_obj in srsly.read_jsonl(jsonl_loc):
|
||||||
if json_obj["text"].strip():
|
if json_obj["text"].strip():
|
||||||
doc = nlp.make_doc(json_obj["text"])
|
doc = nlp.make_doc(json_obj["text"])
|
||||||
yield doc
|
yield Example.from_dict(doc, {})
|
||||||
|
|
||||||
|
|
||||||
def read_gold_data(nlp, gold_loc):
|
def read_gold_data(nlp, gold_loc):
|
||||||
|
@ -52,7 +52,7 @@ def main(model_name, unlabelled_loc):
|
||||||
batch_size = 4
|
batch_size = 4
|
||||||
nlp = spacy.load(model_name)
|
nlp = spacy.load(model_name)
|
||||||
nlp.get_pipe("ner").add_label(LABEL)
|
nlp.get_pipe("ner").add_label(LABEL)
|
||||||
raw_docs = list(read_raw_data(nlp, unlabelled_loc))
|
raw_examples = list(read_raw_data(nlp, unlabelled_loc))
|
||||||
optimizer = nlp.resume_training()
|
optimizer = nlp.resume_training()
|
||||||
# Avoid use of Adam when resuming training. I don't understand this well
|
# Avoid use of Adam when resuming training. I don't understand this well
|
||||||
# yet, but I'm getting weird results from Adam. Try commenting out the
|
# yet, but I'm getting weird results from Adam. Try commenting out the
|
||||||
|
@ -61,20 +61,24 @@ def main(model_name, unlabelled_loc):
|
||||||
optimizer.learn_rate = 0.1
|
optimizer.learn_rate = 0.1
|
||||||
optimizer.b1 = 0.0
|
optimizer.b1 = 0.0
|
||||||
optimizer.b2 = 0.0
|
optimizer.b2 = 0.0
|
||||||
|
|
||||||
sizes = compounding(1.0, 4.0, 1.001)
|
sizes = compounding(1.0, 4.0, 1.001)
|
||||||
|
|
||||||
|
train_examples = []
|
||||||
|
for text, annotations in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
|
|
||||||
with nlp.select_pipes(enable="ner") and warnings.catch_warnings():
|
with nlp.select_pipes(enable="ner") and warnings.catch_warnings():
|
||||||
# show warnings for misaligned entity spans once
|
# show warnings for misaligned entity spans once
|
||||||
warnings.filterwarnings("once", category=UserWarning, module="spacy")
|
warnings.filterwarnings("once", category=UserWarning, module="spacy")
|
||||||
|
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
random.shuffle(TRAIN_DATA)
|
random.shuffle(train_examples)
|
||||||
random.shuffle(raw_docs)
|
random.shuffle(raw_examples)
|
||||||
losses = {}
|
losses = {}
|
||||||
r_losses = {}
|
r_losses = {}
|
||||||
# batch up the examples using spaCy's minibatch
|
# batch up the examples using spaCy's minibatch
|
||||||
raw_batches = minibatch(raw_docs, size=4)
|
raw_batches = minibatch(raw_examples, size=4)
|
||||||
for batch in minibatch(TRAIN_DATA, size=sizes):
|
for batch in minibatch(train_examples, size=sizes):
|
||||||
nlp.update(batch, sgd=optimizer, drop=dropout, losses=losses)
|
nlp.update(batch, sgd=optimizer, drop=dropout, losses=losses)
|
||||||
raw_batch = list(next(raw_batches))
|
raw_batch = list(next(raw_batches))
|
||||||
nlp.rehearse(raw_batch, sgd=optimizer, losses=r_losses)
|
nlp.rehearse(raw_batch, sgd=optimizer, losses=r_losses)
|
||||||
|
|
|
@ -20,6 +20,8 @@ from pathlib import Path
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
import spacy
|
import spacy
|
||||||
from spacy.kb import KnowledgeBase
|
from spacy.kb import KnowledgeBase
|
||||||
|
|
||||||
|
from spacy.gold import Example
|
||||||
from spacy.pipeline import EntityRuler
|
from spacy.pipeline import EntityRuler
|
||||||
from spacy.util import minibatch, compounding
|
from spacy.util import minibatch, compounding
|
||||||
|
|
||||||
|
@ -94,7 +96,7 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
|
||||||
# Convert the texts to docs to make sure we have doc.ents set for the training examples.
|
# Convert the texts to docs to make sure we have doc.ents set for the training examples.
|
||||||
# Also ensure that the annotated examples correspond to known identifiers in the knowledge base.
|
# Also ensure that the annotated examples correspond to known identifiers in the knowledge base.
|
||||||
kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings()
|
kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings()
|
||||||
TRAIN_DOCS = []
|
train_examples = []
|
||||||
for text, annotation in TRAIN_DATA:
|
for text, annotation in TRAIN_DATA:
|
||||||
with nlp.select_pipes(disable="entity_linker"):
|
with nlp.select_pipes(disable="entity_linker"):
|
||||||
doc = nlp(text)
|
doc = nlp(text)
|
||||||
|
@ -109,17 +111,17 @@ def main(kb_path, vocab_path=None, output_dir=None, n_iter=50):
|
||||||
"Removed", kb_id, "from training because it is not in the KB."
|
"Removed", kb_id, "from training because it is not in the KB."
|
||||||
)
|
)
|
||||||
annotation_clean["links"][offset] = new_dict
|
annotation_clean["links"][offset] = new_dict
|
||||||
TRAIN_DOCS.append((doc, annotation_clean))
|
train_examples .append(Example.from_dict(doc, annotation_clean))
|
||||||
|
|
||||||
with nlp.select_pipes(enable="entity_linker"): # only train entity linker
|
with nlp.select_pipes(enable="entity_linker"): # only train entity linker
|
||||||
# reset and initialize the weights randomly
|
# reset and initialize the weights randomly
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
|
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
random.shuffle(TRAIN_DOCS)
|
random.shuffle(train_examples)
|
||||||
losses = {}
|
losses = {}
|
||||||
# batch up the examples using spaCy's minibatch
|
# batch up the examples using spaCy's minibatch
|
||||||
batches = minibatch(TRAIN_DOCS, size=compounding(4.0, 32.0, 1.001))
|
batches = minibatch(train_examples, size=compounding(4.0, 32.0, 1.001))
|
||||||
for batch in batches:
|
for batch in batches:
|
||||||
nlp.update(
|
nlp.update(
|
||||||
batch,
|
batch,
|
||||||
|
|
|
@ -23,6 +23,7 @@ import plac
|
||||||
import random
|
import random
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import spacy
|
import spacy
|
||||||
|
from spacy.gold import Example
|
||||||
from spacy.util import minibatch, compounding
|
from spacy.util import minibatch, compounding
|
||||||
|
|
||||||
|
|
||||||
|
@ -120,17 +121,19 @@ def main(model=None, output_dir=None, n_iter=15):
|
||||||
parser = nlp.create_pipe("parser")
|
parser = nlp.create_pipe("parser")
|
||||||
nlp.add_pipe(parser, first=True)
|
nlp.add_pipe(parser, first=True)
|
||||||
|
|
||||||
|
train_examples = []
|
||||||
for text, annotations in TRAIN_DATA:
|
for text, annotations in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
for dep in annotations.get("deps", []):
|
for dep in annotations.get("deps", []):
|
||||||
parser.add_label(dep)
|
parser.add_label(dep)
|
||||||
|
|
||||||
with nlp.select_pipes(enable="parser"): # only train parser
|
with nlp.select_pipes(enable="parser"): # only train parser
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
random.shuffle(TRAIN_DATA)
|
random.shuffle(train_examples)
|
||||||
losses = {}
|
losses = {}
|
||||||
# batch up the examples using spaCy's minibatch
|
# batch up the examples using spaCy's minibatch
|
||||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
batches = minibatch(train_examples, size=compounding(4.0, 32.0, 1.001))
|
||||||
for batch in batches:
|
for batch in batches:
|
||||||
nlp.update(batch, sgd=optimizer, losses=losses)
|
nlp.update(batch, sgd=optimizer, losses=losses)
|
||||||
print("Losses", losses)
|
print("Losses", losses)
|
||||||
|
|
|
@ -14,6 +14,7 @@ import plac
|
||||||
import random
|
import random
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import spacy
|
import spacy
|
||||||
|
from spacy.gold import Example
|
||||||
from spacy.util import minibatch, compounding
|
from spacy.util import minibatch, compounding
|
||||||
from spacy.morphology import Morphology
|
from spacy.morphology import Morphology
|
||||||
|
|
||||||
|
@ -84,8 +85,10 @@ def main(lang="en", output_dir=None, n_iter=25):
|
||||||
morphologizer = nlp.create_pipe("morphologizer")
|
morphologizer = nlp.create_pipe("morphologizer")
|
||||||
nlp.add_pipe(morphologizer)
|
nlp.add_pipe(morphologizer)
|
||||||
|
|
||||||
# add labels
|
# add labels and create the Example instances
|
||||||
for _, annotations in TRAIN_DATA:
|
train_examples = []
|
||||||
|
for text, annotations in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
morph_labels = annotations.get("morphs")
|
morph_labels = annotations.get("morphs")
|
||||||
pos_labels = annotations.get("pos", [""] * len(annotations.get("morphs")))
|
pos_labels = annotations.get("pos", [""] * len(annotations.get("morphs")))
|
||||||
assert len(morph_labels) == len(pos_labels)
|
assert len(morph_labels) == len(pos_labels)
|
||||||
|
@ -98,10 +101,10 @@ def main(lang="en", output_dir=None, n_iter=25):
|
||||||
|
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
for i in range(n_iter):
|
for i in range(n_iter):
|
||||||
random.shuffle(TRAIN_DATA)
|
random.shuffle(train_examples)
|
||||||
losses = {}
|
losses = {}
|
||||||
# batch up the examples using spaCy's minibatch
|
# batch up the examples using spaCy's minibatch
|
||||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
batches = minibatch(train_examples, size=compounding(4.0, 32.0, 1.001))
|
||||||
for batch in batches:
|
for batch in batches:
|
||||||
nlp.update(batch, sgd=optimizer, losses=losses)
|
nlp.update(batch, sgd=optimizer, losses=losses)
|
||||||
print("Losses", losses)
|
print("Losses", losses)
|
||||||
|
|
|
@ -17,6 +17,7 @@ import random
|
||||||
import warnings
|
import warnings
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import spacy
|
import spacy
|
||||||
|
from spacy.gold import Example
|
||||||
from spacy.util import minibatch, compounding
|
from spacy.util import minibatch, compounding
|
||||||
|
|
||||||
|
|
||||||
|
@ -50,8 +51,10 @@ def main(model=None, output_dir=None, n_iter=100):
|
||||||
else:
|
else:
|
||||||
ner = nlp.get_pipe("simple_ner")
|
ner = nlp.get_pipe("simple_ner")
|
||||||
|
|
||||||
# add labels
|
# add labels and create Example objects
|
||||||
for _, annotations in TRAIN_DATA:
|
train_examples = []
|
||||||
|
for text, annotations in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
for ent in annotations.get("entities"):
|
for ent in annotations.get("entities"):
|
||||||
print("Add label", ent[2])
|
print("Add label", ent[2])
|
||||||
ner.add_label(ent[2])
|
ner.add_label(ent[2])
|
||||||
|
@ -68,10 +71,10 @@ def main(model=None, output_dir=None, n_iter=100):
|
||||||
"Transitions", list(enumerate(nlp.get_pipe("simple_ner").get_tag_names()))
|
"Transitions", list(enumerate(nlp.get_pipe("simple_ner").get_tag_names()))
|
||||||
)
|
)
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
random.shuffle(TRAIN_DATA)
|
random.shuffle(train_examples)
|
||||||
losses = {}
|
losses = {}
|
||||||
# batch up the examples using spaCy's minibatch
|
# batch up the examples using spaCy's minibatch
|
||||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
batches = minibatch(train_examples, size=compounding(4.0, 32.0, 1.001))
|
||||||
for batch in batches:
|
for batch in batches:
|
||||||
nlp.update(
|
nlp.update(
|
||||||
batch,
|
batch,
|
||||||
|
|
|
@ -80,6 +80,10 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
|
||||||
print("Created blank 'en' model")
|
print("Created blank 'en' model")
|
||||||
# Add entity recognizer to model if it's not in the pipeline
|
# Add entity recognizer to model if it's not in the pipeline
|
||||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||||
|
train_examples = []
|
||||||
|
for text, annotation in TRAIN_DATA:
|
||||||
|
train_examples.append(TRAIN_DATA.from_dict(nlp(text), annotation))
|
||||||
|
|
||||||
if "ner" not in nlp.pipe_names:
|
if "ner" not in nlp.pipe_names:
|
||||||
ner = nlp.create_pipe("ner")
|
ner = nlp.create_pipe("ner")
|
||||||
nlp.add_pipe(ner)
|
nlp.add_pipe(ner)
|
||||||
|
@ -102,8 +106,8 @@ def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
|
||||||
sizes = compounding(1.0, 4.0, 1.001)
|
sizes = compounding(1.0, 4.0, 1.001)
|
||||||
# batch up the examples using spaCy's minibatch
|
# batch up the examples using spaCy's minibatch
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
random.shuffle(TRAIN_DATA)
|
random.shuffle(train_examples)
|
||||||
batches = minibatch(TRAIN_DATA, size=sizes)
|
batches = minibatch(train_examples, size=sizes)
|
||||||
losses = {}
|
losses = {}
|
||||||
for batch in batches:
|
for batch in batches:
|
||||||
nlp.update(batch, sgd=optimizer, drop=0.35, losses=losses)
|
nlp.update(batch, sgd=optimizer, drop=0.35, losses=losses)
|
||||||
|
|
|
@ -14,6 +14,7 @@ import plac
|
||||||
import random
|
import random
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import spacy
|
import spacy
|
||||||
|
from spacy.gold import Example
|
||||||
from spacy.util import minibatch, compounding
|
from spacy.util import minibatch, compounding
|
||||||
|
|
||||||
|
|
||||||
|
@ -59,18 +60,20 @@ def main(model=None, output_dir=None, n_iter=15):
|
||||||
else:
|
else:
|
||||||
parser = nlp.get_pipe("parser")
|
parser = nlp.get_pipe("parser")
|
||||||
|
|
||||||
# add labels to the parser
|
# add labels to the parser and create the Example objects
|
||||||
for _, annotations in TRAIN_DATA:
|
train_examples = []
|
||||||
|
for text, annotations in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
for dep in annotations.get("deps", []):
|
for dep in annotations.get("deps", []):
|
||||||
parser.add_label(dep)
|
parser.add_label(dep)
|
||||||
|
|
||||||
with nlp.select_pipes(enable="parser"): # only train parser
|
with nlp.select_pipes(enable="parser"): # only train parser
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
for itn in range(n_iter):
|
for itn in range(n_iter):
|
||||||
random.shuffle(TRAIN_DATA)
|
random.shuffle(train_examples)
|
||||||
losses = {}
|
losses = {}
|
||||||
# batch up the examples using spaCy's minibatch
|
# batch up the examples using spaCy's minibatch
|
||||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
batches = minibatch(train_examples, size=compounding(4.0, 32.0, 1.001))
|
||||||
for batch in batches:
|
for batch in batches:
|
||||||
nlp.update(batch, sgd=optimizer, losses=losses)
|
nlp.update(batch, sgd=optimizer, losses=losses)
|
||||||
print("Losses", losses)
|
print("Losses", losses)
|
||||||
|
|
|
@ -17,6 +17,7 @@ import plac
|
||||||
import random
|
import random
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import spacy
|
import spacy
|
||||||
|
from spacy.gold import Example
|
||||||
from spacy.util import minibatch, compounding
|
from spacy.util import minibatch, compounding
|
||||||
|
|
||||||
|
|
||||||
|
@ -58,12 +59,16 @@ def main(lang="en", output_dir=None, n_iter=25):
|
||||||
tagger.add_label(tag, values)
|
tagger.add_label(tag, values)
|
||||||
nlp.add_pipe(tagger)
|
nlp.add_pipe(tagger)
|
||||||
|
|
||||||
|
train_examples = []
|
||||||
|
for text, annotations in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
|
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
for i in range(n_iter):
|
for i in range(n_iter):
|
||||||
random.shuffle(TRAIN_DATA)
|
random.shuffle(train_examples)
|
||||||
losses = {}
|
losses = {}
|
||||||
# batch up the examples using spaCy's minibatch
|
# batch up the examples using spaCy's minibatch
|
||||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
batches = minibatch(train_examples, size=compounding(4.0, 32.0, 1.001))
|
||||||
for batch in batches:
|
for batch in batches:
|
||||||
nlp.update(batch, sgd=optimizer, losses=losses)
|
nlp.update(batch, sgd=optimizer, losses=losses)
|
||||||
print("Losses", losses)
|
print("Losses", losses)
|
||||||
|
|
|
@ -2,7 +2,7 @@ redirects = [
|
||||||
# Netlify
|
# Netlify
|
||||||
{from = "https://spacy.netlify.com/*", to="https://spacy.io/:splat", force = true },
|
{from = "https://spacy.netlify.com/*", to="https://spacy.io/:splat", force = true },
|
||||||
# Subdomain for branches
|
# Subdomain for branches
|
||||||
{from = "https://nightly.spacy.io/*", to="https://spacy-io-develop.spacy.io/:splat", force = true, status = 200},
|
{from = "https://nightly.spacy.io/*", to="https://nightly-spacy-io.spacy.io/:splat", force = true, status = 200},
|
||||||
# Old subdomains
|
# Old subdomains
|
||||||
{from = "https://survey.spacy.io/*", to = "https://spacy.io", force = true},
|
{from = "https://survey.spacy.io/*", to = "https://spacy.io", force = true},
|
||||||
{from = "http://survey.spacy.io/*", to = "https://spacy.io", force = true},
|
{from = "http://survey.spacy.io/*", to = "https://spacy.io", force = true},
|
||||||
|
@ -38,6 +38,13 @@ redirects = [
|
||||||
{from = "/docs/usage/showcase", to = "/universe", force = true},
|
{from = "/docs/usage/showcase", to = "/universe", force = true},
|
||||||
{from = "/tutorials/load-new-word-vectors", to = "/usage/vectors-similarity#custom", force = true},
|
{from = "/tutorials/load-new-word-vectors", to = "/usage/vectors-similarity#custom", force = true},
|
||||||
{from = "/tutorials", to = "/usage/examples", force = true},
|
{from = "/tutorials", to = "/usage/examples", force = true},
|
||||||
|
# Old documentation pages (v2.x)
|
||||||
|
{from = "/usage/adding-languages", to = "/usage/linguistic-features", force = true},
|
||||||
|
{from = "/usage/vectors-similarity", to = "/usage/vectors-embeddings", force = true},
|
||||||
|
{from = "/api/goldparse", to = "/api/top-level", force = true},
|
||||||
|
{from = "/api/goldcorpus", to = "/api/corpus", force = true},
|
||||||
|
{from = "/api/annotation", to = "/api/data-formats", force = true},
|
||||||
|
{from = "/usage/examples", to = "/usage/projects", force = true},
|
||||||
# Rewrite all other docs pages to /
|
# Rewrite all other docs pages to /
|
||||||
{from = "/docs/*", to = "/:splat"},
|
{from = "/docs/*", to = "/:splat"},
|
||||||
# Updated documentation pages
|
# Updated documentation pages
|
||||||
|
|
|
@ -6,7 +6,8 @@ requires = [
|
||||||
"cymem>=2.0.2,<2.1.0",
|
"cymem>=2.0.2,<2.1.0",
|
||||||
"preshed>=3.0.2,<3.1.0",
|
"preshed>=3.0.2,<3.1.0",
|
||||||
"murmurhash>=0.28.0,<1.1.0",
|
"murmurhash>=0.28.0,<1.1.0",
|
||||||
"thinc==8.0.0a11",
|
"thinc>=8.0.0a12,<8.0.0a20",
|
||||||
"blis>=0.4.0,<0.5.0"
|
"blis>=0.4.0,<0.5.0",
|
||||||
|
"pytokenizations"
|
||||||
]
|
]
|
||||||
build-backend = "setuptools.build_meta"
|
build-backend = "setuptools.build_meta"
|
||||||
|
|
|
@ -1,19 +1,20 @@
|
||||||
# Our libraries
|
# Our libraries
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=3.0.2,<3.1.0
|
preshed>=3.0.2,<3.1.0
|
||||||
thinc==8.0.0a11
|
thinc>=8.0.0a12,<8.0.0a20
|
||||||
blis>=0.4.0,<0.5.0
|
blis>=0.4.0,<0.5.0
|
||||||
ml_datasets>=0.1.1
|
ml_datasets>=0.1.1
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
wasabi>=0.7.0,<1.1.0
|
wasabi>=0.7.0,<1.1.0
|
||||||
srsly>=2.1.0,<3.0.0
|
srsly>=2.1.0,<3.0.0
|
||||||
catalogue>=0.0.7,<1.1.0
|
catalogue>=0.0.7,<1.1.0
|
||||||
typer>=0.3.0,<1.0.0
|
typer>=0.3.0,<0.4.0
|
||||||
# Third party dependencies
|
# Third party dependencies
|
||||||
numpy>=1.15.0
|
numpy>=1.15.0
|
||||||
requests>=2.13.0,<3.0.0
|
requests>=2.13.0,<3.0.0
|
||||||
tqdm>=4.38.0,<5.0.0
|
tqdm>=4.38.0,<5.0.0
|
||||||
pydantic>=1.3.0,<2.0.0
|
pydantic>=1.3.0,<2.0.0
|
||||||
|
pytokenizations
|
||||||
# Official Python utilities
|
# Official Python utilities
|
||||||
setuptools
|
setuptools
|
||||||
packaging
|
packaging
|
||||||
|
|
13
setup.cfg
13
setup.cfg
|
@ -25,8 +25,6 @@ classifiers =
|
||||||
[options]
|
[options]
|
||||||
zip_safe = false
|
zip_safe = false
|
||||||
include_package_data = true
|
include_package_data = true
|
||||||
scripts =
|
|
||||||
bin/spacy
|
|
||||||
python_requires = >=3.6
|
python_requires = >=3.6
|
||||||
setup_requires =
|
setup_requires =
|
||||||
wheel
|
wheel
|
||||||
|
@ -36,28 +34,33 @@ setup_requires =
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=3.0.2,<3.1.0
|
preshed>=3.0.2,<3.1.0
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
thinc==8.0.0a11
|
thinc>=8.0.0a12,<8.0.0a20
|
||||||
install_requires =
|
install_requires =
|
||||||
# Our libraries
|
# Our libraries
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=3.0.2,<3.1.0
|
preshed>=3.0.2,<3.1.0
|
||||||
thinc==8.0.0a11
|
thinc>=8.0.0a12,<8.0.0a20
|
||||||
blis>=0.4.0,<0.5.0
|
blis>=0.4.0,<0.5.0
|
||||||
wasabi>=0.7.0,<1.1.0
|
wasabi>=0.7.0,<1.1.0
|
||||||
srsly>=2.1.0,<3.0.0
|
srsly>=2.1.0,<3.0.0
|
||||||
catalogue>=0.0.7,<1.1.0
|
catalogue>=0.0.7,<1.1.0
|
||||||
typer>=0.3.0,<1.0.0
|
typer>=0.3.0,<0.4.0
|
||||||
# Third-party dependencies
|
# Third-party dependencies
|
||||||
tqdm>=4.38.0,<5.0.0
|
tqdm>=4.38.0,<5.0.0
|
||||||
numpy>=1.15.0
|
numpy>=1.15.0
|
||||||
requests>=2.13.0,<3.0.0
|
requests>=2.13.0,<3.0.0
|
||||||
pydantic>=1.3.0,<2.0.0
|
pydantic>=1.3.0,<2.0.0
|
||||||
|
pytokenizations
|
||||||
# Official Python utilities
|
# Official Python utilities
|
||||||
setuptools
|
setuptools
|
||||||
packaging
|
packaging
|
||||||
importlib_metadata>=0.20; python_version < "3.8"
|
importlib_metadata>=0.20; python_version < "3.8"
|
||||||
|
|
||||||
|
[options.entry_points]
|
||||||
|
console_scripts =
|
||||||
|
spacy = spacy.cli:app
|
||||||
|
|
||||||
[options.extras_require]
|
[options.extras_require]
|
||||||
lookups =
|
lookups =
|
||||||
spacy_lookups_data>=0.3.2,<0.4.0
|
spacy_lookups_data>=0.3.2,<0.4.0
|
||||||
|
|
3
setup.py
3
setup.py
|
@ -1,11 +1,11 @@
|
||||||
#!/usr/bin/env python
|
#!/usr/bin/env python
|
||||||
|
from setuptools import Extension, setup, find_packages
|
||||||
import sys
|
import sys
|
||||||
import platform
|
import platform
|
||||||
from distutils.command.build_ext import build_ext
|
from distutils.command.build_ext import build_ext
|
||||||
from distutils.sysconfig import get_python_inc
|
from distutils.sysconfig import get_python_inc
|
||||||
import distutils.util
|
import distutils.util
|
||||||
from distutils import ccompiler, msvccompiler
|
from distutils import ccompiler, msvccompiler
|
||||||
from setuptools import Extension, setup, find_packages
|
|
||||||
import numpy
|
import numpy
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import shutil
|
import shutil
|
||||||
|
@ -23,7 +23,6 @@ Options.docstrings = True
|
||||||
|
|
||||||
PACKAGES = find_packages()
|
PACKAGES = find_packages()
|
||||||
MOD_NAMES = [
|
MOD_NAMES = [
|
||||||
"spacy.gold.align",
|
|
||||||
"spacy.gold.example",
|
"spacy.gold.example",
|
||||||
"spacy.parts_of_speech",
|
"spacy.parts_of_speech",
|
||||||
"spacy.strings",
|
"spacy.strings",
|
||||||
|
|
|
@ -25,9 +25,6 @@ config = registry
|
||||||
|
|
||||||
|
|
||||||
def load(name, **overrides):
|
def load(name, **overrides):
|
||||||
depr_path = overrides.get("path")
|
|
||||||
if depr_path not in (True, False, None):
|
|
||||||
warnings.warn(Warnings.W001.format(path=depr_path), DeprecationWarning)
|
|
||||||
return util.load_model(name, **overrides)
|
return util.load_model(name, **overrides)
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
# fmt: off
|
# fmt: off
|
||||||
__title__ = "spacy-nightly"
|
__title__ = "spacy-nightly"
|
||||||
__version__ = "3.0.0a0"
|
__version__ = "3.0.0a2"
|
||||||
__release__ = True
|
__release__ = True
|
||||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||||
|
|
|
@ -9,7 +9,7 @@ import sys
|
||||||
from ._app import app, Arg, Opt
|
from ._app import app, Arg, Opt
|
||||||
from ..gold import docs_to_json
|
from ..gold import docs_to_json
|
||||||
from ..tokens import DocBin
|
from ..tokens import DocBin
|
||||||
from ..gold.converters import iob2docs, conll_ner2docs, json2docs
|
from ..gold.converters import iob2docs, conll_ner2docs, json2docs, conllu2docs
|
||||||
|
|
||||||
|
|
||||||
# Converters are matched by file extension except for ner/iob, which are
|
# Converters are matched by file extension except for ner/iob, which are
|
||||||
|
@ -18,9 +18,9 @@ from ..gold.converters import iob2docs, conll_ner2docs, json2docs
|
||||||
# imported from /converters.
|
# imported from /converters.
|
||||||
|
|
||||||
CONVERTERS = {
|
CONVERTERS = {
|
||||||
# "conllubio": conllu2docs, TODO
|
"conllubio": conllu2docs,
|
||||||
# "conllu": conllu2docs, TODO
|
"conllu": conllu2docs,
|
||||||
# "conll": conllu2docs, TODO
|
"conll": conllu2docs,
|
||||||
"ner": conll_ner2docs,
|
"ner": conll_ner2docs,
|
||||||
"iob": iob2docs,
|
"iob": iob2docs,
|
||||||
"json": json2docs,
|
"json": json2docs,
|
||||||
|
@ -28,7 +28,7 @@ CONVERTERS = {
|
||||||
|
|
||||||
|
|
||||||
# File types that can be written to stdout
|
# File types that can be written to stdout
|
||||||
FILE_TYPES_STDOUT = ("json")
|
FILE_TYPES_STDOUT = ("json",)
|
||||||
|
|
||||||
|
|
||||||
class FileTypes(str, Enum):
|
class FileTypes(str, Enum):
|
||||||
|
@ -48,7 +48,7 @@ def convert_cli(
|
||||||
morphology: bool = Opt(False, "--morphology", "-m", help="Enable appending morphology to tags"),
|
morphology: bool = Opt(False, "--morphology", "-m", help="Enable appending morphology to tags"),
|
||||||
merge_subtokens: bool = Opt(False, "--merge-subtokens", "-T", help="Merge CoNLL-U subtokens"),
|
merge_subtokens: bool = Opt(False, "--merge-subtokens", "-T", help="Merge CoNLL-U subtokens"),
|
||||||
converter: str = Opt("auto", "--converter", "-c", help=f"Converter: {tuple(CONVERTERS.keys())}"),
|
converter: str = Opt("auto", "--converter", "-c", help=f"Converter: {tuple(CONVERTERS.keys())}"),
|
||||||
ner_map: Optional[Path] = Opt(None, "--ner-map", "-N", help="NER tag mapping (as JSON-encoded dict of entity types)", exists=True),
|
ner_map: Optional[Path] = Opt(None, "--ner-map", "-nm", help="NER tag mapping (as JSON-encoded dict of entity types)", exists=True),
|
||||||
lang: Optional[str] = Opt(None, "--lang", "-l", help="Language (if tokenizer required)"),
|
lang: Optional[str] = Opt(None, "--lang", "-l", help="Language (if tokenizer required)"),
|
||||||
# fmt: on
|
# fmt: on
|
||||||
):
|
):
|
||||||
|
@ -86,20 +86,20 @@ def convert_cli(
|
||||||
|
|
||||||
|
|
||||||
def convert(
|
def convert(
|
||||||
input_path: Path,
|
input_path: Path,
|
||||||
output_dir: Path,
|
output_dir: Path,
|
||||||
*,
|
*,
|
||||||
file_type: str = "json",
|
file_type: str = "json",
|
||||||
n_sents: int = 1,
|
n_sents: int = 1,
|
||||||
seg_sents: bool = False,
|
seg_sents: bool = False,
|
||||||
model: Optional[str] = None,
|
model: Optional[str] = None,
|
||||||
morphology: bool = False,
|
morphology: bool = False,
|
||||||
merge_subtokens: bool = False,
|
merge_subtokens: bool = False,
|
||||||
converter: str = "auto",
|
converter: str = "auto",
|
||||||
ner_map: Optional[Path] = None,
|
ner_map: Optional[Path] = None,
|
||||||
lang: Optional[str] = None,
|
lang: Optional[str] = None,
|
||||||
silent: bool = True,
|
silent: bool = True,
|
||||||
msg: Optional[Path] = None,
|
msg: Optional[Path] = None,
|
||||||
) -> None:
|
) -> None:
|
||||||
if not msg:
|
if not msg:
|
||||||
msg = Printer(no_print=silent)
|
msg = Printer(no_print=silent)
|
||||||
|
@ -135,21 +135,21 @@ def convert(
|
||||||
|
|
||||||
def _print_docs_to_stdout(docs, output_type):
|
def _print_docs_to_stdout(docs, output_type):
|
||||||
if output_type == "json":
|
if output_type == "json":
|
||||||
srsly.write_json("-", docs_to_json(docs))
|
srsly.write_json("-", [docs_to_json(docs)])
|
||||||
else:
|
else:
|
||||||
sys.stdout.buffer.write(DocBin(docs=docs).to_bytes())
|
sys.stdout.buffer.write(DocBin(docs=docs, store_user_data=True).to_bytes())
|
||||||
|
|
||||||
|
|
||||||
def _write_docs_to_file(docs, output_file, output_type):
|
def _write_docs_to_file(docs, output_file, output_type):
|
||||||
if not output_file.parent.exists():
|
if not output_file.parent.exists():
|
||||||
output_file.parent.mkdir(parents=True)
|
output_file.parent.mkdir(parents=True)
|
||||||
if output_type == "json":
|
if output_type == "json":
|
||||||
srsly.write_json(output_file, docs_to_json(docs))
|
srsly.write_json(output_file, [docs_to_json(docs)])
|
||||||
else:
|
else:
|
||||||
data = DocBin(docs=docs).to_bytes()
|
data = DocBin(docs=docs, store_user_data=True).to_bytes()
|
||||||
with output_file.open("wb") as file_:
|
with output_file.open("wb") as file_:
|
||||||
file_.write(data)
|
file_.write(data)
|
||||||
|
|
||||||
|
|
||||||
def autodetect_ner_format(input_data: str) -> str:
|
def autodetect_ner_format(input_data: str) -> str:
|
||||||
# guess format from the first 20 lines
|
# guess format from the first 20 lines
|
||||||
|
|
|
@ -118,7 +118,9 @@ def debug_data(
|
||||||
|
|
||||||
# Create all gold data here to avoid iterating over the train_dataset constantly
|
# Create all gold data here to avoid iterating over the train_dataset constantly
|
||||||
gold_train_data = _compile_gold(train_dataset, pipeline, nlp, make_proj=True)
|
gold_train_data = _compile_gold(train_dataset, pipeline, nlp, make_proj=True)
|
||||||
gold_train_unpreprocessed_data = _compile_gold(train_dataset, pipeline, nlp, make_proj=False)
|
gold_train_unpreprocessed_data = _compile_gold(
|
||||||
|
train_dataset, pipeline, nlp, make_proj=False
|
||||||
|
)
|
||||||
gold_dev_data = _compile_gold(dev_dataset, pipeline, nlp, make_proj=True)
|
gold_dev_data = _compile_gold(dev_dataset, pipeline, nlp, make_proj=True)
|
||||||
|
|
||||||
train_texts = gold_train_data["texts"]
|
train_texts = gold_train_data["texts"]
|
||||||
|
|
|
@ -16,7 +16,7 @@ from ..util import is_package, get_base_version, run_command
|
||||||
def download_cli(
|
def download_cli(
|
||||||
# fmt: off
|
# fmt: off
|
||||||
ctx: typer.Context,
|
ctx: typer.Context,
|
||||||
model: str = Arg(..., help="Model to download (shortcut or name)"),
|
model: str = Arg(..., help="Name of model to download"),
|
||||||
direct: bool = Opt(False, "--direct", "-d", "-D", help="Force direct download of name + version"),
|
direct: bool = Opt(False, "--direct", "-d", "-D", help="Force direct download of name + version"),
|
||||||
# fmt: on
|
# fmt: on
|
||||||
):
|
):
|
||||||
|
|
|
@ -4,6 +4,7 @@ from wasabi import Printer
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import re
|
import re
|
||||||
import srsly
|
import srsly
|
||||||
|
from thinc.api import require_gpu, fix_random_seed
|
||||||
|
|
||||||
from ..gold import Corpus
|
from ..gold import Corpus
|
||||||
from ..tokens import Doc
|
from ..tokens import Doc
|
||||||
|
@ -52,9 +53,9 @@ def evaluate(
|
||||||
silent: bool = True,
|
silent: bool = True,
|
||||||
) -> Scorer:
|
) -> Scorer:
|
||||||
msg = Printer(no_print=silent, pretty=not silent)
|
msg = Printer(no_print=silent, pretty=not silent)
|
||||||
util.fix_random_seed()
|
fix_random_seed()
|
||||||
if gpu_id >= 0:
|
if gpu_id >= 0:
|
||||||
util.use_gpu(gpu_id)
|
require_gpu(gpu_id)
|
||||||
util.set_env_log(False)
|
util.set_env_log(False)
|
||||||
data_path = util.ensure_path(data_path)
|
data_path = util.ensure_path(data_path)
|
||||||
output_path = util.ensure_path(output)
|
output_path = util.ensure_path(output)
|
||||||
|
|
|
@ -37,7 +37,7 @@ def init_model_cli(
|
||||||
clusters_loc: Optional[Path] = Opt(None, "--clusters-loc", "-c", help="Optional location of brown clusters data", exists=True),
|
clusters_loc: Optional[Path] = Opt(None, "--clusters-loc", "-c", help="Optional location of brown clusters data", exists=True),
|
||||||
jsonl_loc: Optional[Path] = Opt(None, "--jsonl-loc", "-j", help="Location of JSONL-formatted attributes file", exists=True),
|
jsonl_loc: Optional[Path] = Opt(None, "--jsonl-loc", "-j", help="Location of JSONL-formatted attributes file", exists=True),
|
||||||
vectors_loc: Optional[Path] = Opt(None, "--vectors-loc", "-v", help="Optional vectors file in Word2Vec format", exists=True),
|
vectors_loc: Optional[Path] = Opt(None, "--vectors-loc", "-v", help="Optional vectors file in Word2Vec format", exists=True),
|
||||||
prune_vectors: int = Opt(-1 , "--prune-vectors", "-V", help="Optional number of vectors to prune to"),
|
prune_vectors: int = Opt(-1, "--prune-vectors", "-V", help="Optional number of vectors to prune to"),
|
||||||
truncate_vectors: int = Opt(0, "--truncate-vectors", "-t", help="Optional number of vectors to truncate to when reading in vectors file"),
|
truncate_vectors: int = Opt(0, "--truncate-vectors", "-t", help="Optional number of vectors to truncate to when reading in vectors file"),
|
||||||
vectors_name: Optional[str] = Opt(None, "--vectors-name", "-vn", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"),
|
vectors_name: Optional[str] = Opt(None, "--vectors-name", "-vn", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"),
|
||||||
model_name: Optional[str] = Opt(None, "--model-name", "-mn", help="Optional name for the model meta"),
|
model_name: Optional[str] = Opt(None, "--model-name", "-mn", help="Optional name for the model meta"),
|
||||||
|
@ -56,6 +56,7 @@ def init_model_cli(
|
||||||
freqs_loc=freqs_loc,
|
freqs_loc=freqs_loc,
|
||||||
clusters_loc=clusters_loc,
|
clusters_loc=clusters_loc,
|
||||||
jsonl_loc=jsonl_loc,
|
jsonl_loc=jsonl_loc,
|
||||||
|
vectors_loc=vectors_loc,
|
||||||
prune_vectors=prune_vectors,
|
prune_vectors=prune_vectors,
|
||||||
truncate_vectors=truncate_vectors,
|
truncate_vectors=truncate_vectors,
|
||||||
vectors_name=vectors_name,
|
vectors_name=vectors_name,
|
||||||
|
@ -228,7 +229,9 @@ def add_vectors(
|
||||||
else:
|
else:
|
||||||
if vectors_loc:
|
if vectors_loc:
|
||||||
with msg.loading(f"Reading vectors from {vectors_loc}"):
|
with msg.loading(f"Reading vectors from {vectors_loc}"):
|
||||||
vectors_data, vector_keys = read_vectors(msg, vectors_loc)
|
vectors_data, vector_keys = read_vectors(
|
||||||
|
msg, vectors_loc, truncate_vectors
|
||||||
|
)
|
||||||
msg.good(f"Loaded vectors from {vectors_loc}")
|
msg.good(f"Loaded vectors from {vectors_loc}")
|
||||||
else:
|
else:
|
||||||
vectors_data, vector_keys = (None, None)
|
vectors_data, vector_keys = (None, None)
|
||||||
|
@ -247,7 +250,7 @@ def add_vectors(
|
||||||
nlp.vocab.prune_vectors(prune_vectors)
|
nlp.vocab.prune_vectors(prune_vectors)
|
||||||
|
|
||||||
|
|
||||||
def read_vectors(msg: Printer, vectors_loc: Path, truncate_vectors: int = 0):
|
def read_vectors(msg: Printer, vectors_loc: Path, truncate_vectors: int):
|
||||||
f = open_file(vectors_loc)
|
f = open_file(vectors_loc)
|
||||||
shape = tuple(int(size) for size in next(f).split())
|
shape = tuple(int(size) for size in next(f).split())
|
||||||
if truncate_vectors >= 1:
|
if truncate_vectors >= 1:
|
||||||
|
|
|
@ -5,24 +5,26 @@ import time
|
||||||
import re
|
import re
|
||||||
from collections import Counter
|
from collections import Counter
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from thinc.api import Linear, Maxout, chain, list2array, use_pytorch_for_gpu_memory
|
from thinc.api import use_pytorch_for_gpu_memory, require_gpu
|
||||||
|
from thinc.api import set_dropout_rate, to_categorical, fix_random_seed
|
||||||
|
from thinc.api import CosineDistance, L2Distance
|
||||||
from wasabi import msg
|
from wasabi import msg
|
||||||
import srsly
|
import srsly
|
||||||
|
from functools import partial
|
||||||
|
|
||||||
from ._app import app, Arg, Opt
|
from ._app import app, Arg, Opt
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
from ..ml.models.multi_task import build_masked_language_model
|
from ..ml.models.multi_task import build_cloze_multi_task_model
|
||||||
|
from ..ml.models.multi_task import build_cloze_characters_multi_task_model
|
||||||
from ..tokens import Doc
|
from ..tokens import Doc
|
||||||
from ..attrs import ID, HEAD
|
from ..attrs import ID, HEAD
|
||||||
from .. import util
|
from .. import util
|
||||||
from ..gold import Example
|
|
||||||
|
|
||||||
|
|
||||||
@app.command("pretrain")
|
@app.command("pretrain")
|
||||||
def pretrain_cli(
|
def pretrain_cli(
|
||||||
# fmt: off
|
# fmt: off
|
||||||
texts_loc: Path = Arg(..., help="Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the key 'tokens'", exists=True),
|
texts_loc: Path = Arg(..., help="Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the key 'tokens'", exists=True),
|
||||||
vectors_model: str = Arg(..., help="Name or path to spaCy model with vectors to learn from"),
|
|
||||||
output_dir: Path = Arg(..., help="Directory to write models to on each epoch"),
|
output_dir: Path = Arg(..., help="Directory to write models to on each epoch"),
|
||||||
config_path: Path = Arg(..., help="Path to config file", exists=True, dir_okay=False),
|
config_path: Path = Arg(..., help="Path to config file", exists=True, dir_okay=False),
|
||||||
use_gpu: int = Opt(-1, "--use-gpu", "-g", help="Use GPU"),
|
use_gpu: int = Opt(-1, "--use-gpu", "-g", help="Use GPU"),
|
||||||
|
@ -32,11 +34,15 @@ def pretrain_cli(
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
|
Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components,
|
||||||
using an approximate language-modelling objective. Specifically, we load
|
using an approximate language-modelling objective. Two objective types
|
||||||
pretrained vectors, and train a component like a CNN, BiLSTM, etc to predict
|
are available, vector-based and character-based.
|
||||||
vectors which match the pretrained ones. The weights are saved to a directory
|
|
||||||
after each epoch. You can then pass a path to one of these pretrained weights
|
In the vector-based objective, we load word vectors that have been trained
|
||||||
files to the 'spacy train' command.
|
using a word2vec-style distributional similarity algorithm, and train a
|
||||||
|
component like a CNN, BiLSTM, etc to predict vectors which match the
|
||||||
|
pretrained ones. The weights are saved to a directory after each epoch. You
|
||||||
|
can then pass a path to one of these pretrained weights files to the
|
||||||
|
'spacy train' command.
|
||||||
|
|
||||||
This technique may be especially helpful if you have little labelled data.
|
This technique may be especially helpful if you have little labelled data.
|
||||||
However, it's still quite experimental, so your mileage may vary.
|
However, it's still quite experimental, so your mileage may vary.
|
||||||
|
@ -47,7 +53,6 @@ def pretrain_cli(
|
||||||
"""
|
"""
|
||||||
pretrain(
|
pretrain(
|
||||||
texts_loc,
|
texts_loc,
|
||||||
vectors_model,
|
|
||||||
output_dir,
|
output_dir,
|
||||||
config_path,
|
config_path,
|
||||||
use_gpu=use_gpu,
|
use_gpu=use_gpu,
|
||||||
|
@ -58,101 +63,55 @@ def pretrain_cli(
|
||||||
|
|
||||||
def pretrain(
|
def pretrain(
|
||||||
texts_loc: Path,
|
texts_loc: Path,
|
||||||
vectors_model: str,
|
|
||||||
output_dir: Path,
|
output_dir: Path,
|
||||||
config_path: Path,
|
config_path: Path,
|
||||||
use_gpu: int = -1,
|
use_gpu: int = -1,
|
||||||
resume_path: Optional[Path] = None,
|
resume_path: Optional[Path] = None,
|
||||||
epoch_resume: Optional[int] = None,
|
epoch_resume: Optional[int] = None,
|
||||||
):
|
):
|
||||||
if not config_path or not config_path.exists():
|
verify_cli_args(**locals())
|
||||||
msg.fail("Config file not found", config_path, exits=1)
|
if not output_dir.exists():
|
||||||
|
output_dir.mkdir()
|
||||||
|
msg.good(f"Created output directory: {output_dir}")
|
||||||
|
|
||||||
if use_gpu >= 0:
|
if use_gpu >= 0:
|
||||||
msg.info("Using GPU")
|
msg.info("Using GPU")
|
||||||
util.use_gpu(use_gpu)
|
require_gpu(use_gpu)
|
||||||
else:
|
else:
|
||||||
msg.info("Using CPU")
|
msg.info("Using CPU")
|
||||||
|
|
||||||
msg.info(f"Loading config from: {config_path}")
|
msg.info(f"Loading config from: {config_path}")
|
||||||
config = util.load_config(config_path, create_objects=False)
|
config = util.load_config(config_path, create_objects=False)
|
||||||
util.fix_random_seed(config["pretraining"]["seed"])
|
fix_random_seed(config["pretraining"]["seed"])
|
||||||
if config["pretraining"]["use_pytorch_for_gpu_memory"]:
|
if use_gpu >= 0 and config["pretraining"]["use_pytorch_for_gpu_memory"]:
|
||||||
use_pytorch_for_gpu_memory()
|
use_pytorch_for_gpu_memory()
|
||||||
|
|
||||||
if output_dir.exists() and [p for p in output_dir.iterdir()]:
|
nlp_config = config["nlp"]
|
||||||
if resume_path:
|
|
||||||
msg.warn(
|
|
||||||
"Output directory is not empty. ",
|
|
||||||
"If you're resuming a run from a previous model in this directory, "
|
|
||||||
"the old models for the consecutive epochs will be overwritten "
|
|
||||||
"with the new ones.",
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
msg.warn(
|
|
||||||
"Output directory is not empty. ",
|
|
||||||
"It is better to use an empty directory or refer to a new output path, "
|
|
||||||
"then the new directory will be created for you.",
|
|
||||||
)
|
|
||||||
if not output_dir.exists():
|
|
||||||
output_dir.mkdir()
|
|
||||||
msg.good(f"Created output directory: {output_dir}")
|
|
||||||
srsly.write_json(output_dir / "config.json", config)
|
srsly.write_json(output_dir / "config.json", config)
|
||||||
msg.good("Saved config file in the output directory")
|
msg.good("Saved config file in the output directory")
|
||||||
|
|
||||||
config = util.load_config(config_path, create_objects=True)
|
config = util.load_config(config_path, create_objects=True)
|
||||||
|
nlp = util.load_model_from_config(nlp_config)
|
||||||
pretrain_config = config["pretraining"]
|
pretrain_config = config["pretraining"]
|
||||||
|
|
||||||
# Load texts from file or stdin
|
|
||||||
if texts_loc != "-": # reading from a file
|
if texts_loc != "-": # reading from a file
|
||||||
texts_loc = Path(texts_loc)
|
|
||||||
if not texts_loc.exists():
|
|
||||||
msg.fail("Input text file doesn't exist", texts_loc, exits=1)
|
|
||||||
with msg.loading("Loading input texts..."):
|
with msg.loading("Loading input texts..."):
|
||||||
texts = list(srsly.read_jsonl(texts_loc))
|
texts = list(srsly.read_jsonl(texts_loc))
|
||||||
if not texts:
|
|
||||||
msg.fail("Input file is empty", texts_loc, exits=1)
|
|
||||||
msg.good("Loaded input texts")
|
|
||||||
random.shuffle(texts)
|
random.shuffle(texts)
|
||||||
else: # reading from stdin
|
else: # reading from stdin
|
||||||
msg.info("Reading input text from stdin...")
|
msg.info("Reading input text from stdin...")
|
||||||
texts = srsly.read_jsonl("-")
|
texts = srsly.read_jsonl("-")
|
||||||
|
|
||||||
with msg.loading(f"Loading model '{vectors_model}'..."):
|
|
||||||
nlp = util.load_model(vectors_model)
|
|
||||||
msg.good(f"Loaded model '{vectors_model}'")
|
|
||||||
tok2vec_path = pretrain_config["tok2vec_model"]
|
tok2vec_path = pretrain_config["tok2vec_model"]
|
||||||
tok2vec = config
|
tok2vec = config
|
||||||
for subpath in tok2vec_path.split("."):
|
for subpath in tok2vec_path.split("."):
|
||||||
tok2vec = tok2vec.get(subpath)
|
tok2vec = tok2vec.get(subpath)
|
||||||
model = create_pretraining_model(nlp, tok2vec)
|
model = create_pretraining_model(nlp, tok2vec, pretrain_config)
|
||||||
optimizer = pretrain_config["optimizer"]
|
optimizer = pretrain_config["optimizer"]
|
||||||
|
|
||||||
# Load in pretrained weights to resume from
|
# Load in pretrained weights to resume from
|
||||||
if resume_path is not None:
|
if resume_path is not None:
|
||||||
msg.info(f"Resume training tok2vec from: {resume_path}")
|
_resume_model(model, resume_path, epoch_resume)
|
||||||
with resume_path.open("rb") as file_:
|
|
||||||
weights_data = file_.read()
|
|
||||||
model.get_ref("tok2vec").from_bytes(weights_data)
|
|
||||||
# Parse the epoch number from the given weight file
|
|
||||||
model_name = re.search(r"model\d+\.bin", str(resume_path))
|
|
||||||
if model_name:
|
|
||||||
# Default weight file name so read epoch_start from it by cutting off 'model' and '.bin'
|
|
||||||
epoch_resume = int(model_name.group(0)[5:][:-4]) + 1
|
|
||||||
msg.info(f"Resuming from epoch: {epoch_resume}")
|
|
||||||
else:
|
|
||||||
if not epoch_resume:
|
|
||||||
msg.fail(
|
|
||||||
"You have to use the --epoch-resume setting when using a renamed weight file for --resume-path",
|
|
||||||
exits=True,
|
|
||||||
)
|
|
||||||
elif epoch_resume < 0:
|
|
||||||
msg.fail(
|
|
||||||
f"The argument --epoch-resume has to be greater or equal to 0. {epoch_resume} is invalid",
|
|
||||||
exits=True,
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
msg.info(f"Resuming from epoch: {epoch_resume}")
|
|
||||||
else:
|
else:
|
||||||
# Without '--resume-path' the '--epoch-resume' argument is ignored
|
# Without '--resume-path' the '--epoch-resume' argument is ignored
|
||||||
epoch_resume = 0
|
epoch_resume = 0
|
||||||
|
@ -177,18 +136,18 @@ def pretrain(
|
||||||
file_.write(srsly.json_dumps(log) + "\n")
|
file_.write(srsly.json_dumps(log) + "\n")
|
||||||
|
|
||||||
skip_counter = 0
|
skip_counter = 0
|
||||||
loss_func = pretrain_config["loss_func"]
|
objective = create_objective(pretrain_config["objective"])
|
||||||
for epoch in range(epoch_resume, pretrain_config["max_epochs"]):
|
for epoch in range(epoch_resume, pretrain_config["max_epochs"]):
|
||||||
batches = util.minibatch_by_words(texts, size=pretrain_config["batch_size"])
|
batches = util.minibatch_by_words(texts, size=pretrain_config["batch_size"])
|
||||||
for batch_id, batch in enumerate(batches):
|
for batch_id, batch in enumerate(batches):
|
||||||
docs, count = make_docs(
|
docs, count = make_docs(
|
||||||
nlp,
|
nlp,
|
||||||
[ex.doc for ex in batch],
|
batch,
|
||||||
max_length=pretrain_config["max_length"],
|
max_length=pretrain_config["max_length"],
|
||||||
min_length=pretrain_config["min_length"],
|
min_length=pretrain_config["min_length"],
|
||||||
)
|
)
|
||||||
skip_counter += count
|
skip_counter += count
|
||||||
loss = make_update(model, docs, optimizer, distance=loss_func)
|
loss = make_update(model, docs, optimizer, objective)
|
||||||
progress = tracker.update(epoch, loss, docs)
|
progress = tracker.update(epoch, loss, docs)
|
||||||
if progress:
|
if progress:
|
||||||
msg.row(progress, **row_settings)
|
msg.row(progress, **row_settings)
|
||||||
|
@ -208,7 +167,22 @@ def pretrain(
|
||||||
msg.good("Successfully finished pretrain")
|
msg.good("Successfully finished pretrain")
|
||||||
|
|
||||||
|
|
||||||
def make_update(model, docs, optimizer, distance):
|
def _resume_model(model, resume_path, epoch_resume):
|
||||||
|
msg.info(f"Resume training tok2vec from: {resume_path}")
|
||||||
|
with resume_path.open("rb") as file_:
|
||||||
|
weights_data = file_.read()
|
||||||
|
model.get_ref("tok2vec").from_bytes(weights_data)
|
||||||
|
# Parse the epoch number from the given weight file
|
||||||
|
model_name = re.search(r"model\d+\.bin", str(resume_path))
|
||||||
|
if model_name:
|
||||||
|
# Default weight file name so read epoch_start from it by cutting off 'model' and '.bin'
|
||||||
|
epoch_resume = int(model_name.group(0)[5:][:-4]) + 1
|
||||||
|
msg.info(f"Resuming from epoch: {epoch_resume}")
|
||||||
|
else:
|
||||||
|
msg.info(f"Resuming from epoch: {epoch_resume}")
|
||||||
|
|
||||||
|
|
||||||
|
def make_update(model, docs, optimizer, objective_func):
|
||||||
"""Perform an update over a single batch of documents.
|
"""Perform an update over a single batch of documents.
|
||||||
|
|
||||||
docs (iterable): A batch of `Doc` objects.
|
docs (iterable): A batch of `Doc` objects.
|
||||||
|
@ -216,7 +190,7 @@ def make_update(model, docs, optimizer, distance):
|
||||||
RETURNS loss: A float for the loss.
|
RETURNS loss: A float for the loss.
|
||||||
"""
|
"""
|
||||||
predictions, backprop = model.begin_update(docs)
|
predictions, backprop = model.begin_update(docs)
|
||||||
loss, gradients = get_vectors_loss(model.ops, docs, predictions, distance)
|
loss, gradients = objective_func(model.ops, docs, predictions)
|
||||||
backprop(gradients)
|
backprop(gradients)
|
||||||
model.finish_update(optimizer)
|
model.finish_update(optimizer)
|
||||||
# Don't want to return a cupy object here
|
# Don't want to return a cupy object here
|
||||||
|
@ -255,13 +229,38 @@ def make_docs(nlp, batch, min_length, max_length):
|
||||||
return docs, skip_count
|
return docs, skip_count
|
||||||
|
|
||||||
|
|
||||||
def get_vectors_loss(ops, docs, prediction, distance):
|
def create_objective(config):
|
||||||
"""Compute a mean-squared error loss between the documents' vectors and
|
"""Create the objective for pretraining.
|
||||||
the prediction.
|
|
||||||
|
|
||||||
Note that this is ripe for customization! We could compute the vectors
|
We'd like to replace this with a registry function but it's tricky because
|
||||||
in some other word, e.g. with an LSTM language model, or use some other
|
we're also making a model choice based on this. For now we hard-code support
|
||||||
type of objective.
|
for two types (characters, vectors). For characters you can specify
|
||||||
|
n_characters, for vectors you can specify the loss.
|
||||||
|
|
||||||
|
Bleh.
|
||||||
|
"""
|
||||||
|
objective_type = config["type"]
|
||||||
|
if objective_type == "characters":
|
||||||
|
return partial(get_characters_loss, nr_char=config["n_characters"])
|
||||||
|
elif objective_type == "vectors":
|
||||||
|
if config["loss"] == "cosine":
|
||||||
|
return partial(
|
||||||
|
get_vectors_loss,
|
||||||
|
distance=CosineDistance(normalize=True, ignore_zeros=True),
|
||||||
|
)
|
||||||
|
elif config["loss"] == "L2":
|
||||||
|
return partial(
|
||||||
|
get_vectors_loss, distance=L2Distance(normalize=True, ignore_zeros=True)
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
raise ValueError("Unexpected loss type", config["loss"])
|
||||||
|
else:
|
||||||
|
raise ValueError("Unexpected objective_type", objective_type)
|
||||||
|
|
||||||
|
|
||||||
|
def get_vectors_loss(ops, docs, prediction, distance):
|
||||||
|
"""Compute a loss based on a distance between the documents' vectors and
|
||||||
|
the prediction.
|
||||||
"""
|
"""
|
||||||
# The simplest way to implement this would be to vstack the
|
# The simplest way to implement this would be to vstack the
|
||||||
# token.vector values, but that's a bit inefficient, especially on GPU.
|
# token.vector values, but that's a bit inefficient, especially on GPU.
|
||||||
|
@ -273,7 +272,19 @@ def get_vectors_loss(ops, docs, prediction, distance):
|
||||||
return loss, d_target
|
return loss, d_target
|
||||||
|
|
||||||
|
|
||||||
def create_pretraining_model(nlp, tok2vec):
|
def get_characters_loss(ops, docs, prediction, nr_char):
|
||||||
|
"""Compute a loss based on a number of characters predicted from the docs."""
|
||||||
|
target_ids = numpy.vstack([doc.to_utf8_array(nr_char=nr_char) for doc in docs])
|
||||||
|
target_ids = target_ids.reshape((-1,))
|
||||||
|
target = ops.asarray(to_categorical(target_ids, n_classes=256), dtype="f")
|
||||||
|
target = target.reshape((-1, 256 * nr_char))
|
||||||
|
diff = prediction - target
|
||||||
|
loss = (diff ** 2).sum()
|
||||||
|
d_target = diff / float(prediction.shape[0])
|
||||||
|
return loss, d_target
|
||||||
|
|
||||||
|
|
||||||
|
def create_pretraining_model(nlp, tok2vec, pretrain_config):
|
||||||
"""Define a network for the pretraining. We simply add an output layer onto
|
"""Define a network for the pretraining. We simply add an output layer onto
|
||||||
the tok2vec input model. The tok2vec input model needs to be a model that
|
the tok2vec input model. The tok2vec input model needs to be a model that
|
||||||
takes a batch of Doc objects (as a list), and returns a list of arrays.
|
takes a batch of Doc objects (as a list), and returns a list of arrays.
|
||||||
|
@ -281,18 +292,24 @@ def create_pretraining_model(nlp, tok2vec):
|
||||||
The actual tok2vec layer is stored as a reference, and only this bit will be
|
The actual tok2vec layer is stored as a reference, and only this bit will be
|
||||||
serialized to file and read back in when calling the 'train' command.
|
serialized to file and read back in when calling the 'train' command.
|
||||||
"""
|
"""
|
||||||
output_size = nlp.vocab.vectors.data.shape[1]
|
# TODO
|
||||||
output_layer = chain(
|
maxout_pieces = 3
|
||||||
Maxout(nO=300, nP=3, normalize=True, dropout=0.0), Linear(output_size)
|
hidden_size = 300
|
||||||
)
|
if pretrain_config["objective"]["type"] == "vectors":
|
||||||
model = chain(tok2vec, list2array())
|
model = build_cloze_multi_task_model(
|
||||||
model = chain(model, output_layer)
|
nlp.vocab, tok2vec, hidden_size=hidden_size, maxout_pieces=maxout_pieces
|
||||||
|
)
|
||||||
|
elif pretrain_config["objective"]["type"] == "characters":
|
||||||
|
model = build_cloze_characters_multi_task_model(
|
||||||
|
nlp.vocab,
|
||||||
|
tok2vec,
|
||||||
|
hidden_size=hidden_size,
|
||||||
|
maxout_pieces=maxout_pieces,
|
||||||
|
nr_char=pretrain_config["objective"]["n_characters"],
|
||||||
|
)
|
||||||
model.initialize(X=[nlp.make_doc("Give it a doc to infer shapes")])
|
model.initialize(X=[nlp.make_doc("Give it a doc to infer shapes")])
|
||||||
mlm_model = build_masked_language_model(nlp.vocab, model)
|
set_dropout_rate(model, pretrain_config["dropout"])
|
||||||
mlm_model.set_ref("tok2vec", tok2vec)
|
return model
|
||||||
mlm_model.set_ref("output_layer", output_layer)
|
|
||||||
mlm_model.initialize(X=[nlp.make_doc("Give it a doc to infer shapes")])
|
|
||||||
return mlm_model
|
|
||||||
|
|
||||||
|
|
||||||
class ProgressTracker(object):
|
class ProgressTracker(object):
|
||||||
|
@ -341,3 +358,53 @@ def _smart_round(figure, width=10, max_decimal=4):
|
||||||
n_decimal = min(n_decimal, max_decimal)
|
n_decimal = min(n_decimal, max_decimal)
|
||||||
format_str = "%." + str(n_decimal) + "f"
|
format_str = "%." + str(n_decimal) + "f"
|
||||||
return format_str % figure
|
return format_str % figure
|
||||||
|
|
||||||
|
|
||||||
|
def verify_cli_args(
|
||||||
|
texts_loc, output_dir, config_path, use_gpu, resume_path, epoch_resume
|
||||||
|
):
|
||||||
|
if not config_path or not config_path.exists():
|
||||||
|
msg.fail("Config file not found", config_path, exits=1)
|
||||||
|
if output_dir.exists() and [p for p in output_dir.iterdir()]:
|
||||||
|
if resume_path:
|
||||||
|
msg.warn(
|
||||||
|
"Output directory is not empty. ",
|
||||||
|
"If you're resuming a run from a previous model in this directory, "
|
||||||
|
"the old models for the consecutive epochs will be overwritten "
|
||||||
|
"with the new ones.",
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
msg.warn(
|
||||||
|
"Output directory is not empty. ",
|
||||||
|
"It is better to use an empty directory or refer to a new output path, "
|
||||||
|
"then the new directory will be created for you.",
|
||||||
|
)
|
||||||
|
if texts_loc != "-": # reading from a file
|
||||||
|
texts_loc = Path(texts_loc)
|
||||||
|
if not texts_loc.exists():
|
||||||
|
msg.fail("Input text file doesn't exist", texts_loc, exits=1)
|
||||||
|
|
||||||
|
for text in srsly.read_jsonl(texts_loc):
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
msg.fail("Input file is empty", texts_loc, exits=1)
|
||||||
|
|
||||||
|
if resume_path is not None:
|
||||||
|
model_name = re.search(r"model\d+\.bin", str(resume_path))
|
||||||
|
if not model_name and not epoch_resume:
|
||||||
|
msg.fail(
|
||||||
|
"You have to use the --epoch-resume setting when using a renamed weight file for --resume-path",
|
||||||
|
exits=True,
|
||||||
|
)
|
||||||
|
elif not model_name and epoch_resume < 0:
|
||||||
|
msg.fail(
|
||||||
|
f"The argument --epoch-resume has to be greater or equal to 0. {epoch_resume} is invalid",
|
||||||
|
exits=True,
|
||||||
|
)
|
||||||
|
config = util.load_config(config_path, create_objects=False)
|
||||||
|
if config["pretraining"]["objective"]["type"] == "vectors":
|
||||||
|
if not config["nlp"]["vectors"]:
|
||||||
|
msg.fail(
|
||||||
|
"Must specify nlp.vectors if pretraining.objective.type is vectors",
|
||||||
|
exits=True,
|
||||||
|
)
|
||||||
|
|
|
@ -31,17 +31,20 @@ def profile_cli(
|
||||||
|
|
||||||
|
|
||||||
def profile(model: str, inputs: Optional[Path] = None, n_texts: int = 10000) -> None:
|
def profile(model: str, inputs: Optional[Path] = None, n_texts: int = 10000) -> None:
|
||||||
try:
|
|
||||||
import ml_datasets
|
|
||||||
except ImportError:
|
|
||||||
msg.fail(
|
|
||||||
"This command requires the ml_datasets library to be installed:"
|
|
||||||
"pip install ml_datasets",
|
|
||||||
exits=1,
|
|
||||||
)
|
|
||||||
if inputs is not None:
|
if inputs is not None:
|
||||||
inputs = _read_inputs(inputs, msg)
|
inputs = _read_inputs(inputs, msg)
|
||||||
if inputs is None:
|
if inputs is None:
|
||||||
|
try:
|
||||||
|
import ml_datasets
|
||||||
|
except ImportError:
|
||||||
|
msg.fail(
|
||||||
|
"This command, when run without an input file, "
|
||||||
|
"requires the ml_datasets library to be installed: "
|
||||||
|
"pip install ml_datasets",
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
|
|
||||||
n_inputs = 25000
|
n_inputs = 25000
|
||||||
with msg.loading("Loading IMDB dataset via Thinc..."):
|
with msg.loading("Loading IMDB dataset via Thinc..."):
|
||||||
imdb_train, _ = ml_datasets.imdb()
|
imdb_train, _ = ml_datasets.imdb()
|
||||||
|
|
|
@ -1,6 +1,5 @@
|
||||||
from typing import Optional, Dict, List, Union, Sequence
|
from typing import Optional, Dict, List, Union, Sequence
|
||||||
from timeit import default_timer as timer
|
from timeit import default_timer as timer
|
||||||
|
|
||||||
import srsly
|
import srsly
|
||||||
import tqdm
|
import tqdm
|
||||||
from pydantic import BaseModel, FilePath
|
from pydantic import BaseModel, FilePath
|
||||||
|
@ -8,11 +7,11 @@ from pathlib import Path
|
||||||
from wasabi import msg
|
from wasabi import msg
|
||||||
import thinc
|
import thinc
|
||||||
import thinc.schedules
|
import thinc.schedules
|
||||||
from thinc.api import Model, use_pytorch_for_gpu_memory
|
from thinc.api import Model, use_pytorch_for_gpu_memory, require_gpu, fix_random_seed
|
||||||
import random
|
import random
|
||||||
|
|
||||||
from ._app import app, Arg, Opt
|
from ._app import app, Arg, Opt
|
||||||
from ..gold import Corpus
|
from ..gold import Corpus, Example
|
||||||
from ..lookups import Lookups
|
from ..lookups import Lookups
|
||||||
from .. import util
|
from .. import util
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
|
@ -125,7 +124,7 @@ def train_cli(
|
||||||
train_path: Path = Arg(..., help="Location of JSON-formatted training data", exists=True),
|
train_path: Path = Arg(..., help="Location of JSON-formatted training data", exists=True),
|
||||||
dev_path: Path = Arg(..., help="Location of JSON-formatted development data", exists=True),
|
dev_path: Path = Arg(..., help="Location of JSON-formatted development data", exists=True),
|
||||||
config_path: Path = Arg(..., help="Path to config file", exists=True),
|
config_path: Path = Arg(..., help="Path to config file", exists=True),
|
||||||
output_path: Optional[Path] = Opt(None, "--output-path", "-o", help="Output directory to store model in"),
|
output_path: Optional[Path] = Opt(None, "--output", "--output-path", "-o", help="Output directory to store model in"),
|
||||||
code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
||||||
init_tok2vec: Optional[Path] = Opt(None, "--init-tok2vec", "-t2v", help="Path to pretrained weights for the tok2vec components. See 'spacy pretrain'. Experimental."),
|
init_tok2vec: Optional[Path] = Opt(None, "--init-tok2vec", "-t2v", help="Path to pretrained weights for the tok2vec components. See 'spacy pretrain'. Experimental."),
|
||||||
raw_text: Optional[Path] = Opt(None, "--raw-text", "-rt", help="Path to jsonl file with unlabelled text documents."),
|
raw_text: Optional[Path] = Opt(None, "--raw-text", "-rt", help="Path to jsonl file with unlabelled text documents."),
|
||||||
|
@ -156,7 +155,6 @@ def train_cli(
|
||||||
if init_tok2vec is not None:
|
if init_tok2vec is not None:
|
||||||
with init_tok2vec.open("rb") as file_:
|
with init_tok2vec.open("rb") as file_:
|
||||||
weights_data = file_.read()
|
weights_data = file_.read()
|
||||||
|
|
||||||
train_args = dict(
|
train_args = dict(
|
||||||
config_path=config_path,
|
config_path=config_path,
|
||||||
data_paths={"train": train_path, "dev": dev_path},
|
data_paths={"train": train_path, "dev": dev_path},
|
||||||
|
@ -193,7 +191,7 @@ def train(
|
||||||
msg.info(f"Loading config from: {config_path}")
|
msg.info(f"Loading config from: {config_path}")
|
||||||
# Read the config first without creating objects, to get to the original nlp_config
|
# Read the config first without creating objects, to get to the original nlp_config
|
||||||
config = util.load_config(config_path, create_objects=False)
|
config = util.load_config(config_path, create_objects=False)
|
||||||
util.fix_random_seed(config["training"]["seed"])
|
fix_random_seed(config["training"]["seed"])
|
||||||
if config["training"].get("use_pytorch_for_gpu_memory"):
|
if config["training"].get("use_pytorch_for_gpu_memory"):
|
||||||
# It feels kind of weird to not have a default for this.
|
# It feels kind of weird to not have a default for this.
|
||||||
use_pytorch_for_gpu_memory()
|
use_pytorch_for_gpu_memory()
|
||||||
|
@ -216,11 +214,11 @@ def train(
|
||||||
nlp.resume_training()
|
nlp.resume_training()
|
||||||
else:
|
else:
|
||||||
msg.info(f"Initializing the nlp pipeline: {nlp.pipe_names}")
|
msg.info(f"Initializing the nlp pipeline: {nlp.pipe_names}")
|
||||||
train_examples = list(corpus.train_dataset(
|
train_examples = list(
|
||||||
nlp,
|
corpus.train_dataset(
|
||||||
shuffle=False,
|
nlp, shuffle=False, gold_preproc=training["gold_preproc"]
|
||||||
gold_preproc=training["gold_preproc"]
|
)
|
||||||
))
|
)
|
||||||
nlp.begin_training(lambda: train_examples)
|
nlp.begin_training(lambda: train_examples)
|
||||||
|
|
||||||
# Update tag map with provided mapping
|
# Update tag map with provided mapping
|
||||||
|
@ -307,12 +305,14 @@ def train(
|
||||||
|
|
||||||
def create_train_batches(nlp, corpus, cfg, randomization_index):
|
def create_train_batches(nlp, corpus, cfg, randomization_index):
|
||||||
max_epochs = cfg.get("max_epochs", 0)
|
max_epochs = cfg.get("max_epochs", 0)
|
||||||
train_examples = list(corpus.train_dataset(
|
train_examples = list(
|
||||||
nlp,
|
corpus.train_dataset(
|
||||||
shuffle=True,
|
nlp,
|
||||||
gold_preproc=cfg["gold_preproc"],
|
shuffle=True,
|
||||||
max_length=cfg["max_length"]
|
gold_preproc=cfg["gold_preproc"],
|
||||||
))
|
max_length=cfg["max_length"],
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
epoch = 0
|
epoch = 0
|
||||||
while True:
|
while True:
|
||||||
|
@ -440,9 +440,8 @@ def train_while_improving(
|
||||||
|
|
||||||
if raw_text:
|
if raw_text:
|
||||||
random.shuffle(raw_text)
|
random.shuffle(raw_text)
|
||||||
raw_batches = util.minibatch(
|
raw_examples = [Example.from_dict(nlp.make_doc(rt["text"]), {}) for rt in raw_text]
|
||||||
(nlp.make_doc(rt["text"]) for rt in raw_text), size=8
|
raw_batches = util.minibatch(raw_examples, size=8)
|
||||||
)
|
|
||||||
|
|
||||||
for step, (epoch, batch) in enumerate(train_data):
|
for step, (epoch, batch) in enumerate(train_data):
|
||||||
dropout = next(dropouts)
|
dropout = next(dropouts)
|
||||||
|
@ -539,7 +538,10 @@ def setup_printer(training, nlp):
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
data = (
|
data = (
|
||||||
[info["epoch"], info["step"]] + losses + scores + ["{0:.2f}".format(float(info["score"]))]
|
[info["epoch"], info["step"]]
|
||||||
|
+ losses
|
||||||
|
+ scores
|
||||||
|
+ ["{0:.2f}".format(float(info["score"]))]
|
||||||
)
|
)
|
||||||
msg.row(data, widths=table_widths, aligns=table_aligns)
|
msg.row(data, widths=table_widths, aligns=table_aligns)
|
||||||
|
|
||||||
|
|
|
@ -16,16 +16,6 @@ def add_codes(err_cls):
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
class Warnings(object):
|
class Warnings(object):
|
||||||
W001 = ("As of spaCy v2.0, the keyword argument `path=` is deprecated. "
|
|
||||||
"You can now call spacy.load with the path as its first argument, "
|
|
||||||
"and the model's meta.json will be used to determine the language "
|
|
||||||
"to load. For example:\nnlp = spacy.load('{path}')")
|
|
||||||
W002 = ("Tokenizer.from_list is now deprecated. Create a new Doc object "
|
|
||||||
"instead and pass in the strings as the `words` keyword argument, "
|
|
||||||
"for example:\nfrom spacy.tokens import Doc\n"
|
|
||||||
"doc = Doc(nlp.vocab, words=[...])")
|
|
||||||
W003 = ("Positional arguments to Doc.merge are deprecated. Instead, use "
|
|
||||||
"the keyword arguments, for example tag=, lemma= or ent_type=.")
|
|
||||||
W004 = ("No text fixing enabled. Run `pip install ftfy` to enable fixing "
|
W004 = ("No text fixing enabled. Run `pip install ftfy` to enable fixing "
|
||||||
"using ftfy.fix_text if necessary.")
|
"using ftfy.fix_text if necessary.")
|
||||||
W005 = ("Doc object not parsed. This means displaCy won't be able to "
|
W005 = ("Doc object not parsed. This means displaCy won't be able to "
|
||||||
|
@ -45,12 +35,6 @@ class Warnings(object):
|
||||||
"use context-sensitive tensors. You can always add your own word "
|
"use context-sensitive tensors. You can always add your own word "
|
||||||
"vectors, or use one of the larger models instead if available.")
|
"vectors, or use one of the larger models instead if available.")
|
||||||
W008 = ("Evaluating {obj}.similarity based on empty vectors.")
|
W008 = ("Evaluating {obj}.similarity based on empty vectors.")
|
||||||
W009 = ("Custom factory '{name}' provided by entry points of another "
|
|
||||||
"package overwrites built-in factory.")
|
|
||||||
W010 = ("As of v2.1.0, the PhraseMatcher doesn't have a phrase length "
|
|
||||||
"limit anymore, so the max_length argument is now deprecated. "
|
|
||||||
"If you did not specify this parameter, make sure you call the "
|
|
||||||
"constructor with named arguments instead of positional ones.")
|
|
||||||
W011 = ("It looks like you're calling displacy.serve from within a "
|
W011 = ("It looks like you're calling displacy.serve from within a "
|
||||||
"Jupyter notebook or a similar environment. This likely means "
|
"Jupyter notebook or a similar environment. This likely means "
|
||||||
"you're already running a local web server, so there's no need to "
|
"you're already running a local web server, so there's no need to "
|
||||||
|
@ -64,23 +48,9 @@ class Warnings(object):
|
||||||
"components are applied. To only create tokenized Doc objects, "
|
"components are applied. To only create tokenized Doc objects, "
|
||||||
"try using `nlp.make_doc(text)` or process all texts as a stream "
|
"try using `nlp.make_doc(text)` or process all texts as a stream "
|
||||||
"using `list(nlp.tokenizer.pipe(all_texts))`.")
|
"using `list(nlp.tokenizer.pipe(all_texts))`.")
|
||||||
W013 = ("As of v2.1.0, {obj}.merge is deprecated. Please use the more "
|
|
||||||
"efficient and less error-prone Doc.retokenize context manager "
|
|
||||||
"instead.")
|
|
||||||
W014 = ("As of v2.1.0, the `disable` keyword argument on the serialization "
|
|
||||||
"methods is and should be replaced with `exclude`. This makes it "
|
|
||||||
"consistent with the other serializable objects.")
|
|
||||||
W015 = ("As of v2.1.0, the use of keyword arguments to exclude fields from "
|
|
||||||
"being serialized or deserialized is deprecated. Please use the "
|
|
||||||
"`exclude` argument instead. For example: exclude=['{arg}'].")
|
|
||||||
W016 = ("The keyword argument `n_threads` is now deprecated. As of v2.2.2, "
|
|
||||||
"the argument `n_process` controls parallel inference via "
|
|
||||||
"multiprocessing.")
|
|
||||||
W017 = ("Alias '{alias}' already exists in the Knowledge Base.")
|
W017 = ("Alias '{alias}' already exists in the Knowledge Base.")
|
||||||
W018 = ("Entity '{entity}' already exists in the Knowledge Base - "
|
W018 = ("Entity '{entity}' already exists in the Knowledge Base - "
|
||||||
"ignoring the duplicate entry.")
|
"ignoring the duplicate entry.")
|
||||||
W019 = ("Changing vectors name from {old} to {new}, to avoid clash with "
|
|
||||||
"previously loaded vectors. See Issue #3853.")
|
|
||||||
W020 = ("Unnamed vectors. This won't allow multiple vectors models to be "
|
W020 = ("Unnamed vectors. This won't allow multiple vectors models to be "
|
||||||
"loaded. (Shape: {shape})")
|
"loaded. (Shape: {shape})")
|
||||||
W021 = ("Unexpected hash collision in PhraseMatcher. Matches may be "
|
W021 = ("Unexpected hash collision in PhraseMatcher. Matches may be "
|
||||||
|
@ -91,8 +61,6 @@ class Warnings(object):
|
||||||
"or the language you're using doesn't have lemmatization data, "
|
"or the language you're using doesn't have lemmatization data, "
|
||||||
"you can ignore this warning. If this is surprising, make sure you "
|
"you can ignore this warning. If this is surprising, make sure you "
|
||||||
"have the spacy-lookups-data package installed.")
|
"have the spacy-lookups-data package installed.")
|
||||||
W023 = ("Multiprocessing of Language.pipe is not supported in Python 2. "
|
|
||||||
"'n_process' will be set to 1.")
|
|
||||||
W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in "
|
W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in "
|
||||||
"the Knowledge Base.")
|
"the Knowledge Base.")
|
||||||
W025 = ("'{name}' requires '{attr}' to be assigned, but none of the "
|
W025 = ("'{name}' requires '{attr}' to be assigned, but none of the "
|
||||||
|
@ -101,28 +69,11 @@ class Warnings(object):
|
||||||
W027 = ("Found a large training file of {size} bytes. Note that it may "
|
W027 = ("Found a large training file of {size} bytes. Note that it may "
|
||||||
"be more efficient to split your training data into multiple "
|
"be more efficient to split your training data into multiple "
|
||||||
"smaller JSON files instead.")
|
"smaller JSON files instead.")
|
||||||
W028 = ("Doc.from_array was called with a vector of type '{type}', "
|
|
||||||
"but is expecting one of type 'uint64' instead. This may result "
|
|
||||||
"in problems with the vocab further on in the pipeline.")
|
|
||||||
W029 = ("Unable to align tokens with entities from character offsets. "
|
|
||||||
"Discarding entity annotation for the text: {text}.")
|
|
||||||
W030 = ("Some entities could not be aligned in the text \"{text}\" with "
|
W030 = ("Some entities could not be aligned in the text \"{text}\" with "
|
||||||
"entities \"{entities}\". Use "
|
"entities \"{entities}\". Use "
|
||||||
"`spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)`"
|
"`spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)`"
|
||||||
" to check the alignment. Misaligned entities ('-') will be "
|
" to check the alignment. Misaligned entities ('-') will be "
|
||||||
"ignored during training.")
|
"ignored during training.")
|
||||||
W031 = ("Model '{model}' ({model_version}) requires spaCy {version} and "
|
|
||||||
"is incompatible with the current spaCy version ({current}). This "
|
|
||||||
"may lead to unexpected results or runtime errors. To resolve "
|
|
||||||
"this, download a newer compatible model or retrain your custom "
|
|
||||||
"model with the current spaCy version. For more details and "
|
|
||||||
"available updates, run: python -m spacy validate")
|
|
||||||
W032 = ("Unable to determine model compatibility for model '{model}' "
|
|
||||||
"({model_version}) with the current spaCy version ({current}). "
|
|
||||||
"This may lead to unexpected results or runtime errors. To resolve "
|
|
||||||
"this, download a newer compatible model or retrain your custom "
|
|
||||||
"model with the current spaCy version. For more details and "
|
|
||||||
"available updates, run: python -m spacy validate")
|
|
||||||
W033 = ("Training a new {model} using a model with no lexeme normalization "
|
W033 = ("Training a new {model} using a model with no lexeme normalization "
|
||||||
"table. This may degrade the performance of the model to some "
|
"table. This may degrade the performance of the model to some "
|
||||||
"degree. If this is intentional or the language you're using "
|
"degree. If this is intentional or the language you're using "
|
||||||
|
@ -159,6 +110,8 @@ class Warnings(object):
|
||||||
W100 = ("Skipping unsupported morphological feature(s): '{feature}'. "
|
W100 = ("Skipping unsupported morphological feature(s): '{feature}'. "
|
||||||
"Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or "
|
"Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or "
|
||||||
"string \"Field1=Value1,Value2|Field2=Value3\".")
|
"string \"Field1=Value1,Value2|Field2=Value3\".")
|
||||||
|
W101 = ("Skipping `Doc` custom extension '{name}' while merging docs.")
|
||||||
|
W102 = ("Skipping unsupported user data '{key}: {value}' while merging docs.")
|
||||||
|
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
|
@ -234,9 +187,6 @@ class Errors(object):
|
||||||
"the HEAD attribute would potentially override the sentence "
|
"the HEAD attribute would potentially override the sentence "
|
||||||
"boundaries set by SENT_START.")
|
"boundaries set by SENT_START.")
|
||||||
E033 = ("Cannot load into non-empty Doc of length {length}.")
|
E033 = ("Cannot load into non-empty Doc of length {length}.")
|
||||||
E034 = ("Doc.merge received {n_args} non-keyword arguments. Expected "
|
|
||||||
"either 3 arguments (deprecated), or 0 (use keyword arguments).\n"
|
|
||||||
"Arguments supplied:\n{args}\nKeyword arguments:{kwargs}")
|
|
||||||
E035 = ("Error creating span with start {start} and end {end} for Doc of "
|
E035 = ("Error creating span with start {start} and end {end} for Doc of "
|
||||||
"length {length}.")
|
"length {length}.")
|
||||||
E036 = ("Error calculating span: Can't find a token starting at character "
|
E036 = ("Error calculating span: Can't find a token starting at character "
|
||||||
|
@ -345,14 +295,9 @@ class Errors(object):
|
||||||
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
|
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
|
||||||
"token can only be part of one entity, so make sure the entities "
|
"token can only be part of one entity, so make sure the entities "
|
||||||
"you're setting don't overlap.")
|
"you're setting don't overlap.")
|
||||||
E105 = ("The Doc.print_tree() method is now deprecated. Please use "
|
|
||||||
"Doc.to_json() instead or write your own function.")
|
|
||||||
E106 = ("Can't find doc._.{attr} attribute specified in the underscore "
|
E106 = ("Can't find doc._.{attr} attribute specified in the underscore "
|
||||||
"settings: {opts}")
|
"settings: {opts}")
|
||||||
E107 = ("Value of doc._.{attr} is not JSON-serializable: {value}")
|
E107 = ("Value of doc._.{attr} is not JSON-serializable: {value}")
|
||||||
E108 = ("As of spaCy v2.1, the pipe name `sbd` has been deprecated "
|
|
||||||
"in favor of the pipe name `sentencizer`, which does the same "
|
|
||||||
"thing. For example, use `nlp.create_pipeline('sentencizer')`")
|
|
||||||
E109 = ("Component '{name}' could not be run. Did you forget to "
|
E109 = ("Component '{name}' could not be run. Did you forget to "
|
||||||
"call begin_training()?")
|
"call begin_training()?")
|
||||||
E110 = ("Invalid displaCy render wrapper. Expected callable, got: {obj}")
|
E110 = ("Invalid displaCy render wrapper. Expected callable, got: {obj}")
|
||||||
|
@ -392,10 +337,6 @@ class Errors(object):
|
||||||
E125 = ("Unexpected value: {value}")
|
E125 = ("Unexpected value: {value}")
|
||||||
E126 = ("Unexpected matcher predicate: '{bad}'. Expected one of: {good}. "
|
E126 = ("Unexpected matcher predicate: '{bad}'. Expected one of: {good}. "
|
||||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
"This is likely a bug in spaCy, so feel free to open an issue.")
|
||||||
E128 = ("Unsupported serialization argument: '{arg}'. The use of keyword "
|
|
||||||
"arguments to exclude fields from being serialized or deserialized "
|
|
||||||
"is now deprecated. Please use the `exclude` argument instead. "
|
|
||||||
"For example: exclude=['{arg}'].")
|
|
||||||
E129 = ("Cannot write the label of an existing Span object because a Span "
|
E129 = ("Cannot write the label of an existing Span object because a Span "
|
||||||
"is a read-only view of the underlying Token objects stored in the "
|
"is a read-only view of the underlying Token objects stored in the "
|
||||||
"Doc. Instead, create a new Span object and specify the `label` "
|
"Doc. Instead, create a new Span object and specify the `label` "
|
||||||
|
@ -487,9 +428,6 @@ class Errors(object):
|
||||||
E172 = ("The Lemmatizer.load classmethod is deprecated. To create a "
|
E172 = ("The Lemmatizer.load classmethod is deprecated. To create a "
|
||||||
"Lemmatizer, initialize the class directly. See the docs for "
|
"Lemmatizer, initialize the class directly. See the docs for "
|
||||||
"details: https://spacy.io/api/lemmatizer")
|
"details: https://spacy.io/api/lemmatizer")
|
||||||
E173 = ("As of v2.2, the Lemmatizer is initialized with an instance of "
|
|
||||||
"Lookups containing the lemmatization tables. See the docs for "
|
|
||||||
"details: https://spacy.io/api/lemmatizer#init")
|
|
||||||
E175 = ("Can't remove rule for unknown match pattern ID: {key}")
|
E175 = ("Can't remove rule for unknown match pattern ID: {key}")
|
||||||
E176 = ("Alias '{alias}' is not defined in the Knowledge Base.")
|
E176 = ("Alias '{alias}' is not defined in the Knowledge Base.")
|
||||||
E177 = ("Ill-formed IOB input detected: {tag}")
|
E177 = ("Ill-formed IOB input detected: {tag}")
|
||||||
|
@ -545,19 +483,19 @@ class Errors(object):
|
||||||
E972 = ("Example.__init__ got None for '{arg}'. Requires Doc.")
|
E972 = ("Example.__init__ got None for '{arg}'. Requires Doc.")
|
||||||
E973 = ("Unexpected type for NER data")
|
E973 = ("Unexpected type for NER data")
|
||||||
E974 = ("Unknown {obj} attribute: {key}")
|
E974 = ("Unknown {obj} attribute: {key}")
|
||||||
E975 = ("The method Example.from_dict expects a Doc as first argument, "
|
E975 = ("The method 'Example.from_dict' expects a Doc as first argument, "
|
||||||
"but got {type}")
|
"but got {type}")
|
||||||
E976 = ("The method Example.from_dict expects a dict as second argument, "
|
E976 = ("The method 'Example.from_dict' expects a dict as second argument, "
|
||||||
"but received None.")
|
"but received None.")
|
||||||
E977 = ("Can not compare a MorphAnalysis with a string object. "
|
E977 = ("Can not compare a MorphAnalysis with a string object. "
|
||||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
"This is likely a bug in spaCy, so feel free to open an issue.")
|
||||||
E978 = ("The {method} method of component {name} takes a list of Example objects, "
|
E978 = ("The '{method}' method of {name} takes a list of Example objects, "
|
||||||
"but found {types} instead.")
|
"but found {types} instead.")
|
||||||
E979 = ("Cannot convert {type} to an Example object.")
|
E979 = ("Cannot convert {type} to an Example object.")
|
||||||
E980 = ("Each link annotation should refer to a dictionary with at most one "
|
E980 = ("Each link annotation should refer to a dictionary with at most one "
|
||||||
"identifier mapping to 1.0, and all others to 0.0.")
|
"identifier mapping to 1.0, and all others to 0.0.")
|
||||||
E981 = ("The offsets of the annotations for 'links' need to refer exactly "
|
E981 = ("The offsets of the annotations for 'links' could not be aligned "
|
||||||
"to the offsets of the 'entities' annotations.")
|
"to token boundaries.")
|
||||||
E982 = ("The 'ent_iob' attribute of a Token should be an integer indexing "
|
E982 = ("The 'ent_iob' attribute of a Token should be an integer indexing "
|
||||||
"into {values}, but found {value}.")
|
"into {values}, but found {value}.")
|
||||||
E983 = ("Invalid key for '{dict}': {key}. Available keys: "
|
E983 = ("Invalid key for '{dict}': {key}. Available keys: "
|
||||||
|
@ -593,7 +531,9 @@ class Errors(object):
|
||||||
E997 = ("Tokenizer special cases are not allowed to modify the text. "
|
E997 = ("Tokenizer special cases are not allowed to modify the text. "
|
||||||
"This would map '{chunk}' to '{orth}' given token attributes "
|
"This would map '{chunk}' to '{orth}' given token attributes "
|
||||||
"'{token_attrs}'.")
|
"'{token_attrs}'.")
|
||||||
|
E999 = ("Unable to merge the `Doc` objects because they do not all share "
|
||||||
|
"the same `Vocab`.")
|
||||||
|
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
class TempErrors(object):
|
class TempErrors(object):
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
from .corpus import Corpus
|
from .corpus import Corpus
|
||||||
from .example import Example
|
from .example import Example
|
||||||
from .align import align
|
from .align import Alignment
|
||||||
|
|
||||||
from .iob_utils import iob_to_biluo, biluo_to_iob
|
from .iob_utils import iob_to_biluo, biluo_to_iob
|
||||||
from .iob_utils import biluo_tags_from_offsets, offsets_from_biluo_tags
|
from .iob_utils import biluo_tags_from_offsets, offsets_from_biluo_tags
|
||||||
|
|
|
@ -1,8 +0,0 @@
|
||||||
cdef class Alignment:
|
|
||||||
cdef public object cost
|
|
||||||
cdef public object i2j
|
|
||||||
cdef public object j2i
|
|
||||||
cdef public object i2j_multi
|
|
||||||
cdef public object j2i_multi
|
|
||||||
cdef public object cand_to_gold
|
|
||||||
cdef public object gold_to_cand
|
|
30
spacy/gold/align.py
Normal file
30
spacy/gold/align.py
Normal file
|
@ -0,0 +1,30 @@
|
||||||
|
from typing import List
|
||||||
|
import numpy
|
||||||
|
from thinc.types import Ragged
|
||||||
|
from dataclasses import dataclass
|
||||||
|
import tokenizations
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class Alignment:
|
||||||
|
x2y: Ragged
|
||||||
|
y2x: Ragged
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_indices(cls, x2y: List[List[int]], y2x: List[List[int]]) -> "Alignment":
|
||||||
|
x2y = _make_ragged(x2y)
|
||||||
|
y2x = _make_ragged(y2x)
|
||||||
|
return Alignment(x2y=x2y, y2x=y2x)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_strings(cls, A: List[str], B: List[str]) -> "Alignment":
|
||||||
|
x2y, y2x = tokenizations.get_alignments(A, B)
|
||||||
|
return Alignment.from_indices(x2y=x2y, y2x=y2x)
|
||||||
|
|
||||||
|
|
||||||
|
def _make_ragged(indices):
|
||||||
|
lengths = numpy.array([len(x) for x in indices], dtype="i")
|
||||||
|
flat = []
|
||||||
|
for x in indices:
|
||||||
|
flat.extend(x)
|
||||||
|
return Ragged(numpy.array(flat, dtype="i"), lengths)
|
|
@ -1,101 +0,0 @@
|
||||||
import numpy
|
|
||||||
from ..errors import Errors, AlignmentError
|
|
||||||
|
|
||||||
|
|
||||||
cdef class Alignment:
|
|
||||||
def __init__(self, spacy_words, gold_words):
|
|
||||||
# Do many-to-one alignment for misaligned tokens.
|
|
||||||
# If we over-segment, we'll have one gold word that covers a sequence
|
|
||||||
# of predicted words
|
|
||||||
# If we under-segment, we'll have one predicted word that covers a
|
|
||||||
# sequence of gold words.
|
|
||||||
# If we "mis-segment", we'll have a sequence of predicted words covering
|
|
||||||
# a sequence of gold words. That's many-to-many -- we don't do that
|
|
||||||
# except for NER spans where the start and end can be aligned.
|
|
||||||
cost, i2j, j2i, i2j_multi, j2i_multi = align(spacy_words, gold_words)
|
|
||||||
self.cost = cost
|
|
||||||
self.i2j = i2j
|
|
||||||
self.j2i = j2i
|
|
||||||
self.i2j_multi = i2j_multi
|
|
||||||
self.j2i_multi = j2i_multi
|
|
||||||
self.cand_to_gold = [(j if j >= 0 else None) for j in i2j]
|
|
||||||
self.gold_to_cand = [(i if i >= 0 else None) for i in j2i]
|
|
||||||
|
|
||||||
|
|
||||||
def align(tokens_a, tokens_b):
|
|
||||||
"""Calculate alignment tables between two tokenizations.
|
|
||||||
|
|
||||||
tokens_a (List[str]): The candidate tokenization.
|
|
||||||
tokens_b (List[str]): The reference tokenization.
|
|
||||||
RETURNS: (tuple): A 5-tuple consisting of the following information:
|
|
||||||
* cost (int): The number of misaligned tokens.
|
|
||||||
* a2b (List[int]): Mapping of indices in `tokens_a` to indices in `tokens_b`.
|
|
||||||
For instance, if `a2b[4] == 6`, that means that `tokens_a[4]` aligns
|
|
||||||
to `tokens_b[6]`. If there's no one-to-one alignment for a token,
|
|
||||||
it has the value -1.
|
|
||||||
* b2a (List[int]): The same as `a2b`, but mapping the other direction.
|
|
||||||
* a2b_multi (Dict[int, int]): A dictionary mapping indices in `tokens_a`
|
|
||||||
to indices in `tokens_b`, where multiple tokens of `tokens_a` align to
|
|
||||||
the same token of `tokens_b`.
|
|
||||||
* b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other
|
|
||||||
direction.
|
|
||||||
"""
|
|
||||||
tokens_a = _normalize_for_alignment(tokens_a)
|
|
||||||
tokens_b = _normalize_for_alignment(tokens_b)
|
|
||||||
cost = 0
|
|
||||||
a2b = numpy.empty(len(tokens_a), dtype="i")
|
|
||||||
b2a = numpy.empty(len(tokens_b), dtype="i")
|
|
||||||
a2b.fill(-1)
|
|
||||||
b2a.fill(-1)
|
|
||||||
a2b_multi = {}
|
|
||||||
b2a_multi = {}
|
|
||||||
i = 0
|
|
||||||
j = 0
|
|
||||||
offset_a = 0
|
|
||||||
offset_b = 0
|
|
||||||
while i < len(tokens_a) and j < len(tokens_b):
|
|
||||||
a = tokens_a[i][offset_a:]
|
|
||||||
b = tokens_b[j][offset_b:]
|
|
||||||
if a == b:
|
|
||||||
if offset_a == offset_b == 0:
|
|
||||||
a2b[i] = j
|
|
||||||
b2a[j] = i
|
|
||||||
elif offset_a == 0:
|
|
||||||
cost += 2
|
|
||||||
a2b_multi[i] = j
|
|
||||||
elif offset_b == 0:
|
|
||||||
cost += 2
|
|
||||||
b2a_multi[j] = i
|
|
||||||
offset_a = offset_b = 0
|
|
||||||
i += 1
|
|
||||||
j += 1
|
|
||||||
elif a == "":
|
|
||||||
assert offset_a == 0
|
|
||||||
cost += 1
|
|
||||||
i += 1
|
|
||||||
elif b == "":
|
|
||||||
assert offset_b == 0
|
|
||||||
cost += 1
|
|
||||||
j += 1
|
|
||||||
elif b.startswith(a):
|
|
||||||
cost += 1
|
|
||||||
if offset_a == 0:
|
|
||||||
a2b_multi[i] = j
|
|
||||||
i += 1
|
|
||||||
offset_a = 0
|
|
||||||
offset_b += len(a)
|
|
||||||
elif a.startswith(b):
|
|
||||||
cost += 1
|
|
||||||
if offset_b == 0:
|
|
||||||
b2a_multi[j] = i
|
|
||||||
j += 1
|
|
||||||
offset_b = 0
|
|
||||||
offset_a += len(b)
|
|
||||||
else:
|
|
||||||
assert "".join(tokens_a) != "".join(tokens_b)
|
|
||||||
raise AlignmentError(Errors.E186.format(tok_a=tokens_a, tok_b=tokens_b))
|
|
||||||
return cost, a2b, b2a, a2b_multi, b2a_multi
|
|
||||||
|
|
||||||
|
|
||||||
def _normalize_for_alignment(tokens):
|
|
||||||
return [w.replace(" ", "").lower() for w in tokens]
|
|
|
@ -1,6 +1,4 @@
|
||||||
from .iob2docs import iob2docs # noqa: F401
|
from .iob2docs import iob2docs # noqa: F401
|
||||||
from .conll_ner2docs import conll_ner2docs # noqa: F401
|
from .conll_ner2docs import conll_ner2docs # noqa: F401
|
||||||
from .json2docs import json2docs
|
from .json2docs import json2docs
|
||||||
|
from .conllu2docs import conllu2docs # noqa: F401
|
||||||
# TODO: Update this one
|
|
||||||
# from .conllu2docs import conllu2docs # noqa: F401
|
|
||||||
|
|
|
@ -4,11 +4,11 @@ from .conll_ner2docs import n_sents_info
|
||||||
from ...gold import Example
|
from ...gold import Example
|
||||||
from ...gold import iob_to_biluo, spans_from_biluo_tags
|
from ...gold import iob_to_biluo, spans_from_biluo_tags
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...tokens import Doc, Token
|
from ...tokens import Doc, Token, Span
|
||||||
from wasabi import Printer
|
from wasabi import Printer
|
||||||
|
|
||||||
|
|
||||||
def conllu2json(
|
def conllu2docs(
|
||||||
input_data,
|
input_data,
|
||||||
n_sents=10,
|
n_sents=10,
|
||||||
append_morphology=False,
|
append_morphology=False,
|
||||||
|
@ -28,34 +28,22 @@ def conllu2json(
|
||||||
MISC_NER_PATTERN = "^((?:name|NE)=)?([BILU])-([A-Z_]+)|O$"
|
MISC_NER_PATTERN = "^((?:name|NE)=)?([BILU])-([A-Z_]+)|O$"
|
||||||
msg = Printer(no_print=no_print)
|
msg = Printer(no_print=no_print)
|
||||||
n_sents_info(msg, n_sents)
|
n_sents_info(msg, n_sents)
|
||||||
docs = []
|
sent_docs = read_conllx(
|
||||||
raw = ""
|
|
||||||
sentences = []
|
|
||||||
conll_data = read_conllx(
|
|
||||||
input_data,
|
input_data,
|
||||||
append_morphology=append_morphology,
|
append_morphology=append_morphology,
|
||||||
ner_tag_pattern=MISC_NER_PATTERN,
|
ner_tag_pattern=MISC_NER_PATTERN,
|
||||||
ner_map=ner_map,
|
ner_map=ner_map,
|
||||||
merge_subtokens=merge_subtokens,
|
merge_subtokens=merge_subtokens,
|
||||||
)
|
)
|
||||||
has_ner_tags = has_ner(input_data, MISC_NER_PATTERN)
|
docs = []
|
||||||
for i, example in enumerate(conll_data):
|
sent_docs_to_merge = []
|
||||||
raw += example.text
|
for sent_doc in sent_docs:
|
||||||
sentences.append(
|
sent_docs_to_merge.append(sent_doc)
|
||||||
generate_sentence(
|
if len(sent_docs_to_merge) % n_sents == 0:
|
||||||
example.to_dict(), has_ner_tags, MISC_NER_PATTERN, ner_map=ner_map,
|
docs.append(Doc.from_docs(sent_docs_to_merge))
|
||||||
)
|
sent_docs_to_merge = []
|
||||||
)
|
if sent_docs_to_merge:
|
||||||
# Real-sized documents could be extracted using the comments on the
|
docs.append(Doc.from_docs(sent_docs_to_merge))
|
||||||
# conllu document
|
|
||||||
if len(sentences) % n_sents == 0:
|
|
||||||
doc = create_json_doc(raw, sentences, i)
|
|
||||||
docs.append(doc)
|
|
||||||
raw = ""
|
|
||||||
sentences = []
|
|
||||||
if sentences:
|
|
||||||
doc = create_json_doc(raw, sentences, i)
|
|
||||||
docs.append(doc)
|
|
||||||
return docs
|
return docs
|
||||||
|
|
||||||
|
|
||||||
|
@ -84,14 +72,14 @@ def read_conllx(
|
||||||
ner_tag_pattern="",
|
ner_tag_pattern="",
|
||||||
ner_map=None,
|
ner_map=None,
|
||||||
):
|
):
|
||||||
""" Yield examples, one for each sentence """
|
""" Yield docs, one for each sentence """
|
||||||
vocab = Language.Defaults.create_vocab() # need vocab to make a minimal Doc
|
vocab = Language.Defaults.create_vocab() # need vocab to make a minimal Doc
|
||||||
for sent in input_data.strip().split("\n\n"):
|
for sent in input_data.strip().split("\n\n"):
|
||||||
lines = sent.strip().split("\n")
|
lines = sent.strip().split("\n")
|
||||||
if lines:
|
if lines:
|
||||||
while lines[0].startswith("#"):
|
while lines[0].startswith("#"):
|
||||||
lines.pop(0)
|
lines.pop(0)
|
||||||
example = example_from_conllu_sentence(
|
doc = doc_from_conllu_sentence(
|
||||||
vocab,
|
vocab,
|
||||||
lines,
|
lines,
|
||||||
ner_tag_pattern,
|
ner_tag_pattern,
|
||||||
|
@ -99,7 +87,7 @@ def read_conllx(
|
||||||
append_morphology=append_morphology,
|
append_morphology=append_morphology,
|
||||||
ner_map=ner_map,
|
ner_map=ner_map,
|
||||||
)
|
)
|
||||||
yield example
|
yield doc
|
||||||
|
|
||||||
|
|
||||||
def get_entities(lines, tag_pattern, ner_map=None):
|
def get_entities(lines, tag_pattern, ner_map=None):
|
||||||
|
@ -141,39 +129,7 @@ def get_entities(lines, tag_pattern, ner_map=None):
|
||||||
return iob_to_biluo(iob)
|
return iob_to_biluo(iob)
|
||||||
|
|
||||||
|
|
||||||
def generate_sentence(example_dict, has_ner_tags, tag_pattern, ner_map=None):
|
def doc_from_conllu_sentence(
|
||||||
sentence = {}
|
|
||||||
tokens = []
|
|
||||||
token_annotation = example_dict["token_annotation"]
|
|
||||||
for i, id_ in enumerate(token_annotation["ids"]):
|
|
||||||
token = {}
|
|
||||||
token["id"] = id_
|
|
||||||
token["orth"] = token_annotation["words"][i]
|
|
||||||
token["tag"] = token_annotation["tags"][i]
|
|
||||||
token["pos"] = token_annotation["pos"][i]
|
|
||||||
token["lemma"] = token_annotation["lemmas"][i]
|
|
||||||
token["morph"] = token_annotation["morphs"][i]
|
|
||||||
token["head"] = token_annotation["heads"][i] - i
|
|
||||||
token["dep"] = token_annotation["deps"][i]
|
|
||||||
if has_ner_tags:
|
|
||||||
token["ner"] = example_dict["doc_annotation"]["entities"][i]
|
|
||||||
tokens.append(token)
|
|
||||||
sentence["tokens"] = tokens
|
|
||||||
return sentence
|
|
||||||
|
|
||||||
|
|
||||||
def create_json_doc(raw, sentences, id_):
|
|
||||||
doc = {}
|
|
||||||
paragraph = {}
|
|
||||||
doc["id"] = id_
|
|
||||||
doc["paragraphs"] = []
|
|
||||||
paragraph["raw"] = raw.strip()
|
|
||||||
paragraph["sentences"] = sentences
|
|
||||||
doc["paragraphs"].append(paragraph)
|
|
||||||
return doc
|
|
||||||
|
|
||||||
|
|
||||||
def example_from_conllu_sentence(
|
|
||||||
vocab,
|
vocab,
|
||||||
lines,
|
lines,
|
||||||
ner_tag_pattern,
|
ner_tag_pattern,
|
||||||
|
@ -263,8 +219,9 @@ def example_from_conllu_sentence(
|
||||||
if merge_subtokens:
|
if merge_subtokens:
|
||||||
doc = merge_conllu_subtokens(lines, doc)
|
doc = merge_conllu_subtokens(lines, doc)
|
||||||
|
|
||||||
# create Example from custom Doc annotation
|
# create final Doc from custom Doc annotation
|
||||||
words, spaces, tags, morphs, lemmas = [], [], [], [], []
|
words, spaces, tags, morphs, lemmas, poses = [], [], [], [], [], []
|
||||||
|
heads, deps = [], []
|
||||||
for i, t in enumerate(doc):
|
for i, t in enumerate(doc):
|
||||||
words.append(t._.merged_orth)
|
words.append(t._.merged_orth)
|
||||||
lemmas.append(t._.merged_lemma)
|
lemmas.append(t._.merged_lemma)
|
||||||
|
@ -274,16 +231,23 @@ def example_from_conllu_sentence(
|
||||||
tags.append(t.tag_ + "__" + t._.merged_morph)
|
tags.append(t.tag_ + "__" + t._.merged_morph)
|
||||||
else:
|
else:
|
||||||
tags.append(t.tag_)
|
tags.append(t.tag_)
|
||||||
|
poses.append(t.pos_)
|
||||||
|
heads.append(t.head.i)
|
||||||
|
deps.append(t.dep_)
|
||||||
|
|
||||||
doc_x = Doc(vocab, words=words, spaces=spaces)
|
doc_x = Doc(vocab, words=words, spaces=spaces)
|
||||||
ref_dict = Example(doc_x, reference=doc).to_dict()
|
for i in range(len(doc)):
|
||||||
ref_dict["words"] = words
|
doc_x[i].tag_ = tags[i]
|
||||||
ref_dict["lemmas"] = lemmas
|
doc_x[i].morph_ = morphs[i]
|
||||||
ref_dict["spaces"] = spaces
|
doc_x[i].lemma_ = lemmas[i]
|
||||||
ref_dict["tags"] = tags
|
doc_x[i].pos_ = poses[i]
|
||||||
ref_dict["morphs"] = morphs
|
doc_x[i].dep_ = deps[i]
|
||||||
example = Example.from_dict(doc_x, ref_dict)
|
doc_x[i].head = doc_x[heads[i]]
|
||||||
return example
|
doc_x.ents = [Span(doc_x, ent.start, ent.end, label=ent.label) for ent in doc.ents]
|
||||||
|
doc_x.is_parsed = True
|
||||||
|
doc_x.is_tagged = True
|
||||||
|
|
||||||
|
return doc_x
|
||||||
|
|
||||||
|
|
||||||
def merge_conllu_subtokens(lines, doc):
|
def merge_conllu_subtokens(lines, doc):
|
|
@ -59,6 +59,6 @@ def read_iob(raw_sents, vocab, n_sents):
|
||||||
doc[i].is_sent_start = sent_start
|
doc[i].is_sent_start = sent_start
|
||||||
biluo = iob_to_biluo(iob)
|
biluo = iob_to_biluo(iob)
|
||||||
entities = tags_to_entities(biluo)
|
entities = tags_to_entities(biluo)
|
||||||
doc.ents = [Span(doc, start=s, end=e+1, label=L) for (L, s, e) in entities]
|
doc.ents = [Span(doc, start=s, end=e + 1, label=L) for (L, s, e) in entities]
|
||||||
docs.append(doc)
|
docs.append(doc)
|
||||||
return docs
|
return docs
|
||||||
|
|
|
@ -17,8 +17,6 @@ def json2docs(input_data, model=None, **kwargs):
|
||||||
for json_para in json_to_annotations(json_doc):
|
for json_para in json_to_annotations(json_doc):
|
||||||
example_dict = _fix_legacy_dict_data(json_para)
|
example_dict = _fix_legacy_dict_data(json_para)
|
||||||
tok_dict, doc_dict = _parse_example_dict_data(example_dict)
|
tok_dict, doc_dict = _parse_example_dict_data(example_dict)
|
||||||
if json_para.get("raw"):
|
|
||||||
assert tok_dict.get("SPACY")
|
|
||||||
doc = annotations2doc(nlp.vocab, tok_dict, doc_dict)
|
doc = annotations2doc(nlp.vocab, tok_dict, doc_dict)
|
||||||
docs.append(doc)
|
docs.append(doc)
|
||||||
return docs
|
return docs
|
||||||
|
|
|
@ -8,7 +8,7 @@ class Corpus:
|
||||||
"""An annotated corpus, reading train and dev datasets from
|
"""An annotated corpus, reading train and dev datasets from
|
||||||
the DocBin (.spacy) format.
|
the DocBin (.spacy) format.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/goldcorpus
|
DOCS: https://spacy.io/api/corpus
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, train_loc, dev_loc, limit=0):
|
def __init__(self, train_loc, dev_loc, limit=0):
|
||||||
|
@ -43,25 +43,32 @@ class Corpus:
|
||||||
locs.append(path)
|
locs.append(path)
|
||||||
return locs
|
return locs
|
||||||
|
|
||||||
|
def _make_example(self, nlp, reference, gold_preproc):
|
||||||
|
if gold_preproc or reference.has_unknown_spaces:
|
||||||
|
return Example(
|
||||||
|
Doc(
|
||||||
|
nlp.vocab,
|
||||||
|
words=[word.text for word in reference],
|
||||||
|
spaces=[bool(word.whitespace_) for word in reference],
|
||||||
|
),
|
||||||
|
reference,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
return Example(nlp.make_doc(reference.text), reference)
|
||||||
|
|
||||||
def make_examples(self, nlp, reference_docs, max_length=0):
|
def make_examples(self, nlp, reference_docs, max_length=0):
|
||||||
for reference in reference_docs:
|
for reference in reference_docs:
|
||||||
if len(reference) == 0:
|
if len(reference) == 0:
|
||||||
continue
|
continue
|
||||||
elif max_length == 0 or len(reference) < max_length:
|
elif max_length == 0 or len(reference) < max_length:
|
||||||
yield Example(
|
yield self._make_example(nlp, reference, False)
|
||||||
nlp.make_doc(reference.text),
|
|
||||||
reference
|
|
||||||
)
|
|
||||||
elif reference.is_sentenced:
|
elif reference.is_sentenced:
|
||||||
for ref_sent in reference.sents:
|
for ref_sent in reference.sents:
|
||||||
if len(ref_sent) == 0:
|
if len(ref_sent) == 0:
|
||||||
continue
|
continue
|
||||||
elif max_length == 0 or len(ref_sent) < max_length:
|
elif max_length == 0 or len(ref_sent) < max_length:
|
||||||
yield Example(
|
yield self._make_example(nlp, ref_sent.as_doc(), False)
|
||||||
nlp.make_doc(ref_sent.text),
|
|
||||||
ref_sent.as_doc()
|
|
||||||
)
|
|
||||||
|
|
||||||
def make_examples_gold_preproc(self, nlp, reference_docs):
|
def make_examples_gold_preproc(self, nlp, reference_docs):
|
||||||
for reference in reference_docs:
|
for reference in reference_docs:
|
||||||
if reference.is_sentenced:
|
if reference.is_sentenced:
|
||||||
|
@ -69,14 +76,7 @@ class Corpus:
|
||||||
else:
|
else:
|
||||||
ref_sents = [reference]
|
ref_sents = [reference]
|
||||||
for ref_sent in ref_sents:
|
for ref_sent in ref_sents:
|
||||||
eg = Example(
|
eg = self._make_example(nlp, ref_sent, True)
|
||||||
Doc(
|
|
||||||
nlp.vocab,
|
|
||||||
words=[w.text for w in ref_sent],
|
|
||||||
spaces=[bool(w.whitespace_) for w in ref_sent]
|
|
||||||
),
|
|
||||||
ref_sent
|
|
||||||
)
|
|
||||||
if len(eg.x):
|
if len(eg.x):
|
||||||
yield eg
|
yield eg
|
||||||
|
|
||||||
|
@ -107,8 +107,9 @@ class Corpus:
|
||||||
i += 1
|
i += 1
|
||||||
return n
|
return n
|
||||||
|
|
||||||
def train_dataset(self, nlp, *, shuffle=True, gold_preproc=False,
|
def train_dataset(
|
||||||
max_length=0, **kwargs):
|
self, nlp, *, shuffle=True, gold_preproc=False, max_length=0, **kwargs
|
||||||
|
):
|
||||||
ref_docs = self.read_docbin(nlp.vocab, self.walk_corpus(self.train_loc))
|
ref_docs = self.read_docbin(nlp.vocab, self.walk_corpus(self.train_loc))
|
||||||
if gold_preproc:
|
if gold_preproc:
|
||||||
examples = self.make_examples_gold_preproc(nlp, ref_docs)
|
examples = self.make_examples_gold_preproc(nlp, ref_docs)
|
||||||
|
|
|
@ -1,8 +1,7 @@
|
||||||
from ..tokens.doc cimport Doc
|
from ..tokens.doc cimport Doc
|
||||||
from .align cimport Alignment
|
|
||||||
|
|
||||||
|
|
||||||
cdef class Example:
|
cdef class Example:
|
||||||
cdef readonly Doc x
|
cdef readonly Doc x
|
||||||
cdef readonly Doc y
|
cdef readonly Doc y
|
||||||
cdef readonly Alignment _alignment
|
cdef readonly object _alignment
|
||||||
|
|
|
@ -6,16 +6,15 @@ from ..tokens.doc cimport Doc
|
||||||
from ..tokens.span cimport Span
|
from ..tokens.span cimport Span
|
||||||
from ..tokens.span import Span
|
from ..tokens.span import Span
|
||||||
from ..attrs import IDS
|
from ..attrs import IDS
|
||||||
from .align cimport Alignment
|
from .align import Alignment
|
||||||
from .iob_utils import biluo_to_iob, biluo_tags_from_offsets, biluo_tags_from_doc
|
from .iob_utils import biluo_to_iob, biluo_tags_from_offsets, biluo_tags_from_doc
|
||||||
from .iob_utils import spans_from_biluo_tags
|
from .iob_utils import spans_from_biluo_tags
|
||||||
from .align import Alignment
|
|
||||||
from ..errors import Errors, Warnings
|
from ..errors import Errors, Warnings
|
||||||
from ..syntax import nonproj
|
from ..syntax import nonproj
|
||||||
|
|
||||||
|
|
||||||
cpdef Doc annotations2doc(vocab, tok_annot, doc_annot):
|
cpdef Doc annotations2doc(vocab, tok_annot, doc_annot):
|
||||||
""" Create a Doc from dictionaries with token and doc annotations. Assumes ORTH & SPACY are set. """
|
""" Create a Doc from dictionaries with token and doc annotations. """
|
||||||
attrs, array = _annot2array(vocab, tok_annot, doc_annot)
|
attrs, array = _annot2array(vocab, tok_annot, doc_annot)
|
||||||
output = Doc(vocab, words=tok_annot["ORTH"], spaces=tok_annot["SPACY"])
|
output = Doc(vocab, words=tok_annot["ORTH"], spaces=tok_annot["SPACY"])
|
||||||
if "entities" in doc_annot:
|
if "entities" in doc_annot:
|
||||||
|
@ -28,7 +27,7 @@ cpdef Doc annotations2doc(vocab, tok_annot, doc_annot):
|
||||||
|
|
||||||
|
|
||||||
cdef class Example:
|
cdef class Example:
|
||||||
def __init__(self, Doc predicted, Doc reference, *, Alignment alignment=None):
|
def __init__(self, Doc predicted, Doc reference, *, alignment=None):
|
||||||
""" Doc can either be text, or an actual Doc """
|
""" Doc can either be text, or an actual Doc """
|
||||||
if predicted is None:
|
if predicted is None:
|
||||||
raise TypeError(Errors.E972.format(arg="predicted"))
|
raise TypeError(Errors.E972.format(arg="predicted"))
|
||||||
|
@ -83,34 +82,38 @@ cdef class Example:
|
||||||
gold_words = [token.orth_ for token in self.reference]
|
gold_words = [token.orth_ for token in self.reference]
|
||||||
if gold_words == []:
|
if gold_words == []:
|
||||||
gold_words = spacy_words
|
gold_words = spacy_words
|
||||||
self._alignment = Alignment(spacy_words, gold_words)
|
self._alignment = Alignment.from_strings(spacy_words, gold_words)
|
||||||
return self._alignment
|
return self._alignment
|
||||||
|
|
||||||
def get_aligned(self, field, as_string=False):
|
def get_aligned(self, field, as_string=False):
|
||||||
"""Return an aligned array for a token attribute."""
|
"""Return an aligned array for a token attribute."""
|
||||||
i2j_multi = self.alignment.i2j_multi
|
align = self.alignment.x2y
|
||||||
cand_to_gold = self.alignment.cand_to_gold
|
|
||||||
|
|
||||||
vocab = self.reference.vocab
|
vocab = self.reference.vocab
|
||||||
gold_values = self.reference.to_array([field])
|
gold_values = self.reference.to_array([field])
|
||||||
output = [None] * len(self.predicted)
|
output = [None] * len(self.predicted)
|
||||||
for i, gold_i in enumerate(cand_to_gold):
|
for token in self.predicted:
|
||||||
if self.predicted[i].text.isspace():
|
if token.is_space:
|
||||||
output[i] = None
|
output[token.i] = None
|
||||||
if gold_i is None:
|
|
||||||
if i in i2j_multi:
|
|
||||||
output[i] = gold_values[i2j_multi[i]]
|
|
||||||
else:
|
|
||||||
output[i] = None
|
|
||||||
else:
|
else:
|
||||||
output[i] = gold_values[gold_i]
|
values = gold_values[align[token.i].dataXd]
|
||||||
|
values = values.ravel()
|
||||||
|
if len(values) == 0:
|
||||||
|
output[token.i] = None
|
||||||
|
elif len(values) == 1:
|
||||||
|
output[token.i] = values[0]
|
||||||
|
elif len(set(list(values))) == 1:
|
||||||
|
# If all aligned tokens have the same value, use it.
|
||||||
|
output[token.i] = values[0]
|
||||||
|
else:
|
||||||
|
output[token.i] = None
|
||||||
if as_string and field not in ["ENT_IOB", "SENT_START"]:
|
if as_string and field not in ["ENT_IOB", "SENT_START"]:
|
||||||
output = [vocab.strings[o] if o is not None else o for o in output]
|
output = [vocab.strings[o] if o is not None else o for o in output]
|
||||||
return output
|
return output
|
||||||
|
|
||||||
def get_aligned_parse(self, projectivize=True):
|
def get_aligned_parse(self, projectivize=True):
|
||||||
cand_to_gold = self.alignment.cand_to_gold
|
cand_to_gold = self.alignment.x2y
|
||||||
gold_to_cand = self.alignment.gold_to_cand
|
gold_to_cand = self.alignment.y2x
|
||||||
aligned_heads = [None] * self.x.length
|
aligned_heads = [None] * self.x.length
|
||||||
aligned_deps = [None] * self.x.length
|
aligned_deps = [None] * self.x.length
|
||||||
heads = [token.head.i for token in self.y]
|
heads = [token.head.i for token in self.y]
|
||||||
|
@ -118,52 +121,51 @@ cdef class Example:
|
||||||
if projectivize:
|
if projectivize:
|
||||||
heads, deps = nonproj.projectivize(heads, deps)
|
heads, deps = nonproj.projectivize(heads, deps)
|
||||||
for cand_i in range(self.x.length):
|
for cand_i in range(self.x.length):
|
||||||
gold_i = cand_to_gold[cand_i]
|
if cand_to_gold.lengths[cand_i] == 1:
|
||||||
if gold_i is not None: # Alignment found
|
gold_i = cand_to_gold[cand_i].dataXd[0, 0]
|
||||||
gold_head = gold_to_cand[heads[gold_i]]
|
if gold_to_cand.lengths[heads[gold_i]] == 1:
|
||||||
if gold_head is not None:
|
aligned_heads[cand_i] = int(gold_to_cand[heads[gold_i]].dataXd[0, 0])
|
||||||
aligned_heads[cand_i] = gold_head
|
|
||||||
aligned_deps[cand_i] = deps[gold_i]
|
aligned_deps[cand_i] = deps[gold_i]
|
||||||
return aligned_heads, aligned_deps
|
return aligned_heads, aligned_deps
|
||||||
|
|
||||||
|
def get_aligned_spans_x2y(self, x_spans):
|
||||||
|
return self._get_aligned_spans(self.y, x_spans, self.alignment.x2y)
|
||||||
|
|
||||||
|
def get_aligned_spans_y2x(self, y_spans):
|
||||||
|
return self._get_aligned_spans(self.x, y_spans, self.alignment.y2x)
|
||||||
|
|
||||||
|
def _get_aligned_spans(self, doc, spans, align):
|
||||||
|
seen = set()
|
||||||
|
output = []
|
||||||
|
for span in spans:
|
||||||
|
indices = align[span.start : span.end].data.ravel()
|
||||||
|
indices = [idx for idx in indices if idx not in seen]
|
||||||
|
if len(indices) >= 1:
|
||||||
|
aligned_span = Span(doc, indices[0], indices[-1] + 1, label=span.label)
|
||||||
|
target_text = span.text.lower().strip().replace(" ", "")
|
||||||
|
our_text = aligned_span.text.lower().strip().replace(" ", "")
|
||||||
|
if our_text == target_text:
|
||||||
|
output.append(aligned_span)
|
||||||
|
seen.update(indices)
|
||||||
|
return output
|
||||||
|
|
||||||
def get_aligned_ner(self):
|
def get_aligned_ner(self):
|
||||||
if not self.y.is_nered:
|
if not self.y.is_nered:
|
||||||
return [None] * len(self.x) # should this be 'missing' instead of 'None' ?
|
return [None] * len(self.x) # should this be 'missing' instead of 'None' ?
|
||||||
x_text = self.x.text
|
x_ents = self.get_aligned_spans_y2x(self.y.ents)
|
||||||
# Get a list of entities, and make spans for non-entity tokens.
|
# Default to 'None' for missing values
|
||||||
# We then work through the spans in order, trying to find them in
|
|
||||||
# the text and using that to get the offset. Any token that doesn't
|
|
||||||
# get a tag set this way is tagged None.
|
|
||||||
# This could maybe be improved? It at least feels easy to reason about.
|
|
||||||
y_spans = list(self.y.ents)
|
|
||||||
y_spans.sort()
|
|
||||||
x_text_offset = 0
|
|
||||||
x_spans = []
|
|
||||||
for y_span in y_spans:
|
|
||||||
if x_text.count(y_span.text) >= 1:
|
|
||||||
start_char = x_text.index(y_span.text) + x_text_offset
|
|
||||||
end_char = start_char + len(y_span.text)
|
|
||||||
x_span = self.x.char_span(start_char, end_char, label=y_span.label)
|
|
||||||
if x_span is not None:
|
|
||||||
x_spans.append(x_span)
|
|
||||||
x_text = self.x.text[end_char:]
|
|
||||||
x_text_offset = end_char
|
|
||||||
x_tags = biluo_tags_from_offsets(
|
x_tags = biluo_tags_from_offsets(
|
||||||
self.x,
|
self.x,
|
||||||
[(e.start_char, e.end_char, e.label_) for e in x_spans],
|
[(e.start_char, e.end_char, e.label_) for e in x_ents],
|
||||||
missing=None
|
missing=None
|
||||||
)
|
)
|
||||||
gold_to_cand = self.alignment.gold_to_cand
|
# Now fill the tokens we can align to O.
|
||||||
for token in self.y:
|
O = 2 # I=1, O=2, B=3
|
||||||
if token.ent_iob_ == "O":
|
for i, ent_iob in enumerate(self.get_aligned("ENT_IOB")):
|
||||||
cand_i = gold_to_cand[token.i]
|
if x_tags[i] is None:
|
||||||
if cand_i is not None and x_tags[cand_i] is None:
|
if ent_iob == O:
|
||||||
x_tags[cand_i] = "O"
|
x_tags[i] = "O"
|
||||||
i2j_multi = self.alignment.i2j_multi
|
elif self.x[i].is_space:
|
||||||
for i, tag in enumerate(x_tags):
|
|
||||||
if tag is None and i in i2j_multi:
|
|
||||||
gold_i = i2j_multi[i]
|
|
||||||
if gold_i is not None and self.y[gold_i].ent_iob_ == "O":
|
|
||||||
x_tags[i] = "O"
|
x_tags[i] = "O"
|
||||||
return x_tags
|
return x_tags
|
||||||
|
|
||||||
|
@ -194,25 +196,22 @@ cdef class Example:
|
||||||
links[(ent.start_char, ent.end_char)] = {ent.kb_id_: 1.0}
|
links[(ent.start_char, ent.end_char)] = {ent.kb_id_: 1.0}
|
||||||
return links
|
return links
|
||||||
|
|
||||||
|
|
||||||
def split_sents(self):
|
def split_sents(self):
|
||||||
""" Split the token annotations into multiple Examples based on
|
""" Split the token annotations into multiple Examples based on
|
||||||
sent_starts and return a list of the new Examples"""
|
sent_starts and return a list of the new Examples"""
|
||||||
if not self.reference.is_sentenced:
|
if not self.reference.is_sentenced:
|
||||||
return [self]
|
return [self]
|
||||||
|
|
||||||
sent_starts = self.get_aligned("SENT_START")
|
align = self.alignment.y2x
|
||||||
sent_starts.append(1) # appending virtual start of a next sentence to facilitate search
|
seen_indices = set()
|
||||||
|
|
||||||
output = []
|
output = []
|
||||||
pred_start = 0
|
for y_sent in self.reference.sents:
|
||||||
for sent in self.reference.sents:
|
indices = align[y_sent.start : y_sent.end].data.ravel()
|
||||||
new_ref = sent.as_doc()
|
indices = [idx for idx in indices if idx not in seen_indices]
|
||||||
pred_end = sent_starts.index(1, pred_start+1) # find where the next sentence starts
|
if indices:
|
||||||
new_pred = self.predicted[pred_start : pred_end].as_doc()
|
x_sent = self.predicted[indices[0] : indices[-1] + 1]
|
||||||
output.append(Example(new_pred, new_ref))
|
output.append(Example(x_sent.as_doc(), y_sent.as_doc()))
|
||||||
pred_start = pred_end
|
seen_indices.update(indices)
|
||||||
|
|
||||||
return output
|
return output
|
||||||
|
|
||||||
property text:
|
property text:
|
||||||
|
@ -235,10 +234,7 @@ def _annot2array(vocab, tok_annot, doc_annot):
|
||||||
if key == "entities":
|
if key == "entities":
|
||||||
pass
|
pass
|
||||||
elif key == "links":
|
elif key == "links":
|
||||||
entities = doc_annot.get("entities", {})
|
ent_kb_ids = _parse_links(vocab, tok_annot["ORTH"], tok_annot["SPACY"], value)
|
||||||
if not entities:
|
|
||||||
raise ValueError(Errors.E981)
|
|
||||||
ent_kb_ids = _parse_links(vocab, tok_annot["ORTH"], value, entities)
|
|
||||||
tok_annot["ENT_KB_ID"] = ent_kb_ids
|
tok_annot["ENT_KB_ID"] = ent_kb_ids
|
||||||
elif key == "cats":
|
elif key == "cats":
|
||||||
pass
|
pass
|
||||||
|
@ -381,18 +377,11 @@ def _parse_ner_tags(biluo_or_offsets, vocab, words, spaces):
|
||||||
ent_types.append("")
|
ent_types.append("")
|
||||||
return ent_iobs, ent_types
|
return ent_iobs, ent_types
|
||||||
|
|
||||||
def _parse_links(vocab, words, links, entities):
|
def _parse_links(vocab, words, spaces, links):
|
||||||
reference = Doc(vocab, words=words)
|
reference = Doc(vocab, words=words, spaces=spaces)
|
||||||
starts = {token.idx: token.i for token in reference}
|
starts = {token.idx: token.i for token in reference}
|
||||||
ends = {token.idx + len(token): token.i for token in reference}
|
ends = {token.idx + len(token): token.i for token in reference}
|
||||||
ent_kb_ids = ["" for _ in reference]
|
ent_kb_ids = ["" for _ in reference]
|
||||||
entity_map = [(ent[0], ent[1]) for ent in entities]
|
|
||||||
|
|
||||||
# links annotations need to refer 1-1 to entity annotations - throw error otherwise
|
|
||||||
for index, annot_dict in links.items():
|
|
||||||
start_char, end_char = index
|
|
||||||
if (start_char, end_char) not in entity_map:
|
|
||||||
raise ValueError(Errors.E981)
|
|
||||||
|
|
||||||
for index, annot_dict in links.items():
|
for index, annot_dict in links.items():
|
||||||
true_kb_ids = []
|
true_kb_ids = []
|
||||||
|
@ -406,6 +395,8 @@ def _parse_links(vocab, words, links, entities):
|
||||||
start_char, end_char = index
|
start_char, end_char = index
|
||||||
start_token = starts.get(start_char)
|
start_token = starts.get(start_char)
|
||||||
end_token = ends.get(end_char)
|
end_token = ends.get(end_char)
|
||||||
|
if start_token is None or end_token is None:
|
||||||
|
raise ValueError(Errors.E981)
|
||||||
for i in range(start_token, end_token+1):
|
for i in range(start_token, end_token+1):
|
||||||
ent_kb_ids[i] = true_kb_ids[0]
|
ent_kb_ids[i] = true_kb_ids[0]
|
||||||
|
|
||||||
|
@ -414,7 +405,7 @@ def _parse_links(vocab, words, links, entities):
|
||||||
|
|
||||||
def _guess_spaces(text, words):
|
def _guess_spaces(text, words):
|
||||||
if text is None:
|
if text is None:
|
||||||
return [True] * len(words)
|
return None
|
||||||
spaces = []
|
spaces = []
|
||||||
text_pos = 0
|
text_pos = 0
|
||||||
# align words with text
|
# align words with text
|
||||||
|
|
|
@ -24,14 +24,15 @@ def docs_to_json(docs, doc_id=0, ner_missing_tag="O"):
|
||||||
for cat, val in doc.cats.items():
|
for cat, val in doc.cats.items():
|
||||||
json_cat = {"label": cat, "value": val}
|
json_cat = {"label": cat, "value": val}
|
||||||
json_para["cats"].append(json_cat)
|
json_para["cats"].append(json_cat)
|
||||||
|
# warning: entities information is currently duplicated as
|
||||||
|
# doc-level "entities" and token-level "ner"
|
||||||
for ent in doc.ents:
|
for ent in doc.ents:
|
||||||
ent_tuple = (ent.start_char, ent.end_char, ent.label_)
|
ent_tuple = (ent.start_char, ent.end_char, ent.label_)
|
||||||
json_para["entities"].append(ent_tuple)
|
json_para["entities"].append(ent_tuple)
|
||||||
if ent.kb_id_:
|
if ent.kb_id_:
|
||||||
link_dict = {(ent.start_char, ent.end_char): {ent.kb_id_: 1.0}}
|
link_dict = {(ent.start_char, ent.end_char): {ent.kb_id_: 1.0}}
|
||||||
json_para["links"].append(link_dict)
|
json_para["links"].append(link_dict)
|
||||||
ent_offsets = [(e.start_char, e.end_char, e.label_) for e in doc.ents]
|
biluo_tags = biluo_tags_from_offsets(doc, json_para["entities"], missing=ner_missing_tag)
|
||||||
biluo_tags = biluo_tags_from_offsets(doc, ent_offsets, missing=ner_missing_tag)
|
|
||||||
for j, sent in enumerate(doc.sents):
|
for j, sent in enumerate(doc.sents):
|
||||||
json_sent = {"tokens": [], "brackets": []}
|
json_sent = {"tokens": [], "brackets": []}
|
||||||
for token in sent:
|
for token in sent:
|
||||||
|
@ -44,6 +45,7 @@ def docs_to_json(docs, doc_id=0, ner_missing_tag="O"):
|
||||||
if doc.is_parsed:
|
if doc.is_parsed:
|
||||||
json_token["head"] = token.head.i-token.i
|
json_token["head"] = token.head.i-token.i
|
||||||
json_token["dep"] = token.dep_
|
json_token["dep"] = token.dep_
|
||||||
|
json_token["ner"] = biluo_tags[token.i]
|
||||||
json_sent["tokens"].append(json_token)
|
json_sent["tokens"].append(json_token)
|
||||||
json_para["sentences"].append(json_sent)
|
json_para["sentences"].append(json_sent)
|
||||||
json_doc["paragraphs"].append(json_para)
|
json_doc["paragraphs"].append(json_para)
|
||||||
|
|
|
@ -92,7 +92,7 @@ def biluo_tags_from_offsets(doc, entities, missing="O"):
|
||||||
# Handle entity cases
|
# Handle entity cases
|
||||||
for start_char, end_char, label in entities:
|
for start_char, end_char, label in entities:
|
||||||
if not label:
|
if not label:
|
||||||
for s in starts: # account for many-to-one
|
for s in starts: # account for many-to-one
|
||||||
if s >= start_char and s < end_char:
|
if s >= start_char and s < end_char:
|
||||||
biluo[starts[s]] = "O"
|
biluo[starts[s]] = "O"
|
||||||
else:
|
else:
|
||||||
|
|
|
@ -2,12 +2,13 @@ import random
|
||||||
import itertools
|
import itertools
|
||||||
import weakref
|
import weakref
|
||||||
import functools
|
import functools
|
||||||
|
from collections import Iterable
|
||||||
from contextlib import contextmanager
|
from contextlib import contextmanager
|
||||||
from copy import copy, deepcopy
|
from copy import copy, deepcopy
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import warnings
|
import warnings
|
||||||
|
|
||||||
from thinc.api import get_current_ops, Config
|
from thinc.api import get_current_ops, Config, require_gpu
|
||||||
import srsly
|
import srsly
|
||||||
import multiprocessing as mp
|
import multiprocessing as mp
|
||||||
from itertools import chain, cycle
|
from itertools import chain, cycle
|
||||||
|
@ -232,32 +233,6 @@ class Language(object):
|
||||||
def config(self):
|
def config(self):
|
||||||
return self._config
|
return self._config
|
||||||
|
|
||||||
# Conveniences to access pipeline components
|
|
||||||
# Shouldn't be used anymore!
|
|
||||||
@property
|
|
||||||
def tagger(self):
|
|
||||||
return self.get_pipe("tagger")
|
|
||||||
|
|
||||||
@property
|
|
||||||
def parser(self):
|
|
||||||
return self.get_pipe("parser")
|
|
||||||
|
|
||||||
@property
|
|
||||||
def entity(self):
|
|
||||||
return self.get_pipe("ner")
|
|
||||||
|
|
||||||
@property
|
|
||||||
def linker(self):
|
|
||||||
return self.get_pipe("entity_linker")
|
|
||||||
|
|
||||||
@property
|
|
||||||
def senter(self):
|
|
||||||
return self.get_pipe("senter")
|
|
||||||
|
|
||||||
@property
|
|
||||||
def matcher(self):
|
|
||||||
return self.get_pipe("matcher")
|
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def pipe_names(self):
|
def pipe_names(self):
|
||||||
"""Get names of available pipeline components.
|
"""Get names of available pipeline components.
|
||||||
|
@ -313,10 +288,7 @@ class Language(object):
|
||||||
DOCS: https://spacy.io/api/language#create_pipe
|
DOCS: https://spacy.io/api/language#create_pipe
|
||||||
"""
|
"""
|
||||||
if name not in self.factories:
|
if name not in self.factories:
|
||||||
if name == "sbd":
|
raise KeyError(Errors.E002.format(name=name))
|
||||||
raise KeyError(Errors.E108.format(name=name))
|
|
||||||
else:
|
|
||||||
raise KeyError(Errors.E002.format(name=name))
|
|
||||||
factory = self.factories[name]
|
factory = self.factories[name]
|
||||||
|
|
||||||
# transform the model's config to an actual Model
|
# transform the model's config to an actual Model
|
||||||
|
@ -529,22 +501,6 @@ class Language(object):
|
||||||
def make_doc(self, text):
|
def make_doc(self, text):
|
||||||
return self.tokenizer(text)
|
return self.tokenizer(text)
|
||||||
|
|
||||||
def _convert_examples(self, examples):
|
|
||||||
converted_examples = []
|
|
||||||
if isinstance(examples, tuple):
|
|
||||||
examples = [examples]
|
|
||||||
for eg in examples:
|
|
||||||
if isinstance(eg, Example):
|
|
||||||
converted_examples.append(eg.copy())
|
|
||||||
elif isinstance(eg, tuple):
|
|
||||||
doc, annot = eg
|
|
||||||
if isinstance(doc, str):
|
|
||||||
doc = self.make_doc(doc)
|
|
||||||
converted_examples.append(Example.from_dict(doc, annot))
|
|
||||||
else:
|
|
||||||
raise ValueError(Errors.E979.format(type=type(eg)))
|
|
||||||
return converted_examples
|
|
||||||
|
|
||||||
def update(
|
def update(
|
||||||
self,
|
self,
|
||||||
examples,
|
examples,
|
||||||
|
@ -557,7 +513,7 @@ class Language(object):
|
||||||
):
|
):
|
||||||
"""Update the models in the pipeline.
|
"""Update the models in the pipeline.
|
||||||
|
|
||||||
examples (iterable): A batch of `Example` or `Doc` objects.
|
examples (iterable): A batch of `Example` objects.
|
||||||
dummy: Should not be set - serves to catch backwards-incompatible scripts.
|
dummy: Should not be set - serves to catch backwards-incompatible scripts.
|
||||||
drop (float): The dropout rate.
|
drop (float): The dropout rate.
|
||||||
sgd (callable): An optimizer.
|
sgd (callable): An optimizer.
|
||||||
|
@ -569,10 +525,13 @@ class Language(object):
|
||||||
"""
|
"""
|
||||||
if dummy is not None:
|
if dummy is not None:
|
||||||
raise ValueError(Errors.E989)
|
raise ValueError(Errors.E989)
|
||||||
|
|
||||||
if len(examples) == 0:
|
if len(examples) == 0:
|
||||||
return
|
return
|
||||||
examples = self._convert_examples(examples)
|
if not isinstance(examples, Iterable):
|
||||||
|
raise TypeError(Errors.E978.format(name="language", method="update", types=type(examples)))
|
||||||
|
wrong_types = set([type(eg) for eg in examples if not isinstance(eg, Example)])
|
||||||
|
if wrong_types:
|
||||||
|
raise TypeError(Errors.E978.format(name="language", method="update", types=wrong_types))
|
||||||
|
|
||||||
if sgd is None:
|
if sgd is None:
|
||||||
if self._optimizer is None:
|
if self._optimizer is None:
|
||||||
|
@ -605,22 +564,26 @@ class Language(object):
|
||||||
initial ones. This is useful for keeping a pretrained model on-track,
|
initial ones. This is useful for keeping a pretrained model on-track,
|
||||||
even if you're updating it with a smaller set of examples.
|
even if you're updating it with a smaller set of examples.
|
||||||
|
|
||||||
examples (iterable): A batch of `Doc` objects.
|
examples (iterable): A batch of `Example` objects.
|
||||||
drop (float): The dropout rate.
|
drop (float): The dropout rate.
|
||||||
sgd (callable): An optimizer.
|
sgd (callable): An optimizer.
|
||||||
RETURNS (dict): Results from the update.
|
RETURNS (dict): Results from the update.
|
||||||
|
|
||||||
EXAMPLE:
|
EXAMPLE:
|
||||||
>>> raw_text_batches = minibatch(raw_texts)
|
>>> raw_text_batches = minibatch(raw_texts)
|
||||||
>>> for labelled_batch in minibatch(zip(train_docs, train_golds)):
|
>>> for labelled_batch in minibatch(examples):
|
||||||
>>> nlp.update(labelled_batch)
|
>>> nlp.update(labelled_batch)
|
||||||
>>> raw_batch = [nlp.make_doc(text) for text in next(raw_text_batches)]
|
>>> raw_batch = [Example.from_dict(nlp.make_doc(text), {}) for text in next(raw_text_batches)]
|
||||||
>>> nlp.rehearse(raw_batch)
|
>>> nlp.rehearse(raw_batch)
|
||||||
"""
|
"""
|
||||||
# TODO: document
|
# TODO: document
|
||||||
if len(examples) == 0:
|
if len(examples) == 0:
|
||||||
return
|
return
|
||||||
examples = self._convert_examples(examples)
|
if not isinstance(examples, Iterable):
|
||||||
|
raise TypeError(Errors.E978.format(name="language", method="rehearse", types=type(examples)))
|
||||||
|
wrong_types = set([type(eg) for eg in examples if not isinstance(eg, Example)])
|
||||||
|
if wrong_types:
|
||||||
|
raise TypeError(Errors.E978.format(name="language", method="rehearse", types=wrong_types))
|
||||||
if sgd is None:
|
if sgd is None:
|
||||||
if self._optimizer is None:
|
if self._optimizer is None:
|
||||||
self._optimizer = create_default_optimizer()
|
self._optimizer = create_default_optimizer()
|
||||||
|
@ -669,7 +632,7 @@ class Language(object):
|
||||||
_ = self.vocab[word] # noqa: F841
|
_ = self.vocab[word] # noqa: F841
|
||||||
|
|
||||||
if cfg.get("device", -1) >= 0:
|
if cfg.get("device", -1) >= 0:
|
||||||
util.use_gpu(cfg["device"])
|
require_gpu(cfg["device"])
|
||||||
if self.vocab.vectors.data.shape[1] >= 1:
|
if self.vocab.vectors.data.shape[1] >= 1:
|
||||||
ops = get_current_ops()
|
ops = get_current_ops()
|
||||||
self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data)
|
self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data)
|
||||||
|
@ -696,10 +659,10 @@ class Language(object):
|
||||||
component that has a .rehearse() method. Rehearsal is used to prevent
|
component that has a .rehearse() method. Rehearsal is used to prevent
|
||||||
models from "forgetting" their initialised "knowledge". To perform
|
models from "forgetting" their initialised "knowledge". To perform
|
||||||
rehearsal, collect samples of text you want the models to retain performance
|
rehearsal, collect samples of text you want the models to retain performance
|
||||||
on, and call nlp.rehearse() with a batch of Doc objects.
|
on, and call nlp.rehearse() with a batch of Example objects.
|
||||||
"""
|
"""
|
||||||
if cfg.get("device", -1) >= 0:
|
if cfg.get("device", -1) >= 0:
|
||||||
util.use_gpu(cfg["device"])
|
require_gpu(cfg["device"])
|
||||||
ops = get_current_ops()
|
ops = get_current_ops()
|
||||||
if self.vocab.vectors.data.shape[1] >= 1:
|
if self.vocab.vectors.data.shape[1] >= 1:
|
||||||
self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data)
|
self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data)
|
||||||
|
@ -728,7 +691,11 @@ class Language(object):
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/language#evaluate
|
DOCS: https://spacy.io/api/language#evaluate
|
||||||
"""
|
"""
|
||||||
examples = self._convert_examples(examples)
|
if not isinstance(examples, Iterable):
|
||||||
|
raise TypeError(Errors.E978.format(name="language", method="evaluate", types=type(examples)))
|
||||||
|
wrong_types = set([type(eg) for eg in examples if not isinstance(eg, Example)])
|
||||||
|
if wrong_types:
|
||||||
|
raise TypeError(Errors.E978.format(name="language", method="evaluate", types=wrong_types))
|
||||||
if scorer is None:
|
if scorer is None:
|
||||||
scorer = Scorer(pipeline=self.pipeline)
|
scorer = Scorer(pipeline=self.pipeline)
|
||||||
if component_cfg is None:
|
if component_cfg is None:
|
||||||
|
@ -786,7 +753,6 @@ class Language(object):
|
||||||
self,
|
self,
|
||||||
texts,
|
texts,
|
||||||
as_tuples=False,
|
as_tuples=False,
|
||||||
n_threads=-1,
|
|
||||||
batch_size=1000,
|
batch_size=1000,
|
||||||
disable=[],
|
disable=[],
|
||||||
cleanup=False,
|
cleanup=False,
|
||||||
|
@ -811,8 +777,6 @@ class Language(object):
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/language#pipe
|
DOCS: https://spacy.io/api/language#pipe
|
||||||
"""
|
"""
|
||||||
if n_threads != -1:
|
|
||||||
warnings.warn(Warnings.W016, DeprecationWarning)
|
|
||||||
if n_process == -1:
|
if n_process == -1:
|
||||||
n_process = mp.cpu_count()
|
n_process = mp.cpu_count()
|
||||||
if as_tuples:
|
if as_tuples:
|
||||||
|
@ -939,7 +903,7 @@ class Language(object):
|
||||||
if hasattr(proc2, "model"):
|
if hasattr(proc2, "model"):
|
||||||
proc1.find_listeners(proc2.model)
|
proc1.find_listeners(proc2.model)
|
||||||
|
|
||||||
def to_disk(self, path, exclude=tuple(), disable=None):
|
def to_disk(self, path, exclude=tuple()):
|
||||||
"""Save the current state to a directory. If a model is loaded, this
|
"""Save the current state to a directory. If a model is loaded, this
|
||||||
will include the model.
|
will include the model.
|
||||||
|
|
||||||
|
@ -949,9 +913,6 @@ class Language(object):
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/language#to_disk
|
DOCS: https://spacy.io/api/language#to_disk
|
||||||
"""
|
"""
|
||||||
if disable is not None:
|
|
||||||
warnings.warn(Warnings.W014, DeprecationWarning)
|
|
||||||
exclude = disable
|
|
||||||
path = util.ensure_path(path)
|
path = util.ensure_path(path)
|
||||||
serializers = {}
|
serializers = {}
|
||||||
serializers["tokenizer"] = lambda p: self.tokenizer.to_disk(
|
serializers["tokenizer"] = lambda p: self.tokenizer.to_disk(
|
||||||
|
@ -970,7 +931,7 @@ class Language(object):
|
||||||
serializers["vocab"] = lambda p: self.vocab.to_disk(p)
|
serializers["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||||
util.to_disk(path, serializers, exclude)
|
util.to_disk(path, serializers, exclude)
|
||||||
|
|
||||||
def from_disk(self, path, exclude=tuple(), disable=None):
|
def from_disk(self, path, exclude=tuple()):
|
||||||
"""Loads state from a directory. Modifies the object in place and
|
"""Loads state from a directory. Modifies the object in place and
|
||||||
returns it. If the saved `Language` object contains a model, the
|
returns it. If the saved `Language` object contains a model, the
|
||||||
model will be loaded.
|
model will be loaded.
|
||||||
|
@ -995,9 +956,6 @@ class Language(object):
|
||||||
self.vocab.from_disk(path)
|
self.vocab.from_disk(path)
|
||||||
_fix_pretrained_vectors_name(self)
|
_fix_pretrained_vectors_name(self)
|
||||||
|
|
||||||
if disable is not None:
|
|
||||||
warnings.warn(Warnings.W014, DeprecationWarning)
|
|
||||||
exclude = disable
|
|
||||||
path = util.ensure_path(path)
|
path = util.ensure_path(path)
|
||||||
|
|
||||||
deserializers = {}
|
deserializers = {}
|
||||||
|
@ -1024,7 +982,7 @@ class Language(object):
|
||||||
self._link_components()
|
self._link_components()
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_bytes(self, exclude=tuple(), disable=None, **kwargs):
|
def to_bytes(self, exclude=tuple()):
|
||||||
"""Serialize the current state to a binary string.
|
"""Serialize the current state to a binary string.
|
||||||
|
|
||||||
exclude (list): Names of components or serialization fields to exclude.
|
exclude (list): Names of components or serialization fields to exclude.
|
||||||
|
@ -1032,9 +990,6 @@ class Language(object):
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/language#to_bytes
|
DOCS: https://spacy.io/api/language#to_bytes
|
||||||
"""
|
"""
|
||||||
if disable is not None:
|
|
||||||
warnings.warn(Warnings.W014, DeprecationWarning)
|
|
||||||
exclude = disable
|
|
||||||
serializers = {}
|
serializers = {}
|
||||||
serializers["vocab"] = lambda: self.vocab.to_bytes()
|
serializers["vocab"] = lambda: self.vocab.to_bytes()
|
||||||
serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
|
serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
|
||||||
|
@ -1046,10 +1001,9 @@ class Language(object):
|
||||||
if not hasattr(proc, "to_bytes"):
|
if not hasattr(proc, "to_bytes"):
|
||||||
continue
|
continue
|
||||||
serializers[name] = lambda proc=proc: proc.to_bytes(exclude=["vocab"])
|
serializers[name] = lambda proc=proc: proc.to_bytes(exclude=["vocab"])
|
||||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
|
||||||
return util.to_bytes(serializers, exclude)
|
return util.to_bytes(serializers, exclude)
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, exclude=tuple(), disable=None, **kwargs):
|
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||||
"""Load state from a binary string.
|
"""Load state from a binary string.
|
||||||
|
|
||||||
bytes_data (bytes): The data to load from.
|
bytes_data (bytes): The data to load from.
|
||||||
|
@ -1070,9 +1024,6 @@ class Language(object):
|
||||||
self.vocab.from_bytes(b)
|
self.vocab.from_bytes(b)
|
||||||
_fix_pretrained_vectors_name(self)
|
_fix_pretrained_vectors_name(self)
|
||||||
|
|
||||||
if disable is not None:
|
|
||||||
warnings.warn(Warnings.W014, DeprecationWarning)
|
|
||||||
exclude = disable
|
|
||||||
deserializers = {}
|
deserializers = {}
|
||||||
deserializers["config.cfg"] = lambda b: self.config.from_bytes(b)
|
deserializers["config.cfg"] = lambda b: self.config.from_bytes(b)
|
||||||
deserializers["meta.json"] = deserialize_meta
|
deserializers["meta.json"] = deserialize_meta
|
||||||
|
@ -1088,7 +1039,6 @@ class Language(object):
|
||||||
deserializers[name] = lambda b, proc=proc: proc.from_bytes(
|
deserializers[name] = lambda b, proc=proc: proc.from_bytes(
|
||||||
b, exclude=["vocab"]
|
b, exclude=["vocab"]
|
||||||
)
|
)
|
||||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
|
||||||
util.from_bytes(bytes_data, deserializers, exclude)
|
util.from_bytes(bytes_data, deserializers, exclude)
|
||||||
self._link_components()
|
self._link_components()
|
||||||
return self
|
return self
|
||||||
|
@ -1210,7 +1160,7 @@ class DisabledPipes(list):
|
||||||
def _pipe(examples, proc, kwargs):
|
def _pipe(examples, proc, kwargs):
|
||||||
# We added some args for pipe that __call__ doesn't expect.
|
# We added some args for pipe that __call__ doesn't expect.
|
||||||
kwargs = dict(kwargs)
|
kwargs = dict(kwargs)
|
||||||
for arg in ["n_threads", "batch_size"]:
|
for arg in ["batch_size"]:
|
||||||
if arg in kwargs:
|
if arg in kwargs:
|
||||||
kwargs.pop(arg)
|
kwargs.pop(arg)
|
||||||
for eg in examples:
|
for eg in examples:
|
||||||
|
|
|
@ -1,5 +1,4 @@
|
||||||
from .errors import Errors
|
from .errors import Errors
|
||||||
from .lookups import Lookups
|
|
||||||
from .parts_of_speech import NAMES as UPOS_NAMES
|
from .parts_of_speech import NAMES as UPOS_NAMES
|
||||||
|
|
||||||
|
|
||||||
|
@ -15,15 +14,13 @@ class Lemmatizer(object):
|
||||||
def load(cls, *args, **kwargs):
|
def load(cls, *args, **kwargs):
|
||||||
raise NotImplementedError(Errors.E172)
|
raise NotImplementedError(Errors.E172)
|
||||||
|
|
||||||
def __init__(self, lookups, *args, **kwargs):
|
def __init__(self, lookups):
|
||||||
"""Initialize a Lemmatizer.
|
"""Initialize a Lemmatizer.
|
||||||
|
|
||||||
lookups (Lookups): The lookups object containing the (optional) tables
|
lookups (Lookups): The lookups object containing the (optional) tables
|
||||||
"lemma_rules", "lemma_index", "lemma_exc" and "lemma_lookup".
|
"lemma_rules", "lemma_index", "lemma_exc" and "lemma_lookup".
|
||||||
RETURNS (Lemmatizer): The newly constructed object.
|
RETURNS (Lemmatizer): The newly constructed object.
|
||||||
"""
|
"""
|
||||||
if args or kwargs or not isinstance(lookups, Lookups):
|
|
||||||
raise ValueError(Errors.E173)
|
|
||||||
self.lookups = lookups
|
self.lookups = lookups
|
||||||
|
|
||||||
def __call__(self, string, univ_pos, morphology=None):
|
def __call__(self, string, univ_pos, morphology=None):
|
||||||
|
|
|
@ -174,8 +174,7 @@ cdef class Matcher:
|
||||||
return default
|
return default
|
||||||
return (self._callbacks[key], self._patterns[key])
|
return (self._callbacks[key], self._patterns[key])
|
||||||
|
|
||||||
def pipe(self, docs, batch_size=1000, n_threads=-1, return_matches=False,
|
def pipe(self, docs, batch_size=1000, return_matches=False, as_tuples=False):
|
||||||
as_tuples=False):
|
|
||||||
"""Match a stream of documents, yielding them in turn.
|
"""Match a stream of documents, yielding them in turn.
|
||||||
|
|
||||||
docs (iterable): A stream of documents.
|
docs (iterable): A stream of documents.
|
||||||
|
@ -188,9 +187,6 @@ cdef class Matcher:
|
||||||
be a sequence of ((doc, matches), context) tuples.
|
be a sequence of ((doc, matches), context) tuples.
|
||||||
YIELDS (Doc): Documents, in order.
|
YIELDS (Doc): Documents, in order.
|
||||||
"""
|
"""
|
||||||
if n_threads != -1:
|
|
||||||
warnings.warn(Warnings.W016, DeprecationWarning)
|
|
||||||
|
|
||||||
if as_tuples:
|
if as_tuples:
|
||||||
for doc, context in docs:
|
for doc, context in docs:
|
||||||
matches = self(doc)
|
matches = self(doc)
|
||||||
|
|
|
@ -26,7 +26,7 @@ cdef class PhraseMatcher:
|
||||||
Copyright (c) 2017 Vikash Singh (vikash.duliajan@gmail.com)
|
Copyright (c) 2017 Vikash Singh (vikash.duliajan@gmail.com)
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, Vocab vocab, max_length=0, attr="ORTH", validate=False):
|
def __init__(self, Vocab vocab, attr="ORTH", validate=False):
|
||||||
"""Initialize the PhraseMatcher.
|
"""Initialize the PhraseMatcher.
|
||||||
|
|
||||||
vocab (Vocab): The shared vocabulary.
|
vocab (Vocab): The shared vocabulary.
|
||||||
|
@ -36,8 +36,6 @@ cdef class PhraseMatcher:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/phrasematcher#init
|
DOCS: https://spacy.io/api/phrasematcher#init
|
||||||
"""
|
"""
|
||||||
if max_length != 0:
|
|
||||||
warnings.warn(Warnings.W010, DeprecationWarning)
|
|
||||||
self.vocab = vocab
|
self.vocab = vocab
|
||||||
self._callbacks = {}
|
self._callbacks = {}
|
||||||
self._docs = {}
|
self._docs = {}
|
||||||
|
@ -287,8 +285,7 @@ cdef class PhraseMatcher:
|
||||||
current_node = self.c_map
|
current_node = self.c_map
|
||||||
idx += 1
|
idx += 1
|
||||||
|
|
||||||
def pipe(self, stream, batch_size=1000, n_threads=-1, return_matches=False,
|
def pipe(self, stream, batch_size=1000, return_matches=False, as_tuples=False):
|
||||||
as_tuples=False):
|
|
||||||
"""Match a stream of documents, yielding them in turn.
|
"""Match a stream of documents, yielding them in turn.
|
||||||
|
|
||||||
docs (iterable): A stream of documents.
|
docs (iterable): A stream of documents.
|
||||||
|
@ -303,8 +300,6 @@ cdef class PhraseMatcher:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/phrasematcher#pipe
|
DOCS: https://spacy.io/api/phrasematcher#pipe
|
||||||
"""
|
"""
|
||||||
if n_threads != -1:
|
|
||||||
warnings.warn(Warnings.W016, DeprecationWarning)
|
|
||||||
if as_tuples:
|
if as_tuples:
|
||||||
for doc, context in stream:
|
for doc, context in stream:
|
||||||
matches = self(doc)
|
matches = self(doc)
|
||||||
|
|
|
@ -1,6 +1,7 @@
|
||||||
import numpy
|
import numpy
|
||||||
|
|
||||||
from thinc.api import chain, Maxout, LayerNorm, Softmax, Linear, zero_init, Model
|
from thinc.api import chain, Maxout, LayerNorm, Softmax, Linear, zero_init, Model
|
||||||
|
from thinc.api import MultiSoftmax, list2array
|
||||||
|
|
||||||
|
|
||||||
def build_multi_task_model(tok2vec, maxout_pieces, token_vector_width, nO=None):
|
def build_multi_task_model(tok2vec, maxout_pieces, token_vector_width, nO=None):
|
||||||
|
@ -21,9 +22,10 @@ def build_multi_task_model(tok2vec, maxout_pieces, token_vector_width, nO=None):
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
def build_cloze_multi_task_model(vocab, tok2vec, maxout_pieces, nO=None):
|
def build_cloze_multi_task_model(vocab, tok2vec, maxout_pieces, hidden_size, nO=None):
|
||||||
# nO = vocab.vectors.data.shape[1]
|
# nO = vocab.vectors.data.shape[1]
|
||||||
output_layer = chain(
|
output_layer = chain(
|
||||||
|
list2array(),
|
||||||
Maxout(
|
Maxout(
|
||||||
nO=nO,
|
nO=nO,
|
||||||
nI=tok2vec.get_dim("nO"),
|
nI=tok2vec.get_dim("nO"),
|
||||||
|
@ -40,6 +42,22 @@ def build_cloze_multi_task_model(vocab, tok2vec, maxout_pieces, nO=None):
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
def build_cloze_characters_multi_task_model(
|
||||||
|
vocab, tok2vec, maxout_pieces, hidden_size, nr_char
|
||||||
|
):
|
||||||
|
output_layer = chain(
|
||||||
|
list2array(),
|
||||||
|
Maxout(hidden_size, nP=maxout_pieces),
|
||||||
|
LayerNorm(nI=hidden_size),
|
||||||
|
MultiSoftmax([256] * nr_char, nI=hidden_size),
|
||||||
|
)
|
||||||
|
|
||||||
|
model = build_masked_language_model(vocab, chain(tok2vec, output_layer))
|
||||||
|
model.set_ref("tok2vec", tok2vec)
|
||||||
|
model.set_ref("output_layer", output_layer)
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
def build_masked_language_model(vocab, wrapped_model, mask_prob=0.15):
|
def build_masked_language_model(vocab, wrapped_model, mask_prob=0.15):
|
||||||
"""Convert a model into a BERT-style masked language model"""
|
"""Convert a model into a BERT-style masked language model"""
|
||||||
|
|
||||||
|
@ -48,7 +66,7 @@ def build_masked_language_model(vocab, wrapped_model, mask_prob=0.15):
|
||||||
def mlm_forward(model, docs, is_train):
|
def mlm_forward(model, docs, is_train):
|
||||||
mask, docs = _apply_mask(docs, random_words, mask_prob=mask_prob)
|
mask, docs = _apply_mask(docs, random_words, mask_prob=mask_prob)
|
||||||
mask = model.ops.asarray(mask).reshape((mask.shape[0], 1))
|
mask = model.ops.asarray(mask).reshape((mask.shape[0], 1))
|
||||||
output, backprop = model.get_ref("wrapped-model").begin_update(docs)
|
output, backprop = model.layers[0](docs, is_train)
|
||||||
|
|
||||||
def mlm_backward(d_output):
|
def mlm_backward(d_output):
|
||||||
d_output *= 1 - mask
|
d_output *= 1 - mask
|
||||||
|
@ -56,8 +74,22 @@ def build_masked_language_model(vocab, wrapped_model, mask_prob=0.15):
|
||||||
|
|
||||||
return output, mlm_backward
|
return output, mlm_backward
|
||||||
|
|
||||||
mlm_model = Model("masked-language-model", mlm_forward, layers=[wrapped_model])
|
def mlm_initialize(model, X=None, Y=None):
|
||||||
mlm_model.set_ref("wrapped-model", wrapped_model)
|
wrapped = model.layers[0]
|
||||||
|
wrapped.initialize(X=X, Y=Y)
|
||||||
|
for dim in wrapped.dim_names:
|
||||||
|
if wrapped.has_dim(dim):
|
||||||
|
model.set_dim(dim, wrapped.get_dim(dim))
|
||||||
|
|
||||||
|
mlm_model = Model(
|
||||||
|
"masked-language-model",
|
||||||
|
mlm_forward,
|
||||||
|
layers=[wrapped_model],
|
||||||
|
init=mlm_initialize,
|
||||||
|
refs={"wrapped": wrapped_model},
|
||||||
|
dims={dim: None for dim in wrapped_model.dim_names},
|
||||||
|
)
|
||||||
|
mlm_model.set_ref("wrapped", wrapped_model)
|
||||||
|
|
||||||
return mlm_model
|
return mlm_model
|
||||||
|
|
||||||
|
|
|
@ -17,11 +17,7 @@ def build_tb_parser_model(
|
||||||
nO=None,
|
nO=None,
|
||||||
):
|
):
|
||||||
t2v_width = tok2vec.get_dim("nO") if tok2vec.has_dim("nO") else None
|
t2v_width = tok2vec.get_dim("nO") if tok2vec.has_dim("nO") else None
|
||||||
tok2vec = chain(
|
tok2vec = chain(tok2vec, list2array(), Linear(hidden_width, t2v_width),)
|
||||||
tok2vec,
|
|
||||||
list2array(),
|
|
||||||
Linear(hidden_width, t2v_width),
|
|
||||||
)
|
|
||||||
tok2vec.set_dim("nO", hidden_width)
|
tok2vec.set_dim("nO", hidden_width)
|
||||||
|
|
||||||
lower = PrecomputableAffine(
|
lower = PrecomputableAffine(
|
||||||
|
|
|
@ -263,17 +263,21 @@ def build_Tok2Vec_model(
|
||||||
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
|
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
|
||||||
with Model.define_operators({">>": chain, "|": concatenate, "**": clone}):
|
with Model.define_operators({">>": chain, "|": concatenate, "**": clone}):
|
||||||
norm = HashEmbed(
|
norm = HashEmbed(
|
||||||
nO=width, nV=embed_size, column=cols.index(NORM), dropout=dropout
|
nO=width, nV=embed_size, column=cols.index(NORM), dropout=dropout,
|
||||||
|
seed=0
|
||||||
)
|
)
|
||||||
if subword_features:
|
if subword_features:
|
||||||
prefix = HashEmbed(
|
prefix = HashEmbed(
|
||||||
nO=width, nV=embed_size // 2, column=cols.index(PREFIX), dropout=dropout
|
nO=width, nV=embed_size // 2, column=cols.index(PREFIX), dropout=dropout,
|
||||||
|
seed=1
|
||||||
)
|
)
|
||||||
suffix = HashEmbed(
|
suffix = HashEmbed(
|
||||||
nO=width, nV=embed_size // 2, column=cols.index(SUFFIX), dropout=dropout
|
nO=width, nV=embed_size // 2, column=cols.index(SUFFIX), dropout=dropout,
|
||||||
|
seed=2
|
||||||
)
|
)
|
||||||
shape = HashEmbed(
|
shape = HashEmbed(
|
||||||
nO=width, nV=embed_size // 2, column=cols.index(SHAPE), dropout=dropout
|
nO=width, nV=embed_size // 2, column=cols.index(SHAPE), dropout=dropout,
|
||||||
|
seed=3
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
prefix, suffix, shape = (None, None, None)
|
prefix, suffix, shape = (None, None, None)
|
||||||
|
|
|
@ -120,15 +120,14 @@ class Morphologizer(Tagger):
|
||||||
d_scores = self.model.ops.unflatten(d_scores, [len(d) for d in docs])
|
d_scores = self.model.ops.unflatten(d_scores, [len(d) for d in docs])
|
||||||
return float(loss), d_scores
|
return float(loss), d_scores
|
||||||
|
|
||||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
def to_bytes(self, exclude=tuple()):
|
||||||
serialize = {}
|
serialize = {}
|
||||||
serialize["model"] = self.model.to_bytes
|
serialize["model"] = self.model.to_bytes
|
||||||
serialize["vocab"] = self.vocab.to_bytes
|
serialize["vocab"] = self.vocab.to_bytes
|
||||||
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
||||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
|
||||||
return util.to_bytes(serialize, exclude)
|
return util.to_bytes(serialize, exclude)
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||||
def load_model(b):
|
def load_model(b):
|
||||||
try:
|
try:
|
||||||
self.model.from_bytes(b)
|
self.model.from_bytes(b)
|
||||||
|
@ -140,20 +139,18 @@ class Morphologizer(Tagger):
|
||||||
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
|
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
|
||||||
"model": lambda b: load_model(b),
|
"model": lambda b: load_model(b),
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
|
||||||
util.from_bytes(bytes_data, deserialize, exclude)
|
util.from_bytes(bytes_data, deserialize, exclude)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
def to_disk(self, path, exclude=tuple()):
|
||||||
serialize = {
|
serialize = {
|
||||||
"vocab": lambda p: self.vocab.to_disk(p),
|
"vocab": lambda p: self.vocab.to_disk(p),
|
||||||
"model": lambda p: p.open("wb").write(self.model.to_bytes()),
|
"model": lambda p: p.open("wb").write(self.model.to_bytes()),
|
||||||
"cfg": lambda p: srsly.write_json(p, self.cfg),
|
"cfg": lambda p: srsly.write_json(p, self.cfg),
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
|
||||||
util.to_disk(path, serialize, exclude)
|
util.to_disk(path, serialize, exclude)
|
||||||
|
|
||||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
def from_disk(self, path, exclude=tuple()):
|
||||||
def load_model(p):
|
def load_model(p):
|
||||||
with p.open("rb") as file_:
|
with p.open("rb") as file_:
|
||||||
try:
|
try:
|
||||||
|
@ -166,6 +163,5 @@ class Morphologizer(Tagger):
|
||||||
"cfg": lambda p: self.cfg.update(_load_cfg(p)),
|
"cfg": lambda p: self.cfg.update(_load_cfg(p)),
|
||||||
"model": load_model,
|
"model": load_model,
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
|
||||||
util.from_disk(path, deserialize, exclude)
|
util.from_disk(path, deserialize, exclude)
|
||||||
return self
|
return self
|
||||||
|
|
|
@ -66,7 +66,7 @@ class Pipe(object):
|
||||||
self.set_annotations([doc], predictions)
|
self.set_annotations([doc], predictions)
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def pipe(self, stream, batch_size=128, n_threads=-1):
|
def pipe(self, stream, batch_size=128):
|
||||||
"""Apply the pipe to a stream of documents.
|
"""Apply the pipe to a stream of documents.
|
||||||
|
|
||||||
Both __call__ and pipe should delegate to the `predict()`
|
Both __call__ and pipe should delegate to the `predict()`
|
||||||
|
@ -151,7 +151,7 @@ class Pipe(object):
|
||||||
with self.model.use_params(params):
|
with self.model.use_params(params):
|
||||||
yield
|
yield
|
||||||
|
|
||||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
def to_bytes(self, exclude=tuple()):
|
||||||
"""Serialize the pipe to a bytestring.
|
"""Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
exclude (list): String names of serialization fields to exclude.
|
exclude (list): String names of serialization fields to exclude.
|
||||||
|
@ -162,10 +162,9 @@ class Pipe(object):
|
||||||
serialize["model"] = self.model.to_bytes
|
serialize["model"] = self.model.to_bytes
|
||||||
if hasattr(self, "vocab"):
|
if hasattr(self, "vocab"):
|
||||||
serialize["vocab"] = self.vocab.to_bytes
|
serialize["vocab"] = self.vocab.to_bytes
|
||||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
|
||||||
return util.to_bytes(serialize, exclude)
|
return util.to_bytes(serialize, exclude)
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||||
"""Load the pipe from a bytestring."""
|
"""Load the pipe from a bytestring."""
|
||||||
|
|
||||||
def load_model(b):
|
def load_model(b):
|
||||||
|
@ -179,20 +178,18 @@ class Pipe(object):
|
||||||
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
|
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
|
||||||
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
|
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
|
||||||
deserialize["model"] = load_model
|
deserialize["model"] = load_model
|
||||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
|
||||||
util.from_bytes(bytes_data, deserialize, exclude)
|
util.from_bytes(bytes_data, deserialize, exclude)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
def to_disk(self, path, exclude=tuple()):
|
||||||
"""Serialize the pipe to disk."""
|
"""Serialize the pipe to disk."""
|
||||||
serialize = {}
|
serialize = {}
|
||||||
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
|
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
|
||||||
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||||
serialize["model"] = lambda p: self.model.to_disk(p)
|
serialize["model"] = lambda p: self.model.to_disk(p)
|
||||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
|
||||||
util.to_disk(path, serialize, exclude)
|
util.to_disk(path, serialize, exclude)
|
||||||
|
|
||||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
def from_disk(self, path, exclude=tuple()):
|
||||||
"""Load the pipe from disk."""
|
"""Load the pipe from disk."""
|
||||||
|
|
||||||
def load_model(p):
|
def load_model(p):
|
||||||
|
@ -205,7 +202,6 @@ class Pipe(object):
|
||||||
deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
|
deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
|
||||||
deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p))
|
deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p))
|
||||||
deserialize["model"] = load_model
|
deserialize["model"] = load_model
|
||||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
|
||||||
util.from_disk(path, deserialize, exclude)
|
util.from_disk(path, deserialize, exclude)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
@ -232,7 +228,7 @@ class Tagger(Pipe):
|
||||||
self.set_annotations([doc], tags)
|
self.set_annotations([doc], tags)
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def pipe(self, stream, batch_size=128, n_threads=-1):
|
def pipe(self, stream, batch_size=128):
|
||||||
for docs in util.minibatch(stream, size=batch_size):
|
for docs in util.minibatch(stream, size=batch_size):
|
||||||
tag_ids = self.predict(docs)
|
tag_ids = self.predict(docs)
|
||||||
self.set_annotations(docs, tag_ids)
|
self.set_annotations(docs, tag_ids)
|
||||||
|
@ -295,7 +291,7 @@ class Tagger(Pipe):
|
||||||
return
|
return
|
||||||
except AttributeError:
|
except AttributeError:
|
||||||
types = set([type(eg) for eg in examples])
|
types = set([type(eg) for eg in examples])
|
||||||
raise ValueError(Errors.E978.format(name="Tagger", method="update", types=types))
|
raise TypeError(Errors.E978.format(name="Tagger", method="update", types=types))
|
||||||
set_dropout_rate(self.model, drop)
|
set_dropout_rate(self.model, drop)
|
||||||
tag_scores, bp_tag_scores = self.model.begin_update(
|
tag_scores, bp_tag_scores = self.model.begin_update(
|
||||||
[eg.predicted for eg in examples])
|
[eg.predicted for eg in examples])
|
||||||
|
@ -321,7 +317,7 @@ class Tagger(Pipe):
|
||||||
docs = [eg.predicted for eg in examples]
|
docs = [eg.predicted for eg in examples]
|
||||||
except AttributeError:
|
except AttributeError:
|
||||||
types = set([type(eg) for eg in examples])
|
types = set([type(eg) for eg in examples])
|
||||||
raise ValueError(Errors.E978.format(name="Tagger", method="rehearse", types=types))
|
raise TypeError(Errors.E978.format(name="Tagger", method="rehearse", types=types))
|
||||||
if self._rehearsal_model is None:
|
if self._rehearsal_model is None:
|
||||||
return
|
return
|
||||||
if not any(len(doc) for doc in docs):
|
if not any(len(doc) for doc in docs):
|
||||||
|
@ -358,7 +354,7 @@ class Tagger(Pipe):
|
||||||
try:
|
try:
|
||||||
y = example.y
|
y = example.y
|
||||||
except AttributeError:
|
except AttributeError:
|
||||||
raise ValueError(Errors.E978.format(name="Tagger", method="begin_training", types=type(example)))
|
raise TypeError(Errors.E978.format(name="Tagger", method="begin_training", types=type(example)))
|
||||||
for token in y:
|
for token in y:
|
||||||
tag = token.tag_
|
tag = token.tag_
|
||||||
if tag in orig_tag_map:
|
if tag in orig_tag_map:
|
||||||
|
@ -421,17 +417,16 @@ class Tagger(Pipe):
|
||||||
with self.model.use_params(params):
|
with self.model.use_params(params):
|
||||||
yield
|
yield
|
||||||
|
|
||||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
def to_bytes(self, exclude=tuple()):
|
||||||
serialize = {}
|
serialize = {}
|
||||||
serialize["model"] = self.model.to_bytes
|
serialize["model"] = self.model.to_bytes
|
||||||
serialize["vocab"] = self.vocab.to_bytes
|
serialize["vocab"] = self.vocab.to_bytes
|
||||||
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
||||||
tag_map = dict(sorted(self.vocab.morphology.tag_map.items()))
|
tag_map = dict(sorted(self.vocab.morphology.tag_map.items()))
|
||||||
serialize["tag_map"] = lambda: srsly.msgpack_dumps(tag_map)
|
serialize["tag_map"] = lambda: srsly.msgpack_dumps(tag_map)
|
||||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
|
||||||
return util.to_bytes(serialize, exclude)
|
return util.to_bytes(serialize, exclude)
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||||
def load_model(b):
|
def load_model(b):
|
||||||
try:
|
try:
|
||||||
self.model.from_bytes(b)
|
self.model.from_bytes(b)
|
||||||
|
@ -451,11 +446,10 @@ class Tagger(Pipe):
|
||||||
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
|
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
|
||||||
"model": lambda b: load_model(b),
|
"model": lambda b: load_model(b),
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
|
||||||
util.from_bytes(bytes_data, deserialize, exclude)
|
util.from_bytes(bytes_data, deserialize, exclude)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
def to_disk(self, path, exclude=tuple()):
|
||||||
tag_map = dict(sorted(self.vocab.morphology.tag_map.items()))
|
tag_map = dict(sorted(self.vocab.morphology.tag_map.items()))
|
||||||
serialize = {
|
serialize = {
|
||||||
"vocab": lambda p: self.vocab.to_disk(p),
|
"vocab": lambda p: self.vocab.to_disk(p),
|
||||||
|
@ -463,10 +457,9 @@ class Tagger(Pipe):
|
||||||
"model": lambda p: self.model.to_disk(p),
|
"model": lambda p: self.model.to_disk(p),
|
||||||
"cfg": lambda p: srsly.write_json(p, self.cfg),
|
"cfg": lambda p: srsly.write_json(p, self.cfg),
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
|
||||||
util.to_disk(path, serialize, exclude)
|
util.to_disk(path, serialize, exclude)
|
||||||
|
|
||||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
def from_disk(self, path, exclude=tuple()):
|
||||||
def load_model(p):
|
def load_model(p):
|
||||||
with p.open("rb") as file_:
|
with p.open("rb") as file_:
|
||||||
try:
|
try:
|
||||||
|
@ -487,7 +480,6 @@ class Tagger(Pipe):
|
||||||
"tag_map": load_tag_map,
|
"tag_map": load_tag_map,
|
||||||
"model": load_model,
|
"model": load_model,
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
|
||||||
util.from_disk(path, deserialize, exclude)
|
util.from_disk(path, deserialize, exclude)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
@ -566,15 +558,14 @@ class SentenceRecognizer(Tagger):
|
||||||
def add_label(self, label, values=None):
|
def add_label(self, label, values=None):
|
||||||
raise NotImplementedError
|
raise NotImplementedError
|
||||||
|
|
||||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
def to_bytes(self, exclude=tuple()):
|
||||||
serialize = {}
|
serialize = {}
|
||||||
serialize["model"] = self.model.to_bytes
|
serialize["model"] = self.model.to_bytes
|
||||||
serialize["vocab"] = self.vocab.to_bytes
|
serialize["vocab"] = self.vocab.to_bytes
|
||||||
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
||||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
|
||||||
return util.to_bytes(serialize, exclude)
|
return util.to_bytes(serialize, exclude)
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||||
def load_model(b):
|
def load_model(b):
|
||||||
try:
|
try:
|
||||||
self.model.from_bytes(b)
|
self.model.from_bytes(b)
|
||||||
|
@ -586,20 +577,18 @@ class SentenceRecognizer(Tagger):
|
||||||
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
|
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
|
||||||
"model": lambda b: load_model(b),
|
"model": lambda b: load_model(b),
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
|
||||||
util.from_bytes(bytes_data, deserialize, exclude)
|
util.from_bytes(bytes_data, deserialize, exclude)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
def to_disk(self, path, exclude=tuple()):
|
||||||
serialize = {
|
serialize = {
|
||||||
"vocab": lambda p: self.vocab.to_disk(p),
|
"vocab": lambda p: self.vocab.to_disk(p),
|
||||||
"model": lambda p: p.open("wb").write(self.model.to_bytes()),
|
"model": lambda p: p.open("wb").write(self.model.to_bytes()),
|
||||||
"cfg": lambda p: srsly.write_json(p, self.cfg),
|
"cfg": lambda p: srsly.write_json(p, self.cfg),
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
|
||||||
util.to_disk(path, serialize, exclude)
|
util.to_disk(path, serialize, exclude)
|
||||||
|
|
||||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
def from_disk(self, path, exclude=tuple()):
|
||||||
def load_model(p):
|
def load_model(p):
|
||||||
with p.open("rb") as file_:
|
with p.open("rb") as file_:
|
||||||
try:
|
try:
|
||||||
|
@ -612,7 +601,6 @@ class SentenceRecognizer(Tagger):
|
||||||
"cfg": lambda p: self.cfg.update(_load_cfg(p)),
|
"cfg": lambda p: self.cfg.update(_load_cfg(p)),
|
||||||
"model": load_model,
|
"model": load_model,
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
|
||||||
util.from_disk(path, deserialize, exclude)
|
util.from_disk(path, deserialize, exclude)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
@ -790,7 +778,7 @@ class ClozeMultitask(Pipe):
|
||||||
predictions, bp_predictions = self.model.begin_update([eg.predicted for eg in examples])
|
predictions, bp_predictions = self.model.begin_update([eg.predicted for eg in examples])
|
||||||
except AttributeError:
|
except AttributeError:
|
||||||
types = set([type(eg) for eg in examples])
|
types = set([type(eg) for eg in examples])
|
||||||
raise ValueError(Errors.E978.format(name="ClozeMultitask", method="rehearse", types=types))
|
raise TypeError(Errors.E978.format(name="ClozeMultitask", method="rehearse", types=types))
|
||||||
loss, d_predictions = self.get_loss(examples, self.vocab.vectors.data, predictions)
|
loss, d_predictions = self.get_loss(examples, self.vocab.vectors.data, predictions)
|
||||||
bp_predictions(d_predictions)
|
bp_predictions(d_predictions)
|
||||||
if sgd is not None:
|
if sgd is not None:
|
||||||
|
@ -825,7 +813,7 @@ class TextCategorizer(Pipe):
|
||||||
def labels(self, value):
|
def labels(self, value):
|
||||||
self.cfg["labels"] = tuple(value)
|
self.cfg["labels"] = tuple(value)
|
||||||
|
|
||||||
def pipe(self, stream, batch_size=128, n_threads=-1):
|
def pipe(self, stream, batch_size=128):
|
||||||
for docs in util.minibatch(stream, size=batch_size):
|
for docs in util.minibatch(stream, size=batch_size):
|
||||||
scores, tensors = self.predict(docs)
|
scores, tensors = self.predict(docs)
|
||||||
self.set_annotations(docs, scores, tensors=tensors)
|
self.set_annotations(docs, scores, tensors=tensors)
|
||||||
|
@ -856,7 +844,7 @@ class TextCategorizer(Pipe):
|
||||||
return
|
return
|
||||||
except AttributeError:
|
except AttributeError:
|
||||||
types = set([type(eg) for eg in examples])
|
types = set([type(eg) for eg in examples])
|
||||||
raise ValueError(Errors.E978.format(name="TextCategorizer", method="update", types=types))
|
raise TypeError(Errors.E978.format(name="TextCategorizer", method="update", types=types))
|
||||||
set_dropout_rate(self.model, drop)
|
set_dropout_rate(self.model, drop)
|
||||||
scores, bp_scores = self.model.begin_update(
|
scores, bp_scores = self.model.begin_update(
|
||||||
[eg.predicted for eg in examples]
|
[eg.predicted for eg in examples]
|
||||||
|
@ -879,7 +867,7 @@ class TextCategorizer(Pipe):
|
||||||
docs = [eg.predicted for eg in examples]
|
docs = [eg.predicted for eg in examples]
|
||||||
except AttributeError:
|
except AttributeError:
|
||||||
types = set([type(eg) for eg in examples])
|
types = set([type(eg) for eg in examples])
|
||||||
raise ValueError(Errors.E978.format(name="TextCategorizer", method="rehearse", types=types))
|
raise TypeError(Errors.E978.format(name="TextCategorizer", method="rehearse", types=types))
|
||||||
if not any(len(doc) for doc in docs):
|
if not any(len(doc) for doc in docs):
|
||||||
# Handle cases where there are no tokens in any docs.
|
# Handle cases where there are no tokens in any docs.
|
||||||
return
|
return
|
||||||
|
@ -940,7 +928,7 @@ class TextCategorizer(Pipe):
|
||||||
try:
|
try:
|
||||||
y = example.y
|
y = example.y
|
||||||
except AttributeError:
|
except AttributeError:
|
||||||
raise ValueError(Errors.E978.format(name="TextCategorizer", method="update", types=type(example)))
|
raise TypeError(Errors.E978.format(name="TextCategorizer", method="update", types=type(example)))
|
||||||
for cat in y.cats:
|
for cat in y.cats:
|
||||||
self.add_label(cat)
|
self.add_label(cat)
|
||||||
self.require_labels()
|
self.require_labels()
|
||||||
|
@ -1105,7 +1093,7 @@ class EntityLinker(Pipe):
|
||||||
docs = [eg.predicted for eg in examples]
|
docs = [eg.predicted for eg in examples]
|
||||||
except AttributeError:
|
except AttributeError:
|
||||||
types = set([type(eg) for eg in examples])
|
types = set([type(eg) for eg in examples])
|
||||||
raise ValueError(Errors.E978.format(name="EntityLinker", method="update", types=types))
|
raise TypeError(Errors.E978.format(name="EntityLinker", method="update", types=types))
|
||||||
if set_annotations:
|
if set_annotations:
|
||||||
# This seems simpler than other ways to get that exact output -- but
|
# This seems simpler than other ways to get that exact output -- but
|
||||||
# it does run the model twice :(
|
# it does run the model twice :(
|
||||||
|
@ -1198,7 +1186,7 @@ class EntityLinker(Pipe):
|
||||||
self.set_annotations([doc], kb_ids, tensors=tensors)
|
self.set_annotations([doc], kb_ids, tensors=tensors)
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def pipe(self, stream, batch_size=128, n_threads=-1):
|
def pipe(self, stream, batch_size=128):
|
||||||
for docs in util.minibatch(stream, size=batch_size):
|
for docs in util.minibatch(stream, size=batch_size):
|
||||||
kb_ids, tensors = self.predict(docs)
|
kb_ids, tensors = self.predict(docs)
|
||||||
self.set_annotations(docs, kb_ids, tensors=tensors)
|
self.set_annotations(docs, kb_ids, tensors=tensors)
|
||||||
|
@ -1309,17 +1297,16 @@ class EntityLinker(Pipe):
|
||||||
for token in ent:
|
for token in ent:
|
||||||
token.ent_kb_id_ = kb_id
|
token.ent_kb_id_ = kb_id
|
||||||
|
|
||||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
def to_disk(self, path, exclude=tuple()):
|
||||||
serialize = {}
|
serialize = {}
|
||||||
self.cfg["entity_width"] = self.kb.entity_vector_length
|
self.cfg["entity_width"] = self.kb.entity_vector_length
|
||||||
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
|
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
|
||||||
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||||
serialize["kb"] = lambda p: self.kb.dump(p)
|
serialize["kb"] = lambda p: self.kb.dump(p)
|
||||||
serialize["model"] = lambda p: self.model.to_disk(p)
|
serialize["model"] = lambda p: self.model.to_disk(p)
|
||||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
|
||||||
util.to_disk(path, serialize, exclude)
|
util.to_disk(path, serialize, exclude)
|
||||||
|
|
||||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
def from_disk(self, path, exclude=tuple()):
|
||||||
def load_model(p):
|
def load_model(p):
|
||||||
try:
|
try:
|
||||||
self.model.from_bytes(p.open("rb").read())
|
self.model.from_bytes(p.open("rb").read())
|
||||||
|
@ -1335,7 +1322,6 @@ class EntityLinker(Pipe):
|
||||||
deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p))
|
deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p))
|
||||||
deserialize["kb"] = load_kb
|
deserialize["kb"] = load_kb
|
||||||
deserialize["model"] = load_model
|
deserialize["model"] = load_model
|
||||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
|
||||||
util.from_disk(path, deserialize, exclude)
|
util.from_disk(path, deserialize, exclude)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
@ -1411,7 +1397,7 @@ class Sentencizer(Pipe):
|
||||||
doc[start].is_sent_start = True
|
doc[start].is_sent_start = True
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def pipe(self, stream, batch_size=128, n_threads=-1):
|
def pipe(self, stream, batch_size=128):
|
||||||
for docs in util.minibatch(stream, size=batch_size):
|
for docs in util.minibatch(stream, size=batch_size):
|
||||||
predictions = self.predict(docs)
|
predictions = self.predict(docs)
|
||||||
if isinstance(predictions, tuple) and len(tuple) == 2:
|
if isinstance(predictions, tuple) and len(tuple) == 2:
|
||||||
|
|
|
@ -51,11 +51,10 @@ class Tok2Vec(Pipe):
|
||||||
self.set_annotations([doc], tokvecses)
|
self.set_annotations([doc], tokvecses)
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def pipe(self, stream, batch_size=128, n_threads=-1):
|
def pipe(self, stream, batch_size=128):
|
||||||
"""Process `Doc` objects as a stream.
|
"""Process `Doc` objects as a stream.
|
||||||
stream (iterator): A sequence of `Doc` objects to process.
|
stream (iterator): A sequence of `Doc` objects to process.
|
||||||
batch_size (int): Number of `Doc` objects to group.
|
batch_size (int): Number of `Doc` objects to group.
|
||||||
n_threads (int): Number of threads.
|
|
||||||
YIELDS (iterator): A sequence of `Doc` objects, in order of input.
|
YIELDS (iterator): A sequence of `Doc` objects, in order of input.
|
||||||
"""
|
"""
|
||||||
for docs in minibatch(stream, batch_size):
|
for docs in minibatch(stream, batch_size):
|
||||||
|
|
|
@ -326,10 +326,11 @@ class Scorer(object):
|
||||||
for token in doc:
|
for token in doc:
|
||||||
if token.orth_.isspace():
|
if token.orth_.isspace():
|
||||||
continue
|
continue
|
||||||
gold_i = align.cand_to_gold[token.i]
|
if align.x2y.lengths[token.i] != 1:
|
||||||
if gold_i is None:
|
|
||||||
self.tokens.fp += 1
|
self.tokens.fp += 1
|
||||||
|
gold_i = None
|
||||||
else:
|
else:
|
||||||
|
gold_i = align.x2y[token.i].dataXd[0, 0]
|
||||||
self.tokens.tp += 1
|
self.tokens.tp += 1
|
||||||
cand_tags.add((gold_i, token.tag_))
|
cand_tags.add((gold_i, token.tag_))
|
||||||
cand_pos.add((gold_i, token.pos_))
|
cand_pos.add((gold_i, token.pos_))
|
||||||
|
@ -345,7 +346,10 @@ class Scorer(object):
|
||||||
if token.is_sent_start:
|
if token.is_sent_start:
|
||||||
cand_sent_starts.add(gold_i)
|
cand_sent_starts.add(gold_i)
|
||||||
if token.dep_.lower() not in punct_labels and token.orth_.strip():
|
if token.dep_.lower() not in punct_labels and token.orth_.strip():
|
||||||
gold_head = align.cand_to_gold[token.head.i]
|
if align.x2y.lengths[token.head.i] == 1:
|
||||||
|
gold_head = align.x2y[token.head.i].dataXd[0, 0]
|
||||||
|
else:
|
||||||
|
gold_head = None
|
||||||
# None is indistinct, so we can't just add it to the set
|
# None is indistinct, so we can't just add it to the set
|
||||||
# Multiple (None, None) deps are possible
|
# Multiple (None, None) deps are possible
|
||||||
if gold_i is None or gold_head is None:
|
if gold_i is None or gold_head is None:
|
||||||
|
@ -381,15 +385,9 @@ class Scorer(object):
|
||||||
gold_ents.add(gold_ent)
|
gold_ents.add(gold_ent)
|
||||||
gold_per_ents[ent.label_].add((ent.label_, ent.start, ent.end - 1))
|
gold_per_ents[ent.label_].add((ent.label_, ent.start, ent.end - 1))
|
||||||
cand_per_ents = {ent_label: set() for ent_label in ent_labels}
|
cand_per_ents = {ent_label: set() for ent_label in ent_labels}
|
||||||
for ent in doc.ents:
|
for ent in example.get_aligned_spans_x2y(doc.ents):
|
||||||
first = align.cand_to_gold[ent.start]
|
cand_ents.add((ent.label_, ent.start, ent.end - 1))
|
||||||
last = align.cand_to_gold[ent.end - 1]
|
cand_per_ents[ent.label_].add((ent.label_, ent.start, ent.end - 1))
|
||||||
if first is None or last is None:
|
|
||||||
self.ner.fp += 1
|
|
||||||
self.ner_per_ents[ent.label_].fp += 1
|
|
||||||
else:
|
|
||||||
cand_ents.add((ent.label_, first, last))
|
|
||||||
cand_per_ents[ent.label_].add((ent.label_, first, last))
|
|
||||||
# Scores per ent
|
# Scores per ent
|
||||||
for k, v in self.ner_per_ents.items():
|
for k, v in self.ner_per_ents.items():
|
||||||
if k in cand_per_ents:
|
if k in cand_per_ents:
|
||||||
|
|
|
@ -157,7 +157,7 @@ cdef class Parser:
|
||||||
self.set_annotations([doc], states, tensors=None)
|
self.set_annotations([doc], states, tensors=None)
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def pipe(self, docs, int batch_size=256, int n_threads=-1):
|
def pipe(self, docs, int batch_size=256):
|
||||||
"""Process a stream of documents.
|
"""Process a stream of documents.
|
||||||
|
|
||||||
stream: The sequence of documents to process.
|
stream: The sequence of documents to process.
|
||||||
|
@ -461,24 +461,22 @@ cdef class Parser:
|
||||||
link_vectors_to_models(self.vocab)
|
link_vectors_to_models(self.vocab)
|
||||||
return sgd
|
return sgd
|
||||||
|
|
||||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
def to_disk(self, path, exclude=tuple()):
|
||||||
serializers = {
|
serializers = {
|
||||||
'model': lambda p: (self.model.to_disk(p) if self.model is not True else True),
|
'model': lambda p: (self.model.to_disk(p) if self.model is not True else True),
|
||||||
'vocab': lambda p: self.vocab.to_disk(p),
|
'vocab': lambda p: self.vocab.to_disk(p),
|
||||||
'moves': lambda p: self.moves.to_disk(p, exclude=["strings"]),
|
'moves': lambda p: self.moves.to_disk(p, exclude=["strings"]),
|
||||||
'cfg': lambda p: srsly.write_json(p, self.cfg)
|
'cfg': lambda p: srsly.write_json(p, self.cfg)
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
|
||||||
util.to_disk(path, serializers, exclude)
|
util.to_disk(path, serializers, exclude)
|
||||||
|
|
||||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
def from_disk(self, path, exclude=tuple()):
|
||||||
deserializers = {
|
deserializers = {
|
||||||
'vocab': lambda p: self.vocab.from_disk(p),
|
'vocab': lambda p: self.vocab.from_disk(p),
|
||||||
'moves': lambda p: self.moves.from_disk(p, exclude=["strings"]),
|
'moves': lambda p: self.moves.from_disk(p, exclude=["strings"]),
|
||||||
'cfg': lambda p: self.cfg.update(srsly.read_json(p)),
|
'cfg': lambda p: self.cfg.update(srsly.read_json(p)),
|
||||||
'model': lambda p: None,
|
'model': lambda p: None,
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
|
||||||
util.from_disk(path, deserializers, exclude)
|
util.from_disk(path, deserializers, exclude)
|
||||||
if 'model' not in exclude:
|
if 'model' not in exclude:
|
||||||
path = util.ensure_path(path)
|
path = util.ensure_path(path)
|
||||||
|
@ -491,24 +489,22 @@ cdef class Parser:
|
||||||
raise ValueError(Errors.E149)
|
raise ValueError(Errors.E149)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
def to_bytes(self, exclude=tuple()):
|
||||||
serializers = {
|
serializers = {
|
||||||
"model": lambda: (self.model.to_bytes()),
|
"model": lambda: (self.model.to_bytes()),
|
||||||
"vocab": lambda: self.vocab.to_bytes(),
|
"vocab": lambda: self.vocab.to_bytes(),
|
||||||
"moves": lambda: self.moves.to_bytes(exclude=["strings"]),
|
"moves": lambda: self.moves.to_bytes(exclude=["strings"]),
|
||||||
"cfg": lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True)
|
"cfg": lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True)
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
|
||||||
return util.to_bytes(serializers, exclude)
|
return util.to_bytes(serializers, exclude)
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||||
deserializers = {
|
deserializers = {
|
||||||
"vocab": lambda b: self.vocab.from_bytes(b),
|
"vocab": lambda b: self.vocab.from_bytes(b),
|
||||||
"moves": lambda b: self.moves.from_bytes(b, exclude=["strings"]),
|
"moves": lambda b: self.moves.from_bytes(b, exclude=["strings"]),
|
||||||
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
|
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
|
||||||
"model": lambda b: None,
|
"model": lambda b: None,
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
|
||||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||||
if 'model' not in exclude:
|
if 'model' not in exclude:
|
||||||
if 'model' in msg:
|
if 'model' in msg:
|
||||||
|
|
|
@ -60,7 +60,7 @@ cdef class TransitionSystem:
|
||||||
states.append(state)
|
states.append(state)
|
||||||
offset += len(doc)
|
offset += len(doc)
|
||||||
return states
|
return states
|
||||||
|
|
||||||
def get_oracle_sequence(self, Example example, _debug=False):
|
def get_oracle_sequence(self, Example example, _debug=False):
|
||||||
states, golds, _ = self.init_gold_batch([example])
|
states, golds, _ = self.init_gold_batch([example])
|
||||||
if not states:
|
if not states:
|
||||||
|
@ -227,22 +227,20 @@ cdef class TransitionSystem:
|
||||||
self.from_bytes(byte_data, **kwargs)
|
self.from_bytes(byte_data, **kwargs)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
def to_bytes(self, exclude=tuple()):
|
||||||
transitions = []
|
transitions = []
|
||||||
serializers = {
|
serializers = {
|
||||||
'moves': lambda: srsly.json_dumps(self.labels),
|
'moves': lambda: srsly.json_dumps(self.labels),
|
||||||
'strings': lambda: self.strings.to_bytes()
|
'strings': lambda: self.strings.to_bytes()
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
|
||||||
return util.to_bytes(serializers, exclude)
|
return util.to_bytes(serializers, exclude)
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||||
labels = {}
|
labels = {}
|
||||||
deserializers = {
|
deserializers = {
|
||||||
'moves': lambda b: labels.update(srsly.json_loads(b)),
|
'moves': lambda b: labels.update(srsly.json_loads(b)),
|
||||||
'strings': lambda b: self.strings.from_bytes(b)
|
'strings': lambda b: self.strings.from_bytes(b)
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
|
||||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||||
self.initialize_actions(labels)
|
self.initialize_actions(labels)
|
||||||
return self
|
return self
|
||||||
|
|
|
@ -179,22 +179,9 @@ def test_doc_api_right_edge(en_tokenizer):
|
||||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
assert doc[6].text == "for"
|
assert doc[6].text == "for"
|
||||||
subtree = [w.text for w in doc[6].subtree]
|
subtree = [w.text for w in doc[6].subtree]
|
||||||
assert subtree == [
|
# fmt: off
|
||||||
"for",
|
assert subtree == ["for", "the", "sake", "of", "such", "as", "live", "under", "the", "government", "of", "the", "Romans", ","]
|
||||||
"the",
|
# fmt: on
|
||||||
"sake",
|
|
||||||
"of",
|
|
||||||
"such",
|
|
||||||
"as",
|
|
||||||
"live",
|
|
||||||
"under",
|
|
||||||
"the",
|
|
||||||
"government",
|
|
||||||
"of",
|
|
||||||
"the",
|
|
||||||
"Romans",
|
|
||||||
",",
|
|
||||||
]
|
|
||||||
assert doc[6].right_edge.text == ","
|
assert doc[6].right_edge.text == ","
|
||||||
|
|
||||||
|
|
||||||
|
@ -303,6 +290,68 @@ def test_doc_from_array_sent_starts(en_vocab):
|
||||||
assert new_doc.is_parsed
|
assert new_doc.is_parsed
|
||||||
|
|
||||||
|
|
||||||
|
def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
|
||||||
|
en_texts = ["Merging the docs is fun.", "They don't think alike."]
|
||||||
|
de_text = "Wie war die Frage?"
|
||||||
|
en_docs = [en_tokenizer(text) for text in en_texts]
|
||||||
|
docs_idx = en_texts[0].index("docs")
|
||||||
|
de_doc = de_tokenizer(de_text)
|
||||||
|
en_docs[0].user_data[("._.", "is_ambiguous", docs_idx, None)] = (
|
||||||
|
True,
|
||||||
|
None,
|
||||||
|
None,
|
||||||
|
None,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert Doc.from_docs([]) is None
|
||||||
|
|
||||||
|
assert de_doc is not Doc.from_docs([de_doc])
|
||||||
|
assert str(de_doc) == str(Doc.from_docs([de_doc]))
|
||||||
|
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
Doc.from_docs(en_docs + [de_doc])
|
||||||
|
|
||||||
|
m_doc = Doc.from_docs(en_docs)
|
||||||
|
assert len(en_docs) == len(list(m_doc.sents))
|
||||||
|
assert len(str(m_doc)) > len(en_texts[0]) + len(en_texts[1])
|
||||||
|
assert str(m_doc) == " ".join(en_texts)
|
||||||
|
p_token = m_doc[len(en_docs[0]) - 1]
|
||||||
|
assert p_token.text == "." and bool(p_token.whitespace_)
|
||||||
|
en_docs_tokens = [t for doc in en_docs for t in doc]
|
||||||
|
assert len(m_doc) == len(en_docs_tokens)
|
||||||
|
think_idx = len(en_texts[0]) + 1 + en_texts[1].index("think")
|
||||||
|
assert m_doc[9].idx == think_idx
|
||||||
|
with pytest.raises(AttributeError):
|
||||||
|
# not callable, because it was not set via set_extension
|
||||||
|
m_doc[2]._.is_ambiguous
|
||||||
|
assert len(m_doc.user_data) == len(en_docs[0].user_data) # but it's there
|
||||||
|
|
||||||
|
m_doc = Doc.from_docs(en_docs, ensure_whitespace=False)
|
||||||
|
assert len(en_docs) == len(list(m_doc.sents))
|
||||||
|
assert len(str(m_doc)) == len(en_texts[0]) + len(en_texts[1])
|
||||||
|
assert str(m_doc) == "".join(en_texts)
|
||||||
|
p_token = m_doc[len(en_docs[0]) - 1]
|
||||||
|
assert p_token.text == "." and not bool(p_token.whitespace_)
|
||||||
|
en_docs_tokens = [t for doc in en_docs for t in doc]
|
||||||
|
assert len(m_doc) == len(en_docs_tokens)
|
||||||
|
think_idx = len(en_texts[0]) + 0 + en_texts[1].index("think")
|
||||||
|
assert m_doc[9].idx == think_idx
|
||||||
|
|
||||||
|
m_doc = Doc.from_docs(en_docs, attrs=["lemma", "length", "pos"])
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
# important attributes from sentenziser or parser are missing
|
||||||
|
assert list(m_doc.sents)
|
||||||
|
assert len(str(m_doc)) > len(en_texts[0]) + len(en_texts[1])
|
||||||
|
# space delimiter considered, although spacy attribute was missing
|
||||||
|
assert str(m_doc) == " ".join(en_texts)
|
||||||
|
p_token = m_doc[len(en_docs[0]) - 1]
|
||||||
|
assert p_token.text == "." and bool(p_token.whitespace_)
|
||||||
|
en_docs_tokens = [t for doc in en_docs for t in doc]
|
||||||
|
assert len(m_doc) == len(en_docs_tokens)
|
||||||
|
think_idx = len(en_texts[0]) + 1 + en_texts[1].index("think")
|
||||||
|
assert m_doc[9].idx == think_idx
|
||||||
|
|
||||||
|
|
||||||
def test_doc_lang(en_vocab):
|
def test_doc_lang(en_vocab):
|
||||||
doc = Doc(en_vocab, words=["Hello", "world"])
|
doc = Doc(en_vocab, words=["Hello", "world"])
|
||||||
assert doc.lang_ == "en"
|
assert doc.lang_ == "en"
|
||||||
|
|
|
@ -66,8 +66,6 @@ def test_spans_string_fn(doc):
|
||||||
span = doc[0:4]
|
span = doc[0:4]
|
||||||
assert len(span) == 4
|
assert len(span) == 4
|
||||||
assert span.text == "This is a sentence"
|
assert span.text == "This is a sentence"
|
||||||
assert span.upper_ == "THIS IS A SENTENCE"
|
|
||||||
assert span.lower_ == "this is a sentence"
|
|
||||||
|
|
||||||
|
|
||||||
def test_spans_root2(en_tokenizer):
|
def test_spans_root2(en_tokenizer):
|
||||||
|
|
|
@ -1,13 +1,11 @@
|
||||||
import pytest
|
import pytest
|
||||||
from thinc.api import Adam
|
from thinc.api import Adam, fix_random_seed
|
||||||
from spacy.attrs import NORM
|
from spacy.attrs import NORM
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
|
|
||||||
from spacy.gold import Example
|
from spacy.gold import Example
|
||||||
from spacy.pipeline.defaults import default_parser, default_ner
|
from spacy.pipeline.defaults import default_parser, default_ner
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc
|
||||||
from spacy.pipeline import DependencyParser, EntityRecognizer
|
from spacy.pipeline import DependencyParser, EntityRecognizer
|
||||||
from spacy.util import fix_random_seed
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
|
|
|
@ -118,6 +118,7 @@ def test_oracle_moves_missing_B(en_vocab):
|
||||||
moves.add_action(move_types.index("U"), label)
|
moves.add_action(move_types.index("U"), label)
|
||||||
moves.get_oracle_sequence(example)
|
moves.get_oracle_sequence(example)
|
||||||
|
|
||||||
|
|
||||||
# We can't easily represent this on a Doc object. Not sure what the best solution
|
# We can't easily represent this on a Doc object. Not sure what the best solution
|
||||||
# would be, but I don't think it's an important use case?
|
# would be, but I don't think it's an important use case?
|
||||||
@pytest.mark.xfail(reason="No longer supported")
|
@pytest.mark.xfail(reason="No longer supported")
|
||||||
|
@ -208,6 +209,10 @@ def test_train_empty():
|
||||||
]
|
]
|
||||||
|
|
||||||
nlp = English()
|
nlp = English()
|
||||||
|
train_examples = []
|
||||||
|
for t in train_data:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||||
|
|
||||||
ner = nlp.create_pipe("ner")
|
ner = nlp.create_pipe("ner")
|
||||||
ner.add_label("PERSON")
|
ner.add_label("PERSON")
|
||||||
nlp.add_pipe(ner, last=True)
|
nlp.add_pipe(ner, last=True)
|
||||||
|
@ -215,10 +220,9 @@ def test_train_empty():
|
||||||
nlp.begin_training()
|
nlp.begin_training()
|
||||||
for itn in range(2):
|
for itn in range(2):
|
||||||
losses = {}
|
losses = {}
|
||||||
batches = util.minibatch(train_data)
|
batches = util.minibatch(train_examples)
|
||||||
for batch in batches:
|
for batch in batches:
|
||||||
texts, annotations = zip(*batch)
|
nlp.update(batch, losses=losses)
|
||||||
nlp.update(train_data, losses=losses)
|
|
||||||
|
|
||||||
|
|
||||||
def test_overwrite_token():
|
def test_overwrite_token():
|
||||||
|
@ -327,7 +331,9 @@ def test_overfitting_IO():
|
||||||
# Simple test to try and quickly overfit the NER component - ensuring the ML models work correctly
|
# Simple test to try and quickly overfit the NER component - ensuring the ML models work correctly
|
||||||
nlp = English()
|
nlp = English()
|
||||||
ner = nlp.create_pipe("ner")
|
ner = nlp.create_pipe("ner")
|
||||||
for _, annotations in TRAIN_DATA:
|
train_examples = []
|
||||||
|
for text, annotations in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
for ent in annotations.get("entities"):
|
for ent in annotations.get("entities"):
|
||||||
ner.add_label(ent[2])
|
ner.add_label(ent[2])
|
||||||
nlp.add_pipe(ner)
|
nlp.add_pipe(ner)
|
||||||
|
@ -335,7 +341,7 @@ def test_overfitting_IO():
|
||||||
|
|
||||||
for i in range(50):
|
for i in range(50):
|
||||||
losses = {}
|
losses = {}
|
||||||
nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses)
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
assert losses["ner"] < 0.00001
|
assert losses["ner"] < 0.00001
|
||||||
|
|
||||||
# test the trained model
|
# test the trained model
|
||||||
|
|
|
@ -3,6 +3,7 @@ import pytest
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from ..util import get_doc, apply_transition_sequence, make_tempdir
|
from ..util import get_doc, apply_transition_sequence, make_tempdir
|
||||||
from ... import util
|
from ... import util
|
||||||
|
from ...gold import Example
|
||||||
|
|
||||||
TRAIN_DATA = [
|
TRAIN_DATA = [
|
||||||
(
|
(
|
||||||
|
@ -91,6 +92,7 @@ def test_parser_merge_pp(en_tokenizer):
|
||||||
assert doc[2].text == "another phrase"
|
assert doc[2].text == "another phrase"
|
||||||
assert doc[3].text == "occurs"
|
assert doc[3].text == "occurs"
|
||||||
|
|
||||||
|
|
||||||
# We removed the step_through API a while ago. we should bring it back though
|
# We removed the step_through API a while ago. we should bring it back though
|
||||||
@pytest.mark.xfail(reason="Unsupported")
|
@pytest.mark.xfail(reason="Unsupported")
|
||||||
def test_parser_arc_eager_finalize_state(en_tokenizer, en_parser):
|
def test_parser_arc_eager_finalize_state(en_tokenizer, en_parser):
|
||||||
|
@ -188,7 +190,9 @@ def test_overfitting_IO():
|
||||||
# Simple test to try and quickly overfit the dependency parser - ensuring the ML models work correctly
|
# Simple test to try and quickly overfit the dependency parser - ensuring the ML models work correctly
|
||||||
nlp = English()
|
nlp = English()
|
||||||
parser = nlp.create_pipe("parser")
|
parser = nlp.create_pipe("parser")
|
||||||
for _, annotations in TRAIN_DATA:
|
train_examples = []
|
||||||
|
for text, annotations in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
for dep in annotations.get("deps", []):
|
for dep in annotations.get("deps", []):
|
||||||
parser.add_label(dep)
|
parser.add_label(dep)
|
||||||
nlp.add_pipe(parser)
|
nlp.add_pipe(parser)
|
||||||
|
@ -196,7 +200,7 @@ def test_overfitting_IO():
|
||||||
|
|
||||||
for i in range(50):
|
for i in range(50):
|
||||||
losses = {}
|
losses = {}
|
||||||
nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses)
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
assert losses["parser"] < 0.00001
|
assert losses["parser"] < 0.00001
|
||||||
|
|
||||||
# test the trained model
|
# test the trained model
|
||||||
|
|
|
@ -3,6 +3,7 @@ import pytest
|
||||||
from spacy.kb import KnowledgeBase
|
from spacy.kb import KnowledgeBase
|
||||||
|
|
||||||
from spacy import util
|
from spacy import util
|
||||||
|
from spacy.gold import Example
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.pipeline import EntityRuler
|
from spacy.pipeline import EntityRuler
|
||||||
from spacy.tests.util import make_tempdir
|
from spacy.tests.util import make_tempdir
|
||||||
|
@ -283,11 +284,10 @@ def test_overfitting_IO():
|
||||||
nlp.add_pipe(ruler)
|
nlp.add_pipe(ruler)
|
||||||
|
|
||||||
# Convert the texts to docs to make sure we have doc.ents set for the training examples
|
# Convert the texts to docs to make sure we have doc.ents set for the training examples
|
||||||
TRAIN_DOCS = []
|
train_examples = []
|
||||||
for text, annotation in TRAIN_DATA:
|
for text, annotation in TRAIN_DATA:
|
||||||
doc = nlp(text)
|
doc = nlp(text)
|
||||||
annotation_clean = annotation
|
train_examples.append(Example.from_dict(doc, annotation))
|
||||||
TRAIN_DOCS.append((doc, annotation_clean))
|
|
||||||
|
|
||||||
# create artificial KB - assign same prior weight to the two russ cochran's
|
# create artificial KB - assign same prior weight to the two russ cochran's
|
||||||
# Q2146908 (Russ Cochran): American golfer
|
# Q2146908 (Russ Cochran): American golfer
|
||||||
|
@ -309,7 +309,7 @@ def test_overfitting_IO():
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
for i in range(50):
|
for i in range(50):
|
||||||
losses = {}
|
losses = {}
|
||||||
nlp.update(TRAIN_DOCS, sgd=optimizer, losses=losses)
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
assert losses["entity_linker"] < 0.001
|
assert losses["entity_linker"] < 0.001
|
||||||
|
|
||||||
# test the trained model
|
# test the trained model
|
||||||
|
|
|
@ -1,6 +1,7 @@
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from spacy import util
|
from spacy import util
|
||||||
|
from spacy.gold import Example
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.language import Language
|
from spacy.language import Language
|
||||||
from spacy.tests.util import make_tempdir
|
from spacy.tests.util import make_tempdir
|
||||||
|
@ -33,7 +34,9 @@ def test_overfitting_IO():
|
||||||
# Simple test to try and quickly overfit the morphologizer - ensuring the ML models work correctly
|
# Simple test to try and quickly overfit the morphologizer - ensuring the ML models work correctly
|
||||||
nlp = English()
|
nlp = English()
|
||||||
morphologizer = nlp.create_pipe("morphologizer")
|
morphologizer = nlp.create_pipe("morphologizer")
|
||||||
|
train_examples = []
|
||||||
for inst in TRAIN_DATA:
|
for inst in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(inst[0]), inst[1]))
|
||||||
for morph, pos in zip(inst[1]["morphs"], inst[1]["pos"]):
|
for morph, pos in zip(inst[1]["morphs"], inst[1]["pos"]):
|
||||||
morphologizer.add_label(morph + "|POS=" + pos)
|
morphologizer.add_label(morph + "|POS=" + pos)
|
||||||
nlp.add_pipe(morphologizer)
|
nlp.add_pipe(morphologizer)
|
||||||
|
@ -41,7 +44,7 @@ def test_overfitting_IO():
|
||||||
|
|
||||||
for i in range(50):
|
for i in range(50):
|
||||||
losses = {}
|
losses = {}
|
||||||
nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses)
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
assert losses["morphologizer"] < 0.00001
|
assert losses["morphologizer"] < 0.00001
|
||||||
|
|
||||||
# test the trained model
|
# test the trained model
|
||||||
|
|
|
@ -1,6 +1,7 @@
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from spacy import util
|
from spacy import util
|
||||||
|
from spacy.gold import Example
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.language import Language
|
from spacy.language import Language
|
||||||
from spacy.tests.util import make_tempdir
|
from spacy.tests.util import make_tempdir
|
||||||
|
@ -34,12 +35,15 @@ def test_overfitting_IO():
|
||||||
# Simple test to try and quickly overfit the senter - ensuring the ML models work correctly
|
# Simple test to try and quickly overfit the senter - ensuring the ML models work correctly
|
||||||
nlp = English()
|
nlp = English()
|
||||||
senter = nlp.create_pipe("senter")
|
senter = nlp.create_pipe("senter")
|
||||||
|
train_examples = []
|
||||||
|
for t in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||||
nlp.add_pipe(senter)
|
nlp.add_pipe(senter)
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
|
|
||||||
for i in range(200):
|
for i in range(200):
|
||||||
losses = {}
|
losses = {}
|
||||||
nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses)
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
assert losses["senter"] < 0.001
|
assert losses["senter"] < 0.001
|
||||||
|
|
||||||
# test the trained model
|
# test the trained model
|
||||||
|
|
|
@ -1,6 +1,7 @@
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from spacy import util
|
from spacy import util
|
||||||
|
from spacy.gold import Example
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.language import Language
|
from spacy.language import Language
|
||||||
from spacy.tests.util import make_tempdir
|
from spacy.tests.util import make_tempdir
|
||||||
|
@ -28,12 +29,15 @@ def test_overfitting_IO():
|
||||||
tagger = nlp.create_pipe("tagger")
|
tagger = nlp.create_pipe("tagger")
|
||||||
for tag, values in TAG_MAP.items():
|
for tag, values in TAG_MAP.items():
|
||||||
tagger.add_label(tag, values)
|
tagger.add_label(tag, values)
|
||||||
|
train_examples = []
|
||||||
|
for t in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||||
nlp.add_pipe(tagger)
|
nlp.add_pipe(tagger)
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
|
|
||||||
for i in range(50):
|
for i in range(50):
|
||||||
losses = {}
|
losses = {}
|
||||||
nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses)
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
assert losses["tagger"] < 0.00001
|
assert losses["tagger"] < 0.00001
|
||||||
|
|
||||||
# test the trained model
|
# test the trained model
|
||||||
|
|
|
@ -1,18 +1,18 @@
|
||||||
import pytest
|
import pytest
|
||||||
import random
|
import random
|
||||||
import numpy.random
|
import numpy.random
|
||||||
|
from thinc.api import fix_random_seed
|
||||||
from spacy import util
|
from spacy import util
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.language import Language
|
from spacy.language import Language
|
||||||
from spacy.pipeline import TextCategorizer
|
from spacy.pipeline import TextCategorizer
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc
|
||||||
from spacy.util import fix_random_seed
|
from spacy.pipeline.defaults import default_tok2vec
|
||||||
|
|
||||||
from ..util import make_tempdir
|
from ..util import make_tempdir
|
||||||
from spacy.pipeline.defaults import default_tok2vec
|
|
||||||
from ...gold import Example
|
from ...gold import Example
|
||||||
|
|
||||||
|
|
||||||
TRAIN_DATA = [
|
TRAIN_DATA = [
|
||||||
("I'm so happy.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
|
("I'm so happy.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
|
||||||
("I'm so angry", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
|
("I'm so angry", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
|
||||||
|
@ -85,7 +85,9 @@ def test_overfitting_IO():
|
||||||
fix_random_seed(0)
|
fix_random_seed(0)
|
||||||
nlp = English()
|
nlp = English()
|
||||||
textcat = nlp.create_pipe("textcat")
|
textcat = nlp.create_pipe("textcat")
|
||||||
for _, annotations in TRAIN_DATA:
|
train_examples = []
|
||||||
|
for text, annotations in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
for label, value in annotations.get("cats").items():
|
for label, value in annotations.get("cats").items():
|
||||||
textcat.add_label(label)
|
textcat.add_label(label)
|
||||||
nlp.add_pipe(textcat)
|
nlp.add_pipe(textcat)
|
||||||
|
@ -93,7 +95,7 @@ def test_overfitting_IO():
|
||||||
|
|
||||||
for i in range(50):
|
for i in range(50):
|
||||||
losses = {}
|
losses = {}
|
||||||
nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses)
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
assert losses["textcat"] < 0.01
|
assert losses["textcat"] < 0.01
|
||||||
|
|
||||||
# test the trained model
|
# test the trained model
|
||||||
|
@ -134,11 +136,13 @@ def test_textcat_configs(textcat_config):
|
||||||
pipe_config = {"model": textcat_config}
|
pipe_config = {"model": textcat_config}
|
||||||
nlp = English()
|
nlp = English()
|
||||||
textcat = nlp.create_pipe("textcat", pipe_config)
|
textcat = nlp.create_pipe("textcat", pipe_config)
|
||||||
for _, annotations in TRAIN_DATA:
|
train_examples = []
|
||||||
|
for text, annotations in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
for label, value in annotations.get("cats").items():
|
for label, value in annotations.get("cats").items():
|
||||||
textcat.add_label(label)
|
textcat.add_label(label)
|
||||||
nlp.add_pipe(textcat)
|
nlp.add_pipe(textcat)
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
for i in range(5):
|
for i in range(5):
|
||||||
losses = {}
|
losses = {}
|
||||||
nlp.update(TRAIN_DATA, sgd=optimizer, losses=losses)
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
|
|
|
@ -1,5 +1,6 @@
|
||||||
import pytest
|
import pytest
|
||||||
from spacy import displacy
|
from spacy import displacy
|
||||||
|
from spacy.gold import Example
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.lang.ja import Japanese
|
from spacy.lang.ja import Japanese
|
||||||
from spacy.lang.xx import MultiLanguage
|
from spacy.lang.xx import MultiLanguage
|
||||||
|
@ -141,10 +142,10 @@ def test_issue2800():
|
||||||
"""Test issue that arises when too many labels are added to NER model.
|
"""Test issue that arises when too many labels are added to NER model.
|
||||||
Used to cause segfault.
|
Used to cause segfault.
|
||||||
"""
|
"""
|
||||||
train_data = []
|
|
||||||
train_data.extend([("One sentence", {"entities": []})])
|
|
||||||
entity_types = [str(i) for i in range(1000)]
|
|
||||||
nlp = English()
|
nlp = English()
|
||||||
|
train_data = []
|
||||||
|
train_data.extend([Example.from_dict(nlp.make_doc("One sentence"), {"entities": []})])
|
||||||
|
entity_types = [str(i) for i in range(1000)]
|
||||||
ner = nlp.create_pipe("ner")
|
ner = nlp.create_pipe("ner")
|
||||||
nlp.add_pipe(ner)
|
nlp.add_pipe(ner)
|
||||||
for entity_type in list(entity_types):
|
for entity_type in list(entity_types):
|
||||||
|
@ -153,8 +154,8 @@ def test_issue2800():
|
||||||
for i in range(20):
|
for i in range(20):
|
||||||
losses = {}
|
losses = {}
|
||||||
random.shuffle(train_data)
|
random.shuffle(train_data)
|
||||||
for statement, entities in train_data:
|
for example in train_data:
|
||||||
nlp.update((statement, entities), sgd=optimizer, losses=losses, drop=0.5)
|
nlp.update([example], sgd=optimizer, losses=losses, drop=0.5)
|
||||||
|
|
||||||
|
|
||||||
def test_issue2822(it_tokenizer):
|
def test_issue2822(it_tokenizer):
|
||||||
|
|
|
@ -9,7 +9,6 @@ from spacy.vocab import Vocab
|
||||||
from spacy.attrs import ENT_IOB, ENT_TYPE
|
from spacy.attrs import ENT_IOB, ENT_TYPE
|
||||||
from spacy.compat import pickle
|
from spacy.compat import pickle
|
||||||
from spacy import displacy
|
from spacy import displacy
|
||||||
from spacy.util import decaying
|
|
||||||
import numpy
|
import numpy
|
||||||
|
|
||||||
from spacy.vectors import Vectors
|
from spacy.vectors import Vectors
|
||||||
|
@ -216,21 +215,6 @@ def test_issue3345():
|
||||||
assert ner.moves.is_valid(state, "B-GPE")
|
assert ner.moves.is_valid(state, "B-GPE")
|
||||||
|
|
||||||
|
|
||||||
def test_issue3410():
|
|
||||||
texts = ["Hello world", "This is a test"]
|
|
||||||
nlp = English()
|
|
||||||
matcher = Matcher(nlp.vocab)
|
|
||||||
phrasematcher = PhraseMatcher(nlp.vocab)
|
|
||||||
with pytest.deprecated_call():
|
|
||||||
docs = list(nlp.pipe(texts, n_threads=4))
|
|
||||||
with pytest.deprecated_call():
|
|
||||||
docs = list(nlp.tokenizer.pipe(texts, n_threads=4))
|
|
||||||
with pytest.deprecated_call():
|
|
||||||
list(matcher.pipe(docs, n_threads=4))
|
|
||||||
with pytest.deprecated_call():
|
|
||||||
list(phrasematcher.pipe(docs, n_threads=4))
|
|
||||||
|
|
||||||
|
|
||||||
def test_issue3412():
|
def test_issue3412():
|
||||||
data = numpy.asarray([[0, 0, 0], [1, 2, 3], [9, 8, 7]], dtype="f")
|
data = numpy.asarray([[0, 0, 0], [1, 2, 3], [9, 8, 7]], dtype="f")
|
||||||
vectors = Vectors(data=data, keys=["A", "B", "C"])
|
vectors = Vectors(data=data, keys=["A", "B", "C"])
|
||||||
|
@ -240,16 +224,6 @@ def test_issue3412():
|
||||||
assert best_rows[0] == 2
|
assert best_rows[0] == 2
|
||||||
|
|
||||||
|
|
||||||
def test_issue3447():
|
|
||||||
sizes = decaying(10.0, 1.0, 0.5)
|
|
||||||
size = next(sizes)
|
|
||||||
assert size == 10.0
|
|
||||||
size = next(sizes)
|
|
||||||
assert size == 10.0 - 0.5
|
|
||||||
size = next(sizes)
|
|
||||||
assert size == 10.0 - 0.5 - 0.5
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.xfail(reason="default suffix rules avoid one upper-case letter before dot")
|
@pytest.mark.xfail(reason="default suffix rules avoid one upper-case letter before dot")
|
||||||
def test_issue3449():
|
def test_issue3449():
|
||||||
nlp = English()
|
nlp = English()
|
||||||
|
|
|
@ -1,5 +1,7 @@
|
||||||
import spacy
|
import spacy
|
||||||
from spacy.util import minibatch, compounding
|
from spacy.util import minibatch
|
||||||
|
from thinc.api import compounding
|
||||||
|
from spacy.gold import Example
|
||||||
|
|
||||||
|
|
||||||
def test_issue3611():
|
def test_issue3611():
|
||||||
|
@ -12,15 +14,15 @@ def test_issue3611():
|
||||||
]
|
]
|
||||||
y_train = ["offensive", "offensive", "inoffensive"]
|
y_train = ["offensive", "offensive", "inoffensive"]
|
||||||
|
|
||||||
# preparing the data
|
|
||||||
pos_cats = list()
|
|
||||||
for train_instance in y_train:
|
|
||||||
pos_cats.append({label: label == train_instance for label in unique_classes})
|
|
||||||
train_data = list(zip(x_train, [{"cats": cats} for cats in pos_cats]))
|
|
||||||
|
|
||||||
# set up the spacy model with a text categorizer component
|
|
||||||
nlp = spacy.blank("en")
|
nlp = spacy.blank("en")
|
||||||
|
|
||||||
|
# preparing the data
|
||||||
|
train_data = []
|
||||||
|
for text, train_instance in zip(x_train, y_train):
|
||||||
|
cat_dict = {label: label == train_instance for label in unique_classes}
|
||||||
|
train_data.append(Example.from_dict(nlp.make_doc(text), {"cats": cat_dict}))
|
||||||
|
|
||||||
|
# add a text categorizer component
|
||||||
textcat = nlp.create_pipe(
|
textcat = nlp.create_pipe(
|
||||||
"textcat",
|
"textcat",
|
||||||
config={"exclusive_classes": True, "architecture": "bow", "ngram_size": 2},
|
config={"exclusive_classes": True, "architecture": "bow", "ngram_size": 2},
|
||||||
|
|
|
@ -1,5 +1,7 @@
|
||||||
import spacy
|
import spacy
|
||||||
from spacy.util import minibatch, compounding
|
from spacy.util import minibatch
|
||||||
|
from thinc.api import compounding
|
||||||
|
from spacy.gold import Example
|
||||||
|
|
||||||
|
|
||||||
def test_issue4030():
|
def test_issue4030():
|
||||||
|
@ -12,15 +14,15 @@ def test_issue4030():
|
||||||
]
|
]
|
||||||
y_train = ["offensive", "offensive", "inoffensive"]
|
y_train = ["offensive", "offensive", "inoffensive"]
|
||||||
|
|
||||||
# preparing the data
|
|
||||||
pos_cats = list()
|
|
||||||
for train_instance in y_train:
|
|
||||||
pos_cats.append({label: label == train_instance for label in unique_classes})
|
|
||||||
train_data = list(zip(x_train, [{"cats": cats} for cats in pos_cats]))
|
|
||||||
|
|
||||||
# set up the spacy model with a text categorizer component
|
|
||||||
nlp = spacy.blank("en")
|
nlp = spacy.blank("en")
|
||||||
|
|
||||||
|
# preparing the data
|
||||||
|
train_data = []
|
||||||
|
for text, train_instance in zip(x_train, y_train):
|
||||||
|
cat_dict = {label: label == train_instance for label in unique_classes}
|
||||||
|
train_data.append(Example.from_dict(nlp.make_doc(text), {"cats": cat_dict}))
|
||||||
|
|
||||||
|
# add a text categorizer component
|
||||||
textcat = nlp.create_pipe(
|
textcat = nlp.create_pipe(
|
||||||
"textcat",
|
"textcat",
|
||||||
config={"exclusive_classes": True, "architecture": "bow", "ngram_size": 2},
|
config={"exclusive_classes": True, "architecture": "bow", "ngram_size": 2},
|
||||||
|
|
|
@ -1,5 +1,7 @@
|
||||||
|
from spacy.gold import Example
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.util import minibatch, compounding
|
from spacy.util import minibatch
|
||||||
|
from thinc.api import compounding
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@ -7,9 +9,10 @@ import pytest
|
||||||
def test_issue4348():
|
def test_issue4348():
|
||||||
"""Test that training the tagger with empty data, doesn't throw errors"""
|
"""Test that training the tagger with empty data, doesn't throw errors"""
|
||||||
|
|
||||||
TRAIN_DATA = [("", {"tags": []}), ("", {"tags": []})]
|
|
||||||
|
|
||||||
nlp = English()
|
nlp = English()
|
||||||
|
example = Example.from_dict(nlp.make_doc(""), {"tags": []})
|
||||||
|
TRAIN_DATA = [example, example]
|
||||||
|
|
||||||
tagger = nlp.create_pipe("tagger")
|
tagger = nlp.create_pipe("tagger")
|
||||||
nlp.add_pipe(tagger)
|
nlp.add_pipe(tagger)
|
||||||
|
|
||||||
|
|
|
@ -8,10 +8,11 @@ from ...tokens import DocBin
|
||||||
|
|
||||||
def test_issue4402():
|
def test_issue4402():
|
||||||
nlp = English()
|
nlp = English()
|
||||||
|
attrs = ["ORTH", "SENT_START", "ENT_IOB", "ENT_TYPE"]
|
||||||
with make_tempdir() as tmpdir:
|
with make_tempdir() as tmpdir:
|
||||||
output_file = tmpdir / "test4402.spacy"
|
output_file = tmpdir / "test4402.spacy"
|
||||||
docs = json2docs([json_data])
|
docs = json2docs([json_data])
|
||||||
data = DocBin(docs=docs, attrs =["ORTH", "SENT_START", "ENT_IOB", "ENT_TYPE"]).to_bytes()
|
data = DocBin(docs=docs, attrs=attrs).to_bytes()
|
||||||
with output_file.open("wb") as file_:
|
with output_file.open("wb") as file_:
|
||||||
file_.write(data)
|
file_.write(data)
|
||||||
corpus = Corpus(train_loc=str(output_file), dev_loc=str(output_file))
|
corpus = Corpus(train_loc=str(output_file), dev_loc=str(output_file))
|
||||||
|
@ -25,74 +26,73 @@ def test_issue4402():
|
||||||
assert len(split_train_data) == 4
|
assert len(split_train_data) == 4
|
||||||
|
|
||||||
|
|
||||||
json_data =\
|
json_data = {
|
||||||
{
|
"id": 0,
|
||||||
"id": 0,
|
"paragraphs": [
|
||||||
"paragraphs": [
|
{
|
||||||
{
|
"raw": "How should I cook bacon in an oven?\nI've heard of people cooking bacon in an oven.",
|
||||||
"raw": "How should I cook bacon in an oven?\nI've heard of people cooking bacon in an oven.",
|
"sentences": [
|
||||||
"sentences": [
|
{
|
||||||
{
|
"tokens": [
|
||||||
"tokens": [
|
{"id": 0, "orth": "How", "ner": "O"},
|
||||||
{"id": 0, "orth": "How", "ner": "O"},
|
{"id": 1, "orth": "should", "ner": "O"},
|
||||||
{"id": 1, "orth": "should", "ner": "O"},
|
{"id": 2, "orth": "I", "ner": "O"},
|
||||||
{"id": 2, "orth": "I", "ner": "O"},
|
{"id": 3, "orth": "cook", "ner": "O"},
|
||||||
{"id": 3, "orth": "cook", "ner": "O"},
|
{"id": 4, "orth": "bacon", "ner": "O"},
|
||||||
{"id": 4, "orth": "bacon", "ner": "O"},
|
{"id": 5, "orth": "in", "ner": "O"},
|
||||||
{"id": 5, "orth": "in", "ner": "O"},
|
{"id": 6, "orth": "an", "ner": "O"},
|
||||||
{"id": 6, "orth": "an", "ner": "O"},
|
{"id": 7, "orth": "oven", "ner": "O"},
|
||||||
{"id": 7, "orth": "oven", "ner": "O"},
|
{"id": 8, "orth": "?", "ner": "O"},
|
||||||
{"id": 8, "orth": "?", "ner": "O"},
|
],
|
||||||
],
|
"brackets": [],
|
||||||
"brackets": [],
|
},
|
||||||
},
|
{
|
||||||
{
|
"tokens": [
|
||||||
"tokens": [
|
{"id": 9, "orth": "\n", "ner": "O"},
|
||||||
{"id": 9, "orth": "\n", "ner": "O"},
|
{"id": 10, "orth": "I", "ner": "O"},
|
||||||
{"id": 10, "orth": "I", "ner": "O"},
|
{"id": 11, "orth": "'ve", "ner": "O"},
|
||||||
{"id": 11, "orth": "'ve", "ner": "O"},
|
{"id": 12, "orth": "heard", "ner": "O"},
|
||||||
{"id": 12, "orth": "heard", "ner": "O"},
|
{"id": 13, "orth": "of", "ner": "O"},
|
||||||
{"id": 13, "orth": "of", "ner": "O"},
|
{"id": 14, "orth": "people", "ner": "O"},
|
||||||
{"id": 14, "orth": "people", "ner": "O"},
|
{"id": 15, "orth": "cooking", "ner": "O"},
|
||||||
{"id": 15, "orth": "cooking", "ner": "O"},
|
{"id": 16, "orth": "bacon", "ner": "O"},
|
||||||
{"id": 16, "orth": "bacon", "ner": "O"},
|
{"id": 17, "orth": "in", "ner": "O"},
|
||||||
{"id": 17, "orth": "in", "ner": "O"},
|
{"id": 18, "orth": "an", "ner": "O"},
|
||||||
{"id": 18, "orth": "an", "ner": "O"},
|
{"id": 19, "orth": "oven", "ner": "O"},
|
||||||
{"id": 19, "orth": "oven", "ner": "O"},
|
{"id": 20, "orth": ".", "ner": "O"},
|
||||||
{"id": 20, "orth": ".", "ner": "O"},
|
],
|
||||||
],
|
"brackets": [],
|
||||||
"brackets": [],
|
},
|
||||||
},
|
],
|
||||||
],
|
"cats": [
|
||||||
"cats": [
|
{"label": "baking", "value": 1.0},
|
||||||
{"label": "baking", "value": 1.0},
|
{"label": "not_baking", "value": 0.0},
|
||||||
{"label": "not_baking", "value": 0.0},
|
],
|
||||||
],
|
},
|
||||||
},
|
{
|
||||||
{
|
"raw": "What is the difference between white and brown eggs?\n",
|
||||||
"raw": "What is the difference between white and brown eggs?\n",
|
"sentences": [
|
||||||
"sentences": [
|
{
|
||||||
{
|
"tokens": [
|
||||||
"tokens": [
|
{"id": 0, "orth": "What", "ner": "O"},
|
||||||
{"id": 0, "orth": "What", "ner": "O"},
|
{"id": 1, "orth": "is", "ner": "O"},
|
||||||
{"id": 1, "orth": "is", "ner": "O"},
|
{"id": 2, "orth": "the", "ner": "O"},
|
||||||
{"id": 2, "orth": "the", "ner": "O"},
|
{"id": 3, "orth": "difference", "ner": "O"},
|
||||||
{"id": 3, "orth": "difference", "ner": "O"},
|
{"id": 4, "orth": "between", "ner": "O"},
|
||||||
{"id": 4, "orth": "between", "ner": "O"},
|
{"id": 5, "orth": "white", "ner": "O"},
|
||||||
{"id": 5, "orth": "white", "ner": "O"},
|
{"id": 6, "orth": "and", "ner": "O"},
|
||||||
{"id": 6, "orth": "and", "ner": "O"},
|
{"id": 7, "orth": "brown", "ner": "O"},
|
||||||
{"id": 7, "orth": "brown", "ner": "O"},
|
{"id": 8, "orth": "eggs", "ner": "O"},
|
||||||
{"id": 8, "orth": "eggs", "ner": "O"},
|
{"id": 9, "orth": "?", "ner": "O"},
|
||||||
{"id": 9, "orth": "?", "ner": "O"},
|
],
|
||||||
],
|
"brackets": [],
|
||||||
"brackets": [],
|
},
|
||||||
},
|
{"tokens": [{"id": 10, "orth": "\n", "ner": "O"}], "brackets": []},
|
||||||
{"tokens": [{"id": 10, "orth": "\n", "ner": "O"}], "brackets": []},
|
],
|
||||||
],
|
"cats": [
|
||||||
"cats": [
|
{"label": "baking", "value": 0.0},
|
||||||
{"label": "baking", "value": 0.0},
|
{"label": "not_baking", "value": 1.0},
|
||||||
{"label": "not_baking", "value": 1.0},
|
],
|
||||||
],
|
},
|
||||||
},
|
],
|
||||||
],
|
}
|
||||||
}
|
|
||||||
|
|
|
@ -1,7 +1,8 @@
|
||||||
|
from spacy.gold import Example
|
||||||
from spacy.language import Language
|
from spacy.language import Language
|
||||||
|
|
||||||
|
|
||||||
def test_issue4924():
|
def test_issue4924():
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
docs_golds = [("", {})]
|
example = Example.from_dict(nlp.make_doc(""), {})
|
||||||
nlp.evaluate(docs_golds)
|
nlp.evaluate([example])
|
||||||
|
|
|
@ -52,10 +52,6 @@ def test_serialize_doc_exclude(en_vocab):
|
||||||
assert not new_doc.user_data
|
assert not new_doc.user_data
|
||||||
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes(exclude=["user_data"]))
|
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes(exclude=["user_data"]))
|
||||||
assert not new_doc.user_data
|
assert not new_doc.user_data
|
||||||
with pytest.raises(ValueError):
|
|
||||||
doc.to_bytes(user_data=False)
|
|
||||||
with pytest.raises(ValueError):
|
|
||||||
Doc(en_vocab).from_bytes(doc.to_bytes(), tensor=False)
|
|
||||||
|
|
||||||
|
|
||||||
def test_serialize_doc_bin():
|
def test_serialize_doc_bin():
|
||||||
|
@ -75,3 +71,19 @@ def test_serialize_doc_bin():
|
||||||
for i, doc in enumerate(reloaded_docs):
|
for i, doc in enumerate(reloaded_docs):
|
||||||
assert doc.text == texts[i]
|
assert doc.text == texts[i]
|
||||||
assert doc.cats == cats
|
assert doc.cats == cats
|
||||||
|
|
||||||
|
|
||||||
|
def test_serialize_doc_bin_unknown_spaces(en_vocab):
|
||||||
|
doc1 = Doc(en_vocab, words=["that", "'s"])
|
||||||
|
assert doc1.has_unknown_spaces
|
||||||
|
assert doc1.text == "that 's "
|
||||||
|
doc2 = Doc(en_vocab, words=["that", "'s"], spaces=[False, False])
|
||||||
|
assert not doc2.has_unknown_spaces
|
||||||
|
assert doc2.text == "that's"
|
||||||
|
|
||||||
|
doc_bin = DocBin().from_bytes(DocBin(docs=[doc1, doc2]).to_bytes())
|
||||||
|
re_doc1, re_doc2 = doc_bin.get_docs(en_vocab)
|
||||||
|
assert re_doc1.has_unknown_spaces
|
||||||
|
assert re_doc1.text == "that 's "
|
||||||
|
assert not re_doc2.has_unknown_spaces
|
||||||
|
assert re_doc2.text == "that's"
|
||||||
|
|
|
@ -62,7 +62,3 @@ def test_serialize_language_exclude(meta_data):
|
||||||
assert not new_nlp.meta["name"] == name
|
assert not new_nlp.meta["name"] == name
|
||||||
new_nlp = Language().from_bytes(nlp.to_bytes(exclude=["meta"]))
|
new_nlp = Language().from_bytes(nlp.to_bytes(exclude=["meta"]))
|
||||||
assert not new_nlp.meta["name"] == name
|
assert not new_nlp.meta["name"] == name
|
||||||
with pytest.raises(ValueError):
|
|
||||||
nlp.to_bytes(meta=False)
|
|
||||||
with pytest.raises(ValueError):
|
|
||||||
Language().from_bytes(nlp.to_bytes(), meta=False)
|
|
||||||
|
|
|
@ -127,10 +127,6 @@ def test_serialize_pipe_exclude(en_vocab, Parser):
|
||||||
parser.to_bytes(exclude=["cfg"]), exclude=["vocab"]
|
parser.to_bytes(exclude=["cfg"]), exclude=["vocab"]
|
||||||
)
|
)
|
||||||
assert "foo" not in new_parser.cfg
|
assert "foo" not in new_parser.cfg
|
||||||
with pytest.raises(ValueError):
|
|
||||||
parser.to_bytes(cfg=False, exclude=["vocab"])
|
|
||||||
with pytest.raises(ValueError):
|
|
||||||
get_new_parser().from_bytes(parser.to_bytes(exclude=["vocab"]), cfg=False)
|
|
||||||
|
|
||||||
|
|
||||||
def test_serialize_sentencerecognizer(en_vocab):
|
def test_serialize_sentencerecognizer(en_vocab):
|
||||||
|
|
|
@ -1,14 +1,10 @@
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from spacy.gold import docs_to_json
|
from spacy.gold import docs_to_json, biluo_tags_from_offsets
|
||||||
from spacy.gold.converters import iob2docs, conll_ner2docs
|
from spacy.gold.converters import iob2docs, conll_ner2docs, conllu2docs
|
||||||
from spacy.gold.converters.conllu2json import conllu2json
|
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.cli.pretrain import make_docs
|
from spacy.cli.pretrain import make_docs
|
||||||
|
|
||||||
# TODO
|
|
||||||
# from spacy.gold.converters import conllu2docs
|
|
||||||
|
|
||||||
|
|
||||||
def test_cli_converters_conllu2json():
|
def test_cli_converters_conllu2json():
|
||||||
# from NorNE: https://github.com/ltgoslo/norne/blob/3d23274965f513f23aa48455b28b1878dad23c05/ud/nob/no_bokmaal-ud-dev.conllu
|
# from NorNE: https://github.com/ltgoslo/norne/blob/3d23274965f513f23aa48455b28b1878dad23c05/ud/nob/no_bokmaal-ud-dev.conllu
|
||||||
|
@ -19,8 +15,9 @@ def test_cli_converters_conllu2json():
|
||||||
"4\tavstår\tavstå\tVERB\t_\tMood=Ind|Tense=Pres|VerbForm=Fin\t0\troot\t_\tO",
|
"4\tavstår\tavstå\tVERB\t_\tMood=Ind|Tense=Pres|VerbForm=Fin\t0\troot\t_\tO",
|
||||||
]
|
]
|
||||||
input_data = "\n".join(lines)
|
input_data = "\n".join(lines)
|
||||||
converted = conllu2json(input_data, n_sents=1)
|
converted_docs = conllu2docs(input_data, n_sents=1)
|
||||||
assert len(converted) == 1
|
assert len(converted_docs) == 1
|
||||||
|
converted = [docs_to_json(converted_docs)]
|
||||||
assert converted[0]["id"] == 0
|
assert converted[0]["id"] == 0
|
||||||
assert len(converted[0]["paragraphs"]) == 1
|
assert len(converted[0]["paragraphs"]) == 1
|
||||||
assert len(converted[0]["paragraphs"][0]["sentences"]) == 1
|
assert len(converted[0]["paragraphs"][0]["sentences"]) == 1
|
||||||
|
@ -31,7 +28,11 @@ def test_cli_converters_conllu2json():
|
||||||
assert [t["tag"] for t in tokens] == ["NOUN", "PROPN", "PROPN", "VERB"]
|
assert [t["tag"] for t in tokens] == ["NOUN", "PROPN", "PROPN", "VERB"]
|
||||||
assert [t["head"] for t in tokens] == [1, 2, -1, 0]
|
assert [t["head"] for t in tokens] == [1, 2, -1, 0]
|
||||||
assert [t["dep"] for t in tokens] == ["appos", "nsubj", "name", "ROOT"]
|
assert [t["dep"] for t in tokens] == ["appos", "nsubj", "name", "ROOT"]
|
||||||
assert [t["ner"] for t in tokens] == ["O", "B-PER", "L-PER", "O"]
|
ent_offsets = [
|
||||||
|
(e[0], e[1], e[2]) for e in converted[0]["paragraphs"][0]["entities"]
|
||||||
|
]
|
||||||
|
biluo_tags = biluo_tags_from_offsets(converted_docs[0], ent_offsets, missing="O")
|
||||||
|
assert biluo_tags == ["O", "B-PER", "L-PER", "O"]
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
|
@ -55,11 +56,14 @@ def test_cli_converters_conllu2json():
|
||||||
)
|
)
|
||||||
def test_cli_converters_conllu2json_name_ner_map(lines):
|
def test_cli_converters_conllu2json_name_ner_map(lines):
|
||||||
input_data = "\n".join(lines)
|
input_data = "\n".join(lines)
|
||||||
converted = conllu2json(input_data, n_sents=1, ner_map={"PER": "PERSON", "BAD": ""})
|
converted_docs = conllu2docs(
|
||||||
assert len(converted) == 1
|
input_data, n_sents=1, ner_map={"PER": "PERSON", "BAD": ""}
|
||||||
|
)
|
||||||
|
assert len(converted_docs) == 1
|
||||||
|
converted = [docs_to_json(converted_docs)]
|
||||||
assert converted[0]["id"] == 0
|
assert converted[0]["id"] == 0
|
||||||
assert len(converted[0]["paragraphs"]) == 1
|
assert len(converted[0]["paragraphs"]) == 1
|
||||||
assert converted[0]["paragraphs"][0]["raw"] == "Dommer FinnEilertsen avstår."
|
assert converted[0]["paragraphs"][0]["raw"] == "Dommer FinnEilertsen avstår. "
|
||||||
assert len(converted[0]["paragraphs"][0]["sentences"]) == 1
|
assert len(converted[0]["paragraphs"][0]["sentences"]) == 1
|
||||||
sent = converted[0]["paragraphs"][0]["sentences"][0]
|
sent = converted[0]["paragraphs"][0]["sentences"][0]
|
||||||
assert len(sent["tokens"]) == 5
|
assert len(sent["tokens"]) == 5
|
||||||
|
@ -68,7 +72,11 @@ def test_cli_converters_conllu2json_name_ner_map(lines):
|
||||||
assert [t["tag"] for t in tokens] == ["NOUN", "PROPN", "PROPN", "VERB", "PUNCT"]
|
assert [t["tag"] for t in tokens] == ["NOUN", "PROPN", "PROPN", "VERB", "PUNCT"]
|
||||||
assert [t["head"] for t in tokens] == [1, 2, -1, 0, -1]
|
assert [t["head"] for t in tokens] == [1, 2, -1, 0, -1]
|
||||||
assert [t["dep"] for t in tokens] == ["appos", "nsubj", "name", "ROOT", "punct"]
|
assert [t["dep"] for t in tokens] == ["appos", "nsubj", "name", "ROOT", "punct"]
|
||||||
assert [t["ner"] for t in tokens] == ["O", "B-PERSON", "L-PERSON", "O", "O"]
|
ent_offsets = [
|
||||||
|
(e[0], e[1], e[2]) for e in converted[0]["paragraphs"][0]["entities"]
|
||||||
|
]
|
||||||
|
biluo_tags = biluo_tags_from_offsets(converted_docs[0], ent_offsets, missing="O")
|
||||||
|
assert biluo_tags == ["O", "B-PERSON", "L-PERSON", "O", "O"]
|
||||||
|
|
||||||
|
|
||||||
def test_cli_converters_conllu2json_subtokens():
|
def test_cli_converters_conllu2json_subtokens():
|
||||||
|
@ -82,13 +90,15 @@ def test_cli_converters_conllu2json_subtokens():
|
||||||
"5\t.\t$.\tPUNCT\t_\t_\t4\tpunct\t_\tname=O",
|
"5\t.\t$.\tPUNCT\t_\t_\t4\tpunct\t_\tname=O",
|
||||||
]
|
]
|
||||||
input_data = "\n".join(lines)
|
input_data = "\n".join(lines)
|
||||||
converted = conllu2json(
|
converted_docs = conllu2docs(
|
||||||
input_data, n_sents=1, merge_subtokens=True, append_morphology=True
|
input_data, n_sents=1, merge_subtokens=True, append_morphology=True
|
||||||
)
|
)
|
||||||
assert len(converted) == 1
|
assert len(converted_docs) == 1
|
||||||
|
converted = [docs_to_json(converted_docs)]
|
||||||
|
|
||||||
assert converted[0]["id"] == 0
|
assert converted[0]["id"] == 0
|
||||||
assert len(converted[0]["paragraphs"]) == 1
|
assert len(converted[0]["paragraphs"]) == 1
|
||||||
assert converted[0]["paragraphs"][0]["raw"] == "Dommer FE avstår."
|
assert converted[0]["paragraphs"][0]["raw"] == "Dommer FE avstår. "
|
||||||
assert len(converted[0]["paragraphs"][0]["sentences"]) == 1
|
assert len(converted[0]["paragraphs"][0]["sentences"]) == 1
|
||||||
sent = converted[0]["paragraphs"][0]["sentences"][0]
|
sent = converted[0]["paragraphs"][0]["sentences"][0]
|
||||||
assert len(sent["tokens"]) == 4
|
assert len(sent["tokens"]) == 4
|
||||||
|
@ -111,7 +121,11 @@ def test_cli_converters_conllu2json_subtokens():
|
||||||
assert [t["lemma"] for t in tokens] == ["dommer", "Finn Eilertsen", "avstå", "$."]
|
assert [t["lemma"] for t in tokens] == ["dommer", "Finn Eilertsen", "avstå", "$."]
|
||||||
assert [t["head"] for t in tokens] == [1, 1, 0, -1]
|
assert [t["head"] for t in tokens] == [1, 1, 0, -1]
|
||||||
assert [t["dep"] for t in tokens] == ["appos", "nsubj", "ROOT", "punct"]
|
assert [t["dep"] for t in tokens] == ["appos", "nsubj", "ROOT", "punct"]
|
||||||
assert [t["ner"] for t in tokens] == ["O", "U-PER", "O", "O"]
|
ent_offsets = [
|
||||||
|
(e[0], e[1], e[2]) for e in converted[0]["paragraphs"][0]["entities"]
|
||||||
|
]
|
||||||
|
biluo_tags = biluo_tags_from_offsets(converted_docs[0], ent_offsets, missing="O")
|
||||||
|
assert biluo_tags == ["O", "U-PER", "O", "O"]
|
||||||
|
|
||||||
|
|
||||||
def test_cli_converters_iob2json(en_vocab):
|
def test_cli_converters_iob2json(en_vocab):
|
||||||
|
@ -132,11 +146,11 @@ def test_cli_converters_iob2json(en_vocab):
|
||||||
sent = converted["paragraphs"][0]["sentences"][i]
|
sent = converted["paragraphs"][0]["sentences"][i]
|
||||||
assert len(sent["tokens"]) == 8
|
assert len(sent["tokens"]) == 8
|
||||||
tokens = sent["tokens"]
|
tokens = sent["tokens"]
|
||||||
# fmt: off
|
expected = ["I", "like", "London", "and", "New", "York", "City", "."]
|
||||||
assert [t["orth"] for t in tokens] == ["I", "like", "London", "and", "New", "York", "City", "."]
|
assert [t["orth"] for t in tokens] == expected
|
||||||
assert len(converted_docs[0].ents) == 8
|
assert len(converted_docs[0].ents) == 8
|
||||||
for ent in converted_docs[0].ents:
|
for ent in converted_docs[0].ents:
|
||||||
assert(ent.text in ["New York City", "London"])
|
assert ent.text in ["New York City", "London"]
|
||||||
|
|
||||||
|
|
||||||
def test_cli_converters_conll_ner2json():
|
def test_cli_converters_conll_ner2json():
|
||||||
|
@ -204,7 +218,7 @@ def test_cli_converters_conll_ner2json():
|
||||||
# fmt: on
|
# fmt: on
|
||||||
assert len(converted_docs[0].ents) == 10
|
assert len(converted_docs[0].ents) == 10
|
||||||
for ent in converted_docs[0].ents:
|
for ent in converted_docs[0].ents:
|
||||||
assert (ent.text in ["New York City", "London"])
|
assert ent.text in ["New York City", "London"]
|
||||||
|
|
||||||
|
|
||||||
def test_pretrain_make_docs():
|
def test_pretrain_make_docs():
|
||||||
|
|
|
@ -1,13 +1,13 @@
|
||||||
from spacy.errors import AlignmentError
|
from spacy.errors import AlignmentError
|
||||||
from spacy.gold import biluo_tags_from_offsets, offsets_from_biluo_tags
|
from spacy.gold import biluo_tags_from_offsets, offsets_from_biluo_tags
|
||||||
from spacy.gold import spans_from_biluo_tags, iob_to_biluo, align
|
from spacy.gold import spans_from_biluo_tags, iob_to_biluo
|
||||||
from spacy.gold import Corpus, docs_to_json
|
from spacy.gold import Corpus, docs_to_json
|
||||||
from spacy.gold.example import Example
|
from spacy.gold.example import Example
|
||||||
from spacy.gold.converters import json2docs
|
from spacy.gold.converters import json2docs
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.syntax.nonproj import is_nonproj_tree
|
|
||||||
from spacy.tokens import Doc, DocBin
|
from spacy.tokens import Doc, DocBin
|
||||||
from spacy.util import get_words_and_spaces, compounding, minibatch
|
from spacy.util import get_words_and_spaces, minibatch
|
||||||
|
from thinc.api import compounding
|
||||||
import pytest
|
import pytest
|
||||||
import srsly
|
import srsly
|
||||||
|
|
||||||
|
@ -161,65 +161,54 @@ def test_example_from_dict_no_ner(en_vocab):
|
||||||
ner_tags = example.get_aligned_ner()
|
ner_tags = example.get_aligned_ner()
|
||||||
assert ner_tags == [None, None, None, None]
|
assert ner_tags == [None, None, None, None]
|
||||||
|
|
||||||
|
|
||||||
def test_example_from_dict_some_ner(en_vocab):
|
def test_example_from_dict_some_ner(en_vocab):
|
||||||
words = ["a", "b", "c", "d"]
|
words = ["a", "b", "c", "d"]
|
||||||
spaces = [True, True, False, True]
|
spaces = [True, True, False, True]
|
||||||
predicted = Doc(en_vocab, words=words, spaces=spaces)
|
predicted = Doc(en_vocab, words=words, spaces=spaces)
|
||||||
example = Example.from_dict(
|
example = Example.from_dict(
|
||||||
predicted,
|
predicted, {"words": words, "entities": ["U-LOC", None, None, None]}
|
||||||
{
|
|
||||||
"words": words,
|
|
||||||
"entities": ["U-LOC", None, None, None]
|
|
||||||
}
|
|
||||||
)
|
)
|
||||||
ner_tags = example.get_aligned_ner()
|
ner_tags = example.get_aligned_ner()
|
||||||
assert ner_tags == ["U-LOC", None, None, None]
|
assert ner_tags == ["U-LOC", None, None, None]
|
||||||
|
|
||||||
|
|
||||||
def test_json2docs_no_ner(en_vocab):
|
def test_json2docs_no_ner(en_vocab):
|
||||||
data = [{
|
data = [
|
||||||
"id":1,
|
{
|
||||||
"paragraphs":[
|
"id": 1,
|
||||||
{
|
"paragraphs": [
|
||||||
"sentences":[
|
{
|
||||||
{
|
"sentences": [
|
||||||
"tokens":[
|
{
|
||||||
{
|
"tokens": [
|
||||||
"dep":"nn",
|
{"dep": "nn", "head": 1, "tag": "NNP", "orth": "Ms."},
|
||||||
"head":1,
|
{
|
||||||
"tag":"NNP",
|
"dep": "nsubj",
|
||||||
"orth":"Ms."
|
"head": 1,
|
||||||
},
|
"tag": "NNP",
|
||||||
{
|
"orth": "Haag",
|
||||||
"dep":"nsubj",
|
},
|
||||||
"head":1,
|
{
|
||||||
"tag":"NNP",
|
"dep": "ROOT",
|
||||||
"orth":"Haag"
|
"head": 0,
|
||||||
},
|
"tag": "VBZ",
|
||||||
{
|
"orth": "plays",
|
||||||
"dep":"ROOT",
|
},
|
||||||
"head":0,
|
{
|
||||||
"tag":"VBZ",
|
"dep": "dobj",
|
||||||
"orth":"plays"
|
"head": -1,
|
||||||
},
|
"tag": "NNP",
|
||||||
{
|
"orth": "Elianti",
|
||||||
"dep":"dobj",
|
},
|
||||||
"head":-1,
|
{"dep": "punct", "head": -2, "tag": ".", "orth": "."},
|
||||||
"tag":"NNP",
|
]
|
||||||
"orth":"Elianti"
|
}
|
||||||
},
|
|
||||||
{
|
|
||||||
"dep":"punct",
|
|
||||||
"head":-2,
|
|
||||||
"tag":".",
|
|
||||||
"orth":"."
|
|
||||||
}
|
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
]
|
],
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
}]
|
|
||||||
docs = json2docs(data)
|
docs = json2docs(data)
|
||||||
assert len(docs) == 1
|
assert len(docs) == 1
|
||||||
for doc in docs:
|
for doc in docs:
|
||||||
|
@ -282,75 +271,76 @@ def test_split_sentences(en_vocab):
|
||||||
assert split_examples[1].text == "had loads of fun "
|
assert split_examples[1].text == "had loads of fun "
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.xfail(reason="Alignment should be fixed after example refactor")
|
|
||||||
def test_gold_biluo_one_to_many(en_vocab, en_tokenizer):
|
def test_gold_biluo_one_to_many(en_vocab, en_tokenizer):
|
||||||
words = ["I", "flew to", "San Francisco Valley", "."]
|
words = ["Mr. and ", "Mrs. Smith", "flew to", "San Francisco Valley", "."]
|
||||||
spaces = [True, True, False, False]
|
spaces = [True, True, True, False, False]
|
||||||
doc = Doc(en_vocab, words=words, spaces=spaces)
|
doc = Doc(en_vocab, words=words, spaces=spaces)
|
||||||
entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")]
|
prefix = "Mr. and Mrs. Smith flew to "
|
||||||
gold_words = ["I", "flew", "to", "San", "Francisco", "Valley", "."]
|
entities = [(len(prefix), len(prefix + "San Francisco Valley"), "LOC")]
|
||||||
|
gold_words = ["Mr. and Mrs. Smith", "flew", "to", "San", "Francisco", "Valley", "."]
|
||||||
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
||||||
ner_tags = example.get_aligned_ner()
|
ner_tags = example.get_aligned_ner()
|
||||||
assert ner_tags == ["O", "O", "U-LOC", "O"]
|
assert ner_tags == ["O", "O", "O", "U-LOC", "O"]
|
||||||
|
|
||||||
entities = [
|
entities = [
|
||||||
(len("I "), len("I flew to"), "ORG"),
|
(len("Mr. and "), len("Mr. and Mrs. Smith"), "PERSON"), # "Mrs. Smith" is a PERSON
|
||||||
(len("I flew to "), len("I flew to San Francisco Valley"), "LOC"),
|
(len(prefix), len(prefix + "San Francisco Valley"), "LOC"),
|
||||||
]
|
]
|
||||||
gold_words = ["I", "flew", "to", "San", "Francisco", "Valley", "."]
|
gold_words = ["Mr. and", "Mrs.", "Smith", "flew", "to", "San", "Francisco", "Valley", "."]
|
||||||
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
||||||
ner_tags = example.get_aligned_ner()
|
ner_tags = example.get_aligned_ner()
|
||||||
assert ner_tags == ["O", "U-ORG", "U-LOC", "O"]
|
assert ner_tags == ["O", "U-PERSON", "O", "U-LOC", "O"]
|
||||||
|
|
||||||
entities = [
|
entities = [
|
||||||
(len("I "), len("I flew"), "ORG"),
|
(len("Mr. and "), len("Mr. and Mrs."), "PERSON"), # "Mrs." is a Person
|
||||||
(len("I flew to "), len("I flew to San Francisco Valley"), "LOC"),
|
(len(prefix), len(prefix + "San Francisco Valley"), "LOC"),
|
||||||
]
|
]
|
||||||
gold_words = ["I", "flew", "to", "San", "Francisco", "Valley", "."]
|
gold_words = ["Mr. and", "Mrs.", "Smith", "flew", "to", "San", "Francisco", "Valley", "."]
|
||||||
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
||||||
ner_tags = example.get_aligned_ner()
|
ner_tags = example.get_aligned_ner()
|
||||||
assert ner_tags == ["O", None, "U-LOC", "O"]
|
assert ner_tags == ["O", None, "O", "U-LOC", "O"]
|
||||||
|
|
||||||
|
|
||||||
def test_gold_biluo_many_to_one(en_vocab, en_tokenizer):
|
def test_gold_biluo_many_to_one(en_vocab, en_tokenizer):
|
||||||
words = ["I", "flew", "to", "San", "Francisco", "Valley", "."]
|
words = ["Mr. and", "Mrs.", "Smith", "flew", "to", "San", "Francisco", "Valley", "."]
|
||||||
|
spaces = [True, True, True, True, True, True, True, False, False]
|
||||||
|
doc = Doc(en_vocab, words=words, spaces=spaces)
|
||||||
|
prefix = "Mr. and Mrs. Smith flew to "
|
||||||
|
entities = [(len(prefix), len(prefix + "San Francisco Valley"), "LOC")]
|
||||||
|
gold_words = ["Mr. and Mrs. Smith", "flew to", "San Francisco Valley", "."]
|
||||||
|
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
||||||
|
ner_tags = example.get_aligned_ner()
|
||||||
|
assert ner_tags == ["O", "O", "O", "O", "O", "B-LOC", "I-LOC", "L-LOC", "O"]
|
||||||
|
|
||||||
|
entities = [
|
||||||
|
(len("Mr. and "), len("Mr. and Mrs. Smith"), "PERSON"), # "Mrs. Smith" is a PERSON
|
||||||
|
(len(prefix), len(prefix + "San Francisco Valley"), "LOC"),
|
||||||
|
]
|
||||||
|
gold_words = ["Mr. and", "Mrs. Smith", "flew to", "San Francisco Valley", "."]
|
||||||
|
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
||||||
|
ner_tags = example.get_aligned_ner()
|
||||||
|
assert ner_tags == ["O", "B-PERSON", "L-PERSON", "O", "O", "B-LOC", "I-LOC", "L-LOC", "O"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_gold_biluo_misaligned(en_vocab, en_tokenizer):
|
||||||
|
words = ["Mr. and Mrs.", "Smith", "flew", "to", "San Francisco", "Valley", "."]
|
||||||
spaces = [True, True, True, True, True, False, False]
|
spaces = [True, True, True, True, True, False, False]
|
||||||
doc = Doc(en_vocab, words=words, spaces=spaces)
|
doc = Doc(en_vocab, words=words, spaces=spaces)
|
||||||
entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")]
|
prefix = "Mr. and Mrs. Smith flew to "
|
||||||
gold_words = ["I", "flew to", "San Francisco Valley", "."]
|
entities = [(len(prefix), len(prefix + "San Francisco Valley"), "LOC")]
|
||||||
|
gold_words = ["Mr.", "and Mrs. Smith", "flew to", "San", "Francisco Valley", "."]
|
||||||
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
||||||
ner_tags = example.get_aligned_ner()
|
ner_tags = example.get_aligned_ner()
|
||||||
assert ner_tags == ["O", "O", "O", "B-LOC", "I-LOC", "L-LOC", "O"]
|
assert ner_tags == ["O", "O", "O", "O", "B-LOC", "L-LOC", "O"]
|
||||||
|
|
||||||
entities = [
|
entities = [
|
||||||
(len("I "), len("I flew to"), "ORG"),
|
(len("Mr. and "), len("Mr. and Mrs. Smith"), "PERSON"), # "Mrs. Smith" is a PERSON
|
||||||
(len("I flew to "), len("I flew to San Francisco Valley"), "LOC"),
|
(len(prefix), len(prefix + "San Francisco Valley"), "LOC"),
|
||||||
]
|
]
|
||||||
gold_words = ["I", "flew to", "San Francisco Valley", "."]
|
gold_words = ["Mr. and", "Mrs. Smith", "flew to", "San", "Francisco Valley", "."]
|
||||||
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
||||||
ner_tags = example.get_aligned_ner()
|
ner_tags = example.get_aligned_ner()
|
||||||
assert ner_tags == ["O", "B-ORG", "L-ORG", "B-LOC", "I-LOC", "L-LOC", "O"]
|
assert ner_tags == [None, None, "O", "O", "B-LOC", "L-LOC", "O"]
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.xfail(reason="Alignment should be fixed after example refactor")
|
|
||||||
def test_gold_biluo_misaligned(en_vocab, en_tokenizer):
|
|
||||||
words = ["I flew", "to", "San Francisco", "Valley", "."]
|
|
||||||
spaces = [True, True, True, False, False]
|
|
||||||
doc = Doc(en_vocab, words=words, spaces=spaces)
|
|
||||||
entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")]
|
|
||||||
gold_words = ["I", "flew to", "San", "Francisco Valley", "."]
|
|
||||||
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
|
||||||
ner_tags = example.get_aligned_ner()
|
|
||||||
assert ner_tags == ["O", "O", "B-LOC", "L-LOC", "O"]
|
|
||||||
|
|
||||||
entities = [
|
|
||||||
(len("I "), len("I flew to"), "ORG"),
|
|
||||||
(len("I flew to "), len("I flew to San Francisco Valley"), "LOC"),
|
|
||||||
]
|
|
||||||
gold_words = ["I", "flew to", "San", "Francisco Valley", "."]
|
|
||||||
example = Example.from_dict(doc, {"words": gold_words, "entities": entities})
|
|
||||||
ner_tags = example.get_aligned_ner()
|
|
||||||
assert ner_tags == [None, None, "B-LOC", "L-LOC", "O"]
|
|
||||||
|
|
||||||
|
|
||||||
def test_gold_biluo_additional_whitespace(en_vocab, en_tokenizer):
|
def test_gold_biluo_additional_whitespace(en_vocab, en_tokenizer):
|
||||||
|
@ -360,7 +350,8 @@ def test_gold_biluo_additional_whitespace(en_vocab, en_tokenizer):
|
||||||
"I flew to San Francisco Valley.",
|
"I flew to San Francisco Valley.",
|
||||||
)
|
)
|
||||||
doc = Doc(en_vocab, words=words, spaces=spaces)
|
doc = Doc(en_vocab, words=words, spaces=spaces)
|
||||||
entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")]
|
prefix = "I flew to "
|
||||||
|
entities = [(len(prefix), len(prefix + "San Francisco Valley"), "LOC")]
|
||||||
gold_words = ["I", "flew", " ", "to", "San Francisco Valley", "."]
|
gold_words = ["I", "flew", " ", "to", "San Francisco Valley", "."]
|
||||||
gold_spaces = [True, True, False, True, False, False]
|
gold_spaces = [True, True, False, True, False, False]
|
||||||
example = Example.from_dict(
|
example = Example.from_dict(
|
||||||
|
@ -522,11 +513,10 @@ def test_make_orth_variants(doc):
|
||||||
|
|
||||||
# due to randomness, test only that this runs with no errors for now
|
# due to randomness, test only that this runs with no errors for now
|
||||||
train_example = next(goldcorpus.train_dataset(nlp))
|
train_example = next(goldcorpus.train_dataset(nlp))
|
||||||
variant_example = make_orth_variants_example(
|
make_orth_variants_example(nlp, train_example, orth_variant_level=0.2)
|
||||||
nlp, train_example, orth_variant_level=0.2
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.skip("Outdated")
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
"tokens_a,tokens_b,expected",
|
"tokens_a,tokens_b,expected",
|
||||||
[
|
[
|
||||||
|
@ -550,12 +540,12 @@ def test_make_orth_variants(doc):
|
||||||
([" ", "a"], ["a"], (1, [-1, 0], [1], {}, {})),
|
([" ", "a"], ["a"], (1, [-1, 0], [1], {}, {})),
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
def test_align(tokens_a, tokens_b, expected):
|
def test_align(tokens_a, tokens_b, expected): # noqa
|
||||||
cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_a, tokens_b)
|
cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_a, tokens_b) # noqa
|
||||||
assert (cost, list(a2b), list(b2a), a2b_multi, b2a_multi) == expected
|
assert (cost, list(a2b), list(b2a), a2b_multi, b2a_multi) == expected # noqa
|
||||||
# check symmetry
|
# check symmetry
|
||||||
cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_b, tokens_a)
|
cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_b, tokens_a) # noqa
|
||||||
assert (cost, list(b2a), list(a2b), b2a_multi, a2b_multi) == expected
|
assert (cost, list(b2a), list(a2b), b2a_multi, a2b_multi) == expected # noqa
|
||||||
|
|
||||||
|
|
||||||
def test_goldparse_startswith_space(en_tokenizer):
|
def test_goldparse_startswith_space(en_tokenizer):
|
||||||
|
@ -569,7 +559,7 @@ def test_goldparse_startswith_space(en_tokenizer):
|
||||||
doc, {"words": gold_words, "entities": entities, "deps": deps, "heads": heads}
|
doc, {"words": gold_words, "entities": entities, "deps": deps, "heads": heads}
|
||||||
)
|
)
|
||||||
ner_tags = example.get_aligned_ner()
|
ner_tags = example.get_aligned_ner()
|
||||||
assert ner_tags == [None, "U-DATE"]
|
assert ner_tags == ["O", "U-DATE"]
|
||||||
assert example.get_aligned("DEP", as_string=True) == [None, "ROOT"]
|
assert example.get_aligned("DEP", as_string=True) == [None, "ROOT"]
|
||||||
|
|
||||||
|
|
||||||
|
@ -600,7 +590,7 @@ def test_tuple_format_implicit():
|
||||||
("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]}),
|
("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]}),
|
||||||
]
|
]
|
||||||
|
|
||||||
_train(train_data)
|
_train_tuples(train_data)
|
||||||
|
|
||||||
|
|
||||||
def test_tuple_format_implicit_invalid():
|
def test_tuple_format_implicit_invalid():
|
||||||
|
@ -616,20 +606,24 @@ def test_tuple_format_implicit_invalid():
|
||||||
]
|
]
|
||||||
|
|
||||||
with pytest.raises(KeyError):
|
with pytest.raises(KeyError):
|
||||||
_train(train_data)
|
_train_tuples(train_data)
|
||||||
|
|
||||||
|
|
||||||
def _train(train_data):
|
def _train_tuples(train_data):
|
||||||
nlp = English()
|
nlp = English()
|
||||||
ner = nlp.create_pipe("ner")
|
ner = nlp.create_pipe("ner")
|
||||||
ner.add_label("ORG")
|
ner.add_label("ORG")
|
||||||
ner.add_label("LOC")
|
ner.add_label("LOC")
|
||||||
nlp.add_pipe(ner)
|
nlp.add_pipe(ner)
|
||||||
|
|
||||||
|
train_examples = []
|
||||||
|
for t in train_data:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||||
|
|
||||||
optimizer = nlp.begin_training()
|
optimizer = nlp.begin_training()
|
||||||
for i in range(5):
|
for i in range(5):
|
||||||
losses = {}
|
losses = {}
|
||||||
batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
|
batches = minibatch(train_examples, size=compounding(4.0, 32.0, 1.001))
|
||||||
for batch in batches:
|
for batch in batches:
|
||||||
nlp.update(batch, sgd=optimizer, losses=losses)
|
nlp.update(batch, sgd=optimizer, losses=losses)
|
||||||
|
|
||||||
|
|
|
@ -5,6 +5,7 @@ from spacy.tokens import Doc, Span
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
|
|
||||||
from .util import add_vecs_to_vocab, assert_docs_equal
|
from .util import add_vecs_to_vocab, assert_docs_equal
|
||||||
|
from ..gold import Example
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
|
@ -23,26 +24,45 @@ def test_language_update(nlp):
|
||||||
annots = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}
|
annots = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}
|
||||||
wrongkeyannots = {"LABEL": True}
|
wrongkeyannots = {"LABEL": True}
|
||||||
doc = Doc(nlp.vocab, words=text.split(" "))
|
doc = Doc(nlp.vocab, words=text.split(" "))
|
||||||
# Update with text and dict
|
example = Example.from_dict(doc, annots)
|
||||||
nlp.update((text, annots))
|
nlp.update([example])
|
||||||
|
|
||||||
|
# Not allowed to call with just one Example
|
||||||
|
with pytest.raises(TypeError):
|
||||||
|
nlp.update(example)
|
||||||
|
|
||||||
|
# Update with text and dict: not supported anymore since v.3
|
||||||
|
with pytest.raises(TypeError):
|
||||||
|
nlp.update((text, annots))
|
||||||
# Update with doc object and dict
|
# Update with doc object and dict
|
||||||
nlp.update((doc, annots))
|
with pytest.raises(TypeError):
|
||||||
# Update badly
|
nlp.update((doc, annots))
|
||||||
|
|
||||||
|
# Create examples badly
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
nlp.update((doc, None))
|
example = Example.from_dict(doc, None)
|
||||||
with pytest.raises(KeyError):
|
with pytest.raises(KeyError):
|
||||||
nlp.update((text, wrongkeyannots))
|
example = Example.from_dict(doc, wrongkeyannots)
|
||||||
|
|
||||||
|
|
||||||
def test_language_evaluate(nlp):
|
def test_language_evaluate(nlp):
|
||||||
text = "hello world"
|
text = "hello world"
|
||||||
annots = {"doc_annotation": {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}}
|
annots = {"doc_annotation": {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}}
|
||||||
doc = Doc(nlp.vocab, words=text.split(" "))
|
doc = Doc(nlp.vocab, words=text.split(" "))
|
||||||
# Evaluate with text and dict
|
example = Example.from_dict(doc, annots)
|
||||||
nlp.evaluate([(text, annots)])
|
nlp.evaluate([example])
|
||||||
|
|
||||||
|
# Not allowed to call with just one Example
|
||||||
|
with pytest.raises(TypeError):
|
||||||
|
nlp.evaluate(example)
|
||||||
|
|
||||||
|
# Evaluate with text and dict: not supported anymore since v.3
|
||||||
|
with pytest.raises(TypeError):
|
||||||
|
nlp.evaluate([(text, annots)])
|
||||||
# Evaluate with doc object and dict
|
# Evaluate with doc object and dict
|
||||||
nlp.evaluate([(doc, annots)])
|
with pytest.raises(TypeError):
|
||||||
with pytest.raises(Exception):
|
nlp.evaluate([(doc, annots)])
|
||||||
|
with pytest.raises(TypeError):
|
||||||
nlp.evaluate([text, annots])
|
nlp.evaluate([text, annots])
|
||||||
|
|
||||||
|
|
||||||
|
@ -56,8 +76,9 @@ def test_evaluate_no_pipe(nlp):
|
||||||
text = "hello world"
|
text = "hello world"
|
||||||
annots = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}
|
annots = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}
|
||||||
nlp = Language(Vocab())
|
nlp = Language(Vocab())
|
||||||
|
doc = nlp(text)
|
||||||
nlp.add_pipe(pipe)
|
nlp.add_pipe(pipe)
|
||||||
nlp.evaluate([(text, annots)])
|
nlp.evaluate([Example.from_dict(doc, annots)])
|
||||||
|
|
||||||
|
|
||||||
def vector_modification_pipe(doc):
|
def vector_modification_pipe(doc):
|
||||||
|
|
|
@ -55,7 +55,7 @@ def test_aligned_tags():
|
||||||
predicted = Doc(vocab, words=pred_words)
|
predicted = Doc(vocab, words=pred_words)
|
||||||
example = Example.from_dict(predicted, annots)
|
example = Example.from_dict(predicted, annots)
|
||||||
aligned_tags = example.get_aligned("tag", as_string=True)
|
aligned_tags = example.get_aligned("tag", as_string=True)
|
||||||
assert aligned_tags == ["VERB", "DET", None, "SCONJ", "PRON", "VERB", "VERB"]
|
assert aligned_tags == ["VERB", "DET", "NOUN", "SCONJ", "PRON", "VERB", "VERB"]
|
||||||
|
|
||||||
|
|
||||||
def test_aligned_tags_multi():
|
def test_aligned_tags_multi():
|
||||||
|
@ -230,8 +230,7 @@ def test_Example_from_dict_with_links(annots):
|
||||||
[
|
[
|
||||||
{
|
{
|
||||||
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
|
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
|
||||||
"entities": [(7, 15, "LOC"), (20, 26, "LOC")],
|
"links": {(7, 14): {"Q7381115": 1.0, "Q2146908": 0.0}},
|
||||||
"links": {(0, 1): {"Q7381115": 1.0, "Q2146908": 0.0}},
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
|
|
|
@ -26,8 +26,6 @@ cdef class Tokenizer:
|
||||||
cdef int _property_init_count
|
cdef int _property_init_count
|
||||||
cdef int _property_init_max
|
cdef int _property_init_max
|
||||||
|
|
||||||
cpdef Doc tokens_from_list(self, list strings)
|
|
||||||
|
|
||||||
cdef Doc _tokenize_affixes(self, unicode string, bint with_special_cases)
|
cdef Doc _tokenize_affixes(self, unicode string, bint with_special_cases)
|
||||||
cdef int _apply_special_cases(self, Doc doc) except -1
|
cdef int _apply_special_cases(self, Doc doc) except -1
|
||||||
cdef void _filter_special_spans(self, vector[SpanC] &original,
|
cdef void _filter_special_spans(self, vector[SpanC] &original,
|
||||||
|
|
|
@ -140,10 +140,6 @@ cdef class Tokenizer:
|
||||||
self.url_match)
|
self.url_match)
|
||||||
return (self.__class__, args, None, None)
|
return (self.__class__, args, None, None)
|
||||||
|
|
||||||
cpdef Doc tokens_from_list(self, list strings):
|
|
||||||
warnings.warn(Warnings.W002, DeprecationWarning)
|
|
||||||
return Doc(self.vocab, words=strings)
|
|
||||||
|
|
||||||
def __call__(self, unicode string):
|
def __call__(self, unicode string):
|
||||||
"""Tokenize a string.
|
"""Tokenize a string.
|
||||||
|
|
||||||
|
@ -218,7 +214,7 @@ cdef class Tokenizer:
|
||||||
doc.c[doc.length - 1].spacy = string[-1] == " " and not in_ws
|
doc.c[doc.length - 1].spacy = string[-1] == " " and not in_ws
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def pipe(self, texts, batch_size=1000, n_threads=-1):
|
def pipe(self, texts, batch_size=1000):
|
||||||
"""Tokenize a stream of texts.
|
"""Tokenize a stream of texts.
|
||||||
|
|
||||||
texts: A sequence of unicode texts.
|
texts: A sequence of unicode texts.
|
||||||
|
@ -228,8 +224,6 @@ cdef class Tokenizer:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/tokenizer#pipe
|
DOCS: https://spacy.io/api/tokenizer#pipe
|
||||||
"""
|
"""
|
||||||
if n_threads != -1:
|
|
||||||
warnings.warn(Warnings.W016, DeprecationWarning)
|
|
||||||
for text in texts:
|
for text in texts:
|
||||||
yield self(text)
|
yield self(text)
|
||||||
|
|
||||||
|
@ -746,7 +740,7 @@ cdef class Tokenizer:
|
||||||
self.from_bytes(bytes_data, **kwargs)
|
self.from_bytes(bytes_data, **kwargs)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
def to_bytes(self, exclude=tuple()):
|
||||||
"""Serialize the current state to a binary string.
|
"""Serialize the current state to a binary string.
|
||||||
|
|
||||||
exclude (list): String names of serialization fields to exclude.
|
exclude (list): String names of serialization fields to exclude.
|
||||||
|
@ -763,10 +757,9 @@ cdef class Tokenizer:
|
||||||
"url_match": lambda: _get_regex_pattern(self.url_match),
|
"url_match": lambda: _get_regex_pattern(self.url_match),
|
||||||
"exceptions": lambda: dict(sorted(self._rules.items()))
|
"exceptions": lambda: dict(sorted(self._rules.items()))
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
|
||||||
return util.to_bytes(serializers, exclude)
|
return util.to_bytes(serializers, exclude)
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||||
"""Load state from a binary string.
|
"""Load state from a binary string.
|
||||||
|
|
||||||
bytes_data (bytes): The data to load from.
|
bytes_data (bytes): The data to load from.
|
||||||
|
@ -785,7 +778,6 @@ cdef class Tokenizer:
|
||||||
"url_match": lambda b: data.setdefault("url_match", b),
|
"url_match": lambda b: data.setdefault("url_match", b),
|
||||||
"exceptions": lambda b: data.setdefault("rules", b)
|
"exceptions": lambda b: data.setdefault("rules", b)
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
|
||||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||||
if "prefix_search" in data and isinstance(data["prefix_search"], str):
|
if "prefix_search" in data and isinstance(data["prefix_search"], str):
|
||||||
self.prefix_search = re.compile(data["prefix_search"]).search
|
self.prefix_search = re.compile(data["prefix_search"]).search
|
||||||
|
|
|
@ -8,8 +8,9 @@ from ..tokens import Doc
|
||||||
from ..attrs import SPACY, ORTH, intify_attr
|
from ..attrs import SPACY, ORTH, intify_attr
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
|
|
||||||
|
# fmt: off
|
||||||
ALL_ATTRS = ("ORTH", "TAG", "HEAD", "DEP", "ENT_IOB", "ENT_TYPE", "ENT_KB_ID", "LEMMA", "MORPH")
|
ALL_ATTRS = ("ORTH", "TAG", "HEAD", "DEP", "ENT_IOB", "ENT_TYPE", "ENT_KB_ID", "LEMMA", "MORPH", "POS")
|
||||||
|
# fmt: on
|
||||||
|
|
||||||
|
|
||||||
class DocBin(object):
|
class DocBin(object):
|
||||||
|
@ -31,6 +32,7 @@ class DocBin(object):
|
||||||
"spaces": bytes, # Serialized numpy boolean array with spaces data
|
"spaces": bytes, # Serialized numpy boolean array with spaces data
|
||||||
"lengths": bytes, # Serialized numpy int32 array with the doc lengths
|
"lengths": bytes, # Serialized numpy int32 array with the doc lengths
|
||||||
"strings": List[unicode] # List of unique strings in the token data
|
"strings": List[unicode] # List of unique strings in the token data
|
||||||
|
"version": str, # DocBin version number
|
||||||
}
|
}
|
||||||
|
|
||||||
Strings for the words, tags, labels etc are represented by 64-bit hashes in
|
Strings for the words, tags, labels etc are represented by 64-bit hashes in
|
||||||
|
@ -53,12 +55,14 @@ class DocBin(object):
|
||||||
DOCS: https://spacy.io/api/docbin#init
|
DOCS: https://spacy.io/api/docbin#init
|
||||||
"""
|
"""
|
||||||
attrs = sorted([intify_attr(attr) for attr in attrs])
|
attrs = sorted([intify_attr(attr) for attr in attrs])
|
||||||
|
self.version = "0.1"
|
||||||
self.attrs = [attr for attr in attrs if attr != ORTH and attr != SPACY]
|
self.attrs = [attr for attr in attrs if attr != ORTH and attr != SPACY]
|
||||||
self.attrs.insert(0, ORTH) # Ensure ORTH is always attrs[0]
|
self.attrs.insert(0, ORTH) # Ensure ORTH is always attrs[0]
|
||||||
self.tokens = []
|
self.tokens = []
|
||||||
self.spaces = []
|
self.spaces = []
|
||||||
self.cats = []
|
self.cats = []
|
||||||
self.user_data = []
|
self.user_data = []
|
||||||
|
self.flags = []
|
||||||
self.strings = set()
|
self.strings = set()
|
||||||
self.store_user_data = store_user_data
|
self.store_user_data = store_user_data
|
||||||
for doc in docs:
|
for doc in docs:
|
||||||
|
@ -83,12 +87,15 @@ class DocBin(object):
|
||||||
assert array.shape[0] == spaces.shape[0] # this should never happen
|
assert array.shape[0] == spaces.shape[0] # this should never happen
|
||||||
spaces = spaces.reshape((spaces.shape[0], 1))
|
spaces = spaces.reshape((spaces.shape[0], 1))
|
||||||
self.spaces.append(numpy.asarray(spaces, dtype=bool))
|
self.spaces.append(numpy.asarray(spaces, dtype=bool))
|
||||||
|
self.flags.append({"has_unknown_spaces": doc.has_unknown_spaces})
|
||||||
for token in doc:
|
for token in doc:
|
||||||
self.strings.add(token.text)
|
self.strings.add(token.text)
|
||||||
self.strings.add(token.tag_)
|
self.strings.add(token.tag_)
|
||||||
self.strings.add(token.lemma_)
|
self.strings.add(token.lemma_)
|
||||||
|
self.strings.add(token.morph_)
|
||||||
self.strings.add(token.dep_)
|
self.strings.add(token.dep_)
|
||||||
self.strings.add(token.ent_type_)
|
self.strings.add(token.ent_type_)
|
||||||
|
self.strings.add(token.ent_kb_id_)
|
||||||
self.cats.append(doc.cats)
|
self.cats.append(doc.cats)
|
||||||
if self.store_user_data:
|
if self.store_user_data:
|
||||||
self.user_data.append(srsly.msgpack_dumps(doc.user_data))
|
self.user_data.append(srsly.msgpack_dumps(doc.user_data))
|
||||||
|
@ -105,8 +112,11 @@ class DocBin(object):
|
||||||
vocab[string]
|
vocab[string]
|
||||||
orth_col = self.attrs.index(ORTH)
|
orth_col = self.attrs.index(ORTH)
|
||||||
for i in range(len(self.tokens)):
|
for i in range(len(self.tokens)):
|
||||||
|
flags = self.flags[i]
|
||||||
tokens = self.tokens[i]
|
tokens = self.tokens[i]
|
||||||
spaces = self.spaces[i]
|
spaces = self.spaces[i]
|
||||||
|
if flags.get("has_unknown_spaces"):
|
||||||
|
spaces = None
|
||||||
doc = Doc(vocab, words=tokens[:, orth_col], spaces=spaces)
|
doc = Doc(vocab, words=tokens[:, orth_col], spaces=spaces)
|
||||||
doc = doc.from_array(self.attrs, tokens)
|
doc = doc.from_array(self.attrs, tokens)
|
||||||
doc.cats = self.cats[i]
|
doc.cats = self.cats[i]
|
||||||
|
@ -130,6 +140,7 @@ class DocBin(object):
|
||||||
self.spaces.extend(other.spaces)
|
self.spaces.extend(other.spaces)
|
||||||
self.strings.update(other.strings)
|
self.strings.update(other.strings)
|
||||||
self.cats.extend(other.cats)
|
self.cats.extend(other.cats)
|
||||||
|
self.flags.extend(other.flags)
|
||||||
if self.store_user_data:
|
if self.store_user_data:
|
||||||
self.user_data.extend(other.user_data)
|
self.user_data.extend(other.user_data)
|
||||||
|
|
||||||
|
@ -147,12 +158,14 @@ class DocBin(object):
|
||||||
spaces = numpy.vstack(self.spaces) if self.spaces else numpy.asarray([])
|
spaces = numpy.vstack(self.spaces) if self.spaces else numpy.asarray([])
|
||||||
|
|
||||||
msg = {
|
msg = {
|
||||||
|
"version": self.version,
|
||||||
"attrs": self.attrs,
|
"attrs": self.attrs,
|
||||||
"tokens": tokens.tobytes("C"),
|
"tokens": tokens.tobytes("C"),
|
||||||
"spaces": spaces.tobytes("C"),
|
"spaces": spaces.tobytes("C"),
|
||||||
"lengths": numpy.asarray(lengths, dtype="int32").tobytes("C"),
|
"lengths": numpy.asarray(lengths, dtype="int32").tobytes("C"),
|
||||||
"strings": list(self.strings),
|
"strings": list(self.strings),
|
||||||
"cats": self.cats,
|
"cats": self.cats,
|
||||||
|
"flags": self.flags,
|
||||||
}
|
}
|
||||||
if self.store_user_data:
|
if self.store_user_data:
|
||||||
msg["user_data"] = self.user_data
|
msg["user_data"] = self.user_data
|
||||||
|
@ -178,6 +191,7 @@ class DocBin(object):
|
||||||
self.tokens = NumpyOps().unflatten(flat_tokens, lengths)
|
self.tokens = NumpyOps().unflatten(flat_tokens, lengths)
|
||||||
self.spaces = NumpyOps().unflatten(flat_spaces, lengths)
|
self.spaces = NumpyOps().unflatten(flat_spaces, lengths)
|
||||||
self.cats = msg["cats"]
|
self.cats = msg["cats"]
|
||||||
|
self.flags = msg.get("flags", [{} for _ in lengths])
|
||||||
if self.store_user_data and "user_data" in msg:
|
if self.store_user_data and "user_data" in msg:
|
||||||
self.user_data = list(msg["user_data"])
|
self.user_data = list(msg["user_data"])
|
||||||
for tokens in self.tokens:
|
for tokens in self.tokens:
|
||||||
|
|
|
@ -59,11 +59,14 @@ cdef class Doc:
|
||||||
cdef public dict user_token_hooks
|
cdef public dict user_token_hooks
|
||||||
cdef public dict user_span_hooks
|
cdef public dict user_span_hooks
|
||||||
|
|
||||||
|
cdef public bint has_unknown_spaces
|
||||||
|
|
||||||
cdef public list _py_tokens
|
cdef public list _py_tokens
|
||||||
|
|
||||||
cdef int length
|
cdef int length
|
||||||
cdef int max_length
|
cdef int max_length
|
||||||
|
|
||||||
|
|
||||||
cdef public object noun_chunks_iterator
|
cdef public object noun_chunks_iterator
|
||||||
|
|
||||||
cdef object __weakref__
|
cdef object __weakref__
|
||||||
|
|
|
@ -5,6 +5,7 @@ from libc.string cimport memcpy, memset
|
||||||
from libc.math cimport sqrt
|
from libc.math cimport sqrt
|
||||||
from libc.stdint cimport int32_t, uint64_t
|
from libc.stdint cimport int32_t, uint64_t
|
||||||
|
|
||||||
|
import copy
|
||||||
from collections import Counter
|
from collections import Counter
|
||||||
import numpy
|
import numpy
|
||||||
import numpy.linalg
|
import numpy.linalg
|
||||||
|
@ -24,7 +25,7 @@ from ..attrs cimport LENGTH, POS, LEMMA, TAG, MORPH, DEP, HEAD, SPACY, ENT_IOB
|
||||||
from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, IDX, attr_id_t
|
from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, IDX, attr_id_t
|
||||||
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
|
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
|
||||||
|
|
||||||
from ..attrs import intify_attrs, IDS
|
from ..attrs import intify_attr, intify_attrs, IDS
|
||||||
from ..util import normalize_slice
|
from ..util import normalize_slice
|
||||||
from ..compat import copy_reg, pickle
|
from ..compat import copy_reg, pickle
|
||||||
from ..errors import Errors, Warnings
|
from ..errors import Errors, Warnings
|
||||||
|
@ -171,8 +172,7 @@ cdef class Doc:
|
||||||
raise ValueError(Errors.E046.format(name=name))
|
raise ValueError(Errors.E046.format(name=name))
|
||||||
return Underscore.doc_extensions.pop(name)
|
return Underscore.doc_extensions.pop(name)
|
||||||
|
|
||||||
def __init__(self, Vocab vocab, words=None, spaces=None, user_data=None,
|
def __init__(self, Vocab vocab, words=None, spaces=None, user_data=None):
|
||||||
orths_and_spaces=None):
|
|
||||||
"""Create a Doc object.
|
"""Create a Doc object.
|
||||||
|
|
||||||
vocab (Vocab): A vocabulary object, which must match any models you
|
vocab (Vocab): A vocabulary object, which must match any models you
|
||||||
|
@ -214,28 +214,25 @@ cdef class Doc:
|
||||||
self._vector = None
|
self._vector = None
|
||||||
self.noun_chunks_iterator = _get_chunker(self.vocab.lang)
|
self.noun_chunks_iterator = _get_chunker(self.vocab.lang)
|
||||||
cdef bint has_space
|
cdef bint has_space
|
||||||
if orths_and_spaces is None and words is not None:
|
if words is None and spaces is not None:
|
||||||
if spaces is None:
|
raise ValueError("words must be set if spaces is set")
|
||||||
spaces = [True] * len(words)
|
elif spaces is None and words is not None:
|
||||||
elif len(spaces) != len(words):
|
self.has_unknown_spaces = True
|
||||||
raise ValueError(Errors.E027)
|
else:
|
||||||
orths_and_spaces = zip(words, spaces)
|
self.has_unknown_spaces = False
|
||||||
|
words = words if words is not None else []
|
||||||
|
spaces = spaces if spaces is not None else ([True] * len(words))
|
||||||
|
if len(spaces) != len(words):
|
||||||
|
raise ValueError(Errors.E027)
|
||||||
cdef const LexemeC* lexeme
|
cdef const LexemeC* lexeme
|
||||||
if orths_and_spaces is not None:
|
for word, has_space in zip(words, spaces):
|
||||||
orths_and_spaces = list(orths_and_spaces)
|
if isinstance(word, unicode):
|
||||||
for orth_space in orths_and_spaces:
|
lexeme = self.vocab.get(self.mem, word)
|
||||||
if isinstance(orth_space, unicode):
|
elif isinstance(word, bytes):
|
||||||
lexeme = self.vocab.get(self.mem, orth_space)
|
raise ValueError(Errors.E028.format(value=word))
|
||||||
has_space = True
|
else:
|
||||||
elif isinstance(orth_space, bytes):
|
lexeme = self.vocab.get_by_orth(self.mem, word)
|
||||||
raise ValueError(Errors.E028.format(value=orth_space))
|
self.push_back(lexeme, has_space)
|
||||||
elif isinstance(orth_space[0], unicode):
|
|
||||||
lexeme = self.vocab.get(self.mem, orth_space[0])
|
|
||||||
has_space = orth_space[1]
|
|
||||||
else:
|
|
||||||
lexeme = self.vocab.get_by_orth(self.mem, orth_space[0])
|
|
||||||
has_space = orth_space[1]
|
|
||||||
self.push_back(lexeme, has_space)
|
|
||||||
# Tough to decide on policy for this. Is an empty doc tagged and parsed?
|
# Tough to decide on policy for this. Is an empty doc tagged and parsed?
|
||||||
# There's no information we'd like to add to it, so I guess so?
|
# There's no information we'd like to add to it, so I guess so?
|
||||||
if self.length == 0:
|
if self.length == 0:
|
||||||
|
@ -806,7 +803,7 @@ cdef class Doc:
|
||||||
attrs = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
|
attrs = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
|
||||||
for id_ in attrs]
|
for id_ in attrs]
|
||||||
if array.dtype != numpy.uint64:
|
if array.dtype != numpy.uint64:
|
||||||
warnings.warn(Warnings.W028.format(type=array.dtype))
|
warnings.warn(Warnings.W101.format(type=array.dtype))
|
||||||
|
|
||||||
if SENT_START in attrs and HEAD in attrs:
|
if SENT_START in attrs and HEAD in attrs:
|
||||||
raise ValueError(Errors.E032)
|
raise ValueError(Errors.E032)
|
||||||
|
@ -882,6 +879,87 @@ cdef class Doc:
|
||||||
set_children_from_heads(self.c, length)
|
set_children_from_heads(self.c, length)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def from_docs(docs, ensure_whitespace=True, attrs=None):
|
||||||
|
"""Concatenate multiple Doc objects to form a new one. Raises an error if the `Doc` objects do not all share
|
||||||
|
the same `Vocab`.
|
||||||
|
|
||||||
|
docs (list): A list of Doc objects.
|
||||||
|
ensure_whitespace (bool): Insert a space between two adjacent docs whenever the first doc does not end in whitespace.
|
||||||
|
attrs (list): Optional list of attribute ID ints or attribute name strings.
|
||||||
|
RETURNS (Doc): A doc that contains the concatenated docs, or None if no docs were given.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/doc#from_docs
|
||||||
|
"""
|
||||||
|
if not docs:
|
||||||
|
return None
|
||||||
|
|
||||||
|
vocab = {doc.vocab for doc in docs}
|
||||||
|
if len(vocab) > 1:
|
||||||
|
raise ValueError(Errors.E999)
|
||||||
|
(vocab,) = vocab
|
||||||
|
|
||||||
|
if attrs is None:
|
||||||
|
attrs = [LEMMA, NORM]
|
||||||
|
if all(doc.is_nered for doc in docs):
|
||||||
|
attrs.extend([ENT_IOB, ENT_KB_ID, ENT_TYPE])
|
||||||
|
# TODO: separate for is_morphed?
|
||||||
|
if all(doc.is_tagged for doc in docs):
|
||||||
|
attrs.extend([TAG, POS, MORPH])
|
||||||
|
if all(doc.is_parsed for doc in docs):
|
||||||
|
attrs.extend([HEAD, DEP])
|
||||||
|
else:
|
||||||
|
attrs.append(SENT_START)
|
||||||
|
else:
|
||||||
|
if any(isinstance(attr, str) for attr in attrs): # resolve attribute names
|
||||||
|
attrs = [intify_attr(attr) for attr in attrs] # intify_attr returns None for invalid attrs
|
||||||
|
attrs = list(attr for attr in set(attrs) if attr) # filter duplicates, remove None if present
|
||||||
|
if SPACY not in attrs:
|
||||||
|
attrs.append(SPACY)
|
||||||
|
|
||||||
|
concat_words = []
|
||||||
|
concat_spaces = []
|
||||||
|
concat_user_data = {}
|
||||||
|
char_offset = 0
|
||||||
|
for doc in docs:
|
||||||
|
concat_words.extend(t.text for t in doc)
|
||||||
|
concat_spaces.extend(bool(t.whitespace_) for t in doc)
|
||||||
|
|
||||||
|
for key, value in doc.user_data.items():
|
||||||
|
if isinstance(key, tuple) and len(key) == 4:
|
||||||
|
data_type, name, start, end = key
|
||||||
|
if start is not None or end is not None:
|
||||||
|
start += char_offset
|
||||||
|
if end is not None:
|
||||||
|
end += char_offset
|
||||||
|
concat_user_data[(data_type, name, start, end)] = copy.copy(value)
|
||||||
|
else:
|
||||||
|
warnings.warn(Warnings.W101.format(name=name))
|
||||||
|
else:
|
||||||
|
warnings.warn(Warnings.W102.format(key=key, value=value))
|
||||||
|
char_offset += len(doc.text) if not ensure_whitespace or doc[-1].is_space else len(doc.text) + 1
|
||||||
|
|
||||||
|
arrays = [doc.to_array(attrs) for doc in docs]
|
||||||
|
|
||||||
|
if ensure_whitespace:
|
||||||
|
spacy_index = attrs.index(SPACY)
|
||||||
|
for i, array in enumerate(arrays[:-1]):
|
||||||
|
if len(array) > 0 and not docs[i][-1].is_space:
|
||||||
|
array[-1][spacy_index] = 1
|
||||||
|
token_offset = -1
|
||||||
|
for doc in docs[:-1]:
|
||||||
|
token_offset += len(doc)
|
||||||
|
if not doc[-1].is_space:
|
||||||
|
concat_spaces[token_offset] = True
|
||||||
|
|
||||||
|
concat_array = numpy.concatenate(arrays)
|
||||||
|
|
||||||
|
concat_doc = Doc(vocab, words=concat_words, spaces=concat_spaces, user_data=concat_user_data)
|
||||||
|
|
||||||
|
concat_doc.from_array(attrs, concat_array)
|
||||||
|
|
||||||
|
return concat_doc
|
||||||
|
|
||||||
def get_lca_matrix(self):
|
def get_lca_matrix(self):
|
||||||
"""Calculates a matrix of Lowest Common Ancestors (LCA) for a given
|
"""Calculates a matrix of Lowest Common Ancestors (LCA) for a given
|
||||||
`Doc`, where LCA[i, j] is the index of the lowest common ancestor among
|
`Doc`, where LCA[i, j] is the index of the lowest common ancestor among
|
||||||
|
@ -905,6 +983,7 @@ cdef class Doc:
|
||||||
other.is_parsed = self.is_parsed
|
other.is_parsed = self.is_parsed
|
||||||
other.is_morphed = self.is_morphed
|
other.is_morphed = self.is_morphed
|
||||||
other.sentiment = self.sentiment
|
other.sentiment = self.sentiment
|
||||||
|
other.has_unknown_spaces = self.has_unknown_spaces
|
||||||
other.user_hooks = dict(self.user_hooks)
|
other.user_hooks = dict(self.user_hooks)
|
||||||
other.user_token_hooks = dict(self.user_token_hooks)
|
other.user_token_hooks = dict(self.user_token_hooks)
|
||||||
other.user_span_hooks = dict(self.user_span_hooks)
|
other.user_span_hooks = dict(self.user_span_hooks)
|
||||||
|
@ -1000,10 +1079,8 @@ cdef class Doc:
|
||||||
"sentiment": lambda: self.sentiment,
|
"sentiment": lambda: self.sentiment,
|
||||||
"tensor": lambda: self.tensor,
|
"tensor": lambda: self.tensor,
|
||||||
"cats": lambda: self.cats,
|
"cats": lambda: self.cats,
|
||||||
|
"has_unknown_spaces": lambda: self.has_unknown_spaces
|
||||||
}
|
}
|
||||||
for key in kwargs:
|
|
||||||
if key in serializers or key in ("user_data", "user_data_keys", "user_data_values"):
|
|
||||||
raise ValueError(Errors.E128.format(arg=key))
|
|
||||||
if "user_data" not in exclude and self.user_data:
|
if "user_data" not in exclude and self.user_data:
|
||||||
user_data_keys, user_data_values = list(zip(*self.user_data.items()))
|
user_data_keys, user_data_values = list(zip(*self.user_data.items()))
|
||||||
if "user_data_keys" not in exclude:
|
if "user_data_keys" not in exclude:
|
||||||
|
@ -1032,10 +1109,8 @@ cdef class Doc:
|
||||||
"cats": lambda b: None,
|
"cats": lambda b: None,
|
||||||
"user_data_keys": lambda b: None,
|
"user_data_keys": lambda b: None,
|
||||||
"user_data_values": lambda b: None,
|
"user_data_values": lambda b: None,
|
||||||
|
"has_unknown_spaces": lambda b: None
|
||||||
}
|
}
|
||||||
for key in kwargs:
|
|
||||||
if key in deserializers or key in ("user_data",):
|
|
||||||
raise ValueError(Errors.E128.format(arg=key))
|
|
||||||
# Msgpack doesn't distinguish between lists and tuples, which is
|
# Msgpack doesn't distinguish between lists and tuples, which is
|
||||||
# vexing for user data. As a best guess, we *know* that within
|
# vexing for user data. As a best guess, we *know* that within
|
||||||
# keys, we must have tuples. In values we just have to hope
|
# keys, we must have tuples. In values we just have to hope
|
||||||
|
@ -1052,6 +1127,8 @@ cdef class Doc:
|
||||||
self.tensor = msg["tensor"]
|
self.tensor = msg["tensor"]
|
||||||
if "cats" not in exclude and "cats" in msg:
|
if "cats" not in exclude and "cats" in msg:
|
||||||
self.cats = msg["cats"]
|
self.cats = msg["cats"]
|
||||||
|
if "has_unknown_spaces" not in exclude and "has_unknown_spaces" in msg:
|
||||||
|
self.has_unknown_spaces = msg["has_unknown_spaces"]
|
||||||
start = 0
|
start = 0
|
||||||
cdef const LexemeC* lex
|
cdef const LexemeC* lex
|
||||||
cdef unicode orth_
|
cdef unicode orth_
|
||||||
|
@ -1123,50 +1200,6 @@ cdef class Doc:
|
||||||
remove_label_if_necessary(attributes[i])
|
remove_label_if_necessary(attributes[i])
|
||||||
retokenizer.merge(span, attributes[i])
|
retokenizer.merge(span, attributes[i])
|
||||||
|
|
||||||
def merge(self, int start_idx, int end_idx, *args, **attributes):
|
|
||||||
"""Retokenize the document, such that the span at
|
|
||||||
`doc.text[start_idx : end_idx]` is merged into a single token. If
|
|
||||||
`start_idx` and `end_idx `do not mark start and end token boundaries,
|
|
||||||
the document remains unchanged.
|
|
||||||
|
|
||||||
start_idx (int): Character index of the start of the slice to merge.
|
|
||||||
end_idx (int): Character index after the end of the slice to merge.
|
|
||||||
**attributes: Attributes to assign to the merged token. By default,
|
|
||||||
attributes are inherited from the syntactic root of the span.
|
|
||||||
RETURNS (Token): The newly merged token, or `None` if the start and end
|
|
||||||
indices did not fall at token boundaries.
|
|
||||||
"""
|
|
||||||
cdef unicode tag, lemma, ent_type
|
|
||||||
warnings.warn(Warnings.W013.format(obj="Doc"), DeprecationWarning)
|
|
||||||
# TODO: ENT_KB_ID ?
|
|
||||||
if len(args) == 3:
|
|
||||||
warnings.warn(Warnings.W003, DeprecationWarning)
|
|
||||||
tag, lemma, ent_type = args
|
|
||||||
attributes[TAG] = tag
|
|
||||||
attributes[LEMMA] = lemma
|
|
||||||
attributes[ENT_TYPE] = ent_type
|
|
||||||
elif not args:
|
|
||||||
fix_attributes(self, attributes)
|
|
||||||
elif args:
|
|
||||||
raise ValueError(Errors.E034.format(n_args=len(args), args=repr(args),
|
|
||||||
kwargs=repr(attributes)))
|
|
||||||
remove_label_if_necessary(attributes)
|
|
||||||
attributes = intify_attrs(attributes, strings_map=self.vocab.strings)
|
|
||||||
cdef int start = token_by_start(self.c, self.length, start_idx)
|
|
||||||
if start == -1:
|
|
||||||
return None
|
|
||||||
cdef int end = token_by_end(self.c, self.length, end_idx)
|
|
||||||
if end == -1:
|
|
||||||
return None
|
|
||||||
# Currently we have the token index, we want the range-end index
|
|
||||||
end += 1
|
|
||||||
with self.retokenize() as retokenizer:
|
|
||||||
retokenizer.merge(self[start:end], attrs=attributes)
|
|
||||||
return self[start]
|
|
||||||
|
|
||||||
def print_tree(self, light=False, flat=False):
|
|
||||||
raise ValueError(Errors.E105)
|
|
||||||
|
|
||||||
def to_json(self, underscore=None):
|
def to_json(self, underscore=None):
|
||||||
"""Convert a Doc to JSON. The format it produces will be the new format
|
"""Convert a Doc to JSON. The format it produces will be the new format
|
||||||
for the `spacy train` command (not implemented yet).
|
for the `spacy train` command (not implemented yet).
|
||||||
|
|
|
@ -280,18 +280,6 @@ cdef class Span:
|
||||||
|
|
||||||
return array
|
return array
|
||||||
|
|
||||||
def merge(self, *args, **attributes):
|
|
||||||
"""Retokenize the document, such that the span is merged into a single
|
|
||||||
token.
|
|
||||||
|
|
||||||
**attributes: Attributes to assign to the merged token. By default,
|
|
||||||
attributes are inherited from the syntactic root token of the span.
|
|
||||||
RETURNS (Token): The newly merged token.
|
|
||||||
"""
|
|
||||||
warnings.warn(Warnings.W013.format(obj="Span"), DeprecationWarning)
|
|
||||||
return self.doc.merge(self.start_char, self.end_char, *args,
|
|
||||||
**attributes)
|
|
||||||
|
|
||||||
def get_lca_matrix(self):
|
def get_lca_matrix(self):
|
||||||
"""Calculates a matrix of Lowest Common Ancestors (LCA) for a given
|
"""Calculates a matrix of Lowest Common Ancestors (LCA) for a given
|
||||||
`Span`, where LCA[i, j] is the index of the lowest common ancestor among
|
`Span`, where LCA[i, j] is the index of the lowest common ancestor among
|
||||||
|
@ -698,21 +686,6 @@ cdef class Span:
|
||||||
"""RETURNS (str): The span's lemma."""
|
"""RETURNS (str): The span's lemma."""
|
||||||
return " ".join([t.lemma_ for t in self]).strip()
|
return " ".join([t.lemma_ for t in self]).strip()
|
||||||
|
|
||||||
@property
|
|
||||||
def upper_(self):
|
|
||||||
"""Deprecated. Use `Span.text.upper()` instead."""
|
|
||||||
return "".join([t.text_with_ws.upper() for t in self]).strip()
|
|
||||||
|
|
||||||
@property
|
|
||||||
def lower_(self):
|
|
||||||
"""Deprecated. Use `Span.text.lower()` instead."""
|
|
||||||
return "".join([t.text_with_ws.lower() for t in self]).strip()
|
|
||||||
|
|
||||||
@property
|
|
||||||
def string(self):
|
|
||||||
"""Deprecated: Use `Span.text_with_ws` instead."""
|
|
||||||
return "".join([t.text_with_ws for t in self])
|
|
||||||
|
|
||||||
property label_:
|
property label_:
|
||||||
"""RETURNS (str): The span's label."""
|
"""RETURNS (str): The span's label."""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
|
|
@ -237,11 +237,6 @@ cdef class Token:
|
||||||
index into tables, e.g. for word vectors."""
|
index into tables, e.g. for word vectors."""
|
||||||
return self.c.lex.id
|
return self.c.lex.id
|
||||||
|
|
||||||
@property
|
|
||||||
def string(self):
|
|
||||||
"""Deprecated: Use Token.text_with_ws instead."""
|
|
||||||
return self.text_with_ws
|
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def text(self):
|
def text(self):
|
||||||
"""RETURNS (str): The original verbatim text of the token."""
|
"""RETURNS (str): The original verbatim text of the token."""
|
||||||
|
|
106
spacy/util.py
106
spacy/util.py
|
@ -4,9 +4,8 @@ import importlib
|
||||||
import importlib.util
|
import importlib.util
|
||||||
import re
|
import re
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import random
|
|
||||||
import thinc
|
import thinc
|
||||||
from thinc.api import NumpyOps, get_current_ops, Adam, require_gpu, Config
|
from thinc.api import NumpyOps, get_current_ops, Adam, Config
|
||||||
import functools
|
import functools
|
||||||
import itertools
|
import itertools
|
||||||
import numpy.random
|
import numpy.random
|
||||||
|
@ -34,6 +33,13 @@ try: # Python 3.8
|
||||||
except ImportError:
|
except ImportError:
|
||||||
import importlib_metadata
|
import importlib_metadata
|
||||||
|
|
||||||
|
# These are functions that were previously (v2.x) available from spacy.util
|
||||||
|
# and have since moved to Thinc. We're importing them here so people's code
|
||||||
|
# doesn't break, but they should always be imported from Thinc from now on,
|
||||||
|
# not from spacy.util.
|
||||||
|
from thinc.api import fix_random_seed, compounding, decaying # noqa: F401
|
||||||
|
|
||||||
|
|
||||||
from .symbols import ORTH
|
from .symbols import ORTH
|
||||||
from .compat import cupy, CudaStream, is_windows
|
from .compat import cupy, CudaStream, is_windows
|
||||||
from .errors import Errors, Warnings
|
from .errors import Errors, Warnings
|
||||||
|
@ -595,15 +601,8 @@ def compile_prefix_regex(entries):
|
||||||
entries (tuple): The prefix rules, e.g. spacy.lang.punctuation.TOKENIZER_PREFIXES.
|
entries (tuple): The prefix rules, e.g. spacy.lang.punctuation.TOKENIZER_PREFIXES.
|
||||||
RETURNS (regex object): The regex object. to be used for Tokenizer.prefix_search.
|
RETURNS (regex object): The regex object. to be used for Tokenizer.prefix_search.
|
||||||
"""
|
"""
|
||||||
if "(" in entries:
|
expression = "|".join(["^" + piece for piece in entries if piece.strip()])
|
||||||
# Handle deprecated data
|
return re.compile(expression)
|
||||||
expression = "|".join(
|
|
||||||
["^" + re.escape(piece) for piece in entries if piece.strip()]
|
|
||||||
)
|
|
||||||
return re.compile(expression)
|
|
||||||
else:
|
|
||||||
expression = "|".join(["^" + piece for piece in entries if piece.strip()])
|
|
||||||
return re.compile(expression)
|
|
||||||
|
|
||||||
|
|
||||||
def compile_suffix_regex(entries):
|
def compile_suffix_regex(entries):
|
||||||
|
@ -723,59 +722,6 @@ def minibatch(items, size=8):
|
||||||
yield list(batch)
|
yield list(batch)
|
||||||
|
|
||||||
|
|
||||||
def compounding(start, stop, compound):
|
|
||||||
"""Yield an infinite series of compounding values. Each time the
|
|
||||||
generator is called, a value is produced by multiplying the previous
|
|
||||||
value by the compound rate.
|
|
||||||
|
|
||||||
EXAMPLE:
|
|
||||||
>>> sizes = compounding(1., 10., 1.5)
|
|
||||||
>>> assert next(sizes) == 1.
|
|
||||||
>>> assert next(sizes) == 1 * 1.5
|
|
||||||
>>> assert next(sizes) == 1.5 * 1.5
|
|
||||||
"""
|
|
||||||
|
|
||||||
def clip(value):
|
|
||||||
return max(value, stop) if (start > stop) else min(value, stop)
|
|
||||||
|
|
||||||
curr = float(start)
|
|
||||||
while True:
|
|
||||||
yield clip(curr)
|
|
||||||
curr *= compound
|
|
||||||
|
|
||||||
|
|
||||||
def stepping(start, stop, steps):
|
|
||||||
"""Yield an infinite series of values that step from a start value to a
|
|
||||||
final value over some number of steps. Each step is (stop-start)/steps.
|
|
||||||
|
|
||||||
After the final value is reached, the generator continues yielding that
|
|
||||||
value.
|
|
||||||
|
|
||||||
EXAMPLE:
|
|
||||||
>>> sizes = stepping(1., 200., 100)
|
|
||||||
>>> assert next(sizes) == 1.
|
|
||||||
>>> assert next(sizes) == 1 * (200.-1.) / 100
|
|
||||||
>>> assert next(sizes) == 1 + (200.-1.) / 100 + (200.-1.) / 100
|
|
||||||
"""
|
|
||||||
|
|
||||||
def clip(value):
|
|
||||||
return max(value, stop) if (start > stop) else min(value, stop)
|
|
||||||
|
|
||||||
curr = float(start)
|
|
||||||
while True:
|
|
||||||
yield clip(curr)
|
|
||||||
curr += (stop - start) / steps
|
|
||||||
|
|
||||||
|
|
||||||
def decaying(start, stop, decay):
|
|
||||||
"""Yield an infinite series of linearly decaying values."""
|
|
||||||
|
|
||||||
curr = float(start)
|
|
||||||
while True:
|
|
||||||
yield max(curr, stop)
|
|
||||||
curr -= decay
|
|
||||||
|
|
||||||
|
|
||||||
def minibatch_by_words(docs, size, tolerance=0.2, discard_oversize=False):
|
def minibatch_by_words(docs, size, tolerance=0.2, discard_oversize=False):
|
||||||
"""Create minibatches of roughly a given number of words. If any examples
|
"""Create minibatches of roughly a given number of words. If any examples
|
||||||
are longer than the specified batch length, they will appear in a batch by
|
are longer than the specified batch length, they will appear in a batch by
|
||||||
|
@ -854,35 +800,6 @@ def minibatch_by_words(docs, size, tolerance=0.2, discard_oversize=False):
|
||||||
yield batch
|
yield batch
|
||||||
|
|
||||||
|
|
||||||
def itershuffle(iterable, bufsize=1000):
|
|
||||||
"""Shuffle an iterator. This works by holding `bufsize` items back
|
|
||||||
and yielding them sometime later. Obviously, this is not unbiased –
|
|
||||||
but should be good enough for batching. Larger bufsize means less bias.
|
|
||||||
From https://gist.github.com/andres-erbsen/1307752
|
|
||||||
|
|
||||||
iterable (iterable): Iterator to shuffle.
|
|
||||||
bufsize (int): Items to hold back.
|
|
||||||
YIELDS (iterable): The shuffled iterator.
|
|
||||||
"""
|
|
||||||
iterable = iter(iterable)
|
|
||||||
buf = []
|
|
||||||
try:
|
|
||||||
while True:
|
|
||||||
for i in range(random.randint(1, bufsize - len(buf))):
|
|
||||||
buf.append(next(iterable))
|
|
||||||
random.shuffle(buf)
|
|
||||||
for i in range(random.randint(1, bufsize)):
|
|
||||||
if buf:
|
|
||||||
yield buf.pop()
|
|
||||||
else:
|
|
||||||
break
|
|
||||||
except StopIteration:
|
|
||||||
random.shuffle(buf)
|
|
||||||
while buf:
|
|
||||||
yield buf.pop()
|
|
||||||
raise StopIteration
|
|
||||||
|
|
||||||
|
|
||||||
def filter_spans(spans):
|
def filter_spans(spans):
|
||||||
"""Filter a sequence of spans and remove duplicates or overlaps. Useful for
|
"""Filter a sequence of spans and remove duplicates or overlaps. Useful for
|
||||||
creating named entities (where one token can only be part of one entity) or
|
creating named entities (where one token can only be part of one entity) or
|
||||||
|
@ -989,6 +906,7 @@ def escape_html(text):
|
||||||
return text
|
return text
|
||||||
|
|
||||||
|
|
||||||
|
<<<<<<< HEAD
|
||||||
def use_gpu(gpu_id):
|
def use_gpu(gpu_id):
|
||||||
return require_gpu(gpu_id)
|
return require_gpu(gpu_id)
|
||||||
|
|
||||||
|
@ -1025,6 +943,8 @@ def get_serialization_exclude(serializers, exclude, kwargs):
|
||||||
return exclude
|
return exclude
|
||||||
|
|
||||||
|
|
||||||
|
=======
|
||||||
|
>>>>>>> 19d42f42de30ba57e17427798ea2562cdab2c9f8
|
||||||
def get_words_and_spaces(words, text):
|
def get_words_and_spaces(words, text):
|
||||||
if "".join("".join(words).split()) != "".join(text.split()):
|
if "".join("".join(words).split()) != "".join(text.split()):
|
||||||
raise ValueError(Errors.E194.format(text=text, words=words))
|
raise ValueError(Errors.E194.format(text=text, words=words))
|
||||||
|
|
|
@ -426,7 +426,7 @@ cdef class Vocab:
|
||||||
orth = self.strings.add(orth)
|
orth = self.strings.add(orth)
|
||||||
return orth in self.vectors
|
return orth in self.vectors
|
||||||
|
|
||||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
def to_disk(self, path, exclude=tuple()):
|
||||||
"""Save the current state to a directory.
|
"""Save the current state to a directory.
|
||||||
|
|
||||||
path (unicode or Path): A path to a directory, which will be created if
|
path (unicode or Path): A path to a directory, which will be created if
|
||||||
|
@ -439,7 +439,6 @@ cdef class Vocab:
|
||||||
if not path.exists():
|
if not path.exists():
|
||||||
path.mkdir()
|
path.mkdir()
|
||||||
setters = ["strings", "vectors"]
|
setters = ["strings", "vectors"]
|
||||||
exclude = util.get_serialization_exclude(setters, exclude, kwargs)
|
|
||||||
if "strings" not in exclude:
|
if "strings" not in exclude:
|
||||||
self.strings.to_disk(path / "strings.json")
|
self.strings.to_disk(path / "strings.json")
|
||||||
if "vectors" not in "exclude" and self.vectors is not None:
|
if "vectors" not in "exclude" and self.vectors is not None:
|
||||||
|
@ -449,7 +448,7 @@ cdef class Vocab:
|
||||||
if "lookups_extra" not in "exclude" and self.lookups_extra is not None:
|
if "lookups_extra" not in "exclude" and self.lookups_extra is not None:
|
||||||
self.lookups_extra.to_disk(path, filename="lookups_extra.bin")
|
self.lookups_extra.to_disk(path, filename="lookups_extra.bin")
|
||||||
|
|
||||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
def from_disk(self, path, exclude=tuple()):
|
||||||
"""Loads state from a directory. Modifies the object in place and
|
"""Loads state from a directory. Modifies the object in place and
|
||||||
returns it.
|
returns it.
|
||||||
|
|
||||||
|
@ -461,7 +460,6 @@ cdef class Vocab:
|
||||||
"""
|
"""
|
||||||
path = util.ensure_path(path)
|
path = util.ensure_path(path)
|
||||||
getters = ["strings", "vectors"]
|
getters = ["strings", "vectors"]
|
||||||
exclude = util.get_serialization_exclude(getters, exclude, kwargs)
|
|
||||||
if "strings" not in exclude:
|
if "strings" not in exclude:
|
||||||
self.strings.from_disk(path / "strings.json") # TODO: add exclude?
|
self.strings.from_disk(path / "strings.json") # TODO: add exclude?
|
||||||
if "vectors" not in exclude:
|
if "vectors" not in exclude:
|
||||||
|
@ -481,7 +479,7 @@ cdef class Vocab:
|
||||||
self._by_orth = PreshMap()
|
self._by_orth = PreshMap()
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
def to_bytes(self, exclude=tuple()):
|
||||||
"""Serialize the current state to a binary string.
|
"""Serialize the current state to a binary string.
|
||||||
|
|
||||||
exclude (list): String names of serialization fields to exclude.
|
exclude (list): String names of serialization fields to exclude.
|
||||||
|
@ -501,10 +499,9 @@ cdef class Vocab:
|
||||||
"lookups": lambda: self.lookups.to_bytes(),
|
"lookups": lambda: self.lookups.to_bytes(),
|
||||||
"lookups_extra": lambda: self.lookups_extra.to_bytes()
|
"lookups_extra": lambda: self.lookups_extra.to_bytes()
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(getters, exclude, kwargs)
|
|
||||||
return util.to_bytes(getters, exclude)
|
return util.to_bytes(getters, exclude)
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
def from_bytes(self, bytes_data, exclude=tuple()):
|
||||||
"""Load state from a binary string.
|
"""Load state from a binary string.
|
||||||
|
|
||||||
bytes_data (bytes): The data to load from.
|
bytes_data (bytes): The data to load from.
|
||||||
|
@ -526,7 +523,6 @@ cdef class Vocab:
|
||||||
"lookups": lambda b: self.lookups.from_bytes(b),
|
"lookups": lambda b: self.lookups.from_bytes(b),
|
||||||
"lookups_extra": lambda b: self.lookups_extra.from_bytes(b)
|
"lookups_extra": lambda b: self.lookups_extra.from_bytes(b)
|
||||||
}
|
}
|
||||||
exclude = util.get_serialization_exclude(setters, exclude, kwargs)
|
|
||||||
util.from_bytes(bytes_data, setters, exclude)
|
util.from_bytes(bytes_data, setters, exclude)
|
||||||
if "lexeme_norm" in self.lookups:
|
if "lexeme_norm" in self.lookups:
|
||||||
self.lex_attr_getters[NORM] = util.add_lookups(
|
self.lex_attr_getters[NORM] = util.add_lookups(
|
||||||
|
|
|
@ -1,621 +0,0 @@
|
||||||
---
|
|
||||||
title: Annotation Specifications
|
|
||||||
teaser: Schemes used for labels, tags and training data
|
|
||||||
menu:
|
|
||||||
- ['Text Processing', 'text-processing']
|
|
||||||
- ['POS Tagging', 'pos-tagging']
|
|
||||||
- ['Dependencies', 'dependency-parsing']
|
|
||||||
- ['Named Entities', 'named-entities']
|
|
||||||
- ['Models & Training', 'training']
|
|
||||||
---
|
|
||||||
|
|
||||||
## Text processing {#text-processing}
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> from spacy.lang.en import English
|
|
||||||
> nlp = English()
|
|
||||||
> tokens = nlp("Some\\nspaces and\\ttab characters")
|
|
||||||
> tokens_text = [t.text for t in tokens]
|
|
||||||
> assert tokens_text == ["Some", "\\n", "spaces", " ", "and", "\\t", "tab", "characters"]
|
|
||||||
> ```
|
|
||||||
|
|
||||||
Tokenization standards are based on the
|
|
||||||
[OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) corpus. The tokenizer
|
|
||||||
differs from most by including **tokens for significant whitespace**. Any
|
|
||||||
sequence of whitespace characters beyond a single space (`' '`) is included as a
|
|
||||||
token. The whitespace tokens are useful for much the same reason punctuation is
|
|
||||||
– it's often an important delimiter in the text. By preserving it in the token
|
|
||||||
output, we are able to maintain a simple alignment between the tokens and the
|
|
||||||
original string, and we ensure that **no information is lost** during
|
|
||||||
processing.
|
|
||||||
|
|
||||||
### Lemmatization {#lemmatization}
|
|
||||||
|
|
||||||
> #### Examples
|
|
||||||
>
|
|
||||||
> In English, this means:
|
|
||||||
>
|
|
||||||
> - **Adjectives**: happier, happiest → happy
|
|
||||||
> - **Adverbs**: worse, worst → badly
|
|
||||||
> - **Nouns**: dogs, children → dog, child
|
|
||||||
> - **Verbs**: writes, writing, wrote, written → write
|
|
||||||
|
|
||||||
As of v2.2, lemmatization data is stored in a separate package,
|
|
||||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can
|
|
||||||
be installed if needed via `pip install spacy[lookups]`. Some languages provide
|
|
||||||
full lemmatization rules and exceptions, while other languages currently only
|
|
||||||
rely on simple lookup tables.
|
|
||||||
|
|
||||||
<Infobox title="About spaCy's custom pronoun lemma for English" variant="warning">
|
|
||||||
|
|
||||||
spaCy adds a **special case for English pronouns**: all English pronouns are
|
|
||||||
lemmatized to the special token `-PRON-`. Unlike verbs and common nouns,
|
|
||||||
there's no clear base form of a personal pronoun. Should the lemma of "me" be
|
|
||||||
"I", or should we normalize person as well, giving "it" — or maybe "he"?
|
|
||||||
spaCy's solution is to introduce a novel symbol, `-PRON-`, which is used as the
|
|
||||||
lemma for all personal pronouns.
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
### Sentence boundary detection {#sentence-boundary}
|
|
||||||
|
|
||||||
Sentence boundaries are calculated from the syntactic parse tree, so features
|
|
||||||
such as punctuation and capitalization play an important but non-decisive role
|
|
||||||
in determining the sentence boundaries. Usually this means that the sentence
|
|
||||||
boundaries will at least coincide with clause boundaries, even given poorly
|
|
||||||
punctuated text.
|
|
||||||
|
|
||||||
## Part-of-speech tagging {#pos-tagging}
|
|
||||||
|
|
||||||
> #### Tip: Understanding tags
|
|
||||||
>
|
|
||||||
> You can also use `spacy.explain` to get the description for the string
|
|
||||||
> representation of a tag. For example, `spacy.explain("RB")` will return
|
|
||||||
> "adverb".
|
|
||||||
|
|
||||||
This section lists the fine-grained and coarse-grained part-of-speech tags
|
|
||||||
assigned by spaCy's [models](/models). The individual mapping is specific to the
|
|
||||||
training corpus and can be defined in the respective language data's
|
|
||||||
[`tag_map.py`](/usage/adding-languages#tag-map).
|
|
||||||
|
|
||||||
<Accordion title="Universal Part-of-speech Tags" id="pos-universal">
|
|
||||||
|
|
||||||
spaCy maps all language-specific part-of-speech tags to a small, fixed set of
|
|
||||||
word type tags following the
|
|
||||||
[Universal Dependencies scheme](http://universaldependencies.org/u/pos/). The
|
|
||||||
universal tags don't code for any morphological features and only cover the word
|
|
||||||
type. They're available as the [`Token.pos`](/api/token#attributes) and
|
|
||||||
[`Token.pos_`](/api/token#attributes) attributes.
|
|
||||||
|
|
||||||
| POS | Description | Examples |
|
|
||||||
| ------- | ------------------------- | --------------------------------------------- |
|
|
||||||
| `ADJ` | adjective | big, old, green, incomprehensible, first |
|
|
||||||
| `ADP` | adposition | in, to, during |
|
|
||||||
| `ADV` | adverb | very, tomorrow, down, where, there |
|
|
||||||
| `AUX` | auxiliary | is, has (done), will (do), should (do) |
|
|
||||||
| `CONJ` | conjunction | and, or, but |
|
|
||||||
| `CCONJ` | coordinating conjunction | and, or, but |
|
|
||||||
| `DET` | determiner | a, an, the |
|
|
||||||
| `INTJ` | interjection | psst, ouch, bravo, hello |
|
|
||||||
| `NOUN` | noun | girl, cat, tree, air, beauty |
|
|
||||||
| `NUM` | numeral | 1, 2017, one, seventy-seven, IV, MMXIV |
|
|
||||||
| `PART` | particle | 's, not, |
|
|
||||||
| `PRON` | pronoun | I, you, he, she, myself, themselves, somebody |
|
|
||||||
| `PROPN` | proper noun | Mary, John, London, NATO, HBO |
|
|
||||||
| `PUNCT` | punctuation | ., (, ), ? |
|
|
||||||
| `SCONJ` | subordinating conjunction | if, while, that |
|
|
||||||
| `SYM` | symbol | \$, %, §, ©, +, −, ×, ÷, =, :), 😝 |
|
|
||||||
| `VERB` | verb | run, runs, running, eat, ate, eating |
|
|
||||||
| `X` | other | sfpksdpsxmsa |
|
|
||||||
| `SPACE` | space |
|
|
||||||
|
|
||||||
</Accordion>
|
|
||||||
|
|
||||||
<Accordion title="English" id="pos-en">
|
|
||||||
|
|
||||||
The English part-of-speech tagger uses the
|
|
||||||
[OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) version of the Penn
|
|
||||||
Treebank tag set. We also map the tags to the simpler Universal Dependencies v2
|
|
||||||
POS tag set.
|
|
||||||
|
|
||||||
| Tag | POS | Morphology | Description |
|
|
||||||
| ------------------------------------- | ------- | --------------------------------------- | ----------------------------------------- |
|
|
||||||
| `$` | `SYM` | | symbol, currency |
|
|
||||||
| <InlineCode>``</InlineCode> | `PUNCT` | `PunctType=quot PunctSide=ini` | opening quotation mark |
|
|
||||||
| `''` | `PUNCT` | `PunctType=quot PunctSide=fin` | closing quotation mark |
|
|
||||||
| `,` | `PUNCT` | `PunctType=comm` | punctuation mark, comma |
|
|
||||||
| `-LRB-` | `PUNCT` | `PunctType=brck PunctSide=ini` | left round bracket |
|
|
||||||
| `-RRB-` | `PUNCT` | `PunctType=brck PunctSide=fin` | right round bracket |
|
|
||||||
| `.` | `PUNCT` | `PunctType=peri` | punctuation mark, sentence closer |
|
|
||||||
| `:` | `PUNCT` | | punctuation mark, colon or ellipsis |
|
|
||||||
| `ADD` | `X` | | email |
|
|
||||||
| `AFX` | `ADJ` | `Hyph=yes` | affix |
|
|
||||||
| `CC` | `CCONJ` | `ConjType=comp` | conjunction, coordinating |
|
|
||||||
| `CD` | `NUM` | `NumType=card` | cardinal number |
|
|
||||||
| `DT` | `DET` | | determiner |
|
|
||||||
| `EX` | `PRON` | `AdvType=ex` | existential there |
|
|
||||||
| `FW` | `X` | `Foreign=yes` | foreign word |
|
|
||||||
| `GW` | `X` | | additional word in multi-word expression |
|
|
||||||
| `HYPH` | `PUNCT` | `PunctType=dash` | punctuation mark, hyphen |
|
|
||||||
| `IN` | `ADP` | | conjunction, subordinating or preposition |
|
|
||||||
| `JJ` | `ADJ` | `Degree=pos` | adjective |
|
|
||||||
| `JJR` | `ADJ` | `Degree=comp` | adjective, comparative |
|
|
||||||
| `JJS` | `ADJ` | `Degree=sup` | adjective, superlative |
|
|
||||||
| `LS` | `X` | `NumType=ord` | list item marker |
|
|
||||||
| `MD` | `VERB` | `VerbType=mod` | verb, modal auxiliary |
|
|
||||||
| `NFP` | `PUNCT` | | superfluous punctuation |
|
|
||||||
| `NIL` | `X` | | missing tag |
|
|
||||||
| `NN` | `NOUN` | `Number=sing` | noun, singular or mass |
|
|
||||||
| `NNP` | `PROPN` | `NounType=prop Number=sing` | noun, proper singular |
|
|
||||||
| `NNPS` | `PROPN` | `NounType=prop Number=plur` | noun, proper plural |
|
|
||||||
| `NNS` | `NOUN` | `Number=plur` | noun, plural |
|
|
||||||
| `PDT` | `DET` | | predeterminer |
|
|
||||||
| `POS` | `PART` | `Poss=yes` | possessive ending |
|
|
||||||
| `PRP` | `PRON` | `PronType=prs` | pronoun, personal |
|
|
||||||
| `PRP$` | `DET` | `PronType=prs Poss=yes` | pronoun, possessive |
|
|
||||||
| `RB` | `ADV` | `Degree=pos` | adverb |
|
|
||||||
| `RBR` | `ADV` | `Degree=comp` | adverb, comparative |
|
|
||||||
| `RBS` | `ADV` | `Degree=sup` | adverb, superlative |
|
|
||||||
| `RP` | `ADP` | | adverb, particle |
|
|
||||||
| `SP` | `SPACE` | | space |
|
|
||||||
| `SYM` | `SYM` | | symbol |
|
|
||||||
| `TO` | `PART` | `PartType=inf VerbForm=inf` | infinitival "to" |
|
|
||||||
| `UH` | `INTJ` | | interjection |
|
|
||||||
| `VB` | `VERB` | `VerbForm=inf` | verb, base form |
|
|
||||||
| `VBD` | `VERB` | `VerbForm=fin Tense=past` | verb, past tense |
|
|
||||||
| `VBG` | `VERB` | `VerbForm=part Tense=pres Aspect=prog` | verb, gerund or present participle |
|
|
||||||
| `VBN` | `VERB` | `VerbForm=part Tense=past Aspect=perf` | verb, past participle |
|
|
||||||
| `VBP` | `VERB` | `VerbForm=fin Tense=pres` | verb, non-3rd person singular present |
|
|
||||||
| `VBZ` | `VERB` | `VerbForm=fin Tense=pres Number=sing Person=three` | verb, 3rd person singular present |
|
|
||||||
| `WDT` | `DET` | | wh-determiner |
|
|
||||||
| `WP` | `PRON` | | wh-pronoun, personal |
|
|
||||||
| `WP$` | `DET` | `Poss=yes` | wh-pronoun, possessive |
|
|
||||||
| `WRB` | `ADV` | | wh-adverb |
|
|
||||||
| `XX` | `X` | | unknown |
|
|
||||||
| `_SP` | `SPACE` | | |
|
|
||||||
</Accordion>
|
|
||||||
|
|
||||||
<Accordion title="German" id="pos-de">
|
|
||||||
|
|
||||||
The German part-of-speech tagger uses the
|
|
||||||
[TIGER Treebank](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html)
|
|
||||||
annotation scheme. We also map the tags to the simpler Universal Dependencies
|
|
||||||
v2 POS tag set.
|
|
||||||
|
|
||||||
| Tag | POS | Morphology | Description |
|
|
||||||
| --------- | ------- | ---------------------------------------- | ------------------------------------------------- |
|
|
||||||
| `$(` | `PUNCT` | `PunctType=brck` | other sentence-internal punctuation mark |
|
|
||||||
| `$,` | `PUNCT` | `PunctType=comm` | comma |
|
|
||||||
| `$.` | `PUNCT` | `PunctType=peri` | sentence-final punctuation mark |
|
|
||||||
| `ADJA` | `ADJ` | | adjective, attributive |
|
|
||||||
| `ADJD` | `ADJ` | | adjective, adverbial or predicative |
|
|
||||||
| `ADV` | `ADV` | | adverb |
|
|
||||||
| `APPO` | `ADP` | `AdpType=post` | postposition |
|
|
||||||
| `APPR` | `ADP` | `AdpType=prep` | preposition; circumposition left |
|
|
||||||
| `APPRART` | `ADP` | `AdpType=prep PronType=art` | preposition with article |
|
|
||||||
| `APZR` | `ADP` | `AdpType=circ` | circumposition right |
|
|
||||||
| `ART` | `DET` | `PronType=art` | definite or indefinite article |
|
|
||||||
| `CARD` | `NUM` | `NumType=card` | cardinal number |
|
|
||||||
| `FM` | `X` | `Foreign=yes` | foreign language material |
|
|
||||||
| `ITJ` | `INTJ` | | interjection |
|
|
||||||
| `KOKOM` | `CCONJ` | `ConjType=comp` | comparative conjunction |
|
|
||||||
| `KON` | `CCONJ` | | coordinate conjunction |
|
|
||||||
| `KOUI` | `SCONJ` | | subordinate conjunction with "zu" and infinitive |
|
|
||||||
| `KOUS` | `SCONJ` | | subordinate conjunction with sentence |
|
|
||||||
| `NE` | `PROPN` | | proper noun |
|
|
||||||
| `NN` | `NOUN` | | noun, singular or mass |
|
|
||||||
| `NNE` | `PROPN` | | proper noun |
|
|
||||||
| `PDAT` | `DET` | `PronType=dem` | attributive demonstrative pronoun |
|
|
||||||
| `PDS` | `PRON` | `PronType=dem` | substituting demonstrative pronoun |
|
|
||||||
| `PIAT` | `DET` | `PronType=ind|neg|tot` | attributive indefinite pronoun without determiner |
|
|
||||||
| `PIS` | `PRON` | `PronType=ind|neg|tot` | substituting indefinite pronoun |
|
|
||||||
| `PPER` | `PRON` | `PronType=prs` | non-reflexive personal pronoun |
|
|
||||||
| `PPOSAT` | `DET` | `Poss=yes PronType=prs` | attributive possessive pronoun |
|
|
||||||
| `PPOSS` | `PRON` | `Poss=yes PronType=prs` | substituting possessive pronoun |
|
|
||||||
| `PRELAT` | `DET` | `PronType=rel` | attributive relative pronoun |
|
|
||||||
| `PRELS` | `PRON` | `PronType=rel` | substituting relative pronoun |
|
|
||||||
| `PRF` | `PRON` | `PronType=prs Reflex=yes` | reflexive personal pronoun |
|
|
||||||
| `PROAV` | `ADV` | `PronType=dem` | pronominal adverb |
|
|
||||||
| `PTKA` | `PART` | | particle with adjective or adverb |
|
|
||||||
| `PTKANT` | `PART` | `PartType=res` | answer particle |
|
|
||||||
| `PTKNEG` | `PART` | `Polarity=neg` | negative particle |
|
|
||||||
| `PTKVZ` | `ADP` | `PartType=vbp` | separable verbal particle |
|
|
||||||
| `PTKZU` | `PART` | `PartType=inf` | "zu" before infinitive |
|
|
||||||
| `PWAT` | `DET` | `PronType=int` | attributive interrogative pronoun |
|
|
||||||
| `PWAV` | `ADV` | `PronType=int` | adverbial interrogative or relative pronoun |
|
|
||||||
| `PWS` | `PRON` | `PronType=int` | substituting interrogative pronoun |
|
|
||||||
| `TRUNC` | `X` | `Hyph=yes` | word remnant |
|
|
||||||
| `VAFIN` | `AUX` | `Mood=ind VerbForm=fin` | finite verb, auxiliary |
|
|
||||||
| `VAIMP` | `AUX` | `Mood=imp VerbForm=fin` | imperative, auxiliary |
|
|
||||||
| `VAINF` | `AUX` | `VerbForm=inf` | infinitive, auxiliary |
|
|
||||||
| `VAPP` | `AUX` | `Aspect=perf VerbForm=part` | perfect participle, auxiliary |
|
|
||||||
| `VMFIN` | `VERB` | `Mood=ind VerbForm=fin VerbType=mod` | finite verb, modal |
|
|
||||||
| `VMINF` | `VERB` | `VerbForm=inf VerbType=mod` | infinitive, modal |
|
|
||||||
| `VMPP` | `VERB` | `Aspect=perf VerbForm=part VerbType=mod` | perfect participle, modal |
|
|
||||||
| `VVFIN` | `VERB` | `Mood=ind VerbForm=fin` | finite verb, full |
|
|
||||||
| `VVIMP` | `VERB` | `Mood=imp VerbForm=fin` | imperative, full |
|
|
||||||
| `VVINF` | `VERB` | `VerbForm=inf` | infinitive, full |
|
|
||||||
| `VVIZU` | `VERB` | `VerbForm=inf` | infinitive with "zu", full |
|
|
||||||
| `VVPP` | `VERB` | `Aspect=perf VerbForm=part` | perfect participle, full |
|
|
||||||
| `XY` | `X` | | non-word containing non-letter |
|
|
||||||
| `_SP` | `SPACE` | | |
|
|
||||||
</Accordion>
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
<Infobox title="Annotation schemes for other models">
|
|
||||||
|
|
||||||
For the label schemes used by the other models, see the respective `tag_map.py`
|
|
||||||
in [`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang).
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
## Syntactic Dependency Parsing {#dependency-parsing}
|
|
||||||
|
|
||||||
> #### Tip: Understanding labels
|
|
||||||
>
|
|
||||||
> You can also use `spacy.explain` to get the description for the string
|
|
||||||
> representation of a label. For example, `spacy.explain("prt")` will return
|
|
||||||
> "particle".
|
|
||||||
|
|
||||||
This section lists the syntactic dependency labels assigned by spaCy's
|
|
||||||
[models](/models). The individual labels are language-specific and depend on the
|
|
||||||
training corpus.
|
|
||||||
|
|
||||||
<Accordion title="Universal Dependency Labels" id="dependency-parsing-universal">
|
|
||||||
|
|
||||||
The [Universal Dependencies scheme](http://universaldependencies.org/u/dep/) is
|
|
||||||
used in all languages trained on Universal Dependency Corpora.
|
|
||||||
|
|
||||||
| Label | Description |
|
|
||||||
| ------------ | -------------------------------------------- |
|
|
||||||
| `acl` | clausal modifier of noun (adjectival clause) |
|
|
||||||
| `advcl` | adverbial clause modifier |
|
|
||||||
| `advmod` | adverbial modifier |
|
|
||||||
| `amod` | adjectival modifier |
|
|
||||||
| `appos` | appositional modifier |
|
|
||||||
| `aux` | auxiliary |
|
|
||||||
| `case` | case marking |
|
|
||||||
| `cc` | coordinating conjunction |
|
|
||||||
| `ccomp` | clausal complement |
|
|
||||||
| `clf` | classifier |
|
|
||||||
| `compound` | compound |
|
|
||||||
| `conj` | conjunct |
|
|
||||||
| `cop` | copula |
|
|
||||||
| `csubj` | clausal subject |
|
|
||||||
| `dep` | unspecified dependency |
|
|
||||||
| `det` | determiner |
|
|
||||||
| `discourse` | discourse element |
|
|
||||||
| `dislocated` | dislocated elements |
|
|
||||||
| `expl` | expletive |
|
|
||||||
| `fixed` | fixed multiword expression |
|
|
||||||
| `flat` | flat multiword expression |
|
|
||||||
| `goeswith` | goes with |
|
|
||||||
| `iobj` | indirect object |
|
|
||||||
| `list` | list |
|
|
||||||
| `mark` | marker |
|
|
||||||
| `nmod` | nominal modifier |
|
|
||||||
| `nsubj` | nominal subject |
|
|
||||||
| `nummod` | numeric modifier |
|
|
||||||
| `obj` | object |
|
|
||||||
| `obl` | oblique nominal |
|
|
||||||
| `orphan` | orphan |
|
|
||||||
| `parataxis` | parataxis |
|
|
||||||
| `punct` | punctuation |
|
|
||||||
| `reparandum` | overridden disfluency |
|
|
||||||
| `root` | root |
|
|
||||||
| `vocative` | vocative |
|
|
||||||
| `xcomp` | open clausal complement |
|
|
||||||
|
|
||||||
</Accordion>
|
|
||||||
|
|
||||||
<Accordion title="English" id="dependency-parsing-english">
|
|
||||||
|
|
||||||
The English dependency labels use the
|
|
||||||
[CLEAR Style](https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md)
|
|
||||||
by [ClearNLP](http://www.clearnlp.com).
|
|
||||||
|
|
||||||
| Label | Description |
|
|
||||||
| ----------- | -------------------------------------------- |
|
|
||||||
| `acl` | clausal modifier of noun (adjectival clause) |
|
|
||||||
| `acomp` | adjectival complement |
|
|
||||||
| `advcl` | adverbial clause modifier |
|
|
||||||
| `advmod` | adverbial modifier |
|
|
||||||
| `agent` | agent |
|
|
||||||
| `amod` | adjectival modifier |
|
|
||||||
| `appos` | appositional modifier |
|
|
||||||
| `attr` | attribute |
|
|
||||||
| `aux` | auxiliary |
|
|
||||||
| `auxpass` | auxiliary (passive) |
|
|
||||||
| `case` | case marking |
|
|
||||||
| `cc` | coordinating conjunction |
|
|
||||||
| `ccomp` | clausal complement |
|
|
||||||
| `compound` | compound |
|
|
||||||
| `conj` | conjunct |
|
|
||||||
| `cop` | copula |
|
|
||||||
| `csubj` | clausal subject |
|
|
||||||
| `csubjpass` | clausal subject (passive) |
|
|
||||||
| `dative` | dative |
|
|
||||||
| `dep` | unclassified dependent |
|
|
||||||
| `det` | determiner |
|
|
||||||
| `dobj` | direct object |
|
|
||||||
| `expl` | expletive |
|
|
||||||
| `intj` | interjection |
|
|
||||||
| `mark` | marker |
|
|
||||||
| `meta` | meta modifier |
|
|
||||||
| `neg` | negation modifier |
|
|
||||||
| `nn` | noun compound modifier |
|
|
||||||
| `nounmod` | modifier of nominal |
|
|
||||||
| `npmod` | noun phrase as adverbial modifier |
|
|
||||||
| `nsubj` | nominal subject |
|
|
||||||
| `nsubjpass` | nominal subject (passive) |
|
|
||||||
| `nummod` | numeric modifier |
|
|
||||||
| `oprd` | object predicate |
|
|
||||||
| `obj` | object |
|
|
||||||
| `obl` | oblique nominal |
|
|
||||||
| `parataxis` | parataxis |
|
|
||||||
| `pcomp` | complement of preposition |
|
|
||||||
| `pobj` | object of preposition |
|
|
||||||
| `poss` | possession modifier |
|
|
||||||
| `preconj` | pre-correlative conjunction |
|
|
||||||
| `prep` | prepositional modifier |
|
|
||||||
| `prt` | particle |
|
|
||||||
| `punct` | punctuation |
|
|
||||||
| `quantmod` | modifier of quantifier |
|
|
||||||
| `relcl` | relative clause modifier |
|
|
||||||
| `root` | root |
|
|
||||||
| `xcomp` | open clausal complement |
|
|
||||||
|
|
||||||
</Accordion>
|
|
||||||
|
|
||||||
<Accordion title="German" id="dependency-parsing-german">
|
|
||||||
|
|
||||||
The German dependency labels use the
|
|
||||||
[TIGER Treebank](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html)
|
|
||||||
annotation scheme.
|
|
||||||
|
|
||||||
| Label | Description |
|
|
||||||
| ------- | ------------------------------- |
|
|
||||||
| `ac` | adpositional case marker |
|
|
||||||
| `adc` | adjective component |
|
|
||||||
| `ag` | genitive attribute |
|
|
||||||
| `ams` | measure argument of adjective |
|
|
||||||
| `app` | apposition |
|
|
||||||
| `avc` | adverbial phrase component |
|
|
||||||
| `cc` | comparative complement |
|
|
||||||
| `cd` | coordinating conjunction |
|
|
||||||
| `cj` | conjunct |
|
|
||||||
| `cm` | comparative conjunction |
|
|
||||||
| `cp` | complementizer |
|
|
||||||
| `cvc` | collocational verb construction |
|
|
||||||
| `da` | dative |
|
|
||||||
| `dm` | discourse marker |
|
|
||||||
| `ep` | expletive es |
|
|
||||||
| `ju` | junctor |
|
|
||||||
| `mnr` | postnominal modifier |
|
|
||||||
| `mo` | modifier |
|
|
||||||
| `ng` | negation |
|
|
||||||
| `nk` | noun kernel element |
|
|
||||||
| `nmc` | numerical component |
|
|
||||||
| `oa` | accusative object |
|
|
||||||
| `oa2` | second accusative object |
|
|
||||||
| `oc` | clausal object |
|
|
||||||
| `og` | genitive object |
|
|
||||||
| `op` | prepositional object |
|
|
||||||
| `par` | parenthetical element |
|
|
||||||
| `pd` | predicate |
|
|
||||||
| `pg` | phrasal genitive |
|
|
||||||
| `ph` | placeholder |
|
|
||||||
| `pm` | morphological particle |
|
|
||||||
| `pnc` | proper noun component |
|
|
||||||
| `punct` | punctuation |
|
|
||||||
| `rc` | relative clause |
|
|
||||||
| `re` | repeated element |
|
|
||||||
| `rs` | reported speech |
|
|
||||||
| `sb` | subject |
|
|
||||||
| `sbp` | passivized subject (PP) |
|
|
||||||
| `sp` | subject or predicate |
|
|
||||||
| `svp` | separable verb prefix |
|
|
||||||
| `uc` | unit component |
|
|
||||||
| `vo` | vocative |
|
|
||||||
| `ROOT` | root |
|
|
||||||
|
|
||||||
</Accordion>
|
|
||||||
|
|
||||||
## Named Entity Recognition {#named-entities}
|
|
||||||
|
|
||||||
> #### Tip: Understanding entity types
|
|
||||||
>
|
|
||||||
> You can also use `spacy.explain` to get the description for the string
|
|
||||||
> representation of an entity label. For example, `spacy.explain("LANGUAGE")`
|
|
||||||
> will return "any named language".
|
|
||||||
|
|
||||||
Models trained on the [OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19)
|
|
||||||
corpus support the following entity types:
|
|
||||||
|
|
||||||
| Type | Description |
|
|
||||||
| ------------- | ---------------------------------------------------- |
|
|
||||||
| `PERSON` | People, including fictional. |
|
|
||||||
| `NORP` | Nationalities or religious or political groups. |
|
|
||||||
| `FAC` | Buildings, airports, highways, bridges, etc. |
|
|
||||||
| `ORG` | Companies, agencies, institutions, etc. |
|
|
||||||
| `GPE` | Countries, cities, states. |
|
|
||||||
| `LOC` | Non-GPE locations, mountain ranges, bodies of water. |
|
|
||||||
| `PRODUCT` | Objects, vehicles, foods, etc. (Not services.) |
|
|
||||||
| `EVENT` | Named hurricanes, battles, wars, sports events, etc. |
|
|
||||||
| `WORK_OF_ART` | Titles of books, songs, etc. |
|
|
||||||
| `LAW` | Named documents made into laws. |
|
|
||||||
| `LANGUAGE` | Any named language. |
|
|
||||||
| `DATE` | Absolute or relative dates or periods. |
|
|
||||||
| `TIME` | Times smaller than a day. |
|
|
||||||
| `PERCENT` | Percentage, including "%". |
|
|
||||||
| `MONEY` | Monetary values, including unit. |
|
|
||||||
| `QUANTITY` | Measurements, as of weight or distance. |
|
|
||||||
| `ORDINAL` | "first", "second", etc. |
|
|
||||||
| `CARDINAL` | Numerals that do not fall under another type. |
|
|
||||||
|
|
||||||
### Wikipedia scheme {#ner-wikipedia-scheme}
|
|
||||||
|
|
||||||
Models trained on Wikipedia corpus
|
|
||||||
([Nothman et al., 2013](http://www.sciencedirect.com/science/article/pii/S0004370212000276))
|
|
||||||
use a less fine-grained NER annotation scheme and recognise the following
|
|
||||||
entities:
|
|
||||||
|
|
||||||
| Type | Description |
|
|
||||||
| ------ | ----------------------------------------------------------------------------------------------------------------------------------------- |
|
|
||||||
| `PER` | Named person or family. |
|
|
||||||
| `LOC` | Name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains). |
|
|
||||||
| `ORG` | Named corporate, governmental, or other organizational entity. |
|
|
||||||
| `MISC` | Miscellaneous entities, e.g. events, nationalities, products or works of art. |
|
|
||||||
|
|
||||||
### IOB Scheme {#iob}
|
|
||||||
|
|
||||||
| Tag | ID | Description |
|
|
||||||
| ----- | --- | ------------------------------------- |
|
|
||||||
| `"I"` | `1` | Token is inside an entity. |
|
|
||||||
| `"O"` | `2` | Token is outside an entity. |
|
|
||||||
| `"B"` | `3` | Token begins an entity. |
|
|
||||||
| `""` | `0` | No entity tag is set (missing value). |
|
|
||||||
|
|
||||||
### BILUO Scheme {#biluo}
|
|
||||||
|
|
||||||
| Tag | Description |
|
|
||||||
| ----------- | ---------------------------------------- |
|
|
||||||
| **`B`**EGIN | The first token of a multi-token entity. |
|
|
||||||
| **`I`**N | An inner token of a multi-token entity. |
|
|
||||||
| **`L`**AST | The final token of a multi-token entity. |
|
|
||||||
| **`U`**NIT | A single-token entity. |
|
|
||||||
| **`O`**UT | A non-entity token. |
|
|
||||||
|
|
||||||
> #### Why BILUO, not IOB?
|
|
||||||
>
|
|
||||||
> There are several coding schemes for encoding entity annotations as token
|
|
||||||
> tags. These coding schemes are equally expressive, but not necessarily equally
|
|
||||||
> learnable. [Ratinov and Roth](http://www.aclweb.org/anthology/W09-1119) showed
|
|
||||||
> that the minimal **Begin**, **In**, **Out** scheme was more difficult to learn
|
|
||||||
> than the **BILUO** scheme that we use, which explicitly marks boundary tokens.
|
|
||||||
|
|
||||||
spaCy translates the character offsets into this scheme, in order to decide the
|
|
||||||
cost of each action given the current state of the entity recognizer. The costs
|
|
||||||
are then used to calculate the gradient of the loss, to train the model. The
|
|
||||||
exact algorithm is a pastiche of well-known methods, and is not currently
|
|
||||||
described in any single publication. The model is a greedy transition-based
|
|
||||||
parser guided by a linear model whose weights are learned using the averaged
|
|
||||||
perceptron loss, via the
|
|
||||||
[dynamic oracle](http://www.aclweb.org/anthology/C12-1059) imitation learning
|
|
||||||
strategy. The transition system is equivalent to the BILUO tagging scheme.
|
|
||||||
|
|
||||||
## Models and training data {#training}
|
|
||||||
|
|
||||||
### JSON input format for training {#json-input}
|
|
||||||
|
|
||||||
spaCy takes training data in JSON format. The built-in
|
|
||||||
[`convert`](/api/cli#convert) command helps you convert the `.conllu` format
|
|
||||||
used by the
|
|
||||||
[Universal Dependencies corpora](https://github.com/UniversalDependencies) to
|
|
||||||
spaCy's training format. To convert one or more existing `Doc` objects to
|
|
||||||
spaCy's JSON format, you can use the
|
|
||||||
[`gold.docs_to_json`](/api/goldparse#docs_to_json) helper.
|
|
||||||
|
|
||||||
> #### Annotating entities
|
|
||||||
>
|
|
||||||
> Named entities are provided in the [BILUO](#biluo) notation. Tokens outside an
|
|
||||||
> entity are set to `"O"` and tokens that are part of an entity are set to the
|
|
||||||
> entity label, prefixed by the BILUO marker. For example `"B-ORG"` describes
|
|
||||||
> the first token of a multi-token `ORG` entity and `"U-PERSON"` a single token
|
|
||||||
> representing a `PERSON` entity. The
|
|
||||||
> [`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) function
|
|
||||||
> can help you convert entity offsets to the right format.
|
|
||||||
|
|
||||||
```python
|
|
||||||
### Example structure
|
|
||||||
[{
|
|
||||||
"id": int, # ID of the document within the corpus
|
|
||||||
"paragraphs": [{ # list of paragraphs in the corpus
|
|
||||||
"raw": string, # raw text of the paragraph
|
|
||||||
"sentences": [{ # list of sentences in the paragraph
|
|
||||||
"tokens": [{ # list of tokens in the sentence
|
|
||||||
"id": int, # index of the token in the document
|
|
||||||
"dep": string, # dependency label
|
|
||||||
"head": int, # offset of token head relative to token index
|
|
||||||
"tag": string, # part-of-speech tag
|
|
||||||
"orth": string, # verbatim text of the token
|
|
||||||
"ner": string # BILUO label, e.g. "O" or "B-ORG"
|
|
||||||
}],
|
|
||||||
"brackets": [{ # phrase structure (NOT USED by current models)
|
|
||||||
"first": int, # index of first token
|
|
||||||
"last": int, # index of last token
|
|
||||||
"label": string # phrase label
|
|
||||||
}]
|
|
||||||
}],
|
|
||||||
"cats": [{ # new in v2.2: categories for text classifier
|
|
||||||
"label": string, # text category label
|
|
||||||
"value": float / bool # label applies (1.0/true) or not (0.0/false)
|
|
||||||
}]
|
|
||||||
}]
|
|
||||||
}]
|
|
||||||
```
|
|
||||||
|
|
||||||
Here's an example of dependencies, part-of-speech tags and names entities, taken
|
|
||||||
from the English Wall Street Journal portion of the Penn Treebank:
|
|
||||||
|
|
||||||
```json
|
|
||||||
https://github.com/explosion/spaCy/tree/master/examples/training/training-data.json
|
|
||||||
```
|
|
||||||
|
|
||||||
### Lexical data for vocabulary {#vocab-jsonl new="2"}
|
|
||||||
|
|
||||||
To populate a model's vocabulary, you can use the
|
|
||||||
[`spacy init-model`](/api/cli#init-model) command and load in a
|
|
||||||
[newline-delimited JSON](http://jsonlines.org/) (JSONL) file containing one
|
|
||||||
lexical entry per line via the `--jsonl-loc` option. The first line defines the
|
|
||||||
language and vocabulary settings. All other lines are expected to be JSON
|
|
||||||
objects describing an individual lexeme. The lexical attributes will be then set
|
|
||||||
as attributes on spaCy's [`Lexeme`](/api/lexeme#attributes) object. The `vocab`
|
|
||||||
command outputs a ready-to-use spaCy model with a `Vocab` containing the lexical
|
|
||||||
data.
|
|
||||||
|
|
||||||
```python
|
|
||||||
### First line
|
|
||||||
{"lang": "en", "settings": {"oov_prob": -20.502029418945312}}
|
|
||||||
```
|
|
||||||
|
|
||||||
```python
|
|
||||||
### Entry structure
|
|
||||||
{
|
|
||||||
"orth": string, # the word text
|
|
||||||
"id": int, # can correspond to row in vectors table
|
|
||||||
"lower": string,
|
|
||||||
"norm": string,
|
|
||||||
"shape": string
|
|
||||||
"prefix": string,
|
|
||||||
"suffix": string,
|
|
||||||
"length": int,
|
|
||||||
"cluster": string,
|
|
||||||
"prob": float,
|
|
||||||
"is_alpha": bool,
|
|
||||||
"is_ascii": bool,
|
|
||||||
"is_digit": bool,
|
|
||||||
"is_lower": bool,
|
|
||||||
"is_punct": bool,
|
|
||||||
"is_space": bool,
|
|
||||||
"is_title": bool,
|
|
||||||
"is_upper": bool,
|
|
||||||
"like_url": bool,
|
|
||||||
"like_num": bool,
|
|
||||||
"like_email": bool,
|
|
||||||
"is_stop": bool,
|
|
||||||
"is_oov": bool,
|
|
||||||
"is_quote": bool,
|
|
||||||
"is_left_punct": bool,
|
|
||||||
"is_right_punct": bool
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Here's an example of the 20 most frequent lexemes in the English training data:
|
|
||||||
|
|
||||||
```json
|
|
||||||
https://github.com/explosion/spaCy/tree/master/examples/training/vocab-data.jsonl
|
|
||||||
```
|
|
7
website/docs/api/architectures.md
Normal file
7
website/docs/api/architectures.md
Normal file
|
@ -0,0 +1,7 @@
|
||||||
|
---
|
||||||
|
title: Model Architectures
|
||||||
|
teaser: Pre-defined model architectures included with the core library
|
||||||
|
source: spacy/ml/models
|
||||||
|
---
|
||||||
|
|
||||||
|
TODO: write
|
|
@ -4,7 +4,6 @@ teaser: Download, train and package models, and debug spaCy
|
||||||
source: spacy/cli
|
source: spacy/cli
|
||||||
menu:
|
menu:
|
||||||
- ['Download', 'download']
|
- ['Download', 'download']
|
||||||
- ['Link', 'link']
|
|
||||||
- ['Info', 'info']
|
- ['Info', 'info']
|
||||||
- ['Validate', 'validate']
|
- ['Validate', 'validate']
|
||||||
- ['Convert', 'convert']
|
- ['Convert', 'convert']
|
||||||
|
@ -14,20 +13,19 @@ menu:
|
||||||
- ['Init Model', 'init-model']
|
- ['Init Model', 'init-model']
|
||||||
- ['Evaluate', 'evaluate']
|
- ['Evaluate', 'evaluate']
|
||||||
- ['Package', 'package']
|
- ['Package', 'package']
|
||||||
|
- ['Project', 'project']
|
||||||
---
|
---
|
||||||
|
|
||||||
As of v1.7.0, spaCy comes with new command line helpers to download and link
|
For a list of available commands, type `spacy --help`.
|
||||||
models and show useful debugging information. For a list of available commands,
|
|
||||||
type `spacy --help`.
|
<!-- TODO: add notes on autocompletion etc. -->
|
||||||
|
|
||||||
## Download {#download}
|
## Download {#download}
|
||||||
|
|
||||||
Download [models](/usage/models) for spaCy. The downloader finds the
|
Download [models](/usage/models) for spaCy. The downloader finds the
|
||||||
best-matching compatible version, uses `pip install` to download the model as a
|
best-matching compatible version and uses `pip install` to download the model as
|
||||||
package and creates a [shortcut link](/usage/models#usage) if the model was
|
a package. Direct downloads don't perform any compatibility checks and require
|
||||||
downloaded via a shortcut. Direct downloads don't perform any compatibility
|
the model name to be specified with its version (e.g. `en_core_web_sm-2.2.0`).
|
||||||
checks and require the model name to be specified with its version (e.g.
|
|
||||||
`en_core_web_sm-2.2.0`).
|
|
||||||
|
|
||||||
> #### Downloading best practices
|
> #### Downloading best practices
|
||||||
>
|
>
|
||||||
|
@ -43,42 +41,13 @@ checks and require the model name to be specified with its version (e.g.
|
||||||
$ python -m spacy download [model] [--direct] [pip args]
|
$ python -m spacy download [model] [--direct] [pip args]
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Type | Description |
|
||||||
| ------------------------------------- | ------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `model` | positional | Model name or shortcut (`en`, `de`, `en_core_web_sm`). |
|
| `model` | positional | Model name, e.g. `en_core_web_sm`.. |
|
||||||
| `--direct`, `-d` | flag | Force direct download of exact model version. |
|
| `--direct`, `-d` | flag | Force direct download of exact model version. |
|
||||||
| pip args <Tag variant="new">2.1</Tag> | - | Additional installation options to be passed to `pip install` when installing the model package. For example, `--user` to install to the user home directory or `--no-deps` to not install model dependencies. |
|
| pip args <Tag variant="new">2.1</Tag> | - | Additional installation options to be passed to `pip install` when installing the model package. For example, `--user` to install to the user home directory or `--no-deps` to not install model dependencies. |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||||
| **CREATES** | directory, symlink | The installed model package in your `site-packages` directory and a shortcut link as a symlink in `spacy/data` if installed via shortcut. |
|
| **CREATES** | directory | The installed model package in your `site-packages` directory. |
|
||||||
|
|
||||||
## Link {#link}
|
|
||||||
|
|
||||||
Create a [shortcut link](/usage/models#usage) for a model, either a Python
|
|
||||||
package or a local directory. This will let you load models from any location
|
|
||||||
using a custom name via [`spacy.load()`](/api/top-level#spacy.load).
|
|
||||||
|
|
||||||
<Infobox title="Important note" variant="warning">
|
|
||||||
|
|
||||||
In spaCy v1.x, you had to use the model data directory to set up a shortcut link
|
|
||||||
for a local path. As of v2.0, spaCy expects all shortcut links to be **loadable
|
|
||||||
model packages**. If you want to load a data directory, call
|
|
||||||
[`spacy.load()`](/api/top-level#spacy.load) or
|
|
||||||
[`Language.from_disk()`](/api/language#from_disk) with the path, or use the
|
|
||||||
[`package`](/api/cli#package) command to create a model package.
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
```bash
|
|
||||||
$ python -m spacy link [origin] [link_name] [--force]
|
|
||||||
```
|
|
||||||
|
|
||||||
| Argument | Type | Description |
|
|
||||||
| --------------- | ---------- | --------------------------------------------------------------- |
|
|
||||||
| `origin` | positional | Model name if package, or path to local directory. |
|
|
||||||
| `link_name` | positional | Name of the shortcut link to create. |
|
|
||||||
| `--force`, `-f` | flag | Force overwriting of existing link. |
|
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
|
||||||
| **CREATES** | symlink | A shortcut link of the given name as a symlink in `spacy/data`. |
|
|
||||||
|
|
||||||
## Info {#info}
|
## Info {#info}
|
||||||
|
|
||||||
|
@ -94,30 +63,28 @@ $ python -m spacy info [--markdown] [--silent]
|
||||||
$ python -m spacy info [model] [--markdown] [--silent]
|
$ python -m spacy info [model] [--markdown] [--silent]
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Type | Description |
|
||||||
| ------------------------------------------------ | ---------- | ------------------------------------------------------------- |
|
| ------------------------------------------------ | ---------- | ---------------------------------------------- |
|
||||||
| `model` | positional | A model, i.e. shortcut link, package name or path (optional). |
|
| `model` | positional | A model, i.e. package name or path (optional). |
|
||||||
| `--markdown`, `-md` | flag | Print information as Markdown. |
|
| `--markdown`, `-md` | flag | Print information as Markdown. |
|
||||||
| `--silent`, `-s` <Tag variant="new">2.0.12</Tag> | flag | Don't print anything, just return the values. |
|
| `--silent`, `-s` <Tag variant="new">2.0.12</Tag> | flag | Don't print anything, just return the values. |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||||
| **PRINTS** | `stdout` | Information about your spaCy installation. |
|
| **PRINTS** | `stdout` | Information about your spaCy installation. |
|
||||||
|
|
||||||
## Validate {#validate new="2"}
|
## Validate {#validate new="2"}
|
||||||
|
|
||||||
Find all models installed in the current environment (both packages and shortcut
|
Find all models installed in the current environment and check whether they are
|
||||||
links) and check whether they are compatible with the currently installed
|
compatible with the currently installed version of spaCy. Should be run after
|
||||||
version of spaCy. Should be run after upgrading spaCy via `pip install -U spacy`
|
upgrading spaCy via `pip install -U spacy` to ensure that all installed models
|
||||||
to ensure that all installed models are can be used with the new version. The
|
are can be used with the new version. It will show a list of models and their
|
||||||
command is also useful to detect out-of-sync model links resulting from links
|
installed versions. If any model is out of date, the latest compatible versions
|
||||||
created in different virtual environments. It will show a list of models and
|
and command for updating are shown.
|
||||||
their installed versions. If any model is out of date, the latest compatible
|
|
||||||
versions and command for updating are shown.
|
|
||||||
|
|
||||||
> #### Automated validation
|
> #### Automated validation
|
||||||
>
|
>
|
||||||
> You can also use the `validate` command as part of your build process or test
|
> You can also use the `validate` command as part of your build process or test
|
||||||
> suite, to ensure all models are up to date before proceeding. If incompatible
|
> suite, to ensure all models are up to date before proceeding. If incompatible
|
||||||
> models or shortcut links are found, it will return `1`.
|
> models are found, it will return `1`.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ python -m spacy validate
|
$ python -m spacy validate
|
||||||
|
@ -129,50 +96,42 @@ $ python -m spacy validate
|
||||||
|
|
||||||
## Convert {#convert}
|
## Convert {#convert}
|
||||||
|
|
||||||
Convert files into spaCy's [JSON format](/api/annotation#json-input) for use
|
Convert files into spaCy's
|
||||||
with the `train` command and other experiment management functions. The
|
[binary training data format](/api/data-formats#binary-training), a serialized
|
||||||
converter can be specified on the command line, or chosen based on the file
|
[`DocBin`](/api/docbin), for use with the `train` command and other experiment
|
||||||
extension of the input file.
|
management functions. The converter can be specified on the command line, or
|
||||||
|
chosen based on the file extension of the input file.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ python -m spacy convert [input_file] [output_dir] [--file-type] [--converter]
|
$ python -m spacy convert [input_file] [output_dir] [--converter]
|
||||||
[--n-sents] [--morphology] [--lang]
|
[--file-type] [--n-sents] [--seg-sents] [--model] [--morphology]
|
||||||
|
[--merge-subtokens] [--ner-map] [--lang]
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Type | Description |
|
||||||
| ------------------------------------------------ | ---------- | ------------------------------------------------------------------------------------------------- |
|
| ------------------------------------------------ | ---------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `input_file` | positional | Input file. |
|
| `input_file` | positional | Input file. |
|
||||||
| `output_dir` | positional | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. |
|
| `output_dir` | positional | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. |
|
||||||
| `--file-type`, `-t` <Tag variant="new">2.1</Tag> | option | Type of file to create (see below). |
|
| `--converter`, `-c` <Tag variant="new">2</Tag> | option | Name of converter to use (see below). |
|
||||||
| `--converter`, `-c` <Tag variant="new">2</Tag> | option | Name of converter to use (see below). |
|
| `--file-type`, `-t` <Tag variant="new">2.1</Tag> | option | Type of file to create. Either `spacy` (default) for binary [`DocBin`](/api/docbin) data or `json` for v2.x JSON format. |
|
||||||
| `--n-sents`, `-n` | option | Number of sentences per document. |
|
| `--n-sents`, `-n` | option | Number of sentences per document. |
|
||||||
| `--seg-sents`, `-s` <Tag variant="new">2.2</Tag> | flag | Segment sentences (for `-c ner`) |
|
| `--seg-sents`, `-s` <Tag variant="new">2.2</Tag> | flag | Segment sentences (for `-c ner`) |
|
||||||
| `--model`, `-b` <Tag variant="new">2.2</Tag> | option | Model for parser-based sentence segmentation (for `-s`) |
|
| `--model`, `-b` <Tag variant="new">2.2</Tag> | option | Model for parser-based sentence segmentation (for `-s`) |
|
||||||
| `--morphology`, `-m` | option | Enable appending morphology to tags. |
|
| `--morphology`, `-m` | option | Enable appending morphology to tags. |
|
||||||
| `--lang`, `-l` <Tag variant="new">2.1</Tag> | option | Language code (if tokenizer required). |
|
| `--ner-map`, `-nm` | option | NER tag mapping (as JSON-encoded dict of entity types). |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--lang`, `-l` <Tag variant="new">2.1</Tag> | option | Language code (if tokenizer required). |
|
||||||
| **CREATES** | JSON | Data in spaCy's [JSON format](/api/annotation#json-input). |
|
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||||
|
| **CREATES** | binary | Binary [`DocBin`](/api/docbin) training data that can be used with [`spacy train`](/api/cli#train). |
|
||||||
|
|
||||||
### Output file types {new="2.1"}
|
### Converters
|
||||||
|
|
||||||
All output files generated by this command are compatible with
|
| ID | Description |
|
||||||
[`spacy train`](/api/cli#train).
|
| ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `auto` | Automatically pick converter based on file extension and file content (default). |
|
||||||
| ID | Description |
|
| `json` | JSON-formatted training data used in spaCy v2.x and produced by [`docs2json`](/api/top-level#docs_to_json). |
|
||||||
| ------- | -------------------------- |
|
| `conll` | Universal Dependencies `.conllu` or `.conll` format. |
|
||||||
| `json` | Regular JSON (default). |
|
| `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
||||||
| `jsonl` | Newline-delimited JSON. |
|
| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
||||||
| `msg` | Binary MessagePack format. |
|
|
||||||
|
|
||||||
### Converter options
|
|
||||||
|
|
||||||
| ID | Description |
|
|
||||||
| ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
||||||
| `auto` | Automatically pick converter based on file extension and file content (default). |
|
|
||||||
| `conll`, `conllu`, `conllubio` | Universal Dependencies `.conllu` or `.conll` format. |
|
|
||||||
| `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
|
||||||
| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
|
||||||
| `jsonl` | NER data formatted as JSONL with one dict per line and a `"text"` and `"spans"` key. This is also the format exported by the [Prodigy](https://prodi.gy) annotation tool. See [sample data](https://raw.githubusercontent.com/explosion/projects/master/ner-fashion-brands/fashion_brands_training.jsonl). |
|
|
||||||
|
|
||||||
## Debug data {#debug-data new="2.2"}
|
## Debug data {#debug-data new="2.2"}
|
||||||
|
|
||||||
|
@ -181,20 +140,21 @@ stats, and find problems like invalid entity annotations, cyclic dependencies,
|
||||||
low data labels and more.
|
low data labels and more.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ python -m spacy debug-data [lang] [train_path] [dev_path] [--base-model] [--pipeline] [--ignore-warnings] [--verbose] [--no-format]
|
$ python -m spacy debug-data [lang] [train_path] [dev_path] [--base-model]
|
||||||
|
[--pipeline] [--tag-map-path] [--ignore-warnings] [--verbose] [--no-format]
|
||||||
```
|
```
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Type | Description |
|
||||||
| ------------------------------------------------------ | ---------- | -------------------------------------------------------------------------------------------------- |
|
| ------------------------------------------------------ | ---------- | ------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `lang` | positional | Model language. |
|
| `lang` | positional | Model language. |
|
||||||
| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. |
|
| `train_path` | positional | Location of [binary training data](/usage/training#data-format). Can be a file or a directory of files. |
|
||||||
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
|
| `dev_path` | positional | Location of [binary development data](/usage/training#data-format) for evaluation. Can be a file or a directory of files. |
|
||||||
| `--tag-map-path`, `-tm` <Tag variant="new">2.2.4</Tag> | option | Location of JSON-formatted tag map. |
|
| `--tag-map-path`, `-tm` <Tag variant="new">2.2.4</Tag> | option | Location of JSON-formatted tag map. |
|
||||||
| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
|
| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
|
||||||
| `--pipeline`, `-p` | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
|
| `--pipeline`, `-p` | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
|
||||||
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
|
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
|
||||||
| `--verbose`, `-V` | flag | Print additional information and explanations. |
|
| `--verbose`, `-V` | flag | Print additional information and explanations. |
|
||||||
| --no-format, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. |
|
| `--no-format`, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. |
|
||||||
|
|
||||||
<Accordion title="Example output">
|
<Accordion title="Example output">
|
||||||
|
|
||||||
|
@ -337,21 +297,14 @@ will not be available.
|
||||||
|
|
||||||
## Train {#train}
|
## Train {#train}
|
||||||
|
|
||||||
|
<!-- TODO: document new training -->
|
||||||
|
|
||||||
Train a model. Expects data in spaCy's
|
Train a model. Expects data in spaCy's
|
||||||
[JSON format](/api/annotation#json-input). On each epoch, a model will be saved
|
[JSON format](/api/data-formats#json-input). On each epoch, a model will be
|
||||||
out to the directory. Accuracy scores and model details will be added to a
|
saved out to the directory. Accuracy scores and model details will be added to a
|
||||||
[`meta.json`](/usage/training#models-generating) to allow packaging the model
|
[`meta.json`](/usage/training#models-generating) to allow packaging the model
|
||||||
using the [`package`](/api/cli#package) command.
|
using the [`package`](/api/cli#package) command.
|
||||||
|
|
||||||
<Infobox title="Changed in v2.1" variant="warning">
|
|
||||||
|
|
||||||
As of spaCy 2.1, the `--no-tagger`, `--no-parser` and `--no-entities` flags have
|
|
||||||
been replaced by a `--pipeline` option, which lets you define comma-separated
|
|
||||||
names of pipeline components to train. For example, `--pipeline tagger,parser`
|
|
||||||
will only train the tagger and parser.
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
$ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
||||||
[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-early-stopping]
|
[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-early-stopping]
|
||||||
|
@ -399,47 +352,10 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||||
| **CREATES** | model, pickle | A spaCy model on each epoch. |
|
| **CREATES** | model, pickle | A spaCy model on each epoch. |
|
||||||
|
|
||||||
### Environment variables for hyperparameters {#train-hyperparams new="2"}
|
|
||||||
|
|
||||||
spaCy lets you set hyperparameters for training via environment variables. For
|
|
||||||
example:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
$ token_vector_width=256 learn_rate=0.0001 spacy train [...]
|
|
||||||
```
|
|
||||||
|
|
||||||
> #### Usage with alias
|
|
||||||
>
|
|
||||||
> Environment variables keep the command simple and allow you to to
|
|
||||||
> [create an alias](https://askubuntu.com/questions/17536/how-do-i-create-a-permanent-bash-alias/17537#17537)
|
|
||||||
> for your custom `train` command while still being able to easily tweak the
|
|
||||||
> hyperparameters.
|
|
||||||
>
|
|
||||||
> ```bash
|
|
||||||
> alias train-parser="python -m spacy train en /output /data /train /dev -n 1000"
|
|
||||||
> token_vector_width=256 train-parser
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Description | Default |
|
|
||||||
| -------------------- | --------------------------------------------------- | ------- |
|
|
||||||
| `dropout_from` | Initial dropout rate. | `0.2` |
|
|
||||||
| `dropout_to` | Final dropout rate. | `0.2` |
|
|
||||||
| `dropout_decay` | Rate of dropout change. | `0.0` |
|
|
||||||
| `batch_from` | Initial batch size. | `1` |
|
|
||||||
| `batch_to` | Final batch size. | `64` |
|
|
||||||
| `batch_compound` | Rate of batch size acceleration. | `1.001` |
|
|
||||||
| `token_vector_width` | Width of embedding tables and convolutional layers. | `128` |
|
|
||||||
| `embed_size` | Number of rows in embedding tables. | `7500` |
|
|
||||||
| `hidden_width` | Size of the parser's and NER's hidden layers. | `128` |
|
|
||||||
| `learn_rate` | Learning rate. | `0.001` |
|
|
||||||
| `optimizer_B1` | Momentum for the Adam solver. | `0.9` |
|
|
||||||
| `optimizer_B2` | Adagrad-momentum for the Adam solver. | `0.999` |
|
|
||||||
| `optimizer_eps` | Epsilon value for the Adam solver. | `1e-08` |
|
|
||||||
| `L2_penalty` | L2 regularization penalty. | `1e-06` |
|
|
||||||
| `grad_norm_clip` | Gradient L2 norm constraint. | `1.0` |
|
|
||||||
|
|
||||||
## Pretrain {#pretrain new="2.1" tag="experimental"}
|
## Pretrain {#pretrain new="2.1" tag="experimental"}
|
||||||
|
|
||||||
|
<!-- TODO: document new pretrain command and link to new pretraining docs -->
|
||||||
|
|
||||||
Pre-train the "token to vector" (`tok2vec`) layer of pipeline components, using
|
Pre-train the "token to vector" (`tok2vec`) layer of pipeline components, using
|
||||||
an approximate language-modeling objective. Specifically, we load pretrained
|
an approximate language-modeling objective. Specifically, we load pretrained
|
||||||
vectors, and train a component like a CNN, BiLSTM, etc to predict vectors which
|
vectors, and train a component like a CNN, BiLSTM, etc to predict vectors which
|
||||||
|
@ -522,20 +438,10 @@ tokenization can be provided.
|
||||||
Create a new model directory from raw data, like word frequencies, Brown
|
Create a new model directory from raw data, like word frequencies, Brown
|
||||||
clusters and word vectors. This command is similar to the `spacy model` command
|
clusters and word vectors. This command is similar to the `spacy model` command
|
||||||
in v1.x. Note that in order to populate the model's vocab, you need to pass in a
|
in v1.x. Note that in order to populate the model's vocab, you need to pass in a
|
||||||
JSONL-formatted [vocabulary file](<(/api/annotation#vocab-jsonl)>) as
|
JSONL-formatted [vocabulary file](/api/data-formats#vocab-jsonl) as
|
||||||
`--jsonl-loc` with optional `id` values that correspond to the vectors table.
|
`--jsonl-loc` with optional `id` values that correspond to the vectors table.
|
||||||
Just loading in vectors will not automatically populate the vocab.
|
Just loading in vectors will not automatically populate the vocab.
|
||||||
|
|
||||||
<Infobox title="Deprecation note" variant="warning">
|
|
||||||
|
|
||||||
As of v2.1.0, the `--freqs-loc` and `--clusters-loc` are deprecated and have
|
|
||||||
been replaced with the `--jsonl-loc` argument, which lets you pass in a a
|
|
||||||
[JSONL](http://jsonlines.org/) file containing one lexical entry per line. For
|
|
||||||
more details on the format, see the
|
|
||||||
[annotation specs](/api/annotation#vocab-jsonl).
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ python -m spacy init-model [lang] [output_dir] [--jsonl-loc] [--vectors-loc]
|
$ python -m spacy init-model [lang] [output_dir] [--jsonl-loc] [--vectors-loc]
|
||||||
[--prune-vectors]
|
[--prune-vectors]
|
||||||
|
@ -545,7 +451,7 @@ $ python -m spacy init-model [lang] [output_dir] [--jsonl-loc] [--vectors-loc]
|
||||||
| ----------------------------------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ----------------------------------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `lang` | positional | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), e.g. `en`. |
|
| `lang` | positional | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), e.g. `en`. |
|
||||||
| `output_dir` | positional | Model output directory. Will be created if it doesn't exist. |
|
| `output_dir` | positional | Model output directory. Will be created if it doesn't exist. |
|
||||||
| `--jsonl-loc`, `-j` | option | Optional location of JSONL-formatted [vocabulary file](/api/annotation#vocab-jsonl) with lexical attributes. |
|
| `--jsonl-loc`, `-j` | option | Optional location of JSONL-formatted [vocabulary file](/api/data-formats#vocab-jsonl) with lexical attributes. |
|
||||||
| `--vectors-loc`, `-v` | option | Optional location of vectors. Should be a file where the first row contains the dimensions of the vectors, followed by a space-separated Word2Vec table. File can be provided in `.txt` format or as a zipped text file in `.zip` or `.tar.gz` format. |
|
| `--vectors-loc`, `-v` | option | Optional location of vectors. Should be a file where the first row contains the dimensions of the vectors, followed by a space-separated Word2Vec table. File can be provided in `.txt` format or as a zipped text file in `.zip` or `.tar.gz` format. |
|
||||||
| `--truncate-vectors`, `-t` <Tag variant="new">2.3</Tag> | option | Number of vectors to truncate to when reading in vectors file. Defaults to `0` for no truncation. |
|
| `--truncate-vectors`, `-t` <Tag variant="new">2.3</Tag> | option | Number of vectors to truncate to when reading in vectors file. Defaults to `0` for no truncation. |
|
||||||
| `--prune-vectors`, `-V` | option | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. |
|
| `--prune-vectors`, `-V` | option | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. |
|
||||||
|
@ -555,6 +461,8 @@ $ python -m spacy init-model [lang] [output_dir] [--jsonl-loc] [--vectors-loc]
|
||||||
|
|
||||||
## Evaluate {#evaluate new="2"}
|
## Evaluate {#evaluate new="2"}
|
||||||
|
|
||||||
|
<!-- TODO: document new evaluate command -->
|
||||||
|
|
||||||
Evaluate a model's accuracy and speed on JSON-formatted annotated data. Will
|
Evaluate a model's accuracy and speed on JSON-formatted annotated data. Will
|
||||||
print the results and optionally export
|
print the results and optionally export
|
||||||
[displaCy visualizations](/usage/visualizers) of a sample set of parses to
|
[displaCy visualizations](/usage/visualizers) of a sample set of parses to
|
||||||
|
@ -569,7 +477,7 @@ $ python -m spacy evaluate [model] [data_path] [--displacy-path] [--displacy-lim
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Type | Description |
|
||||||
| ------------------------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `model` | positional | Model to evaluate. Can be a package or shortcut link name, or a path to a model data directory. |
|
| `model` | positional | Model to evaluate. Can be a package or a path to a model data directory. |
|
||||||
| `data_path` | positional | Location of JSON-formatted evaluation data. |
|
| `data_path` | positional | Location of JSON-formatted evaluation data. |
|
||||||
| `--displacy-path`, `-dp` | option | Directory to output rendered parses as HTML. If not set, no visualizations will be generated. |
|
| `--displacy-path`, `-dp` | option | Directory to output rendered parses as HTML. If not set, no visualizations will be generated. |
|
||||||
| `--displacy-limit`, `-dl` | option | Number of parses to generate per file. Defaults to `25`. Keep in mind that a significantly higher number might cause the `.html` files to render slowly. |
|
| `--displacy-limit`, `-dl` | option | Number of parses to generate per file. Defaults to `25`. Keep in mind that a significantly higher number might cause the `.html` files to render slowly. |
|
||||||
|
@ -580,12 +488,20 @@ $ python -m spacy evaluate [model] [data_path] [--displacy-path] [--displacy-lim
|
||||||
|
|
||||||
## Package {#package}
|
## Package {#package}
|
||||||
|
|
||||||
Generate a [model Python package](/usage/training#models-generating) from an
|
Generate an installable
|
||||||
existing model data directory. All data files are copied over. If the path to a
|
[model Python package](/usage/training#models-generating) from an existing model
|
||||||
`meta.json` is supplied, or a `meta.json` is found in the input directory, this
|
data directory. All data files are copied over. If the path to a `meta.json` is
|
||||||
file is used. Otherwise, the data can be entered directly from the command line.
|
supplied, or a `meta.json` is found in the input directory, this file is used.
|
||||||
After packaging, you can run `python setup.py sdist` from the newly created
|
Otherwise, the data can be entered directly from the command line. spaCy will
|
||||||
directory to turn your model into an installable archive file.
|
then create a `.tar.gz` archive file that you can distribute and install with
|
||||||
|
`pip install`.
|
||||||
|
|
||||||
|
<Infobox title="New in v3.0" variant="warning">
|
||||||
|
|
||||||
|
The `spacy package` command now also builds the `.tar.gz` archive automatically,
|
||||||
|
so you don't have to run `python setup.py sdist` separately anymore.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ python -m spacy package [input_dir] [output_dir] [--meta-path] [--create-meta] [--force]
|
$ python -m spacy package [input_dir] [output_dir] [--meta-path] [--create-meta] [--force]
|
||||||
|
@ -595,7 +511,6 @@ $ python -m spacy package [input_dir] [output_dir] [--meta-path] [--create-meta]
|
||||||
### Example
|
### Example
|
||||||
python -m spacy package /input /output
|
python -m spacy package /input /output
|
||||||
cd /output/en_model-0.0.0
|
cd /output/en_model-0.0.0
|
||||||
python setup.py sdist
|
|
||||||
pip install dist/en_model-0.0.0.tar.gz
|
pip install dist/en_model-0.0.0.tar.gz
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -605,6 +520,23 @@ pip install dist/en_model-0.0.0.tar.gz
|
||||||
| `output_dir` | positional | Directory to create package folder in. |
|
| `output_dir` | positional | Directory to create package folder in. |
|
||||||
| `--meta-path`, `-m` <Tag variant="new">2</Tag> | option | Path to `meta.json` file (optional). |
|
| `--meta-path`, `-m` <Tag variant="new">2</Tag> | option | Path to `meta.json` file (optional). |
|
||||||
| `--create-meta`, `-c` <Tag variant="new">2</Tag> | flag | Create a `meta.json` file on the command line, even if one already exists in the directory. If an existing file is found, its entries will be shown as the defaults in the command line prompt. |
|
| `--create-meta`, `-c` <Tag variant="new">2</Tag> | flag | Create a `meta.json` file on the command line, even if one already exists in the directory. If an existing file is found, its entries will be shown as the defaults in the command line prompt. |
|
||||||
|
| `--version`, `-v` <Tag variant="new">3</Tag> | option | Package version to override in meta. Useful when training new versions, as it doesn't require editing the meta template. |
|
||||||
| `--force`, `-f` | flag | Force overwriting of existing folder in output directory. |
|
| `--force`, `-f` | flag | Force overwriting of existing folder in output directory. |
|
||||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||||
| **CREATES** | directory | A Python package containing the spaCy model. |
|
| **CREATES** | directory | A Python package containing the spaCy model. |
|
||||||
|
|
||||||
|
## Project {#project}
|
||||||
|
|
||||||
|
<!-- TODO: document project command and subcommands. We should probably wait and only finalize this once we've finalized the design -->
|
||||||
|
|
||||||
|
### project clone {#project-clone}
|
||||||
|
|
||||||
|
### project assets {#project-assets}
|
||||||
|
|
||||||
|
### project run-all {#project-run-all}
|
||||||
|
|
||||||
|
### project run {#project-run}
|
||||||
|
|
||||||
|
### project init {#project-init}
|
||||||
|
|
||||||
|
### project update-dvc {#project-update-dvc}
|
||||||
|
|
37
website/docs/api/corpus.md
Normal file
37
website/docs/api/corpus.md
Normal file
|
@ -0,0 +1,37 @@
|
||||||
|
---
|
||||||
|
title: Corpus
|
||||||
|
teaser: An annotated corpus
|
||||||
|
tag: class
|
||||||
|
source: spacy/gold/corpus.py
|
||||||
|
new: 3
|
||||||
|
---
|
||||||
|
|
||||||
|
This class manages annotated corpora and can read training and development
|
||||||
|
datasets in the [DocBin](/api/docbin) (`.spacy`) format.
|
||||||
|
|
||||||
|
## Corpus.\_\_init\_\_ {#init tag="method"}
|
||||||
|
|
||||||
|
Create a `Corpus`. The input data can be a file or a directory of files.
|
||||||
|
|
||||||
|
| Name | Type | Description |
|
||||||
|
| ----------- | ------------ | ---------------------------------------------------------------- |
|
||||||
|
| `train` | str / `Path` | Training data (`.spacy` file or directory of `.spacy` files). |
|
||||||
|
| `dev` | str / `Path` | Development data (`.spacy` file or directory of `.spacy` files). |
|
||||||
|
| `limit` | int | Maximum number of examples returned. |
|
||||||
|
| **RETURNS** | `Corpus` | The newly constructed object. |
|
||||||
|
|
||||||
|
<!-- TODO: document remaining methods / decide which to document -->
|
||||||
|
|
||||||
|
## Corpus.walk_corpus {#walk_corpus tag="staticmethod"}
|
||||||
|
|
||||||
|
## Corpus.make_examples {#make_examples tag="method"}
|
||||||
|
|
||||||
|
## Corpus.make_examples_gold_preproc {#make_examples_gold_preproc tag="method"}
|
||||||
|
|
||||||
|
## Corpus.read_docbin {#read_docbin tag="method"}
|
||||||
|
|
||||||
|
## Corpus.count_train {#count_train tag="method"}
|
||||||
|
|
||||||
|
## Corpus.train_dataset {#train_dataset tag="method"}
|
||||||
|
|
||||||
|
## Corpus.dev_dataset {#dev_dataset tag="method"}
|
130
website/docs/api/data-formats.md
Normal file
130
website/docs/api/data-formats.md
Normal file
|
@ -0,0 +1,130 @@
|
||||||
|
---
|
||||||
|
title: Data formats
|
||||||
|
teaser: Details on spaCy's input and output data formats
|
||||||
|
menu:
|
||||||
|
- ['Training data', 'training']
|
||||||
|
- ['Vocabulary', 'vocab']
|
||||||
|
---
|
||||||
|
|
||||||
|
This section documents input and output formats of data used by spaCy, including
|
||||||
|
training data and lexical vocabulary data. For an overview of label schemes used
|
||||||
|
by the models, see the [models directory](/models). Each model documents the
|
||||||
|
label schemes used in its components, depending on the data it was trained on.
|
||||||
|
|
||||||
|
## Training data {#training}
|
||||||
|
|
||||||
|
### Binary training format {#binary-training new="3"}
|
||||||
|
|
||||||
|
<!-- TODO: document DocBin format -->
|
||||||
|
|
||||||
|
### JSON input format for training {#json-input}
|
||||||
|
|
||||||
|
spaCy takes training data in JSON format. The built-in
|
||||||
|
[`convert`](/api/cli#convert) command helps you convert the `.conllu` format
|
||||||
|
used by the
|
||||||
|
[Universal Dependencies corpora](https://github.com/UniversalDependencies) to
|
||||||
|
spaCy's training format. To convert one or more existing `Doc` objects to
|
||||||
|
spaCy's JSON format, you can use the
|
||||||
|
[`gold.docs_to_json`](/api/top-level#docs_to_json) helper.
|
||||||
|
|
||||||
|
> #### Annotating entities
|
||||||
|
>
|
||||||
|
> Named entities are provided in the
|
||||||
|
> [BILUO](/usage/linguistic-features#accessing-ner) notation. Tokens outside an
|
||||||
|
> entity are set to `"O"` and tokens that are part of an entity are set to the
|
||||||
|
> entity label, prefixed by the BILUO marker. For example `"B-ORG"` describes
|
||||||
|
> the first token of a multi-token `ORG` entity and `"U-PERSON"` a single token
|
||||||
|
> representing a `PERSON` entity. The
|
||||||
|
> [`biluo_tags_from_offsets`](/api/top-level#biluo_tags_from_offsets) function
|
||||||
|
> can help you convert entity offsets to the right format.
|
||||||
|
|
||||||
|
```python
|
||||||
|
### Example structure
|
||||||
|
[{
|
||||||
|
"id": int, # ID of the document within the corpus
|
||||||
|
"paragraphs": [{ # list of paragraphs in the corpus
|
||||||
|
"raw": string, # raw text of the paragraph
|
||||||
|
"sentences": [{ # list of sentences in the paragraph
|
||||||
|
"tokens": [{ # list of tokens in the sentence
|
||||||
|
"id": int, # index of the token in the document
|
||||||
|
"dep": string, # dependency label
|
||||||
|
"head": int, # offset of token head relative to token index
|
||||||
|
"tag": string, # part-of-speech tag
|
||||||
|
"orth": string, # verbatim text of the token
|
||||||
|
"ner": string # BILUO label, e.g. "O" or "B-ORG"
|
||||||
|
}],
|
||||||
|
"brackets": [{ # phrase structure (NOT USED by current models)
|
||||||
|
"first": int, # index of first token
|
||||||
|
"last": int, # index of last token
|
||||||
|
"label": string # phrase label
|
||||||
|
}]
|
||||||
|
}],
|
||||||
|
"cats": [{ # new in v2.2: categories for text classifier
|
||||||
|
"label": string, # text category label
|
||||||
|
"value": float / bool # label applies (1.0/true) or not (0.0/false)
|
||||||
|
}]
|
||||||
|
}]
|
||||||
|
}]
|
||||||
|
```
|
||||||
|
|
||||||
|
Here's an example of dependencies, part-of-speech tags and names entities, taken
|
||||||
|
from the English Wall Street Journal portion of the Penn Treebank:
|
||||||
|
|
||||||
|
```json
|
||||||
|
https://github.com/explosion/spaCy/tree/master/examples/training/training-data.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Lexical data for vocabulary {#vocab-jsonl new="2"}
|
||||||
|
|
||||||
|
To populate a model's vocabulary, you can use the
|
||||||
|
[`spacy init-model`](/api/cli#init-model) command and load in a
|
||||||
|
[newline-delimited JSON](http://jsonlines.org/) (JSONL) file containing one
|
||||||
|
lexical entry per line via the `--jsonl-loc` option. The first line defines the
|
||||||
|
language and vocabulary settings. All other lines are expected to be JSON
|
||||||
|
objects describing an individual lexeme. The lexical attributes will be then set
|
||||||
|
as attributes on spaCy's [`Lexeme`](/api/lexeme#attributes) object. The `vocab`
|
||||||
|
command outputs a ready-to-use spaCy model with a `Vocab` containing the lexical
|
||||||
|
data.
|
||||||
|
|
||||||
|
```python
|
||||||
|
### First line
|
||||||
|
{"lang": "en", "settings": {"oov_prob": -20.502029418945312}}
|
||||||
|
```
|
||||||
|
|
||||||
|
```python
|
||||||
|
### Entry structure
|
||||||
|
{
|
||||||
|
"orth": string, # the word text
|
||||||
|
"id": int, # can correspond to row in vectors table
|
||||||
|
"lower": string,
|
||||||
|
"norm": string,
|
||||||
|
"shape": string
|
||||||
|
"prefix": string,
|
||||||
|
"suffix": string,
|
||||||
|
"length": int,
|
||||||
|
"cluster": string,
|
||||||
|
"prob": float,
|
||||||
|
"is_alpha": bool,
|
||||||
|
"is_ascii": bool,
|
||||||
|
"is_digit": bool,
|
||||||
|
"is_lower": bool,
|
||||||
|
"is_punct": bool,
|
||||||
|
"is_space": bool,
|
||||||
|
"is_title": bool,
|
||||||
|
"is_upper": bool,
|
||||||
|
"like_url": bool,
|
||||||
|
"like_num": bool,
|
||||||
|
"like_email": bool,
|
||||||
|
"is_stop": bool,
|
||||||
|
"is_oov": bool,
|
||||||
|
"is_quote": bool,
|
||||||
|
"is_left_punct": bool,
|
||||||
|
"is_right_punct": bool
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Here's an example of the 20 most frequent lexemes in the English training data:
|
||||||
|
|
||||||
|
```json
|
||||||
|
https://github.com/explosion/spaCy/tree/master/examples/training/vocab-data.jsonl
|
||||||
|
```
|
|
@ -123,7 +123,7 @@ details, see the documentation on
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| --------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
| --------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `name` | str | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `doc._.my_attr`. |
|
| `name` | str | Name of the attribute to set by the extension. For example, `"my_attr"` will be available as `doc._.my_attr`. |
|
||||||
| `default` | - | Optional default value of the attribute if no getter or method is defined. |
|
| `default` | - | Optional default value of the attribute if no getter or method is defined. |
|
||||||
| `method` | callable | Set a custom method on the object, for example `doc._.compare(other_doc)`. |
|
| `method` | callable | Set a custom method on the object, for example `doc._.compare(other_doc)`. |
|
||||||
| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. |
|
| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. |
|
||||||
|
@ -140,8 +140,8 @@ Look up a previously registered extension by name. Returns a 4-tuple
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.tokens import Doc
|
> from spacy.tokens import Doc
|
||||||
> Doc.set_extension('has_city', default=False)
|
> Doc.set_extension("has_city", default=False)
|
||||||
> extension = Doc.get_extension('has_city')
|
> extension = Doc.get_extension("has_city")
|
||||||
> assert extension == (False, None, None, None)
|
> assert extension == (False, None, None, None)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
|
@ -158,8 +158,8 @@ Check whether an extension has been registered on the `Doc` class.
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.tokens import Doc
|
> from spacy.tokens import Doc
|
||||||
> Doc.set_extension('has_city', default=False)
|
> Doc.set_extension("has_city", default=False)
|
||||||
> assert Doc.has_extension('has_city')
|
> assert Doc.has_extension("has_city")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
|
@ -175,9 +175,9 @@ Remove a previously registered extension.
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.tokens import Doc
|
> from spacy.tokens import Doc
|
||||||
> Doc.set_extension('has_city', default=False)
|
> Doc.set_extension("has_city", default=False)
|
||||||
> removed = Doc.remove_extension('has_city')
|
> removed = Doc.remove_extension("has_city")
|
||||||
> assert not Doc.has_extension('has_city')
|
> assert not Doc.has_extension("has_city")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
|
@ -202,9 +202,9 @@ the character indices don't map to a valid span.
|
||||||
| ------------------------------------ | ---------------------------------------- | --------------------------------------------------------------------- |
|
| ------------------------------------ | ---------------------------------------- | --------------------------------------------------------------------- |
|
||||||
| `start` | int | The index of the first character of the span. |
|
| `start` | int | The index of the first character of the span. |
|
||||||
| `end` | int | The index of the last character after the span. |
|
| `end` | int | The index of the last character after the span. |
|
||||||
| `label` | uint64 / unicode | A label to attach to the span, e.g. for named entities. |
|
| `label` | uint64 / str | A label to attach to the span, e.g. for named entities. |
|
||||||
| `kb_id` <Tag variant="new">2.2</Tag> | uint64 / unicode | An ID from a knowledge base to capture the meaning of a named entity. |
|
| `kb_id` <Tag variant="new">2.2</Tag> | uint64 / str | An ID from a knowledge base to capture the meaning of a named entity. |
|
||||||
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
|
| `vector` | `numpy.ndarray[ndim=1, dtype="float32"]` | A meaning representation of the span. |
|
||||||
| **RETURNS** | `Span` | The newly constructed object or `None`. |
|
| **RETURNS** | `Span` | The newly constructed object or `None`. |
|
||||||
|
|
||||||
## Doc.similarity {#similarity tag="method" model="vectors"}
|
## Doc.similarity {#similarity tag="method" model="vectors"}
|
||||||
|
@ -264,7 +264,7 @@ ancestor is found, e.g. if span excludes a necessary ancestor.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | -------------------------------------- | ----------------------------------------------- |
|
| ----------- | -------------------------------------- | ----------------------------------------------- |
|
||||||
| **RETURNS** | `numpy.ndarray[ndim=2, dtype='int32']` | The lowest common ancestor matrix of the `Doc`. |
|
| **RETURNS** | `numpy.ndarray[ndim=2, dtype="int32"]` | The lowest common ancestor matrix of the `Doc`. |
|
||||||
|
|
||||||
## Doc.to_json {#to_json tag="method" new="2.1"}
|
## Doc.to_json {#to_json tag="method" new="2.1"}
|
||||||
|
|
||||||
|
@ -297,22 +297,13 @@ They'll be added to an `"_"` key in the data, e.g. `"_": {"foo": "bar"}`.
|
||||||
| `underscore` | list | Optional list of string names of custom JSON-serializable `doc._.` attributes. |
|
| `underscore` | list | Optional list of string names of custom JSON-serializable `doc._.` attributes. |
|
||||||
| **RETURNS** | dict | The JSON-formatted data. |
|
| **RETURNS** | dict | The JSON-formatted data. |
|
||||||
|
|
||||||
<Infobox title="Deprecation note" variant="warning">
|
|
||||||
|
|
||||||
spaCy previously implemented a `Doc.print_tree` method that returned a similar
|
|
||||||
JSON-formatted representation of a `Doc`. As of v2.1, this method is deprecated
|
|
||||||
in favor of `Doc.to_json`. If you need more complex nested representations, you
|
|
||||||
might want to write your own function to extract the data.
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
## Doc.to_array {#to_array tag="method"}
|
## Doc.to_array {#to_array tag="method"}
|
||||||
|
|
||||||
Export given token attributes to a numpy `ndarray`. If `attr_ids` is a sequence
|
Export given token attributes to a numpy `ndarray`. If `attr_ids` is a sequence
|
||||||
of `M` attributes, the output array will be of shape `(N, M)`, where `N` is the
|
of `M` attributes, the output array will be of shape `(N, M)`, where `N` is the
|
||||||
length of the `Doc` (in tokens). If `attr_ids` is a single attribute, the output
|
length of the `Doc` (in tokens). If `attr_ids` is a single attribute, the output
|
||||||
shape will be `(N,)`. You can specify attributes by integer ID (e.g.
|
shape will be `(N,)`. You can specify attributes by integer ID (e.g.
|
||||||
`spacy.attrs.LEMMA`) or string name (e.g. 'LEMMA' or 'lemma'). The values will
|
`spacy.attrs.LEMMA`) or string name (e.g. "LEMMA" or "lemma"). The values will
|
||||||
be 64-bit integers.
|
be 64-bit integers.
|
||||||
|
|
||||||
Returns a 2D array with one row per token and one column per attribute (when
|
Returns a 2D array with one row per token and one column per attribute (when
|
||||||
|
@ -332,7 +323,7 @@ Returns a 2D array with one row per token and one column per attribute (when
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- |
|
| ----------- | ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- |
|
||||||
| `attr_ids` | list or int or string | A list of attributes (int IDs or string names) or a single attribute (int ID or string name) |
|
| `attr_ids` | list or int or string | A list of attributes (int IDs or string names) or a single attribute (int ID or string name) |
|
||||||
| **RETURNS** | `numpy.ndarray[ndim=2, dtype='uint64']` or `numpy.ndarray[ndim=1, dtype='uint64']` | The exported attributes as a numpy array. |
|
| **RETURNS** | `numpy.ndarray[ndim=2, dtype="uint64"]` or `numpy.ndarray[ndim=1, dtype="uint64"]` | The exported attributes as a numpy array. |
|
||||||
|
|
||||||
## Doc.from_array {#from_array tag="method"}
|
## Doc.from_array {#from_array tag="method"}
|
||||||
|
|
||||||
|
@ -354,10 +345,37 @@ array of attributes.
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | -------------------------------------- | ------------------------------------------------------------------------- |
|
| ----------- | -------------------------------------- | ------------------------------------------------------------------------- |
|
||||||
| `attrs` | list | A list of attribute ID ints. |
|
| `attrs` | list | A list of attribute ID ints. |
|
||||||
| `array` | `numpy.ndarray[ndim=2, dtype='int32']` | The attribute values to load. |
|
| `array` | `numpy.ndarray[ndim=2, dtype="int32"]` | The attribute values to load. |
|
||||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | `Doc` | Itself. |
|
| **RETURNS** | `Doc` | Itself. |
|
||||||
|
|
||||||
|
## Doc.from_docs {#from_docs tag="staticmethod"}
|
||||||
|
|
||||||
|
Concatenate multiple `Doc` objects to form a new one. Raises an error if the
|
||||||
|
`Doc` objects do not all share the same `Vocab`.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> from spacy.tokens import Doc
|
||||||
|
> texts = ["London is the capital of the United Kingdom.",
|
||||||
|
> "The River Thames flows through London.",
|
||||||
|
> "The famous Tower Bridge crosses the River Thames."]
|
||||||
|
> docs = list(nlp.pipe(texts))
|
||||||
|
> c_doc = Doc.from_docs(docs)
|
||||||
|
> assert str(c_doc) == " ".join(texts)
|
||||||
|
> assert len(list(c_doc.sents)) == len(docs)
|
||||||
|
> assert [str(ent) for ent in c_doc.ents] == \
|
||||||
|
> [str(ent) for doc in docs for ent in doc.ents]
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Type | Description |
|
||||||
|
| ------------------- | ----- | ----------------------------------------------------------------------------------------------- |
|
||||||
|
| `docs` | list | A list of `Doc` objects. |
|
||||||
|
| `ensure_whitespace` | bool | Insert a space between two adjacent docs whenever the first doc does not end in whitespace. |
|
||||||
|
| `attrs` | list | Optional list of attribute ID ints or attribute name strings. |
|
||||||
|
| **RETURNS** | `Doc` | The new `Doc` object that is containing the other docs or `None`, if `docs` is empty or `None`. |
|
||||||
|
|
||||||
## Doc.to_disk {#to_disk tag="method" new="2"}
|
## Doc.to_disk {#to_disk tag="method" new="2"}
|
||||||
|
|
||||||
Save the current state to a directory.
|
Save the current state to a directory.
|
||||||
|
@ -507,14 +525,6 @@ underlying lexeme (if they're context-independent lexical attributes like
|
||||||
|
|
||||||
## Doc.merge {#merge tag="method"}
|
## Doc.merge {#merge tag="method"}
|
||||||
|
|
||||||
<Infobox title="Deprecation note" variant="danger">
|
|
||||||
|
|
||||||
As of v2.1.0, `Doc.merge` still works but is considered deprecated. You should
|
|
||||||
use the new and less error-prone [`Doc.retokenize`](/api/doc#retokenize)
|
|
||||||
instead.
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
Retokenize the document, such that the span at `doc.text[start_idx : end_idx]`
|
Retokenize the document, such that the span at `doc.text[start_idx : end_idx]`
|
||||||
is merged into a single token. If `start_idx` and `end_idx` do not mark start
|
is merged into a single token. If `start_idx` and `end_idx` do not mark start
|
||||||
and end token boundaries, the document remains unchanged.
|
and end token boundaries, the document remains unchanged.
|
||||||
|
@ -624,7 +634,7 @@ vectors.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ---------------------------------------- | ------------------------------------------------------- |
|
| ----------- | ---------------------------------------- | ------------------------------------------------------- |
|
||||||
| **RETURNS** | `numpy.ndarray[ndim=1, dtype='float32']` | A 1D numpy array representing the document's semantics. |
|
| **RETURNS** | `numpy.ndarray[ndim=1, dtype="float32"]` | A 1D numpy array representing the document's semantics. |
|
||||||
|
|
||||||
## Doc.vector_norm {#vector_norm tag="property" model="vectors"}
|
## Doc.vector_norm {#vector_norm tag="property" model="vectors"}
|
||||||
|
|
||||||
|
@ -646,26 +656,26 @@ The L2 norm of the document's vector representation.
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| --------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| --------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `text` | str | A unicode representation of the document text. |
|
| `text` | str | A string representation of the document text. |
|
||||||
| `text_with_ws` | str | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
|
| `text_with_ws` | str | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
|
||||||
| `mem` | `Pool` | The document's local memory heap, for all C data it owns. |
|
| `mem` | `Pool` | The document's local memory heap, for all C data it owns. |
|
||||||
| `vocab` | `Vocab` | The store of lexical types. |
|
| `vocab` | `Vocab` | The store of lexical types. |
|
||||||
| `tensor` <Tag variant="new">2</Tag> | `ndarray` | Container for dense vector representations. |
|
| `tensor` <Tag variant="new">2</Tag> | `ndarray` | Container for dense vector representations. |
|
||||||
| `cats` <Tag variant="new">2</Tag> | dict | Maps a label to a score for categories applied to the document. The label is a string and the score should be a float. |
|
| `cats` <Tag variant="new">2</Tag> | dict | Maps a label to a score for categories applied to the document. The label is a string and the score should be a float. |
|
||||||
| `user_data` | - | A generic storage area, for user custom data. |
|
| `user_data` | - | A generic storage area, for user custom data. |
|
||||||
| `lang` <Tag variant="new">2.1</Tag> | int | Language of the document's vocabulary. |
|
| `lang` <Tag variant="new">2.1</Tag> | int | Language of the document's vocabulary. |
|
||||||
| `lang_` <Tag variant="new">2.1</Tag> | str | Language of the document's vocabulary. |
|
| `lang_` <Tag variant="new">2.1</Tag> | str | Language of the document's vocabulary. |
|
||||||
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. Returns `True` if the `Doc` is empty. |
|
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. Returns `True` if the `Doc` is empty. |
|
||||||
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. Returns `True` if the `Doc` is empty. |
|
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. Returns `True` if the `Doc` is empty. |
|
||||||
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. Returns `True` if the `Doc` is empty. |
|
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. Returns `True` if the `Doc` is empty. |
|
||||||
| `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if the `Doc` is empty, or if _any_ of the tokens has an entity tag set, even if the others are unknown. |
|
| `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if the `Doc` is empty, or if _any_ of the tokens has an entity tag set, even if the others are unknown. |
|
||||||
| `sentiment` | float | The document's positivity/negativity score, if available. |
|
| `sentiment` | float | The document's positivity/negativity score, if available. |
|
||||||
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
|
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
|
||||||
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
|
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
|
||||||
| `user_span_hooks` | dict | A dictionary that allows customization of properties of `Span` children. |
|
| `user_span_hooks` | dict | A dictionary that allows customization of properties of `Span` children. |
|
||||||
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
|
@ -16,13 +16,14 @@ document from the `DocBin`. The serialization format is gzipped msgpack, where
|
||||||
the msgpack object has the following structure:
|
the msgpack object has the following structure:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### msgpack object strcutrue
|
### msgpack object structrue
|
||||||
{
|
{
|
||||||
|
"version": str, # DocBin version number
|
||||||
"attrs": List[uint64], # e.g. [TAG, HEAD, ENT_IOB, ENT_TYPE]
|
"attrs": List[uint64], # e.g. [TAG, HEAD, ENT_IOB, ENT_TYPE]
|
||||||
"tokens": bytes, # Serialized numpy uint64 array with the token data
|
"tokens": bytes, # Serialized numpy uint64 array with the token data
|
||||||
"spaces": bytes, # Serialized numpy boolean array with spaces data
|
"spaces": bytes, # Serialized numpy boolean array with spaces data
|
||||||
"lengths": bytes, # Serialized numpy int32 array with the doc lengths
|
"lengths": bytes, # Serialized numpy int32 array with the doc lengths
|
||||||
"strings": List[unicode] # List of unique strings in the token data
|
"strings": List[str] # List of unique strings in the token data
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -45,7 +46,7 @@ Create a `DocBin` object to hold serialized annotations.
|
||||||
|
|
||||||
| Argument | Type | Description |
|
| Argument | Type | Description |
|
||||||
| ----------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ----------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `attrs` | list | List of attributes to serialize. `orth` (hash of token text) and `spacy` (whether the token is followed by whitespace) are always serialized, so they're not required. Defaults to `None`. |
|
| `attrs` | list | List of attributes to serialize. `ORTH` (hash of token text) and `SPACY` (whether the token is followed by whitespace) are always serialized, so they're not required. Defaults to `("ORTH", "TAG", "HEAD", "DEP", "ENT_IOB", "ENT_TYPE", "ENT_KB_ID", "LEMMA", "MORPH", "POS")`. |
|
||||||
| `store_user_data` | bool | Whether to include the `Doc.user_data` and the values of custom extension attributes. Defaults to `False`. |
|
| `store_user_data` | bool | Whether to include the `Doc.user_data` and the values of custom extension attributes. Defaults to `False`. |
|
||||||
| **RETURNS** | `DocBin` | The newly constructed object. |
|
| **RETURNS** | `DocBin` | The newly constructed object. |
|
||||||
|
|
||||||
|
|
|
@ -36,7 +36,7 @@ be a token pattern (list) or a phrase pattern (string). For example:
|
||||||
| --------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| --------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `nlp` | `Language` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. |
|
| `nlp` | `Language` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. |
|
||||||
| `patterns` | iterable | Optional patterns to load in. |
|
| `patterns` | iterable | Optional patterns to load in. |
|
||||||
| `phrase_matcher_attr` | int / unicode | Optional attr to pass to the internal [`PhraseMatcher`](/api/phrasematcher). defaults to `None` |
|
| `phrase_matcher_attr` | int / str | Optional attr to pass to the internal [`PhraseMatcher`](/api/phrasematcher). defaults to `None` |
|
||||||
| `validate` | bool | Whether patterns should be validated, passed to Matcher and PhraseMatcher as `validate`. Defaults to `False`. |
|
| `validate` | bool | Whether patterns should be validated, passed to Matcher and PhraseMatcher as `validate`. Defaults to `False`. |
|
||||||
| `overwrite_ents` | bool | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. |
|
| `overwrite_ents` | bool | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. |
|
||||||
| `**cfg` | - | Other config parameters. If pipeline component is loaded as part of a model pipeline, this will include all keyword arguments passed to `spacy.load`. |
|
| `**cfg` | - | Other config parameters. If pipeline component is loaded as part of a model pipeline, this will include all keyword arguments passed to `spacy.load`. |
|
||||||
|
|
10
website/docs/api/example.md
Normal file
10
website/docs/api/example.md
Normal file
|
@ -0,0 +1,10 @@
|
||||||
|
---
|
||||||
|
title: Example
|
||||||
|
teaser: A training example
|
||||||
|
tag: class
|
||||||
|
source: spacy/gold/example.pyx
|
||||||
|
---
|
||||||
|
|
||||||
|
<!-- TODO: -->
|
||||||
|
|
||||||
|
## Example.\_\_init\_\_ {#init tag="method"}
|
|
@ -1,24 +0,0 @@
|
||||||
---
|
|
||||||
title: GoldCorpus
|
|
||||||
teaser: An annotated corpus, using the JSON file format
|
|
||||||
tag: class
|
|
||||||
source: spacy/gold.pyx
|
|
||||||
new: 2
|
|
||||||
---
|
|
||||||
|
|
||||||
This class manages annotations for tagging, dependency parsing and NER.
|
|
||||||
|
|
||||||
## GoldCorpus.\_\_init\_\_ {#init tag="method"}
|
|
||||||
|
|
||||||
Create a `GoldCorpus`. IF the input data is an iterable, each item should be a
|
|
||||||
`(text, paragraphs)` tuple, where each paragraph is a tuple
|
|
||||||
`(sentences, brackets)`, and each sentence is a tuple
|
|
||||||
`(ids, words, tags, heads, ner)`. See the implementation of
|
|
||||||
[`gold.read_json_file`](https://github.com/explosion/spaCy/tree/master/spacy/gold.pyx)
|
|
||||||
for further details.
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ----------------------- | ------------------------------------------------------------ |
|
|
||||||
| `train` | str / `Path` / iterable | Training data, as a path (file or directory) or iterable. |
|
|
||||||
| `dev` | str / `Path` / iterable | Development data, as a path (file or directory) or iterable. |
|
|
||||||
| **RETURNS** | `GoldCorpus` | The newly constructed object. |
|
|
|
@ -1,207 +0,0 @@
|
||||||
---
|
|
||||||
title: GoldParse
|
|
||||||
teaser: A collection for training annotations
|
|
||||||
tag: class
|
|
||||||
source: spacy/gold.pyx
|
|
||||||
---
|
|
||||||
|
|
||||||
## GoldParse.\_\_init\_\_ {#init tag="method"}
|
|
||||||
|
|
||||||
Create a `GoldParse`. The [`TextCategorizer`](/api/textcategorizer) component
|
|
||||||
expects true examples of a label to have the value `1.0`, and negative examples
|
|
||||||
of a label to have the value `0.0`. Labels not in the dictionary are treated as
|
|
||||||
missing – the gradient for those labels will be zero.
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
||||||
| `doc` | `Doc` | The document the annotations refer to. |
|
|
||||||
| `words` | iterable | A sequence of unicode word strings. |
|
|
||||||
| `tags` | iterable | A sequence of strings, representing tag annotations. |
|
|
||||||
| `heads` | iterable | A sequence of integers, representing syntactic head offsets. |
|
|
||||||
| `deps` | iterable | A sequence of strings, representing the syntactic relation types. |
|
|
||||||
| `entities` | iterable | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. |
|
|
||||||
| `cats` | dict | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative). |
|
|
||||||
| `links` | dict | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative). |
|
|
||||||
| `make_projective` | bool | Whether to projectivize the dependency tree. Defaults to `False`. |
|
|
||||||
| **RETURNS** | `GoldParse` | The newly constructed object. |
|
|
||||||
|
|
||||||
## GoldParse.\_\_len\_\_ {#len tag="method"}
|
|
||||||
|
|
||||||
Get the number of gold-standard tokens.
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ---- | ----------------------------------- |
|
|
||||||
| **RETURNS** | int | The number of gold-standard tokens. |
|
|
||||||
|
|
||||||
## GoldParse.is_projective {#is_projective tag="property"}
|
|
||||||
|
|
||||||
Whether the provided syntactic annotations form a projective dependency tree.
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ---- | ----------------------------------------- |
|
|
||||||
| **RETURNS** | bool | Whether annotations form projective tree. |
|
|
||||||
|
|
||||||
## Attributes {#attributes}
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ------------------------------------ | ---- | ------------------------------------------------------------------------------------------------------------------------ |
|
|
||||||
| `words` | list | The words. |
|
|
||||||
| `tags` | list | The part-of-speech tag annotations. |
|
|
||||||
| `heads` | list | The syntactic head annotations. |
|
|
||||||
| `labels` | list | The syntactic relation-type annotations. |
|
|
||||||
| `ner` | list | The named entity annotations as BILUO tags. |
|
|
||||||
| `cand_to_gold` | list | The alignment from candidate tokenization to gold tokenization. |
|
|
||||||
| `gold_to_cand` | list | The alignment from gold tokenization to candidate tokenization. |
|
|
||||||
| `cats` <Tag variant="new">2</Tag> | dict | Keys in the dictionary are string category labels with values `1.0` or `0.0`. |
|
|
||||||
| `links` <Tag variant="new">2.2</Tag> | dict | Keys in the dictionary are `(start_char, end_char)` triples, and the values are dictionaries with `kb_id:value` entries. |
|
|
||||||
|
|
||||||
## Utilities {#util}
|
|
||||||
|
|
||||||
### gold.docs_to_json {#docs_to_json tag="function"}
|
|
||||||
|
|
||||||
Convert a list of Doc objects into the
|
|
||||||
[JSON-serializable format](/api/annotation#json-input) used by the
|
|
||||||
[`spacy train`](/api/cli#train) command. Each input doc will be treated as a
|
|
||||||
'paragraph' in the output doc.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> from spacy.gold import docs_to_json
|
|
||||||
>
|
|
||||||
> doc = nlp("I like London")
|
|
||||||
> json_data = docs_to_json([doc])
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ---------------- | ------------------------------------------ |
|
|
||||||
| `docs` | iterable / `Doc` | The `Doc` object(s) to convert. |
|
|
||||||
| `id` | int | ID to assign to the JSON. Defaults to `0`. |
|
|
||||||
| **RETURNS** | dict | The data in spaCy's JSON format. |
|
|
||||||
|
|
||||||
### gold.align {#align tag="function"}
|
|
||||||
|
|
||||||
Calculate alignment tables between two tokenizations, using the Levenshtein
|
|
||||||
algorithm. The alignment is case-insensitive.
|
|
||||||
|
|
||||||
<Infobox title="Important note" variant="warning">
|
|
||||||
|
|
||||||
The current implementation of the alignment algorithm assumes that both
|
|
||||||
tokenizations add up to the same string. For example, you'll be able to align
|
|
||||||
`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
|
|
||||||
`["I", "'m"]` and `["I", "am"]`.
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> from spacy.gold import align
|
|
||||||
>
|
|
||||||
> bert_tokens = ["obama", "'", "s", "podcast"]
|
|
||||||
> spacy_tokens = ["obama", "'s", "podcast"]
|
|
||||||
> alignment = align(bert_tokens, spacy_tokens)
|
|
||||||
> cost, a2b, b2a, a2b_multi, b2a_multi = alignment
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | ----- | -------------------------------------------------------------------------- |
|
|
||||||
| `tokens_a` | list | String values of candidate tokens to align. |
|
|
||||||
| `tokens_b` | list | String values of reference tokens to align. |
|
|
||||||
| **RETURNS** | tuple | A `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the alignment. |
|
|
||||||
|
|
||||||
The returned tuple contains the following alignment information:
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> a2b = array([0, -1, -1, 2])
|
|
||||||
> b2a = array([0, 2, 3])
|
|
||||||
> a2b_multi = {1: 1, 2: 1}
|
|
||||||
> b2a_multi = {}
|
|
||||||
> ```
|
|
||||||
>
|
|
||||||
> If `a2b[3] == 2`, that means that `tokens_a[3]` aligns to `tokens_b[2]`. If
|
|
||||||
> there's no one-to-one alignment for a token, it has the value `-1`.
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
||||||
| `cost` | int | The number of misaligned tokens. |
|
|
||||||
| `a2b` | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_a` to indices in `tokens_b`. |
|
|
||||||
| `b2a` | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_b` to indices in `tokens_a`. |
|
|
||||||
| `a2b_multi` | dict | A dictionary mapping indices in `tokens_a` to indices in `tokens_b`, where multiple tokens of `tokens_a` align to the same token of `tokens_b`. |
|
|
||||||
| `b2a_multi` | dict | A dictionary mapping indices in `tokens_b` to indices in `tokens_a`, where multiple tokens of `tokens_b` align to the same token of `tokens_a`. |
|
|
||||||
|
|
||||||
### gold.biluo_tags_from_offsets {#biluo_tags_from_offsets tag="function"}
|
|
||||||
|
|
||||||
Encode labelled spans into per-token tags, using the
|
|
||||||
[BILUO scheme](/api/annotation#biluo) (Begin, In, Last, Unit, Out). Returns a
|
|
||||||
list of unicode strings, describing the tags. Each tag string will be of the
|
|
||||||
form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of
|
|
||||||
`"B"`, `"I"`, `"L"`, `"U"`. The string `"-"` is used where the entity offsets
|
|
||||||
don't align with the tokenization in the `Doc` object. The training algorithm
|
|
||||||
will view these as missing values. `O` denotes a non-entity token. `B` denotes
|
|
||||||
the beginning of a multi-token entity, `I` the inside of an entity of three or
|
|
||||||
more tokens, and `L` the end of an entity of two or more tokens. `U` denotes a
|
|
||||||
single-token entity.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> from spacy.gold import biluo_tags_from_offsets
|
|
||||||
>
|
|
||||||
> doc = nlp("I like London.")
|
|
||||||
> entities = [(7, 13, "LOC")]
|
|
||||||
> tags = biluo_tags_from_offsets(doc, entities)
|
|
||||||
> assert tags == ["O", "O", "U-LOC", "O"]
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
||||||
| `doc` | `Doc` | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. |
|
|
||||||
| `entities` | iterable | A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string. |
|
|
||||||
| **RETURNS** | list | str strings, describing the [BILUO](/api/annotation#biluo) tags. |
|
|
||||||
|
|
||||||
### gold.offsets_from_biluo_tags {#offsets_from_biluo_tags tag="function"}
|
|
||||||
|
|
||||||
Encode per-token tags following the [BILUO scheme](/api/annotation#biluo) into
|
|
||||||
entity offsets.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> from spacy.gold import offsets_from_biluo_tags
|
|
||||||
>
|
|
||||||
> doc = nlp("I like London.")
|
|
||||||
> tags = ["O", "O", "U-LOC", "O"]
|
|
||||||
> entities = offsets_from_biluo_tags(doc, tags)
|
|
||||||
> assert entities == [(7, 13, "LOC")]
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
||||||
| `doc` | `Doc` | The document that the BILUO tags refer to. |
|
|
||||||
| `entities` | iterable | A sequence of [BILUO](/api/annotation#biluo) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. |
|
|
||||||
| **RETURNS** | list | A sequence of `(start, end, label)` triples. `start` and `end` will be character-offset integers denoting the slice into the original string. |
|
|
||||||
|
|
||||||
### gold.spans_from_biluo_tags {#spans_from_biluo_tags tag="function" new="2.1"}
|
|
||||||
|
|
||||||
Encode per-token tags following the [BILUO scheme](/api/annotation#biluo) into
|
|
||||||
[`Span`](/api/span) objects. This can be used to create entity spans from
|
|
||||||
token-based tags, e.g. to overwrite the `doc.ents`.
|
|
||||||
|
|
||||||
> #### Example
|
|
||||||
>
|
|
||||||
> ```python
|
|
||||||
> from spacy.gold import spans_from_biluo_tags
|
|
||||||
>
|
|
||||||
> doc = nlp("I like London.")
|
|
||||||
> tags = ["O", "O", "U-LOC", "O"]
|
|
||||||
> doc.ents = spans_from_biluo_tags(doc, tags)
|
|
||||||
> ```
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ----------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
||||||
| `doc` | `Doc` | The document that the BILUO tags refer to. |
|
|
||||||
| `entities` | iterable | A sequence of [BILUO](/api/annotation#biluo) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. |
|
|
||||||
| **RETURNS** | list | A sequence of `Span` objects with added entity labels. |
|
|
|
@ -1,10 +1,8 @@
|
||||||
---
|
---
|
||||||
title: Architecture
|
title: Library Architecture
|
||||||
next: /api/annotation
|
next: /api/architectures
|
||||||
---
|
---
|
||||||
|
|
||||||
## Library architecture {#architecture}
|
|
||||||
|
|
||||||
import Architecture101 from 'usage/101/\_architecture.md'
|
import Architecture101 from 'usage/101/\_architecture.md'
|
||||||
|
|
||||||
<Architecture101 />
|
<Architecture101 />
|
||||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user