mirror of
https://github.com/explosion/spaCy.git
synced 2025-07-19 12:42:20 +03:00
Merge remote-tracking branch 'upstream/master' into feature/coref
This commit is contained in:
commit
ba2e491cc4
106
.github/contributors/juliensalinas.md
vendored
Normal file
106
.github/contributors/juliensalinas.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [X] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
| ----------------------------- | ------------------- |
|
||||||
|
| Name | Julien Salinas |
|
||||||
|
| Company name (if applicable) | NLP Cloud |
|
||||||
|
| Title or role (if applicable) | Founder and CTO |
|
||||||
|
| Date | Mayb 14th 2021 |
|
||||||
|
| GitHub username | juliensalinas |
|
||||||
|
| Website (optional) | https://nlpcloud.io |
|
|
@ -82,18 +82,18 @@ jobs:
|
||||||
python_version: '$(python.version)'
|
python_version: '$(python.version)'
|
||||||
architecture: 'x64'
|
architecture: 'x64'
|
||||||
|
|
||||||
- job: "TestGPU"
|
# - job: "TestGPU"
|
||||||
dependsOn: "Validate"
|
# dependsOn: "Validate"
|
||||||
strategy:
|
# strategy:
|
||||||
matrix:
|
# matrix:
|
||||||
Python38LinuxX64_GPU:
|
# Python38LinuxX64_GPU:
|
||||||
python.version: '3.8'
|
# python.version: '3.8'
|
||||||
pool:
|
# pool:
|
||||||
name: "LinuxX64_GPU"
|
# name: "LinuxX64_GPU"
|
||||||
steps:
|
# steps:
|
||||||
- template: .github/azure-steps.yml
|
# - template: .github/azure-steps.yml
|
||||||
parameters:
|
# parameters:
|
||||||
python_version: '$(python.version)'
|
# python_version: '$(python.version)'
|
||||||
architecture: 'x64'
|
# architecture: 'x64'
|
||||||
gpu: true
|
# gpu: true
|
||||||
num_build_jobs: 24
|
# num_build_jobs: 24
|
||||||
|
|
|
@ -15,7 +15,7 @@ pathy>=0.3.5
|
||||||
numpy>=1.15.0
|
numpy>=1.15.0
|
||||||
requests>=2.13.0,<3.0.0
|
requests>=2.13.0,<3.0.0
|
||||||
tqdm>=4.38.0,<5.0.0
|
tqdm>=4.38.0,<5.0.0
|
||||||
pydantic>=1.7.1,<1.8.0
|
pydantic>=1.7.4,!=1.8,!=1.8.1,<1.9.0
|
||||||
jinja2
|
jinja2
|
||||||
# Official Python utilities
|
# Official Python utilities
|
||||||
setuptools
|
setuptools
|
||||||
|
|
|
@ -52,7 +52,7 @@ install_requires =
|
||||||
tqdm>=4.38.0,<5.0.0
|
tqdm>=4.38.0,<5.0.0
|
||||||
numpy>=1.15.0
|
numpy>=1.15.0
|
||||||
requests>=2.13.0,<3.0.0
|
requests>=2.13.0,<3.0.0
|
||||||
pydantic>=1.7.1,<1.8.0
|
pydantic>=1.7.4,!=1.8,!=1.8.1,<1.9.0
|
||||||
jinja2
|
jinja2
|
||||||
# Official Python utilities
|
# Official Python utilities
|
||||||
setuptools
|
setuptools
|
||||||
|
|
|
@ -174,7 +174,8 @@ def debug_data(
|
||||||
n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values())
|
n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values())
|
||||||
msg.warn(
|
msg.warn(
|
||||||
"{} words in training data without vectors ({:.0f}%)".format(
|
"{} words in training data without vectors ({:.0f}%)".format(
|
||||||
n_missing_vectors, 100 * (n_missing_vectors / gold_train_data["n_words"])
|
n_missing_vectors,
|
||||||
|
100 * (n_missing_vectors / gold_train_data["n_words"]),
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
msg.text(
|
msg.text(
|
||||||
|
@ -282,42 +283,7 @@ def debug_data(
|
||||||
labels = _get_labels_from_model(nlp, "textcat")
|
labels = _get_labels_from_model(nlp, "textcat")
|
||||||
msg.info(f"Text Classification: {len(labels)} label(s)")
|
msg.info(f"Text Classification: {len(labels)} label(s)")
|
||||||
msg.text(f"Labels: {_format_labels(labels)}", show=verbose)
|
msg.text(f"Labels: {_format_labels(labels)}", show=verbose)
|
||||||
labels_with_counts = _format_labels(
|
missing_labels = labels - set(gold_train_data["cats"])
|
||||||
gold_train_data["cats"].most_common(), counts=True
|
|
||||||
)
|
|
||||||
msg.text(f"Labels in train data: {labels_with_counts}", show=verbose)
|
|
||||||
missing_labels = labels - set(gold_train_data["cats"].keys())
|
|
||||||
if missing_labels:
|
|
||||||
msg.warn(
|
|
||||||
"Some model labels are not present in the train data. The "
|
|
||||||
"model performance may be degraded for these labels after "
|
|
||||||
f"training: {_format_labels(missing_labels)}."
|
|
||||||
)
|
|
||||||
if gold_train_data["n_cats_multilabel"] > 0:
|
|
||||||
# Note: you should never get here because you run into E895 on
|
|
||||||
# initialization first.
|
|
||||||
msg.warn(
|
|
||||||
"The train data contains instances without "
|
|
||||||
"mutually-exclusive classes. Use the component "
|
|
||||||
"'textcat_multilabel' instead of 'textcat'."
|
|
||||||
)
|
|
||||||
if gold_dev_data["n_cats_multilabel"] > 0:
|
|
||||||
msg.fail(
|
|
||||||
"Train/dev mismatch: the dev data contains instances "
|
|
||||||
"without mutually-exclusive classes while the train data "
|
|
||||||
"contains only instances with mutually-exclusive classes."
|
|
||||||
)
|
|
||||||
|
|
||||||
if "textcat_multilabel" in factory_names:
|
|
||||||
msg.divider("Text Classification (Multilabel)")
|
|
||||||
labels = _get_labels_from_model(nlp, "textcat_multilabel")
|
|
||||||
msg.info(f"Text Classification: {len(labels)} label(s)")
|
|
||||||
msg.text(f"Labels: {_format_labels(labels)}", show=verbose)
|
|
||||||
labels_with_counts = _format_labels(
|
|
||||||
gold_train_data["cats"].most_common(), counts=True
|
|
||||||
)
|
|
||||||
msg.text(f"Labels in train data: {labels_with_counts}", show=verbose)
|
|
||||||
missing_labels = labels - set(gold_train_data["cats"].keys())
|
|
||||||
if missing_labels:
|
if missing_labels:
|
||||||
msg.warn(
|
msg.warn(
|
||||||
"Some model labels are not present in the train data. The "
|
"Some model labels are not present in the train data. The "
|
||||||
|
@ -325,17 +291,76 @@ def debug_data(
|
||||||
f"training: {_format_labels(missing_labels)}."
|
f"training: {_format_labels(missing_labels)}."
|
||||||
)
|
)
|
||||||
if set(gold_train_data["cats"]) != set(gold_dev_data["cats"]):
|
if set(gold_train_data["cats"]) != set(gold_dev_data["cats"]):
|
||||||
msg.fail(
|
msg.warn(
|
||||||
f"The train and dev labels are not the same. "
|
"Potential train/dev mismatch: the train and dev labels are "
|
||||||
|
"not the same. "
|
||||||
f"Train labels: {_format_labels(gold_train_data['cats'])}. "
|
f"Train labels: {_format_labels(gold_train_data['cats'])}. "
|
||||||
f"Dev labels: {_format_labels(gold_dev_data['cats'])}."
|
f"Dev labels: {_format_labels(gold_dev_data['cats'])}."
|
||||||
)
|
)
|
||||||
|
if len(labels) < 2:
|
||||||
|
msg.fail(
|
||||||
|
"The model does not have enough labels. 'textcat' requires at "
|
||||||
|
"least two labels due to mutually-exclusive classes, e.g. "
|
||||||
|
"LABEL/NOT_LABEL or POSITIVE/NEGATIVE for a binary "
|
||||||
|
"classification task."
|
||||||
|
)
|
||||||
|
if (
|
||||||
|
gold_train_data["n_cats_bad_values"] > 0
|
||||||
|
or gold_dev_data["n_cats_bad_values"] > 0
|
||||||
|
):
|
||||||
|
msg.fail(
|
||||||
|
"Unsupported values for cats: the supported values are "
|
||||||
|
"1.0/True and 0.0/False."
|
||||||
|
)
|
||||||
|
if gold_train_data["n_cats_multilabel"] > 0:
|
||||||
|
# Note: you should never get here because you run into E895 on
|
||||||
|
# initialization first.
|
||||||
|
msg.fail(
|
||||||
|
"The train data contains instances without mutually-exclusive "
|
||||||
|
"classes. Use the component 'textcat_multilabel' instead of "
|
||||||
|
"'textcat'."
|
||||||
|
)
|
||||||
|
if gold_dev_data["n_cats_multilabel"] > 0:
|
||||||
|
msg.fail(
|
||||||
|
"The dev data contains instances without mutually-exclusive "
|
||||||
|
"classes. Use the component 'textcat_multilabel' instead of "
|
||||||
|
"'textcat'."
|
||||||
|
)
|
||||||
|
|
||||||
|
if "textcat_multilabel" in factory_names:
|
||||||
|
msg.divider("Text Classification (Multilabel)")
|
||||||
|
labels = _get_labels_from_model(nlp, "textcat_multilabel")
|
||||||
|
msg.info(f"Text Classification: {len(labels)} label(s)")
|
||||||
|
msg.text(f"Labels: {_format_labels(labels)}", show=verbose)
|
||||||
|
missing_labels = labels - set(gold_train_data["cats"])
|
||||||
|
if missing_labels:
|
||||||
|
msg.warn(
|
||||||
|
"Some model labels are not present in the train data. The "
|
||||||
|
"model performance may be degraded for these labels after "
|
||||||
|
f"training: {_format_labels(missing_labels)}."
|
||||||
|
)
|
||||||
|
if set(gold_train_data["cats"]) != set(gold_dev_data["cats"]):
|
||||||
|
msg.warn(
|
||||||
|
"Potential train/dev mismatch: the train and dev labels are "
|
||||||
|
"not the same. "
|
||||||
|
f"Train labels: {_format_labels(gold_train_data['cats'])}. "
|
||||||
|
f"Dev labels: {_format_labels(gold_dev_data['cats'])}."
|
||||||
|
)
|
||||||
|
if (
|
||||||
|
gold_train_data["n_cats_bad_values"] > 0
|
||||||
|
or gold_dev_data["n_cats_bad_values"] > 0
|
||||||
|
):
|
||||||
|
msg.fail(
|
||||||
|
"Unsupported values for cats: the supported values are "
|
||||||
|
"1.0/True and 0.0/False."
|
||||||
|
)
|
||||||
if gold_train_data["n_cats_multilabel"] > 0:
|
if gold_train_data["n_cats_multilabel"] > 0:
|
||||||
if gold_dev_data["n_cats_multilabel"] == 0:
|
if gold_dev_data["n_cats_multilabel"] == 0:
|
||||||
msg.warn(
|
msg.warn(
|
||||||
"Potential train/dev mismatch: the train data contains "
|
"Potential train/dev mismatch: the train data contains "
|
||||||
"instances without mutually-exclusive classes while the "
|
"instances without mutually-exclusive classes while the "
|
||||||
"dev data does not."
|
"dev data contains only instances with mutually-exclusive "
|
||||||
|
"classes."
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
msg.warn(
|
msg.warn(
|
||||||
|
@ -556,6 +581,7 @@ def _compile_gold(
|
||||||
"n_nonproj": 0,
|
"n_nonproj": 0,
|
||||||
"n_cycles": 0,
|
"n_cycles": 0,
|
||||||
"n_cats_multilabel": 0,
|
"n_cats_multilabel": 0,
|
||||||
|
"n_cats_bad_values": 0,
|
||||||
"texts": set(),
|
"texts": set(),
|
||||||
}
|
}
|
||||||
for eg in examples:
|
for eg in examples:
|
||||||
|
@ -599,7 +625,9 @@ def _compile_gold(
|
||||||
data["ner"]["-"] += 1
|
data["ner"]["-"] += 1
|
||||||
if "textcat" in factory_names or "textcat_multilabel" in factory_names:
|
if "textcat" in factory_names or "textcat_multilabel" in factory_names:
|
||||||
data["cats"].update(gold.cats)
|
data["cats"].update(gold.cats)
|
||||||
if list(gold.cats.values()).count(1.0) != 1:
|
if any(val not in (0, 1) for val in gold.cats.values()):
|
||||||
|
data["n_cats_bad_values"] += 1
|
||||||
|
if list(gold.cats.values()).count(1) != 1:
|
||||||
data["n_cats_multilabel"] += 1
|
data["n_cats_multilabel"] += 1
|
||||||
if "tagger" in factory_names:
|
if "tagger" in factory_names:
|
||||||
tags = eg.get_aligned("TAG", as_string=True)
|
tags = eg.get_aligned("TAG", as_string=True)
|
||||||
|
|
|
@ -375,21 +375,10 @@ class Errors:
|
||||||
E125 = ("Unexpected value: {value}")
|
E125 = ("Unexpected value: {value}")
|
||||||
E126 = ("Unexpected matcher predicate: '{bad}'. Expected one of: {good}. "
|
E126 = ("Unexpected matcher predicate: '{bad}'. Expected one of: {good}. "
|
||||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
"This is likely a bug in spaCy, so feel free to open an issue.")
|
||||||
E129 = ("Cannot write the label of an existing Span object because a Span "
|
|
||||||
"is a read-only view of the underlying Token objects stored in the "
|
|
||||||
"Doc. Instead, create a new Span object and specify the `label` "
|
|
||||||
"keyword argument, for example:\nfrom spacy.tokens import Span\n"
|
|
||||||
"span = Span(doc, start={start}, end={end}, label='{label}')")
|
|
||||||
E130 = ("You are running a narrow unicode build, which is incompatible "
|
E130 = ("You are running a narrow unicode build, which is incompatible "
|
||||||
"with spacy >= 2.1.0. To fix this, reinstall Python and use a wide "
|
"with spacy >= 2.1.0. To fix this, reinstall Python and use a wide "
|
||||||
"unicode build instead. You can also rebuild Python and set the "
|
"unicode build instead. You can also rebuild Python and set the "
|
||||||
"`--enable-unicode=ucs4 flag`.")
|
"`--enable-unicode=ucs4 flag`.")
|
||||||
E131 = ("Cannot write the kb_id of an existing Span object because a Span "
|
|
||||||
"is a read-only view of the underlying Token objects stored in "
|
|
||||||
"the Doc. Instead, create a new Span object and specify the "
|
|
||||||
"`kb_id` keyword argument, for example:\nfrom spacy.tokens "
|
|
||||||
"import Span\nspan = Span(doc, start={start}, end={end}, "
|
|
||||||
"label='{label}', kb_id='{kb_id}')")
|
|
||||||
E132 = ("The vectors for entities and probabilities for alias '{alias}' "
|
E132 = ("The vectors for entities and probabilities for alias '{alias}' "
|
||||||
"should have equal length, but found {entities_length} and "
|
"should have equal length, but found {entities_length} and "
|
||||||
"{probabilities_length} respectively.")
|
"{probabilities_length} respectively.")
|
||||||
|
@ -501,6 +490,12 @@ class Errors:
|
||||||
E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.")
|
E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.")
|
||||||
|
|
||||||
# New errors added in v3.x
|
# New errors added in v3.x
|
||||||
|
E870 = ("Could not serialize the DocBin because it is too large. Consider "
|
||||||
|
"splitting up your documents into several doc bins and serializing "
|
||||||
|
"each separately. spacy.Corpus.v1 will search recursively for all "
|
||||||
|
"*.spacy files if you provide a directory instead of a filename as "
|
||||||
|
"the 'path'.")
|
||||||
|
E871 = ("Error encountered in nlp.pipe with multiprocessing:\n\n{error}")
|
||||||
E872 = ("Unable to copy tokenizer from base model due to different "
|
E872 = ("Unable to copy tokenizer from base model due to different "
|
||||||
'tokenizer settings: current tokenizer config "{curr_config}" '
|
'tokenizer settings: current tokenizer config "{curr_config}" '
|
||||||
'vs. base model "{base_config}"')
|
'vs. base model "{base_config}"')
|
||||||
|
|
115
spacy/kb.pyx
115
spacy/kb.pyx
|
@ -93,6 +93,15 @@ cdef class KnowledgeBase:
|
||||||
self.vocab = vocab
|
self.vocab = vocab
|
||||||
self._create_empty_vectors(dummy_hash=self.vocab.strings[""])
|
self._create_empty_vectors(dummy_hash=self.vocab.strings[""])
|
||||||
|
|
||||||
|
def initialize_entities(self, int64_t nr_entities):
|
||||||
|
self._entry_index = PreshMap(nr_entities + 1)
|
||||||
|
self._entries = entry_vec(nr_entities + 1)
|
||||||
|
self._vectors_table = float_matrix(nr_entities + 1)
|
||||||
|
|
||||||
|
def initialize_aliases(self, int64_t nr_aliases):
|
||||||
|
self._alias_index = PreshMap(nr_aliases + 1)
|
||||||
|
self._aliases_table = alias_vec(nr_aliases + 1)
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def entity_vector_length(self):
|
def entity_vector_length(self):
|
||||||
"""RETURNS (uint64): length of the entity vectors"""
|
"""RETURNS (uint64): length of the entity vectors"""
|
||||||
|
@ -144,8 +153,7 @@ cdef class KnowledgeBase:
|
||||||
raise ValueError(Errors.E140)
|
raise ValueError(Errors.E140)
|
||||||
|
|
||||||
nr_entities = len(set(entity_list))
|
nr_entities = len(set(entity_list))
|
||||||
self._entry_index = PreshMap(nr_entities+1)
|
self.initialize_entities(nr_entities)
|
||||||
self._entries = entry_vec(nr_entities+1)
|
|
||||||
|
|
||||||
i = 0
|
i = 0
|
||||||
cdef KBEntryC entry
|
cdef KBEntryC entry
|
||||||
|
@ -325,6 +333,102 @@ cdef class KnowledgeBase:
|
||||||
|
|
||||||
return 0.0
|
return 0.0
|
||||||
|
|
||||||
|
def to_bytes(self, **kwargs):
|
||||||
|
"""Serialize the current state to a binary string.
|
||||||
|
"""
|
||||||
|
def serialize_header():
|
||||||
|
header = (self.get_size_entities(), self.get_size_aliases(), self.entity_vector_length)
|
||||||
|
return srsly.json_dumps(header)
|
||||||
|
|
||||||
|
def serialize_entries():
|
||||||
|
i = 1
|
||||||
|
tuples = []
|
||||||
|
for entry_hash, entry_index in sorted(self._entry_index.items(), key=lambda x: x[1]):
|
||||||
|
entry = self._entries[entry_index]
|
||||||
|
assert entry.entity_hash == entry_hash
|
||||||
|
assert entry_index == i
|
||||||
|
tuples.append((entry.entity_hash, entry.freq, entry.vector_index))
|
||||||
|
i = i + 1
|
||||||
|
return srsly.json_dumps(tuples)
|
||||||
|
|
||||||
|
def serialize_aliases():
|
||||||
|
i = 1
|
||||||
|
headers = []
|
||||||
|
indices_lists = []
|
||||||
|
probs_lists = []
|
||||||
|
for alias_hash, alias_index in sorted(self._alias_index.items(), key=lambda x: x[1]):
|
||||||
|
alias = self._aliases_table[alias_index]
|
||||||
|
assert alias_index == i
|
||||||
|
candidate_length = len(alias.entry_indices)
|
||||||
|
headers.append((alias_hash, candidate_length))
|
||||||
|
indices_lists.append(alias.entry_indices)
|
||||||
|
probs_lists.append(alias.probs)
|
||||||
|
i = i + 1
|
||||||
|
headers_dump = srsly.json_dumps(headers)
|
||||||
|
indices_dump = srsly.json_dumps(indices_lists)
|
||||||
|
probs_dump = srsly.json_dumps(probs_lists)
|
||||||
|
return srsly.json_dumps((headers_dump, indices_dump, probs_dump))
|
||||||
|
|
||||||
|
serializers = {
|
||||||
|
"header": serialize_header,
|
||||||
|
"entity_vectors": lambda: srsly.json_dumps(self._vectors_table),
|
||||||
|
"entries": serialize_entries,
|
||||||
|
"aliases": serialize_aliases,
|
||||||
|
}
|
||||||
|
return util.to_bytes(serializers, [])
|
||||||
|
|
||||||
|
def from_bytes(self, bytes_data, *, exclude=tuple()):
|
||||||
|
"""Load state from a binary string.
|
||||||
|
"""
|
||||||
|
def deserialize_header(b):
|
||||||
|
header = srsly.json_loads(b)
|
||||||
|
nr_entities = header[0]
|
||||||
|
nr_aliases = header[1]
|
||||||
|
entity_vector_length = header[2]
|
||||||
|
self.initialize_entities(nr_entities)
|
||||||
|
self.initialize_aliases(nr_aliases)
|
||||||
|
self.entity_vector_length = entity_vector_length
|
||||||
|
|
||||||
|
def deserialize_vectors(b):
|
||||||
|
self._vectors_table = srsly.json_loads(b)
|
||||||
|
|
||||||
|
def deserialize_entries(b):
|
||||||
|
cdef KBEntryC entry
|
||||||
|
tuples = srsly.json_loads(b)
|
||||||
|
i = 1
|
||||||
|
for (entity_hash, freq, vector_index) in tuples:
|
||||||
|
entry.entity_hash = entity_hash
|
||||||
|
entry.freq = freq
|
||||||
|
entry.vector_index = vector_index
|
||||||
|
entry.feats_row = -1 # Features table currently not implemented
|
||||||
|
self._entries[i] = entry
|
||||||
|
self._entry_index[entity_hash] = i
|
||||||
|
i += 1
|
||||||
|
|
||||||
|
def deserialize_aliases(b):
|
||||||
|
cdef AliasC alias
|
||||||
|
i = 1
|
||||||
|
all_data = srsly.json_loads(b)
|
||||||
|
headers = srsly.json_loads(all_data[0])
|
||||||
|
indices = srsly.json_loads(all_data[1])
|
||||||
|
probs = srsly.json_loads(all_data[2])
|
||||||
|
for header, indices, probs in zip(headers, indices, probs):
|
||||||
|
alias_hash, candidate_length = header
|
||||||
|
alias.entry_indices = indices
|
||||||
|
alias.probs = probs
|
||||||
|
self._aliases_table[i] = alias
|
||||||
|
self._alias_index[alias_hash] = i
|
||||||
|
i += 1
|
||||||
|
|
||||||
|
setters = {
|
||||||
|
"header": deserialize_header,
|
||||||
|
"entity_vectors": deserialize_vectors,
|
||||||
|
"entries": deserialize_entries,
|
||||||
|
"aliases": deserialize_aliases,
|
||||||
|
}
|
||||||
|
util.from_bytes(bytes_data, setters, exclude)
|
||||||
|
return self
|
||||||
|
|
||||||
def to_disk(self, path, exclude: Iterable[str] = SimpleFrozenList()):
|
def to_disk(self, path, exclude: Iterable[str] = SimpleFrozenList()):
|
||||||
path = ensure_path(path)
|
path = ensure_path(path)
|
||||||
if not path.exists():
|
if not path.exists():
|
||||||
|
@ -404,10 +508,8 @@ cdef class KnowledgeBase:
|
||||||
cdef int64_t entity_vector_length
|
cdef int64_t entity_vector_length
|
||||||
reader.read_header(&nr_entities, &entity_vector_length)
|
reader.read_header(&nr_entities, &entity_vector_length)
|
||||||
|
|
||||||
|
self.initialize_entities(nr_entities)
|
||||||
self.entity_vector_length = entity_vector_length
|
self.entity_vector_length = entity_vector_length
|
||||||
self._entry_index = PreshMap(nr_entities+1)
|
|
||||||
self._entries = entry_vec(nr_entities+1)
|
|
||||||
self._vectors_table = float_matrix(nr_entities+1)
|
|
||||||
|
|
||||||
# STEP 1: load entity vectors
|
# STEP 1: load entity vectors
|
||||||
cdef int i = 0
|
cdef int i = 0
|
||||||
|
@ -445,8 +547,7 @@ cdef class KnowledgeBase:
|
||||||
# STEP 3: load aliases
|
# STEP 3: load aliases
|
||||||
cdef int64_t nr_aliases
|
cdef int64_t nr_aliases
|
||||||
reader.read_alias_length(&nr_aliases)
|
reader.read_alias_length(&nr_aliases)
|
||||||
self._alias_index = PreshMap(nr_aliases+1)
|
self.initialize_aliases(nr_aliases)
|
||||||
self._aliases_table = alias_vec(nr_aliases+1)
|
|
||||||
|
|
||||||
cdef int64_t nr_candidates
|
cdef int64_t nr_candidates
|
||||||
cdef vector[int64_t] entry_indices
|
cdef vector[int64_t] entry_indices
|
||||||
|
|
|
@ -13,6 +13,7 @@ import srsly
|
||||||
import multiprocessing as mp
|
import multiprocessing as mp
|
||||||
from itertools import chain, cycle
|
from itertools import chain, cycle
|
||||||
from timeit import default_timer as timer
|
from timeit import default_timer as timer
|
||||||
|
import traceback
|
||||||
|
|
||||||
from .tokens.underscore import Underscore
|
from .tokens.underscore import Underscore
|
||||||
from .vocab import Vocab, create_vocab
|
from .vocab import Vocab, create_vocab
|
||||||
|
@ -1538,11 +1539,15 @@ class Language:
|
||||||
|
|
||||||
# Cycle channels not to break the order of docs.
|
# Cycle channels not to break the order of docs.
|
||||||
# The received object is a batch of byte-encoded docs, so flatten them with chain.from_iterable.
|
# The received object is a batch of byte-encoded docs, so flatten them with chain.from_iterable.
|
||||||
byte_docs = chain.from_iterable(recv.recv() for recv in cycle(bytedocs_recv_ch))
|
byte_tuples = chain.from_iterable(recv.recv() for recv in cycle(bytedocs_recv_ch))
|
||||||
docs = (Doc(self.vocab).from_bytes(byte_doc) for byte_doc in byte_docs)
|
|
||||||
try:
|
try:
|
||||||
for i, (_, doc) in enumerate(zip(raw_texts, docs), 1):
|
for i, (_, (byte_doc, byte_error)) in enumerate(zip(raw_texts, byte_tuples), 1):
|
||||||
yield doc
|
if byte_doc is not None:
|
||||||
|
doc = Doc(self.vocab).from_bytes(byte_doc)
|
||||||
|
yield doc
|
||||||
|
elif byte_error is not None:
|
||||||
|
error = srsly.msgpack_loads(byte_error)
|
||||||
|
self.default_error_handler(None, None, None, ValueError(Errors.E871.format(error=error)))
|
||||||
if i % batch_size == 0:
|
if i % batch_size == 0:
|
||||||
# tell `sender` that one batch was consumed.
|
# tell `sender` that one batch was consumed.
|
||||||
sender.step()
|
sender.step()
|
||||||
|
@ -2036,12 +2041,19 @@ def _apply_pipes(
|
||||||
"""
|
"""
|
||||||
Underscore.load_state(underscore_state)
|
Underscore.load_state(underscore_state)
|
||||||
while True:
|
while True:
|
||||||
texts = receiver.get()
|
try:
|
||||||
docs = (make_doc(text) for text in texts)
|
texts = receiver.get()
|
||||||
for pipe in pipes:
|
docs = (make_doc(text) for text in texts)
|
||||||
docs = pipe(docs)
|
for pipe in pipes:
|
||||||
# Connection does not accept unpickable objects, so send list.
|
docs = pipe(docs)
|
||||||
sender.send([doc.to_bytes() for doc in docs])
|
# Connection does not accept unpickable objects, so send list.
|
||||||
|
byte_docs = [(doc.to_bytes(), None) for doc in docs]
|
||||||
|
padding = [(None, None)] * (len(texts) - len(byte_docs))
|
||||||
|
sender.send(byte_docs + padding)
|
||||||
|
except Exception:
|
||||||
|
error_msg = [(None, srsly.msgpack_dumps(traceback.format_exc()))]
|
||||||
|
padding = [(None, None)] * (len(texts) - 1)
|
||||||
|
sender.send(error_msg + padding)
|
||||||
|
|
||||||
|
|
||||||
class _Sender:
|
class _Sender:
|
||||||
|
|
|
@ -408,6 +408,48 @@ class EntityLinker(TrainablePipe):
|
||||||
validate_examples(examples, "EntityLinker.score")
|
validate_examples(examples, "EntityLinker.score")
|
||||||
return Scorer.score_links(examples, negative_labels=[self.NIL])
|
return Scorer.score_links(examples, negative_labels=[self.NIL])
|
||||||
|
|
||||||
|
def to_bytes(self, *, exclude=tuple()):
|
||||||
|
"""Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
|
exclude (Iterable[str]): String names of serialization fields to exclude.
|
||||||
|
RETURNS (bytes): The serialized object.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/entitylinker#to_bytes
|
||||||
|
"""
|
||||||
|
self._validate_serialization_attrs()
|
||||||
|
serialize = {}
|
||||||
|
if hasattr(self, "cfg") and self.cfg is not None:
|
||||||
|
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
||||||
|
serialize["vocab"] = self.vocab.to_bytes
|
||||||
|
serialize["kb"] = self.kb.to_bytes
|
||||||
|
serialize["model"] = self.model.to_bytes
|
||||||
|
return util.to_bytes(serialize, exclude)
|
||||||
|
|
||||||
|
def from_bytes(self, bytes_data, *, exclude=tuple()):
|
||||||
|
"""Load the pipe from a bytestring.
|
||||||
|
|
||||||
|
exclude (Iterable[str]): String names of serialization fields to exclude.
|
||||||
|
RETURNS (TrainablePipe): The loaded object.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/entitylinker#from_bytes
|
||||||
|
"""
|
||||||
|
self._validate_serialization_attrs()
|
||||||
|
|
||||||
|
def load_model(b):
|
||||||
|
try:
|
||||||
|
self.model.from_bytes(b)
|
||||||
|
except AttributeError:
|
||||||
|
raise ValueError(Errors.E149) from None
|
||||||
|
|
||||||
|
deserialize = {}
|
||||||
|
if hasattr(self, "cfg") and self.cfg is not None:
|
||||||
|
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
|
||||||
|
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
|
||||||
|
deserialize["kb"] = lambda b: self.kb.from_bytes(b)
|
||||||
|
deserialize["model"] = load_model
|
||||||
|
util.from_bytes(bytes_data, deserialize, exclude)
|
||||||
|
return self
|
||||||
|
|
||||||
def to_disk(
|
def to_disk(
|
||||||
self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
|
self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
|
||||||
) -> None:
|
) -> None:
|
||||||
|
|
|
@ -1,4 +1,6 @@
|
||||||
import pytest
|
import pytest
|
||||||
|
import numpy
|
||||||
|
from numpy.testing import assert_array_equal
|
||||||
from spacy.attrs import ORTH, LENGTH
|
from spacy.attrs import ORTH, LENGTH
|
||||||
from spacy.tokens import Doc, Span, Token
|
from spacy.tokens import Doc, Span, Token
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
|
@ -120,6 +122,17 @@ def test_spans_lca_matrix(en_tokenizer):
|
||||||
assert lca[1, 0] == 1 # slept & dog -> slept
|
assert lca[1, 0] == 1 # slept & dog -> slept
|
||||||
assert lca[1, 1] == 1 # slept & slept -> slept
|
assert lca[1, 1] == 1 # slept & slept -> slept
|
||||||
|
|
||||||
|
# example from Span API docs
|
||||||
|
tokens = en_tokenizer("I like New York in Autumn")
|
||||||
|
doc = Doc(
|
||||||
|
tokens.vocab,
|
||||||
|
words=[t.text for t in tokens],
|
||||||
|
heads=[1, 1, 3, 1, 3, 4],
|
||||||
|
deps=["dep"] * len(tokens),
|
||||||
|
)
|
||||||
|
lca = doc[1:4].get_lca_matrix()
|
||||||
|
assert_array_equal(lca, numpy.asarray([[0, 0, 0], [0, 1, 2], [0, 2, 2]]))
|
||||||
|
|
||||||
|
|
||||||
def test_span_similarity_match():
|
def test_span_similarity_match():
|
||||||
doc = Doc(Vocab(), words=["a", "b", "a", "b"])
|
doc = Doc(Vocab(), words=["a", "b", "a", "b"])
|
||||||
|
@ -266,16 +279,10 @@ def test_span_string_label_kb_id(doc):
|
||||||
assert span.kb_id == doc.vocab.strings["Q342"]
|
assert span.kb_id == doc.vocab.strings["Q342"]
|
||||||
|
|
||||||
|
|
||||||
def test_span_label_readonly(doc):
|
def test_span_attrs_writable(doc):
|
||||||
span = Span(doc, 0, 1)
|
span = Span(doc, 0, 1)
|
||||||
with pytest.raises(NotImplementedError):
|
span.label_ = "label"
|
||||||
span.label_ = "hello"
|
span.kb_id_ = "kb_id"
|
||||||
|
|
||||||
|
|
||||||
def test_span_kb_id_readonly(doc):
|
|
||||||
span = Span(doc, 0, 1)
|
|
||||||
with pytest.raises(NotImplementedError):
|
|
||||||
span.kb_id_ = "Q342"
|
|
||||||
|
|
||||||
|
|
||||||
def test_span_ents_property(doc):
|
def test_span_ents_property(doc):
|
||||||
|
|
|
@ -2,7 +2,7 @@ from typing import Callable, Iterable
|
||||||
import pytest
|
import pytest
|
||||||
from numpy.testing import assert_equal
|
from numpy.testing import assert_equal
|
||||||
from spacy.attrs import ENT_KB_ID
|
from spacy.attrs import ENT_KB_ID
|
||||||
|
from spacy.compat import pickle
|
||||||
from spacy.kb import KnowledgeBase, get_candidates, Candidate
|
from spacy.kb import KnowledgeBase, get_candidates, Candidate
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
|
|
||||||
|
@ -11,7 +11,7 @@ from spacy.ml import load_kb
|
||||||
from spacy.scorer import Scorer
|
from spacy.scorer import Scorer
|
||||||
from spacy.training import Example
|
from spacy.training import Example
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.tests.util import make_tempdir
|
from spacy.tests.util import make_tempdir, make_tempfile
|
||||||
from spacy.tokens import Span
|
from spacy.tokens import Span
|
||||||
|
|
||||||
|
|
||||||
|
@ -290,6 +290,9 @@ def test_vocab_serialization(nlp):
|
||||||
assert candidates[0].alias == adam_hash
|
assert candidates[0].alias == adam_hash
|
||||||
assert candidates[0].alias_ == "adam"
|
assert candidates[0].alias_ == "adam"
|
||||||
|
|
||||||
|
assert kb_new_vocab.get_vector("Q2") == [2]
|
||||||
|
assert_almost_equal(kb_new_vocab.get_prior_prob("Q2", "douglas"), 0.4)
|
||||||
|
|
||||||
|
|
||||||
def test_append_alias(nlp):
|
def test_append_alias(nlp):
|
||||||
"""Test that we can append additional alias-entity pairs"""
|
"""Test that we can append additional alias-entity pairs"""
|
||||||
|
@ -546,6 +549,98 @@ def test_kb_serialization():
|
||||||
assert "RandomWord" in nlp2.vocab.strings
|
assert "RandomWord" in nlp2.vocab.strings
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.xfail(reason="Needs fixing")
|
||||||
|
def test_kb_pickle():
|
||||||
|
# Test that the KB can be pickled
|
||||||
|
nlp = English()
|
||||||
|
kb_1 = KnowledgeBase(nlp.vocab, entity_vector_length=3)
|
||||||
|
kb_1.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3])
|
||||||
|
assert not kb_1.contains_alias("Russ Cochran")
|
||||||
|
kb_1.add_alias(alias="Russ Cochran", entities=["Q2146908"], probabilities=[0.8])
|
||||||
|
assert kb_1.contains_alias("Russ Cochran")
|
||||||
|
data = pickle.dumps(kb_1)
|
||||||
|
kb_2 = pickle.loads(data)
|
||||||
|
assert kb_2.contains_alias("Russ Cochran")
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.xfail(reason="Needs fixing")
|
||||||
|
def test_nel_pickle():
|
||||||
|
# Test that a pipeline with an EL component can be pickled
|
||||||
|
def create_kb(vocab):
|
||||||
|
kb = KnowledgeBase(vocab, entity_vector_length=3)
|
||||||
|
kb.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3])
|
||||||
|
kb.add_alias(alias="Russ Cochran", entities=["Q2146908"], probabilities=[0.8])
|
||||||
|
return kb
|
||||||
|
|
||||||
|
nlp_1 = English()
|
||||||
|
nlp_1.add_pipe("ner")
|
||||||
|
entity_linker_1 = nlp_1.add_pipe("entity_linker", last=True)
|
||||||
|
entity_linker_1.set_kb(create_kb)
|
||||||
|
assert nlp_1.pipe_names == ["ner", "entity_linker"]
|
||||||
|
assert entity_linker_1.kb.contains_alias("Russ Cochran")
|
||||||
|
|
||||||
|
data = pickle.dumps(nlp_1)
|
||||||
|
nlp_2 = pickle.loads(data)
|
||||||
|
assert nlp_2.pipe_names == ["ner", "entity_linker"]
|
||||||
|
entity_linker_2 = nlp_2.get_pipe("entity_linker")
|
||||||
|
assert entity_linker_2.kb.contains_alias("Russ Cochran")
|
||||||
|
|
||||||
|
|
||||||
|
def test_kb_to_bytes():
|
||||||
|
# Test that the KB's to_bytes method works correctly
|
||||||
|
nlp = English()
|
||||||
|
kb_1 = KnowledgeBase(nlp.vocab, entity_vector_length=3)
|
||||||
|
kb_1.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3])
|
||||||
|
kb_1.add_entity(entity="Q66", freq=9, entity_vector=[1, 2, 3])
|
||||||
|
kb_1.add_alias(alias="Russ Cochran", entities=["Q2146908"], probabilities=[0.8])
|
||||||
|
kb_1.add_alias(alias="Boeing", entities=["Q66"], probabilities=[0.5])
|
||||||
|
kb_1.add_alias(alias="Randomness", entities=["Q66", "Q2146908"], probabilities=[0.1, 0.2])
|
||||||
|
assert kb_1.contains_alias("Russ Cochran")
|
||||||
|
kb_bytes = kb_1.to_bytes()
|
||||||
|
kb_2 = KnowledgeBase(nlp.vocab, entity_vector_length=3)
|
||||||
|
assert not kb_2.contains_alias("Russ Cochran")
|
||||||
|
kb_2 = kb_2.from_bytes(kb_bytes)
|
||||||
|
# check that both KBs are exactly the same
|
||||||
|
assert kb_1.get_size_entities() == kb_2.get_size_entities()
|
||||||
|
assert kb_1.entity_vector_length == kb_2.entity_vector_length
|
||||||
|
assert kb_1.get_entity_strings() == kb_2.get_entity_strings()
|
||||||
|
assert kb_1.get_vector("Q2146908") == kb_2.get_vector("Q2146908")
|
||||||
|
assert kb_1.get_vector("Q66") == kb_2.get_vector("Q66")
|
||||||
|
assert kb_2.contains_alias("Russ Cochran")
|
||||||
|
assert kb_1.get_size_aliases() == kb_2.get_size_aliases()
|
||||||
|
assert kb_1.get_alias_strings() == kb_2.get_alias_strings()
|
||||||
|
assert len(kb_1.get_alias_candidates("Russ Cochran")) == len(kb_2.get_alias_candidates("Russ Cochran"))
|
||||||
|
assert len(kb_1.get_alias_candidates("Randomness")) == len(kb_2.get_alias_candidates("Randomness"))
|
||||||
|
|
||||||
|
|
||||||
|
def test_nel_to_bytes():
|
||||||
|
# Test that a pipeline with an EL component can be converted to bytes
|
||||||
|
def create_kb(vocab):
|
||||||
|
kb = KnowledgeBase(vocab, entity_vector_length=3)
|
||||||
|
kb.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3])
|
||||||
|
kb.add_alias(alias="Russ Cochran", entities=["Q2146908"], probabilities=[0.8])
|
||||||
|
return kb
|
||||||
|
|
||||||
|
nlp_1 = English()
|
||||||
|
nlp_1.add_pipe("ner")
|
||||||
|
entity_linker_1 = nlp_1.add_pipe("entity_linker", last=True)
|
||||||
|
entity_linker_1.set_kb(create_kb)
|
||||||
|
assert entity_linker_1.kb.contains_alias("Russ Cochran")
|
||||||
|
assert nlp_1.pipe_names == ["ner", "entity_linker"]
|
||||||
|
|
||||||
|
nlp_bytes = nlp_1.to_bytes()
|
||||||
|
nlp_2 = English()
|
||||||
|
nlp_2.add_pipe("ner")
|
||||||
|
nlp_2.add_pipe("entity_linker", last=True)
|
||||||
|
assert nlp_2.pipe_names == ["ner", "entity_linker"]
|
||||||
|
assert not nlp_2.get_pipe("entity_linker").kb.contains_alias("Russ Cochran")
|
||||||
|
nlp_2 = nlp_2.from_bytes(nlp_bytes)
|
||||||
|
kb_2 = nlp_2.get_pipe("entity_linker").kb
|
||||||
|
assert kb_2.contains_alias("Russ Cochran")
|
||||||
|
assert kb_2.get_vector("Q2146908") == [6, -4, 3]
|
||||||
|
assert_almost_equal(kb_2.get_prior_prob(entity="Q2146908", alias="Russ Cochran"), 0.8)
|
||||||
|
|
||||||
|
|
||||||
def test_scorer_links():
|
def test_scorer_links():
|
||||||
train_examples = []
|
train_examples = []
|
||||||
nlp = English()
|
nlp = English()
|
||||||
|
|
|
@ -64,13 +64,15 @@ def test_serialize_doc_span_groups(en_vocab):
|
||||||
|
|
||||||
|
|
||||||
def test_serialize_doc_bin():
|
def test_serialize_doc_bin():
|
||||||
doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)
|
doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE", "NORM", "ENT_ID"], store_user_data=True)
|
||||||
texts = ["Some text", "Lots of texts...", "..."]
|
texts = ["Some text", "Lots of texts...", "..."]
|
||||||
cats = {"A": 0.5}
|
cats = {"A": 0.5}
|
||||||
nlp = English()
|
nlp = English()
|
||||||
for doc in nlp.pipe(texts):
|
for doc in nlp.pipe(texts):
|
||||||
doc.cats = cats
|
doc.cats = cats
|
||||||
doc.spans["start"] = [doc[0:2]]
|
doc.spans["start"] = [doc[0:2]]
|
||||||
|
doc[0].norm_ = "UNUSUAL_TOKEN_NORM"
|
||||||
|
doc[0].ent_id_ = "UNUSUAL_TOKEN_ENT_ID"
|
||||||
doc_bin.add(doc)
|
doc_bin.add(doc)
|
||||||
bytes_data = doc_bin.to_bytes()
|
bytes_data = doc_bin.to_bytes()
|
||||||
|
|
||||||
|
@ -82,6 +84,8 @@ def test_serialize_doc_bin():
|
||||||
assert doc.text == texts[i]
|
assert doc.text == texts[i]
|
||||||
assert doc.cats == cats
|
assert doc.cats == cats
|
||||||
assert len(doc.spans) == 1
|
assert len(doc.spans) == 1
|
||||||
|
assert doc[0].norm_ == "UNUSUAL_TOKEN_NORM"
|
||||||
|
assert doc[0].ent_id_ == "UNUSUAL_TOKEN_ENT_ID"
|
||||||
|
|
||||||
|
|
||||||
def test_serialize_doc_bin_unknown_spaces(en_vocab):
|
def test_serialize_doc_bin_unknown_spaces(en_vocab):
|
||||||
|
|
|
@ -8,13 +8,36 @@ from spacy.vocab import Vocab
|
||||||
from spacy.training import Example
|
from spacy.training import Example
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.lang.de import German
|
from spacy.lang.de import German
|
||||||
from spacy.util import registry, ignore_error, raise_error
|
from spacy.util import registry, ignore_error, raise_error, logger
|
||||||
import spacy
|
import spacy
|
||||||
from thinc.api import NumpyOps, get_current_ops
|
from thinc.api import NumpyOps, get_current_ops
|
||||||
|
|
||||||
from .util import add_vecs_to_vocab, assert_docs_equal
|
from .util import add_vecs_to_vocab, assert_docs_equal
|
||||||
|
|
||||||
|
|
||||||
|
def evil_component(doc):
|
||||||
|
if "2" in doc.text:
|
||||||
|
raise ValueError("no dice")
|
||||||
|
return doc
|
||||||
|
|
||||||
|
|
||||||
|
def perhaps_set_sentences(doc):
|
||||||
|
if not doc.text.startswith("4"):
|
||||||
|
doc[-1].is_sent_start = True
|
||||||
|
return doc
|
||||||
|
|
||||||
|
|
||||||
|
def assert_sents_error(doc):
|
||||||
|
if not doc.has_annotation("SENT_START"):
|
||||||
|
raise ValueError("no sents")
|
||||||
|
return doc
|
||||||
|
|
||||||
|
|
||||||
|
def warn_error(proc_name, proc, docs, e):
|
||||||
|
logger = logging.getLogger("spacy")
|
||||||
|
logger.warning(f"Trouble with component {proc_name}.")
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def nlp():
|
def nlp():
|
||||||
nlp = Language(Vocab())
|
nlp = Language(Vocab())
|
||||||
|
@ -93,19 +116,16 @@ def test_evaluate_no_pipe(nlp):
|
||||||
nlp.evaluate([Example.from_dict(doc, annots)])
|
nlp.evaluate([Example.from_dict(doc, annots)])
|
||||||
|
|
||||||
|
|
||||||
@Language.component("test_language_vector_modification_pipe")
|
|
||||||
def vector_modification_pipe(doc):
|
def vector_modification_pipe(doc):
|
||||||
doc.vector += 1
|
doc.vector += 1
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
|
|
||||||
@Language.component("test_language_userdata_pipe")
|
|
||||||
def userdata_pipe(doc):
|
def userdata_pipe(doc):
|
||||||
doc.user_data["foo"] = "bar"
|
doc.user_data["foo"] = "bar"
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
|
|
||||||
@Language.component("test_language_ner_pipe")
|
|
||||||
def ner_pipe(doc):
|
def ner_pipe(doc):
|
||||||
span = Span(doc, 0, 1, label="FIRST")
|
span = Span(doc, 0, 1, label="FIRST")
|
||||||
doc.ents += (span,)
|
doc.ents += (span,)
|
||||||
|
@ -123,6 +143,9 @@ def sample_vectors():
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def nlp2(nlp, sample_vectors):
|
def nlp2(nlp, sample_vectors):
|
||||||
|
Language.component("test_language_vector_modification_pipe", func=vector_modification_pipe)
|
||||||
|
Language.component("test_language_userdata_pipe", func=userdata_pipe)
|
||||||
|
Language.component("test_language_ner_pipe", func=ner_pipe)
|
||||||
add_vecs_to_vocab(nlp.vocab, sample_vectors)
|
add_vecs_to_vocab(nlp.vocab, sample_vectors)
|
||||||
nlp.add_pipe("test_language_vector_modification_pipe")
|
nlp.add_pipe("test_language_vector_modification_pipe")
|
||||||
nlp.add_pipe("test_language_ner_pipe")
|
nlp.add_pipe("test_language_ner_pipe")
|
||||||
|
@ -168,82 +191,115 @@ def test_language_pipe_stream(nlp2, n_process, texts):
|
||||||
assert_docs_equal(doc, expected_doc)
|
assert_docs_equal(doc, expected_doc)
|
||||||
|
|
||||||
|
|
||||||
def test_language_pipe_error_handler():
|
@pytest.mark.parametrize("n_process", [1, 2])
|
||||||
|
def test_language_pipe_error_handler(n_process):
|
||||||
"""Test that the error handling of nlp.pipe works well"""
|
"""Test that the error handling of nlp.pipe works well"""
|
||||||
nlp = English()
|
ops = get_current_ops()
|
||||||
nlp.add_pipe("merge_subtokens")
|
if isinstance(ops, NumpyOps) or n_process < 2:
|
||||||
nlp.initialize()
|
nlp = English()
|
||||||
texts = ["Curious to see what will happen to this text.", "And this one."]
|
nlp.add_pipe("merge_subtokens")
|
||||||
# the pipeline fails because there's no parser
|
nlp.initialize()
|
||||||
with pytest.raises(ValueError):
|
texts = ["Curious to see what will happen to this text.", "And this one."]
|
||||||
|
# the pipeline fails because there's no parser
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
nlp(texts[0])
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
list(nlp.pipe(texts, n_process=n_process))
|
||||||
|
nlp.set_error_handler(raise_error)
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
list(nlp.pipe(texts, n_process=n_process))
|
||||||
|
# set explicitely to ignoring
|
||||||
|
nlp.set_error_handler(ignore_error)
|
||||||
|
docs = list(nlp.pipe(texts, n_process=n_process))
|
||||||
|
assert len(docs) == 0
|
||||||
nlp(texts[0])
|
nlp(texts[0])
|
||||||
with pytest.raises(ValueError):
|
|
||||||
list(nlp.pipe(texts))
|
|
||||||
nlp.set_error_handler(raise_error)
|
|
||||||
with pytest.raises(ValueError):
|
|
||||||
list(nlp.pipe(texts))
|
|
||||||
# set explicitely to ignoring
|
|
||||||
nlp.set_error_handler(ignore_error)
|
|
||||||
docs = list(nlp.pipe(texts))
|
|
||||||
assert len(docs) == 0
|
|
||||||
nlp(texts[0])
|
|
||||||
|
|
||||||
|
|
||||||
def test_language_pipe_error_handler_custom(en_vocab):
|
@pytest.mark.parametrize("n_process", [1, 2])
|
||||||
|
def test_language_pipe_error_handler_custom(en_vocab, n_process):
|
||||||
"""Test the error handling of a custom component that has no pipe method"""
|
"""Test the error handling of a custom component that has no pipe method"""
|
||||||
|
Language.component("my_evil_component", func=evil_component)
|
||||||
|
ops = get_current_ops()
|
||||||
|
if isinstance(ops, NumpyOps) or n_process < 2:
|
||||||
|
nlp = English()
|
||||||
|
nlp.add_pipe("my_evil_component")
|
||||||
|
texts = ["TEXT 111", "TEXT 222", "TEXT 333", "TEXT 342", "TEXT 666"]
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
# the evil custom component throws an error
|
||||||
|
list(nlp.pipe(texts))
|
||||||
|
|
||||||
@Language.component("my_evil_component")
|
nlp.set_error_handler(warn_error)
|
||||||
def evil_component(doc):
|
logger = logging.getLogger("spacy")
|
||||||
if "2" in doc.text:
|
with mock.patch.object(logger, "warning") as mock_warning:
|
||||||
raise ValueError("no dice")
|
# the errors by the evil custom component raise a warning for each
|
||||||
return doc
|
# bad doc
|
||||||
|
docs = list(nlp.pipe(texts, n_process=n_process))
|
||||||
def warn_error(proc_name, proc, docs, e):
|
# HACK/TODO? the warnings in child processes don't seem to be
|
||||||
from spacy.util import logger
|
# detected by the mock logger
|
||||||
|
if n_process == 1:
|
||||||
logger.warning(f"Trouble with component {proc_name}.")
|
mock_warning.assert_called()
|
||||||
|
assert mock_warning.call_count == 2
|
||||||
nlp = English()
|
assert len(docs) + mock_warning.call_count == len(texts)
|
||||||
nlp.add_pipe("my_evil_component")
|
assert [doc.text for doc in docs] == ["TEXT 111", "TEXT 333", "TEXT 666"]
|
||||||
nlp.initialize()
|
|
||||||
texts = ["TEXT 111", "TEXT 222", "TEXT 333", "TEXT 342", "TEXT 666"]
|
|
||||||
with pytest.raises(ValueError):
|
|
||||||
# the evil custom component throws an error
|
|
||||||
list(nlp.pipe(texts))
|
|
||||||
|
|
||||||
nlp.set_error_handler(warn_error)
|
|
||||||
logger = logging.getLogger("spacy")
|
|
||||||
with mock.patch.object(logger, "warning") as mock_warning:
|
|
||||||
# the errors by the evil custom component raise a warning for each bad batch
|
|
||||||
docs = list(nlp.pipe(texts))
|
|
||||||
mock_warning.assert_called()
|
|
||||||
assert mock_warning.call_count == 2
|
|
||||||
assert len(docs) + mock_warning.call_count == len(texts)
|
|
||||||
assert [doc.text for doc in docs] == ["TEXT 111", "TEXT 333", "TEXT 666"]
|
|
||||||
|
|
||||||
|
|
||||||
def test_language_pipe_error_handler_pipe(en_vocab):
|
@pytest.mark.parametrize("n_process", [1, 2])
|
||||||
|
def test_language_pipe_error_handler_pipe(en_vocab, n_process):
|
||||||
"""Test the error handling of a component's pipe method"""
|
"""Test the error handling of a component's pipe method"""
|
||||||
|
Language.component("my_perhaps_sentences", func=perhaps_set_sentences)
|
||||||
|
Language.component("assert_sents_error", func=assert_sents_error)
|
||||||
|
ops = get_current_ops()
|
||||||
|
if isinstance(ops, NumpyOps) or n_process < 2:
|
||||||
|
texts = [f"{str(i)} is enough. Done" for i in range(100)]
|
||||||
|
nlp = English()
|
||||||
|
nlp.add_pipe("my_perhaps_sentences")
|
||||||
|
nlp.add_pipe("assert_sents_error")
|
||||||
|
nlp.initialize()
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
# assert_sents_error requires sentence boundaries, will throw an error otherwise
|
||||||
|
docs = list(nlp.pipe(texts, n_process=n_process, batch_size=10))
|
||||||
|
nlp.set_error_handler(ignore_error)
|
||||||
|
docs = list(nlp.pipe(texts, n_process=n_process, batch_size=10))
|
||||||
|
# we lose/ignore the failing 4,40-49 docs
|
||||||
|
assert len(docs) == 89
|
||||||
|
|
||||||
@Language.component("my_sentences")
|
|
||||||
def perhaps_set_sentences(doc):
|
|
||||||
if not doc.text.startswith("4"):
|
|
||||||
doc[-1].is_sent_start = True
|
|
||||||
return doc
|
|
||||||
|
|
||||||
texts = [f"{str(i)} is enough. Done" for i in range(100)]
|
@pytest.mark.parametrize("n_process", [1, 2])
|
||||||
nlp = English()
|
def test_language_pipe_error_handler_make_doc_actual(n_process):
|
||||||
nlp.add_pipe("my_sentences")
|
"""Test the error handling for make_doc"""
|
||||||
entity_linker = nlp.add_pipe("entity_linker", config={"entity_vector_length": 3})
|
# TODO: fix so that the following test is the actual behavior
|
||||||
entity_linker.kb.add_entity(entity="Q1", freq=12, entity_vector=[1, 2, 3])
|
|
||||||
nlp.initialize()
|
ops = get_current_ops()
|
||||||
with pytest.raises(ValueError):
|
if isinstance(ops, NumpyOps) or n_process < 2:
|
||||||
# the entity linker requires sentence boundaries, will throw an error otherwise
|
nlp = English()
|
||||||
docs = list(nlp.pipe(texts, batch_size=10))
|
nlp.max_length = 10
|
||||||
nlp.set_error_handler(ignore_error)
|
texts = ["12345678901234567890", "12345"] * 10
|
||||||
docs = list(nlp.pipe(texts, batch_size=10))
|
with pytest.raises(ValueError):
|
||||||
# we lose/ignore the failing 0-9 and 40-49 batches
|
list(nlp.pipe(texts, n_process=n_process))
|
||||||
assert len(docs) == 80
|
nlp.default_error_handler = ignore_error
|
||||||
|
if n_process == 1:
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
list(nlp.pipe(texts, n_process=n_process))
|
||||||
|
else:
|
||||||
|
docs = list(nlp.pipe(texts, n_process=n_process))
|
||||||
|
assert len(docs) == 0
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.xfail
|
||||||
|
@pytest.mark.parametrize("n_process", [1, 2])
|
||||||
|
def test_language_pipe_error_handler_make_doc_preferred(n_process):
|
||||||
|
"""Test the error handling for make_doc"""
|
||||||
|
|
||||||
|
ops = get_current_ops()
|
||||||
|
if isinstance(ops, NumpyOps) or n_process < 2:
|
||||||
|
nlp = English()
|
||||||
|
nlp.max_length = 10
|
||||||
|
texts = ["12345678901234567890", "12345"] * 10
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
list(nlp.pipe(texts, n_process=n_process))
|
||||||
|
nlp.default_error_handler = ignore_error
|
||||||
|
docs = list(nlp.pipe(texts, n_process=n_process))
|
||||||
|
assert len(docs) == 0
|
||||||
|
|
||||||
|
|
||||||
def test_language_from_config_before_after_init():
|
def test_language_from_config_before_after_init():
|
||||||
|
|
|
@ -103,10 +103,12 @@ class DocBin:
|
||||||
self.strings.add(token.text)
|
self.strings.add(token.text)
|
||||||
self.strings.add(token.tag_)
|
self.strings.add(token.tag_)
|
||||||
self.strings.add(token.lemma_)
|
self.strings.add(token.lemma_)
|
||||||
|
self.strings.add(token.norm_)
|
||||||
self.strings.add(str(token.morph))
|
self.strings.add(str(token.morph))
|
||||||
self.strings.add(token.dep_)
|
self.strings.add(token.dep_)
|
||||||
self.strings.add(token.ent_type_)
|
self.strings.add(token.ent_type_)
|
||||||
self.strings.add(token.ent_kb_id_)
|
self.strings.add(token.ent_kb_id_)
|
||||||
|
self.strings.add(token.ent_id_)
|
||||||
self.cats.append(doc.cats)
|
self.cats.append(doc.cats)
|
||||||
self.user_data.append(srsly.msgpack_dumps(doc.user_data))
|
self.user_data.append(srsly.msgpack_dumps(doc.user_data))
|
||||||
self.span_groups.append(doc.spans.to_bytes())
|
self.span_groups.append(doc.spans.to_bytes())
|
||||||
|
@ -244,7 +246,10 @@ class DocBin:
|
||||||
"""
|
"""
|
||||||
path = ensure_path(path)
|
path = ensure_path(path)
|
||||||
with path.open("wb") as file_:
|
with path.open("wb") as file_:
|
||||||
file_.write(self.to_bytes())
|
try:
|
||||||
|
file_.write(self.to_bytes())
|
||||||
|
except ValueError:
|
||||||
|
raise ValueError(Errors.E870)
|
||||||
|
|
||||||
def from_disk(self, path: Union[str, Path]) -> "DocBin":
|
def from_disk(self, path: Union[str, Path]) -> "DocBin":
|
||||||
"""Load the DocBin from a file (typically called .spacy).
|
"""Load the DocBin from a file (typically called .spacy).
|
||||||
|
|
|
@ -1673,7 +1673,7 @@ cdef int [:,:] _get_lca_matrix(Doc doc, int start, int end):
|
||||||
j_idx_in_sent = start + j - sent_start
|
j_idx_in_sent = start + j - sent_start
|
||||||
n_missing_tokens_in_sent = len(sent) - j_idx_in_sent
|
n_missing_tokens_in_sent = len(sent) - j_idx_in_sent
|
||||||
# make sure we do not go past `end`, in cases where `end` < sent.end
|
# make sure we do not go past `end`, in cases where `end` < sent.end
|
||||||
max_range = min(j + n_missing_tokens_in_sent, end)
|
max_range = min(j + n_missing_tokens_in_sent, end - start)
|
||||||
for k in range(j + 1, max_range):
|
for k in range(j + 1, max_range):
|
||||||
lca = _get_tokens_lca(token_j, doc[start + k])
|
lca = _get_tokens_lca(token_j, doc[start + k])
|
||||||
# if lca is outside of span, we set it to -1
|
# if lca is outside of span, we set it to -1
|
||||||
|
|
|
@ -740,7 +740,7 @@ cdef class Span:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.root.ent_id_
|
return self.root.ent_id_
|
||||||
|
|
||||||
def __set__(self, hash_t key):
|
def __set__(self, unicode key):
|
||||||
raise NotImplementedError(Errors.E200.format(attr="ent_id_"))
|
raise NotImplementedError(Errors.E200.format(attr="ent_id_"))
|
||||||
|
|
||||||
@property
|
@property
|
||||||
|
@ -762,9 +762,7 @@ cdef class Span:
|
||||||
return self.doc.vocab.strings[self.label]
|
return self.doc.vocab.strings[self.label]
|
||||||
|
|
||||||
def __set__(self, unicode label_):
|
def __set__(self, unicode label_):
|
||||||
if not label_:
|
self.label = self.doc.vocab.strings.add(label_)
|
||||||
label_ = ''
|
|
||||||
raise NotImplementedError(Errors.E129.format(start=self.start, end=self.end, label=label_))
|
|
||||||
|
|
||||||
property kb_id_:
|
property kb_id_:
|
||||||
"""RETURNS (str): The named entity's KB ID."""
|
"""RETURNS (str): The named entity's KB ID."""
|
||||||
|
@ -772,13 +770,7 @@ cdef class Span:
|
||||||
return self.doc.vocab.strings[self.kb_id]
|
return self.doc.vocab.strings[self.kb_id]
|
||||||
|
|
||||||
def __set__(self, unicode kb_id_):
|
def __set__(self, unicode kb_id_):
|
||||||
if not kb_id_:
|
self.kb_id = self.doc.vocab.strings.add(kb_id_)
|
||||||
kb_id_ = ''
|
|
||||||
current_label = self.label_
|
|
||||||
if not current_label:
|
|
||||||
current_label = ''
|
|
||||||
raise NotImplementedError(Errors.E131.format(start=self.start, end=self.end,
|
|
||||||
label=current_label, kb_id=kb_id_))
|
|
||||||
|
|
||||||
|
|
||||||
cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1:
|
cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1:
|
||||||
|
|
|
@ -66,7 +66,7 @@ def configure_minibatch_by_words(
|
||||||
"""
|
"""
|
||||||
optionals = {"get_length": get_length} if get_length is not None else {}
|
optionals = {"get_length": get_length} if get_length is not None else {}
|
||||||
return partial(
|
return partial(
|
||||||
minibatch_by_words, size=size, discard_oversize=discard_oversize, **optionals
|
minibatch_by_words, size=size, tolerance=tolerance, discard_oversize=discard_oversize, **optionals
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -71,6 +71,8 @@ def offsets_to_biluo_tags(
|
||||||
entities (iterable): A sequence of `(start, end, label)` triples. `start`
|
entities (iterable): A sequence of `(start, end, label)` triples. `start`
|
||||||
and `end` should be character-offset integers denoting the slice into
|
and `end` should be character-offset integers denoting the slice into
|
||||||
the original string.
|
the original string.
|
||||||
|
missing (str): The label used for missing values, e.g. if tokenization
|
||||||
|
doesn’t align with the entity offsets. Defaults to "O".
|
||||||
RETURNS (list): A list of unicode strings, describing the tags. Each tag
|
RETURNS (list): A list of unicode strings, describing the tags. Each tag
|
||||||
string will be of the form either "", "O" or "{action}-{label}", where
|
string will be of the form either "", "O" or "{action}-{label}", where
|
||||||
action is one of "B", "I", "L", "U". The missing label is used where the
|
action is one of "B", "I", "L", "U". The missing label is used where the
|
||||||
|
@ -150,7 +152,7 @@ def biluo_tags_to_spans(doc: Doc, tags: Iterable[str]) -> List[Span]:
|
||||||
to overwrite the doc.ents.
|
to overwrite the doc.ents.
|
||||||
|
|
||||||
doc (Doc): The document that the BILUO tags refer to.
|
doc (Doc): The document that the BILUO tags refer to.
|
||||||
entities (iterable): A sequence of BILUO tags with each tag describing one
|
tags (iterable): A sequence of BILUO tags with each tag describing one
|
||||||
token. Each tag string will be of the form of either "", "O" or
|
token. Each tag string will be of the form of either "", "O" or
|
||||||
"{action}-{label}", where action is one of "B", "I", "L", "U".
|
"{action}-{label}", where action is one of "B", "I", "L", "U".
|
||||||
RETURNS (list): A sequence of Span objects. Each token with a missing IOB
|
RETURNS (list): A sequence of Span objects. Each token with a missing IOB
|
||||||
|
@ -170,7 +172,7 @@ def biluo_tags_to_offsets(
|
||||||
"""Encode per-token tags following the BILUO scheme into entity offsets.
|
"""Encode per-token tags following the BILUO scheme into entity offsets.
|
||||||
|
|
||||||
doc (Doc): The document that the BILUO tags refer to.
|
doc (Doc): The document that the BILUO tags refer to.
|
||||||
entities (iterable): A sequence of BILUO tags with each tag describing one
|
tags (iterable): A sequence of BILUO tags with each tag describing one
|
||||||
token. Each tags string will be of the form of either "", "O" or
|
token. Each tags string will be of the form of either "", "O" or
|
||||||
"{action}-{label}", where action is one of "B", "I", "L", "U".
|
"{action}-{label}", where action is one of "B", "I", "L", "U".
|
||||||
RETURNS (list): A sequence of `(start, end, label)` triples. `start` and
|
RETURNS (list): A sequence of `(start, end, label)` triples. `start` and
|
||||||
|
|
|
@ -213,10 +213,10 @@ if there is no prediction.
|
||||||
> kb_ids = entity_linker.predict([doc1, doc2])
|
> kb_ids = entity_linker.predict([doc1, doc2])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------------------------------- |
|
| ----------- | -------------------------------------------------------------------------- |
|
||||||
| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
|
| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
|
||||||
| **RETURNS** | `List[str]` | The predicted KB identifiers for the entities in the `docs`. ~~List[str]~~ |
|
| **RETURNS** | The predicted KB identifiers for the entities in the `docs`. ~~List[str]~~ |
|
||||||
|
|
||||||
## EntityLinker.set_annotations {#set_annotations tag="method"}
|
## EntityLinker.set_annotations {#set_annotations tag="method"}
|
||||||
|
|
||||||
|
@ -341,6 +341,42 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
| **RETURNS** | The modified `EntityLinker` object. ~~EntityLinker~~ |
|
| **RETURNS** | The modified `EntityLinker` object. ~~EntityLinker~~ |
|
||||||
|
|
||||||
|
## EntityLinker.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> entity_linker = nlp.add_pipe("entity_linker")
|
||||||
|
> entity_linker_bytes = entity_linker.to_bytes()
|
||||||
|
> ```
|
||||||
|
|
||||||
|
Serialize the pipe to a bytestring, including the `KnowledgeBase`.
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
|
| _keyword-only_ | |
|
||||||
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
| **RETURNS** | The serialized form of the `EntityLinker` object. ~~bytes~~ |
|
||||||
|
|
||||||
|
## EntityLinker.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> entity_linker_bytes = entity_linker.to_bytes()
|
||||||
|
> entity_linker = nlp.add_pipe("entity_linker")
|
||||||
|
> entity_linker.from_bytes(entity_linker_bytes)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
|
| _keyword-only_ | |
|
||||||
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
| **RETURNS** | The `EntityLinker` object. ~~EntityLinker~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
During serialization, spaCy will export several data fields used to restore
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
|
|
@ -426,7 +426,8 @@ component, adds it to the pipeline and returns it.
|
||||||
> ```python
|
> ```python
|
||||||
> @Language.component("component")
|
> @Language.component("component")
|
||||||
> def component_func(doc):
|
> def component_func(doc):
|
||||||
> # modify Doc and return it return doc
|
> # modify Doc and return it
|
||||||
|
> return doc
|
||||||
>
|
>
|
||||||
> nlp.add_pipe("component", before="ner")
|
> nlp.add_pipe("component", before="ner")
|
||||||
> component = nlp.add_pipe("component", name="custom_name", last=True)
|
> component = nlp.add_pipe("component", name="custom_name", last=True)
|
||||||
|
|
|
@ -879,7 +879,7 @@ This method was previously available as `spacy.gold.offsets_from_biluo_tags`.
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `doc` | The document that the BILUO tags refer to. ~~Doc~~ |
|
| `doc` | The document that the BILUO tags refer to. ~~Doc~~ |
|
||||||
| `entities` | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. ~~List[str]~~ |
|
| `tags` | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. ~~List[str]~~ |
|
||||||
| **RETURNS** | A sequence of `(start, end, label)` triples. `start` and `end` will be character-offset integers denoting the slice into the original string. ~~List[Tuple[int, int, str]]~~ |
|
| **RETURNS** | A sequence of `(start, end, label)` triples. `start` and `end` will be character-offset integers denoting the slice into the original string. ~~List[Tuple[int, int, str]]~~ |
|
||||||
|
|
||||||
### training.biluo_tags_to_spans {#biluo_tags_to_spans tag="function" new="2.1"}
|
### training.biluo_tags_to_spans {#biluo_tags_to_spans tag="function" new="2.1"}
|
||||||
|
@ -908,7 +908,7 @@ This method was previously available as `spacy.gold.spans_from_biluo_tags`.
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `doc` | The document that the BILUO tags refer to. ~~Doc~~ |
|
| `doc` | The document that the BILUO tags refer to. ~~Doc~~ |
|
||||||
| `entities` | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. ~~List[str]~~ |
|
| `tags` | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. ~~List[str]~~ |
|
||||||
| **RETURNS** | A sequence of `Span` objects with added entity labels. ~~List[Span]~~ |
|
| **RETURNS** | A sequence of `Span` objects with added entity labels. ~~List[Span]~~ |
|
||||||
|
|
||||||
## Utility functions {#util source="spacy/util.py"}
|
## Utility functions {#util source="spacy/util.py"}
|
||||||
|
|
|
@ -45,6 +45,14 @@ you generate a starter config with the **recommended settings** for your
|
||||||
specific use case. It's also available in spaCy as the
|
specific use case. It's also available in spaCy as the
|
||||||
[`init config`](/api/cli#init-config) command.
|
[`init config`](/api/cli#init-config) command.
|
||||||
|
|
||||||
|
<Infobox variant="warning">
|
||||||
|
|
||||||
|
Upgrade to the [latest version of spaCy](/usage) to use the quickstart widget.
|
||||||
|
For earlier releases, follow the CLI instructions to generate a compatible
|
||||||
|
config.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
> #### Instructions: widget
|
> #### Instructions: widget
|
||||||
>
|
>
|
||||||
> 1. Select your requirements and settings.
|
> 1. Select your requirements and settings.
|
||||||
|
|
|
@ -1,5 +1,31 @@
|
||||||
{
|
{
|
||||||
"resources": [
|
"resources": [
|
||||||
|
{
|
||||||
|
"id": "nlpcloud",
|
||||||
|
"title": "NLPCloud.io",
|
||||||
|
"slogan": "Production-ready API for spaCy models in production",
|
||||||
|
"description": "A highly-available hosted API to easily deploy and use spaCy models in production. Supports NER, POS tagging, dependency parsing, and tokenization.",
|
||||||
|
"github": "nlpcloud",
|
||||||
|
"pip": "nlpcloud",
|
||||||
|
"code_example": [
|
||||||
|
"import nlpcloud",
|
||||||
|
"",
|
||||||
|
"client = nlpcloud.Client('en_core_web_lg', '4eC39HqLyjWDarjtT1zdp7dc')",
|
||||||
|
"client.entities('John Doe is a Go Developer at Google')",
|
||||||
|
"# [{'end': 8, 'start': 0, 'text': 'John Doe', 'type': 'PERSON'}, {'end': 25, 'start': 13, 'text': 'Go Developer', 'type': 'POSITION'}, {'end': 35,'start': 30, 'text': 'Google', 'type': 'ORG'}]"
|
||||||
|
],
|
||||||
|
"thumb":"https://avatars.githubusercontent.com/u/77671902",
|
||||||
|
"image":"https://nlpcloud.io/assets/images/logo.svg",
|
||||||
|
"code_language": "python",
|
||||||
|
"author": "NLPCloud.io",
|
||||||
|
"author_links": {
|
||||||
|
"github": "nlpcloud",
|
||||||
|
"twitter": "cloud_nlp",
|
||||||
|
"website": "https://nlpcloud.io"
|
||||||
|
},
|
||||||
|
"category": ["apis", "nonpython", "standalone"],
|
||||||
|
"tags": ["api", "deploy", "production"]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"id": "denomme",
|
"id": "denomme",
|
||||||
"title": "denomme : Multilingual Name Detector",
|
"title": "denomme : Multilingual Name Detector",
|
||||||
|
|
|
@ -10,7 +10,7 @@ const DEFAULT_LANG = 'en'
|
||||||
const DEFAULT_HARDWARE = 'cpu'
|
const DEFAULT_HARDWARE = 'cpu'
|
||||||
const DEFAULT_OPT = 'efficiency'
|
const DEFAULT_OPT = 'efficiency'
|
||||||
const DEFAULT_TEXTCAT_EXCLUSIVE = true
|
const DEFAULT_TEXTCAT_EXCLUSIVE = true
|
||||||
const COMPONENTS = ['tagger', 'parser', 'ner', 'textcat']
|
const COMPONENTS = ['tagger', 'morphologizer', 'parser', 'ner', 'textcat']
|
||||||
const COMMENT = `# This is an auto-generated partial config. To use it with 'spacy train'
|
const COMMENT = `# This is an auto-generated partial config. To use it with 'spacy train'
|
||||||
# you can run spacy init fill-config to auto-fill all default settings:
|
# you can run spacy init fill-config to auto-fill all default settings:
|
||||||
# python -m spacy init fill-config ./base_config.cfg ./config.cfg`
|
# python -m spacy init fill-config ./base_config.cfg ./config.cfg`
|
||||||
|
|
Loading…
Reference in New Issue
Block a user