Merge branch 'master' into spacy.io

This commit is contained in:
Ines Montani 2021-03-03 23:15:25 +11:00
commit 9280e844fb
61 changed files with 843 additions and 198 deletions

106
.github/contributors/dardoria.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Boian Tzonev |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 18.02.2021 |
| GitHub username | dardoria |
| Website (optional) | |

View File

@ -10,7 +10,7 @@ wasabi>=0.8.1,<1.1.0
srsly>=2.4.0,<3.0.0
catalogue>=2.0.1,<2.1.0
typer>=0.3.0,<0.4.0
pathy
pathy>=0.3.5
# Third party dependencies
numpy>=1.15.0
requests>=2.13.0,<3.0.0
@ -21,11 +21,11 @@ jinja2
setuptools
packaging>=20.0
importlib_metadata>=0.20; python_version < "3.8"
typing_extensions>=3.7.4; python_version < "3.8"
typing_extensions>=3.7.4.1,<4.0.0.0; python_version < "3.8"
# Development dependencies
cython>=0.25
pytest>=5.2.0
pytest-timeout>=1.3.0,<2.0.0
mock>=2.0.0,<3.0.0
flake8>=3.5.0,<3.6.0
hypothesis
hypothesis>=3.27.0,<7.0.0

View File

@ -47,7 +47,7 @@ install_requires =
srsly>=2.4.0,<3.0.0
catalogue>=2.0.1,<2.1.0
typer>=0.3.0,<0.4.0
pathy
pathy>=0.3.5
# Third-party dependencies
tqdm>=4.38.0,<5.0.0
numpy>=1.15.0
@ -58,7 +58,7 @@ install_requires =
setuptools
packaging>=20.0
importlib_metadata>=0.20; python_version < "3.8"
typing_extensions>=3.7.4; python_version < "3.8"
typing_extensions>=3.7.4,<4.0.0.0; python_version < "3.8"
[options.entry_points]
console_scripts =

View File

@ -204,7 +204,7 @@ def setup_package():
for name in MOD_NAMES:
mod_path = name.replace(".", "/") + ".pyx"
ext = Extension(
name, [mod_path], language="c++", extra_compile_args=["-std=c++11"]
name, [mod_path], language="c++", include_dirs=include_dirs, extra_compile_args=["-std=c++11"]
)
ext_modules.append(ext)
print("Cythonizing sources")
@ -216,7 +216,6 @@ def setup_package():
version=about["__version__"],
ext_modules=ext_modules,
cmdclass={"build_ext": build_ext_subclass},
include_dirs=include_dirs,
package_data={"": ["*.pyx", "*.pxd", "*.pxi"]},
)

View File

@ -11,6 +11,7 @@ from click.parser import split_arg_string
from typer.main import get_command
from contextlib import contextmanager
from thinc.api import Config, ConfigValidationError, require_gpu
from thinc.util import has_cupy, gpu_is_available
from configparser import InterpolationError
import os
@ -510,3 +511,5 @@ def setup_gpu(use_gpu: int) -> None:
require_gpu(use_gpu)
else:
msg.info("Using CPU")
if has_cupy and gpu_is_available():
msg.info("To switch to GPU 0, use the option: --gpu-id 0")

View File

@ -22,7 +22,7 @@ from ..training.converters import conllu_to_docs
CONVERTERS = {
"conllubio": conllu_to_docs,
"conllu": conllu_to_docs,
"conll": conllu_to_docs,
"conll": conll_ner_to_docs,
"ner": conll_ner_to_docs,
"iob": iob_to_docs,
"json": json_to_docs,

View File

@ -132,7 +132,7 @@ def evaluate(
if displacy_path:
factory_names = [nlp.get_pipe_meta(pipe).factory for pipe in nlp.pipe_names]
docs = [ex.predicted for ex in dev_dataset]
docs = list(nlp.pipe(ex.reference.text for ex in dev_dataset[:displacy_limit]))
render_deps = "parser" in factory_names
render_ents = "ner" in factory_names
render_parses(

View File

@ -16,7 +16,11 @@ gpu_allocator = null
[nlp]
lang = "{{ lang }}"
{%- if "tagger" in components or "morphologizer" in components or "parser" in components or "ner" in components or "entity_linker" in components or (("textcat" in components or "textcat_multilabel" in components) and optimize == "accuracy") -%}
{%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components %}
{%- else -%}
{%- set full_pipeline = components %}
{%- endif %}
pipeline = {{ full_pipeline|pprint()|replace("'", '"')|safe }}
batch_size = {{ 128 if hardware == "gpu" else 1000 }}

View File

@ -321,7 +321,8 @@ class Errors:
"https://spacy.io/api/top-level#util.filter_spans")
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
"token can only be part of one entity, so make sure the entities "
"you're setting don't overlap.")
"you're setting don't overlap. To work with overlapping entities, "
"consider using doc.spans instead.")
E106 = ("Can't find `doc._.{attr}` attribute specified in the underscore "
"settings: {opts}")
E107 = ("Value of `doc._.{attr}` is not JSON-serializable: {value}")
@ -486,6 +487,15 @@ class Errors:
E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.")
# New errors added in v3.x
E879 = ("Unexpected type for 'spans' data. Provide a dictionary mapping keys to "
"a list of spans, with each span represented by a tuple (start_char, end_char). "
"The tuple can be optionally extended with a label and a KB ID.")
E880 = ("The 'wandb' library could not be found - did you install it? "
"Alternatively, specify the 'ConsoleLogger' in the 'training.logger' "
"config section, instead of the 'WandbLogger'.")
E885 = ("entity_linker.set_kb received an invalid 'kb_loader' argument: expected "
"a callable function, but got: {arg_type}")
E886 = ("Can't replace {name} -> {tok2vec} listeners: path '{path}' not "
"found in config for component '{name}'.")
E887 = ("Can't replace {name} -> {tok2vec} listeners: the paths to replace "

View File

@ -1,9 +1,21 @@
from .stop_words import STOP_WORDS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .lex_attrs import LEX_ATTRS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...language import Language
from ...attrs import LANG
from ...util import update_exc
class BulgarianDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: "bg"
lex_attr_getters.update(LEX_ATTRS)
stop_words = STOP_WORDS
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
class Bulgarian(Language):

View File

@ -0,0 +1,88 @@
from ...attrs import LIKE_NUM
_num_words = [
"нула",
"едно",
"един",
"една",
"две",
"три",
"четири",
"пет",
"шест",
"седем",
"осем",
"девет",
"десет",
"единадесет",
"единайсет",
"дванадесет",
"дванайсет",
"тринадесет",
"тринайсет",
"четиринадесет",
"четиринайсет"
"петнадесет",
"петнайсет"
"шестнадесет",
"шестнайсет",
"седемнадесет",
"седемнайсет"
"осемнадесет",
"осемнайсет",
"деветнадесет",
"деветнайсет",
"двадесет",
"двайсет",
"тридесет",
"трийсет"
"четиридесет",
"четиресет",
"петдесет",
"шестдесет",
"шейсет",
"седемдесет",
"осемдесет",
"деветдесет",
"сто",
"двеста",
"триста",
"четиристотин",
"петстотин",
"шестстотин",
"седемстотин",
"осемстотин",
"деветстотин",
"хиляда",
"милион",
"милиона",
"милиард",
"милиарда",
"трилион",
"трилионa",
"билион",
"билионa",
"квадрилион",
"квадрилионa",
"квинтилион",
"квинтилионa",
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -0,0 +1,68 @@
from ...symbols import ORTH, NORM
_exc = {}
_abbr_exc = [
{ORTH: "м", NORM: "метър"},
{ORTH: "мм", NORM: "милиметър"},
{ORTH: "см", NORM: "сантиметър"},
{ORTH: "дм", NORM: "дециметър"},
{ORTH: "км", NORM: "километър"},
{ORTH: "кг", NORM: "килограм"},
{ORTH: "мг", NORM: "милиграм"},
{ORTH: "г", NORM: "грам"},
{ORTH: "т", NORM: "тон"},
{ORTH: "хл", NORM: "хектолиър"},
{ORTH: "дкл", NORM: "декалитър"},
{ORTH: "л", NORM: "литър"},
]
for abbr in _abbr_exc:
_exc[abbr[ORTH]] = [abbr]
_abbr_line_exc = [
{ORTH: "г-жа", NORM: "госпожа"},
{ORTH: "г", NORM: "господин"},
{ORTH: "г-ца", NORM: "госпожица"},
{ORTH: "д-р", NORM: "доктор"},
{ORTH: "о", NORM: "остров"},
{ORTH: "п-в", NORM: "полуостров"},
]
for abbr in _abbr_line_exc:
_exc[abbr[ORTH]] = [abbr]
_abbr_dot_exc = [
{ORTH: "акад.", NORM: "академик"},
{ORTH: "ал.", NORM: "алинея"},
{ORTH: "арх.", NORM: "архитект"},
{ORTH: "бл.", NORM: "блок"},
{ORTH: "бр.", NORM: "брой"},
{ORTH: "бул.", NORM: "булевард"},
{ORTH: "в.", NORM: "век"},
{ORTH: "г.", NORM: "година"},
{ORTH: "гр.", NORM: "град"},
{ORTH: "ж.р.", NORM: "женски род"},
{ORTH: "инж.", NORM: "инженер"},
{ORTH: "лв.", NORM: "лев"},
{ORTH: "м.р.", NORM: "мъжки род"},
{ORTH: "мат.", NORM: "математика"},
{ORTH: "мед.", NORM: "медицина"},
{ORTH: "пл.", NORM: "площад"},
{ORTH: "проф.", NORM: "професор"},
{ORTH: "с.", NORM: "село"},
{ORTH: "с.р.", NORM: "среден род"},
{ORTH: "св.", NORM: "свети"},
{ORTH: "сп.", NORM: "списание"},
{ORTH: "стр.", NORM: "страница"},
{ORTH: "ул.", NORM: "улица"},
{ORTH: "чл.", NORM: "член"},
]
for abbr in _abbr_dot_exc:
_exc[abbr[ORTH]] = [abbr]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -23,8 +23,6 @@ class RussianLemmatizer(Lemmatizer):
mode: str = "pymorphy2",
overwrite: bool = False,
) -> None:
super().__init__(vocab, model, name, mode=mode, overwrite=overwrite)
try:
from pymorphy2 import MorphAnalyzer
except ImportError:
@ -34,6 +32,7 @@ class RussianLemmatizer(Lemmatizer):
) from None
if RussianLemmatizer._morph is None:
RussianLemmatizer._morph = MorphAnalyzer()
super().__init__(vocab, model, name, mode=mode, overwrite=overwrite)
def pymorphy2_lemmatize(self, token: Token) -> List[str]:
string = token.text

View File

@ -7,6 +7,8 @@ from ...vocab import Vocab
class UkrainianLemmatizer(RussianLemmatizer):
_morph = None
def __init__(
self,
vocab: Vocab,
@ -16,7 +18,6 @@ class UkrainianLemmatizer(RussianLemmatizer):
mode: str = "pymorphy2",
overwrite: bool = False,
) -> None:
super().__init__(vocab, model, name, mode=mode, overwrite=overwrite)
try:
from pymorphy2 import MorphAnalyzer
except ImportError:
@ -27,3 +28,4 @@ class UkrainianLemmatizer(RussianLemmatizer):
) from None
if UkrainianLemmatizer._morph is None:
UkrainianLemmatizer._morph = MorphAnalyzer(lang="uk")
super().__init__(vocab, model, name, mode=mode, overwrite=overwrite)

View File

@ -684,12 +684,12 @@ class Language:
# TODO: handle errors and mismatches (vectors etc.)
if not isinstance(source, self.__class__):
raise ValueError(Errors.E945.format(name=source_name, source=type(source)))
if not source.has_pipe(source_name):
if not source_name in source.component_names:
raise KeyError(
Errors.E944.format(
name=source_name,
model=f"{source.meta['lang']}_{source.meta['name']}",
opts=", ".join(source.pipe_names),
opts=", ".join(source.component_names),
)
)
pipe = source.get_pipe(source_name)

View File

@ -8,7 +8,7 @@ from ...kb import KnowledgeBase, Candidate, get_candidates
from ...vocab import Vocab
@registry.architectures.register("spacy.EntityLinker.v1")
@registry.architectures("spacy.EntityLinker.v1")
def build_nel_encoder(tok2vec: Model, nO: Optional[int] = None) -> Model:
with Model.define_operators({">>": chain, "**": clone}):
token_width = tok2vec.get_dim("nO")
@ -25,7 +25,7 @@ def build_nel_encoder(tok2vec: Model, nO: Optional[int] = None) -> Model:
return model
@registry.misc.register("spacy.KBFromFile.v1")
@registry.misc("spacy.KBFromFile.v1")
def load_kb(kb_path: Path) -> Callable[[Vocab], KnowledgeBase]:
def kb_from_file(vocab):
kb = KnowledgeBase(vocab, entity_vector_length=1)
@ -35,7 +35,7 @@ def load_kb(kb_path: Path) -> Callable[[Vocab], KnowledgeBase]:
return kb_from_file
@registry.misc.register("spacy.EmptyKB.v1")
@registry.misc("spacy.EmptyKB.v1")
def empty_kb(entity_vector_length: int) -> Callable[[Vocab], KnowledgeBase]:
def empty_kb_factory(vocab):
return KnowledgeBase(vocab=vocab, entity_vector_length=entity_vector_length)
@ -43,6 +43,6 @@ def empty_kb(entity_vector_length: int) -> Callable[[Vocab], KnowledgeBase]:
return empty_kb_factory
@registry.misc.register("spacy.CandidateGenerator.v1")
@registry.misc("spacy.CandidateGenerator.v1")
def create_candidates() -> Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]:
return get_candidates

View File

@ -16,7 +16,7 @@ if TYPE_CHECKING:
from ...tokens import Doc # noqa: F401
@registry.architectures.register("spacy.PretrainVectors.v1")
@registry.architectures("spacy.PretrainVectors.v1")
def create_pretrain_vectors(
maxout_pieces: int, hidden_size: int, loss: str
) -> Callable[["Vocab", Model], Model]:
@ -40,7 +40,7 @@ def create_pretrain_vectors(
return create_vectors_objective
@registry.architectures.register("spacy.PretrainCharacters.v1")
@registry.architectures("spacy.PretrainCharacters.v1")
def create_pretrain_characters(
maxout_pieces: int, hidden_size: int, n_characters: int
) -> Callable[["Vocab", Model], Model]:

View File

@ -10,7 +10,7 @@ from ..tb_framework import TransitionModel
from ...tokens import Doc
@registry.architectures.register("spacy.TransitionBasedParser.v1")
@registry.architectures("spacy.TransitionBasedParser.v1")
def transition_parser_v1(
tok2vec: Model[List[Doc], List[Floats2d]],
state_type: Literal["parser", "ner"],
@ -31,7 +31,7 @@ def transition_parser_v1(
)
@registry.architectures.register("spacy.TransitionBasedParser.v2")
@registry.architectures("spacy.TransitionBasedParser.v2")
def transition_parser_v2(
tok2vec: Model[List[Doc], List[Floats2d]],
state_type: Literal["parser", "ner"],

View File

@ -6,7 +6,7 @@ from ...util import registry
from ...tokens import Doc
@registry.architectures.register("spacy.Tagger.v1")
@registry.architectures("spacy.Tagger.v1")
def build_tagger_model(
tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None
) -> Model[List[Doc], List[Floats2d]]:

View File

@ -15,7 +15,7 @@ from ...tokens import Doc
from .tok2vec import get_tok2vec_width
@registry.architectures.register("spacy.TextCatCNN.v1")
@registry.architectures("spacy.TextCatCNN.v1")
def build_simple_cnn_text_classifier(
tok2vec: Model, exclusive_classes: bool, nO: Optional[int] = None
) -> Model[List[Doc], Floats2d]:
@ -41,7 +41,7 @@ def build_simple_cnn_text_classifier(
return model
@registry.architectures.register("spacy.TextCatBOW.v1")
@registry.architectures("spacy.TextCatBOW.v1")
def build_bow_text_classifier(
exclusive_classes: bool,
ngram_size: int,
@ -60,7 +60,7 @@ def build_bow_text_classifier(
return model
@registry.architectures.register("spacy.TextCatEnsemble.v2")
@registry.architectures("spacy.TextCatEnsemble.v2")
def build_text_classifier_v2(
tok2vec: Model[List[Doc], List[Floats2d]],
linear_model: Model[List[Doc], Floats2d],
@ -112,7 +112,7 @@ def init_ensemble_textcat(model, X, Y) -> Model:
return model
@registry.architectures.register("spacy.TextCatLowData.v1")
@registry.architectures("spacy.TextCatLowData.v1")
def build_text_classifier_lowdata(
width: int, dropout: Optional[float], nO: Optional[int] = None
) -> Model[List[Doc], Floats2d]:

View File

@ -14,7 +14,7 @@ from ...pipeline.tok2vec import Tok2VecListener
from ...attrs import intify_attr
@registry.architectures.register("spacy.Tok2VecListener.v1")
@registry.architectures("spacy.Tok2VecListener.v1")
def tok2vec_listener_v1(width: int, upstream: str = "*"):
tok2vec = Tok2VecListener(upstream_name=upstream, width=width)
return tok2vec
@ -31,7 +31,7 @@ def get_tok2vec_width(model: Model):
return nO
@registry.architectures.register("spacy.HashEmbedCNN.v1")
@registry.architectures("spacy.HashEmbedCNN.v1")
def build_hash_embed_cnn_tok2vec(
*,
width: int,
@ -87,7 +87,7 @@ def build_hash_embed_cnn_tok2vec(
)
@registry.architectures.register("spacy.Tok2Vec.v2")
@registry.architectures("spacy.Tok2Vec.v2")
def build_Tok2Vec_model(
embed: Model[List[Doc], List[Floats2d]],
encode: Model[List[Floats2d], List[Floats2d]],
@ -108,7 +108,7 @@ def build_Tok2Vec_model(
return tok2vec
@registry.architectures.register("spacy.MultiHashEmbed.v1")
@registry.architectures("spacy.MultiHashEmbed.v1")
def MultiHashEmbed(
width: int,
attrs: List[Union[str, int]],
@ -182,7 +182,7 @@ def MultiHashEmbed(
return model
@registry.architectures.register("spacy.CharacterEmbed.v1")
@registry.architectures("spacy.CharacterEmbed.v1")
def CharacterEmbed(
width: int,
rows: int,
@ -255,7 +255,7 @@ def CharacterEmbed(
return model
@registry.architectures.register("spacy.MaxoutWindowEncoder.v2")
@registry.architectures("spacy.MaxoutWindowEncoder.v2")
def MaxoutWindowEncoder(
width: int, window_size: int, maxout_pieces: int, depth: int
) -> Model[List[Floats2d], List[Floats2d]]:
@ -287,7 +287,7 @@ def MaxoutWindowEncoder(
return with_array(model, pad=receptive_field)
@registry.architectures.register("spacy.MishWindowEncoder.v2")
@registry.architectures("spacy.MishWindowEncoder.v2")
def MishWindowEncoder(
width: int, window_size: int, depth: int
) -> Model[List[Floats2d], List[Floats2d]]:
@ -310,7 +310,7 @@ def MishWindowEncoder(
return with_array(model)
@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1")
@registry.architectures("spacy.TorchBiLSTMEncoder.v1")
def BiLSTMEncoder(
width: int, depth: int, dropout: float
) -> Model[List[Floats2d], List[Floats2d]]:

View File

@ -45,6 +45,7 @@ DEFAULT_NEL_MODEL = Config().from_str(default_model_config)["model"]
default_config={
"model": DEFAULT_NEL_MODEL,
"labels_discard": [],
"n_sents": 0,
"incl_prior": True,
"incl_context": True,
"entity_vector_length": 64,
@ -62,6 +63,7 @@ def make_entity_linker(
model: Model,
*,
labels_discard: Iterable[str],
n_sents: int,
incl_prior: bool,
incl_context: bool,
entity_vector_length: int,
@ -73,6 +75,7 @@ def make_entity_linker(
representations. Given a batch of Doc objects, it should return a single
array, with one row per item in the batch.
labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction.
n_sents (int): The number of neighbouring sentences to take into account.
incl_prior (bool): Whether or not to include prior probabilities from the KB in the model.
incl_context (bool): Whether or not to include the local context in the model.
entity_vector_length (int): Size of encoding vectors in the KB.
@ -84,6 +87,7 @@ def make_entity_linker(
model,
name,
labels_discard=labels_discard,
n_sents=n_sents,
incl_prior=incl_prior,
incl_context=incl_context,
entity_vector_length=entity_vector_length,
@ -106,6 +110,7 @@ class EntityLinker(TrainablePipe):
name: str = "entity_linker",
*,
labels_discard: Iterable[str],
n_sents: int,
incl_prior: bool,
incl_context: bool,
entity_vector_length: int,
@ -118,6 +123,7 @@ class EntityLinker(TrainablePipe):
name (str): The component instance name, used to add entries to the
losses during training.
labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction.
n_sents (int): The number of neighbouring sentences to take into account.
incl_prior (bool): Whether or not to include prior probabilities from the KB in the model.
incl_context (bool): Whether or not to include the local context in the model.
entity_vector_length (int): Size of encoding vectors in the KB.
@ -129,25 +135,24 @@ class EntityLinker(TrainablePipe):
self.vocab = vocab
self.model = model
self.name = name
cfg = {
"labels_discard": list(labels_discard),
"incl_prior": incl_prior,
"incl_context": incl_context,
"entity_vector_length": entity_vector_length,
}
self.labels_discard = list(labels_discard)
self.n_sents = n_sents
self.incl_prior = incl_prior
self.incl_context = incl_context
self.get_candidates = get_candidates
self.cfg = dict(cfg)
self.cfg = {}
self.distance = CosineDistance(normalize=False)
# how many neightbour sentences to take into account
self.n_sents = cfg.get("n_sents", 0)
# create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'.
self.kb = empty_kb(entity_vector_length)(self.vocab)
def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]):
"""Define the KB of this pipe by providing a function that will
create it using this object's vocab."""
if not callable(kb_loader):
raise ValueError(Errors.E885.format(arg_type=type(kb_loader)))
self.kb = kb_loader(self.vocab)
self.cfg["entity_vector_length"] = self.kb.entity_vector_length
def validate_kb(self) -> None:
# Raise an error if the knowledge base is not initialized.
@ -309,14 +314,13 @@ class EntityLinker(TrainablePipe):
sent_doc = doc[start_token:end_token].as_doc()
# currently, the context is the same for each entity in a sentence (should be refined)
xp = self.model.ops.xp
if self.cfg.get("incl_context"):
if self.incl_context:
sentence_encoding = self.model.predict([sent_doc])[0]
sentence_encoding_t = sentence_encoding.T
sentence_norm = xp.linalg.norm(sentence_encoding_t)
for ent in sent.ents:
entity_count += 1
to_discard = self.cfg.get("labels_discard", [])
if to_discard and ent.label_ in to_discard:
if ent.label_ in self.labels_discard:
# ignoring this entity - setting to NIL
final_kb_ids.append(self.NIL)
else:
@ -334,13 +338,13 @@ class EntityLinker(TrainablePipe):
prior_probs = xp.asarray(
[c.prior_prob for c in candidates]
)
if not self.cfg.get("incl_prior"):
if not self.incl_prior:
prior_probs = xp.asarray(
[0.0 for _ in candidates]
)
scores = prior_probs
# add in similarity from the context
if self.cfg.get("incl_context"):
if self.incl_context:
entity_encodings = xp.asarray(
[c.entity_vector for c in candidates]
)

View File

@ -66,26 +66,12 @@ class Sentencizer(Pipe):
"""
error_handler = self.get_error_handler()
try:
self._call(doc)
tags = self.predict([doc])
self.set_annotations([doc], tags)
return doc
except Exception as e:
error_handler(self.name, self, [doc], e)
def _call(self, doc):
start = 0
seen_period = False
for i, token in enumerate(doc):
is_in_punct_chars = token.text in self.punct_chars
token.is_sent_start = i == 0
if seen_period and not token.is_punct and not is_in_punct_chars:
doc[start].is_sent_start = True
start = token.i
seen_period = False
elif is_in_punct_chars:
seen_period = True
if start < len(doc):
doc[start].is_sent_start = True
def predict(self, docs):
"""Apply the pipe to a batch of docs, without modifying them.

View File

@ -314,6 +314,9 @@ class Scorer:
getter (Callable[[Doc, str], Iterable[Span]]): Defaults to getattr. If
provided, getter(doc, attr) should return the spans for the
individual doc.
has_annotation (Optional[Callable[[Doc], bool]]) should return whether a `Doc`
has annotation for this `attr`. Docs without annotation are skipped for
scoring purposes.
RETURNS (Dict[str, Any]): A dictionary containing the PRF scores under
the keys attr_p/r/f and the per-type PRF scores under attr_per_type.
@ -324,7 +327,7 @@ class Scorer:
for example in examples:
pred_doc = example.predicted
gold_doc = example.reference
# Option to handle docs without sents
# Option to handle docs without annotation for this attribute
if has_annotation is not None:
if not has_annotation(gold_doc):
continue
@ -531,6 +534,7 @@ class Scorer:
gold_span = gold_ent_by_offset.get(
(pred_ent.start_char, pred_ent.end_char), None
)
if gold_span is not None:
label = gold_span.label_
if label not in f_per_type:
f_per_type[label] = PRFScore()

View File

@ -39,6 +39,11 @@ def ar_tokenizer():
return get_lang_class("ar")().tokenizer
@pytest.fixture(scope="session")
def bg_tokenizer():
return get_lang_class("bg")().tokenizer
@pytest.fixture(scope="session")
def bn_tokenizer():
return get_lang_class("bn")().tokenizer

View File

@ -1,3 +1,5 @@
import weakref
import pytest
import numpy
import logging
@ -663,3 +665,10 @@ def test_span_groups(en_tokenizer):
assert doc.spans["hi"].has_overlap
del doc.spans["hi"]
assert "hi" not in doc.spans
def test_doc_spans_copy(en_tokenizer):
doc1 = en_tokenizer("Some text about Colombia and the Czech Republic")
assert weakref.ref(doc1) == doc1.spans.doc_ref
doc2 = doc1.copy()
assert weakref.ref(doc2) == doc2.spans.doc_ref

View File

@ -0,0 +1,30 @@
import pytest
from spacy.lang.bg.lex_attrs import like_num
@pytest.mark.parametrize(
"word,match",
[
("10", True),
("1", True),
("10000", True),
("1.000", True),
("бројка", False),
("999,23", True),
("едно", True),
("две", True),
("цифра", False),
("единайсет", True),
("десет", True),
("сто", True),
("брой", False),
("хиляда", True),
("милион", True),
(",", False),
("милиарда", True),
("билион", True),
],
)
def test_bg_lex_attrs_like_number(bg_tokenizer, word, match):
tokens = bg_tokenizer(word)
assert len(tokens) == 1
assert tokens[0].like_num == match

View File

@ -230,7 +230,7 @@ def test_el_pipe_configuration(nlp):
def get_lowercased_candidates(kb, span):
return kb.get_alias_candidates(span.text.lower())
@registry.misc.register("spacy.LowercaseCandidateGenerator.v1")
@registry.misc("spacy.LowercaseCandidateGenerator.v1")
def create_candidates() -> Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]:
return get_lowercased_candidates
@ -250,6 +250,14 @@ def test_el_pipe_configuration(nlp):
assert doc[2].ent_kb_id_ == "Q2"
def test_nel_nsents(nlp):
"""Test that n_sents can be set through the configuration"""
entity_linker = nlp.add_pipe("entity_linker", config={})
assert entity_linker.n_sents == 0
entity_linker = nlp.replace_pipe("entity_linker", "entity_linker", config={"n_sents": 2})
assert entity_linker.n_sents == 2
def test_vocab_serialization(nlp):
"""Test that string information is retained across storage"""
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)

View File

@ -83,9 +83,9 @@ def test_replace_last_pipe(nlp):
def test_replace_pipe_config(nlp):
nlp.add_pipe("entity_linker")
nlp.add_pipe("sentencizer")
assert nlp.get_pipe("entity_linker").cfg["incl_prior"] is True
assert nlp.get_pipe("entity_linker").incl_prior is True
nlp.replace_pipe("entity_linker", "entity_linker", config={"incl_prior": False})
assert nlp.get_pipe("entity_linker").cfg["incl_prior"] is False
assert nlp.get_pipe("entity_linker").incl_prior is False
@pytest.mark.parametrize("old_name,new_name", [("old_pipe", "new_pipe")])

View File

@ -61,7 +61,6 @@ def test_issue7029():
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
texts = ["first", "second", "third", "fourth", "and", "then", "some", ""]
nlp.select_pipes(enable=["tok2vec", "tagger"])
docs1 = list(nlp.pipe(texts, batch_size=1))
docs2 = list(nlp.pipe(texts, batch_size=4))
assert [doc[0].tag_ for doc in docs1[:-1]] == [doc[0].tag_ for doc in docs2[:-1]]

View File

@ -1,5 +1,3 @@
import pytest
from spacy.tokens.doc import Doc
from spacy.vocab import Vocab
from spacy.pipeline._parser_internals.arc_eager import ArcEager

View File

@ -0,0 +1,54 @@
from spacy.kb import KnowledgeBase
from spacy.training import Example
from spacy.lang.en import English
# fmt: off
TRAIN_DATA = [
("Russ Cochran his reprints include EC Comics.",
{"links": {(0, 12): {"Q2146908": 1.0}},
"entities": [(0, 12, "PERSON")],
"sent_starts": [1, -1, 0, 0, 0, 0, 0, 0]})
]
# fmt: on
def test_partial_links():
# Test that having some entities on the doc without gold links, doesn't crash
nlp = English()
vector_length = 3
train_examples = []
for text, annotation in TRAIN_DATA:
doc = nlp(text)
train_examples.append(Example.from_dict(doc, annotation))
def create_kb(vocab):
# create artificial KB
mykb = KnowledgeBase(vocab, entity_vector_length=vector_length)
mykb.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3])
mykb.add_alias("Russ Cochran", ["Q2146908"], [0.9])
return mykb
# Create and train the Entity Linker
entity_linker = nlp.add_pipe("entity_linker", last=True)
entity_linker.set_kb(create_kb)
optimizer = nlp.initialize(get_examples=lambda: train_examples)
for i in range(2):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
# adding additional components that are required for the entity_linker
nlp.add_pipe("sentencizer", first=True)
patterns = [
{"label": "PERSON", "pattern": [{"LOWER": "russ"}, {"LOWER": "cochran"}]},
{"label": "ORG", "pattern": [{"LOWER": "ec"}, {"LOWER": "comics"}]}
]
ruler = nlp.add_pipe("entity_ruler", before="entity_linker")
ruler.add_patterns(patterns)
# this will run the pipeline on the examples and shouldn't crash
results = nlp.evaluate(train_examples)
assert "PERSON" in results["ents_per_type"]
assert "PERSON" in results["nel_f_per_type"]
assert "ORG" in results["ents_per_type"]
assert "ORG" not in results["nel_f_per_type"]

View File

@ -0,0 +1,18 @@
from spacy.lang.en import English
def test_issue7065():
text = "Kathleen Battle sang in Mahler 's Symphony No. 8 at the Cincinnati Symphony Orchestra 's May Festival."
nlp = English()
nlp.add_pipe("sentencizer")
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "THING", "pattern": [{"LOWER": "symphony"}, {"LOWER": "no"}, {"LOWER": "."}, {"LOWER": "8"}]}]
ruler.add_patterns(patterns)
doc = nlp(text)
sentences = [s for s in doc.sents]
assert len(sentences) == 2
sent0 = sentences[0]
ent = doc.ents[0]
assert ent.start < sent0.end < ent.end
assert sentences.index(ent.sent) == 0

View File

@ -160,7 +160,7 @@ subword_features = false
"""
@registry.architectures.register("my_test_parser")
@registry.architectures("my_test_parser")
def my_parser():
tok2vec = build_Tok2Vec_model(
MultiHashEmbed(

View File

@ -108,7 +108,7 @@ def test_serialize_subclassed_kb():
super().__init__(vocab, entity_vector_length)
self.custom_field = custom_field
@registry.misc.register("spacy.CustomKB.v1")
@registry.misc("spacy.CustomKB.v1")
def custom_kb(
entity_vector_length: int, custom_field: int
) -> Callable[["Vocab"], KnowledgeBase]:

View File

@ -4,12 +4,12 @@ from thinc.api import Linear
from catalogue import RegistryError
@registry.architectures.register("my_test_function")
def create_model(nr_in, nr_out):
def test_get_architecture():
@registry.architectures("my_test_function")
def create_model(nr_in, nr_out):
return Linear(nr_in, nr_out)
def test_get_architecture():
arch = registry.architectures.get("my_test_function")
assert arch is create_model
with pytest.raises(RegistryError):

View File

@ -7,7 +7,7 @@ from spacy import util
from spacy import prefer_gpu, require_gpu, require_cpu
from spacy.ml._precomputable_affine import PrecomputableAffine
from spacy.ml._precomputable_affine import _backprop_precomputable_affine_padding
from spacy.util import dot_to_object, SimpleFrozenList
from spacy.util import dot_to_object, SimpleFrozenList, import_file
from thinc.api import Config, Optimizer, ConfigValidationError
from spacy.training.batchers import minibatch_by_words
from spacy.lang.en import English
@ -17,7 +17,7 @@ from spacy.schemas import ConfigSchemaTraining
from thinc.api import get_current_ops, NumpyOps, CupyOps
from .util import get_random_doc
from .util import get_random_doc, make_tempdir
@pytest.fixture
@ -347,3 +347,35 @@ def test_resolve_dot_names():
errors = e.value.errors
assert len(errors) == 1
assert errors[0]["loc"] == ["training", "xyz"]
def test_import_code():
code_str = """
from spacy import Language
class DummyComponent:
def __init__(self, vocab, name):
pass
def initialize(self, get_examples, *, nlp, dummy_param: int):
pass
@Language.factory(
"dummy_component",
)
def make_dummy_component(
nlp: Language, name: str
):
return DummyComponent(nlp.vocab, name)
"""
with make_tempdir() as temp_dir:
code_path = os.path.join(temp_dir, "code.py")
with open(code_path, "w") as fileh:
fileh.write(code_str)
import_file("python_code", code_path)
config = {"initialize": {"components": {"dummy_component": {"dummy_param": 1}}}}
nlp = English.from_config(config)
nlp.add_pipe("dummy_component")
nlp.initialize()

View File

@ -196,6 +196,104 @@ def test_Example_from_dict_with_entities_invalid(annots):
assert len(list(example.reference.ents)) == 0
@pytest.mark.parametrize(
"annots",
[
{
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
"entities": [
(7, 15, "LOC"),
(11, 15, "LOC"),
(20, 26, "LOC"),
], # overlapping
}
],
)
def test_Example_from_dict_with_entities_overlapping(annots):
vocab = Vocab()
predicted = Doc(vocab, words=annots["words"])
with pytest.raises(ValueError):
Example.from_dict(predicted, annots)
@pytest.mark.parametrize(
"annots",
[
{
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
"spans": {
"cities": [(7, 15, "LOC"), (20, 26, "LOC")],
"people": [(0, 1, "PERSON")],
},
}
],
)
def test_Example_from_dict_with_spans(annots):
vocab = Vocab()
predicted = Doc(vocab, words=annots["words"])
example = Example.from_dict(predicted, annots)
assert len(list(example.reference.ents)) == 0
assert len(list(example.reference.spans["cities"])) == 2
assert len(list(example.reference.spans["people"])) == 1
for span in example.reference.spans["cities"]:
assert span.label_ == "LOC"
for span in example.reference.spans["people"]:
assert span.label_ == "PERSON"
@pytest.mark.parametrize(
"annots",
[
{
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
"spans": {
"cities": [(7, 15, "LOC"), (11, 15, "LOC"), (20, 26, "LOC")],
"people": [(0, 1, "PERSON")],
},
}
],
)
def test_Example_from_dict_with_spans_overlapping(annots):
vocab = Vocab()
predicted = Doc(vocab, words=annots["words"])
example = Example.from_dict(predicted, annots)
assert len(list(example.reference.ents)) == 0
assert len(list(example.reference.spans["cities"])) == 3
assert len(list(example.reference.spans["people"])) == 1
for span in example.reference.spans["cities"]:
assert span.label_ == "LOC"
for span in example.reference.spans["people"]:
assert span.label_ == "PERSON"
@pytest.mark.parametrize(
"annots",
[
{
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
"spans": [(0, 1, "PERSON")],
},
{
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
"spans": {"cities": (7, 15, "LOC")},
},
{
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
"spans": {"cities": [7, 11]},
},
{
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
"spans": {"cities": [[7]]},
},
],
)
def test_Example_from_dict_with_spans_invalid(annots):
vocab = Vocab()
predicted = Doc(vocab, words=annots["words"])
with pytest.raises(ValueError):
Example.from_dict(predicted, annots)
@pytest.mark.parametrize(
"annots",
[

View File

@ -27,7 +27,7 @@ def test_readers():
factory = "textcat"
"""
@registry.readers.register("myreader.v1")
@registry.readers("myreader.v1")
def myreader() -> Dict[str, Callable[[Language, str], Iterable[Example]]]:
annots = {"cats": {"POS": 1.0, "NEG": 0.0}}

View File

@ -1,7 +1,8 @@
from .doc import Doc
from .token import Token
from .span import Span
from .span_group import SpanGroup
from ._serialize import DocBin
from .morphanalysis import MorphAnalysis
__all__ = ["Doc", "Token", "Span", "DocBin", "MorphAnalysis"]
__all__ = ["Doc", "Token", "Span", "SpanGroup", "DocBin", "MorphAnalysis"]

View File

@ -33,8 +33,10 @@ class SpanGroups(UserDict):
def _make_span_group(self, name: str, spans: Iterable["Span"]) -> SpanGroup:
return SpanGroup(self.doc_ref(), name=name, spans=spans)
def copy(self) -> "SpanGroups":
return SpanGroups(self.doc_ref()).from_bytes(self.to_bytes())
def copy(self, doc: "Doc" = None) -> "SpanGroups":
if doc is None:
doc = self.doc_ref()
return SpanGroups(doc).from_bytes(self.to_bytes())
def to_bytes(self) -> bytes:
# We don't need to serialize this as a dict, because the groups

View File

@ -1188,7 +1188,7 @@ cdef class Doc:
other.user_span_hooks = dict(self.user_span_hooks)
other.length = self.length
other.max_length = self.max_length
other.spans = self.spans.copy()
other.spans = self.spans.copy(doc=other)
buff_size = other.max_length + (PADDING*2)
assert buff_size > 0
tokens = <TokenC*>other.mem.alloc(buff_size, sizeof(TokenC))

View File

@ -357,7 +357,12 @@ cdef class Span:
@property
def sent(self):
"""RETURNS (Span): The sentence span that the span is a part of."""
"""Obtain the sentence that contains this span. If the given span
crosses sentence boundaries, return only the first sentence
to which it belongs.
RETURNS (Span): The sentence span that the span is a part of.
"""
if "sent" in self.doc.user_span_hooks:
return self.doc.user_span_hooks["sent"](self)
# Use `sent_start` token attribute to find sentence boundaries
@ -367,8 +372,8 @@ cdef class Span:
start = self.start
while self.doc.c[start].sent_start != 1 and start > 0:
start += -1
# Find end of the sentence
end = self.end
# Find end of the sentence - can be within the entity
end = self.start + 1
while end < self.doc.length and self.doc.c[end].sent_start != 1:
end += 1
n += 1

View File

@ -22,6 +22,8 @@ cpdef Doc annotations_to_doc(vocab, tok_annot, doc_annot):
output = Doc(vocab, words=tok_annot["ORTH"], spaces=tok_annot["SPACY"])
if "entities" in doc_annot:
_add_entities_to_doc(output, doc_annot["entities"])
if "spans" in doc_annot:
_add_spans_to_doc(output, doc_annot["spans"])
if array.size:
output = output.from_array(attrs, array)
# links are currently added with ENT_KB_ID on the token level
@ -314,13 +316,11 @@ def _annot2array(vocab, tok_annot, doc_annot):
for key, value in doc_annot.items():
if value:
if key == "entities":
if key in ["entities", "cats", "spans"]:
pass
elif key == "links":
ent_kb_ids = _parse_links(vocab, tok_annot["ORTH"], tok_annot["SPACY"], value)
tok_annot["ENT_KB_ID"] = ent_kb_ids
elif key == "cats":
pass
else:
raise ValueError(Errors.E974.format(obj="doc", key=key))
@ -351,6 +351,29 @@ def _annot2array(vocab, tok_annot, doc_annot):
return attrs, array.T
def _add_spans_to_doc(doc, spans_data):
if not isinstance(spans_data, dict):
raise ValueError(Errors.E879)
for key, span_list in spans_data.items():
spans = []
if not isinstance(span_list, list):
raise ValueError(Errors.E879)
for span_tuple in span_list:
if not isinstance(span_tuple, (list, tuple)) or len(span_tuple) < 2:
raise ValueError(Errors.E879)
start_char = span_tuple[0]
end_char = span_tuple[1]
label = 0
kb_id = 0
if len(span_tuple) > 2:
label = span_tuple[2]
if len(span_tuple) > 3:
kb_id = span_tuple[3]
span = doc.char_span(start_char, end_char, label=label, kb_id=kb_id)
spans.append(span)
doc.spans[key] = spans
def _add_entities_to_doc(doc, ner_data):
if ner_data is None:
return
@ -397,7 +420,7 @@ def _fix_legacy_dict_data(example_dict):
pass
elif key == "ids":
pass
elif key in ("cats", "links"):
elif key in ("cats", "links", "spans"):
doc_dict[key] = value
elif key in ("ner", "entities"):
doc_dict["entities"] = value

View File

@ -103,7 +103,11 @@ def console_logger(progress_bar: bool = False):
@registry.loggers("spacy.WandbLogger.v1")
def wandb_logger(project_name: str, remove_config_values: List[str] = []):
try:
import wandb
from wandb import init, log, join # test that these are available
except ImportError:
raise ImportError(Errors.E880)
console = console_logger(progress_bar=False)

View File

@ -70,7 +70,7 @@ CONFIG_SECTION_ORDER = ["paths", "variables", "system", "nlp", "components", "co
logger = logging.getLogger("spacy")
logger_stream_handler = logging.StreamHandler()
logger_stream_handler.setFormatter(logging.Formatter("%(message)s"))
logger_stream_handler.setFormatter(logging.Formatter("[%(asctime)s] [%(levelname)s] %(message)s"))
logger.addHandler(logger_stream_handler)
@ -1454,7 +1454,8 @@ def is_cython_func(func: Callable) -> bool:
if hasattr(func, attr): # function or class instance
return True
# https://stackoverflow.com/a/55767059
if hasattr(func, "__qualname__") and hasattr(func, "__module__"): # method
if hasattr(func, "__qualname__") and hasattr(func, "__module__") \
and func.__module__ in sys.modules: # method
cls_func = vars(sys.modules[func.__module__])[func.__qualname__.split(".")[0]]
return hasattr(cls_func, attr)
return False

View File

@ -61,6 +61,8 @@ cdef class Vocab:
lookups (Lookups): Container for large lookup tables and dictionaries.
oov_prob (float): Default OOV probability.
vectors_name (unicode): Optional name to identify the vectors table.
get_noun_chunks (Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]):
A function that yields base noun phrases used for Doc.noun_chunks.
"""
lex_attr_getters = lex_attr_getters if lex_attr_getters is not None else {}
if lookups in (None, True, False):

View File

@ -19,7 +19,7 @@ spaCy's built-in architectures that are used for different NLP tasks. All
trainable [built-in components](/api#architecture-pipeline) expect a `model`
argument defined in the config and document their the default architecture.
Custom architectures can be registered using the
[`@spacy.registry.architectures`](/api/top-level#regsitry) decorator and used as
[`@spacy.registry.architectures`](/api/top-level#registry) decorator and used as
part of the [training config](/usage/training#custom-functions). Also see the
usage documentation on
[layers and model architectures](/usage/layers-architectures).

View File

@ -219,7 +219,7 @@ alignment mode `"strict".
| `alignment_mode` | How character indices snap to token boundaries. Options: `"strict"` (no snapping), `"contract"` (span of all tokens completely within the character span), `"expand"` (span of all tokens at least partially covered by the character span). Defaults to `"strict"`. ~~str~~ |
| **RETURNS** | The newly constructed object or `None`. ~~Optional[Span]~~ |
## Doc.set_ents {#ents tag="method" new="3"}
## Doc.set_ents {#set_ents tag="method" new="3"}
Set the named entities in the document.
@ -616,8 +616,10 @@ phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be
nested within it so no NP-level coordination, no prepositional phrases, and no
relative clauses.
If the `noun_chunk` [syntax iterator](/usage/adding-languages#language-data) has
not been implemeted for the given language, a `NotImplementedError` is raised.
To customize the noun chunk iterator in a loaded pipeline, modify
[`nlp.vocab.get_noun_chunks`](/api/vocab#attributes). If the `noun_chunk`
[syntax iterator](/usage/adding-languages#language-data) has not been
implemented for the given language, a `NotImplementedError` is raised.
> #### Example
>
@ -633,12 +635,14 @@ not been implemeted for the given language, a `NotImplementedError` is raised.
| ---------- | ------------------------------------- |
| **YIELDS** | Noun chunks in the document. ~~Span~~ |
## Doc.sents {#sents tag="property" model="parser"}
## Doc.sents {#sents tag="property" model="sentences"}
Iterate over the sentences in the document. Sentence spans have no label. To
improve accuracy on informal texts, spaCy calculates sentence boundaries from
the syntactic dependency parse. If the parser is disabled, the `sents` iterator
will be unavailable.
Iterate over the sentences in the document. Sentence spans have no label.
This property is only available when
[sentence boundaries](/usage/linguistic-features#sbd) have been set on the
document by the `parser`, `senter`, `sentencizer` or some custom function. It
will raise an error otherwise.
> #### Example
>

View File

@ -31,6 +31,7 @@ architectures and their arguments and hyperparameters.
> from spacy.pipeline.entity_linker import DEFAULT_NEL_MODEL
> config = {
> "labels_discard": [],
> "n_sents": 0,
> "incl_prior": True,
> "incl_context": True,
> "model": DEFAULT_NEL_MODEL,
@ -43,6 +44,7 @@ architectures and their arguments and hyperparameters.
| Setting | Description |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `labels_discard` | NER labels that will automatically get a "NIL" prediction. Defaults to `[]`. ~~Iterable[str]~~ |
| `n_sents` | The number of neighbouring sentences to take into account. Defaults to 0. ~~int~~ |
| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. Defaults to `True`. ~~bool~~ |
| `incl_context` | Whether or not to include the local context in the model. Defaults to `True`. ~~bool~~ |
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [EntityLinker](/api/architectures#EntityLinker). ~~Model~~ |
@ -89,6 +91,7 @@ custom knowledge base, you should either call
| `entity_vector_length` | Size of encoding vectors in the KB. ~~int~~ |
| `get_candidates` | Function that generates plausible candidates for a given `Span` object. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ |
| `labels_discard` | NER labels that will automatically get a `"NIL"` prediction. ~~Iterable[str]~~ |
| `n_sents` | The number of neighbouring sentences to take into account. ~~int~~ |
| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. ~~bool~~ |
| `incl_context` | Whether or not to include the local context in the model. ~~bool~~ |
@ -154,7 +157,7 @@ with the current vocab.
> kb.add_alias(...)
> return kb
> entity_linker = nlp.add_pipe("entity_linker")
> entity_linker.set_kb(lambda: [], nlp=nlp, kb_loader=create_kb)
> entity_linker.set_kb(create_kb)
> ```
| Name | Description |
@ -248,7 +251,7 @@ pipe's entity linking model and context encoder. Delegates to
> ```
| Name | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |

View File

@ -152,7 +152,7 @@ Get a list of all aliases in the knowledge base.
| ----------- | -------------------------------------------------------- |
| **RETURNS** | The list of aliases in the knowledge base. ~~List[str]~~ |
## KnowledgeBase.get_candidates {#get_candidates tag="method"}
## KnowledgeBase.get_alias_candidates {#get_alias_candidates tag="method"}
Given a certain textual mention as input, retrieve a list of candidate entities
of type [`Candidate`](/api/kb/#candidate).
@ -160,13 +160,13 @@ of type [`Candidate`](/api/kb/#candidate).
> #### Example
>
> ```python
> candidates = kb.get_candidates("Douglas")
> candidates = kb.get_alias_candidates("Douglas")
> ```
| Name | Description |
| ----------- | ------------------------------------- |
| ----------- | ------------------------------------------------------------- |
| `alias` | The textual mention or alias. ~~str~~ |
| **RETURNS** | iterable | The list of relevant `Candidate` objects. ~~List[Candidate]~~ |
| **RETURNS** | The list of relevant `Candidate` objects. ~~List[Candidate]~~ |
## KnowledgeBase.get_vector {#get_vector tag="method"}
@ -246,7 +246,7 @@ certain prior probability.
Construct a `Candidate` object. Usually this constructor is not called directly,
but instead these objects are returned by the
[`get_candidates`](/api/kb#get_candidates) method of a `KnowledgeBase`.
`get_candidates` method of the [`entity_linker`](/api/entitylinker) pipe.
> #### Example
>

View File

@ -364,7 +364,7 @@ Evaluate a pipeline's components.
<Infobox variant="warning" title="Changed in v3.0">
The `Language.update` method now takes a batch of [`Example`](/api/example)
The `Language.evaluate` method now takes a batch of [`Example`](/api/example)
objects instead of tuples of `Doc` and `GoldParse` objects.
</Infobox>

View File

@ -138,12 +138,12 @@ Returns PRF scores for labeled or unlabeled spans.
> ```
| Name | Description |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
| `attr` | The attribute to score. ~~str~~ |
| _keyword-only_ | |
| `getter` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`. ~~Callable[[Doc, str], Iterable[Span]]~~ |
| `has_annotation` | Defaults to `None`. If provided, `has_annotation(doc)` should return whether a `Doc` has annotation for this `attr`. Docs without annotation are skipped for scoring purposes. ~~str~~ |
| `has_annotation` | Defaults to `None`. If provided, `has_annotation(doc)` should return whether a `Doc` has annotation for this `attr`. Docs without annotation are skipped for scoring purposes. ~~Optional[Callable[[Doc], bool]]~~ |
| **RETURNS** | A dictionary containing the PRF scores under the keys `{attr}_p`, `{attr}_r`, `{attr}_f` and the per-type PRF scores under `{attr}_per_type`. ~~Dict[str, Union[float, Dict[str, float]]]~~ |
## Scorer.score_deps {#score_deps tag="staticmethod" new="3"}

View File

@ -483,13 +483,40 @@ The L2 norm of the span's vector representation.
| ----------- | --------------------------------------------------- |
| **RETURNS** | The L2 norm of the vector representation. ~~float~~ |
## Span.sent {#sent tag="property" model="sentences"}
The sentence span that this span is a part of. This property is only available
when [sentence boundaries](/usage/linguistic-features#sbd) have been set on the
document by the `parser`, `senter`, `sentencizer` or some custom function. It
will raise an error otherwise.
If the span happens to cross sentence boundaries, only the first sentence will
be returned. If it is required that the sentence always includes the
full span, the result can be adjusted as such:
```python
sent = span.sent
sent = doc[sent.start : max(sent.end, span.end)]
```
> #### Example
>
> ```python
> doc = nlp("Give it back! He pleaded.")
> span = doc[1:3]
> assert span.sent.text == "Give it back!"
> ```
| Name | Description |
| ----------- | ------------------------------------------------------- |
| **RETURNS** | The sentence span that this span is a part of. ~~Span~~ |
## Attributes {#attributes}
| Name | Description |
| --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| `doc` | The parent document. ~~Doc~~ |
| `tensor` <Tag variant="new">2.1.7</Tag> | The span's slice of the parent `Doc`'s tensor. ~~numpy.ndarray~~ |
| `sent` | The sentence span that this span is a part of. ~~Span~~ |
| `start` | The token offset for the start of the span. ~~int~~ |
| `end` | The token offset for the end of the span. ~~int~~ |
| `start_char` | The character offset for the start of the span. ~~int~~ |

View File

@ -22,7 +22,7 @@ Create the vocabulary.
> ```
| Name | Description |
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `lex_attr_getters` | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. ~~Optional[Dict[str, Callable[[str], Any]]]~~ |
| `strings` | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. ~~Union[List[str], StringStore]~~ |
| `lookups` | A [`Lookups`](/api/lookups) that stores the `lexeme_norm` and other large lookup tables. Defaults to `None`. ~~Optional[Lookups]~~ |
@ -188,8 +188,8 @@ subword features by average over n-grams of `orth` (introduced in spaCy `v2.1`).
## Vocab.set_vector {#set_vector tag="method" new="2"}
Set a vector for a word in the vocabulary. Words can be referenced by string
or hash value.
Set a vector for a word in the vocabulary. Words can be referenced by string or
hash value.
> #### Example
>
@ -301,12 +301,13 @@ Load state from a binary string.
> ```
| Name | Description |
| --------------------------------------------- | ------------------------------------------------------------------------------- |
| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `strings` | A table managing the string-to-int mapping. ~~StringStore~~ |
| `vectors` <Tag variant="new">2</Tag> | A table associating word IDs to word vectors. ~~Vectors~~ |
| `vectors_length` | Number of dimensions for each word vector. ~~int~~ |
| `lookups` | The available lookup tables in this vocab. ~~Lookups~~ |
| `writing_system` <Tag variant="new">2.1</Tag> | A dict with information about the language's writing system. ~~Dict[str, Any]~~ |
| `get_noun_chunks` <Tag variant="new">3.0</Tag> | A function that yields base noun phrases used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
## Serialization fields {#serialization-fields}

View File

@ -15,7 +15,7 @@ next: /usage/projects
> ```python
> from thinc.api import Model, chain
>
> @spacy.registry.architectures.register("model.v1")
> @spacy.registry.architectures("model.v1")
> def build_model(width: int, classes: int) -> Model:
> tok2vec = build_tok2vec(width)
> output_layer = build_output_layer(width, classes)
@ -563,7 +563,7 @@ matrix** (~~Floats2d~~) of predictions:
```python
### The model architecture
@spacy.registry.architectures.register("rel_model.v1")
@spacy.registry.architectures("rel_model.v1")
def create_relation_model(...) -> Model[List[Doc], Floats2d]:
model = ... # 👈 model will go here
return model
@ -589,7 +589,7 @@ transforms the instance tensor into a final tensor holding the predictions:
```python
### The model architecture {highlight="6"}
@spacy.registry.architectures.register("rel_model.v1")
@spacy.registry.architectures("rel_model.v1")
def create_relation_model(
create_instance_tensor: Model[List[Doc], Floats2d],
classification_layer: Model[Floats2d, Floats2d],
@ -613,7 +613,7 @@ The `classification_layer` could be something like a
```python
### The classification layer
@spacy.registry.architectures.register("rel_classification_layer.v1")
@spacy.registry.architectures("rel_classification_layer.v1")
def create_classification_layer(
nO: int = None, nI: int = None
) -> Model[Floats2d, Floats2d]:
@ -650,7 +650,7 @@ that has the full implementation.
```python
### The layer that creates the instance tensor
@spacy.registry.architectures.register("rel_instance_tensor.v1")
@spacy.registry.architectures("rel_instance_tensor.v1")
def create_tensors(
tok2vec: Model[List[Doc], List[Floats2d]],
pooling: Model[Ragged, Floats2d],
@ -731,7 +731,7 @@ are within a **maximum distance** (in number of tokens) of each other:
```python
### Candidate generation
@spacy.registry.misc.register("rel_instance_generator.v1")
@spacy.registry.misc("rel_instance_generator.v1")
def create_instances(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]:
def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
candidates = []

View File

@ -585,7 +585,7 @@ print(ent_francisco) # ['Francisco', 'I', 'GPE']
To ensure that the sequence of token annotations remains consistent, you have to
set entity annotations **at the document level**. However, you can't write
directly to the `token.ent_iob` or `token.ent_type` attributes, so the easiest
way to set entities is to assign to the [`doc.ents`](/api/doc#ents) attribute
way to set entities is to use the [`doc.set_ents`](/api/doc#set_ents) function
and create the new entity as a [`Span`](/api/span).
```python

View File

@ -95,6 +95,14 @@ spaCy's binary `.spacy` format. You can either include the data paths in the
$ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
```
> #### Tip: Enable your GPU
>
> Use the `--gpu-id` option to select the GPU:
>
> ```cli
> $ python -m spacy train config.cfg --gpu-id 0
> ```
<Accordion title="How are the config recommendations generated?" id="quickstart-source" spaced>
The recommended config settings generated by the quickstart widget and the

View File

@ -603,6 +603,7 @@ Note that spaCy v3.0 now requires **Python 3.6+**.
| `GoldParse` | [`Example`](/api/example) |
| `GoldCorpus` | [`Corpus`](/api/corpus) |
| `KnowledgeBase.load_bulk`, `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk) |
| `KnowledgeBase.get_candidates` | [`KnowledgeBase.get_alias_candidates`](/api/kb#get_alias_candidates) |
| `Matcher.pipe`, `PhraseMatcher.pipe` | not needed |
| `gold.offsets_from_biluo_tags`, `gold.spans_from_biluo_tags`, `gold.biluo_tags_from_offsets` | [`training.biluo_tags_to_offsets`](/api/top-level#biluo_tags_to_offsets), [`training.biluo_tags_to_spans`](/api/top-level#biluo_tags_to_spans), [`training.offsets_to_biluo_tags`](/api/top-level#offsets_to_biluo_tags) |
| `spacy init-model` | [`spacy init vectors`](/api/cli#init-vectors) |

View File

@ -2762,6 +2762,33 @@
"github": "AMArostegui"
},
"category": ["nonpython"]
},
{
"id": "ruts",
"title": "ruTS",
"slogan": "A library for statistics extraction from texts in Russian",
"description": "The library allows extracting the following statistics from a text: basic statistics, readability metrics, lexical diversity metrics, morphological statistics",
"github": "SergeyShk/ruTS",
"pip": "ruts",
"code_example": [
"import spacy",
"import ruts",
"",
"nlp = spacy.load('ru_core_news_sm')",
"nlp.add_pipe('basic', last=True)",
"doc = nlp('мама мыла раму')",
"doc._.basic.get_stats()"
],
"code_language": "python",
"thumb": "https://habrastorage.org/webt/6z/le/fz/6zlefzjavzoqw_wymz7v3pwgfp4.png",
"image": "https://clipartart.com/images/free-tree-roots-clipart-black-and-white-2.png",
"author": "Sergey Shkarin",
"author_links": {
"twitter": "shk_sergey",
"github": "SergeyShk"
},
"category": ["pipeline", "standalone"],
"tags": ["Text Analytics", "Russian"]
}
],