Merge branch 'master' into spacy.io

This commit is contained in:
Ines Montani 2021-03-03 23:15:25 +11:00
commit 9280e844fb
61 changed files with 843 additions and 198 deletions

106
.github/contributors/dardoria.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Boian Tzonev |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 18.02.2021 |
| GitHub username | dardoria |
| Website (optional) | |

View File

@ -10,7 +10,7 @@ wasabi>=0.8.1,<1.1.0
srsly>=2.4.0,<3.0.0 srsly>=2.4.0,<3.0.0
catalogue>=2.0.1,<2.1.0 catalogue>=2.0.1,<2.1.0
typer>=0.3.0,<0.4.0 typer>=0.3.0,<0.4.0
pathy pathy>=0.3.5
# Third party dependencies # Third party dependencies
numpy>=1.15.0 numpy>=1.15.0
requests>=2.13.0,<3.0.0 requests>=2.13.0,<3.0.0
@ -21,11 +21,11 @@ jinja2
setuptools setuptools
packaging>=20.0 packaging>=20.0
importlib_metadata>=0.20; python_version < "3.8" importlib_metadata>=0.20; python_version < "3.8"
typing_extensions>=3.7.4; python_version < "3.8" typing_extensions>=3.7.4.1,<4.0.0.0; python_version < "3.8"
# Development dependencies # Development dependencies
cython>=0.25 cython>=0.25
pytest>=5.2.0 pytest>=5.2.0
pytest-timeout>=1.3.0,<2.0.0 pytest-timeout>=1.3.0,<2.0.0
mock>=2.0.0,<3.0.0 mock>=2.0.0,<3.0.0
flake8>=3.5.0,<3.6.0 flake8>=3.5.0,<3.6.0
hypothesis hypothesis>=3.27.0,<7.0.0

View File

@ -47,7 +47,7 @@ install_requires =
srsly>=2.4.0,<3.0.0 srsly>=2.4.0,<3.0.0
catalogue>=2.0.1,<2.1.0 catalogue>=2.0.1,<2.1.0
typer>=0.3.0,<0.4.0 typer>=0.3.0,<0.4.0
pathy pathy>=0.3.5
# Third-party dependencies # Third-party dependencies
tqdm>=4.38.0,<5.0.0 tqdm>=4.38.0,<5.0.0
numpy>=1.15.0 numpy>=1.15.0
@ -58,7 +58,7 @@ install_requires =
setuptools setuptools
packaging>=20.0 packaging>=20.0
importlib_metadata>=0.20; python_version < "3.8" importlib_metadata>=0.20; python_version < "3.8"
typing_extensions>=3.7.4; python_version < "3.8" typing_extensions>=3.7.4,<4.0.0.0; python_version < "3.8"
[options.entry_points] [options.entry_points]
console_scripts = console_scripts =

View File

@ -204,7 +204,7 @@ def setup_package():
for name in MOD_NAMES: for name in MOD_NAMES:
mod_path = name.replace(".", "/") + ".pyx" mod_path = name.replace(".", "/") + ".pyx"
ext = Extension( ext = Extension(
name, [mod_path], language="c++", extra_compile_args=["-std=c++11"] name, [mod_path], language="c++", include_dirs=include_dirs, extra_compile_args=["-std=c++11"]
) )
ext_modules.append(ext) ext_modules.append(ext)
print("Cythonizing sources") print("Cythonizing sources")
@ -216,7 +216,6 @@ def setup_package():
version=about["__version__"], version=about["__version__"],
ext_modules=ext_modules, ext_modules=ext_modules,
cmdclass={"build_ext": build_ext_subclass}, cmdclass={"build_ext": build_ext_subclass},
include_dirs=include_dirs,
package_data={"": ["*.pyx", "*.pxd", "*.pxi"]}, package_data={"": ["*.pyx", "*.pxd", "*.pxi"]},
) )

View File

@ -11,6 +11,7 @@ from click.parser import split_arg_string
from typer.main import get_command from typer.main import get_command
from contextlib import contextmanager from contextlib import contextmanager
from thinc.api import Config, ConfigValidationError, require_gpu from thinc.api import Config, ConfigValidationError, require_gpu
from thinc.util import has_cupy, gpu_is_available
from configparser import InterpolationError from configparser import InterpolationError
import os import os
@ -510,3 +511,5 @@ def setup_gpu(use_gpu: int) -> None:
require_gpu(use_gpu) require_gpu(use_gpu)
else: else:
msg.info("Using CPU") msg.info("Using CPU")
if has_cupy and gpu_is_available():
msg.info("To switch to GPU 0, use the option: --gpu-id 0")

View File

@ -22,7 +22,7 @@ from ..training.converters import conllu_to_docs
CONVERTERS = { CONVERTERS = {
"conllubio": conllu_to_docs, "conllubio": conllu_to_docs,
"conllu": conllu_to_docs, "conllu": conllu_to_docs,
"conll": conllu_to_docs, "conll": conll_ner_to_docs,
"ner": conll_ner_to_docs, "ner": conll_ner_to_docs,
"iob": iob_to_docs, "iob": iob_to_docs,
"json": json_to_docs, "json": json_to_docs,

View File

@ -132,7 +132,7 @@ def evaluate(
if displacy_path: if displacy_path:
factory_names = [nlp.get_pipe_meta(pipe).factory for pipe in nlp.pipe_names] factory_names = [nlp.get_pipe_meta(pipe).factory for pipe in nlp.pipe_names]
docs = [ex.predicted for ex in dev_dataset] docs = list(nlp.pipe(ex.reference.text for ex in dev_dataset[:displacy_limit]))
render_deps = "parser" in factory_names render_deps = "parser" in factory_names
render_ents = "ner" in factory_names render_ents = "ner" in factory_names
render_parses( render_parses(

View File

@ -16,7 +16,11 @@ gpu_allocator = null
[nlp] [nlp]
lang = "{{ lang }}" lang = "{{ lang }}"
{%- if "tagger" in components or "morphologizer" in components or "parser" in components or "ner" in components or "entity_linker" in components or (("textcat" in components or "textcat_multilabel" in components) and optimize == "accuracy") -%}
{%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components %} {%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components %}
{%- else -%}
{%- set full_pipeline = components %}
{%- endif %}
pipeline = {{ full_pipeline|pprint()|replace("'", '"')|safe }} pipeline = {{ full_pipeline|pprint()|replace("'", '"')|safe }}
batch_size = {{ 128 if hardware == "gpu" else 1000 }} batch_size = {{ 128 if hardware == "gpu" else 1000 }}

View File

@ -22,21 +22,21 @@ ar:
bg: bg:
word_vectors: null word_vectors: null
transformer: transformer:
efficiency: efficiency:
name: iarfmoose/roberta-base-bulgarian name: iarfmoose/roberta-base-bulgarian
size_factor: 3 size_factor: 3
accuracy: accuracy:
name: iarfmoose/roberta-base-bulgarian name: iarfmoose/roberta-base-bulgarian
size_factor: 3 size_factor: 3
bn: bn:
word_vectors: null word_vectors: null
transformer: transformer:
efficiency: efficiency:
name: sagorsarker/bangla-bert-base name: sagorsarker/bangla-bert-base
size_factor: 3 size_factor: 3
accuracy: accuracy:
name: sagorsarker/bangla-bert-base name: sagorsarker/bangla-bert-base
size_factor: 3 size_factor: 3
da: da:
word_vectors: da_core_news_lg word_vectors: da_core_news_lg
transformer: transformer:

View File

@ -321,7 +321,8 @@ class Errors:
"https://spacy.io/api/top-level#util.filter_spans") "https://spacy.io/api/top-level#util.filter_spans")
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A " E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
"token can only be part of one entity, so make sure the entities " "token can only be part of one entity, so make sure the entities "
"you're setting don't overlap.") "you're setting don't overlap. To work with overlapping entities, "
"consider using doc.spans instead.")
E106 = ("Can't find `doc._.{attr}` attribute specified in the underscore " E106 = ("Can't find `doc._.{attr}` attribute specified in the underscore "
"settings: {opts}") "settings: {opts}")
E107 = ("Value of `doc._.{attr}` is not JSON-serializable: {value}") E107 = ("Value of `doc._.{attr}` is not JSON-serializable: {value}")
@ -486,6 +487,15 @@ class Errors:
E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.") E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.")
# New errors added in v3.x # New errors added in v3.x
E879 = ("Unexpected type for 'spans' data. Provide a dictionary mapping keys to "
"a list of spans, with each span represented by a tuple (start_char, end_char). "
"The tuple can be optionally extended with a label and a KB ID.")
E880 = ("The 'wandb' library could not be found - did you install it? "
"Alternatively, specify the 'ConsoleLogger' in the 'training.logger' "
"config section, instead of the 'WandbLogger'.")
E885 = ("entity_linker.set_kb received an invalid 'kb_loader' argument: expected "
"a callable function, but got: {arg_type}")
E886 = ("Can't replace {name} -> {tok2vec} listeners: path '{path}' not " E886 = ("Can't replace {name} -> {tok2vec} listeners: path '{path}' not "
"found in config for component '{name}'.") "found in config for component '{name}'.")
E887 = ("Can't replace {name} -> {tok2vec} listeners: the paths to replace " E887 = ("Can't replace {name} -> {tok2vec} listeners: the paths to replace "

View File

@ -1,9 +1,21 @@
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .lex_attrs import LEX_ATTRS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...language import Language from ...language import Language
from ...attrs import LANG
from ...util import update_exc
class BulgarianDefaults(Language.Defaults): class BulgarianDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: "bg"
lex_attr_getters.update(LEX_ATTRS)
stop_words = STOP_WORDS stop_words = STOP_WORDS
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
class Bulgarian(Language): class Bulgarian(Language):

View File

@ -0,0 +1,88 @@
from ...attrs import LIKE_NUM
_num_words = [
"нула",
"едно",
"един",
"една",
"две",
"три",
"четири",
"пет",
"шест",
"седем",
"осем",
"девет",
"десет",
"единадесет",
"единайсет",
"дванадесет",
"дванайсет",
"тринадесет",
"тринайсет",
"четиринадесет",
"четиринайсет"
"петнадесет",
"петнайсет"
"шестнадесет",
"шестнайсет",
"седемнадесет",
"седемнайсет"
"осемнадесет",
"осемнайсет",
"деветнадесет",
"деветнайсет",
"двадесет",
"двайсет",
"тридесет",
"трийсет"
"четиридесет",
"четиресет",
"петдесет",
"шестдесет",
"шейсет",
"седемдесет",
"осемдесет",
"деветдесет",
"сто",
"двеста",
"триста",
"четиристотин",
"петстотин",
"шестстотин",
"седемстотин",
"осемстотин",
"деветстотин",
"хиляда",
"милион",
"милиона",
"милиард",
"милиарда",
"трилион",
"трилионa",
"билион",
"билионa",
"квадрилион",
"квадрилионa",
"квинтилион",
"квинтилионa",
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -0,0 +1,68 @@
from ...symbols import ORTH, NORM
_exc = {}
_abbr_exc = [
{ORTH: "м", NORM: "метър"},
{ORTH: "мм", NORM: "милиметър"},
{ORTH: "см", NORM: "сантиметър"},
{ORTH: "дм", NORM: "дециметър"},
{ORTH: "км", NORM: "километър"},
{ORTH: "кг", NORM: "килограм"},
{ORTH: "мг", NORM: "милиграм"},
{ORTH: "г", NORM: "грам"},
{ORTH: "т", NORM: "тон"},
{ORTH: "хл", NORM: "хектолиър"},
{ORTH: "дкл", NORM: "декалитър"},
{ORTH: "л", NORM: "литър"},
]
for abbr in _abbr_exc:
_exc[abbr[ORTH]] = [abbr]
_abbr_line_exc = [
{ORTH: "г-жа", NORM: "госпожа"},
{ORTH: "г", NORM: "господин"},
{ORTH: "г-ца", NORM: "госпожица"},
{ORTH: "д-р", NORM: "доктор"},
{ORTH: "о", NORM: "остров"},
{ORTH: "п-в", NORM: "полуостров"},
]
for abbr in _abbr_line_exc:
_exc[abbr[ORTH]] = [abbr]
_abbr_dot_exc = [
{ORTH: "акад.", NORM: "академик"},
{ORTH: "ал.", NORM: "алинея"},
{ORTH: "арх.", NORM: "архитект"},
{ORTH: "бл.", NORM: "блок"},
{ORTH: "бр.", NORM: "брой"},
{ORTH: "бул.", NORM: "булевард"},
{ORTH: "в.", NORM: "век"},
{ORTH: "г.", NORM: "година"},
{ORTH: "гр.", NORM: "град"},
{ORTH: "ж.р.", NORM: "женски род"},
{ORTH: "инж.", NORM: "инженер"},
{ORTH: "лв.", NORM: "лев"},
{ORTH: "м.р.", NORM: "мъжки род"},
{ORTH: "мат.", NORM: "математика"},
{ORTH: "мед.", NORM: "медицина"},
{ORTH: "пл.", NORM: "площад"},
{ORTH: "проф.", NORM: "професор"},
{ORTH: "с.", NORM: "село"},
{ORTH: "с.р.", NORM: "среден род"},
{ORTH: "св.", NORM: "свети"},
{ORTH: "сп.", NORM: "списание"},
{ORTH: "стр.", NORM: "страница"},
{ORTH: "ул.", NORM: "улица"},
{ORTH: "чл.", NORM: "член"},
]
for abbr in _abbr_dot_exc:
_exc[abbr[ORTH]] = [abbr]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -23,8 +23,6 @@ class RussianLemmatizer(Lemmatizer):
mode: str = "pymorphy2", mode: str = "pymorphy2",
overwrite: bool = False, overwrite: bool = False,
) -> None: ) -> None:
super().__init__(vocab, model, name, mode=mode, overwrite=overwrite)
try: try:
from pymorphy2 import MorphAnalyzer from pymorphy2 import MorphAnalyzer
except ImportError: except ImportError:
@ -34,6 +32,7 @@ class RussianLemmatizer(Lemmatizer):
) from None ) from None
if RussianLemmatizer._morph is None: if RussianLemmatizer._morph is None:
RussianLemmatizer._morph = MorphAnalyzer() RussianLemmatizer._morph = MorphAnalyzer()
super().__init__(vocab, model, name, mode=mode, overwrite=overwrite)
def pymorphy2_lemmatize(self, token: Token) -> List[str]: def pymorphy2_lemmatize(self, token: Token) -> List[str]:
string = token.text string = token.text

View File

@ -7,6 +7,8 @@ from ...vocab import Vocab
class UkrainianLemmatizer(RussianLemmatizer): class UkrainianLemmatizer(RussianLemmatizer):
_morph = None
def __init__( def __init__(
self, self,
vocab: Vocab, vocab: Vocab,
@ -16,7 +18,6 @@ class UkrainianLemmatizer(RussianLemmatizer):
mode: str = "pymorphy2", mode: str = "pymorphy2",
overwrite: bool = False, overwrite: bool = False,
) -> None: ) -> None:
super().__init__(vocab, model, name, mode=mode, overwrite=overwrite)
try: try:
from pymorphy2 import MorphAnalyzer from pymorphy2 import MorphAnalyzer
except ImportError: except ImportError:
@ -27,3 +28,4 @@ class UkrainianLemmatizer(RussianLemmatizer):
) from None ) from None
if UkrainianLemmatizer._morph is None: if UkrainianLemmatizer._morph is None:
UkrainianLemmatizer._morph = MorphAnalyzer(lang="uk") UkrainianLemmatizer._morph = MorphAnalyzer(lang="uk")
super().__init__(vocab, model, name, mode=mode, overwrite=overwrite)

View File

@ -684,12 +684,12 @@ class Language:
# TODO: handle errors and mismatches (vectors etc.) # TODO: handle errors and mismatches (vectors etc.)
if not isinstance(source, self.__class__): if not isinstance(source, self.__class__):
raise ValueError(Errors.E945.format(name=source_name, source=type(source))) raise ValueError(Errors.E945.format(name=source_name, source=type(source)))
if not source.has_pipe(source_name): if not source_name in source.component_names:
raise KeyError( raise KeyError(
Errors.E944.format( Errors.E944.format(
name=source_name, name=source_name,
model=f"{source.meta['lang']}_{source.meta['name']}", model=f"{source.meta['lang']}_{source.meta['name']}",
opts=", ".join(source.pipe_names), opts=", ".join(source.component_names),
) )
) )
pipe = source.get_pipe(source_name) pipe = source.get_pipe(source_name)

View File

@ -8,7 +8,7 @@ from ...kb import KnowledgeBase, Candidate, get_candidates
from ...vocab import Vocab from ...vocab import Vocab
@registry.architectures.register("spacy.EntityLinker.v1") @registry.architectures("spacy.EntityLinker.v1")
def build_nel_encoder(tok2vec: Model, nO: Optional[int] = None) -> Model: def build_nel_encoder(tok2vec: Model, nO: Optional[int] = None) -> Model:
with Model.define_operators({">>": chain, "**": clone}): with Model.define_operators({">>": chain, "**": clone}):
token_width = tok2vec.get_dim("nO") token_width = tok2vec.get_dim("nO")
@ -25,7 +25,7 @@ def build_nel_encoder(tok2vec: Model, nO: Optional[int] = None) -> Model:
return model return model
@registry.misc.register("spacy.KBFromFile.v1") @registry.misc("spacy.KBFromFile.v1")
def load_kb(kb_path: Path) -> Callable[[Vocab], KnowledgeBase]: def load_kb(kb_path: Path) -> Callable[[Vocab], KnowledgeBase]:
def kb_from_file(vocab): def kb_from_file(vocab):
kb = KnowledgeBase(vocab, entity_vector_length=1) kb = KnowledgeBase(vocab, entity_vector_length=1)
@ -35,7 +35,7 @@ def load_kb(kb_path: Path) -> Callable[[Vocab], KnowledgeBase]:
return kb_from_file return kb_from_file
@registry.misc.register("spacy.EmptyKB.v1") @registry.misc("spacy.EmptyKB.v1")
def empty_kb(entity_vector_length: int) -> Callable[[Vocab], KnowledgeBase]: def empty_kb(entity_vector_length: int) -> Callable[[Vocab], KnowledgeBase]:
def empty_kb_factory(vocab): def empty_kb_factory(vocab):
return KnowledgeBase(vocab=vocab, entity_vector_length=entity_vector_length) return KnowledgeBase(vocab=vocab, entity_vector_length=entity_vector_length)
@ -43,6 +43,6 @@ def empty_kb(entity_vector_length: int) -> Callable[[Vocab], KnowledgeBase]:
return empty_kb_factory return empty_kb_factory
@registry.misc.register("spacy.CandidateGenerator.v1") @registry.misc("spacy.CandidateGenerator.v1")
def create_candidates() -> Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]: def create_candidates() -> Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]:
return get_candidates return get_candidates

View File

@ -16,7 +16,7 @@ if TYPE_CHECKING:
from ...tokens import Doc # noqa: F401 from ...tokens import Doc # noqa: F401
@registry.architectures.register("spacy.PretrainVectors.v1") @registry.architectures("spacy.PretrainVectors.v1")
def create_pretrain_vectors( def create_pretrain_vectors(
maxout_pieces: int, hidden_size: int, loss: str maxout_pieces: int, hidden_size: int, loss: str
) -> Callable[["Vocab", Model], Model]: ) -> Callable[["Vocab", Model], Model]:
@ -40,7 +40,7 @@ def create_pretrain_vectors(
return create_vectors_objective return create_vectors_objective
@registry.architectures.register("spacy.PretrainCharacters.v1") @registry.architectures("spacy.PretrainCharacters.v1")
def create_pretrain_characters( def create_pretrain_characters(
maxout_pieces: int, hidden_size: int, n_characters: int maxout_pieces: int, hidden_size: int, n_characters: int
) -> Callable[["Vocab", Model], Model]: ) -> Callable[["Vocab", Model], Model]:

View File

@ -10,7 +10,7 @@ from ..tb_framework import TransitionModel
from ...tokens import Doc from ...tokens import Doc
@registry.architectures.register("spacy.TransitionBasedParser.v1") @registry.architectures("spacy.TransitionBasedParser.v1")
def transition_parser_v1( def transition_parser_v1(
tok2vec: Model[List[Doc], List[Floats2d]], tok2vec: Model[List[Doc], List[Floats2d]],
state_type: Literal["parser", "ner"], state_type: Literal["parser", "ner"],
@ -31,7 +31,7 @@ def transition_parser_v1(
) )
@registry.architectures.register("spacy.TransitionBasedParser.v2") @registry.architectures("spacy.TransitionBasedParser.v2")
def transition_parser_v2( def transition_parser_v2(
tok2vec: Model[List[Doc], List[Floats2d]], tok2vec: Model[List[Doc], List[Floats2d]],
state_type: Literal["parser", "ner"], state_type: Literal["parser", "ner"],

View File

@ -6,7 +6,7 @@ from ...util import registry
from ...tokens import Doc from ...tokens import Doc
@registry.architectures.register("spacy.Tagger.v1") @registry.architectures("spacy.Tagger.v1")
def build_tagger_model( def build_tagger_model(
tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None
) -> Model[List[Doc], List[Floats2d]]: ) -> Model[List[Doc], List[Floats2d]]:

View File

@ -15,7 +15,7 @@ from ...tokens import Doc
from .tok2vec import get_tok2vec_width from .tok2vec import get_tok2vec_width
@registry.architectures.register("spacy.TextCatCNN.v1") @registry.architectures("spacy.TextCatCNN.v1")
def build_simple_cnn_text_classifier( def build_simple_cnn_text_classifier(
tok2vec: Model, exclusive_classes: bool, nO: Optional[int] = None tok2vec: Model, exclusive_classes: bool, nO: Optional[int] = None
) -> Model[List[Doc], Floats2d]: ) -> Model[List[Doc], Floats2d]:
@ -41,7 +41,7 @@ def build_simple_cnn_text_classifier(
return model return model
@registry.architectures.register("spacy.TextCatBOW.v1") @registry.architectures("spacy.TextCatBOW.v1")
def build_bow_text_classifier( def build_bow_text_classifier(
exclusive_classes: bool, exclusive_classes: bool,
ngram_size: int, ngram_size: int,
@ -60,7 +60,7 @@ def build_bow_text_classifier(
return model return model
@registry.architectures.register("spacy.TextCatEnsemble.v2") @registry.architectures("spacy.TextCatEnsemble.v2")
def build_text_classifier_v2( def build_text_classifier_v2(
tok2vec: Model[List[Doc], List[Floats2d]], tok2vec: Model[List[Doc], List[Floats2d]],
linear_model: Model[List[Doc], Floats2d], linear_model: Model[List[Doc], Floats2d],
@ -112,7 +112,7 @@ def init_ensemble_textcat(model, X, Y) -> Model:
return model return model
@registry.architectures.register("spacy.TextCatLowData.v1") @registry.architectures("spacy.TextCatLowData.v1")
def build_text_classifier_lowdata( def build_text_classifier_lowdata(
width: int, dropout: Optional[float], nO: Optional[int] = None width: int, dropout: Optional[float], nO: Optional[int] = None
) -> Model[List[Doc], Floats2d]: ) -> Model[List[Doc], Floats2d]:

View File

@ -14,7 +14,7 @@ from ...pipeline.tok2vec import Tok2VecListener
from ...attrs import intify_attr from ...attrs import intify_attr
@registry.architectures.register("spacy.Tok2VecListener.v1") @registry.architectures("spacy.Tok2VecListener.v1")
def tok2vec_listener_v1(width: int, upstream: str = "*"): def tok2vec_listener_v1(width: int, upstream: str = "*"):
tok2vec = Tok2VecListener(upstream_name=upstream, width=width) tok2vec = Tok2VecListener(upstream_name=upstream, width=width)
return tok2vec return tok2vec
@ -31,7 +31,7 @@ def get_tok2vec_width(model: Model):
return nO return nO
@registry.architectures.register("spacy.HashEmbedCNN.v1") @registry.architectures("spacy.HashEmbedCNN.v1")
def build_hash_embed_cnn_tok2vec( def build_hash_embed_cnn_tok2vec(
*, *,
width: int, width: int,
@ -87,7 +87,7 @@ def build_hash_embed_cnn_tok2vec(
) )
@registry.architectures.register("spacy.Tok2Vec.v2") @registry.architectures("spacy.Tok2Vec.v2")
def build_Tok2Vec_model( def build_Tok2Vec_model(
embed: Model[List[Doc], List[Floats2d]], embed: Model[List[Doc], List[Floats2d]],
encode: Model[List[Floats2d], List[Floats2d]], encode: Model[List[Floats2d], List[Floats2d]],
@ -108,7 +108,7 @@ def build_Tok2Vec_model(
return tok2vec return tok2vec
@registry.architectures.register("spacy.MultiHashEmbed.v1") @registry.architectures("spacy.MultiHashEmbed.v1")
def MultiHashEmbed( def MultiHashEmbed(
width: int, width: int,
attrs: List[Union[str, int]], attrs: List[Union[str, int]],
@ -182,7 +182,7 @@ def MultiHashEmbed(
return model return model
@registry.architectures.register("spacy.CharacterEmbed.v1") @registry.architectures("spacy.CharacterEmbed.v1")
def CharacterEmbed( def CharacterEmbed(
width: int, width: int,
rows: int, rows: int,
@ -255,7 +255,7 @@ def CharacterEmbed(
return model return model
@registry.architectures.register("spacy.MaxoutWindowEncoder.v2") @registry.architectures("spacy.MaxoutWindowEncoder.v2")
def MaxoutWindowEncoder( def MaxoutWindowEncoder(
width: int, window_size: int, maxout_pieces: int, depth: int width: int, window_size: int, maxout_pieces: int, depth: int
) -> Model[List[Floats2d], List[Floats2d]]: ) -> Model[List[Floats2d], List[Floats2d]]:
@ -287,7 +287,7 @@ def MaxoutWindowEncoder(
return with_array(model, pad=receptive_field) return with_array(model, pad=receptive_field)
@registry.architectures.register("spacy.MishWindowEncoder.v2") @registry.architectures("spacy.MishWindowEncoder.v2")
def MishWindowEncoder( def MishWindowEncoder(
width: int, window_size: int, depth: int width: int, window_size: int, depth: int
) -> Model[List[Floats2d], List[Floats2d]]: ) -> Model[List[Floats2d], List[Floats2d]]:
@ -310,7 +310,7 @@ def MishWindowEncoder(
return with_array(model) return with_array(model)
@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1") @registry.architectures("spacy.TorchBiLSTMEncoder.v1")
def BiLSTMEncoder( def BiLSTMEncoder(
width: int, depth: int, dropout: float width: int, depth: int, dropout: float
) -> Model[List[Floats2d], List[Floats2d]]: ) -> Model[List[Floats2d], List[Floats2d]]:

View File

@ -45,6 +45,7 @@ DEFAULT_NEL_MODEL = Config().from_str(default_model_config)["model"]
default_config={ default_config={
"model": DEFAULT_NEL_MODEL, "model": DEFAULT_NEL_MODEL,
"labels_discard": [], "labels_discard": [],
"n_sents": 0,
"incl_prior": True, "incl_prior": True,
"incl_context": True, "incl_context": True,
"entity_vector_length": 64, "entity_vector_length": 64,
@ -62,6 +63,7 @@ def make_entity_linker(
model: Model, model: Model,
*, *,
labels_discard: Iterable[str], labels_discard: Iterable[str],
n_sents: int,
incl_prior: bool, incl_prior: bool,
incl_context: bool, incl_context: bool,
entity_vector_length: int, entity_vector_length: int,
@ -73,6 +75,7 @@ def make_entity_linker(
representations. Given a batch of Doc objects, it should return a single representations. Given a batch of Doc objects, it should return a single
array, with one row per item in the batch. array, with one row per item in the batch.
labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction. labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction.
n_sents (int): The number of neighbouring sentences to take into account.
incl_prior (bool): Whether or not to include prior probabilities from the KB in the model. incl_prior (bool): Whether or not to include prior probabilities from the KB in the model.
incl_context (bool): Whether or not to include the local context in the model. incl_context (bool): Whether or not to include the local context in the model.
entity_vector_length (int): Size of encoding vectors in the KB. entity_vector_length (int): Size of encoding vectors in the KB.
@ -84,6 +87,7 @@ def make_entity_linker(
model, model,
name, name,
labels_discard=labels_discard, labels_discard=labels_discard,
n_sents=n_sents,
incl_prior=incl_prior, incl_prior=incl_prior,
incl_context=incl_context, incl_context=incl_context,
entity_vector_length=entity_vector_length, entity_vector_length=entity_vector_length,
@ -106,6 +110,7 @@ class EntityLinker(TrainablePipe):
name: str = "entity_linker", name: str = "entity_linker",
*, *,
labels_discard: Iterable[str], labels_discard: Iterable[str],
n_sents: int,
incl_prior: bool, incl_prior: bool,
incl_context: bool, incl_context: bool,
entity_vector_length: int, entity_vector_length: int,
@ -118,6 +123,7 @@ class EntityLinker(TrainablePipe):
name (str): The component instance name, used to add entries to the name (str): The component instance name, used to add entries to the
losses during training. losses during training.
labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction. labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction.
n_sents (int): The number of neighbouring sentences to take into account.
incl_prior (bool): Whether or not to include prior probabilities from the KB in the model. incl_prior (bool): Whether or not to include prior probabilities from the KB in the model.
incl_context (bool): Whether or not to include the local context in the model. incl_context (bool): Whether or not to include the local context in the model.
entity_vector_length (int): Size of encoding vectors in the KB. entity_vector_length (int): Size of encoding vectors in the KB.
@ -129,25 +135,24 @@ class EntityLinker(TrainablePipe):
self.vocab = vocab self.vocab = vocab
self.model = model self.model = model
self.name = name self.name = name
cfg = { self.labels_discard = list(labels_discard)
"labels_discard": list(labels_discard), self.n_sents = n_sents
"incl_prior": incl_prior, self.incl_prior = incl_prior
"incl_context": incl_context, self.incl_context = incl_context
"entity_vector_length": entity_vector_length,
}
self.get_candidates = get_candidates self.get_candidates = get_candidates
self.cfg = dict(cfg) self.cfg = {}
self.distance = CosineDistance(normalize=False) self.distance = CosineDistance(normalize=False)
# how many neightbour sentences to take into account # how many neightbour sentences to take into account
self.n_sents = cfg.get("n_sents", 0)
# create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'. # create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'.
self.kb = empty_kb(entity_vector_length)(self.vocab) self.kb = empty_kb(entity_vector_length)(self.vocab)
def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]): def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]):
"""Define the KB of this pipe by providing a function that will """Define the KB of this pipe by providing a function that will
create it using this object's vocab.""" create it using this object's vocab."""
if not callable(kb_loader):
raise ValueError(Errors.E885.format(arg_type=type(kb_loader)))
self.kb = kb_loader(self.vocab) self.kb = kb_loader(self.vocab)
self.cfg["entity_vector_length"] = self.kb.entity_vector_length
def validate_kb(self) -> None: def validate_kb(self) -> None:
# Raise an error if the knowledge base is not initialized. # Raise an error if the knowledge base is not initialized.
@ -309,14 +314,13 @@ class EntityLinker(TrainablePipe):
sent_doc = doc[start_token:end_token].as_doc() sent_doc = doc[start_token:end_token].as_doc()
# currently, the context is the same for each entity in a sentence (should be refined) # currently, the context is the same for each entity in a sentence (should be refined)
xp = self.model.ops.xp xp = self.model.ops.xp
if self.cfg.get("incl_context"): if self.incl_context:
sentence_encoding = self.model.predict([sent_doc])[0] sentence_encoding = self.model.predict([sent_doc])[0]
sentence_encoding_t = sentence_encoding.T sentence_encoding_t = sentence_encoding.T
sentence_norm = xp.linalg.norm(sentence_encoding_t) sentence_norm = xp.linalg.norm(sentence_encoding_t)
for ent in sent.ents: for ent in sent.ents:
entity_count += 1 entity_count += 1
to_discard = self.cfg.get("labels_discard", []) if ent.label_ in self.labels_discard:
if to_discard and ent.label_ in to_discard:
# ignoring this entity - setting to NIL # ignoring this entity - setting to NIL
final_kb_ids.append(self.NIL) final_kb_ids.append(self.NIL)
else: else:
@ -334,13 +338,13 @@ class EntityLinker(TrainablePipe):
prior_probs = xp.asarray( prior_probs = xp.asarray(
[c.prior_prob for c in candidates] [c.prior_prob for c in candidates]
) )
if not self.cfg.get("incl_prior"): if not self.incl_prior:
prior_probs = xp.asarray( prior_probs = xp.asarray(
[0.0 for _ in candidates] [0.0 for _ in candidates]
) )
scores = prior_probs scores = prior_probs
# add in similarity from the context # add in similarity from the context
if self.cfg.get("incl_context"): if self.incl_context:
entity_encodings = xp.asarray( entity_encodings = xp.asarray(
[c.entity_vector for c in candidates] [c.entity_vector for c in candidates]
) )

View File

@ -66,26 +66,12 @@ class Sentencizer(Pipe):
""" """
error_handler = self.get_error_handler() error_handler = self.get_error_handler()
try: try:
self._call(doc) tags = self.predict([doc])
self.set_annotations([doc], tags)
return doc return doc
except Exception as e: except Exception as e:
error_handler(self.name, self, [doc], e) error_handler(self.name, self, [doc], e)
def _call(self, doc):
start = 0
seen_period = False
for i, token in enumerate(doc):
is_in_punct_chars = token.text in self.punct_chars
token.is_sent_start = i == 0
if seen_period and not token.is_punct and not is_in_punct_chars:
doc[start].is_sent_start = True
start = token.i
seen_period = False
elif is_in_punct_chars:
seen_period = True
if start < len(doc):
doc[start].is_sent_start = True
def predict(self, docs): def predict(self, docs):
"""Apply the pipe to a batch of docs, without modifying them. """Apply the pipe to a batch of docs, without modifying them.

View File

@ -314,6 +314,9 @@ class Scorer:
getter (Callable[[Doc, str], Iterable[Span]]): Defaults to getattr. If getter (Callable[[Doc, str], Iterable[Span]]): Defaults to getattr. If
provided, getter(doc, attr) should return the spans for the provided, getter(doc, attr) should return the spans for the
individual doc. individual doc.
has_annotation (Optional[Callable[[Doc], bool]]) should return whether a `Doc`
has annotation for this `attr`. Docs without annotation are skipped for
scoring purposes.
RETURNS (Dict[str, Any]): A dictionary containing the PRF scores under RETURNS (Dict[str, Any]): A dictionary containing the PRF scores under
the keys attr_p/r/f and the per-type PRF scores under attr_per_type. the keys attr_p/r/f and the per-type PRF scores under attr_per_type.
@ -324,7 +327,7 @@ class Scorer:
for example in examples: for example in examples:
pred_doc = example.predicted pred_doc = example.predicted
gold_doc = example.reference gold_doc = example.reference
# Option to handle docs without sents # Option to handle docs without annotation for this attribute
if has_annotation is not None: if has_annotation is not None:
if not has_annotation(gold_doc): if not has_annotation(gold_doc):
continue continue
@ -531,27 +534,28 @@ class Scorer:
gold_span = gold_ent_by_offset.get( gold_span = gold_ent_by_offset.get(
(pred_ent.start_char, pred_ent.end_char), None (pred_ent.start_char, pred_ent.end_char), None
) )
label = gold_span.label_ if gold_span is not None:
if label not in f_per_type: label = gold_span.label_
f_per_type[label] = PRFScore() if label not in f_per_type:
gold = gold_span.kb_id_ f_per_type[label] = PRFScore()
# only evaluating entities that overlap between gold and pred, gold = gold_span.kb_id_
# to disentangle the performance of the NEL from the NER # only evaluating entities that overlap between gold and pred,
if gold is not None: # to disentangle the performance of the NEL from the NER
pred = pred_ent.kb_id_ if gold is not None:
if gold in negative_labels and pred in negative_labels: pred = pred_ent.kb_id_
# ignore true negatives if gold in negative_labels and pred in negative_labels:
pass # ignore true negatives
elif gold == pred: pass
f_per_type[label].tp += 1 elif gold == pred:
elif gold in negative_labels: f_per_type[label].tp += 1
f_per_type[label].fp += 1 elif gold in negative_labels:
elif pred in negative_labels: f_per_type[label].fp += 1
f_per_type[label].fn += 1 elif pred in negative_labels:
else: f_per_type[label].fn += 1
# a wrong prediction (e.g. Q42 != Q3) counts as both a FP as well as a FN else:
f_per_type[label].fp += 1 # a wrong prediction (e.g. Q42 != Q3) counts as both a FP as well as a FN
f_per_type[label].fn += 1 f_per_type[label].fp += 1
f_per_type[label].fn += 1
micro_prf = PRFScore() micro_prf = PRFScore()
for label_prf in f_per_type.values(): for label_prf in f_per_type.values():
micro_prf.tp += label_prf.tp micro_prf.tp += label_prf.tp

View File

@ -39,6 +39,11 @@ def ar_tokenizer():
return get_lang_class("ar")().tokenizer return get_lang_class("ar")().tokenizer
@pytest.fixture(scope="session")
def bg_tokenizer():
return get_lang_class("bg")().tokenizer
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def bn_tokenizer(): def bn_tokenizer():
return get_lang_class("bn")().tokenizer return get_lang_class("bn")().tokenizer

View File

@ -1,3 +1,5 @@
import weakref
import pytest import pytest
import numpy import numpy
import logging import logging
@ -663,3 +665,10 @@ def test_span_groups(en_tokenizer):
assert doc.spans["hi"].has_overlap assert doc.spans["hi"].has_overlap
del doc.spans["hi"] del doc.spans["hi"]
assert "hi" not in doc.spans assert "hi" not in doc.spans
def test_doc_spans_copy(en_tokenizer):
doc1 = en_tokenizer("Some text about Colombia and the Czech Republic")
assert weakref.ref(doc1) == doc1.spans.doc_ref
doc2 = doc1.copy()
assert weakref.ref(doc2) == doc2.spans.doc_ref

View File

@ -0,0 +1,30 @@
import pytest
from spacy.lang.bg.lex_attrs import like_num
@pytest.mark.parametrize(
"word,match",
[
("10", True),
("1", True),
("10000", True),
("1.000", True),
("бројка", False),
("999,23", True),
("едно", True),
("две", True),
("цифра", False),
("единайсет", True),
("десет", True),
("сто", True),
("брой", False),
("хиляда", True),
("милион", True),
(",", False),
("милиарда", True),
("билион", True),
],
)
def test_bg_lex_attrs_like_number(bg_tokenizer, word, match):
tokens = bg_tokenizer(word)
assert len(tokens) == 1
assert tokens[0].like_num == match

View File

@ -230,7 +230,7 @@ def test_el_pipe_configuration(nlp):
def get_lowercased_candidates(kb, span): def get_lowercased_candidates(kb, span):
return kb.get_alias_candidates(span.text.lower()) return kb.get_alias_candidates(span.text.lower())
@registry.misc.register("spacy.LowercaseCandidateGenerator.v1") @registry.misc("spacy.LowercaseCandidateGenerator.v1")
def create_candidates() -> Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]: def create_candidates() -> Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]:
return get_lowercased_candidates return get_lowercased_candidates
@ -250,6 +250,14 @@ def test_el_pipe_configuration(nlp):
assert doc[2].ent_kb_id_ == "Q2" assert doc[2].ent_kb_id_ == "Q2"
def test_nel_nsents(nlp):
"""Test that n_sents can be set through the configuration"""
entity_linker = nlp.add_pipe("entity_linker", config={})
assert entity_linker.n_sents == 0
entity_linker = nlp.replace_pipe("entity_linker", "entity_linker", config={"n_sents": 2})
assert entity_linker.n_sents == 2
def test_vocab_serialization(nlp): def test_vocab_serialization(nlp):
"""Test that string information is retained across storage""" """Test that string information is retained across storage"""
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1) mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)

View File

@ -83,9 +83,9 @@ def test_replace_last_pipe(nlp):
def test_replace_pipe_config(nlp): def test_replace_pipe_config(nlp):
nlp.add_pipe("entity_linker") nlp.add_pipe("entity_linker")
nlp.add_pipe("sentencizer") nlp.add_pipe("sentencizer")
assert nlp.get_pipe("entity_linker").cfg["incl_prior"] is True assert nlp.get_pipe("entity_linker").incl_prior is True
nlp.replace_pipe("entity_linker", "entity_linker", config={"incl_prior": False}) nlp.replace_pipe("entity_linker", "entity_linker", config={"incl_prior": False})
assert nlp.get_pipe("entity_linker").cfg["incl_prior"] is False assert nlp.get_pipe("entity_linker").incl_prior is False
@pytest.mark.parametrize("old_name,new_name", [("old_pipe", "new_pipe")]) @pytest.mark.parametrize("old_name,new_name", [("old_pipe", "new_pipe")])

View File

@ -61,7 +61,6 @@ def test_issue7029():
losses = {} losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses) nlp.update(train_examples, sgd=optimizer, losses=losses)
texts = ["first", "second", "third", "fourth", "and", "then", "some", ""] texts = ["first", "second", "third", "fourth", "and", "then", "some", ""]
nlp.select_pipes(enable=["tok2vec", "tagger"])
docs1 = list(nlp.pipe(texts, batch_size=1)) docs1 = list(nlp.pipe(texts, batch_size=1))
docs2 = list(nlp.pipe(texts, batch_size=4)) docs2 = list(nlp.pipe(texts, batch_size=4))
assert [doc[0].tag_ for doc in docs1[:-1]] == [doc[0].tag_ for doc in docs2[:-1]] assert [doc[0].tag_ for doc in docs1[:-1]] == [doc[0].tag_ for doc in docs2[:-1]]

View File

@ -1,5 +1,3 @@
import pytest
from spacy.tokens.doc import Doc from spacy.tokens.doc import Doc
from spacy.vocab import Vocab from spacy.vocab import Vocab
from spacy.pipeline._parser_internals.arc_eager import ArcEager from spacy.pipeline._parser_internals.arc_eager import ArcEager

View File

@ -0,0 +1,54 @@
from spacy.kb import KnowledgeBase
from spacy.training import Example
from spacy.lang.en import English
# fmt: off
TRAIN_DATA = [
("Russ Cochran his reprints include EC Comics.",
{"links": {(0, 12): {"Q2146908": 1.0}},
"entities": [(0, 12, "PERSON")],
"sent_starts": [1, -1, 0, 0, 0, 0, 0, 0]})
]
# fmt: on
def test_partial_links():
# Test that having some entities on the doc without gold links, doesn't crash
nlp = English()
vector_length = 3
train_examples = []
for text, annotation in TRAIN_DATA:
doc = nlp(text)
train_examples.append(Example.from_dict(doc, annotation))
def create_kb(vocab):
# create artificial KB
mykb = KnowledgeBase(vocab, entity_vector_length=vector_length)
mykb.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3])
mykb.add_alias("Russ Cochran", ["Q2146908"], [0.9])
return mykb
# Create and train the Entity Linker
entity_linker = nlp.add_pipe("entity_linker", last=True)
entity_linker.set_kb(create_kb)
optimizer = nlp.initialize(get_examples=lambda: train_examples)
for i in range(2):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
# adding additional components that are required for the entity_linker
nlp.add_pipe("sentencizer", first=True)
patterns = [
{"label": "PERSON", "pattern": [{"LOWER": "russ"}, {"LOWER": "cochran"}]},
{"label": "ORG", "pattern": [{"LOWER": "ec"}, {"LOWER": "comics"}]}
]
ruler = nlp.add_pipe("entity_ruler", before="entity_linker")
ruler.add_patterns(patterns)
# this will run the pipeline on the examples and shouldn't crash
results = nlp.evaluate(train_examples)
assert "PERSON" in results["ents_per_type"]
assert "PERSON" in results["nel_f_per_type"]
assert "ORG" in results["ents_per_type"]
assert "ORG" not in results["nel_f_per_type"]

View File

@ -0,0 +1,18 @@
from spacy.lang.en import English
def test_issue7065():
text = "Kathleen Battle sang in Mahler 's Symphony No. 8 at the Cincinnati Symphony Orchestra 's May Festival."
nlp = English()
nlp.add_pipe("sentencizer")
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "THING", "pattern": [{"LOWER": "symphony"}, {"LOWER": "no"}, {"LOWER": "."}, {"LOWER": "8"}]}]
ruler.add_patterns(patterns)
doc = nlp(text)
sentences = [s for s in doc.sents]
assert len(sentences) == 2
sent0 = sentences[0]
ent = doc.ents[0]
assert ent.start < sent0.end < ent.end
assert sentences.index(ent.sent) == 0

View File

@ -160,7 +160,7 @@ subword_features = false
""" """
@registry.architectures.register("my_test_parser") @registry.architectures("my_test_parser")
def my_parser(): def my_parser():
tok2vec = build_Tok2Vec_model( tok2vec = build_Tok2Vec_model(
MultiHashEmbed( MultiHashEmbed(

View File

@ -108,7 +108,7 @@ def test_serialize_subclassed_kb():
super().__init__(vocab, entity_vector_length) super().__init__(vocab, entity_vector_length)
self.custom_field = custom_field self.custom_field = custom_field
@registry.misc.register("spacy.CustomKB.v1") @registry.misc("spacy.CustomKB.v1")
def custom_kb( def custom_kb(
entity_vector_length: int, custom_field: int entity_vector_length: int, custom_field: int
) -> Callable[["Vocab"], KnowledgeBase]: ) -> Callable[["Vocab"], KnowledgeBase]:

View File

@ -4,12 +4,12 @@ from thinc.api import Linear
from catalogue import RegistryError from catalogue import RegistryError
@registry.architectures.register("my_test_function")
def create_model(nr_in, nr_out):
return Linear(nr_in, nr_out)
def test_get_architecture(): def test_get_architecture():
@registry.architectures("my_test_function")
def create_model(nr_in, nr_out):
return Linear(nr_in, nr_out)
arch = registry.architectures.get("my_test_function") arch = registry.architectures.get("my_test_function")
assert arch is create_model assert arch is create_model
with pytest.raises(RegistryError): with pytest.raises(RegistryError):

View File

@ -7,7 +7,7 @@ from spacy import util
from spacy import prefer_gpu, require_gpu, require_cpu from spacy import prefer_gpu, require_gpu, require_cpu
from spacy.ml._precomputable_affine import PrecomputableAffine from spacy.ml._precomputable_affine import PrecomputableAffine
from spacy.ml._precomputable_affine import _backprop_precomputable_affine_padding from spacy.ml._precomputable_affine import _backprop_precomputable_affine_padding
from spacy.util import dot_to_object, SimpleFrozenList from spacy.util import dot_to_object, SimpleFrozenList, import_file
from thinc.api import Config, Optimizer, ConfigValidationError from thinc.api import Config, Optimizer, ConfigValidationError
from spacy.training.batchers import minibatch_by_words from spacy.training.batchers import minibatch_by_words
from spacy.lang.en import English from spacy.lang.en import English
@ -17,7 +17,7 @@ from spacy.schemas import ConfigSchemaTraining
from thinc.api import get_current_ops, NumpyOps, CupyOps from thinc.api import get_current_ops, NumpyOps, CupyOps
from .util import get_random_doc from .util import get_random_doc, make_tempdir
@pytest.fixture @pytest.fixture
@ -347,3 +347,35 @@ def test_resolve_dot_names():
errors = e.value.errors errors = e.value.errors
assert len(errors) == 1 assert len(errors) == 1
assert errors[0]["loc"] == ["training", "xyz"] assert errors[0]["loc"] == ["training", "xyz"]
def test_import_code():
code_str = """
from spacy import Language
class DummyComponent:
def __init__(self, vocab, name):
pass
def initialize(self, get_examples, *, nlp, dummy_param: int):
pass
@Language.factory(
"dummy_component",
)
def make_dummy_component(
nlp: Language, name: str
):
return DummyComponent(nlp.vocab, name)
"""
with make_tempdir() as temp_dir:
code_path = os.path.join(temp_dir, "code.py")
with open(code_path, "w") as fileh:
fileh.write(code_str)
import_file("python_code", code_path)
config = {"initialize": {"components": {"dummy_component": {"dummy_param": 1}}}}
nlp = English.from_config(config)
nlp.add_pipe("dummy_component")
nlp.initialize()

View File

@ -196,6 +196,104 @@ def test_Example_from_dict_with_entities_invalid(annots):
assert len(list(example.reference.ents)) == 0 assert len(list(example.reference.ents)) == 0
@pytest.mark.parametrize(
"annots",
[
{
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
"entities": [
(7, 15, "LOC"),
(11, 15, "LOC"),
(20, 26, "LOC"),
], # overlapping
}
],
)
def test_Example_from_dict_with_entities_overlapping(annots):
vocab = Vocab()
predicted = Doc(vocab, words=annots["words"])
with pytest.raises(ValueError):
Example.from_dict(predicted, annots)
@pytest.mark.parametrize(
"annots",
[
{
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
"spans": {
"cities": [(7, 15, "LOC"), (20, 26, "LOC")],
"people": [(0, 1, "PERSON")],
},
}
],
)
def test_Example_from_dict_with_spans(annots):
vocab = Vocab()
predicted = Doc(vocab, words=annots["words"])
example = Example.from_dict(predicted, annots)
assert len(list(example.reference.ents)) == 0
assert len(list(example.reference.spans["cities"])) == 2
assert len(list(example.reference.spans["people"])) == 1
for span in example.reference.spans["cities"]:
assert span.label_ == "LOC"
for span in example.reference.spans["people"]:
assert span.label_ == "PERSON"
@pytest.mark.parametrize(
"annots",
[
{
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
"spans": {
"cities": [(7, 15, "LOC"), (11, 15, "LOC"), (20, 26, "LOC")],
"people": [(0, 1, "PERSON")],
},
}
],
)
def test_Example_from_dict_with_spans_overlapping(annots):
vocab = Vocab()
predicted = Doc(vocab, words=annots["words"])
example = Example.from_dict(predicted, annots)
assert len(list(example.reference.ents)) == 0
assert len(list(example.reference.spans["cities"])) == 3
assert len(list(example.reference.spans["people"])) == 1
for span in example.reference.spans["cities"]:
assert span.label_ == "LOC"
for span in example.reference.spans["people"]:
assert span.label_ == "PERSON"
@pytest.mark.parametrize(
"annots",
[
{
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
"spans": [(0, 1, "PERSON")],
},
{
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
"spans": {"cities": (7, 15, "LOC")},
},
{
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
"spans": {"cities": [7, 11]},
},
{
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
"spans": {"cities": [[7]]},
},
],
)
def test_Example_from_dict_with_spans_invalid(annots):
vocab = Vocab()
predicted = Doc(vocab, words=annots["words"])
with pytest.raises(ValueError):
Example.from_dict(predicted, annots)
@pytest.mark.parametrize( @pytest.mark.parametrize(
"annots", "annots",
[ [

View File

@ -27,7 +27,7 @@ def test_readers():
factory = "textcat" factory = "textcat"
""" """
@registry.readers.register("myreader.v1") @registry.readers("myreader.v1")
def myreader() -> Dict[str, Callable[[Language, str], Iterable[Example]]]: def myreader() -> Dict[str, Callable[[Language, str], Iterable[Example]]]:
annots = {"cats": {"POS": 1.0, "NEG": 0.0}} annots = {"cats": {"POS": 1.0, "NEG": 0.0}}

View File

@ -1,7 +1,8 @@
from .doc import Doc from .doc import Doc
from .token import Token from .token import Token
from .span import Span from .span import Span
from .span_group import SpanGroup
from ._serialize import DocBin from ._serialize import DocBin
from .morphanalysis import MorphAnalysis from .morphanalysis import MorphAnalysis
__all__ = ["Doc", "Token", "Span", "DocBin", "MorphAnalysis"] __all__ = ["Doc", "Token", "Span", "SpanGroup", "DocBin", "MorphAnalysis"]

View File

@ -33,8 +33,10 @@ class SpanGroups(UserDict):
def _make_span_group(self, name: str, spans: Iterable["Span"]) -> SpanGroup: def _make_span_group(self, name: str, spans: Iterable["Span"]) -> SpanGroup:
return SpanGroup(self.doc_ref(), name=name, spans=spans) return SpanGroup(self.doc_ref(), name=name, spans=spans)
def copy(self) -> "SpanGroups": def copy(self, doc: "Doc" = None) -> "SpanGroups":
return SpanGroups(self.doc_ref()).from_bytes(self.to_bytes()) if doc is None:
doc = self.doc_ref()
return SpanGroups(doc).from_bytes(self.to_bytes())
def to_bytes(self) -> bytes: def to_bytes(self) -> bytes:
# We don't need to serialize this as a dict, because the groups # We don't need to serialize this as a dict, because the groups

View File

@ -1188,7 +1188,7 @@ cdef class Doc:
other.user_span_hooks = dict(self.user_span_hooks) other.user_span_hooks = dict(self.user_span_hooks)
other.length = self.length other.length = self.length
other.max_length = self.max_length other.max_length = self.max_length
other.spans = self.spans.copy() other.spans = self.spans.copy(doc=other)
buff_size = other.max_length + (PADDING*2) buff_size = other.max_length + (PADDING*2)
assert buff_size > 0 assert buff_size > 0
tokens = <TokenC*>other.mem.alloc(buff_size, sizeof(TokenC)) tokens = <TokenC*>other.mem.alloc(buff_size, sizeof(TokenC))

View File

@ -357,7 +357,12 @@ cdef class Span:
@property @property
def sent(self): def sent(self):
"""RETURNS (Span): The sentence span that the span is a part of.""" """Obtain the sentence that contains this span. If the given span
crosses sentence boundaries, return only the first sentence
to which it belongs.
RETURNS (Span): The sentence span that the span is a part of.
"""
if "sent" in self.doc.user_span_hooks: if "sent" in self.doc.user_span_hooks:
return self.doc.user_span_hooks["sent"](self) return self.doc.user_span_hooks["sent"](self)
# Use `sent_start` token attribute to find sentence boundaries # Use `sent_start` token attribute to find sentence boundaries
@ -367,8 +372,8 @@ cdef class Span:
start = self.start start = self.start
while self.doc.c[start].sent_start != 1 and start > 0: while self.doc.c[start].sent_start != 1 and start > 0:
start += -1 start += -1
# Find end of the sentence # Find end of the sentence - can be within the entity
end = self.end end = self.start + 1
while end < self.doc.length and self.doc.c[end].sent_start != 1: while end < self.doc.length and self.doc.c[end].sent_start != 1:
end += 1 end += 1
n += 1 n += 1

View File

@ -22,6 +22,8 @@ cpdef Doc annotations_to_doc(vocab, tok_annot, doc_annot):
output = Doc(vocab, words=tok_annot["ORTH"], spaces=tok_annot["SPACY"]) output = Doc(vocab, words=tok_annot["ORTH"], spaces=tok_annot["SPACY"])
if "entities" in doc_annot: if "entities" in doc_annot:
_add_entities_to_doc(output, doc_annot["entities"]) _add_entities_to_doc(output, doc_annot["entities"])
if "spans" in doc_annot:
_add_spans_to_doc(output, doc_annot["spans"])
if array.size: if array.size:
output = output.from_array(attrs, array) output = output.from_array(attrs, array)
# links are currently added with ENT_KB_ID on the token level # links are currently added with ENT_KB_ID on the token level
@ -314,13 +316,11 @@ def _annot2array(vocab, tok_annot, doc_annot):
for key, value in doc_annot.items(): for key, value in doc_annot.items():
if value: if value:
if key == "entities": if key in ["entities", "cats", "spans"]:
pass pass
elif key == "links": elif key == "links":
ent_kb_ids = _parse_links(vocab, tok_annot["ORTH"], tok_annot["SPACY"], value) ent_kb_ids = _parse_links(vocab, tok_annot["ORTH"], tok_annot["SPACY"], value)
tok_annot["ENT_KB_ID"] = ent_kb_ids tok_annot["ENT_KB_ID"] = ent_kb_ids
elif key == "cats":
pass
else: else:
raise ValueError(Errors.E974.format(obj="doc", key=key)) raise ValueError(Errors.E974.format(obj="doc", key=key))
@ -351,6 +351,29 @@ def _annot2array(vocab, tok_annot, doc_annot):
return attrs, array.T return attrs, array.T
def _add_spans_to_doc(doc, spans_data):
if not isinstance(spans_data, dict):
raise ValueError(Errors.E879)
for key, span_list in spans_data.items():
spans = []
if not isinstance(span_list, list):
raise ValueError(Errors.E879)
for span_tuple in span_list:
if not isinstance(span_tuple, (list, tuple)) or len(span_tuple) < 2:
raise ValueError(Errors.E879)
start_char = span_tuple[0]
end_char = span_tuple[1]
label = 0
kb_id = 0
if len(span_tuple) > 2:
label = span_tuple[2]
if len(span_tuple) > 3:
kb_id = span_tuple[3]
span = doc.char_span(start_char, end_char, label=label, kb_id=kb_id)
spans.append(span)
doc.spans[key] = spans
def _add_entities_to_doc(doc, ner_data): def _add_entities_to_doc(doc, ner_data):
if ner_data is None: if ner_data is None:
return return
@ -397,7 +420,7 @@ def _fix_legacy_dict_data(example_dict):
pass pass
elif key == "ids": elif key == "ids":
pass pass
elif key in ("cats", "links"): elif key in ("cats", "links", "spans"):
doc_dict[key] = value doc_dict[key] = value
elif key in ("ner", "entities"): elif key in ("ner", "entities"):
doc_dict["entities"] = value doc_dict["entities"] = value

View File

@ -103,7 +103,11 @@ def console_logger(progress_bar: bool = False):
@registry.loggers("spacy.WandbLogger.v1") @registry.loggers("spacy.WandbLogger.v1")
def wandb_logger(project_name: str, remove_config_values: List[str] = []): def wandb_logger(project_name: str, remove_config_values: List[str] = []):
import wandb try:
import wandb
from wandb import init, log, join # test that these are available
except ImportError:
raise ImportError(Errors.E880)
console = console_logger(progress_bar=False) console = console_logger(progress_bar=False)

View File

@ -70,7 +70,7 @@ CONFIG_SECTION_ORDER = ["paths", "variables", "system", "nlp", "components", "co
logger = logging.getLogger("spacy") logger = logging.getLogger("spacy")
logger_stream_handler = logging.StreamHandler() logger_stream_handler = logging.StreamHandler()
logger_stream_handler.setFormatter(logging.Formatter("%(message)s")) logger_stream_handler.setFormatter(logging.Formatter("[%(asctime)s] [%(levelname)s] %(message)s"))
logger.addHandler(logger_stream_handler) logger.addHandler(logger_stream_handler)
@ -1454,9 +1454,10 @@ def is_cython_func(func: Callable) -> bool:
if hasattr(func, attr): # function or class instance if hasattr(func, attr): # function or class instance
return True return True
# https://stackoverflow.com/a/55767059 # https://stackoverflow.com/a/55767059
if hasattr(func, "__qualname__") and hasattr(func, "__module__"): # method if hasattr(func, "__qualname__") and hasattr(func, "__module__") \
cls_func = vars(sys.modules[func.__module__])[func.__qualname__.split(".")[0]] and func.__module__ in sys.modules: # method
return hasattr(cls_func, attr) cls_func = vars(sys.modules[func.__module__])[func.__qualname__.split(".")[0]]
return hasattr(cls_func, attr)
return False return False

View File

@ -61,6 +61,8 @@ cdef class Vocab:
lookups (Lookups): Container for large lookup tables and dictionaries. lookups (Lookups): Container for large lookup tables and dictionaries.
oov_prob (float): Default OOV probability. oov_prob (float): Default OOV probability.
vectors_name (unicode): Optional name to identify the vectors table. vectors_name (unicode): Optional name to identify the vectors table.
get_noun_chunks (Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]):
A function that yields base noun phrases used for Doc.noun_chunks.
""" """
lex_attr_getters = lex_attr_getters if lex_attr_getters is not None else {} lex_attr_getters = lex_attr_getters if lex_attr_getters is not None else {}
if lookups in (None, True, False): if lookups in (None, True, False):

View File

@ -19,7 +19,7 @@ spaCy's built-in architectures that are used for different NLP tasks. All
trainable [built-in components](/api#architecture-pipeline) expect a `model` trainable [built-in components](/api#architecture-pipeline) expect a `model`
argument defined in the config and document their the default architecture. argument defined in the config and document their the default architecture.
Custom architectures can be registered using the Custom architectures can be registered using the
[`@spacy.registry.architectures`](/api/top-level#regsitry) decorator and used as [`@spacy.registry.architectures`](/api/top-level#registry) decorator and used as
part of the [training config](/usage/training#custom-functions). Also see the part of the [training config](/usage/training#custom-functions). Also see the
usage documentation on usage documentation on
[layers and model architectures](/usage/layers-architectures). [layers and model architectures](/usage/layers-architectures).

View File

@ -219,7 +219,7 @@ alignment mode `"strict".
| `alignment_mode` | How character indices snap to token boundaries. Options: `"strict"` (no snapping), `"contract"` (span of all tokens completely within the character span), `"expand"` (span of all tokens at least partially covered by the character span). Defaults to `"strict"`. ~~str~~ | | `alignment_mode` | How character indices snap to token boundaries. Options: `"strict"` (no snapping), `"contract"` (span of all tokens completely within the character span), `"expand"` (span of all tokens at least partially covered by the character span). Defaults to `"strict"`. ~~str~~ |
| **RETURNS** | The newly constructed object or `None`. ~~Optional[Span]~~ | | **RETURNS** | The newly constructed object or `None`. ~~Optional[Span]~~ |
## Doc.set_ents {#ents tag="method" new="3"} ## Doc.set_ents {#set_ents tag="method" new="3"}
Set the named entities in the document. Set the named entities in the document.
@ -616,8 +616,10 @@ phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be
nested within it so no NP-level coordination, no prepositional phrases, and no nested within it so no NP-level coordination, no prepositional phrases, and no
relative clauses. relative clauses.
If the `noun_chunk` [syntax iterator](/usage/adding-languages#language-data) has To customize the noun chunk iterator in a loaded pipeline, modify
not been implemeted for the given language, a `NotImplementedError` is raised. [`nlp.vocab.get_noun_chunks`](/api/vocab#attributes). If the `noun_chunk`
[syntax iterator](/usage/adding-languages#language-data) has not been
implemented for the given language, a `NotImplementedError` is raised.
> #### Example > #### Example
> >
@ -633,12 +635,14 @@ not been implemeted for the given language, a `NotImplementedError` is raised.
| ---------- | ------------------------------------- | | ---------- | ------------------------------------- |
| **YIELDS** | Noun chunks in the document. ~~Span~~ | | **YIELDS** | Noun chunks in the document. ~~Span~~ |
## Doc.sents {#sents tag="property" model="parser"} ## Doc.sents {#sents tag="property" model="sentences"}
Iterate over the sentences in the document. Sentence spans have no label. To Iterate over the sentences in the document. Sentence spans have no label.
improve accuracy on informal texts, spaCy calculates sentence boundaries from
the syntactic dependency parse. If the parser is disabled, the `sents` iterator This property is only available when
will be unavailable. [sentence boundaries](/usage/linguistic-features#sbd) have been set on the
document by the `parser`, `senter`, `sentencizer` or some custom function. It
will raise an error otherwise.
> #### Example > #### Example
> >

View File

@ -31,6 +31,7 @@ architectures and their arguments and hyperparameters.
> from spacy.pipeline.entity_linker import DEFAULT_NEL_MODEL > from spacy.pipeline.entity_linker import DEFAULT_NEL_MODEL
> config = { > config = {
> "labels_discard": [], > "labels_discard": [],
> "n_sents": 0,
> "incl_prior": True, > "incl_prior": True,
> "incl_context": True, > "incl_context": True,
> "model": DEFAULT_NEL_MODEL, > "model": DEFAULT_NEL_MODEL,
@ -43,6 +44,7 @@ architectures and their arguments and hyperparameters.
| Setting | Description | | Setting | Description |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `labels_discard` | NER labels that will automatically get a "NIL" prediction. Defaults to `[]`. ~~Iterable[str]~~ | | `labels_discard` | NER labels that will automatically get a "NIL" prediction. Defaults to `[]`. ~~Iterable[str]~~ |
| `n_sents` | The number of neighbouring sentences to take into account. Defaults to 0. ~~int~~ |
| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. Defaults to `True`. ~~bool~~ | | `incl_prior` | Whether or not to include prior probabilities from the KB in the model. Defaults to `True`. ~~bool~~ |
| `incl_context` | Whether or not to include the local context in the model. Defaults to `True`. ~~bool~~ | | `incl_context` | Whether or not to include the local context in the model. Defaults to `True`. ~~bool~~ |
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [EntityLinker](/api/architectures#EntityLinker). ~~Model~~ | | `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [EntityLinker](/api/architectures#EntityLinker). ~~Model~~ |
@ -89,6 +91,7 @@ custom knowledge base, you should either call
| `entity_vector_length` | Size of encoding vectors in the KB. ~~int~~ | | `entity_vector_length` | Size of encoding vectors in the KB. ~~int~~ |
| `get_candidates` | Function that generates plausible candidates for a given `Span` object. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ | | `get_candidates` | Function that generates plausible candidates for a given `Span` object. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ |
| `labels_discard` | NER labels that will automatically get a `"NIL"` prediction. ~~Iterable[str]~~ | | `labels_discard` | NER labels that will automatically get a `"NIL"` prediction. ~~Iterable[str]~~ |
| `n_sents` | The number of neighbouring sentences to take into account. ~~int~~ |
| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. ~~bool~~ | | `incl_prior` | Whether or not to include prior probabilities from the KB in the model. ~~bool~~ |
| `incl_context` | Whether or not to include the local context in the model. ~~bool~~ | | `incl_context` | Whether or not to include the local context in the model. ~~bool~~ |
@ -154,7 +157,7 @@ with the current vocab.
> kb.add_alias(...) > kb.add_alias(...)
> return kb > return kb
> entity_linker = nlp.add_pipe("entity_linker") > entity_linker = nlp.add_pipe("entity_linker")
> entity_linker.set_kb(lambda: [], nlp=nlp, kb_loader=create_kb) > entity_linker.set_kb(create_kb)
> ``` > ```
| Name | Description | | Name | Description |
@ -247,14 +250,14 @@ pipe's entity linking model and context encoder. Delegates to
> losses = entity_linker.update(examples, sgd=optimizer) > losses = entity_linker.update(examples, sgd=optimizer)
> ``` > ```
| Name | Description | | Name | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | | -------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | | `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | | | _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ | | `drop` | The dropout rate. ~~float~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | | `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | | `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | | **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
## EntityLinker.score {#score tag="method" new="3"} ## EntityLinker.score {#score tag="method" new="3"}

View File

@ -152,7 +152,7 @@ Get a list of all aliases in the knowledge base.
| ----------- | -------------------------------------------------------- | | ----------- | -------------------------------------------------------- |
| **RETURNS** | The list of aliases in the knowledge base. ~~List[str]~~ | | **RETURNS** | The list of aliases in the knowledge base. ~~List[str]~~ |
## KnowledgeBase.get_candidates {#get_candidates tag="method"} ## KnowledgeBase.get_alias_candidates {#get_alias_candidates tag="method"}
Given a certain textual mention as input, retrieve a list of candidate entities Given a certain textual mention as input, retrieve a list of candidate entities
of type [`Candidate`](/api/kb/#candidate). of type [`Candidate`](/api/kb/#candidate).
@ -160,13 +160,13 @@ of type [`Candidate`](/api/kb/#candidate).
> #### Example > #### Example
> >
> ```python > ```python
> candidates = kb.get_candidates("Douglas") > candidates = kb.get_alias_candidates("Douglas")
> ``` > ```
| Name | Description | | Name | Description |
| ----------- | ------------------------------------- | | ----------- | ------------------------------------------------------------- |
| `alias` | The textual mention or alias. ~~str~~ | | `alias` | The textual mention or alias. ~~str~~ |
| **RETURNS** | iterable | The list of relevant `Candidate` objects. ~~List[Candidate]~~ | | **RETURNS** | The list of relevant `Candidate` objects. ~~List[Candidate]~~ |
## KnowledgeBase.get_vector {#get_vector tag="method"} ## KnowledgeBase.get_vector {#get_vector tag="method"}
@ -246,7 +246,7 @@ certain prior probability.
Construct a `Candidate` object. Usually this constructor is not called directly, Construct a `Candidate` object. Usually this constructor is not called directly,
but instead these objects are returned by the but instead these objects are returned by the
[`get_candidates`](/api/kb#get_candidates) method of a `KnowledgeBase`. `get_candidates` method of the [`entity_linker`](/api/entitylinker) pipe.
> #### Example > #### Example
> >

View File

@ -364,7 +364,7 @@ Evaluate a pipeline's components.
<Infobox variant="warning" title="Changed in v3.0"> <Infobox variant="warning" title="Changed in v3.0">
The `Language.update` method now takes a batch of [`Example`](/api/example) The `Language.evaluate` method now takes a batch of [`Example`](/api/example)
objects instead of tuples of `Doc` and `GoldParse` objects. objects instead of tuples of `Doc` and `GoldParse` objects.
</Infobox> </Infobox>

View File

@ -137,14 +137,14 @@ Returns PRF scores for labeled or unlabeled spans.
> print(scores["ents_f"]) > print(scores["ents_f"])
> ``` > ```
| Name | Description | | Name | Description |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ | | `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
| `attr` | The attribute to score. ~~str~~ | | `attr` | The attribute to score. ~~str~~ |
| _keyword-only_ | | | _keyword-only_ | |
| `getter` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`. ~~Callable[[Doc, str], Iterable[Span]]~~ | | `getter` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`. ~~Callable[[Doc, str], Iterable[Span]]~~ |
| `has_annotation` | Defaults to `None`. If provided, `has_annotation(doc)` should return whether a `Doc` has annotation for this `attr`. Docs without annotation are skipped for scoring purposes. ~~str~~ | | `has_annotation` | Defaults to `None`. If provided, `has_annotation(doc)` should return whether a `Doc` has annotation for this `attr`. Docs without annotation are skipped for scoring purposes. ~~Optional[Callable[[Doc], bool]]~~ |
| **RETURNS** | A dictionary containing the PRF scores under the keys `{attr}_p`, `{attr}_r`, `{attr}_f` and the per-type PRF scores under `{attr}_per_type`. ~~Dict[str, Union[float, Dict[str, float]]]~~ | | **RETURNS** | A dictionary containing the PRF scores under the keys `{attr}_p`, `{attr}_r`, `{attr}_f` and the per-type PRF scores under `{attr}_per_type`. ~~Dict[str, Union[float, Dict[str, float]]]~~ |
## Scorer.score_deps {#score_deps tag="staticmethod" new="3"} ## Scorer.score_deps {#score_deps tag="staticmethod" new="3"}

View File

@ -483,13 +483,40 @@ The L2 norm of the span's vector representation.
| ----------- | --------------------------------------------------- | | ----------- | --------------------------------------------------- |
| **RETURNS** | The L2 norm of the vector representation. ~~float~~ | | **RETURNS** | The L2 norm of the vector representation. ~~float~~ |
## Span.sent {#sent tag="property" model="sentences"}
The sentence span that this span is a part of. This property is only available
when [sentence boundaries](/usage/linguistic-features#sbd) have been set on the
document by the `parser`, `senter`, `sentencizer` or some custom function. It
will raise an error otherwise.
If the span happens to cross sentence boundaries, only the first sentence will
be returned. If it is required that the sentence always includes the
full span, the result can be adjusted as such:
```python
sent = span.sent
sent = doc[sent.start : max(sent.end, span.end)]
```
> #### Example
>
> ```python
> doc = nlp("Give it back! He pleaded.")
> span = doc[1:3]
> assert span.sent.text == "Give it back!"
> ```
| Name | Description |
| ----------- | ------------------------------------------------------- |
| **RETURNS** | The sentence span that this span is a part of. ~~Span~~ |
## Attributes {#attributes} ## Attributes {#attributes}
| Name | Description | | Name | Description |
| --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | | --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| `doc` | The parent document. ~~Doc~~ | | `doc` | The parent document. ~~Doc~~ |
| `tensor` <Tag variant="new">2.1.7</Tag> | The span's slice of the parent `Doc`'s tensor. ~~numpy.ndarray~~ | | `tensor` <Tag variant="new">2.1.7</Tag> | The span's slice of the parent `Doc`'s tensor. ~~numpy.ndarray~~ |
| `sent` | The sentence span that this span is a part of. ~~Span~~ |
| `start` | The token offset for the start of the span. ~~int~~ | | `start` | The token offset for the start of the span. ~~int~~ |
| `end` | The token offset for the end of the span. ~~int~~ | | `end` | The token offset for the end of the span. ~~int~~ |
| `start_char` | The character offset for the start of the span. ~~int~~ | | `start_char` | The character offset for the start of the span. ~~int~~ |

View File

@ -21,14 +21,14 @@ Create the vocabulary.
> vocab = Vocab(strings=["hello", "world"]) > vocab = Vocab(strings=["hello", "world"])
> ``` > ```
| Name | Description | | Name | Description |
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `lex_attr_getters` | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. ~~Optional[Dict[str, Callable[[str], Any]]]~~ | | `lex_attr_getters` | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. ~~Optional[Dict[str, Callable[[str], Any]]]~~ |
| `strings` | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. ~~Union[List[str], StringStore]~~ | | `strings` | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. ~~Union[List[str], StringStore]~~ |
| `lookups` | A [`Lookups`](/api/lookups) that stores the `lexeme_norm` and other large lookup tables. Defaults to `None`. ~~Optional[Lookups]~~ | | `lookups` | A [`Lookups`](/api/lookups) that stores the `lexeme_norm` and other large lookup tables. Defaults to `None`. ~~Optional[Lookups]~~ |
| `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ | | `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ |
| `vectors_name` <Tag variant="new">2.2</Tag> | A name to identify the vectors table. ~~str~~ | | `vectors_name` <Tag variant="new">2.2</Tag> | A name to identify the vectors table. ~~str~~ |
| `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ | | `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ |
| `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ | | `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
## Vocab.\_\_len\_\_ {#len tag="method"} ## Vocab.\_\_len\_\_ {#len tag="method"}
@ -182,14 +182,14 @@ subword features by average over n-grams of `orth` (introduced in spaCy `v2.1`).
| Name | Description | | Name | Description |
| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------- | | ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| `orth` | The hash value of a word, or its unicode string. ~~Union[int, str]~~ | | `orth` | The hash value of a word, or its unicode string. ~~Union[int, str]~~ |
| `minn` <Tag variant="new">2.1</Tag> | Minimum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ | | `minn` <Tag variant="new">2.1</Tag> | Minimum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ |
| `maxn` <Tag variant="new">2.1</Tag> | Maximum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ | | `maxn` <Tag variant="new">2.1</Tag> | Maximum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ |
| **RETURNS** | A word vector. Size and shape are determined by the `Vocab.vectors` instance. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | | **RETURNS** | A word vector. Size and shape are determined by the `Vocab.vectors` instance. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
## Vocab.set_vector {#set_vector tag="method" new="2"} ## Vocab.set_vector {#set_vector tag="method" new="2"}
Set a vector for a word in the vocabulary. Words can be referenced by string Set a vector for a word in the vocabulary. Words can be referenced by string or
or hash value. hash value.
> #### Example > #### Example
> >
@ -300,13 +300,14 @@ Load state from a binary string.
> assert type(PERSON) == int > assert type(PERSON) == int
> ``` > ```
| Name | Description | | Name | Description |
| --------------------------------------------- | ------------------------------------------------------------------------------- | | ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `strings` | A table managing the string-to-int mapping. ~~StringStore~~ | | `strings` | A table managing the string-to-int mapping. ~~StringStore~~ |
| `vectors` <Tag variant="new">2</Tag> | A table associating word IDs to word vectors. ~~Vectors~~ | | `vectors` <Tag variant="new">2</Tag> | A table associating word IDs to word vectors. ~~Vectors~~ |
| `vectors_length` | Number of dimensions for each word vector. ~~int~~ | | `vectors_length` | Number of dimensions for each word vector. ~~int~~ |
| `lookups` | The available lookup tables in this vocab. ~~Lookups~~ | | `lookups` | The available lookup tables in this vocab. ~~Lookups~~ |
| `writing_system` <Tag variant="new">2.1</Tag> | A dict with information about the language's writing system. ~~Dict[str, Any]~~ | | `writing_system` <Tag variant="new">2.1</Tag> | A dict with information about the language's writing system. ~~Dict[str, Any]~~ |
| `get_noun_chunks` <Tag variant="new">3.0</Tag> | A function that yields base noun phrases used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
## Serialization fields {#serialization-fields} ## Serialization fields {#serialization-fields}

View File

@ -15,7 +15,7 @@ next: /usage/projects
> ```python > ```python
> from thinc.api import Model, chain > from thinc.api import Model, chain
> >
> @spacy.registry.architectures.register("model.v1") > @spacy.registry.architectures("model.v1")
> def build_model(width: int, classes: int) -> Model: > def build_model(width: int, classes: int) -> Model:
> tok2vec = build_tok2vec(width) > tok2vec = build_tok2vec(width)
> output_layer = build_output_layer(width, classes) > output_layer = build_output_layer(width, classes)
@ -563,7 +563,7 @@ matrix** (~~Floats2d~~) of predictions:
```python ```python
### The model architecture ### The model architecture
@spacy.registry.architectures.register("rel_model.v1") @spacy.registry.architectures("rel_model.v1")
def create_relation_model(...) -> Model[List[Doc], Floats2d]: def create_relation_model(...) -> Model[List[Doc], Floats2d]:
model = ... # 👈 model will go here model = ... # 👈 model will go here
return model return model
@ -589,7 +589,7 @@ transforms the instance tensor into a final tensor holding the predictions:
```python ```python
### The model architecture {highlight="6"} ### The model architecture {highlight="6"}
@spacy.registry.architectures.register("rel_model.v1") @spacy.registry.architectures("rel_model.v1")
def create_relation_model( def create_relation_model(
create_instance_tensor: Model[List[Doc], Floats2d], create_instance_tensor: Model[List[Doc], Floats2d],
classification_layer: Model[Floats2d, Floats2d], classification_layer: Model[Floats2d, Floats2d],
@ -613,7 +613,7 @@ The `classification_layer` could be something like a
```python ```python
### The classification layer ### The classification layer
@spacy.registry.architectures.register("rel_classification_layer.v1") @spacy.registry.architectures("rel_classification_layer.v1")
def create_classification_layer( def create_classification_layer(
nO: int = None, nI: int = None nO: int = None, nI: int = None
) -> Model[Floats2d, Floats2d]: ) -> Model[Floats2d, Floats2d]:
@ -650,7 +650,7 @@ that has the full implementation.
```python ```python
### The layer that creates the instance tensor ### The layer that creates the instance tensor
@spacy.registry.architectures.register("rel_instance_tensor.v1") @spacy.registry.architectures("rel_instance_tensor.v1")
def create_tensors( def create_tensors(
tok2vec: Model[List[Doc], List[Floats2d]], tok2vec: Model[List[Doc], List[Floats2d]],
pooling: Model[Ragged, Floats2d], pooling: Model[Ragged, Floats2d],
@ -731,7 +731,7 @@ are within a **maximum distance** (in number of tokens) of each other:
```python ```python
### Candidate generation ### Candidate generation
@spacy.registry.misc.register("rel_instance_generator.v1") @spacy.registry.misc("rel_instance_generator.v1")
def create_instances(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]: def create_instances(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]:
def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]: def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
candidates = [] candidates = []

View File

@ -585,7 +585,7 @@ print(ent_francisco) # ['Francisco', 'I', 'GPE']
To ensure that the sequence of token annotations remains consistent, you have to To ensure that the sequence of token annotations remains consistent, you have to
set entity annotations **at the document level**. However, you can't write set entity annotations **at the document level**. However, you can't write
directly to the `token.ent_iob` or `token.ent_type` attributes, so the easiest directly to the `token.ent_iob` or `token.ent_type` attributes, so the easiest
way to set entities is to assign to the [`doc.ents`](/api/doc#ents) attribute way to set entities is to use the [`doc.set_ents`](/api/doc#set_ents) function
and create the new entity as a [`Span`](/api/span). and create the new entity as a [`Span`](/api/span).
```python ```python

View File

@ -95,6 +95,14 @@ spaCy's binary `.spacy` format. You can either include the data paths in the
$ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy $ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
``` ```
> #### Tip: Enable your GPU
>
> Use the `--gpu-id` option to select the GPU:
>
> ```cli
> $ python -m spacy train config.cfg --gpu-id 0
> ```
<Accordion title="How are the config recommendations generated?" id="quickstart-source" spaced> <Accordion title="How are the config recommendations generated?" id="quickstart-source" spaced>
The recommended config settings generated by the quickstart widget and the The recommended config settings generated by the quickstart widget and the

View File

@ -603,6 +603,7 @@ Note that spaCy v3.0 now requires **Python 3.6+**.
| `GoldParse` | [`Example`](/api/example) | | `GoldParse` | [`Example`](/api/example) |
| `GoldCorpus` | [`Corpus`](/api/corpus) | | `GoldCorpus` | [`Corpus`](/api/corpus) |
| `KnowledgeBase.load_bulk`, `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk) | | `KnowledgeBase.load_bulk`, `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk) |
| `KnowledgeBase.get_candidates` | [`KnowledgeBase.get_alias_candidates`](/api/kb#get_alias_candidates) |
| `Matcher.pipe`, `PhraseMatcher.pipe` | not needed | | `Matcher.pipe`, `PhraseMatcher.pipe` | not needed |
| `gold.offsets_from_biluo_tags`, `gold.spans_from_biluo_tags`, `gold.biluo_tags_from_offsets` | [`training.biluo_tags_to_offsets`](/api/top-level#biluo_tags_to_offsets), [`training.biluo_tags_to_spans`](/api/top-level#biluo_tags_to_spans), [`training.offsets_to_biluo_tags`](/api/top-level#offsets_to_biluo_tags) | | `gold.offsets_from_biluo_tags`, `gold.spans_from_biluo_tags`, `gold.biluo_tags_from_offsets` | [`training.biluo_tags_to_offsets`](/api/top-level#biluo_tags_to_offsets), [`training.biluo_tags_to_spans`](/api/top-level#biluo_tags_to_spans), [`training.offsets_to_biluo_tags`](/api/top-level#offsets_to_biluo_tags) |
| `spacy init-model` | [`spacy init vectors`](/api/cli#init-vectors) | | `spacy init-model` | [`spacy init vectors`](/api/cli#init-vectors) |

View File

@ -58,7 +58,7 @@
}, },
"category": ["pipeline"], "category": ["pipeline"],
"tags": ["sentiment", "textblob"] "tags": ["sentiment", "textblob"]
}, },
{ {
"id": "spacy-ray", "id": "spacy-ray",
"title": "spacy-ray", "title": "spacy-ray",
@ -2647,14 +2647,14 @@
"github": "medspacy" "github": "medspacy"
} }
}, },
{ {
"id": "rita-dsl", "id": "rita-dsl",
"title": "RITA DSL", "title": "RITA DSL",
"slogan": "Domain Specific Language for creating language rules", "slogan": "Domain Specific Language for creating language rules",
"github": "zaibacu/rita-dsl", "github": "zaibacu/rita-dsl",
"description": "A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format", "description": "A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format",
"pip": "rita-dsl", "pip": "rita-dsl",
"thumb": "https://raw.githubusercontent.com/zaibacu/rita-dsl/master/docs/assets/logo-100px.png", "thumb": "https://raw.githubusercontent.com/zaibacu/rita-dsl/master/docs/assets/logo-100px.png",
"code_language": "python", "code_language": "python",
"code_example": [ "code_example": [
"import spacy", "import spacy",
@ -2754,14 +2754,41 @@
"{", "{",
" var lexeme = doc.Vocab[word.Text];", " var lexeme = doc.Vocab[word.Text];",
" Console.WriteLine($@\"{lexeme.Text} {lexeme.Orth} {lexeme.Shape} {lexeme.Prefix} {lexeme.Suffix} {lexeme.IsAlpha} {lexeme.IsDigit} {lexeme.IsTitle} {lexeme.Lang}\");", " Console.WriteLine($@\"{lexeme.Text} {lexeme.Orth} {lexeme.Shape} {lexeme.Prefix} {lexeme.Suffix} {lexeme.IsAlpha} {lexeme.IsDigit} {lexeme.IsTitle} {lexeme.Lang}\");",
"}" "}"
], ],
"code_language": "csharp", "code_language": "csharp",
"author": "Antonio Miras", "author": "Antonio Miras",
"author_links": { "author_links": {
"github": "AMArostegui" "github": "AMArostegui"
}, },
"category": ["nonpython"] "category": ["nonpython"]
},
{
"id": "ruts",
"title": "ruTS",
"slogan": "A library for statistics extraction from texts in Russian",
"description": "The library allows extracting the following statistics from a text: basic statistics, readability metrics, lexical diversity metrics, morphological statistics",
"github": "SergeyShk/ruTS",
"pip": "ruts",
"code_example": [
"import spacy",
"import ruts",
"",
"nlp = spacy.load('ru_core_news_sm')",
"nlp.add_pipe('basic', last=True)",
"doc = nlp('мама мыла раму')",
"doc._.basic.get_stats()"
],
"code_language": "python",
"thumb": "https://habrastorage.org/webt/6z/le/fz/6zlefzjavzoqw_wymz7v3pwgfp4.png",
"image": "https://clipartart.com/images/free-tree-roots-clipart-black-and-white-2.png",
"author": "Sergey Shkarin",
"author_links": {
"twitter": "shk_sergey",
"github": "SergeyShk"
},
"category": ["pipeline", "standalone"],
"tags": ["Text Analytics", "Russian"]
} }
], ],