mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-03 22:06:37 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
9280e844fb
106
.github/contributors/dardoria.md
vendored
Normal file
106
.github/contributors/dardoria.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Boian Tzonev |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 18.02.2021 |
|
||||||
|
| GitHub username | dardoria |
|
||||||
|
| Website (optional) | |
|
|
@ -10,7 +10,7 @@ wasabi>=0.8.1,<1.1.0
|
||||||
srsly>=2.4.0,<3.0.0
|
srsly>=2.4.0,<3.0.0
|
||||||
catalogue>=2.0.1,<2.1.0
|
catalogue>=2.0.1,<2.1.0
|
||||||
typer>=0.3.0,<0.4.0
|
typer>=0.3.0,<0.4.0
|
||||||
pathy
|
pathy>=0.3.5
|
||||||
# Third party dependencies
|
# Third party dependencies
|
||||||
numpy>=1.15.0
|
numpy>=1.15.0
|
||||||
requests>=2.13.0,<3.0.0
|
requests>=2.13.0,<3.0.0
|
||||||
|
@ -21,11 +21,11 @@ jinja2
|
||||||
setuptools
|
setuptools
|
||||||
packaging>=20.0
|
packaging>=20.0
|
||||||
importlib_metadata>=0.20; python_version < "3.8"
|
importlib_metadata>=0.20; python_version < "3.8"
|
||||||
typing_extensions>=3.7.4; python_version < "3.8"
|
typing_extensions>=3.7.4.1,<4.0.0.0; python_version < "3.8"
|
||||||
# Development dependencies
|
# Development dependencies
|
||||||
cython>=0.25
|
cython>=0.25
|
||||||
pytest>=5.2.0
|
pytest>=5.2.0
|
||||||
pytest-timeout>=1.3.0,<2.0.0
|
pytest-timeout>=1.3.0,<2.0.0
|
||||||
mock>=2.0.0,<3.0.0
|
mock>=2.0.0,<3.0.0
|
||||||
flake8>=3.5.0,<3.6.0
|
flake8>=3.5.0,<3.6.0
|
||||||
hypothesis
|
hypothesis>=3.27.0,<7.0.0
|
||||||
|
|
|
@ -47,7 +47,7 @@ install_requires =
|
||||||
srsly>=2.4.0,<3.0.0
|
srsly>=2.4.0,<3.0.0
|
||||||
catalogue>=2.0.1,<2.1.0
|
catalogue>=2.0.1,<2.1.0
|
||||||
typer>=0.3.0,<0.4.0
|
typer>=0.3.0,<0.4.0
|
||||||
pathy
|
pathy>=0.3.5
|
||||||
# Third-party dependencies
|
# Third-party dependencies
|
||||||
tqdm>=4.38.0,<5.0.0
|
tqdm>=4.38.0,<5.0.0
|
||||||
numpy>=1.15.0
|
numpy>=1.15.0
|
||||||
|
@ -58,7 +58,7 @@ install_requires =
|
||||||
setuptools
|
setuptools
|
||||||
packaging>=20.0
|
packaging>=20.0
|
||||||
importlib_metadata>=0.20; python_version < "3.8"
|
importlib_metadata>=0.20; python_version < "3.8"
|
||||||
typing_extensions>=3.7.4; python_version < "3.8"
|
typing_extensions>=3.7.4,<4.0.0.0; python_version < "3.8"
|
||||||
|
|
||||||
[options.entry_points]
|
[options.entry_points]
|
||||||
console_scripts =
|
console_scripts =
|
||||||
|
|
3
setup.py
3
setup.py
|
@ -204,7 +204,7 @@ def setup_package():
|
||||||
for name in MOD_NAMES:
|
for name in MOD_NAMES:
|
||||||
mod_path = name.replace(".", "/") + ".pyx"
|
mod_path = name.replace(".", "/") + ".pyx"
|
||||||
ext = Extension(
|
ext = Extension(
|
||||||
name, [mod_path], language="c++", extra_compile_args=["-std=c++11"]
|
name, [mod_path], language="c++", include_dirs=include_dirs, extra_compile_args=["-std=c++11"]
|
||||||
)
|
)
|
||||||
ext_modules.append(ext)
|
ext_modules.append(ext)
|
||||||
print("Cythonizing sources")
|
print("Cythonizing sources")
|
||||||
|
@ -216,7 +216,6 @@ def setup_package():
|
||||||
version=about["__version__"],
|
version=about["__version__"],
|
||||||
ext_modules=ext_modules,
|
ext_modules=ext_modules,
|
||||||
cmdclass={"build_ext": build_ext_subclass},
|
cmdclass={"build_ext": build_ext_subclass},
|
||||||
include_dirs=include_dirs,
|
|
||||||
package_data={"": ["*.pyx", "*.pxd", "*.pxi"]},
|
package_data={"": ["*.pyx", "*.pxd", "*.pxi"]},
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
|
@ -11,6 +11,7 @@ from click.parser import split_arg_string
|
||||||
from typer.main import get_command
|
from typer.main import get_command
|
||||||
from contextlib import contextmanager
|
from contextlib import contextmanager
|
||||||
from thinc.api import Config, ConfigValidationError, require_gpu
|
from thinc.api import Config, ConfigValidationError, require_gpu
|
||||||
|
from thinc.util import has_cupy, gpu_is_available
|
||||||
from configparser import InterpolationError
|
from configparser import InterpolationError
|
||||||
import os
|
import os
|
||||||
|
|
||||||
|
@ -510,3 +511,5 @@ def setup_gpu(use_gpu: int) -> None:
|
||||||
require_gpu(use_gpu)
|
require_gpu(use_gpu)
|
||||||
else:
|
else:
|
||||||
msg.info("Using CPU")
|
msg.info("Using CPU")
|
||||||
|
if has_cupy and gpu_is_available():
|
||||||
|
msg.info("To switch to GPU 0, use the option: --gpu-id 0")
|
||||||
|
|
|
@ -22,7 +22,7 @@ from ..training.converters import conllu_to_docs
|
||||||
CONVERTERS = {
|
CONVERTERS = {
|
||||||
"conllubio": conllu_to_docs,
|
"conllubio": conllu_to_docs,
|
||||||
"conllu": conllu_to_docs,
|
"conllu": conllu_to_docs,
|
||||||
"conll": conllu_to_docs,
|
"conll": conll_ner_to_docs,
|
||||||
"ner": conll_ner_to_docs,
|
"ner": conll_ner_to_docs,
|
||||||
"iob": iob_to_docs,
|
"iob": iob_to_docs,
|
||||||
"json": json_to_docs,
|
"json": json_to_docs,
|
||||||
|
|
|
@ -132,7 +132,7 @@ def evaluate(
|
||||||
|
|
||||||
if displacy_path:
|
if displacy_path:
|
||||||
factory_names = [nlp.get_pipe_meta(pipe).factory for pipe in nlp.pipe_names]
|
factory_names = [nlp.get_pipe_meta(pipe).factory for pipe in nlp.pipe_names]
|
||||||
docs = [ex.predicted for ex in dev_dataset]
|
docs = list(nlp.pipe(ex.reference.text for ex in dev_dataset[:displacy_limit]))
|
||||||
render_deps = "parser" in factory_names
|
render_deps = "parser" in factory_names
|
||||||
render_ents = "ner" in factory_names
|
render_ents = "ner" in factory_names
|
||||||
render_parses(
|
render_parses(
|
||||||
|
|
|
@ -16,7 +16,11 @@ gpu_allocator = null
|
||||||
|
|
||||||
[nlp]
|
[nlp]
|
||||||
lang = "{{ lang }}"
|
lang = "{{ lang }}"
|
||||||
|
{%- if "tagger" in components or "morphologizer" in components or "parser" in components or "ner" in components or "entity_linker" in components or (("textcat" in components or "textcat_multilabel" in components) and optimize == "accuracy") -%}
|
||||||
{%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components %}
|
{%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components %}
|
||||||
|
{%- else -%}
|
||||||
|
{%- set full_pipeline = components %}
|
||||||
|
{%- endif %}
|
||||||
pipeline = {{ full_pipeline|pprint()|replace("'", '"')|safe }}
|
pipeline = {{ full_pipeline|pprint()|replace("'", '"')|safe }}
|
||||||
batch_size = {{ 128 if hardware == "gpu" else 1000 }}
|
batch_size = {{ 128 if hardware == "gpu" else 1000 }}
|
||||||
|
|
||||||
|
|
|
@ -22,21 +22,21 @@ ar:
|
||||||
bg:
|
bg:
|
||||||
word_vectors: null
|
word_vectors: null
|
||||||
transformer:
|
transformer:
|
||||||
efficiency:
|
efficiency:
|
||||||
name: iarfmoose/roberta-base-bulgarian
|
name: iarfmoose/roberta-base-bulgarian
|
||||||
size_factor: 3
|
size_factor: 3
|
||||||
accuracy:
|
accuracy:
|
||||||
name: iarfmoose/roberta-base-bulgarian
|
name: iarfmoose/roberta-base-bulgarian
|
||||||
size_factor: 3
|
size_factor: 3
|
||||||
bn:
|
bn:
|
||||||
word_vectors: null
|
word_vectors: null
|
||||||
transformer:
|
transformer:
|
||||||
efficiency:
|
efficiency:
|
||||||
name: sagorsarker/bangla-bert-base
|
name: sagorsarker/bangla-bert-base
|
||||||
size_factor: 3
|
size_factor: 3
|
||||||
accuracy:
|
accuracy:
|
||||||
name: sagorsarker/bangla-bert-base
|
name: sagorsarker/bangla-bert-base
|
||||||
size_factor: 3
|
size_factor: 3
|
||||||
da:
|
da:
|
||||||
word_vectors: da_core_news_lg
|
word_vectors: da_core_news_lg
|
||||||
transformer:
|
transformer:
|
||||||
|
|
|
@ -321,7 +321,8 @@ class Errors:
|
||||||
"https://spacy.io/api/top-level#util.filter_spans")
|
"https://spacy.io/api/top-level#util.filter_spans")
|
||||||
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
|
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A "
|
||||||
"token can only be part of one entity, so make sure the entities "
|
"token can only be part of one entity, so make sure the entities "
|
||||||
"you're setting don't overlap.")
|
"you're setting don't overlap. To work with overlapping entities, "
|
||||||
|
"consider using doc.spans instead.")
|
||||||
E106 = ("Can't find `doc._.{attr}` attribute specified in the underscore "
|
E106 = ("Can't find `doc._.{attr}` attribute specified in the underscore "
|
||||||
"settings: {opts}")
|
"settings: {opts}")
|
||||||
E107 = ("Value of `doc._.{attr}` is not JSON-serializable: {value}")
|
E107 = ("Value of `doc._.{attr}` is not JSON-serializable: {value}")
|
||||||
|
@ -486,6 +487,15 @@ class Errors:
|
||||||
E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.")
|
E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.")
|
||||||
|
|
||||||
# New errors added in v3.x
|
# New errors added in v3.x
|
||||||
|
|
||||||
|
E879 = ("Unexpected type for 'spans' data. Provide a dictionary mapping keys to "
|
||||||
|
"a list of spans, with each span represented by a tuple (start_char, end_char). "
|
||||||
|
"The tuple can be optionally extended with a label and a KB ID.")
|
||||||
|
E880 = ("The 'wandb' library could not be found - did you install it? "
|
||||||
|
"Alternatively, specify the 'ConsoleLogger' in the 'training.logger' "
|
||||||
|
"config section, instead of the 'WandbLogger'.")
|
||||||
|
E885 = ("entity_linker.set_kb received an invalid 'kb_loader' argument: expected "
|
||||||
|
"a callable function, but got: {arg_type}")
|
||||||
E886 = ("Can't replace {name} -> {tok2vec} listeners: path '{path}' not "
|
E886 = ("Can't replace {name} -> {tok2vec} listeners: path '{path}' not "
|
||||||
"found in config for component '{name}'.")
|
"found in config for component '{name}'.")
|
||||||
E887 = ("Can't replace {name} -> {tok2vec} listeners: the paths to replace "
|
E887 = ("Can't replace {name} -> {tok2vec} listeners: the paths to replace "
|
||||||
|
|
|
@ -1,9 +1,21 @@
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
|
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
|
from ...attrs import LANG
|
||||||
|
from ...util import update_exc
|
||||||
|
|
||||||
|
|
||||||
class BulgarianDefaults(Language.Defaults):
|
class BulgarianDefaults(Language.Defaults):
|
||||||
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
|
lex_attr_getters[LANG] = lambda text: "bg"
|
||||||
|
|
||||||
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
|
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
|
|
||||||
|
|
||||||
class Bulgarian(Language):
|
class Bulgarian(Language):
|
||||||
|
|
88
spacy/lang/bg/lex_attrs.py
Normal file
88
spacy/lang/bg/lex_attrs.py
Normal file
|
@ -0,0 +1,88 @@
|
||||||
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
|
|
||||||
|
_num_words = [
|
||||||
|
"нула",
|
||||||
|
"едно",
|
||||||
|
"един",
|
||||||
|
"една",
|
||||||
|
"две",
|
||||||
|
"три",
|
||||||
|
"четири",
|
||||||
|
"пет",
|
||||||
|
"шест",
|
||||||
|
"седем",
|
||||||
|
"осем",
|
||||||
|
"девет",
|
||||||
|
"десет",
|
||||||
|
"единадесет",
|
||||||
|
"единайсет",
|
||||||
|
"дванадесет",
|
||||||
|
"дванайсет",
|
||||||
|
"тринадесет",
|
||||||
|
"тринайсет",
|
||||||
|
"четиринадесет",
|
||||||
|
"четиринайсет"
|
||||||
|
"петнадесет",
|
||||||
|
"петнайсет"
|
||||||
|
"шестнадесет",
|
||||||
|
"шестнайсет",
|
||||||
|
"седемнадесет",
|
||||||
|
"седемнайсет"
|
||||||
|
"осемнадесет",
|
||||||
|
"осемнайсет",
|
||||||
|
"деветнадесет",
|
||||||
|
"деветнайсет",
|
||||||
|
"двадесет",
|
||||||
|
"двайсет",
|
||||||
|
"тридесет",
|
||||||
|
"трийсет"
|
||||||
|
"четиридесет",
|
||||||
|
"четиресет",
|
||||||
|
"петдесет",
|
||||||
|
"шестдесет",
|
||||||
|
"шейсет",
|
||||||
|
"седемдесет",
|
||||||
|
"осемдесет",
|
||||||
|
"деветдесет",
|
||||||
|
"сто",
|
||||||
|
"двеста",
|
||||||
|
"триста",
|
||||||
|
"четиристотин",
|
||||||
|
"петстотин",
|
||||||
|
"шестстотин",
|
||||||
|
"седемстотин",
|
||||||
|
"осемстотин",
|
||||||
|
"деветстотин",
|
||||||
|
"хиляда",
|
||||||
|
"милион",
|
||||||
|
"милиона",
|
||||||
|
"милиард",
|
||||||
|
"милиарда",
|
||||||
|
"трилион",
|
||||||
|
"трилионa",
|
||||||
|
"билион",
|
||||||
|
"билионa",
|
||||||
|
"квадрилион",
|
||||||
|
"квадрилионa",
|
||||||
|
"квинтилион",
|
||||||
|
"квинтилионa",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def like_num(text):
|
||||||
|
if text.startswith(("+", "-", "±", "~")):
|
||||||
|
text = text[1:]
|
||||||
|
text = text.replace(",", "").replace(".", "")
|
||||||
|
if text.isdigit():
|
||||||
|
return True
|
||||||
|
if text.count("/") == 1:
|
||||||
|
num, denom = text.split("/")
|
||||||
|
if num.isdigit() and denom.isdigit():
|
||||||
|
return True
|
||||||
|
if text.lower() in _num_words:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
LEX_ATTRS = {LIKE_NUM: like_num}
|
68
spacy/lang/bg/tokenizer_exceptions.py
Normal file
68
spacy/lang/bg/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,68 @@
|
||||||
|
from ...symbols import ORTH, NORM
|
||||||
|
|
||||||
|
|
||||||
|
_exc = {}
|
||||||
|
|
||||||
|
|
||||||
|
_abbr_exc = [
|
||||||
|
{ORTH: "м", NORM: "метър"},
|
||||||
|
{ORTH: "мм", NORM: "милиметър"},
|
||||||
|
{ORTH: "см", NORM: "сантиметър"},
|
||||||
|
{ORTH: "дм", NORM: "дециметър"},
|
||||||
|
{ORTH: "км", NORM: "километър"},
|
||||||
|
{ORTH: "кг", NORM: "килограм"},
|
||||||
|
{ORTH: "мг", NORM: "милиграм"},
|
||||||
|
{ORTH: "г", NORM: "грам"},
|
||||||
|
{ORTH: "т", NORM: "тон"},
|
||||||
|
{ORTH: "хл", NORM: "хектолиър"},
|
||||||
|
{ORTH: "дкл", NORM: "декалитър"},
|
||||||
|
{ORTH: "л", NORM: "литър"},
|
||||||
|
]
|
||||||
|
for abbr in _abbr_exc:
|
||||||
|
_exc[abbr[ORTH]] = [abbr]
|
||||||
|
|
||||||
|
_abbr_line_exc = [
|
||||||
|
{ORTH: "г-жа", NORM: "госпожа"},
|
||||||
|
{ORTH: "г-н", NORM: "господин"},
|
||||||
|
{ORTH: "г-ца", NORM: "госпожица"},
|
||||||
|
{ORTH: "д-р", NORM: "доктор"},
|
||||||
|
{ORTH: "о-в", NORM: "остров"},
|
||||||
|
{ORTH: "п-в", NORM: "полуостров"},
|
||||||
|
]
|
||||||
|
|
||||||
|
for abbr in _abbr_line_exc:
|
||||||
|
_exc[abbr[ORTH]] = [abbr]
|
||||||
|
|
||||||
|
_abbr_dot_exc = [
|
||||||
|
{ORTH: "акад.", NORM: "академик"},
|
||||||
|
{ORTH: "ал.", NORM: "алинея"},
|
||||||
|
{ORTH: "арх.", NORM: "архитект"},
|
||||||
|
{ORTH: "бл.", NORM: "блок"},
|
||||||
|
{ORTH: "бр.", NORM: "брой"},
|
||||||
|
{ORTH: "бул.", NORM: "булевард"},
|
||||||
|
{ORTH: "в.", NORM: "век"},
|
||||||
|
{ORTH: "г.", NORM: "година"},
|
||||||
|
{ORTH: "гр.", NORM: "град"},
|
||||||
|
{ORTH: "ж.р.", NORM: "женски род"},
|
||||||
|
{ORTH: "инж.", NORM: "инженер"},
|
||||||
|
{ORTH: "лв.", NORM: "лев"},
|
||||||
|
{ORTH: "м.р.", NORM: "мъжки род"},
|
||||||
|
{ORTH: "мат.", NORM: "математика"},
|
||||||
|
{ORTH: "мед.", NORM: "медицина"},
|
||||||
|
{ORTH: "пл.", NORM: "площад"},
|
||||||
|
{ORTH: "проф.", NORM: "професор"},
|
||||||
|
{ORTH: "с.", NORM: "село"},
|
||||||
|
{ORTH: "с.р.", NORM: "среден род"},
|
||||||
|
{ORTH: "св.", NORM: "свети"},
|
||||||
|
{ORTH: "сп.", NORM: "списание"},
|
||||||
|
{ORTH: "стр.", NORM: "страница"},
|
||||||
|
{ORTH: "ул.", NORM: "улица"},
|
||||||
|
{ORTH: "чл.", NORM: "член"},
|
||||||
|
|
||||||
|
]
|
||||||
|
|
||||||
|
for abbr in _abbr_dot_exc:
|
||||||
|
_exc[abbr[ORTH]] = [abbr]
|
||||||
|
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -23,8 +23,6 @@ class RussianLemmatizer(Lemmatizer):
|
||||||
mode: str = "pymorphy2",
|
mode: str = "pymorphy2",
|
||||||
overwrite: bool = False,
|
overwrite: bool = False,
|
||||||
) -> None:
|
) -> None:
|
||||||
super().__init__(vocab, model, name, mode=mode, overwrite=overwrite)
|
|
||||||
|
|
||||||
try:
|
try:
|
||||||
from pymorphy2 import MorphAnalyzer
|
from pymorphy2 import MorphAnalyzer
|
||||||
except ImportError:
|
except ImportError:
|
||||||
|
@ -34,6 +32,7 @@ class RussianLemmatizer(Lemmatizer):
|
||||||
) from None
|
) from None
|
||||||
if RussianLemmatizer._morph is None:
|
if RussianLemmatizer._morph is None:
|
||||||
RussianLemmatizer._morph = MorphAnalyzer()
|
RussianLemmatizer._morph = MorphAnalyzer()
|
||||||
|
super().__init__(vocab, model, name, mode=mode, overwrite=overwrite)
|
||||||
|
|
||||||
def pymorphy2_lemmatize(self, token: Token) -> List[str]:
|
def pymorphy2_lemmatize(self, token: Token) -> List[str]:
|
||||||
string = token.text
|
string = token.text
|
||||||
|
|
|
@ -7,6 +7,8 @@ from ...vocab import Vocab
|
||||||
|
|
||||||
|
|
||||||
class UkrainianLemmatizer(RussianLemmatizer):
|
class UkrainianLemmatizer(RussianLemmatizer):
|
||||||
|
_morph = None
|
||||||
|
|
||||||
def __init__(
|
def __init__(
|
||||||
self,
|
self,
|
||||||
vocab: Vocab,
|
vocab: Vocab,
|
||||||
|
@ -16,7 +18,6 @@ class UkrainianLemmatizer(RussianLemmatizer):
|
||||||
mode: str = "pymorphy2",
|
mode: str = "pymorphy2",
|
||||||
overwrite: bool = False,
|
overwrite: bool = False,
|
||||||
) -> None:
|
) -> None:
|
||||||
super().__init__(vocab, model, name, mode=mode, overwrite=overwrite)
|
|
||||||
try:
|
try:
|
||||||
from pymorphy2 import MorphAnalyzer
|
from pymorphy2 import MorphAnalyzer
|
||||||
except ImportError:
|
except ImportError:
|
||||||
|
@ -27,3 +28,4 @@ class UkrainianLemmatizer(RussianLemmatizer):
|
||||||
) from None
|
) from None
|
||||||
if UkrainianLemmatizer._morph is None:
|
if UkrainianLemmatizer._morph is None:
|
||||||
UkrainianLemmatizer._morph = MorphAnalyzer(lang="uk")
|
UkrainianLemmatizer._morph = MorphAnalyzer(lang="uk")
|
||||||
|
super().__init__(vocab, model, name, mode=mode, overwrite=overwrite)
|
||||||
|
|
|
@ -684,12 +684,12 @@ class Language:
|
||||||
# TODO: handle errors and mismatches (vectors etc.)
|
# TODO: handle errors and mismatches (vectors etc.)
|
||||||
if not isinstance(source, self.__class__):
|
if not isinstance(source, self.__class__):
|
||||||
raise ValueError(Errors.E945.format(name=source_name, source=type(source)))
|
raise ValueError(Errors.E945.format(name=source_name, source=type(source)))
|
||||||
if not source.has_pipe(source_name):
|
if not source_name in source.component_names:
|
||||||
raise KeyError(
|
raise KeyError(
|
||||||
Errors.E944.format(
|
Errors.E944.format(
|
||||||
name=source_name,
|
name=source_name,
|
||||||
model=f"{source.meta['lang']}_{source.meta['name']}",
|
model=f"{source.meta['lang']}_{source.meta['name']}",
|
||||||
opts=", ".join(source.pipe_names),
|
opts=", ".join(source.component_names),
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
pipe = source.get_pipe(source_name)
|
pipe = source.get_pipe(source_name)
|
||||||
|
|
|
@ -8,7 +8,7 @@ from ...kb import KnowledgeBase, Candidate, get_candidates
|
||||||
from ...vocab import Vocab
|
from ...vocab import Vocab
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.EntityLinker.v1")
|
@registry.architectures("spacy.EntityLinker.v1")
|
||||||
def build_nel_encoder(tok2vec: Model, nO: Optional[int] = None) -> Model:
|
def build_nel_encoder(tok2vec: Model, nO: Optional[int] = None) -> Model:
|
||||||
with Model.define_operators({">>": chain, "**": clone}):
|
with Model.define_operators({">>": chain, "**": clone}):
|
||||||
token_width = tok2vec.get_dim("nO")
|
token_width = tok2vec.get_dim("nO")
|
||||||
|
@ -25,7 +25,7 @@ def build_nel_encoder(tok2vec: Model, nO: Optional[int] = None) -> Model:
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
@registry.misc.register("spacy.KBFromFile.v1")
|
@registry.misc("spacy.KBFromFile.v1")
|
||||||
def load_kb(kb_path: Path) -> Callable[[Vocab], KnowledgeBase]:
|
def load_kb(kb_path: Path) -> Callable[[Vocab], KnowledgeBase]:
|
||||||
def kb_from_file(vocab):
|
def kb_from_file(vocab):
|
||||||
kb = KnowledgeBase(vocab, entity_vector_length=1)
|
kb = KnowledgeBase(vocab, entity_vector_length=1)
|
||||||
|
@ -35,7 +35,7 @@ def load_kb(kb_path: Path) -> Callable[[Vocab], KnowledgeBase]:
|
||||||
return kb_from_file
|
return kb_from_file
|
||||||
|
|
||||||
|
|
||||||
@registry.misc.register("spacy.EmptyKB.v1")
|
@registry.misc("spacy.EmptyKB.v1")
|
||||||
def empty_kb(entity_vector_length: int) -> Callable[[Vocab], KnowledgeBase]:
|
def empty_kb(entity_vector_length: int) -> Callable[[Vocab], KnowledgeBase]:
|
||||||
def empty_kb_factory(vocab):
|
def empty_kb_factory(vocab):
|
||||||
return KnowledgeBase(vocab=vocab, entity_vector_length=entity_vector_length)
|
return KnowledgeBase(vocab=vocab, entity_vector_length=entity_vector_length)
|
||||||
|
@ -43,6 +43,6 @@ def empty_kb(entity_vector_length: int) -> Callable[[Vocab], KnowledgeBase]:
|
||||||
return empty_kb_factory
|
return empty_kb_factory
|
||||||
|
|
||||||
|
|
||||||
@registry.misc.register("spacy.CandidateGenerator.v1")
|
@registry.misc("spacy.CandidateGenerator.v1")
|
||||||
def create_candidates() -> Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]:
|
def create_candidates() -> Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]:
|
||||||
return get_candidates
|
return get_candidates
|
||||||
|
|
|
@ -16,7 +16,7 @@ if TYPE_CHECKING:
|
||||||
from ...tokens import Doc # noqa: F401
|
from ...tokens import Doc # noqa: F401
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.PretrainVectors.v1")
|
@registry.architectures("spacy.PretrainVectors.v1")
|
||||||
def create_pretrain_vectors(
|
def create_pretrain_vectors(
|
||||||
maxout_pieces: int, hidden_size: int, loss: str
|
maxout_pieces: int, hidden_size: int, loss: str
|
||||||
) -> Callable[["Vocab", Model], Model]:
|
) -> Callable[["Vocab", Model], Model]:
|
||||||
|
@ -40,7 +40,7 @@ def create_pretrain_vectors(
|
||||||
return create_vectors_objective
|
return create_vectors_objective
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.PretrainCharacters.v1")
|
@registry.architectures("spacy.PretrainCharacters.v1")
|
||||||
def create_pretrain_characters(
|
def create_pretrain_characters(
|
||||||
maxout_pieces: int, hidden_size: int, n_characters: int
|
maxout_pieces: int, hidden_size: int, n_characters: int
|
||||||
) -> Callable[["Vocab", Model], Model]:
|
) -> Callable[["Vocab", Model], Model]:
|
||||||
|
|
|
@ -10,7 +10,7 @@ from ..tb_framework import TransitionModel
|
||||||
from ...tokens import Doc
|
from ...tokens import Doc
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.TransitionBasedParser.v1")
|
@registry.architectures("spacy.TransitionBasedParser.v1")
|
||||||
def transition_parser_v1(
|
def transition_parser_v1(
|
||||||
tok2vec: Model[List[Doc], List[Floats2d]],
|
tok2vec: Model[List[Doc], List[Floats2d]],
|
||||||
state_type: Literal["parser", "ner"],
|
state_type: Literal["parser", "ner"],
|
||||||
|
@ -31,7 +31,7 @@ def transition_parser_v1(
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.TransitionBasedParser.v2")
|
@registry.architectures("spacy.TransitionBasedParser.v2")
|
||||||
def transition_parser_v2(
|
def transition_parser_v2(
|
||||||
tok2vec: Model[List[Doc], List[Floats2d]],
|
tok2vec: Model[List[Doc], List[Floats2d]],
|
||||||
state_type: Literal["parser", "ner"],
|
state_type: Literal["parser", "ner"],
|
||||||
|
|
|
@ -6,7 +6,7 @@ from ...util import registry
|
||||||
from ...tokens import Doc
|
from ...tokens import Doc
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.Tagger.v1")
|
@registry.architectures("spacy.Tagger.v1")
|
||||||
def build_tagger_model(
|
def build_tagger_model(
|
||||||
tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None
|
tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None
|
||||||
) -> Model[List[Doc], List[Floats2d]]:
|
) -> Model[List[Doc], List[Floats2d]]:
|
||||||
|
|
|
@ -15,7 +15,7 @@ from ...tokens import Doc
|
||||||
from .tok2vec import get_tok2vec_width
|
from .tok2vec import get_tok2vec_width
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.TextCatCNN.v1")
|
@registry.architectures("spacy.TextCatCNN.v1")
|
||||||
def build_simple_cnn_text_classifier(
|
def build_simple_cnn_text_classifier(
|
||||||
tok2vec: Model, exclusive_classes: bool, nO: Optional[int] = None
|
tok2vec: Model, exclusive_classes: bool, nO: Optional[int] = None
|
||||||
) -> Model[List[Doc], Floats2d]:
|
) -> Model[List[Doc], Floats2d]:
|
||||||
|
@ -41,7 +41,7 @@ def build_simple_cnn_text_classifier(
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.TextCatBOW.v1")
|
@registry.architectures("spacy.TextCatBOW.v1")
|
||||||
def build_bow_text_classifier(
|
def build_bow_text_classifier(
|
||||||
exclusive_classes: bool,
|
exclusive_classes: bool,
|
||||||
ngram_size: int,
|
ngram_size: int,
|
||||||
|
@ -60,7 +60,7 @@ def build_bow_text_classifier(
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.TextCatEnsemble.v2")
|
@registry.architectures("spacy.TextCatEnsemble.v2")
|
||||||
def build_text_classifier_v2(
|
def build_text_classifier_v2(
|
||||||
tok2vec: Model[List[Doc], List[Floats2d]],
|
tok2vec: Model[List[Doc], List[Floats2d]],
|
||||||
linear_model: Model[List[Doc], Floats2d],
|
linear_model: Model[List[Doc], Floats2d],
|
||||||
|
@ -112,7 +112,7 @@ def init_ensemble_textcat(model, X, Y) -> Model:
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.TextCatLowData.v1")
|
@registry.architectures("spacy.TextCatLowData.v1")
|
||||||
def build_text_classifier_lowdata(
|
def build_text_classifier_lowdata(
|
||||||
width: int, dropout: Optional[float], nO: Optional[int] = None
|
width: int, dropout: Optional[float], nO: Optional[int] = None
|
||||||
) -> Model[List[Doc], Floats2d]:
|
) -> Model[List[Doc], Floats2d]:
|
||||||
|
|
|
@ -14,7 +14,7 @@ from ...pipeline.tok2vec import Tok2VecListener
|
||||||
from ...attrs import intify_attr
|
from ...attrs import intify_attr
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.Tok2VecListener.v1")
|
@registry.architectures("spacy.Tok2VecListener.v1")
|
||||||
def tok2vec_listener_v1(width: int, upstream: str = "*"):
|
def tok2vec_listener_v1(width: int, upstream: str = "*"):
|
||||||
tok2vec = Tok2VecListener(upstream_name=upstream, width=width)
|
tok2vec = Tok2VecListener(upstream_name=upstream, width=width)
|
||||||
return tok2vec
|
return tok2vec
|
||||||
|
@ -31,7 +31,7 @@ def get_tok2vec_width(model: Model):
|
||||||
return nO
|
return nO
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.HashEmbedCNN.v1")
|
@registry.architectures("spacy.HashEmbedCNN.v1")
|
||||||
def build_hash_embed_cnn_tok2vec(
|
def build_hash_embed_cnn_tok2vec(
|
||||||
*,
|
*,
|
||||||
width: int,
|
width: int,
|
||||||
|
@ -87,7 +87,7 @@ def build_hash_embed_cnn_tok2vec(
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.Tok2Vec.v2")
|
@registry.architectures("spacy.Tok2Vec.v2")
|
||||||
def build_Tok2Vec_model(
|
def build_Tok2Vec_model(
|
||||||
embed: Model[List[Doc], List[Floats2d]],
|
embed: Model[List[Doc], List[Floats2d]],
|
||||||
encode: Model[List[Floats2d], List[Floats2d]],
|
encode: Model[List[Floats2d], List[Floats2d]],
|
||||||
|
@ -108,7 +108,7 @@ def build_Tok2Vec_model(
|
||||||
return tok2vec
|
return tok2vec
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.MultiHashEmbed.v1")
|
@registry.architectures("spacy.MultiHashEmbed.v1")
|
||||||
def MultiHashEmbed(
|
def MultiHashEmbed(
|
||||||
width: int,
|
width: int,
|
||||||
attrs: List[Union[str, int]],
|
attrs: List[Union[str, int]],
|
||||||
|
@ -182,7 +182,7 @@ def MultiHashEmbed(
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.CharacterEmbed.v1")
|
@registry.architectures("spacy.CharacterEmbed.v1")
|
||||||
def CharacterEmbed(
|
def CharacterEmbed(
|
||||||
width: int,
|
width: int,
|
||||||
rows: int,
|
rows: int,
|
||||||
|
@ -255,7 +255,7 @@ def CharacterEmbed(
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.MaxoutWindowEncoder.v2")
|
@registry.architectures("spacy.MaxoutWindowEncoder.v2")
|
||||||
def MaxoutWindowEncoder(
|
def MaxoutWindowEncoder(
|
||||||
width: int, window_size: int, maxout_pieces: int, depth: int
|
width: int, window_size: int, maxout_pieces: int, depth: int
|
||||||
) -> Model[List[Floats2d], List[Floats2d]]:
|
) -> Model[List[Floats2d], List[Floats2d]]:
|
||||||
|
@ -287,7 +287,7 @@ def MaxoutWindowEncoder(
|
||||||
return with_array(model, pad=receptive_field)
|
return with_array(model, pad=receptive_field)
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.MishWindowEncoder.v2")
|
@registry.architectures("spacy.MishWindowEncoder.v2")
|
||||||
def MishWindowEncoder(
|
def MishWindowEncoder(
|
||||||
width: int, window_size: int, depth: int
|
width: int, window_size: int, depth: int
|
||||||
) -> Model[List[Floats2d], List[Floats2d]]:
|
) -> Model[List[Floats2d], List[Floats2d]]:
|
||||||
|
@ -310,7 +310,7 @@ def MishWindowEncoder(
|
||||||
return with_array(model)
|
return with_array(model)
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1")
|
@registry.architectures("spacy.TorchBiLSTMEncoder.v1")
|
||||||
def BiLSTMEncoder(
|
def BiLSTMEncoder(
|
||||||
width: int, depth: int, dropout: float
|
width: int, depth: int, dropout: float
|
||||||
) -> Model[List[Floats2d], List[Floats2d]]:
|
) -> Model[List[Floats2d], List[Floats2d]]:
|
||||||
|
|
|
@ -45,6 +45,7 @@ DEFAULT_NEL_MODEL = Config().from_str(default_model_config)["model"]
|
||||||
default_config={
|
default_config={
|
||||||
"model": DEFAULT_NEL_MODEL,
|
"model": DEFAULT_NEL_MODEL,
|
||||||
"labels_discard": [],
|
"labels_discard": [],
|
||||||
|
"n_sents": 0,
|
||||||
"incl_prior": True,
|
"incl_prior": True,
|
||||||
"incl_context": True,
|
"incl_context": True,
|
||||||
"entity_vector_length": 64,
|
"entity_vector_length": 64,
|
||||||
|
@ -62,6 +63,7 @@ def make_entity_linker(
|
||||||
model: Model,
|
model: Model,
|
||||||
*,
|
*,
|
||||||
labels_discard: Iterable[str],
|
labels_discard: Iterable[str],
|
||||||
|
n_sents: int,
|
||||||
incl_prior: bool,
|
incl_prior: bool,
|
||||||
incl_context: bool,
|
incl_context: bool,
|
||||||
entity_vector_length: int,
|
entity_vector_length: int,
|
||||||
|
@ -73,6 +75,7 @@ def make_entity_linker(
|
||||||
representations. Given a batch of Doc objects, it should return a single
|
representations. Given a batch of Doc objects, it should return a single
|
||||||
array, with one row per item in the batch.
|
array, with one row per item in the batch.
|
||||||
labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction.
|
labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction.
|
||||||
|
n_sents (int): The number of neighbouring sentences to take into account.
|
||||||
incl_prior (bool): Whether or not to include prior probabilities from the KB in the model.
|
incl_prior (bool): Whether or not to include prior probabilities from the KB in the model.
|
||||||
incl_context (bool): Whether or not to include the local context in the model.
|
incl_context (bool): Whether or not to include the local context in the model.
|
||||||
entity_vector_length (int): Size of encoding vectors in the KB.
|
entity_vector_length (int): Size of encoding vectors in the KB.
|
||||||
|
@ -84,6 +87,7 @@ def make_entity_linker(
|
||||||
model,
|
model,
|
||||||
name,
|
name,
|
||||||
labels_discard=labels_discard,
|
labels_discard=labels_discard,
|
||||||
|
n_sents=n_sents,
|
||||||
incl_prior=incl_prior,
|
incl_prior=incl_prior,
|
||||||
incl_context=incl_context,
|
incl_context=incl_context,
|
||||||
entity_vector_length=entity_vector_length,
|
entity_vector_length=entity_vector_length,
|
||||||
|
@ -106,6 +110,7 @@ class EntityLinker(TrainablePipe):
|
||||||
name: str = "entity_linker",
|
name: str = "entity_linker",
|
||||||
*,
|
*,
|
||||||
labels_discard: Iterable[str],
|
labels_discard: Iterable[str],
|
||||||
|
n_sents: int,
|
||||||
incl_prior: bool,
|
incl_prior: bool,
|
||||||
incl_context: bool,
|
incl_context: bool,
|
||||||
entity_vector_length: int,
|
entity_vector_length: int,
|
||||||
|
@ -118,6 +123,7 @@ class EntityLinker(TrainablePipe):
|
||||||
name (str): The component instance name, used to add entries to the
|
name (str): The component instance name, used to add entries to the
|
||||||
losses during training.
|
losses during training.
|
||||||
labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction.
|
labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction.
|
||||||
|
n_sents (int): The number of neighbouring sentences to take into account.
|
||||||
incl_prior (bool): Whether or not to include prior probabilities from the KB in the model.
|
incl_prior (bool): Whether or not to include prior probabilities from the KB in the model.
|
||||||
incl_context (bool): Whether or not to include the local context in the model.
|
incl_context (bool): Whether or not to include the local context in the model.
|
||||||
entity_vector_length (int): Size of encoding vectors in the KB.
|
entity_vector_length (int): Size of encoding vectors in the KB.
|
||||||
|
@ -129,25 +135,24 @@ class EntityLinker(TrainablePipe):
|
||||||
self.vocab = vocab
|
self.vocab = vocab
|
||||||
self.model = model
|
self.model = model
|
||||||
self.name = name
|
self.name = name
|
||||||
cfg = {
|
self.labels_discard = list(labels_discard)
|
||||||
"labels_discard": list(labels_discard),
|
self.n_sents = n_sents
|
||||||
"incl_prior": incl_prior,
|
self.incl_prior = incl_prior
|
||||||
"incl_context": incl_context,
|
self.incl_context = incl_context
|
||||||
"entity_vector_length": entity_vector_length,
|
|
||||||
}
|
|
||||||
self.get_candidates = get_candidates
|
self.get_candidates = get_candidates
|
||||||
self.cfg = dict(cfg)
|
self.cfg = {}
|
||||||
self.distance = CosineDistance(normalize=False)
|
self.distance = CosineDistance(normalize=False)
|
||||||
# how many neightbour sentences to take into account
|
# how many neightbour sentences to take into account
|
||||||
self.n_sents = cfg.get("n_sents", 0)
|
|
||||||
# create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'.
|
# create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'.
|
||||||
self.kb = empty_kb(entity_vector_length)(self.vocab)
|
self.kb = empty_kb(entity_vector_length)(self.vocab)
|
||||||
|
|
||||||
def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]):
|
def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]):
|
||||||
"""Define the KB of this pipe by providing a function that will
|
"""Define the KB of this pipe by providing a function that will
|
||||||
create it using this object's vocab."""
|
create it using this object's vocab."""
|
||||||
|
if not callable(kb_loader):
|
||||||
|
raise ValueError(Errors.E885.format(arg_type=type(kb_loader)))
|
||||||
|
|
||||||
self.kb = kb_loader(self.vocab)
|
self.kb = kb_loader(self.vocab)
|
||||||
self.cfg["entity_vector_length"] = self.kb.entity_vector_length
|
|
||||||
|
|
||||||
def validate_kb(self) -> None:
|
def validate_kb(self) -> None:
|
||||||
# Raise an error if the knowledge base is not initialized.
|
# Raise an error if the knowledge base is not initialized.
|
||||||
|
@ -309,14 +314,13 @@ class EntityLinker(TrainablePipe):
|
||||||
sent_doc = doc[start_token:end_token].as_doc()
|
sent_doc = doc[start_token:end_token].as_doc()
|
||||||
# currently, the context is the same for each entity in a sentence (should be refined)
|
# currently, the context is the same for each entity in a sentence (should be refined)
|
||||||
xp = self.model.ops.xp
|
xp = self.model.ops.xp
|
||||||
if self.cfg.get("incl_context"):
|
if self.incl_context:
|
||||||
sentence_encoding = self.model.predict([sent_doc])[0]
|
sentence_encoding = self.model.predict([sent_doc])[0]
|
||||||
sentence_encoding_t = sentence_encoding.T
|
sentence_encoding_t = sentence_encoding.T
|
||||||
sentence_norm = xp.linalg.norm(sentence_encoding_t)
|
sentence_norm = xp.linalg.norm(sentence_encoding_t)
|
||||||
for ent in sent.ents:
|
for ent in sent.ents:
|
||||||
entity_count += 1
|
entity_count += 1
|
||||||
to_discard = self.cfg.get("labels_discard", [])
|
if ent.label_ in self.labels_discard:
|
||||||
if to_discard and ent.label_ in to_discard:
|
|
||||||
# ignoring this entity - setting to NIL
|
# ignoring this entity - setting to NIL
|
||||||
final_kb_ids.append(self.NIL)
|
final_kb_ids.append(self.NIL)
|
||||||
else:
|
else:
|
||||||
|
@ -334,13 +338,13 @@ class EntityLinker(TrainablePipe):
|
||||||
prior_probs = xp.asarray(
|
prior_probs = xp.asarray(
|
||||||
[c.prior_prob for c in candidates]
|
[c.prior_prob for c in candidates]
|
||||||
)
|
)
|
||||||
if not self.cfg.get("incl_prior"):
|
if not self.incl_prior:
|
||||||
prior_probs = xp.asarray(
|
prior_probs = xp.asarray(
|
||||||
[0.0 for _ in candidates]
|
[0.0 for _ in candidates]
|
||||||
)
|
)
|
||||||
scores = prior_probs
|
scores = prior_probs
|
||||||
# add in similarity from the context
|
# add in similarity from the context
|
||||||
if self.cfg.get("incl_context"):
|
if self.incl_context:
|
||||||
entity_encodings = xp.asarray(
|
entity_encodings = xp.asarray(
|
||||||
[c.entity_vector for c in candidates]
|
[c.entity_vector for c in candidates]
|
||||||
)
|
)
|
||||||
|
|
|
@ -66,26 +66,12 @@ class Sentencizer(Pipe):
|
||||||
"""
|
"""
|
||||||
error_handler = self.get_error_handler()
|
error_handler = self.get_error_handler()
|
||||||
try:
|
try:
|
||||||
self._call(doc)
|
tags = self.predict([doc])
|
||||||
|
self.set_annotations([doc], tags)
|
||||||
return doc
|
return doc
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
error_handler(self.name, self, [doc], e)
|
error_handler(self.name, self, [doc], e)
|
||||||
|
|
||||||
def _call(self, doc):
|
|
||||||
start = 0
|
|
||||||
seen_period = False
|
|
||||||
for i, token in enumerate(doc):
|
|
||||||
is_in_punct_chars = token.text in self.punct_chars
|
|
||||||
token.is_sent_start = i == 0
|
|
||||||
if seen_period and not token.is_punct and not is_in_punct_chars:
|
|
||||||
doc[start].is_sent_start = True
|
|
||||||
start = token.i
|
|
||||||
seen_period = False
|
|
||||||
elif is_in_punct_chars:
|
|
||||||
seen_period = True
|
|
||||||
if start < len(doc):
|
|
||||||
doc[start].is_sent_start = True
|
|
||||||
|
|
||||||
def predict(self, docs):
|
def predict(self, docs):
|
||||||
"""Apply the pipe to a batch of docs, without modifying them.
|
"""Apply the pipe to a batch of docs, without modifying them.
|
||||||
|
|
||||||
|
|
|
@ -314,6 +314,9 @@ class Scorer:
|
||||||
getter (Callable[[Doc, str], Iterable[Span]]): Defaults to getattr. If
|
getter (Callable[[Doc, str], Iterable[Span]]): Defaults to getattr. If
|
||||||
provided, getter(doc, attr) should return the spans for the
|
provided, getter(doc, attr) should return the spans for the
|
||||||
individual doc.
|
individual doc.
|
||||||
|
has_annotation (Optional[Callable[[Doc], bool]]) should return whether a `Doc`
|
||||||
|
has annotation for this `attr`. Docs without annotation are skipped for
|
||||||
|
scoring purposes.
|
||||||
RETURNS (Dict[str, Any]): A dictionary containing the PRF scores under
|
RETURNS (Dict[str, Any]): A dictionary containing the PRF scores under
|
||||||
the keys attr_p/r/f and the per-type PRF scores under attr_per_type.
|
the keys attr_p/r/f and the per-type PRF scores under attr_per_type.
|
||||||
|
|
||||||
|
@ -324,7 +327,7 @@ class Scorer:
|
||||||
for example in examples:
|
for example in examples:
|
||||||
pred_doc = example.predicted
|
pred_doc = example.predicted
|
||||||
gold_doc = example.reference
|
gold_doc = example.reference
|
||||||
# Option to handle docs without sents
|
# Option to handle docs without annotation for this attribute
|
||||||
if has_annotation is not None:
|
if has_annotation is not None:
|
||||||
if not has_annotation(gold_doc):
|
if not has_annotation(gold_doc):
|
||||||
continue
|
continue
|
||||||
|
@ -531,27 +534,28 @@ class Scorer:
|
||||||
gold_span = gold_ent_by_offset.get(
|
gold_span = gold_ent_by_offset.get(
|
||||||
(pred_ent.start_char, pred_ent.end_char), None
|
(pred_ent.start_char, pred_ent.end_char), None
|
||||||
)
|
)
|
||||||
label = gold_span.label_
|
if gold_span is not None:
|
||||||
if label not in f_per_type:
|
label = gold_span.label_
|
||||||
f_per_type[label] = PRFScore()
|
if label not in f_per_type:
|
||||||
gold = gold_span.kb_id_
|
f_per_type[label] = PRFScore()
|
||||||
# only evaluating entities that overlap between gold and pred,
|
gold = gold_span.kb_id_
|
||||||
# to disentangle the performance of the NEL from the NER
|
# only evaluating entities that overlap between gold and pred,
|
||||||
if gold is not None:
|
# to disentangle the performance of the NEL from the NER
|
||||||
pred = pred_ent.kb_id_
|
if gold is not None:
|
||||||
if gold in negative_labels and pred in negative_labels:
|
pred = pred_ent.kb_id_
|
||||||
# ignore true negatives
|
if gold in negative_labels and pred in negative_labels:
|
||||||
pass
|
# ignore true negatives
|
||||||
elif gold == pred:
|
pass
|
||||||
f_per_type[label].tp += 1
|
elif gold == pred:
|
||||||
elif gold in negative_labels:
|
f_per_type[label].tp += 1
|
||||||
f_per_type[label].fp += 1
|
elif gold in negative_labels:
|
||||||
elif pred in negative_labels:
|
f_per_type[label].fp += 1
|
||||||
f_per_type[label].fn += 1
|
elif pred in negative_labels:
|
||||||
else:
|
f_per_type[label].fn += 1
|
||||||
# a wrong prediction (e.g. Q42 != Q3) counts as both a FP as well as a FN
|
else:
|
||||||
f_per_type[label].fp += 1
|
# a wrong prediction (e.g. Q42 != Q3) counts as both a FP as well as a FN
|
||||||
f_per_type[label].fn += 1
|
f_per_type[label].fp += 1
|
||||||
|
f_per_type[label].fn += 1
|
||||||
micro_prf = PRFScore()
|
micro_prf = PRFScore()
|
||||||
for label_prf in f_per_type.values():
|
for label_prf in f_per_type.values():
|
||||||
micro_prf.tp += label_prf.tp
|
micro_prf.tp += label_prf.tp
|
||||||
|
|
|
@ -39,6 +39,11 @@ def ar_tokenizer():
|
||||||
return get_lang_class("ar")().tokenizer
|
return get_lang_class("ar")().tokenizer
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="session")
|
||||||
|
def bg_tokenizer():
|
||||||
|
return get_lang_class("bg")().tokenizer
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def bn_tokenizer():
|
def bn_tokenizer():
|
||||||
return get_lang_class("bn")().tokenizer
|
return get_lang_class("bn")().tokenizer
|
||||||
|
|
|
@ -1,3 +1,5 @@
|
||||||
|
import weakref
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
import numpy
|
import numpy
|
||||||
import logging
|
import logging
|
||||||
|
@ -663,3 +665,10 @@ def test_span_groups(en_tokenizer):
|
||||||
assert doc.spans["hi"].has_overlap
|
assert doc.spans["hi"].has_overlap
|
||||||
del doc.spans["hi"]
|
del doc.spans["hi"]
|
||||||
assert "hi" not in doc.spans
|
assert "hi" not in doc.spans
|
||||||
|
|
||||||
|
|
||||||
|
def test_doc_spans_copy(en_tokenizer):
|
||||||
|
doc1 = en_tokenizer("Some text about Colombia and the Czech Republic")
|
||||||
|
assert weakref.ref(doc1) == doc1.spans.doc_ref
|
||||||
|
doc2 = doc1.copy()
|
||||||
|
assert weakref.ref(doc2) == doc2.spans.doc_ref
|
||||||
|
|
30
spacy/tests/lang/bg/test_text.py
Normal file
30
spacy/tests/lang/bg/test_text.py
Normal file
|
@ -0,0 +1,30 @@
|
||||||
|
import pytest
|
||||||
|
from spacy.lang.bg.lex_attrs import like_num
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"word,match",
|
||||||
|
[
|
||||||
|
("10", True),
|
||||||
|
("1", True),
|
||||||
|
("10000", True),
|
||||||
|
("1.000", True),
|
||||||
|
("бројка", False),
|
||||||
|
("999,23", True),
|
||||||
|
("едно", True),
|
||||||
|
("две", True),
|
||||||
|
("цифра", False),
|
||||||
|
("единайсет", True),
|
||||||
|
("десет", True),
|
||||||
|
("сто", True),
|
||||||
|
("брой", False),
|
||||||
|
("хиляда", True),
|
||||||
|
("милион", True),
|
||||||
|
(",", False),
|
||||||
|
("милиарда", True),
|
||||||
|
("билион", True),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_bg_lex_attrs_like_number(bg_tokenizer, word, match):
|
||||||
|
tokens = bg_tokenizer(word)
|
||||||
|
assert len(tokens) == 1
|
||||||
|
assert tokens[0].like_num == match
|
|
@ -230,7 +230,7 @@ def test_el_pipe_configuration(nlp):
|
||||||
def get_lowercased_candidates(kb, span):
|
def get_lowercased_candidates(kb, span):
|
||||||
return kb.get_alias_candidates(span.text.lower())
|
return kb.get_alias_candidates(span.text.lower())
|
||||||
|
|
||||||
@registry.misc.register("spacy.LowercaseCandidateGenerator.v1")
|
@registry.misc("spacy.LowercaseCandidateGenerator.v1")
|
||||||
def create_candidates() -> Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]:
|
def create_candidates() -> Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]:
|
||||||
return get_lowercased_candidates
|
return get_lowercased_candidates
|
||||||
|
|
||||||
|
@ -250,6 +250,14 @@ def test_el_pipe_configuration(nlp):
|
||||||
assert doc[2].ent_kb_id_ == "Q2"
|
assert doc[2].ent_kb_id_ == "Q2"
|
||||||
|
|
||||||
|
|
||||||
|
def test_nel_nsents(nlp):
|
||||||
|
"""Test that n_sents can be set through the configuration"""
|
||||||
|
entity_linker = nlp.add_pipe("entity_linker", config={})
|
||||||
|
assert entity_linker.n_sents == 0
|
||||||
|
entity_linker = nlp.replace_pipe("entity_linker", "entity_linker", config={"n_sents": 2})
|
||||||
|
assert entity_linker.n_sents == 2
|
||||||
|
|
||||||
|
|
||||||
def test_vocab_serialization(nlp):
|
def test_vocab_serialization(nlp):
|
||||||
"""Test that string information is retained across storage"""
|
"""Test that string information is retained across storage"""
|
||||||
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
|
mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1)
|
||||||
|
|
|
@ -83,9 +83,9 @@ def test_replace_last_pipe(nlp):
|
||||||
def test_replace_pipe_config(nlp):
|
def test_replace_pipe_config(nlp):
|
||||||
nlp.add_pipe("entity_linker")
|
nlp.add_pipe("entity_linker")
|
||||||
nlp.add_pipe("sentencizer")
|
nlp.add_pipe("sentencizer")
|
||||||
assert nlp.get_pipe("entity_linker").cfg["incl_prior"] is True
|
assert nlp.get_pipe("entity_linker").incl_prior is True
|
||||||
nlp.replace_pipe("entity_linker", "entity_linker", config={"incl_prior": False})
|
nlp.replace_pipe("entity_linker", "entity_linker", config={"incl_prior": False})
|
||||||
assert nlp.get_pipe("entity_linker").cfg["incl_prior"] is False
|
assert nlp.get_pipe("entity_linker").incl_prior is False
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("old_name,new_name", [("old_pipe", "new_pipe")])
|
@pytest.mark.parametrize("old_name,new_name", [("old_pipe", "new_pipe")])
|
||||||
|
|
|
@ -61,7 +61,6 @@ def test_issue7029():
|
||||||
losses = {}
|
losses = {}
|
||||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
texts = ["first", "second", "third", "fourth", "and", "then", "some", ""]
|
texts = ["first", "second", "third", "fourth", "and", "then", "some", ""]
|
||||||
nlp.select_pipes(enable=["tok2vec", "tagger"])
|
|
||||||
docs1 = list(nlp.pipe(texts, batch_size=1))
|
docs1 = list(nlp.pipe(texts, batch_size=1))
|
||||||
docs2 = list(nlp.pipe(texts, batch_size=4))
|
docs2 = list(nlp.pipe(texts, batch_size=4))
|
||||||
assert [doc[0].tag_ for doc in docs1[:-1]] == [doc[0].tag_ for doc in docs2[:-1]]
|
assert [doc[0].tag_ for doc in docs1[:-1]] == [doc[0].tag_ for doc in docs2[:-1]]
|
||||||
|
|
|
@ -1,5 +1,3 @@
|
||||||
import pytest
|
|
||||||
|
|
||||||
from spacy.tokens.doc import Doc
|
from spacy.tokens.doc import Doc
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
from spacy.pipeline._parser_internals.arc_eager import ArcEager
|
from spacy.pipeline._parser_internals.arc_eager import ArcEager
|
||||||
|
|
54
spacy/tests/regression/test_issue7062.py
Normal file
54
spacy/tests/regression/test_issue7062.py
Normal file
|
@ -0,0 +1,54 @@
|
||||||
|
from spacy.kb import KnowledgeBase
|
||||||
|
from spacy.training import Example
|
||||||
|
from spacy.lang.en import English
|
||||||
|
|
||||||
|
|
||||||
|
# fmt: off
|
||||||
|
TRAIN_DATA = [
|
||||||
|
("Russ Cochran his reprints include EC Comics.",
|
||||||
|
{"links": {(0, 12): {"Q2146908": 1.0}},
|
||||||
|
"entities": [(0, 12, "PERSON")],
|
||||||
|
"sent_starts": [1, -1, 0, 0, 0, 0, 0, 0]})
|
||||||
|
]
|
||||||
|
# fmt: on
|
||||||
|
|
||||||
|
|
||||||
|
def test_partial_links():
|
||||||
|
# Test that having some entities on the doc without gold links, doesn't crash
|
||||||
|
nlp = English()
|
||||||
|
vector_length = 3
|
||||||
|
train_examples = []
|
||||||
|
for text, annotation in TRAIN_DATA:
|
||||||
|
doc = nlp(text)
|
||||||
|
train_examples.append(Example.from_dict(doc, annotation))
|
||||||
|
|
||||||
|
def create_kb(vocab):
|
||||||
|
# create artificial KB
|
||||||
|
mykb = KnowledgeBase(vocab, entity_vector_length=vector_length)
|
||||||
|
mykb.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3])
|
||||||
|
mykb.add_alias("Russ Cochran", ["Q2146908"], [0.9])
|
||||||
|
return mykb
|
||||||
|
|
||||||
|
# Create and train the Entity Linker
|
||||||
|
entity_linker = nlp.add_pipe("entity_linker", last=True)
|
||||||
|
entity_linker.set_kb(create_kb)
|
||||||
|
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||||
|
for i in range(2):
|
||||||
|
losses = {}
|
||||||
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
|
|
||||||
|
# adding additional components that are required for the entity_linker
|
||||||
|
nlp.add_pipe("sentencizer", first=True)
|
||||||
|
patterns = [
|
||||||
|
{"label": "PERSON", "pattern": [{"LOWER": "russ"}, {"LOWER": "cochran"}]},
|
||||||
|
{"label": "ORG", "pattern": [{"LOWER": "ec"}, {"LOWER": "comics"}]}
|
||||||
|
]
|
||||||
|
ruler = nlp.add_pipe("entity_ruler", before="entity_linker")
|
||||||
|
ruler.add_patterns(patterns)
|
||||||
|
|
||||||
|
# this will run the pipeline on the examples and shouldn't crash
|
||||||
|
results = nlp.evaluate(train_examples)
|
||||||
|
assert "PERSON" in results["ents_per_type"]
|
||||||
|
assert "PERSON" in results["nel_f_per_type"]
|
||||||
|
assert "ORG" in results["ents_per_type"]
|
||||||
|
assert "ORG" not in results["nel_f_per_type"]
|
18
spacy/tests/regression/test_issue7065.py
Normal file
18
spacy/tests/regression/test_issue7065.py
Normal file
|
@ -0,0 +1,18 @@
|
||||||
|
from spacy.lang.en import English
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue7065():
|
||||||
|
text = "Kathleen Battle sang in Mahler 's Symphony No. 8 at the Cincinnati Symphony Orchestra 's May Festival."
|
||||||
|
nlp = English()
|
||||||
|
nlp.add_pipe("sentencizer")
|
||||||
|
ruler = nlp.add_pipe("entity_ruler")
|
||||||
|
patterns = [{"label": "THING", "pattern": [{"LOWER": "symphony"}, {"LOWER": "no"}, {"LOWER": "."}, {"LOWER": "8"}]}]
|
||||||
|
ruler.add_patterns(patterns)
|
||||||
|
|
||||||
|
doc = nlp(text)
|
||||||
|
sentences = [s for s in doc.sents]
|
||||||
|
assert len(sentences) == 2
|
||||||
|
sent0 = sentences[0]
|
||||||
|
ent = doc.ents[0]
|
||||||
|
assert ent.start < sent0.end < ent.end
|
||||||
|
assert sentences.index(ent.sent) == 0
|
|
@ -160,7 +160,7 @@ subword_features = false
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("my_test_parser")
|
@registry.architectures("my_test_parser")
|
||||||
def my_parser():
|
def my_parser():
|
||||||
tok2vec = build_Tok2Vec_model(
|
tok2vec = build_Tok2Vec_model(
|
||||||
MultiHashEmbed(
|
MultiHashEmbed(
|
||||||
|
|
|
@ -108,7 +108,7 @@ def test_serialize_subclassed_kb():
|
||||||
super().__init__(vocab, entity_vector_length)
|
super().__init__(vocab, entity_vector_length)
|
||||||
self.custom_field = custom_field
|
self.custom_field = custom_field
|
||||||
|
|
||||||
@registry.misc.register("spacy.CustomKB.v1")
|
@registry.misc("spacy.CustomKB.v1")
|
||||||
def custom_kb(
|
def custom_kb(
|
||||||
entity_vector_length: int, custom_field: int
|
entity_vector_length: int, custom_field: int
|
||||||
) -> Callable[["Vocab"], KnowledgeBase]:
|
) -> Callable[["Vocab"], KnowledgeBase]:
|
||||||
|
|
|
@ -4,12 +4,12 @@ from thinc.api import Linear
|
||||||
from catalogue import RegistryError
|
from catalogue import RegistryError
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("my_test_function")
|
|
||||||
def create_model(nr_in, nr_out):
|
|
||||||
return Linear(nr_in, nr_out)
|
|
||||||
|
|
||||||
|
|
||||||
def test_get_architecture():
|
def test_get_architecture():
|
||||||
|
|
||||||
|
@registry.architectures("my_test_function")
|
||||||
|
def create_model(nr_in, nr_out):
|
||||||
|
return Linear(nr_in, nr_out)
|
||||||
|
|
||||||
arch = registry.architectures.get("my_test_function")
|
arch = registry.architectures.get("my_test_function")
|
||||||
assert arch is create_model
|
assert arch is create_model
|
||||||
with pytest.raises(RegistryError):
|
with pytest.raises(RegistryError):
|
||||||
|
|
|
@ -7,7 +7,7 @@ from spacy import util
|
||||||
from spacy import prefer_gpu, require_gpu, require_cpu
|
from spacy import prefer_gpu, require_gpu, require_cpu
|
||||||
from spacy.ml._precomputable_affine import PrecomputableAffine
|
from spacy.ml._precomputable_affine import PrecomputableAffine
|
||||||
from spacy.ml._precomputable_affine import _backprop_precomputable_affine_padding
|
from spacy.ml._precomputable_affine import _backprop_precomputable_affine_padding
|
||||||
from spacy.util import dot_to_object, SimpleFrozenList
|
from spacy.util import dot_to_object, SimpleFrozenList, import_file
|
||||||
from thinc.api import Config, Optimizer, ConfigValidationError
|
from thinc.api import Config, Optimizer, ConfigValidationError
|
||||||
from spacy.training.batchers import minibatch_by_words
|
from spacy.training.batchers import minibatch_by_words
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
|
@ -17,7 +17,7 @@ from spacy.schemas import ConfigSchemaTraining
|
||||||
|
|
||||||
from thinc.api import get_current_ops, NumpyOps, CupyOps
|
from thinc.api import get_current_ops, NumpyOps, CupyOps
|
||||||
|
|
||||||
from .util import get_random_doc
|
from .util import get_random_doc, make_tempdir
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
|
@ -347,3 +347,35 @@ def test_resolve_dot_names():
|
||||||
errors = e.value.errors
|
errors = e.value.errors
|
||||||
assert len(errors) == 1
|
assert len(errors) == 1
|
||||||
assert errors[0]["loc"] == ["training", "xyz"]
|
assert errors[0]["loc"] == ["training", "xyz"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_import_code():
|
||||||
|
code_str = """
|
||||||
|
from spacy import Language
|
||||||
|
|
||||||
|
class DummyComponent:
|
||||||
|
def __init__(self, vocab, name):
|
||||||
|
pass
|
||||||
|
|
||||||
|
def initialize(self, get_examples, *, nlp, dummy_param: int):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@Language.factory(
|
||||||
|
"dummy_component",
|
||||||
|
)
|
||||||
|
def make_dummy_component(
|
||||||
|
nlp: Language, name: str
|
||||||
|
):
|
||||||
|
return DummyComponent(nlp.vocab, name)
|
||||||
|
"""
|
||||||
|
|
||||||
|
with make_tempdir() as temp_dir:
|
||||||
|
code_path = os.path.join(temp_dir, "code.py")
|
||||||
|
with open(code_path, "w") as fileh:
|
||||||
|
fileh.write(code_str)
|
||||||
|
|
||||||
|
import_file("python_code", code_path)
|
||||||
|
config = {"initialize": {"components": {"dummy_component": {"dummy_param": 1}}}}
|
||||||
|
nlp = English.from_config(config)
|
||||||
|
nlp.add_pipe("dummy_component")
|
||||||
|
nlp.initialize()
|
||||||
|
|
|
@ -196,6 +196,104 @@ def test_Example_from_dict_with_entities_invalid(annots):
|
||||||
assert len(list(example.reference.ents)) == 0
|
assert len(list(example.reference.ents)) == 0
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"annots",
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
|
||||||
|
"entities": [
|
||||||
|
(7, 15, "LOC"),
|
||||||
|
(11, 15, "LOC"),
|
||||||
|
(20, 26, "LOC"),
|
||||||
|
], # overlapping
|
||||||
|
}
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_Example_from_dict_with_entities_overlapping(annots):
|
||||||
|
vocab = Vocab()
|
||||||
|
predicted = Doc(vocab, words=annots["words"])
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
Example.from_dict(predicted, annots)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"annots",
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
|
||||||
|
"spans": {
|
||||||
|
"cities": [(7, 15, "LOC"), (20, 26, "LOC")],
|
||||||
|
"people": [(0, 1, "PERSON")],
|
||||||
|
},
|
||||||
|
}
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_Example_from_dict_with_spans(annots):
|
||||||
|
vocab = Vocab()
|
||||||
|
predicted = Doc(vocab, words=annots["words"])
|
||||||
|
example = Example.from_dict(predicted, annots)
|
||||||
|
assert len(list(example.reference.ents)) == 0
|
||||||
|
assert len(list(example.reference.spans["cities"])) == 2
|
||||||
|
assert len(list(example.reference.spans["people"])) == 1
|
||||||
|
for span in example.reference.spans["cities"]:
|
||||||
|
assert span.label_ == "LOC"
|
||||||
|
for span in example.reference.spans["people"]:
|
||||||
|
assert span.label_ == "PERSON"
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"annots",
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
|
||||||
|
"spans": {
|
||||||
|
"cities": [(7, 15, "LOC"), (11, 15, "LOC"), (20, 26, "LOC")],
|
||||||
|
"people": [(0, 1, "PERSON")],
|
||||||
|
},
|
||||||
|
}
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_Example_from_dict_with_spans_overlapping(annots):
|
||||||
|
vocab = Vocab()
|
||||||
|
predicted = Doc(vocab, words=annots["words"])
|
||||||
|
example = Example.from_dict(predicted, annots)
|
||||||
|
assert len(list(example.reference.ents)) == 0
|
||||||
|
assert len(list(example.reference.spans["cities"])) == 3
|
||||||
|
assert len(list(example.reference.spans["people"])) == 1
|
||||||
|
for span in example.reference.spans["cities"]:
|
||||||
|
assert span.label_ == "LOC"
|
||||||
|
for span in example.reference.spans["people"]:
|
||||||
|
assert span.label_ == "PERSON"
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"annots",
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
|
||||||
|
"spans": [(0, 1, "PERSON")],
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
|
||||||
|
"spans": {"cities": (7, 15, "LOC")},
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
|
||||||
|
"spans": {"cities": [7, 11]},
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"words": ["I", "like", "New", "York", "and", "Berlin", "."],
|
||||||
|
"spans": {"cities": [[7]]},
|
||||||
|
},
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_Example_from_dict_with_spans_invalid(annots):
|
||||||
|
vocab = Vocab()
|
||||||
|
predicted = Doc(vocab, words=annots["words"])
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
Example.from_dict(predicted, annots)
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
"annots",
|
"annots",
|
||||||
[
|
[
|
||||||
|
|
|
@ -27,7 +27,7 @@ def test_readers():
|
||||||
factory = "textcat"
|
factory = "textcat"
|
||||||
"""
|
"""
|
||||||
|
|
||||||
@registry.readers.register("myreader.v1")
|
@registry.readers("myreader.v1")
|
||||||
def myreader() -> Dict[str, Callable[[Language, str], Iterable[Example]]]:
|
def myreader() -> Dict[str, Callable[[Language, str], Iterable[Example]]]:
|
||||||
annots = {"cats": {"POS": 1.0, "NEG": 0.0}}
|
annots = {"cats": {"POS": 1.0, "NEG": 0.0}}
|
||||||
|
|
||||||
|
|
|
@ -1,7 +1,8 @@
|
||||||
from .doc import Doc
|
from .doc import Doc
|
||||||
from .token import Token
|
from .token import Token
|
||||||
from .span import Span
|
from .span import Span
|
||||||
|
from .span_group import SpanGroup
|
||||||
from ._serialize import DocBin
|
from ._serialize import DocBin
|
||||||
from .morphanalysis import MorphAnalysis
|
from .morphanalysis import MorphAnalysis
|
||||||
|
|
||||||
__all__ = ["Doc", "Token", "Span", "DocBin", "MorphAnalysis"]
|
__all__ = ["Doc", "Token", "Span", "SpanGroup", "DocBin", "MorphAnalysis"]
|
||||||
|
|
|
@ -33,8 +33,10 @@ class SpanGroups(UserDict):
|
||||||
def _make_span_group(self, name: str, spans: Iterable["Span"]) -> SpanGroup:
|
def _make_span_group(self, name: str, spans: Iterable["Span"]) -> SpanGroup:
|
||||||
return SpanGroup(self.doc_ref(), name=name, spans=spans)
|
return SpanGroup(self.doc_ref(), name=name, spans=spans)
|
||||||
|
|
||||||
def copy(self) -> "SpanGroups":
|
def copy(self, doc: "Doc" = None) -> "SpanGroups":
|
||||||
return SpanGroups(self.doc_ref()).from_bytes(self.to_bytes())
|
if doc is None:
|
||||||
|
doc = self.doc_ref()
|
||||||
|
return SpanGroups(doc).from_bytes(self.to_bytes())
|
||||||
|
|
||||||
def to_bytes(self) -> bytes:
|
def to_bytes(self) -> bytes:
|
||||||
# We don't need to serialize this as a dict, because the groups
|
# We don't need to serialize this as a dict, because the groups
|
||||||
|
|
|
@ -1188,7 +1188,7 @@ cdef class Doc:
|
||||||
other.user_span_hooks = dict(self.user_span_hooks)
|
other.user_span_hooks = dict(self.user_span_hooks)
|
||||||
other.length = self.length
|
other.length = self.length
|
||||||
other.max_length = self.max_length
|
other.max_length = self.max_length
|
||||||
other.spans = self.spans.copy()
|
other.spans = self.spans.copy(doc=other)
|
||||||
buff_size = other.max_length + (PADDING*2)
|
buff_size = other.max_length + (PADDING*2)
|
||||||
assert buff_size > 0
|
assert buff_size > 0
|
||||||
tokens = <TokenC*>other.mem.alloc(buff_size, sizeof(TokenC))
|
tokens = <TokenC*>other.mem.alloc(buff_size, sizeof(TokenC))
|
||||||
|
|
|
@ -357,7 +357,12 @@ cdef class Span:
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def sent(self):
|
def sent(self):
|
||||||
"""RETURNS (Span): The sentence span that the span is a part of."""
|
"""Obtain the sentence that contains this span. If the given span
|
||||||
|
crosses sentence boundaries, return only the first sentence
|
||||||
|
to which it belongs.
|
||||||
|
|
||||||
|
RETURNS (Span): The sentence span that the span is a part of.
|
||||||
|
"""
|
||||||
if "sent" in self.doc.user_span_hooks:
|
if "sent" in self.doc.user_span_hooks:
|
||||||
return self.doc.user_span_hooks["sent"](self)
|
return self.doc.user_span_hooks["sent"](self)
|
||||||
# Use `sent_start` token attribute to find sentence boundaries
|
# Use `sent_start` token attribute to find sentence boundaries
|
||||||
|
@ -367,8 +372,8 @@ cdef class Span:
|
||||||
start = self.start
|
start = self.start
|
||||||
while self.doc.c[start].sent_start != 1 and start > 0:
|
while self.doc.c[start].sent_start != 1 and start > 0:
|
||||||
start += -1
|
start += -1
|
||||||
# Find end of the sentence
|
# Find end of the sentence - can be within the entity
|
||||||
end = self.end
|
end = self.start + 1
|
||||||
while end < self.doc.length and self.doc.c[end].sent_start != 1:
|
while end < self.doc.length and self.doc.c[end].sent_start != 1:
|
||||||
end += 1
|
end += 1
|
||||||
n += 1
|
n += 1
|
||||||
|
|
|
@ -22,6 +22,8 @@ cpdef Doc annotations_to_doc(vocab, tok_annot, doc_annot):
|
||||||
output = Doc(vocab, words=tok_annot["ORTH"], spaces=tok_annot["SPACY"])
|
output = Doc(vocab, words=tok_annot["ORTH"], spaces=tok_annot["SPACY"])
|
||||||
if "entities" in doc_annot:
|
if "entities" in doc_annot:
|
||||||
_add_entities_to_doc(output, doc_annot["entities"])
|
_add_entities_to_doc(output, doc_annot["entities"])
|
||||||
|
if "spans" in doc_annot:
|
||||||
|
_add_spans_to_doc(output, doc_annot["spans"])
|
||||||
if array.size:
|
if array.size:
|
||||||
output = output.from_array(attrs, array)
|
output = output.from_array(attrs, array)
|
||||||
# links are currently added with ENT_KB_ID on the token level
|
# links are currently added with ENT_KB_ID on the token level
|
||||||
|
@ -314,13 +316,11 @@ def _annot2array(vocab, tok_annot, doc_annot):
|
||||||
|
|
||||||
for key, value in doc_annot.items():
|
for key, value in doc_annot.items():
|
||||||
if value:
|
if value:
|
||||||
if key == "entities":
|
if key in ["entities", "cats", "spans"]:
|
||||||
pass
|
pass
|
||||||
elif key == "links":
|
elif key == "links":
|
||||||
ent_kb_ids = _parse_links(vocab, tok_annot["ORTH"], tok_annot["SPACY"], value)
|
ent_kb_ids = _parse_links(vocab, tok_annot["ORTH"], tok_annot["SPACY"], value)
|
||||||
tok_annot["ENT_KB_ID"] = ent_kb_ids
|
tok_annot["ENT_KB_ID"] = ent_kb_ids
|
||||||
elif key == "cats":
|
|
||||||
pass
|
|
||||||
else:
|
else:
|
||||||
raise ValueError(Errors.E974.format(obj="doc", key=key))
|
raise ValueError(Errors.E974.format(obj="doc", key=key))
|
||||||
|
|
||||||
|
@ -351,6 +351,29 @@ def _annot2array(vocab, tok_annot, doc_annot):
|
||||||
return attrs, array.T
|
return attrs, array.T
|
||||||
|
|
||||||
|
|
||||||
|
def _add_spans_to_doc(doc, spans_data):
|
||||||
|
if not isinstance(spans_data, dict):
|
||||||
|
raise ValueError(Errors.E879)
|
||||||
|
for key, span_list in spans_data.items():
|
||||||
|
spans = []
|
||||||
|
if not isinstance(span_list, list):
|
||||||
|
raise ValueError(Errors.E879)
|
||||||
|
for span_tuple in span_list:
|
||||||
|
if not isinstance(span_tuple, (list, tuple)) or len(span_tuple) < 2:
|
||||||
|
raise ValueError(Errors.E879)
|
||||||
|
start_char = span_tuple[0]
|
||||||
|
end_char = span_tuple[1]
|
||||||
|
label = 0
|
||||||
|
kb_id = 0
|
||||||
|
if len(span_tuple) > 2:
|
||||||
|
label = span_tuple[2]
|
||||||
|
if len(span_tuple) > 3:
|
||||||
|
kb_id = span_tuple[3]
|
||||||
|
span = doc.char_span(start_char, end_char, label=label, kb_id=kb_id)
|
||||||
|
spans.append(span)
|
||||||
|
doc.spans[key] = spans
|
||||||
|
|
||||||
|
|
||||||
def _add_entities_to_doc(doc, ner_data):
|
def _add_entities_to_doc(doc, ner_data):
|
||||||
if ner_data is None:
|
if ner_data is None:
|
||||||
return
|
return
|
||||||
|
@ -397,7 +420,7 @@ def _fix_legacy_dict_data(example_dict):
|
||||||
pass
|
pass
|
||||||
elif key == "ids":
|
elif key == "ids":
|
||||||
pass
|
pass
|
||||||
elif key in ("cats", "links"):
|
elif key in ("cats", "links", "spans"):
|
||||||
doc_dict[key] = value
|
doc_dict[key] = value
|
||||||
elif key in ("ner", "entities"):
|
elif key in ("ner", "entities"):
|
||||||
doc_dict["entities"] = value
|
doc_dict["entities"] = value
|
||||||
|
|
|
@ -103,7 +103,11 @@ def console_logger(progress_bar: bool = False):
|
||||||
|
|
||||||
@registry.loggers("spacy.WandbLogger.v1")
|
@registry.loggers("spacy.WandbLogger.v1")
|
||||||
def wandb_logger(project_name: str, remove_config_values: List[str] = []):
|
def wandb_logger(project_name: str, remove_config_values: List[str] = []):
|
||||||
import wandb
|
try:
|
||||||
|
import wandb
|
||||||
|
from wandb import init, log, join # test that these are available
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError(Errors.E880)
|
||||||
|
|
||||||
console = console_logger(progress_bar=False)
|
console = console_logger(progress_bar=False)
|
||||||
|
|
||||||
|
|
|
@ -70,7 +70,7 @@ CONFIG_SECTION_ORDER = ["paths", "variables", "system", "nlp", "components", "co
|
||||||
|
|
||||||
logger = logging.getLogger("spacy")
|
logger = logging.getLogger("spacy")
|
||||||
logger_stream_handler = logging.StreamHandler()
|
logger_stream_handler = logging.StreamHandler()
|
||||||
logger_stream_handler.setFormatter(logging.Formatter("%(message)s"))
|
logger_stream_handler.setFormatter(logging.Formatter("[%(asctime)s] [%(levelname)s] %(message)s"))
|
||||||
logger.addHandler(logger_stream_handler)
|
logger.addHandler(logger_stream_handler)
|
||||||
|
|
||||||
|
|
||||||
|
@ -1454,9 +1454,10 @@ def is_cython_func(func: Callable) -> bool:
|
||||||
if hasattr(func, attr): # function or class instance
|
if hasattr(func, attr): # function or class instance
|
||||||
return True
|
return True
|
||||||
# https://stackoverflow.com/a/55767059
|
# https://stackoverflow.com/a/55767059
|
||||||
if hasattr(func, "__qualname__") and hasattr(func, "__module__"): # method
|
if hasattr(func, "__qualname__") and hasattr(func, "__module__") \
|
||||||
cls_func = vars(sys.modules[func.__module__])[func.__qualname__.split(".")[0]]
|
and func.__module__ in sys.modules: # method
|
||||||
return hasattr(cls_func, attr)
|
cls_func = vars(sys.modules[func.__module__])[func.__qualname__.split(".")[0]]
|
||||||
|
return hasattr(cls_func, attr)
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -61,6 +61,8 @@ cdef class Vocab:
|
||||||
lookups (Lookups): Container for large lookup tables and dictionaries.
|
lookups (Lookups): Container for large lookup tables and dictionaries.
|
||||||
oov_prob (float): Default OOV probability.
|
oov_prob (float): Default OOV probability.
|
||||||
vectors_name (unicode): Optional name to identify the vectors table.
|
vectors_name (unicode): Optional name to identify the vectors table.
|
||||||
|
get_noun_chunks (Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]):
|
||||||
|
A function that yields base noun phrases used for Doc.noun_chunks.
|
||||||
"""
|
"""
|
||||||
lex_attr_getters = lex_attr_getters if lex_attr_getters is not None else {}
|
lex_attr_getters = lex_attr_getters if lex_attr_getters is not None else {}
|
||||||
if lookups in (None, True, False):
|
if lookups in (None, True, False):
|
||||||
|
|
|
@ -19,7 +19,7 @@ spaCy's built-in architectures that are used for different NLP tasks. All
|
||||||
trainable [built-in components](/api#architecture-pipeline) expect a `model`
|
trainable [built-in components](/api#architecture-pipeline) expect a `model`
|
||||||
argument defined in the config and document their the default architecture.
|
argument defined in the config and document their the default architecture.
|
||||||
Custom architectures can be registered using the
|
Custom architectures can be registered using the
|
||||||
[`@spacy.registry.architectures`](/api/top-level#regsitry) decorator and used as
|
[`@spacy.registry.architectures`](/api/top-level#registry) decorator and used as
|
||||||
part of the [training config](/usage/training#custom-functions). Also see the
|
part of the [training config](/usage/training#custom-functions). Also see the
|
||||||
usage documentation on
|
usage documentation on
|
||||||
[layers and model architectures](/usage/layers-architectures).
|
[layers and model architectures](/usage/layers-architectures).
|
||||||
|
|
|
@ -219,7 +219,7 @@ alignment mode `"strict".
|
||||||
| `alignment_mode` | How character indices snap to token boundaries. Options: `"strict"` (no snapping), `"contract"` (span of all tokens completely within the character span), `"expand"` (span of all tokens at least partially covered by the character span). Defaults to `"strict"`. ~~str~~ |
|
| `alignment_mode` | How character indices snap to token boundaries. Options: `"strict"` (no snapping), `"contract"` (span of all tokens completely within the character span), `"expand"` (span of all tokens at least partially covered by the character span). Defaults to `"strict"`. ~~str~~ |
|
||||||
| **RETURNS** | The newly constructed object or `None`. ~~Optional[Span]~~ |
|
| **RETURNS** | The newly constructed object or `None`. ~~Optional[Span]~~ |
|
||||||
|
|
||||||
## Doc.set_ents {#ents tag="method" new="3"}
|
## Doc.set_ents {#set_ents tag="method" new="3"}
|
||||||
|
|
||||||
Set the named entities in the document.
|
Set the named entities in the document.
|
||||||
|
|
||||||
|
@ -616,8 +616,10 @@ phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be
|
||||||
nested within it – so no NP-level coordination, no prepositional phrases, and no
|
nested within it – so no NP-level coordination, no prepositional phrases, and no
|
||||||
relative clauses.
|
relative clauses.
|
||||||
|
|
||||||
If the `noun_chunk` [syntax iterator](/usage/adding-languages#language-data) has
|
To customize the noun chunk iterator in a loaded pipeline, modify
|
||||||
not been implemeted for the given language, a `NotImplementedError` is raised.
|
[`nlp.vocab.get_noun_chunks`](/api/vocab#attributes). If the `noun_chunk`
|
||||||
|
[syntax iterator](/usage/adding-languages#language-data) has not been
|
||||||
|
implemented for the given language, a `NotImplementedError` is raised.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -633,12 +635,14 @@ not been implemeted for the given language, a `NotImplementedError` is raised.
|
||||||
| ---------- | ------------------------------------- |
|
| ---------- | ------------------------------------- |
|
||||||
| **YIELDS** | Noun chunks in the document. ~~Span~~ |
|
| **YIELDS** | Noun chunks in the document. ~~Span~~ |
|
||||||
|
|
||||||
## Doc.sents {#sents tag="property" model="parser"}
|
## Doc.sents {#sents tag="property" model="sentences"}
|
||||||
|
|
||||||
Iterate over the sentences in the document. Sentence spans have no label. To
|
Iterate over the sentences in the document. Sentence spans have no label.
|
||||||
improve accuracy on informal texts, spaCy calculates sentence boundaries from
|
|
||||||
the syntactic dependency parse. If the parser is disabled, the `sents` iterator
|
This property is only available when
|
||||||
will be unavailable.
|
[sentence boundaries](/usage/linguistic-features#sbd) have been set on the
|
||||||
|
document by the `parser`, `senter`, `sentencizer` or some custom function. It
|
||||||
|
will raise an error otherwise.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
|
|
@ -31,6 +31,7 @@ architectures and their arguments and hyperparameters.
|
||||||
> from spacy.pipeline.entity_linker import DEFAULT_NEL_MODEL
|
> from spacy.pipeline.entity_linker import DEFAULT_NEL_MODEL
|
||||||
> config = {
|
> config = {
|
||||||
> "labels_discard": [],
|
> "labels_discard": [],
|
||||||
|
> "n_sents": 0,
|
||||||
> "incl_prior": True,
|
> "incl_prior": True,
|
||||||
> "incl_context": True,
|
> "incl_context": True,
|
||||||
> "model": DEFAULT_NEL_MODEL,
|
> "model": DEFAULT_NEL_MODEL,
|
||||||
|
@ -43,6 +44,7 @@ architectures and their arguments and hyperparameters.
|
||||||
| Setting | Description |
|
| Setting | Description |
|
||||||
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `labels_discard` | NER labels that will automatically get a "NIL" prediction. Defaults to `[]`. ~~Iterable[str]~~ |
|
| `labels_discard` | NER labels that will automatically get a "NIL" prediction. Defaults to `[]`. ~~Iterable[str]~~ |
|
||||||
|
| `n_sents` | The number of neighbouring sentences to take into account. Defaults to 0. ~~int~~ |
|
||||||
| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. Defaults to `True`. ~~bool~~ |
|
| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. Defaults to `True`. ~~bool~~ |
|
||||||
| `incl_context` | Whether or not to include the local context in the model. Defaults to `True`. ~~bool~~ |
|
| `incl_context` | Whether or not to include the local context in the model. Defaults to `True`. ~~bool~~ |
|
||||||
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [EntityLinker](/api/architectures#EntityLinker). ~~Model~~ |
|
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [EntityLinker](/api/architectures#EntityLinker). ~~Model~~ |
|
||||||
|
@ -89,6 +91,7 @@ custom knowledge base, you should either call
|
||||||
| `entity_vector_length` | Size of encoding vectors in the KB. ~~int~~ |
|
| `entity_vector_length` | Size of encoding vectors in the KB. ~~int~~ |
|
||||||
| `get_candidates` | Function that generates plausible candidates for a given `Span` object. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ |
|
| `get_candidates` | Function that generates plausible candidates for a given `Span` object. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ |
|
||||||
| `labels_discard` | NER labels that will automatically get a `"NIL"` prediction. ~~Iterable[str]~~ |
|
| `labels_discard` | NER labels that will automatically get a `"NIL"` prediction. ~~Iterable[str]~~ |
|
||||||
|
| `n_sents` | The number of neighbouring sentences to take into account. ~~int~~ |
|
||||||
| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. ~~bool~~ |
|
| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. ~~bool~~ |
|
||||||
| `incl_context` | Whether or not to include the local context in the model. ~~bool~~ |
|
| `incl_context` | Whether or not to include the local context in the model. ~~bool~~ |
|
||||||
|
|
||||||
|
@ -154,7 +157,7 @@ with the current vocab.
|
||||||
> kb.add_alias(...)
|
> kb.add_alias(...)
|
||||||
> return kb
|
> return kb
|
||||||
> entity_linker = nlp.add_pipe("entity_linker")
|
> entity_linker = nlp.add_pipe("entity_linker")
|
||||||
> entity_linker.set_kb(lambda: [], nlp=nlp, kb_loader=create_kb)
|
> entity_linker.set_kb(create_kb)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
|
@ -247,14 +250,14 @@ pipe's entity linking model and context encoder. Delegates to
|
||||||
> losses = entity_linker.update(examples, sgd=optimizer)
|
> losses = entity_linker.update(examples, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `drop` | The dropout rate. ~~float~~ |
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||||
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
## EntityLinker.score {#score tag="method" new="3"}
|
## EntityLinker.score {#score tag="method" new="3"}
|
||||||
|
|
||||||
|
|
|
@ -152,7 +152,7 @@ Get a list of all aliases in the knowledge base.
|
||||||
| ----------- | -------------------------------------------------------- |
|
| ----------- | -------------------------------------------------------- |
|
||||||
| **RETURNS** | The list of aliases in the knowledge base. ~~List[str]~~ |
|
| **RETURNS** | The list of aliases in the knowledge base. ~~List[str]~~ |
|
||||||
|
|
||||||
## KnowledgeBase.get_candidates {#get_candidates tag="method"}
|
## KnowledgeBase.get_alias_candidates {#get_alias_candidates tag="method"}
|
||||||
|
|
||||||
Given a certain textual mention as input, retrieve a list of candidate entities
|
Given a certain textual mention as input, retrieve a list of candidate entities
|
||||||
of type [`Candidate`](/api/kb/#candidate).
|
of type [`Candidate`](/api/kb/#candidate).
|
||||||
|
@ -160,13 +160,13 @@ of type [`Candidate`](/api/kb/#candidate).
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> candidates = kb.get_candidates("Douglas")
|
> candidates = kb.get_alias_candidates("Douglas")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------- | ------------------------------------- |
|
| ----------- | ------------------------------------------------------------- |
|
||||||
| `alias` | The textual mention or alias. ~~str~~ |
|
| `alias` | The textual mention or alias. ~~str~~ |
|
||||||
| **RETURNS** | iterable | The list of relevant `Candidate` objects. ~~List[Candidate]~~ |
|
| **RETURNS** | The list of relevant `Candidate` objects. ~~List[Candidate]~~ |
|
||||||
|
|
||||||
## KnowledgeBase.get_vector {#get_vector tag="method"}
|
## KnowledgeBase.get_vector {#get_vector tag="method"}
|
||||||
|
|
||||||
|
@ -246,7 +246,7 @@ certain prior probability.
|
||||||
|
|
||||||
Construct a `Candidate` object. Usually this constructor is not called directly,
|
Construct a `Candidate` object. Usually this constructor is not called directly,
|
||||||
but instead these objects are returned by the
|
but instead these objects are returned by the
|
||||||
[`get_candidates`](/api/kb#get_candidates) method of a `KnowledgeBase`.
|
`get_candidates` method of the [`entity_linker`](/api/entitylinker) pipe.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
|
|
@ -364,7 +364,7 @@ Evaluate a pipeline's components.
|
||||||
|
|
||||||
<Infobox variant="warning" title="Changed in v3.0">
|
<Infobox variant="warning" title="Changed in v3.0">
|
||||||
|
|
||||||
The `Language.update` method now takes a batch of [`Example`](/api/example)
|
The `Language.evaluate` method now takes a batch of [`Example`](/api/example)
|
||||||
objects instead of tuples of `Doc` and `GoldParse` objects.
|
objects instead of tuples of `Doc` and `GoldParse` objects.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
|
@ -137,14 +137,14 @@ Returns PRF scores for labeled or unlabeled spans.
|
||||||
> print(scores["ents_f"])
|
> print(scores["ents_f"])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
|
| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ |
|
||||||
| `attr` | The attribute to score. ~~str~~ |
|
| `attr` | The attribute to score. ~~str~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `getter` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`. ~~Callable[[Doc, str], Iterable[Span]]~~ |
|
| `getter` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`. ~~Callable[[Doc, str], Iterable[Span]]~~ |
|
||||||
| `has_annotation` | Defaults to `None`. If provided, `has_annotation(doc)` should return whether a `Doc` has annotation for this `attr`. Docs without annotation are skipped for scoring purposes. ~~str~~ |
|
| `has_annotation` | Defaults to `None`. If provided, `has_annotation(doc)` should return whether a `Doc` has annotation for this `attr`. Docs without annotation are skipped for scoring purposes. ~~Optional[Callable[[Doc], bool]]~~ |
|
||||||
| **RETURNS** | A dictionary containing the PRF scores under the keys `{attr}_p`, `{attr}_r`, `{attr}_f` and the per-type PRF scores under `{attr}_per_type`. ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
| **RETURNS** | A dictionary containing the PRF scores under the keys `{attr}_p`, `{attr}_r`, `{attr}_f` and the per-type PRF scores under `{attr}_per_type`. ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||||
|
|
||||||
## Scorer.score_deps {#score_deps tag="staticmethod" new="3"}
|
## Scorer.score_deps {#score_deps tag="staticmethod" new="3"}
|
||||||
|
|
||||||
|
|
|
@ -483,13 +483,40 @@ The L2 norm of the span's vector representation.
|
||||||
| ----------- | --------------------------------------------------- |
|
| ----------- | --------------------------------------------------- |
|
||||||
| **RETURNS** | The L2 norm of the vector representation. ~~float~~ |
|
| **RETURNS** | The L2 norm of the vector representation. ~~float~~ |
|
||||||
|
|
||||||
|
## Span.sent {#sent tag="property" model="sentences"}
|
||||||
|
|
||||||
|
The sentence span that this span is a part of. This property is only available
|
||||||
|
when [sentence boundaries](/usage/linguistic-features#sbd) have been set on the
|
||||||
|
document by the `parser`, `senter`, `sentencizer` or some custom function. It
|
||||||
|
will raise an error otherwise.
|
||||||
|
|
||||||
|
If the span happens to cross sentence boundaries, only the first sentence will
|
||||||
|
be returned. If it is required that the sentence always includes the
|
||||||
|
full span, the result can be adjusted as such:
|
||||||
|
|
||||||
|
```python
|
||||||
|
sent = span.sent
|
||||||
|
sent = doc[sent.start : max(sent.end, span.end)]
|
||||||
|
```
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> doc = nlp("Give it back! He pleaded.")
|
||||||
|
> span = doc[1:3]
|
||||||
|
> assert span.sent.text == "Give it back!"
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ------------------------------------------------------- |
|
||||||
|
| **RETURNS** | The sentence span that this span is a part of. ~~Span~~ |
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
|
| --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `doc` | The parent document. ~~Doc~~ |
|
| `doc` | The parent document. ~~Doc~~ |
|
||||||
| `tensor` <Tag variant="new">2.1.7</Tag> | The span's slice of the parent `Doc`'s tensor. ~~numpy.ndarray~~ |
|
| `tensor` <Tag variant="new">2.1.7</Tag> | The span's slice of the parent `Doc`'s tensor. ~~numpy.ndarray~~ |
|
||||||
| `sent` | The sentence span that this span is a part of. ~~Span~~ |
|
|
||||||
| `start` | The token offset for the start of the span. ~~int~~ |
|
| `start` | The token offset for the start of the span. ~~int~~ |
|
||||||
| `end` | The token offset for the end of the span. ~~int~~ |
|
| `end` | The token offset for the end of the span. ~~int~~ |
|
||||||
| `start_char` | The character offset for the start of the span. ~~int~~ |
|
| `start_char` | The character offset for the start of the span. ~~int~~ |
|
||||||
|
|
|
@ -21,14 +21,14 @@ Create the vocabulary.
|
||||||
> vocab = Vocab(strings=["hello", "world"])
|
> vocab = Vocab(strings=["hello", "world"])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `lex_attr_getters` | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. ~~Optional[Dict[str, Callable[[str], Any]]]~~ |
|
| `lex_attr_getters` | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. ~~Optional[Dict[str, Callable[[str], Any]]]~~ |
|
||||||
| `strings` | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. ~~Union[List[str], StringStore]~~ |
|
| `strings` | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. ~~Union[List[str], StringStore]~~ |
|
||||||
| `lookups` | A [`Lookups`](/api/lookups) that stores the `lexeme_norm` and other large lookup tables. Defaults to `None`. ~~Optional[Lookups]~~ |
|
| `lookups` | A [`Lookups`](/api/lookups) that stores the `lexeme_norm` and other large lookup tables. Defaults to `None`. ~~Optional[Lookups]~~ |
|
||||||
| `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ |
|
| `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ |
|
||||||
| `vectors_name` <Tag variant="new">2.2</Tag> | A name to identify the vectors table. ~~str~~ |
|
| `vectors_name` <Tag variant="new">2.2</Tag> | A name to identify the vectors table. ~~str~~ |
|
||||||
| `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ |
|
| `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ |
|
||||||
| `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
|
| `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
|
||||||
|
|
||||||
## Vocab.\_\_len\_\_ {#len tag="method"}
|
## Vocab.\_\_len\_\_ {#len tag="method"}
|
||||||
|
@ -182,14 +182,14 @@ subword features by average over n-grams of `orth` (introduced in spaCy `v2.1`).
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
|
| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `orth` | The hash value of a word, or its unicode string. ~~Union[int, str]~~ |
|
| `orth` | The hash value of a word, or its unicode string. ~~Union[int, str]~~ |
|
||||||
| `minn` <Tag variant="new">2.1</Tag> | Minimum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ |
|
| `minn` <Tag variant="new">2.1</Tag> | Minimum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ |
|
||||||
| `maxn` <Tag variant="new">2.1</Tag> | Maximum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ |
|
| `maxn` <Tag variant="new">2.1</Tag> | Maximum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ |
|
||||||
| **RETURNS** | A word vector. Size and shape are determined by the `Vocab.vectors` instance. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
|
| **RETURNS** | A word vector. Size and shape are determined by the `Vocab.vectors` instance. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
|
||||||
|
|
||||||
## Vocab.set_vector {#set_vector tag="method" new="2"}
|
## Vocab.set_vector {#set_vector tag="method" new="2"}
|
||||||
|
|
||||||
Set a vector for a word in the vocabulary. Words can be referenced by string
|
Set a vector for a word in the vocabulary. Words can be referenced by string or
|
||||||
or hash value.
|
hash value.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -300,13 +300,14 @@ Load state from a binary string.
|
||||||
> assert type(PERSON) == int
|
> assert type(PERSON) == int
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| --------------------------------------------- | ------------------------------------------------------------------------------- |
|
| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `strings` | A table managing the string-to-int mapping. ~~StringStore~~ |
|
| `strings` | A table managing the string-to-int mapping. ~~StringStore~~ |
|
||||||
| `vectors` <Tag variant="new">2</Tag> | A table associating word IDs to word vectors. ~~Vectors~~ |
|
| `vectors` <Tag variant="new">2</Tag> | A table associating word IDs to word vectors. ~~Vectors~~ |
|
||||||
| `vectors_length` | Number of dimensions for each word vector. ~~int~~ |
|
| `vectors_length` | Number of dimensions for each word vector. ~~int~~ |
|
||||||
| `lookups` | The available lookup tables in this vocab. ~~Lookups~~ |
|
| `lookups` | The available lookup tables in this vocab. ~~Lookups~~ |
|
||||||
| `writing_system` <Tag variant="new">2.1</Tag> | A dict with information about the language's writing system. ~~Dict[str, Any]~~ |
|
| `writing_system` <Tag variant="new">2.1</Tag> | A dict with information about the language's writing system. ~~Dict[str, Any]~~ |
|
||||||
|
| `get_noun_chunks` <Tag variant="new">3.0</Tag> | A function that yields base noun phrases used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
|
@ -15,7 +15,7 @@ next: /usage/projects
|
||||||
> ```python
|
> ```python
|
||||||
> from thinc.api import Model, chain
|
> from thinc.api import Model, chain
|
||||||
>
|
>
|
||||||
> @spacy.registry.architectures.register("model.v1")
|
> @spacy.registry.architectures("model.v1")
|
||||||
> def build_model(width: int, classes: int) -> Model:
|
> def build_model(width: int, classes: int) -> Model:
|
||||||
> tok2vec = build_tok2vec(width)
|
> tok2vec = build_tok2vec(width)
|
||||||
> output_layer = build_output_layer(width, classes)
|
> output_layer = build_output_layer(width, classes)
|
||||||
|
@ -563,7 +563,7 @@ matrix** (~~Floats2d~~) of predictions:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### The model architecture
|
### The model architecture
|
||||||
@spacy.registry.architectures.register("rel_model.v1")
|
@spacy.registry.architectures("rel_model.v1")
|
||||||
def create_relation_model(...) -> Model[List[Doc], Floats2d]:
|
def create_relation_model(...) -> Model[List[Doc], Floats2d]:
|
||||||
model = ... # 👈 model will go here
|
model = ... # 👈 model will go here
|
||||||
return model
|
return model
|
||||||
|
@ -589,7 +589,7 @@ transforms the instance tensor into a final tensor holding the predictions:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### The model architecture {highlight="6"}
|
### The model architecture {highlight="6"}
|
||||||
@spacy.registry.architectures.register("rel_model.v1")
|
@spacy.registry.architectures("rel_model.v1")
|
||||||
def create_relation_model(
|
def create_relation_model(
|
||||||
create_instance_tensor: Model[List[Doc], Floats2d],
|
create_instance_tensor: Model[List[Doc], Floats2d],
|
||||||
classification_layer: Model[Floats2d, Floats2d],
|
classification_layer: Model[Floats2d, Floats2d],
|
||||||
|
@ -613,7 +613,7 @@ The `classification_layer` could be something like a
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### The classification layer
|
### The classification layer
|
||||||
@spacy.registry.architectures.register("rel_classification_layer.v1")
|
@spacy.registry.architectures("rel_classification_layer.v1")
|
||||||
def create_classification_layer(
|
def create_classification_layer(
|
||||||
nO: int = None, nI: int = None
|
nO: int = None, nI: int = None
|
||||||
) -> Model[Floats2d, Floats2d]:
|
) -> Model[Floats2d, Floats2d]:
|
||||||
|
@ -650,7 +650,7 @@ that has the full implementation.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### The layer that creates the instance tensor
|
### The layer that creates the instance tensor
|
||||||
@spacy.registry.architectures.register("rel_instance_tensor.v1")
|
@spacy.registry.architectures("rel_instance_tensor.v1")
|
||||||
def create_tensors(
|
def create_tensors(
|
||||||
tok2vec: Model[List[Doc], List[Floats2d]],
|
tok2vec: Model[List[Doc], List[Floats2d]],
|
||||||
pooling: Model[Ragged, Floats2d],
|
pooling: Model[Ragged, Floats2d],
|
||||||
|
@ -731,7 +731,7 @@ are within a **maximum distance** (in number of tokens) of each other:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### Candidate generation
|
### Candidate generation
|
||||||
@spacy.registry.misc.register("rel_instance_generator.v1")
|
@spacy.registry.misc("rel_instance_generator.v1")
|
||||||
def create_instances(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]:
|
def create_instances(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]:
|
||||||
def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
|
def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
|
||||||
candidates = []
|
candidates = []
|
||||||
|
|
|
@ -585,7 +585,7 @@ print(ent_francisco) # ['Francisco', 'I', 'GPE']
|
||||||
To ensure that the sequence of token annotations remains consistent, you have to
|
To ensure that the sequence of token annotations remains consistent, you have to
|
||||||
set entity annotations **at the document level**. However, you can't write
|
set entity annotations **at the document level**. However, you can't write
|
||||||
directly to the `token.ent_iob` or `token.ent_type` attributes, so the easiest
|
directly to the `token.ent_iob` or `token.ent_type` attributes, so the easiest
|
||||||
way to set entities is to assign to the [`doc.ents`](/api/doc#ents) attribute
|
way to set entities is to use the [`doc.set_ents`](/api/doc#set_ents) function
|
||||||
and create the new entity as a [`Span`](/api/span).
|
and create the new entity as a [`Span`](/api/span).
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
|
|
@ -95,6 +95,14 @@ spaCy's binary `.spacy` format. You can either include the data paths in the
|
||||||
$ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
|
$ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
|
||||||
```
|
```
|
||||||
|
|
||||||
|
> #### Tip: Enable your GPU
|
||||||
|
>
|
||||||
|
> Use the `--gpu-id` option to select the GPU:
|
||||||
|
>
|
||||||
|
> ```cli
|
||||||
|
> $ python -m spacy train config.cfg --gpu-id 0
|
||||||
|
> ```
|
||||||
|
|
||||||
<Accordion title="How are the config recommendations generated?" id="quickstart-source" spaced>
|
<Accordion title="How are the config recommendations generated?" id="quickstart-source" spaced>
|
||||||
|
|
||||||
The recommended config settings generated by the quickstart widget and the
|
The recommended config settings generated by the quickstart widget and the
|
||||||
|
|
|
@ -603,6 +603,7 @@ Note that spaCy v3.0 now requires **Python 3.6+**.
|
||||||
| `GoldParse` | [`Example`](/api/example) |
|
| `GoldParse` | [`Example`](/api/example) |
|
||||||
| `GoldCorpus` | [`Corpus`](/api/corpus) |
|
| `GoldCorpus` | [`Corpus`](/api/corpus) |
|
||||||
| `KnowledgeBase.load_bulk`, `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk) |
|
| `KnowledgeBase.load_bulk`, `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk) |
|
||||||
|
| `KnowledgeBase.get_candidates` | [`KnowledgeBase.get_alias_candidates`](/api/kb#get_alias_candidates) |
|
||||||
| `Matcher.pipe`, `PhraseMatcher.pipe` | not needed |
|
| `Matcher.pipe`, `PhraseMatcher.pipe` | not needed |
|
||||||
| `gold.offsets_from_biluo_tags`, `gold.spans_from_biluo_tags`, `gold.biluo_tags_from_offsets` | [`training.biluo_tags_to_offsets`](/api/top-level#biluo_tags_to_offsets), [`training.biluo_tags_to_spans`](/api/top-level#biluo_tags_to_spans), [`training.offsets_to_biluo_tags`](/api/top-level#offsets_to_biluo_tags) |
|
| `gold.offsets_from_biluo_tags`, `gold.spans_from_biluo_tags`, `gold.biluo_tags_from_offsets` | [`training.biluo_tags_to_offsets`](/api/top-level#biluo_tags_to_offsets), [`training.biluo_tags_to_spans`](/api/top-level#biluo_tags_to_spans), [`training.offsets_to_biluo_tags`](/api/top-level#offsets_to_biluo_tags) |
|
||||||
| `spacy init-model` | [`spacy init vectors`](/api/cli#init-vectors) |
|
| `spacy init-model` | [`spacy init vectors`](/api/cli#init-vectors) |
|
||||||
|
|
|
@ -58,7 +58,7 @@
|
||||||
},
|
},
|
||||||
"category": ["pipeline"],
|
"category": ["pipeline"],
|
||||||
"tags": ["sentiment", "textblob"]
|
"tags": ["sentiment", "textblob"]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"id": "spacy-ray",
|
"id": "spacy-ray",
|
||||||
"title": "spacy-ray",
|
"title": "spacy-ray",
|
||||||
|
@ -2647,14 +2647,14 @@
|
||||||
"github": "medspacy"
|
"github": "medspacy"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"id": "rita-dsl",
|
"id": "rita-dsl",
|
||||||
"title": "RITA DSL",
|
"title": "RITA DSL",
|
||||||
"slogan": "Domain Specific Language for creating language rules",
|
"slogan": "Domain Specific Language for creating language rules",
|
||||||
"github": "zaibacu/rita-dsl",
|
"github": "zaibacu/rita-dsl",
|
||||||
"description": "A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format",
|
"description": "A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format",
|
||||||
"pip": "rita-dsl",
|
"pip": "rita-dsl",
|
||||||
"thumb": "https://raw.githubusercontent.com/zaibacu/rita-dsl/master/docs/assets/logo-100px.png",
|
"thumb": "https://raw.githubusercontent.com/zaibacu/rita-dsl/master/docs/assets/logo-100px.png",
|
||||||
"code_language": "python",
|
"code_language": "python",
|
||||||
"code_example": [
|
"code_example": [
|
||||||
"import spacy",
|
"import spacy",
|
||||||
|
@ -2754,14 +2754,41 @@
|
||||||
"{",
|
"{",
|
||||||
" var lexeme = doc.Vocab[word.Text];",
|
" var lexeme = doc.Vocab[word.Text];",
|
||||||
" Console.WriteLine($@\"{lexeme.Text} {lexeme.Orth} {lexeme.Shape} {lexeme.Prefix} {lexeme.Suffix} {lexeme.IsAlpha} {lexeme.IsDigit} {lexeme.IsTitle} {lexeme.Lang}\");",
|
" Console.WriteLine($@\"{lexeme.Text} {lexeme.Orth} {lexeme.Shape} {lexeme.Prefix} {lexeme.Suffix} {lexeme.IsAlpha} {lexeme.IsDigit} {lexeme.IsTitle} {lexeme.Lang}\");",
|
||||||
"}"
|
"}"
|
||||||
],
|
],
|
||||||
"code_language": "csharp",
|
"code_language": "csharp",
|
||||||
"author": "Antonio Miras",
|
"author": "Antonio Miras",
|
||||||
"author_links": {
|
"author_links": {
|
||||||
"github": "AMArostegui"
|
"github": "AMArostegui"
|
||||||
},
|
},
|
||||||
"category": ["nonpython"]
|
"category": ["nonpython"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "ruts",
|
||||||
|
"title": "ruTS",
|
||||||
|
"slogan": "A library for statistics extraction from texts in Russian",
|
||||||
|
"description": "The library allows extracting the following statistics from a text: basic statistics, readability metrics, lexical diversity metrics, morphological statistics",
|
||||||
|
"github": "SergeyShk/ruTS",
|
||||||
|
"pip": "ruts",
|
||||||
|
"code_example": [
|
||||||
|
"import spacy",
|
||||||
|
"import ruts",
|
||||||
|
"",
|
||||||
|
"nlp = spacy.load('ru_core_news_sm')",
|
||||||
|
"nlp.add_pipe('basic', last=True)",
|
||||||
|
"doc = nlp('мама мыла раму')",
|
||||||
|
"doc._.basic.get_stats()"
|
||||||
|
],
|
||||||
|
"code_language": "python",
|
||||||
|
"thumb": "https://habrastorage.org/webt/6z/le/fz/6zlefzjavzoqw_wymz7v3pwgfp4.png",
|
||||||
|
"image": "https://clipartart.com/images/free-tree-roots-clipart-black-and-white-2.png",
|
||||||
|
"author": "Sergey Shkarin",
|
||||||
|
"author_links": {
|
||||||
|
"twitter": "shk_sergey",
|
||||||
|
"github": "SergeyShk"
|
||||||
|
},
|
||||||
|
"category": ["pipeline", "standalone"],
|
||||||
|
"tags": ["Text Analytics", "Russian"]
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user