Merge branch 'develop' of https://github.com/explosion/spaCy into develop

This commit is contained in:
Ines Montani 2021-01-13 12:03:02 +11:00
commit 97d5a7ba99
82 changed files with 2710 additions and 4817 deletions

106
.github/contributors/bratao.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Bruno Souza Cabral |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 24/12/2020 |
| GitHub username | bratao |
| Website (optional) | |

1
.gitignore vendored
View File

@ -51,6 +51,7 @@ env3.*/
.pypyenv .pypyenv
.pytest_cache/ .pytest_cache/
.mypy_cache/ .mypy_cache/
.hypothesis/
# Distribution / packaging # Distribution / packaging
env/ env/

View File

@ -35,7 +35,10 @@ def download_cli(
def download(model: str, direct: bool = False, *pip_args) -> None: def download(model: str, direct: bool = False, *pip_args) -> None:
if not (is_package("spacy") or is_package("spacy-nightly")) and "--no-deps" not in pip_args: if (
not (is_package("spacy") or is_package("spacy-nightly"))
and "--no-deps" not in pip_args
):
msg.warn( msg.warn(
"Skipping pipeline package dependencies and setting `--no-deps`. " "Skipping pipeline package dependencies and setting `--no-deps`. "
"You don't seem to have the spaCy package itself installed " "You don't seem to have the spaCy package itself installed "

View File

@ -172,7 +172,9 @@ def render_parses(
file_.write(html) file_.write(html)
def print_prf_per_type(msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str) -> None: def print_prf_per_type(
msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str
) -> None:
data = [ data = [
(k, f"{v['p']*100:.2f}", f"{v['r']*100:.2f}", f"{v['f']*100:.2f}") (k, f"{v['p']*100:.2f}", f"{v['r']*100:.2f}", f"{v['f']*100:.2f}")
for k, v in scores.items() for k, v in scores.items()

View File

@ -1,10 +1,10 @@
from typing import Optional, Dict, Any, Union from typing import Optional, Dict, Any, Union, List
import platform import platform
from pathlib import Path from pathlib import Path
from wasabi import Printer, MarkdownRenderer from wasabi import Printer, MarkdownRenderer
import srsly import srsly
from ._util import app, Arg, Opt from ._util import app, Arg, Opt, string_to_list
from .. import util from .. import util
from .. import about from .. import about
@ -15,20 +15,22 @@ def info_cli(
model: Optional[str] = Arg(None, help="Optional loadable spaCy pipeline"), model: Optional[str] = Arg(None, help="Optional loadable spaCy pipeline"),
markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues"), markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues"),
silent: bool = Opt(False, "--silent", "-s", "-S", help="Don't print anything (just return)"), silent: bool = Opt(False, "--silent", "-s", "-S", help="Don't print anything (just return)"),
exclude: Optional[str] = Opt("labels", "--exclude", "-e", help="Comma-separated keys to exclude from the print-out"),
# fmt: on # fmt: on
): ):
""" """
Print info about spaCy installation. If a pipeline is speficied as an argument, Print info about spaCy installation. If a pipeline is specified as an argument,
print its meta information. Flag --markdown prints details in Markdown for easy print its meta information. Flag --markdown prints details in Markdown for easy
copy-pasting to GitHub issues. copy-pasting to GitHub issues.
DOCS: https://nightly.spacy.io/api/cli#info DOCS: https://nightly.spacy.io/api/cli#info
""" """
info(model, markdown=markdown, silent=silent) exclude = string_to_list(exclude)
info(model, markdown=markdown, silent=silent, exclude=exclude)
def info( def info(
model: Optional[str] = None, *, markdown: bool = False, silent: bool = True model: Optional[str] = None, *, markdown: bool = False, silent: bool = True, exclude: List[str]
) -> Union[str, dict]: ) -> Union[str, dict]:
msg = Printer(no_print=silent, pretty=not silent) msg = Printer(no_print=silent, pretty=not silent)
if model: if model:
@ -42,13 +44,13 @@ def info(
data["Pipelines"] = ", ".join( data["Pipelines"] = ", ".join(
f"{n} ({v})" for n, v in data["Pipelines"].items() f"{n} ({v})" for n, v in data["Pipelines"].items()
) )
markdown_data = get_markdown(data, title=title) markdown_data = get_markdown(data, title=title, exclude=exclude)
if markdown: if markdown:
if not silent: if not silent:
print(markdown_data) print(markdown_data)
return markdown_data return markdown_data
if not silent: if not silent:
table_data = dict(data) table_data = {k: v for k, v in data.items() if k not in exclude}
msg.table(table_data, title=title) msg.table(table_data, title=title)
return raw_data return raw_data
@ -82,7 +84,7 @@ def info_model(model: str, *, silent: bool = True) -> Dict[str, Any]:
if util.is_package(model): if util.is_package(model):
model_path = util.get_package_path(model) model_path = util.get_package_path(model)
else: else:
model_path = model model_path = Path(model)
meta_path = model_path / "meta.json" meta_path = model_path / "meta.json"
if not meta_path.is_file(): if not meta_path.is_file():
msg.fail("Can't find pipeline meta.json", meta_path, exits=1) msg.fail("Can't find pipeline meta.json", meta_path, exits=1)
@ -96,7 +98,7 @@ def info_model(model: str, *, silent: bool = True) -> Dict[str, Any]:
} }
def get_markdown(data: Dict[str, Any], title: Optional[str] = None) -> str: def get_markdown(data: Dict[str, Any], title: Optional[str] = None, exclude: List[str] = None) -> str:
"""Get data in GitHub-flavoured Markdown format for issues etc. """Get data in GitHub-flavoured Markdown format for issues etc.
data (dict or list of tuples): Label/value pairs. data (dict or list of tuples): Label/value pairs.
@ -108,8 +110,16 @@ def get_markdown(data: Dict[str, Any], title: Optional[str] = None) -> str:
md.add(md.title(2, title)) md.add(md.title(2, title))
items = [] items = []
for key, value in data.items(): for key, value in data.items():
if isinstance(value, str) and Path(value).exists(): if exclude and key in exclude:
continue continue
if isinstance(value, str):
try:
existing_path = Path(value).exists()
except:
# invalid Path, like a URL string
existing_path = False
if existing_path:
continue
items.append(f"{md.bold(f'{key}:')} {value}") items.append(f"{md.bold(f'{key}:')} {value}")
md.add(md.list(items)) md.add(md.list(items))
return f"\n{md.text}\n" return f"\n{md.text}\n"

View File

@ -32,6 +32,7 @@ def init_config_cli(
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."), optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
gpu: bool = Opt(False, "--gpu", "-G", help="Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters."), gpu: bool = Opt(False, "--gpu", "-G", help="Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters."),
pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"), pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
force_overwrite: bool = Opt(False, "--force", "-F", help="Force overwriting the output file"),
# fmt: on # fmt: on
): ):
""" """
@ -46,6 +47,12 @@ def init_config_cli(
optimize = optimize.value optimize = optimize.value
pipeline = string_to_list(pipeline) pipeline = string_to_list(pipeline)
is_stdout = str(output_file) == "-" is_stdout = str(output_file) == "-"
if not is_stdout and output_file.exists() and not force_overwrite:
msg = Printer()
msg.fail(
"The provided output file already exists. To force overwriting the config file, set the --force or -F flag.",
exits=1,
)
config = init_config( config = init_config(
lang=lang, lang=lang,
pipeline=pipeline, pipeline=pipeline,
@ -162,7 +169,7 @@ def init_config(
"Hardware": variables["hardware"].upper(), "Hardware": variables["hardware"].upper(),
"Transformer": template_vars.transformer.get("name", False), "Transformer": template_vars.transformer.get("name", False),
} }
msg.info("Generated template specific for your use case") msg.info("Generated config template specific for your use case")
for label, value in use_case.items(): for label, value in use_case.items():
msg.text(f"- {label}: {value}") msg.text(f"- {label}: {value}")
with show_validation_error(hint_fill=False): with show_validation_error(hint_fill=False):

View File

@ -149,13 +149,44 @@ grad_factor = 1.0
[components.textcat.model.linear_model] [components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v1" @architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false exclusive_classes = true
ngram_size = 1 ngram_size = 1
no_output_layer = false no_output_layer = false
{% else -%} {% else -%}
[components.textcat.model] [components.textcat.model]
@architectures = "spacy.TextCatBOW.v1" @architectures = "spacy.TextCatBOW.v1"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
{%- endif %}
{%- endif %}
{% if "textcat_multilabel" in components %}
[components.textcat_multilabel]
factory = "textcat_multilabel"
{% if optimize == "accuracy" %}
[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null
[components.textcat_multilabel.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
[components.textcat_multilabel.model.tok2vec.pooling]
@layers = "reduce_mean.v1"
[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
{% else -%}
[components.textcat_multilabel.model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false exclusive_classes = false
ngram_size = 1 ngram_size = 1
no_output_layer = false no_output_layer = false
@ -174,7 +205,7 @@ no_output_layer = false
factory = "tok2vec" factory = "tok2vec"
[components.tok2vec.model] [components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1" @architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed] [components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1" @architectures = "spacy.MultiHashEmbed.v1"
@ -189,7 +220,7 @@ rows = [5000, 2500]
include_static_vectors = {{ "true" if optimize == "accuracy" else "false" }} include_static_vectors = {{ "true" if optimize == "accuracy" else "false" }}
[components.tok2vec.model.encode] [components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1" @architectures = "spacy.MaxoutWindowEncoder.v2"
width = {{ 96 if optimize == "efficiency" else 256 }} width = {{ 96 if optimize == "efficiency" else 256 }}
depth = {{ 4 if optimize == "efficiency" else 8 }} depth = {{ 4 if optimize == "efficiency" else 8 }}
window_size = 1 window_size = 1
@ -288,13 +319,41 @@ width = ${components.tok2vec.model.encode.width}
[components.textcat.model.linear_model] [components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v1" @architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false exclusive_classes = true
ngram_size = 1 ngram_size = 1
no_output_layer = false no_output_layer = false
{% else -%} {% else -%}
[components.textcat.model] [components.textcat.model]
@architectures = "spacy.TextCatBOW.v1" @architectures = "spacy.TextCatBOW.v1"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
{%- endif %}
{%- endif %}
{% if "textcat_multilabel" in components %}
[components.textcat_multilabel]
factory = "textcat_multilabel"
{% if optimize == "accuracy" %}
[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null
[components.textcat_multilabel.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
{% else -%}
[components.textcat_multilabel.model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false exclusive_classes = false
ngram_size = 1 ngram_size = 1
no_output_layer = false no_output_layer = false
@ -303,7 +362,7 @@ no_output_layer = false
{% endif %} {% endif %}
{% for pipe in components %} {% for pipe in components %}
{% if pipe not in ["tagger", "morphologizer", "parser", "ner", "textcat", "entity_linker"] %} {% if pipe not in ["tagger", "morphologizer", "parser", "ner", "textcat", "textcat_multilabel", "entity_linker"] %}
{# Other components defined by the user: we just assume they're factories #} {# Other components defined by the user: we just assume they're factories #}
[components.{{ pipe }}] [components.{{ pipe }}]
factory = "{{ pipe }}" factory = "{{ pipe }}"

View File

@ -463,6 +463,10 @@ class Errors:
"issue tracker: http://github.com/explosion/spaCy/issues") "issue tracker: http://github.com/explosion/spaCy/issues")
# TODO: fix numbering after merging develop into master # TODO: fix numbering after merging develop into master
E895 = ("The 'textcat' component received gold-standard annotations with "
"multiple labels per document. In spaCy 3 you should use the "
"'textcat_multilabel' component for this instead. "
"Example of an offending annotation: {value}")
E896 = ("There was an error using the static vectors. Ensure that the vectors " E896 = ("There was an error using the static vectors. Ensure that the vectors "
"of the vocab are properly initialized, or set 'include_static_vectors' " "of the vocab are properly initialized, or set 'include_static_vectors' "
"to False.") "to False.")

View File

@ -214,8 +214,22 @@ _macedonian_lower = r"ѓѕјљњќѐѝ"
_macedonian_upper = r"ЃЅЈЉЊЌЀЍ" _macedonian_upper = r"ЃЅЈЉЊЌЀЍ"
_macedonian = r"ѓѕјљњќѐѝЃЅЈЉЊЌЀЍ" _macedonian = r"ѓѕјљњќѐѝЃЅЈЉЊЌЀЍ"
_upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian_upper + _macedonian_upper _upper = (
_lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower + _macedonian_lower LATIN_UPPER
+ _russian_upper
+ _tatar_upper
+ _greek_upper
+ _ukrainian_upper
+ _macedonian_upper
)
_lower = (
LATIN_LOWER
+ _russian_lower
+ _tatar_lower
+ _greek_lower
+ _ukrainian_lower
+ _macedonian_lower
)
_uncased = ( _uncased = (
_bengali _bengali
@ -230,7 +244,9 @@ _uncased = (
+ _cjk + _cjk
) )
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _macedonian + _uncased) ALPHA = group_chars(
LATIN + _russian + _tatar + _greek + _ukrainian + _macedonian + _uncased
)
ALPHA_LOWER = group_chars(_lower + _uncased) ALPHA_LOWER = group_chars(_lower + _uncased)
ALPHA_UPPER = group_chars(_upper + _uncased) ALPHA_UPPER = group_chars(_upper + _uncased)

View File

@ -1,18 +1,11 @@
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .tag_map import TAG_MAP
from ...language import Language
from ...attrs import LANG
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from ...language import Language from ...language import Language
class CzechDefaults(Language.Defaults): class CzechDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "cs"
tag_map = TAG_MAP
stop_words = STOP_WORDS
lex_attr_getters = LEX_ATTRS lex_attr_getters = LEX_ATTRS
stop_words = STOP_WORDS
class Czech(Language): class Czech(Language):

File diff suppressed because it is too large Load Diff

View File

@ -14,7 +14,7 @@ class MacedonianLemmatizer(Lemmatizer):
if univ_pos in ("", "eol", "space"): if univ_pos in ("", "eol", "space"):
return [string.lower()] return [string.lower()]
if string[-3:] == 'јќи': if string[-3:] == "јќи":
string = string[:-3] string = string[:-3]
univ_pos = "verb" univ_pos = "verb"
@ -23,7 +23,13 @@ class MacedonianLemmatizer(Lemmatizer):
index_table = self.lookups.get_table("lemma_index", {}) index_table = self.lookups.get_table("lemma_index", {})
exc_table = self.lookups.get_table("lemma_exc", {}) exc_table = self.lookups.get_table("lemma_exc", {})
rules_table = self.lookups.get_table("lemma_rules", {}) rules_table = self.lookups.get_table("lemma_rules", {})
if not any((index_table.get(univ_pos), exc_table.get(univ_pos), rules_table.get(univ_pos))): if not any(
(
index_table.get(univ_pos),
exc_table.get(univ_pos),
rules_table.get(univ_pos),
)
):
if univ_pos == "propn": if univ_pos == "propn":
return [string] return [string]
else: else:

View File

@ -1,21 +1,104 @@
from ...attrs import LIKE_NUM from ...attrs import LIKE_NUM
_num_words = [ _num_words = [
"нула", "еден", "една", "едно", "два", "две", "три", "четири", "пет", "шест", "седум", "осум", "девет", "десет", "нула",
"единаесет", "дванаесет", "тринаесет", "четиринаесет", "петнаесет", "шеснаесет", "седумнаесет", "осумнаесет", "еден",
"деветнаесет", "дваесет", "триесет", "четириесет", "педесет", "шеесет", "седумдесет", "осумдесет", "деведесет", "една",
"сто", "двесте", "триста", "четиристотини", "петстотини", "шестотини", "седумстотини", "осумстотини", "едно",
"деветстотини", "илјада", "илјади", 'милион', 'милиони', 'милијарда', 'милијарди', 'билион', 'билиони', "два",
"две",
"двајца", "тројца", "четворица", "петмина", "шестмина", "седуммина", "осуммина", "деветмина", "обата", "обајцата", "три",
"четири",
"прв", "втор", "трет", "четврт", "седм", "осм", "двестоти", "пет",
"шест",
"два-три", "два-триесет", "два-триесетмина", "два-тринаесет", "два-тројца", "две-три", "две-тристотини", "седум",
"пет-шеесет", "пет-шеесетмина", "пет-шеснаесетмина", "пет-шест", "пет-шестмина", "пет-шестотини", "петина", "осум",
"осмина", "седум-осум", "седум-осумдесет", "седум-осуммина", "седум-осумнаесет", "седум-осумнаесетмина", "девет",
"три-четириесет", "три-четиринаесет", "шеесет", "шеесетина", "шеесетмина", "шеснаесет", "шеснаесетмина", "десет",
"шест-седум", "шест-седумдесет", "шест-седумнаесет", "шест-седумстотини", "шестоти", "шестотини" "единаесет",
"дванаесет",
"тринаесет",
"четиринаесет",
"петнаесет",
"шеснаесет",
"седумнаесет",
"осумнаесет",
"деветнаесет",
"дваесет",
"триесет",
"четириесет",
"педесет",
"шеесет",
"седумдесет",
"осумдесет",
"деведесет",
"сто",
"двесте",
"триста",
"четиристотини",
"петстотини",
"шестотини",
"седумстотини",
"осумстотини",
"деветстотини",
"илјада",
"илјади",
"милион",
"милиони",
"милијарда",
"милијарди",
"билион",
"билиони",
"двајца",
"тројца",
"четворица",
"петмина",
"шестмина",
"седуммина",
"осуммина",
"деветмина",
"обата",
"обајцата",
"прв",
"втор",
"трет",
"четврт",
"седм",
"осм",
"двестоти",
"два-три",
"два-триесет",
"два-триесетмина",
"два-тринаесет",
"два-тројца",
"две-три",
"две-тристотини",
"пет-шеесет",
"пет-шеесетмина",
"пет-шеснаесетмина",
"пет-шест",
"пет-шестмина",
"пет-шестотини",
"петина",
"осмина",
"седум-осум",
"седум-осумдесет",
"седум-осуммина",
"седум-осумнаесет",
"седум-осумнаесетмина",
"три-четириесет",
"три-четиринаесет",
"шеесет",
"шеесетина",
"шеесетмина",
"шеснаесет",
"шеснаесетмина",
"шест-седум",
"шест-седумдесет",
"шест-седумнаесет",
"шест-седумстотини",
"шестоти",
"шестотини",
] ]

View File

@ -21,8 +21,7 @@ _abbr_exc = [
{ORTH: "хл", NORM: "хектолитар"}, {ORTH: "хл", NORM: "хектолитар"},
{ORTH: "дкл", NORM: "декалитар"}, {ORTH: "дкл", NORM: "декалитар"},
{ORTH: "л", NORM: "литар"}, {ORTH: "л", NORM: "литар"},
{ORTH: "дл", NORM: "децилитар"} {ORTH: "дл", NORM: "децилитар"},
] ]
for abbr in _abbr_exc: for abbr in _abbr_exc:
_exc[abbr[ORTH]] = [abbr] _exc[abbr[ORTH]] = [abbr]
@ -33,7 +32,6 @@ _abbr_line_exc = [
{ORTH: "г-ѓа", NORM: "госпоѓа"}, {ORTH: "г-ѓа", NORM: "госпоѓа"},
{ORTH: "г-ца", NORM: "госпоѓица"}, {ORTH: "г-ца", NORM: "госпоѓица"},
{ORTH: "г-дин", NORM: "господин"}, {ORTH: "г-дин", NORM: "господин"},
] ]
for abbr in _abbr_line_exc: for abbr in _abbr_line_exc:
@ -54,7 +52,6 @@ _abbr_dot_exc = [
{ORTH: "т.", NORM: "точка"}, {ORTH: "т.", NORM: "точка"},
{ORTH: "т.е.", NORM: "то ест"}, {ORTH: "т.е.", NORM: "то ест"},
{ORTH: "т.н.", NORM: "таканаречен"}, {ORTH: "т.н.", NORM: "таканаречен"},
{ORTH: "бр.", NORM: "број"}, {ORTH: "бр.", NORM: "број"},
{ORTH: "гр.", NORM: "град"}, {ORTH: "гр.", NORM: "град"},
{ORTH: "др.", NORM: "другар"}, {ORTH: "др.", NORM: "другар"},
@ -68,7 +65,6 @@ _abbr_dot_exc = [
{ORTH: "с.", NORM: "страница"}, {ORTH: "с.", NORM: "страница"},
{ORTH: "стр.", NORM: "страница"}, {ORTH: "стр.", NORM: "страница"},
{ORTH: "чл.", NORM: "член"}, {ORTH: "чл.", NORM: "член"},
{ORTH: "арх.", NORM: "архитект"}, {ORTH: "арх.", NORM: "архитект"},
{ORTH: "бел.", NORM: "белешка"}, {ORTH: "бел.", NORM: "белешка"},
{ORTH: "гимн.", NORM: "гимназија"}, {ORTH: "гимн.", NORM: "гимназија"},
@ -89,8 +85,6 @@ _abbr_dot_exc = [
{ORTH: "истор.", NORM: "историја"}, {ORTH: "истор.", NORM: "историја"},
{ORTH: "геогр.", NORM: "географија"}, {ORTH: "геогр.", NORM: "географија"},
{ORTH: "литер.", NORM: "литература"}, {ORTH: "литер.", NORM: "литература"},
] ]
for abbr in _abbr_dot_exc: for abbr in _abbr_dot_exc:

View File

@ -45,7 +45,7 @@ _abbr_period_exc = [
{ORTH: "Doç.", NORM: "doçent"}, {ORTH: "Doç.", NORM: "doçent"},
{ORTH: "doğ."}, {ORTH: "doğ."},
{ORTH: "Dr.", NORM: "doktor"}, {ORTH: "Dr.", NORM: "doktor"},
{ORTH: "dr.", NORM:"doktor"}, {ORTH: "dr.", NORM: "doktor"},
{ORTH: "drl.", NORM: "derleyen"}, {ORTH: "drl.", NORM: "derleyen"},
{ORTH: "Dz.", NORM: "deniz"}, {ORTH: "Dz.", NORM: "deniz"},
{ORTH: "Dz.K.K.lığı"}, {ORTH: "Dz.K.K.lığı"},
@ -118,7 +118,7 @@ _abbr_period_exc = [
{ORTH: "Uzm.", NORM: "uzman"}, {ORTH: "Uzm.", NORM: "uzman"},
{ORTH: "Üçvş.", NORM: "üstçavuş"}, {ORTH: "Üçvş.", NORM: "üstçavuş"},
{ORTH: "Üni.", NORM: "üniversitesi"}, {ORTH: "Üni.", NORM: "üniversitesi"},
{ORTH: "Ütğm.", NORM: "üsteğmen"}, {ORTH: "Ütğm.", NORM: "üsteğmen"},
{ORTH: "vb."}, {ORTH: "vb."},
{ORTH: "vs.", NORM: "vesaire"}, {ORTH: "vs.", NORM: "vesaire"},
{ORTH: "Yard.", NORM: "yardımcı"}, {ORTH: "Yard.", NORM: "yardımcı"},
@ -163,19 +163,29 @@ for abbr in _abbr_exc:
_exc[abbr[ORTH]] = [abbr] _exc[abbr[ORTH]] = [abbr]
_num = r"[+-]?\d+([,.]\d+)*" _num = r"[+-]?\d+([,.]\d+)*"
_ord_num = r"(\d+\.)" _ord_num = r"(\d+\.)"
_date = r"(((\d{1,2}[./-]){2})?(\d{4})|(\d{1,2}[./]\d{1,2}(\.)?))" _date = r"(((\d{1,2}[./-]){2})?(\d{4})|(\d{1,2}[./]\d{1,2}(\.)?))"
_dash_num = r"(([{al}\d]+/\d+)|(\d+/[{al}]))".format(al=ALPHA) _dash_num = r"(([{al}\d]+/\d+)|(\d+/[{al}]))".format(al=ALPHA)
_roman_num = "M{0,3}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})" _roman_num = "M{0,3}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})"
_roman_ord = r"({rn})\.".format(rn=_roman_num) _roman_ord = r"({rn})\.".format(rn=_roman_num)
_time_exp = r"\d+(:\d+)*" _time_exp = r"\d+(:\d+)*"
_inflections = r"'[{al}]+".format(al=ALPHA_LOWER) _inflections = r"'[{al}]+".format(al=ALPHA_LOWER)
_abbrev_inflected = r"[{a}]+\.'[{al}]+".format(a=ALPHA, al=ALPHA_LOWER) _abbrev_inflected = r"[{a}]+\.'[{al}]+".format(a=ALPHA, al=ALPHA_LOWER)
_nums = r"(({d})|({dn})|({te})|({on})|({n})|({ro})|({rn}))({inf})?".format(d=_date, dn=_dash_num, te=_time_exp, on=_ord_num, n=_num, ro=_roman_ord, rn=_roman_num, inf=_inflections) _nums = r"(({d})|({dn})|({te})|({on})|({n})|({ro})|({rn}))({inf})?".format(
d=_date,
dn=_dash_num,
te=_time_exp,
on=_ord_num,
n=_num,
ro=_roman_ord,
rn=_roman_num,
inf=_inflections,
)
TOKENIZER_EXCEPTIONS = _exc TOKENIZER_EXCEPTIONS = _exc
TOKEN_MATCH = re.compile(r"^({abbr})|({n})$".format(n=_nums, abbr=_abbrev_inflected)).match TOKEN_MATCH = re.compile(
r"^({abbr})|({n})$".format(n=_nums, abbr=_abbrev_inflected)
).match

View File

@ -1,4 +1,3 @@
import numpy
from thinc.api import Model from thinc.api import Model
from ..attrs import LOWER from ..attrs import LOWER

View File

@ -21,14 +21,14 @@ def transition_parser_v1(
nO: Optional[int] = None, nO: Optional[int] = None,
) -> Model: ) -> Model:
return build_tb_parser_model( return build_tb_parser_model(
tok2vec, tok2vec,
state_type, state_type,
extra_state_tokens, extra_state_tokens,
hidden_width, hidden_width,
maxout_pieces, maxout_pieces,
use_upper, use_upper,
nO, nO,
) )
@registry.architectures.register("spacy.TransitionBasedParser.v2") @registry.architectures.register("spacy.TransitionBasedParser.v2")
@ -42,14 +42,15 @@ def transition_parser_v2(
nO: Optional[int] = None, nO: Optional[int] = None,
) -> Model: ) -> Model:
return build_tb_parser_model( return build_tb_parser_model(
tok2vec, tok2vec,
state_type, state_type,
extra_state_tokens, extra_state_tokens,
hidden_width, hidden_width,
maxout_pieces, maxout_pieces,
use_upper, use_upper,
nO, nO,
) )
def build_tb_parser_model( def build_tb_parser_model(
tok2vec: Model[List[Doc], List[Floats2d]], tok2vec: Model[List[Doc], List[Floats2d]],
@ -162,8 +163,8 @@ def _resize_upper(model, new_nO):
# just adding rows here. # just adding rows here.
if smaller.has_dim("nO"): if smaller.has_dim("nO"):
old_nO = smaller.get_dim("nO") old_nO = smaller.get_dim("nO")
larger_W[: old_nO] = smaller_W larger_W[:old_nO] = smaller_W
larger_b[: old_nO] = smaller_b larger_b[:old_nO] = smaller_b
for i in range(old_nO, new_nO): for i in range(old_nO, new_nO):
model.attrs["unseen_classes"].add(i) model.attrs["unseen_classes"].add(i)

View File

@ -6,6 +6,7 @@ from thinc.api import chain, concatenate, clone, Dropout, ParametricAttention
from thinc.api import SparseLinear, Softmax, softmax_activation, Maxout, reduce_sum from thinc.api import SparseLinear, Softmax, softmax_activation, Maxout, reduce_sum
from thinc.api import HashEmbed, with_array, with_cpu, uniqued from thinc.api import HashEmbed, with_array, with_cpu, uniqued
from thinc.api import Relu, residual, expand_window from thinc.api import Relu, residual, expand_window
from thinc.layers.chain import init as init_chain
from ...attrs import ID, ORTH, PREFIX, SUFFIX, SHAPE, LOWER from ...attrs import ID, ORTH, PREFIX, SUFFIX, SHAPE, LOWER
from ...util import registry from ...util import registry
@ -13,6 +14,7 @@ from ..extract_ngrams import extract_ngrams
from ..staticvectors import StaticVectors from ..staticvectors import StaticVectors
from ..featureextractor import FeatureExtractor from ..featureextractor import FeatureExtractor
from ...tokens import Doc from ...tokens import Doc
from .tok2vec import get_tok2vec_width
@registry.architectures.register("spacy.TextCatCNN.v1") @registry.architectures.register("spacy.TextCatCNN.v1")
@ -69,13 +71,16 @@ def build_text_classifier_v2(
exclusive_classes = not linear_model.attrs["multi_label"] exclusive_classes = not linear_model.attrs["multi_label"]
with Model.define_operators({">>": chain, "|": concatenate}): with Model.define_operators({">>": chain, "|": concatenate}):
width = tok2vec.maybe_get_dim("nO") width = tok2vec.maybe_get_dim("nO")
attention_layer = ParametricAttention(width) # TODO: benchmark performance difference of this layer
maxout_layer = Maxout(nO=width, nI=width)
linear_layer = Linear(nO=nO, nI=width)
cnn_model = ( cnn_model = (
tok2vec tok2vec
>> list2ragged() >> list2ragged()
>> ParametricAttention(width) # TODO: benchmark performance difference of this layer >> attention_layer
>> reduce_sum() >> reduce_sum()
>> residual(Maxout(nO=width, nI=width)) >> residual(maxout_layer)
>> Linear(nO=nO, nI=width) >> linear_layer
>> Dropout(0.0) >> Dropout(0.0)
) )
@ -89,9 +94,25 @@ def build_text_classifier_v2(
if model.has_dim("nO") is not False: if model.has_dim("nO") is not False:
model.set_dim("nO", nO) model.set_dim("nO", nO)
model.set_ref("output_layer", linear_model.get_ref("output_layer")) model.set_ref("output_layer", linear_model.get_ref("output_layer"))
model.set_ref("attention_layer", attention_layer)
model.set_ref("maxout_layer", maxout_layer)
model.set_ref("linear_layer", linear_layer)
model.attrs["multi_label"] = not exclusive_classes model.attrs["multi_label"] = not exclusive_classes
model.init = init_ensemble_textcat
return model return model
def init_ensemble_textcat(model, X, Y) -> Model:
tok2vec_width = get_tok2vec_width(model)
model.get_ref("attention_layer").set_dim("nO", tok2vec_width)
model.get_ref("maxout_layer").set_dim("nO", tok2vec_width)
model.get_ref("maxout_layer").set_dim("nI", tok2vec_width)
model.get_ref("linear_layer").set_dim("nI", tok2vec_width)
init_chain(model, X, Y)
return model
# TODO: move to legacy # TODO: move to legacy
@registry.architectures.register("spacy.TextCatEnsemble.v1") @registry.architectures.register("spacy.TextCatEnsemble.v1")
def build_text_classifier_v1( def build_text_classifier_v1(

View File

@ -20,6 +20,17 @@ def tok2vec_listener_v1(width: int, upstream: str = "*"):
return tok2vec return tok2vec
def get_tok2vec_width(model: Model):
nO = None
if model.has_ref("tok2vec"):
tok2vec = model.get_ref("tok2vec")
if tok2vec.has_dim("nO"):
nO = tok2vec.get_dim("nO")
elif tok2vec.has_ref("listener"):
nO = tok2vec.get_ref("listener").get_dim("nO")
return nO
@registry.architectures.register("spacy.HashEmbedCNN.v1") @registry.architectures.register("spacy.HashEmbedCNN.v1")
def build_hash_embed_cnn_tok2vec( def build_hash_embed_cnn_tok2vec(
*, *,
@ -76,6 +87,7 @@ def build_hash_embed_cnn_tok2vec(
) )
# TODO: archive
@registry.architectures.register("spacy.Tok2Vec.v1") @registry.architectures.register("spacy.Tok2Vec.v1")
def build_Tok2Vec_model( def build_Tok2Vec_model(
embed: Model[List[Doc], List[Floats2d]], embed: Model[List[Doc], List[Floats2d]],
@ -97,6 +109,28 @@ def build_Tok2Vec_model(
return tok2vec return tok2vec
@registry.architectures.register("spacy.Tok2Vec.v2")
def build_Tok2Vec_model(
embed: Model[List[Doc], List[Floats2d]],
encode: Model[List[Floats2d], List[Floats2d]],
) -> Model[List[Doc], List[Floats2d]]:
"""Construct a tok2vec model out of embedding and encoding subnetworks.
See https://explosion.ai/blog/deep-learning-formula-nlp
embed (Model[List[Doc], List[Floats2d]]): Embed tokens into context-independent
word vector representations.
encode (Model[List[Floats2d], List[Floats2d]]): Encode context into the
embeddings, using an architecture such as a CNN, BiLSTM or transformer.
"""
tok2vec = chain(embed, encode)
tok2vec.set_dim("nO", encode.get_dim("nO"))
tok2vec.set_ref("embed", embed)
tok2vec.set_ref("encode", encode)
return tok2vec
@registry.architectures.register("spacy.MultiHashEmbed.v1") @registry.architectures.register("spacy.MultiHashEmbed.v1")
def MultiHashEmbed( def MultiHashEmbed(
width: int, width: int,
@ -244,6 +278,7 @@ def CharacterEmbed(
return model return model
# TODO: archive
@registry.architectures.register("spacy.MaxoutWindowEncoder.v1") @registry.architectures.register("spacy.MaxoutWindowEncoder.v1")
def MaxoutWindowEncoder( def MaxoutWindowEncoder(
width: int, window_size: int, maxout_pieces: int, depth: int width: int, window_size: int, maxout_pieces: int, depth: int
@ -275,7 +310,39 @@ def MaxoutWindowEncoder(
model.attrs["receptive_field"] = window_size * depth model.attrs["receptive_field"] = window_size * depth
return model return model
@registry.architectures.register("spacy.MaxoutWindowEncoder.v2")
def MaxoutWindowEncoder(
width: int, window_size: int, maxout_pieces: int, depth: int
) -> Model[List[Floats2d], List[Floats2d]]:
"""Encode context using convolutions with maxout activation, layer
normalization and residual connections.
width (int): The input and output width. These are required to be the same,
to allow residual connections. This value will be determined by the
width of the inputs. Recommended values are between 64 and 300.
window_size (int): The number of words to concatenate around each token
to construct the convolution. Recommended value is 1.
maxout_pieces (int): The number of maxout pieces to use. Recommended
values are 2 or 3.
depth (int): The number of convolutional layers. Recommended value is 4.
"""
cnn = chain(
expand_window(window_size=window_size),
Maxout(
nO=width,
nI=width * ((window_size * 2) + 1),
nP=maxout_pieces,
dropout=0.0,
normalize=True,
),
)
model = clone(residual(cnn), depth)
model.set_dim("nO", width)
receptive_field = window_size * depth
return with_array(model, pad=receptive_field)
# TODO: archive
@registry.architectures.register("spacy.MishWindowEncoder.v1") @registry.architectures.register("spacy.MishWindowEncoder.v1")
def MishWindowEncoder( def MishWindowEncoder(
width: int, window_size: int, depth: int width: int, window_size: int, depth: int
@ -299,6 +366,29 @@ def MishWindowEncoder(
return model return model
@registry.architectures.register("spacy.MishWindowEncoder.v2")
def MishWindowEncoder(
width: int, window_size: int, depth: int
) -> Model[List[Floats2d], List[Floats2d]]:
"""Encode context using convolutions with mish activation, layer
normalization and residual connections.
width (int): The input and output width. These are required to be the same,
to allow residual connections. This value will be determined by the
width of the inputs. Recommended values are between 64 and 300.
window_size (int): The number of words to concatenate around each token
to construct the convolution. Recommended value is 1.
depth (int): The number of convolutional layers. Recommended value is 4.
"""
cnn = chain(
expand_window(window_size=window_size),
Mish(nO=width, nI=width * ((window_size * 2) + 1), dropout=0.0, normalize=True),
)
model = clone(residual(cnn), depth)
model.set_dim("nO", width)
return with_array(model)
@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1") @registry.architectures.register("spacy.TorchBiLSTMEncoder.v1")
def BiLSTMEncoder( def BiLSTMEncoder(
width: int, depth: int, dropout: float width: int, depth: int, dropout: float
@ -308,9 +398,9 @@ def BiLSTMEncoder(
width (int): The input and output width. These are required to be the same, width (int): The input and output width. These are required to be the same,
to allow residual connections. This value will be determined by the to allow residual connections. This value will be determined by the
width of the inputs. Recommended values are between 64 and 300. width of the inputs. Recommended values are between 64 and 300.
window_size (int): The number of words to concatenate around each token depth (int): The number of recurrent layers.
to construct the convolution. Recommended value is 1. dropout (float): Creates a Dropout layer on the outputs of each LSTM layer
depth (int): The number of convolutional layers. Recommended value is 4. except the last layer. Set to 0 to disable this functionality.
""" """
if depth == 0: if depth == 0:
return noop() return noop()

View File

@ -47,8 +47,7 @@ def forward(
except ValueError: except ValueError:
raise RuntimeError(Errors.E896) raise RuntimeError(Errors.E896)
output = Ragged( output = Ragged(
vectors_data, vectors_data, model.ops.asarray([len(doc) for doc in docs], dtype="i")
model.ops.asarray([len(doc) for doc in docs], dtype="i")
) )
mask = None mask = None
if is_train: if is_train:

View File

@ -1,8 +1,10 @@
from thinc.api import Model, noop, use_ops, Linear from thinc.api import Model, noop
from .parser_model import ParserStepModel from .parser_model import ParserStepModel
def TransitionModel(tok2vec, lower, upper, resize_output, dropout=0.2, unseen_classes=set()): def TransitionModel(
tok2vec, lower, upper, resize_output, dropout=0.2, unseen_classes=set()
):
"""Set up a stepwise transition-based model""" """Set up a stepwise transition-based model"""
if upper is None: if upper is None:
has_upper = False has_upper = False
@ -44,4 +46,3 @@ def init(model, X=None, Y=None):
if model.attrs["has_upper"]: if model.attrs["has_upper"]:
statevecs = model.ops.alloc2f(2, lower.get_dim("nO")) statevecs = model.ops.alloc2f(2, lower.get_dim("nO"))
model.get_ref("upper").initialize(X=statevecs) model.get_ref("upper").initialize(X=statevecs)

View File

@ -133,8 +133,9 @@ cdef class Morphology:
""" """
cdef MorphAnalysisC tag cdef MorphAnalysisC tag
tag.length = len(field_feature_pairs) tag.length = len(field_feature_pairs)
tag.fields = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t)) if tag.length > 0:
tag.features = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t)) tag.fields = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t))
tag.features = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t))
for i, (field, feature) in enumerate(field_feature_pairs): for i, (field, feature) in enumerate(field_feature_pairs):
tag.fields[i] = field tag.fields[i] = field
tag.features[i] = feature tag.features[i] = feature

View File

@ -11,6 +11,7 @@ from .senter import SentenceRecognizer
from .sentencizer import Sentencizer from .sentencizer import Sentencizer
from .tagger import Tagger from .tagger import Tagger
from .textcat import TextCategorizer from .textcat import TextCategorizer
from .textcat_multilabel import MultiLabel_TextCategorizer
from .tok2vec import Tok2Vec from .tok2vec import Tok2Vec
from .functions import merge_entities, merge_noun_chunks, merge_subtokens from .functions import merge_entities, merge_noun_chunks, merge_subtokens
@ -22,13 +23,14 @@ __all__ = [
"EntityRuler", "EntityRuler",
"Morphologizer", "Morphologizer",
"Lemmatizer", "Lemmatizer",
"TrainablePipe", "MultiLabel_TextCategorizer",
"Pipe", "Pipe",
"SentenceRecognizer", "SentenceRecognizer",
"Sentencizer", "Sentencizer",
"Tagger", "Tagger",
"TextCategorizer", "TextCategorizer",
"Tok2Vec", "Tok2Vec",
"TrainablePipe",
"merge_entities", "merge_entities",
"merge_noun_chunks", "merge_noun_chunks",
"merge_subtokens", "merge_subtokens",

View File

@ -255,7 +255,7 @@ def get_gradient(nr_class, beam_maps, histories, losses):
for a beam state -- so we have "the gradient of loss for taking for a beam state -- so we have "the gradient of loss for taking
action i given history H." action i given history H."
Histories: Each hitory is a list of actions Histories: Each history is a list of actions
Each candidate has a history Each candidate has a history
Each beam has multiple candidates Each beam has multiple candidates
Each batch has multiple beams Each batch has multiple beams

View File

@ -4,4 +4,4 @@ from .transition_system cimport Transition, TransitionSystem
cdef class ArcEager(TransitionSystem): cdef class ArcEager(TransitionSystem):
pass cdef get_arcs(self, StateC* state)

View File

@ -1,6 +1,7 @@
# cython: profile=True, cdivision=True, infer_types=True # cython: profile=True, cdivision=True, infer_types=True
from cymem.cymem cimport Pool, Address from cymem.cymem cimport Pool, Address
from libc.stdint cimport int32_t from libc.stdint cimport int32_t
from libcpp.vector cimport vector
from collections import defaultdict, Counter from collections import defaultdict, Counter
@ -10,9 +11,9 @@ from ...structs cimport TokenC
from ...tokens.doc cimport Doc, set_children_from_heads from ...tokens.doc cimport Doc, set_children_from_heads
from ...training.example cimport Example from ...training.example cimport Example
from .stateclass cimport StateClass from .stateclass cimport StateClass
from ._state cimport StateC from ._state cimport StateC, ArcC
from ...errors import Errors from ...errors import Errors
from thinc.extra.search cimport Beam
cdef weight_t MIN_SCORE = -90000 cdef weight_t MIN_SCORE = -90000
cdef attr_t SUBTOK_LABEL = hash_string(u'subtok') cdef attr_t SUBTOK_LABEL = hash_string(u'subtok')
@ -65,6 +66,7 @@ cdef GoldParseStateC create_gold_state(Pool mem, const StateC* state,
cdef GoldParseStateC gs cdef GoldParseStateC gs
gs.length = len(heads) gs.length = len(heads)
gs.stride = 1 gs.stride = 1
assert gs.length > 0
gs.labels = <attr_t*>mem.alloc(gs.length, sizeof(gs.labels[0])) gs.labels = <attr_t*>mem.alloc(gs.length, sizeof(gs.labels[0]))
gs.heads = <int32_t*>mem.alloc(gs.length, sizeof(gs.heads[0])) gs.heads = <int32_t*>mem.alloc(gs.length, sizeof(gs.heads[0]))
gs.n_kids = <int32_t*>mem.alloc(gs.length, sizeof(gs.n_kids[0])) gs.n_kids = <int32_t*>mem.alloc(gs.length, sizeof(gs.n_kids[0]))
@ -126,6 +128,7 @@ cdef GoldParseStateC create_gold_state(Pool mem, const StateC* state,
1 1
) )
# Make an array of pointers, pointing into the gs_kids_flat array. # Make an array of pointers, pointing into the gs_kids_flat array.
assert gs.length > 0
gs.kids = <int32_t**>mem.alloc(gs.length, sizeof(int32_t*)) gs.kids = <int32_t**>mem.alloc(gs.length, sizeof(int32_t*))
for i in range(gs.length): for i in range(gs.length):
if gs.n_kids[i] != 0: if gs.n_kids[i] != 0:
@ -609,7 +612,7 @@ cdef class ArcEager(TransitionSystem):
return gold return gold
def init_gold_batch(self, examples): def init_gold_batch(self, examples):
# TODO: Projectivitity? # TODO: Projectivity?
all_states = self.init_batch([eg.predicted for eg in examples]) all_states = self.init_batch([eg.predicted for eg in examples])
golds = [] golds = []
states = [] states = []
@ -705,6 +708,28 @@ cdef class ArcEager(TransitionSystem):
doc.c[i].dep = self.root_label doc.c[i].dep = self.root_label
set_children_from_heads(doc.c, 0, doc.length) set_children_from_heads(doc.c, 0, doc.length)
def get_beam_parses(self, Beam beam):
parses = []
probs = beam.probs
for i in range(beam.size):
state = <StateC*>beam.at(i)
if state.is_final():
prob = probs[i]
parse = []
arcs = self.get_arcs(state)
if arcs:
for arc in arcs:
dep = arc["label"]
label = self.strings[dep]
parse.append((arc["head"], arc["child"], label))
parses.append((prob, parse))
return parses
cdef get_arcs(self, StateC* state):
cdef vector[ArcC] arcs
state.get_arcs(&arcs)
return list(arcs)
def has_gold(self, Example eg, start=0, end=None): def has_gold(self, Example eg, start=0, end=None):
for word in eg.y[start:end]: for word in eg.y[start:end]:
if word.dep != 0: if word.dep != 0:

View File

@ -2,6 +2,7 @@ from libc.stdint cimport int32_t
from cymem.cymem cimport Pool from cymem.cymem cimport Pool
from collections import Counter from collections import Counter
from thinc.extra.search cimport Beam
from ...tokens.doc cimport Doc from ...tokens.doc cimport Doc
from ...tokens.span import Span from ...tokens.span import Span
@ -63,6 +64,7 @@ cdef GoldNERStateC create_gold_state(
Example example Example example
) except *: ) except *:
cdef GoldNERStateC gs cdef GoldNERStateC gs
assert example.x.length > 0
gs.ner = <Transition*>mem.alloc(example.x.length, sizeof(Transition)) gs.ner = <Transition*>mem.alloc(example.x.length, sizeof(Transition))
ner_tags = example.get_aligned_ner() ner_tags = example.get_aligned_ner()
for i, ner_tag in enumerate(ner_tags): for i, ner_tag in enumerate(ner_tags):
@ -245,6 +247,21 @@ cdef class BiluoPushDown(TransitionSystem):
if doc.c[i].ent_iob == 0: if doc.c[i].ent_iob == 0:
doc.c[i].ent_iob = 2 doc.c[i].ent_iob = 2
def get_beam_parses(self, Beam beam):
parses = []
probs = beam.probs
for i in range(beam.size):
state = <StateC*>beam.at(i)
if state.is_final():
prob = probs[i]
parse = []
for j in range(state._ents.size()):
ent = state._ents.at(j)
if ent.start != -1 and ent.end != -1:
parse.append((ent.start, ent.end, self.strings[ent.label]))
parses.append((prob, parse))
return parses
def init_gold(self, StateClass state, Example example): def init_gold(self, StateClass state, Example example):
return BiluoGold(self, state, example) return BiluoGold(self, state, example)

View File

@ -226,6 +226,7 @@ class AttributeRuler(Pipe):
DOCS: https://nightly.spacy.io/api/tagger#score DOCS: https://nightly.spacy.io/api/tagger#score
""" """
def morph_key_getter(token, attr): def morph_key_getter(token, attr):
return getattr(token, attr).key return getattr(token, attr).key
@ -240,8 +241,16 @@ class AttributeRuler(Pipe):
elif attr == POS: elif attr == POS:
results.update(Scorer.score_token_attr(examples, "pos", **kwargs)) results.update(Scorer.score_token_attr(examples, "pos", **kwargs))
elif attr == MORPH: elif attr == MORPH:
results.update(Scorer.score_token_attr(examples, "morph", getter=morph_key_getter, **kwargs)) results.update(
results.update(Scorer.score_token_attr_per_feat(examples, "morph", getter=morph_key_getter, **kwargs)) Scorer.score_token_attr(
examples, "morph", getter=morph_key_getter, **kwargs
)
)
results.update(
Scorer.score_token_attr_per_feat(
examples, "morph", getter=morph_key_getter, **kwargs
)
)
elif attr == LEMMA: elif attr == LEMMA:
results.update(Scorer.score_token_attr(examples, "lemma", **kwargs)) results.update(Scorer.score_token_attr(examples, "lemma", **kwargs))
return results return results

View File

@ -1,4 +1,5 @@
# cython: infer_types=True, profile=True, binding=True # cython: infer_types=True, profile=True, binding=True
from collections import defaultdict
from typing import Optional, Iterable from typing import Optional, Iterable
from thinc.api import Model, Config from thinc.api import Model, Config
@ -258,3 +259,20 @@ cdef class DependencyParser(Parser):
results.update(Scorer.score_deps(examples, "dep", **kwargs)) results.update(Scorer.score_deps(examples, "dep", **kwargs))
del results["sents_per_type"] del results["sents_per_type"]
return results return results
def scored_parses(self, beams):
"""Return two dictionaries with scores for each beam/doc that was processed:
one containing (i, head) keys, and another containing (i, label) keys.
"""
head_scores = []
label_scores = []
for beam in beams:
score_head_dict = defaultdict(float)
score_label_dict = defaultdict(float)
for score, parses in self.moves.get_beam_parses(beam):
for head, i, label in parses:
score_head_dict[(i, head)] += score
score_label_dict[(i, label)] += score
head_scores.append(score_head_dict)
label_scores.append(score_label_dict)
return head_scores, label_scores

View File

@ -24,7 +24,7 @@ default_model_config = """
@architectures = "spacy.Tagger.v1" @architectures = "spacy.Tagger.v1"
[model.tok2vec] [model.tok2vec]
@architectures = "spacy.Tok2Vec.v1" @architectures = "spacy.Tok2Vec.v2"
[model.tok2vec.embed] [model.tok2vec.embed]
@architectures = "spacy.CharacterEmbed.v1" @architectures = "spacy.CharacterEmbed.v1"
@ -35,7 +35,7 @@ nC = 8
include_static_vectors = false include_static_vectors = false
[model.tok2vec.encode] [model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1" @architectures = "spacy.MaxoutWindowEncoder.v2"
width = 128 width = 128
depth = 4 depth = 4
window_size = 1 window_size = 1

View File

@ -1,4 +1,5 @@
# cython: infer_types=True, profile=True, binding=True # cython: infer_types=True, profile=True, binding=True
from collections import defaultdict
from typing import Optional, Iterable from typing import Optional, Iterable
from thinc.api import Model, Config from thinc.api import Model, Config
@ -197,3 +198,16 @@ cdef class EntityRecognizer(Parser):
""" """
validate_examples(examples, "EntityRecognizer.score") validate_examples(examples, "EntityRecognizer.score")
return get_ner_prf(examples) return get_ner_prf(examples)
def scored_ents(self, beams):
"""Return a dictionary of (start, end, label) tuples with corresponding scores
for each beam/doc that was processed.
"""
entity_scores = []
for beam in beams:
score_dict = defaultdict(float)
for score, ents in self.moves.get_beam_parses(beam):
for start, end, label in ents:
score_dict[(start, end, label)] += score
entity_scores.append(score_dict)
return entity_scores

View File

@ -256,8 +256,14 @@ class Tagger(TrainablePipe):
DOCS: https://nightly.spacy.io/api/tagger#get_loss DOCS: https://nightly.spacy.io/api/tagger#get_loss
""" """
validate_examples(examples, "Tagger.get_loss") validate_examples(examples, "Tagger.get_loss")
loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False, missing_value="") loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False)
truths = [eg.get_aligned("TAG", as_string=True) for eg in examples] # Convert empty tag "" to missing value None so that both misaligned
# tokens and tokens with missing annotation have the default missing
# value None.
truths = []
for eg in examples:
eg_truths = [tag if tag is not "" else None for tag in eg.get_aligned("TAG", as_string=True)]
truths.append(eg_truths)
d_scores, loss = loss_func(scores, truths) d_scores, loss = loss_func(scores, truths)
if self.model.ops.xp.isnan(loss): if self.model.ops.xp.isnan(loss):
raise ValueError(Errors.E910.format(name=self.name)) raise ValueError(Errors.E910.format(name=self.name))

View File

@ -14,12 +14,12 @@ from ..tokens import Doc
from ..vocab import Vocab from ..vocab import Vocab
default_model_config = """ single_label_default_config = """
[model] [model]
@architectures = "spacy.TextCatEnsemble.v2" @architectures = "spacy.TextCatEnsemble.v2"
[model.tok2vec] [model.tok2vec]
@architectures = "spacy.Tok2Vec.v1" @architectures = "spacy.Tok2Vec.v2"
[model.tok2vec.embed] [model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1" @architectures = "spacy.MultiHashEmbed.v1"
@ -29,7 +29,7 @@ attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
include_static_vectors = false include_static_vectors = false
[model.tok2vec.encode] [model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1" @architectures = "spacy.MaxoutWindowEncoder.v2"
width = ${model.tok2vec.embed.width} width = ${model.tok2vec.embed.width}
window_size = 1 window_size = 1
maxout_pieces = 3 maxout_pieces = 3
@ -37,24 +37,24 @@ depth = 2
[model.linear_model] [model.linear_model]
@architectures = "spacy.TextCatBOW.v1" @architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false exclusive_classes = true
ngram_size = 1 ngram_size = 1
no_output_layer = false no_output_layer = false
""" """
DEFAULT_TEXTCAT_MODEL = Config().from_str(default_model_config)["model"] DEFAULT_SINGLE_TEXTCAT_MODEL = Config().from_str(single_label_default_config)["model"]
bow_model_config = """ single_label_bow_config = """
[model] [model]
@architectures = "spacy.TextCatBOW.v1" @architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false exclusive_classes = true
ngram_size = 1 ngram_size = 1
no_output_layer = false no_output_layer = false
""" """
cnn_model_config = """ single_label_cnn_config = """
[model] [model]
@architectures = "spacy.TextCatCNN.v1" @architectures = "spacy.TextCatCNN.v1"
exclusive_classes = false exclusive_classes = true
[model.tok2vec] [model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1" @architectures = "spacy.HashEmbedCNN.v1"
@ -71,7 +71,7 @@ subword_features = true
@Language.factory( @Language.factory(
"textcat", "textcat",
assigns=["doc.cats"], assigns=["doc.cats"],
default_config={"threshold": 0.5, "model": DEFAULT_TEXTCAT_MODEL}, default_config={"threshold": 0.5, "model": DEFAULT_SINGLE_TEXTCAT_MODEL},
default_score_weights={ default_score_weights={
"cats_score": 1.0, "cats_score": 1.0,
"cats_score_desc": None, "cats_score_desc": None,
@ -103,7 +103,7 @@ def make_textcat(
class TextCategorizer(TrainablePipe): class TextCategorizer(TrainablePipe):
"""Pipeline component for text classification. """Pipeline component for single-label text classification.
DOCS: https://nightly.spacy.io/api/textcategorizer DOCS: https://nightly.spacy.io/api/textcategorizer
""" """
@ -111,7 +111,7 @@ class TextCategorizer(TrainablePipe):
def __init__( def __init__(
self, vocab: Vocab, model: Model, name: str = "textcat", *, threshold: float self, vocab: Vocab, model: Model, name: str = "textcat", *, threshold: float
) -> None: ) -> None:
"""Initialize a text categorizer. """Initialize a text categorizer for single-label classification.
vocab (Vocab): The shared vocabulary. vocab (Vocab): The shared vocabulary.
model (thinc.api.Model): The Thinc Model powering the pipeline component. model (thinc.api.Model): The Thinc Model powering the pipeline component.
@ -214,6 +214,7 @@ class TextCategorizer(TrainablePipe):
losses = {} losses = {}
losses.setdefault(self.name, 0.0) losses.setdefault(self.name, 0.0)
validate_examples(examples, "TextCategorizer.update") validate_examples(examples, "TextCategorizer.update")
self._validate_categories(examples)
if not any(len(eg.predicted) if eg.predicted else 0 for eg in examples): if not any(len(eg.predicted) if eg.predicted else 0 for eg in examples):
# Handle cases where there are no tokens in any docs. # Handle cases where there are no tokens in any docs.
return losses return losses
@ -256,6 +257,7 @@ class TextCategorizer(TrainablePipe):
if self._rehearsal_model is None: if self._rehearsal_model is None:
return losses return losses
validate_examples(examples, "TextCategorizer.rehearse") validate_examples(examples, "TextCategorizer.rehearse")
self._validate_categories(examples)
docs = [eg.predicted for eg in examples] docs = [eg.predicted for eg in examples]
if not any(len(doc) for doc in docs): if not any(len(doc) for doc in docs):
# Handle cases where there are no tokens in any docs. # Handle cases where there are no tokens in any docs.
@ -296,6 +298,7 @@ class TextCategorizer(TrainablePipe):
DOCS: https://nightly.spacy.io/api/textcategorizer#get_loss DOCS: https://nightly.spacy.io/api/textcategorizer#get_loss
""" """
validate_examples(examples, "TextCategorizer.get_loss") validate_examples(examples, "TextCategorizer.get_loss")
self._validate_categories(examples)
truths, not_missing = self._examples_to_truth(examples) truths, not_missing = self._examples_to_truth(examples)
not_missing = self.model.ops.asarray(not_missing) not_missing = self.model.ops.asarray(not_missing)
d_scores = (scores - truths) / scores.shape[0] d_scores = (scores - truths) / scores.shape[0]
@ -341,6 +344,7 @@ class TextCategorizer(TrainablePipe):
DOCS: https://nightly.spacy.io/api/textcategorizer#initialize DOCS: https://nightly.spacy.io/api/textcategorizer#initialize
""" """
validate_get_examples(get_examples, "TextCategorizer.initialize") validate_get_examples(get_examples, "TextCategorizer.initialize")
self._validate_categories(get_examples())
if labels is None: if labels is None:
for example in get_examples(): for example in get_examples():
for cat in example.y.cats: for cat in example.y.cats:
@ -373,12 +377,20 @@ class TextCategorizer(TrainablePipe):
DOCS: https://nightly.spacy.io/api/textcategorizer#score DOCS: https://nightly.spacy.io/api/textcategorizer#score
""" """
validate_examples(examples, "TextCategorizer.score") validate_examples(examples, "TextCategorizer.score")
self._validate_categories(examples)
return Scorer.score_cats( return Scorer.score_cats(
examples, examples,
"cats", "cats",
labels=self.labels, labels=self.labels,
multi_label=self.model.attrs["multi_label"], multi_label=False,
positive_label=self.cfg["positive_label"], positive_label=self.cfg["positive_label"],
threshold=self.cfg["threshold"], threshold=self.cfg["threshold"],
**kwargs, **kwargs,
) )
def _validate_categories(self, examples: List[Example]):
"""Check whether the provided examples all have single-label cats annotations."""
for ex in examples:
if list(ex.reference.cats.values()).count(1.0) > 1:
raise ValueError(Errors.E895.format(value=ex.reference.cats))

View File

@ -0,0 +1,191 @@
from itertools import islice
from typing import Iterable, Optional, Dict, List, Callable, Any
from thinc.api import Model, Config
from thinc.types import Floats2d
from ..language import Language
from ..training import Example, validate_examples, validate_get_examples
from ..errors import Errors
from ..scorer import Scorer
from ..tokens import Doc
from ..vocab import Vocab
from .textcat import TextCategorizer
multi_label_default_config = """
[model]
@architectures = "spacy.TextCatEnsemble.v2"
[model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"
[model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = 64
rows = [2000, 2000, 1000, 1000, 1000, 1000]
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
include_static_vectors = false
[model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = ${model.tok2vec.embed.width}
window_size = 1
maxout_pieces = 3
depth = 2
[model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
"""
DEFAULT_MULTI_TEXTCAT_MODEL = Config().from_str(multi_label_default_config)["model"]
multi_label_bow_config = """
[model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
"""
multi_label_cnn_config = """
[model]
@architectures = "spacy.TextCatCNN.v1"
exclusive_classes = false
[model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = null
width = 96
depth = 4
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true
"""
@Language.factory(
"textcat_multilabel",
assigns=["doc.cats"],
default_config={"threshold": 0.5, "model": DEFAULT_MULTI_TEXTCAT_MODEL},
default_score_weights={
"cats_score": 1.0,
"cats_score_desc": None,
"cats_micro_p": None,
"cats_micro_r": None,
"cats_micro_f": None,
"cats_macro_p": None,
"cats_macro_r": None,
"cats_macro_f": None,
"cats_macro_auc": None,
"cats_f_per_type": None,
"cats_macro_auc_per_type": None,
},
)
def make_multilabel_textcat(
nlp: Language, name: str, model: Model[List[Doc], List[Floats2d]], threshold: float
) -> "TextCategorizer":
"""Create a TextCategorizer compoment. The text categorizer predicts categories
over a whole document. It can learn one or more labels, and the labels can
be mutually exclusive (i.e. one true label per doc) or non-mutually exclusive
(i.e. zero or more labels may be true per doc). The multi-label setting is
controlled by the model instance that's provided.
model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
scores for each category.
threshold (float): Cutoff to consider a prediction "positive".
"""
return MultiLabel_TextCategorizer(nlp.vocab, model, name, threshold=threshold)
class MultiLabel_TextCategorizer(TextCategorizer):
"""Pipeline component for multi-label text classification.
DOCS: https://nightly.spacy.io/api/multilabel_textcategorizer
"""
def __init__(
self,
vocab: Vocab,
model: Model,
name: str = "textcat_multilabel",
*,
threshold: float,
) -> None:
"""Initialize a text categorizer for multi-label classification.
vocab (Vocab): The shared vocabulary.
model (thinc.api.Model): The Thinc Model powering the pipeline component.
name (str): The component instance name, used to add entries to the
losses during training.
threshold (float): Cutoff to consider a prediction "positive".
DOCS: https://nightly.spacy.io/api/multilabel_textcategorizer#init
"""
self.vocab = vocab
self.model = model
self.name = name
self._rehearsal_model = None
cfg = {"labels": [], "threshold": threshold}
self.cfg = dict(cfg)
def initialize(
self,
get_examples: Callable[[], Iterable[Example]],
*,
nlp: Optional[Language] = None,
labels: Optional[Dict] = None,
):
"""Initialize the pipe for training, using a representative set
of data examples.
get_examples (Callable[[], Iterable[Example]]): Function that
returns a representative sample of gold-standard Example objects.
nlp (Language): The current nlp object the component is part of.
labels: The labels to add to the component, typically generated by the
`init labels` command. If no labels are provided, the get_examples
callback is used to extract the labels from the data.
DOCS: https://nightly.spacy.io/api/multilabel_textcategorizer#initialize
"""
validate_get_examples(get_examples, "MultiLabel_TextCategorizer.initialize")
if labels is None:
for example in get_examples():
for cat in example.y.cats:
self.add_label(cat)
else:
for label in labels:
self.add_label(label)
subbatch = list(islice(get_examples(), 10))
doc_sample = [eg.reference for eg in subbatch]
label_sample, _ = self._examples_to_truth(subbatch)
self._require_labels()
assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
assert len(label_sample) > 0, Errors.E923.format(name=self.name)
self.model.initialize(X=doc_sample, Y=label_sample)
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
"""Score a batch of examples.
examples (Iterable[Example]): The examples to score.
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_cats.
DOCS: https://nightly.spacy.io/api/multilabel_textcategorizer#score
"""
validate_examples(examples, "MultiLabel_TextCategorizer.score")
return Scorer.score_cats(
examples,
"cats",
labels=self.labels,
multi_label=True,
threshold=self.cfg["threshold"],
**kwargs,
)
def _validate_categories(self, examples: List[Example]):
"""This component allows any type of single- or multi-label annotations.
This method overwrites the more strict one from 'textcat'. """
pass

View File

@ -3,7 +3,7 @@ import numpy as np
from collections import defaultdict from collections import defaultdict
from .training import Example from .training import Example
from .tokens import Token, Doc, Span, MorphAnalysis from .tokens import Token, Doc, Span
from .errors import Errors from .errors import Errors
from .util import get_lang_class, SimpleFrozenList from .util import get_lang_class, SimpleFrozenList
from .morphology import Morphology from .morphology import Morphology
@ -176,7 +176,7 @@ class Scorer:
"token_acc": None, "token_acc": None,
"token_p": None, "token_p": None,
"token_r": None, "token_r": None,
"token_f": None "token_f": None,
} }
@staticmethod @staticmethod
@ -276,7 +276,10 @@ class Scorer:
if gold_i not in missing_indices: if gold_i not in missing_indices:
value = getter(token, attr) value = getter(token, attr)
morph = gold_doc.vocab.strings[value] morph = gold_doc.vocab.strings[value]
if value not in missing_values and morph != Morphology.EMPTY_MORPH: if (
value not in missing_values
and morph != Morphology.EMPTY_MORPH
):
for feat in morph.split(Morphology.FEATURE_SEP): for feat in morph.split(Morphology.FEATURE_SEP):
field, values = feat.split(Morphology.FIELD_SEP) field, values = feat.split(Morphology.FIELD_SEP)
if field not in per_feat: if field not in per_feat:
@ -367,7 +370,6 @@ class Scorer:
f"{attr}_per_type": None, f"{attr}_per_type": None,
} }
@staticmethod @staticmethod
def score_cats( def score_cats(
examples: Iterable[Example], examples: Iterable[Example],
@ -458,7 +460,7 @@ class Scorer:
gold_label, gold_score = max(gold_cats, key=lambda it: it[1]) gold_label, gold_score = max(gold_cats, key=lambda it: it[1])
if gold_score is not None and gold_score > 0: if gold_score is not None and gold_score > 0:
f_per_type[gold_label].fn += 1 f_per_type[gold_label].fn += 1
else: elif pred_cats:
pred_label, pred_score = max(pred_cats, key=lambda it: it[1]) pred_label, pred_score = max(pred_cats, key=lambda it: it[1])
if pred_score >= threshold: if pred_score >= threshold:
f_per_type[pred_label].fp += 1 f_per_type[pred_label].fp += 1
@ -473,7 +475,10 @@ class Scorer:
macro_f = sum(prf.fscore for prf in f_per_type.values()) / n_cats macro_f = sum(prf.fscore for prf in f_per_type.values()) / n_cats
# Limit macro_auc to those labels with gold annotations, # Limit macro_auc to those labels with gold annotations,
# but still divide by all cats to avoid artificial boosting of datasets with missing labels # but still divide by all cats to avoid artificial boosting of datasets with missing labels
macro_auc = sum(auc.score if auc.is_binary() else 0.0 for auc in auc_per_type.values()) / n_cats macro_auc = (
sum(auc.score if auc.is_binary() else 0.0 for auc in auc_per_type.values())
/ n_cats
)
results = { results = {
f"{attr}_score": None, f"{attr}_score": None,
f"{attr}_score_desc": None, f"{attr}_score_desc": None,
@ -485,7 +490,9 @@ class Scorer:
f"{attr}_macro_f": macro_f, f"{attr}_macro_f": macro_f,
f"{attr}_macro_auc": macro_auc, f"{attr}_macro_auc": macro_auc,
f"{attr}_f_per_type": {k: v.to_dict() for k, v in f_per_type.items()}, f"{attr}_f_per_type": {k: v.to_dict() for k, v in f_per_type.items()},
f"{attr}_auc_per_type": {k: v.score if v.is_binary() else None for k, v in auc_per_type.items()}, f"{attr}_auc_per_type": {
k: v.score if v.is_binary() else None for k, v in auc_per_type.items()
},
} }
if len(labels) == 2 and not multi_label and positive_label: if len(labels) == 2 and not multi_label and positive_label:
positive_label_f = results[f"{attr}_f_per_type"][positive_label]["f"] positive_label_f = results[f"{attr}_f_per_type"][positive_label]["f"]
@ -675,8 +682,7 @@ class Scorer:
def get_ner_prf(examples: Iterable[Example]) -> Dict[str, Any]: def get_ner_prf(examples: Iterable[Example]) -> Dict[str, Any]:
"""Compute micro-PRF and per-entity PRF scores for a sequence of examples. """Compute micro-PRF and per-entity PRF scores for a sequence of examples."""
"""
score_per_type = defaultdict(PRFScore) score_per_type = defaultdict(PRFScore)
for eg in examples: for eg in examples:
if not eg.y.has_annotation("ENT_IOB"): if not eg.y.has_annotation("ENT_IOB"):

View File

@ -154,10 +154,10 @@ def test_doc_api_serialize(en_tokenizer, text):
logger = logging.getLogger("spacy") logger = logging.getLogger("spacy")
with mock.patch.object(logger, "warning") as mock_warning: with mock.patch.object(logger, "warning") as mock_warning:
_ = tokens.to_bytes() _ = tokens.to_bytes() # noqa: F841
mock_warning.assert_not_called() mock_warning.assert_not_called()
tokens.user_hooks["similarity"] = inner_func tokens.user_hooks["similarity"] = inner_func
_ = tokens.to_bytes() _ = tokens.to_bytes() # noqa: F841
mock_warning.assert_called_once() mock_warning.assert_called_once()

View File

@ -21,11 +21,13 @@ def test_doc_retokenize_merge(en_tokenizer):
assert doc[4].text == "the beach boys" assert doc[4].text == "the beach boys"
assert doc[4].text_with_ws == "the beach boys " assert doc[4].text_with_ws == "the beach boys "
assert doc[4].tag_ == "NAMED" assert doc[4].tag_ == "NAMED"
assert doc[4].lemma_ == "LEMMA"
assert str(doc[4].morph) == "Number=Plur" assert str(doc[4].morph) == "Number=Plur"
assert doc[5].text == "all night" assert doc[5].text == "all night"
assert doc[5].text_with_ws == "all night" assert doc[5].text_with_ws == "all night"
assert doc[5].tag_ == "NAMED" assert doc[5].tag_ == "NAMED"
assert str(doc[5].morph) == "Number=Plur" assert str(doc[5].morph) == "Number=Plur"
assert doc[5].lemma_ == "LEMMA"
def test_doc_retokenize_merge_children(en_tokenizer): def test_doc_retokenize_merge_children(en_tokenizer):
@ -103,25 +105,29 @@ def test_doc_retokenize_spans_merge_tokens(en_tokenizer):
def test_doc_retokenize_spans_merge_tokens_default_attrs(en_vocab): def test_doc_retokenize_spans_merge_tokens_default_attrs(en_vocab):
words = ["The", "players", "start", "."] words = ["The", "players", "start", "."]
lemmas = [t.lower() for t in words]
heads = [1, 2, 2, 2] heads = [1, 2, 2, 2]
tags = ["DT", "NN", "VBZ", "."] tags = ["DT", "NN", "VBZ", "."]
pos = ["DET", "NOUN", "VERB", "PUNCT"] pos = ["DET", "NOUN", "VERB", "PUNCT"]
doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads) doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads, lemmas=lemmas)
assert len(doc) == 4 assert len(doc) == 4
assert doc[0].text == "The" assert doc[0].text == "The"
assert doc[0].tag_ == "DT" assert doc[0].tag_ == "DT"
assert doc[0].pos_ == "DET" assert doc[0].pos_ == "DET"
assert doc[0].lemma_ == "the"
with doc.retokenize() as retokenizer: with doc.retokenize() as retokenizer:
retokenizer.merge(doc[0:2]) retokenizer.merge(doc[0:2])
assert len(doc) == 3 assert len(doc) == 3
assert doc[0].text == "The players" assert doc[0].text == "The players"
assert doc[0].tag_ == "NN" assert doc[0].tag_ == "NN"
assert doc[0].pos_ == "NOUN" assert doc[0].pos_ == "NOUN"
doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads) assert doc[0].lemma_ == "the players"
doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads, lemmas=lemmas)
assert len(doc) == 4 assert len(doc) == 4
assert doc[0].text == "The" assert doc[0].text == "The"
assert doc[0].tag_ == "DT" assert doc[0].tag_ == "DT"
assert doc[0].pos_ == "DET" assert doc[0].pos_ == "DET"
assert doc[0].lemma_ == "the"
with doc.retokenize() as retokenizer: with doc.retokenize() as retokenizer:
retokenizer.merge(doc[0:2]) retokenizer.merge(doc[0:2])
retokenizer.merge(doc[2:4]) retokenizer.merge(doc[2:4])
@ -129,9 +135,11 @@ def test_doc_retokenize_spans_merge_tokens_default_attrs(en_vocab):
assert doc[0].text == "The players" assert doc[0].text == "The players"
assert doc[0].tag_ == "NN" assert doc[0].tag_ == "NN"
assert doc[0].pos_ == "NOUN" assert doc[0].pos_ == "NOUN"
assert doc[0].lemma_ == "the players"
assert doc[1].text == "start ." assert doc[1].text == "start ."
assert doc[1].tag_ == "VBZ" assert doc[1].tag_ == "VBZ"
assert doc[1].pos_ == "VERB" assert doc[1].pos_ == "VERB"
assert doc[1].lemma_ == "start ."
def test_doc_retokenize_spans_merge_heads(en_vocab): def test_doc_retokenize_spans_merge_heads(en_vocab):

View File

@ -39,6 +39,36 @@ def test_doc_retokenize_split(en_vocab):
assert len(str(doc)) == 19 assert len(str(doc)) == 19
def test_doc_retokenize_split_lemmas(en_vocab):
# If lemmas are not set, leave unset
words = ["LosAngeles", "start", "."]
heads = [1, 2, 2]
doc = Doc(en_vocab, words=words, heads=heads)
with doc.retokenize() as retokenizer:
retokenizer.split(
doc[0],
["Los", "Angeles"],
[(doc[0], 1), doc[1]],
)
assert doc[0].lemma_ == ""
assert doc[1].lemma_ == ""
# If lemmas are set, use split orth as default lemma
words = ["LosAngeles", "start", "."]
heads = [1, 2, 2]
doc = Doc(en_vocab, words=words, heads=heads)
for t in doc:
t.lemma_ = "a"
with doc.retokenize() as retokenizer:
retokenizer.split(
doc[0],
["Los", "Angeles"],
[(doc[0], 1), doc[1]],
)
assert doc[0].lemma_ == "Los"
assert doc[1].lemma_ == "Angeles"
def test_doc_retokenize_split_dependencies(en_vocab): def test_doc_retokenize_split_dependencies(en_vocab):
doc = Doc(en_vocab, words=["LosAngeles", "start", "."]) doc = Doc(en_vocab, words=["LosAngeles", "start", "."])
dep1 = doc.vocab.strings.add("amod") dep1 = doc.vocab.strings.add("amod")

View File

@ -113,9 +113,8 @@ def test_en_tokenizer_norm_exceptions(en_tokenizer, text, norms):
assert [token.norm_ for token in tokens] == norms assert [token.norm_ for token in tokens] == norms
@pytest.mark.skip
@pytest.mark.parametrize( @pytest.mark.parametrize(
"text,norm", [("radicalised", "radicalized"), ("cuz", "because")] "text,norm", [("Jan.", "January"), ("'cuz", "because")]
) )
def test_en_lex_attrs_norm_exceptions(en_tokenizer, text, norm): def test_en_lex_attrs_norm_exceptions(en_tokenizer, text, norm):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)

View File

@ -4,21 +4,21 @@ from spacy.lang.mk.lex_attrs import like_num
def test_tokenizer_handles_long_text(mk_tokenizer): def test_tokenizer_handles_long_text(mk_tokenizer):
text = """ text = """
Во организациските работи или на нашите собранија со членството, никој од нас не зборуваше за Во организациските работи или на нашите собранија со членството, никој од нас не зборуваше за
организацијата и идеологијата. Работна беше нашата работа, а не идеолошка. Што се однесува до социјализмот на организацијата и идеологијата. Работна беше нашата работа, а не идеолошка. Што се однесува до социјализмот на
Делчев, неговата дејност зборува сама за себе - спротивно. Во суштина, водачите си имаа свои основни погледи и Делчев, неговата дејност зборува сама за себе - спротивно. Во суштина, водачите си имаа свои основни погледи и
свои разбирања за положбата и работите, коишто стоеја пред нив и ги завршуваа со голема упорност, настојчивост и свои разбирања за положбата и работите, коишто стоеја пред нив и ги завршуваа со голема упорност, настојчивост и
насоченост. Значи, идеологија имаше, само што нивната идеологија имаше своја оригиналност. Македонија денеска, насоченост. Значи, идеологија имаше, само што нивната идеологија имаше своја оригиналност. Македонија денеска,
чиста рожба на животот и положбата во Македонија, кои му служеа како база на неговите побуди, беше дејност која чиста рожба на животот и положбата во Македонија, кои му служеа како база на неговите побуди, беше дејност која
имаше потреба од ум за да си најде своја смисла. Таквата идеологија и заемното дејство на умот и срцето му имаше потреба од ум за да си најде своја смисла. Таквата идеологија и заемното дејство на умот и срцето му
помогнаа на Делчев да не се занесе по патот на својата идеологија... Во суштина, Организацијата и нејзините помогнаа на Делчев да не се занесе по патот на својата идеологија... Во суштина, Организацијата и нејзините
водачи имаа свои разбирања за работите и положбата во идеен поглед, но тоа беше врската, животот и положбата во водачи имаа свои разбирања за работите и положбата во идеен поглед, но тоа беше врската, животот и положбата во
Македонија и го внесуваа во својата идеологија гласот на своето срце, и на крај, прибегнуваа до умот, Македонија и го внесуваа во својата идеологија гласот на своето срце, и на крај, прибегнуваа до умот,
за да најдат смисла или да ѝ дадат. Тоа содејство и заемен сооднос на умот и срцето му помогнаа на Делчев да ја за да најдат смисла или да ѝ дадат. Тоа содејство и заемен сооднос на умот и срцето му помогнаа на Делчев да ја
држи својата идеологија во сообразност со положбата на работите... Водачите навистина направија една жртва држи својата идеологија во сообразност со положбата на работите... Водачите навистина направија една жртва
бидејќи на населението не му зборуваа за своите мисли и идеи. Тие се одрекоа од секаква субјективност во своите бидејќи на населението не му зборуваа за своите мисли и идеи. Тие се одрекоа од секаква субјективност во своите
мисли. Целта беше да не се зголемуваат целите и задачите како и преданоста во работата. Населението не можеше да мисли. Целта беше да не се зголемуваат целите и задачите како и преданоста во работата. Населението не можеше да
ги разбере овие идеи... ги разбере овие идеи...
""" """
tokens = mk_tokenizer(text) tokens = mk_tokenizer(text)
assert len(tokens) == 297 assert len(tokens) == 297
@ -45,7 +45,7 @@ def test_tokenizer_handles_long_text(mk_tokenizer):
(",", False), (",", False),
("милијарда", True), ("милијарда", True),
("билион", True), ("билион", True),
] ],
) )
def test_mk_lex_attrs_like_number(mk_tokenizer, word, match): def test_mk_lex_attrs_like_number(mk_tokenizer, word, match):
tokens = mk_tokenizer(word) tokens = mk_tokenizer(word)
@ -53,14 +53,7 @@ def test_mk_lex_attrs_like_number(mk_tokenizer, word, match):
assert tokens[0].like_num == match assert tokens[0].like_num == match
@pytest.mark.parametrize( @pytest.mark.parametrize("word", ["двесте", "два-три", "пет-шест"])
"word",
[
"двесте",
"два-три",
"пет-шест"
]
)
def test_mk_lex_attrs_capitals(word): def test_mk_lex_attrs_capitals(word):
assert like_num(word) assert like_num(word)
assert like_num(word.upper()) assert like_num(word.upper())
@ -77,8 +70,8 @@ def test_mk_lex_attrs_capitals(word):
"петто", "петто",
"стоти", "стоти",
"шеесетите", "шеесетите",
"седумдесетите" "седумдесетите",
] ],
) )
def test_mk_lex_attrs_like_number_for_ordinal(word): def test_mk_lex_attrs_like_number_for_ordinal(word):
assert like_num(word) assert like_num(word)

View File

@ -5,24 +5,22 @@ from spacy.lang.tr.lex_attrs import like_num
def test_tr_tokenizer_handles_long_text(tr_tokenizer): def test_tr_tokenizer_handles_long_text(tr_tokenizer):
text = """Pamuk nasıl ipliğe dönüştürülür? text = """Pamuk nasıl ipliğe dönüştürülür?
Sıkıştırılmış balyalar halindeki pamuk, iplik fabrikasına getirildiğinde hem Sıkıştırılmış balyalar halindeki pamuk, iplik fabrikasına getirildiğinde hem
lifleri birbirine dolaşmıştır, hem de tarladan toplanırken araya bitkinin lifleri birbirine dolaşmıştır, hem de tarladan toplanırken araya bitkinin
parçaları karışmıştır. Üstelik balyalardaki pamuğun cinsi aynı olsa bile kalitesi parçaları karışmıştır. Üstelik balyalardaki pamuğun cinsi aynı olsa bile kalitesi
değişeceğinden, önce bütün balyaların birbirine karıştırılarak harmanlanması gerekir. değişeceğinden, önce bütün balyaların birbirine karıştırılarak harmanlanması gerekir.
Daha sonra pamuk yığınları, liflerin ılıp temizlenmesi için tek bir birim halinde Daha sonra pamuk yığınları, liflerin ılıp temizlenmesi için tek bir birim halinde
birleştirilmiş çeşitli makinelerden geçirilir.Bunlardan biri, dönen tokmaklarıyla birleştirilmiş çeşitli makinelerden geçirilir.Bunlardan biri, dönen tokmaklarıyla
pamuğu dövüp kabartarak dağınık yumaklar haline getiren ve liflerin arasındaki yabancı pamuğu dövüp kabartarak dağınık yumaklar haline getiren ve liflerin arasındaki yabancı
maddeleri temizleyen hallaç makinesidir. Daha sonra tarak makinesine giren pamuk demetleri, maddeleri temizleyen hallaç makinesidir. Daha sonra tarak makinesine giren pamuk demetleri,
herbirinin yüzeyinde yüzbinlerce incecik iğne bulunan döner silindirlerin arasından geçerek lif lif ayrılır herbirinin yüzeyinde yüzbinlerce incecik iğne bulunan döner silindirlerin arasından geçerek lif lif ayrılır
ve tül inceliğinde gevşek bir örtüye dönüşür. Ama bir sonraki makine bu lifleri dağınık ve tül inceliğinde gevşek bir örtüye dönüşür. Ama bir sonraki makine bu lifleri dağınık
ve gevşek bir biçimde birbirine yaklaştırarak 2 cm eninde bir pamuk şeridi haline getirir.""" ve gevşek bir biçimde birbirine yaklaştırarak 2 cm eninde bir pamuk şeridi haline getirir."""
tokens = tr_tokenizer(text) tokens = tr_tokenizer(text)
assert len(tokens) == 146 assert len(tokens) == 146
@pytest.mark.parametrize( @pytest.mark.parametrize(
"word", "word",
[ [

View File

@ -2,145 +2,692 @@ import pytest
ABBREV_TESTS = [ ABBREV_TESTS = [
("Dr. Murat Bey ile görüştüm.", ["Dr.", "Murat", "Bey", "ile", "görüştüm", "."]), ("Dr. Murat Bey ile görüştüm.", ["Dr.", "Murat", "Bey", "ile", "görüştüm", "."]),
("Dr.la görüştüm.", ["Dr.la", "görüştüm", "."]), ("Dr.la görüştüm.", ["Dr.la", "görüştüm", "."]),
("Dr.'la görüştüm.", ["Dr.'la", "görüştüm", "."]), ("Dr.'la görüştüm.", ["Dr.'la", "görüştüm", "."]),
("TBMM'de çalışıyormuş.", ["TBMM'de", "çalışıyormuş", "."]), ("TBMM'de çalışıyormuş.", ["TBMM'de", "çalışıyormuş", "."]),
("Hem İst. hem Ank. bu konuda gayet iyi durumda.", ["Hem", "İst.", "hem", "Ank.", "bu", "konuda", "gayet", "iyi", "durumda", "."]), (
("Hem İst. hem Ank.'da yağış var.", ["Hem", "İst.", "hem", "Ank.'da", "yağış", "var", "."]), "Hem İst. hem Ank. bu konuda gayet iyi durumda.",
("Dr.", ["Dr."]), ["Hem", "İst.", "hem", "Ank.", "bu", "konuda", "gayet", "iyi", "durumda", "."],
("Yrd.Doç.", ["Yrd.Doç."]), ),
("Prof.'un", ["Prof.'un"]), (
("Böl.'nde", ["Böl.'nde"]), "Hem İst. hem Ank.'da yağış var.",
["Hem", "İst.", "hem", "Ank.'da", "yağış", "var", "."],
),
("Dr.", ["Dr."]),
("Yrd.Doç.", ["Yrd.Doç."]),
("Prof.'un", ["Prof.'un"]),
("Böl.'nde", ["Böl.'nde"]),
] ]
URL_TESTS = [ URL_TESTS = [
("Bizler de www.duygu.com.tr adında bir websitesi kurduk.", ["Bizler", "de", "www.duygu.com.tr", "adında", "bir", "websitesi", "kurduk", "."]), (
("Bizler de https://www.duygu.com.tr adında bir websitesi kurduk.", ["Bizler", "de", "https://www.duygu.com.tr", "adında", "bir", "websitesi", "kurduk", "."]), "Bizler de www.duygu.com.tr adında bir websitesi kurduk.",
("Bizler de www.duygu.com.tr'dan satın aldık.", ["Bizler", "de", "www.duygu.com.tr'dan", "satın", "aldık", "."]), [
("Bizler de https://www.duygu.com.tr'dan satın aldık.", ["Bizler", "de", "https://www.duygu.com.tr'dan", "satın", "aldık", "."]), "Bizler",
"de",
"www.duygu.com.tr",
"adında",
"bir",
"websitesi",
"kurduk",
".",
],
),
(
"Bizler de https://www.duygu.com.tr adında bir websitesi kurduk.",
[
"Bizler",
"de",
"https://www.duygu.com.tr",
"adında",
"bir",
"websitesi",
"kurduk",
".",
],
),
(
"Bizler de www.duygu.com.tr'dan satın aldık.",
["Bizler", "de", "www.duygu.com.tr'dan", "satın", "aldık", "."],
),
(
"Bizler de https://www.duygu.com.tr'dan satın aldık.",
["Bizler", "de", "https://www.duygu.com.tr'dan", "satın", "aldık", "."],
),
] ]
NUMBER_TESTS = [ NUMBER_TESTS = [
("Rakamla 6 yazılıydı.", ["Rakamla", "6", "yazılıydı", "."]), ("Rakamla 6 yazılıydı.", ["Rakamla", "6", "yazılıydı", "."]),
("Hava -4 dereceydi.", ["Hava", "-4", "dereceydi", "."]), ("Hava -4 dereceydi.", ["Hava", "-4", "dereceydi", "."]),
("Hava sıcaklığı -4ten +6ya yükseldi.", ["Hava", "sıcaklığı", "-4ten", "+6ya", "yükseldi", "."]), (
("Hava sıcaklığı -4'ten +6'ya yükseldi.", ["Hava", "sıcaklığı", "-4'ten", "+6'ya", "yükseldi", "."]), "Hava sıcaklığı -4ten +6ya yükseldi.",
("Yarışta 6. oldum.", ["Yarışta", "6.", "oldum", "."]), ["Hava", "sıcaklığı", "-4ten", "+6ya", "yükseldi", "."],
("Yarışta 438547745. oldum.", ["Yarışta", "438547745.", "oldum", "."]), ),
("Kitap IV. Murat hakkında.",["Kitap", "IV.", "Murat", "hakkında", "."]), (
#("Bana söylediği sayı 6.", ["Bana", "söylediği", "sayı", "6", "."]), "Hava sıcaklığı -4'ten +6'ya yükseldi.",
("Saat 6'da buluşalım.", ["Saat", "6'da", "buluşalım", "."]), ["Hava", "sıcaklığı", "-4'ten", "+6'ya", "yükseldi", "."],
("Saat 6dan sonra buluşalım.", ["Saat", "6dan", "sonra", "buluşalım", "."]), ),
("6.dan sonra saymadım.", ["6.dan", "sonra", "saymadım", "."]), ("Yarışta 6. oldum.", ["Yarışta", "6.", "oldum", "."]),
("6.'dan sonra saymadım.", ["6.'dan", "sonra", "saymadım", "."]), ("Yarışta 438547745. oldum.", ["Yarışta", "438547745.", "oldum", "."]),
("Saat 6'ydı.", ["Saat", "6'ydı", "."]), ("Kitap IV. Murat hakkında.", ["Kitap", "IV.", "Murat", "hakkında", "."]),
("5'te", ["5'te"]), # ("Bana söylediği sayı 6.", ["Bana", "söylediği", "sayı", "6", "."]),
("6'da", ["6'da"]), ("Saat 6'da buluşalım.", ["Saat", "6'da", "buluşalım", "."]),
("9dan", ["9dan"]), ("Saat 6dan sonra buluşalım.", ["Saat", "6dan", "sonra", "buluşalım", "."]),
("19'da", ["19'da"]), ("6.dan sonra saymadım.", ["6.dan", "sonra", "saymadım", "."]),
("VI'da", ["VI'da"]), ("6.'dan sonra saymadım.", ["6.'dan", "sonra", "saymadım", "."]),
("5.", ["5."]), ("Saat 6'ydı.", ["Saat", "6'ydı", "."]),
("72.", ["72."]), ("5'te", ["5'te"]),
("VI.", ["VI."]), ("6'da", ["6'da"]),
("6.'dan", ["6.'dan"]), ("9dan", ["9dan"]),
("19.'dan", ["19.'dan"]), ("19'da", ["19'da"]),
("6.dan", ["6.dan"]), ("VI'da", ["VI'da"]),
("16.dan", ["16.dan"]), ("5.", ["5."]),
("VI.'dan", ["VI.'dan"]), ("72.", ["72."]),
("VI.dan", ["VI.dan"]), ("VI.", ["VI."]),
("Hepsi 1994 yılında oldu.", ["Hepsi", "1994", "yılında", "oldu", "."]), ("6.'dan", ["6.'dan"]),
("Hepsi 1994'te oldu.", ["Hepsi", "1994'te", "oldu", "."]), ("19.'dan", ["19.'dan"]),
("2/3 tarihli faturayı bulamadım.", ["2/3", "tarihli", "faturayı", "bulamadım", "."]), ("6.dan", ["6.dan"]),
("2.3 tarihli faturayı bulamadım.", ["2.3", "tarihli", "faturayı", "bulamadım", "."]), ("16.dan", ["16.dan"]),
("2.3. tarihli faturayı bulamadım.", ["2.3.", "tarihli", "faturayı", "bulamadım", "."]), ("VI.'dan", ["VI.'dan"]),
("2/3/2020 tarihli faturayı bulamadm.", ["2/3/2020", "tarihli", "faturayı", "bulamadm", "."]), ("VI.dan", ["VI.dan"]),
("2/3/1987 tarihinden beri burda yaşıyorum.", ["2/3/1987", "tarihinden", "beri", "burda", "yaşıyorum", "."]), ("Hepsi 1994 yılında oldu.", ["Hepsi", "1994", "yılında", "oldu", "."]),
("2-3-1987 tarihinden beri burdayım.", ["2-3-1987", "tarihinden", "beri", "burdayım", "."]), ("Hepsi 1994'te oldu.", ["Hepsi", "1994'te", "oldu", "."]),
("2.3.1987 tarihinden beri burdayım.", ["2.3.1987", "tarihinden", "beri", "burdayım", "."]), (
("Bu olay 2005-2006 tarihleri arasında oldu.", ["Bu", "olay", "2005", "-", "2006", "tarihleri", "arasında", "oldu", "."]), "2/3 tarihli faturayı bulamadım.",
("Bu olay 4/12/2005-21/3/2006 tarihleri arasında oldu.", ["Bu", "olay", "4/12/2005", "-", "21/3/2006", "tarihleri", "arasında", "oldu", ".",]), ["2/3", "tarihli", "faturayı", "bulamadım", "."],
("Ek fıkra: 5/11/2003-4999/3 maddesine göre uygundur.", ["Ek", "fıkra", ":", "5/11/2003", "-", "4999/3", "maddesine", "göre", "uygundur", "."]), ),
("2/A alanları: 6831 sayılı Kanunun 2nci maddesinin birinci fıkrasının (A) bendine göre", ["2/A", "alanları", ":", "6831", "sayılı", "Kanunun", "2nci", "maddesinin", "birinci", "fıkrasının", "(", "A", ")", "bendine", "göre"]), (
("ŞEHİTTEĞMENKALMAZ Cad. No: 2/311", ["ŞEHİTTEĞMENKALMAZ", "Cad.", "No", ":", "2/311"]), "2.3 tarihli faturayı bulamadım.",
("2-3-2025", ["2-3-2025",]), ["2.3", "tarihli", "faturayı", "bulamadım", "."],
("2/3/2025", ["2/3/2025"]), ),
("Yıllardır 0.5 uç kullanıyorum.", ["Yıllardır", "0.5", "", "kullanıyorum", "."]), (
("Kan değerlerim 0.5-0.7 arasıydı.", ["Kan", "değerlerim", "0.5", "-", "0.7", "arasıydı", "."]), "2.3. tarihli faturayı bulamadım.",
("0.5", ["0.5"]), ["2.3.", "tarihli", "faturayı", "bulamadım", "."],
("1/2", ["1/2"]), ),
("%1", ["%", "1"]), (
("%1lik", ["%", "1lik"]), "2/3/2020 tarihli faturayı bulamadm.",
("%1'lik", ["%", "1'lik"]), ["2/3/2020", "tarihli", "faturayı", "bulamadm", "."],
("%1lik dilim", ["%", "1lik", "dilim"]), ),
("%1'lik dilim", ["%", "1'lik", "dilim"]), (
("%1.5", ["%", "1.5"]), "2/3/1987 tarihinden beri burda yaşıyorum.",
#("%1-%2 arası büyüme bekleniyor.", ["%", "1", "-", "%", "2", "arası", "büyüme", "bekleniyor", "."]), ["2/3/1987", "tarihinden", "beri", "burda", "yaşıyorum", "."],
("%1-2 arası büyüme bekliyoruz.", ["%", "1", "-", "2", "arası", "büyüme", "bekliyoruz", "."]), ),
("%11-12 arası büyüme bekliyoruz.", ["%", "11", "-", "12", "arası", "büyüme", "bekliyoruz", "."]), (
("%1.5luk büyüme bekliyoruz.", ["%", "1.5luk", "büyüme", "bekliyoruz", "."]), "2-3-1987 tarihinden beri burdayım.",
("Saat 1-2 arası gelin lütfen.", ["Saat", "1", "-", "2", "arası", "gelin", "lütfen", "."]), ["2-3-1987", "tarihinden", "beri", "burdayım", "."],
("Saat 15:30 gibi buluşalım.", ["Saat", "15:30", "gibi", "buluşalım", "."]), ),
("Saat 15:30'da buluşalım.", ["Saat", "15:30'da", "buluşalım", "."]), (
("Saat 15.30'da buluşalım.", ["Saat", "15.30'da", "buluşalım", "."]), "2.3.1987 tarihinden beri burdayım.",
("Saat 15.30da buluşalım.", ["Saat", "15.30da", "buluşalım", "."]), ["2.3.1987", "tarihinden", "beri", "burdayım", "."],
("Saat 15 civarı buluşalım.", ["Saat", "15", "civarı", "buluşalım", "."]), ),
("9daki otobüse binsek mi?", ["9daki", "otobüse", "binsek", "mi", "?"]), (
("Okulumuz 3-B şubesi", ["Okulumuz", "3-B", "şubesi"]), "Bu olay 2005-2006 tarihleri arasında oldu.",
("Okulumuz 3/B şubesi", ["Okulumuz", "3/B", "şubesi"]), ["Bu", "olay", "2005", "-", "2006", "tarihleri", "arasında", "oldu", "."],
("Okulumuz 3B şubesi", ["Okulumuz", "3B", "şubesi"]), ),
("Okulumuz 3b şubesi", ["Okulumuz", "3b", "şubesi"]), (
("Antonio Gaudí 20. yüzyılda, 1904-1914 yılları arasında on yıl süren bir reform süreci getirmiştir.", ["Antonio", "Gaudí", "20.", "yüzyılda", ",", "1904", "-", "1914", "yılları", "arasında", "on", "yıl", "süren", "bir", "reform", "süreci", "getirmiştir", "."]), "Bu olay 4/12/2005-21/3/2006 tarihleri arasında oldu.",
("Dizel yakıtın avro bölgesi ortalaması olan 1,165 avroya kıyasla litre başına 1,335 avroya mal olduğunu gösteriyor.", ["Dizel", "yakıtın", "avro", "bölgesi", "ortalaması", "olan", "1,165", "avroya", "kıyasla", "litre", "başına", "1,335", "avroya", "mal", "olduğunu", "gösteriyor", "."]), [
("Marcus Antonius M.Ö. 1 Ocak 49'da, Sezar'dan Vali'nin kendisini barış dostu ilan ettiği bir bildiri yayınlamıştır.", ["Marcus", "Antonius", "M.Ö.", "1", "Ocak", "49'da", ",", "Sezar'dan", "Vali'nin", "kendisini", "barış", "dostu", "ilan", "ettiği", "bir", "bildiri", "yayınlamıştır", "."]) "Bu",
"olay",
"4/12/2005",
"-",
"21/3/2006",
"tarihleri",
"arasında",
"oldu",
".",
],
),
(
"Ek fıkra: 5/11/2003-4999/3 maddesine göre uygundur.",
[
"Ek",
"fıkra",
":",
"5/11/2003",
"-",
"4999/3",
"maddesine",
"göre",
"uygundur",
".",
],
),
(
"2/A alanları: 6831 sayılı Kanunun 2nci maddesinin birinci fıkrasının (A) bendine göre",
[
"2/A",
"alanları",
":",
"6831",
"sayılı",
"Kanunun",
"2nci",
"maddesinin",
"birinci",
"fıkrasının",
"(",
"A",
")",
"bendine",
"göre",
],
),
(
"ŞEHİTTEĞMENKALMAZ Cad. No: 2/311",
["ŞEHİTTEĞMENKALMAZ", "Cad.", "No", ":", "2/311"],
),
(
"2-3-2025",
[
"2-3-2025",
],
),
("2/3/2025", ["2/3/2025"]),
("Yıllardır 0.5 uç kullanıyorum.", ["Yıllardır", "0.5", "", "kullanıyorum", "."]),
(
"Kan değerlerim 0.5-0.7 arasıydı.",
["Kan", "değerlerim", "0.5", "-", "0.7", "arasıydı", "."],
),
("0.5", ["0.5"]),
("1/2", ["1/2"]),
("%1", ["%", "1"]),
("%1lik", ["%", "1lik"]),
("%1'lik", ["%", "1'lik"]),
("%1lik dilim", ["%", "1lik", "dilim"]),
("%1'lik dilim", ["%", "1'lik", "dilim"]),
("%1.5", ["%", "1.5"]),
# ("%1-%2 arası büyüme bekleniyor.", ["%", "1", "-", "%", "2", "arası", "büyüme", "bekleniyor", "."]),
(
"%1-2 arası büyüme bekliyoruz.",
["%", "1", "-", "2", "arası", "büyüme", "bekliyoruz", "."],
),
(
"%11-12 arası büyüme bekliyoruz.",
["%", "11", "-", "12", "arası", "büyüme", "bekliyoruz", "."],
),
("%1.5luk büyüme bekliyoruz.", ["%", "1.5luk", "büyüme", "bekliyoruz", "."]),
(
"Saat 1-2 arası gelin lütfen.",
["Saat", "1", "-", "2", "arası", "gelin", "lütfen", "."],
),
("Saat 15:30 gibi buluşalım.", ["Saat", "15:30", "gibi", "buluşalım", "."]),
("Saat 15:30'da buluşalım.", ["Saat", "15:30'da", "buluşalım", "."]),
("Saat 15.30'da buluşalım.", ["Saat", "15.30'da", "buluşalım", "."]),
("Saat 15.30da buluşalım.", ["Saat", "15.30da", "buluşalım", "."]),
("Saat 15 civarı buluşalım.", ["Saat", "15", "civarı", "buluşalım", "."]),
("9daki otobüse binsek mi?", ["9daki", "otobüse", "binsek", "mi", "?"]),
("Okulumuz 3-B şubesi", ["Okulumuz", "3-B", "şubesi"]),
("Okulumuz 3/B şubesi", ["Okulumuz", "3/B", "şubesi"]),
("Okulumuz 3B şubesi", ["Okulumuz", "3B", "şubesi"]),
("Okulumuz 3b şubesi", ["Okulumuz", "3b", "şubesi"]),
(
"Antonio Gaudí 20. yüzyılda, 1904-1914 yılları arasında on yıl süren bir reform süreci getirmiştir.",
[
"Antonio",
"Gaudí",
"20.",
"yüzyılda",
",",
"1904",
"-",
"1914",
"yılları",
"arasında",
"on",
"yıl",
"süren",
"bir",
"reform",
"süreci",
"getirmiştir",
".",
],
),
(
"Dizel yakıtın avro bölgesi ortalaması olan 1,165 avroya kıyasla litre başına 1,335 avroya mal olduğunu gösteriyor.",
[
"Dizel",
"yakıtın",
"avro",
"bölgesi",
"ortalaması",
"olan",
"1,165",
"avroya",
"kıyasla",
"litre",
"başına",
"1,335",
"avroya",
"mal",
"olduğunu",
"gösteriyor",
".",
],
),
(
"Marcus Antonius M.Ö. 1 Ocak 49'da, Sezar'dan Vali'nin kendisini barış dostu ilan ettiği bir bildiri yayınlamıştır.",
[
"Marcus",
"Antonius",
"M.Ö.",
"1",
"Ocak",
"49'da",
",",
"Sezar'dan",
"Vali'nin",
"kendisini",
"barış",
"dostu",
"ilan",
"ettiği",
"bir",
"bildiri",
"yayınlamıştır",
".",
],
),
] ]
PUNCT_TESTS = [ PUNCT_TESTS = [
("Gitmedim dedim ya!", ["Gitmedim", "dedim", "ya", "!"]), ("Gitmedim dedim ya!", ["Gitmedim", "dedim", "ya", "!"]),
("Gitmedim dedim ya!!", ["Gitmedim", "dedim", "ya", "!", "!"]), ("Gitmedim dedim ya!!", ["Gitmedim", "dedim", "ya", "!", "!"]),
("Gitsek mi?", ["Gitsek", "mi", "?"]), ("Gitsek mi?", ["Gitsek", "mi", "?"]),
("Gitsek mi??", ["Gitsek", "mi", "?", "?"]), ("Gitsek mi??", ["Gitsek", "mi", "?", "?"]),
("Gitsek mi?!?", ["Gitsek", "mi", "?", "!", "?"]), ("Gitsek mi?!?", ["Gitsek", "mi", "?", "!", "?"]),
("Ankara - Antalya arası otobüs işliyor.", ["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."]), (
("Ankara-Antalya arası otobüs işliyor.", ["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."]), "Ankara - Antalya arası otobüs işliyor.",
("Sen--ben, ya da onlar.", ["Sen", "--", "ben", ",", "ya", "da", "onlar", "."]), ["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."],
("Senden, benden, bizden şarkısını biliyor musun?", ["Senden", ",", "benden", ",", "bizden", "şarkısını", "biliyor", "musun", "?"]), ),
("Akif'le geldik, sonra da o ayrıldı.", ["Akif'le", "geldik", ",", "sonra", "da", "o", "ayrıldı", "."]), (
("Bu adam ne dedi şimdi???", ["Bu", "adam", "ne", "dedi", "şimdi", "?", "?", "?"]), "Ankara-Antalya arası otobüs işliyor.",
("Yok hasta olmuş, yok annesi hastaymış, bahaneler işte...", ["Yok", "hasta", "olmuş", ",", "yok", "annesi", "hastaymış", ",", "bahaneler", "işte", "..."]), ["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."],
("Ankara'dan İstanbul'a ... bir aşk hikayesi.", ["Ankara'dan", "İstanbul'a", "...", "bir", "aşk", "hikayesi", "."]), ),
("Ahmet'te", ["Ahmet'te"]), ("Sen--ben, ya da onlar.", ["Sen", "--", "ben", ",", "ya", "da", "onlar", "."]),
("İstanbul'da", ["İstanbul'da"]), (
"Senden, benden, bizden şarkısını biliyor musun?",
["Senden", ",", "benden", ",", "bizden", "şarkısını", "biliyor", "musun", "?"],
),
(
"Akif'le geldik, sonra da o ayrıldı.",
["Akif'le", "geldik", ",", "sonra", "da", "o", "ayrıldı", "."],
),
("Bu adam ne dedi şimdi???", ["Bu", "adam", "ne", "dedi", "şimdi", "?", "?", "?"]),
(
"Yok hasta olmuş, yok annesi hastaymış, bahaneler işte...",
[
"Yok",
"hasta",
"olmuş",
",",
"yok",
"annesi",
"hastaymış",
",",
"bahaneler",
"işte",
"...",
],
),
(
"Ankara'dan İstanbul'a ... bir aşk hikayesi.",
["Ankara'dan", "İstanbul'a", "...", "bir", "aşk", "hikayesi", "."],
),
("Ahmet'te", ["Ahmet'te"]),
("İstanbul'da", ["İstanbul'da"]),
] ]
GENERAL_TESTS = [ GENERAL_TESTS = [
("1914'teki Endurance seferinde, Sir Ernest Shackleton'ın kaptanlığını yaptığı İngiliz Endurance gemisi yirmi sekiz kişi ile Antarktika'yı geçmek üzere yelken açtı.", ["1914'teki", "Endurance", "seferinde", ",", "Sir", "Ernest", "Shackleton'ın", "kaptanlığını", "yaptığı", "İngiliz", "Endurance", "gemisi", "yirmi", "sekiz", "kişi", "ile", "Antarktika'yı", "geçmek", "üzere", "yelken", "açtı", "."]), (
("Danışılan \"%100 Cospedal\" olduğunu belirtti.", ["Danışılan", '"', "%", "100", "Cospedal", '"', "olduğunu", "belirtti", "."]), "1914'teki Endurance seferinde, Sir Ernest Shackleton'ın kaptanlığını yaptığı İngiliz Endurance gemisi yirmi sekiz kişi ile Antarktika'yı geçmek üzere yelken açtı.",
("1976'da parkur artık kullanılmıyordu; 1990'da ise bir yangın, daha sonraları ahırlarla birlikte yıkılacak olan tahta tribünlerden geri kalanları da yok etmişti.", ["1976'da", "parkur", "artık", "kullanılmıyordu", ";", "1990'da", "ise", "bir", "yangın", ",", "daha", "sonraları", "ahırlarla", "birlikte", "yıkılacak", "olan", "tahta", "tribünlerden", "geri", "kalanları", "da", "yok", "etmişti", "."]), [
("Dahiyane bir ameliyat ve zorlu bir rehabilitasyon sürecinden sonra, tamamen iyileştim.", ["Dahiyane", "bir", "ameliyat", "ve", "zorlu", "bir", "rehabilitasyon", "sürecinden", "sonra", ",", "tamamen", "iyileştim", "."]), "1914'teki",
("Yaklaşık iki hafta süren bireysel erken oy kullanma döneminin ardından 5,7 milyondan fazla Floridalı sandık başına gitti.", ["Yaklaşık", "iki", "hafta", "süren", "bireysel", "erken", "oy", "kullanma", "döneminin", "ardından", "5,7", "milyondan", "fazla", "Floridalı", "sandık", "başına", "gitti", "."]), "Endurance",
("Ancak, bu ABD Çevre Koruma Ajansı'nın dünyayı bu konularda uyarmasının ardından ortaya çıktı.", ["Ancak", ",", "bu", "ABD", "Çevre", "Koruma", "Ajansı'nın", "dünyayı", "bu", "konularda", "uyarmasının", "ardından", "ortaya", "çıktı", "."]), "seferinde",
("Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.", ["Ortalama", "şansa", "ve", "10.000", "Sterlin", "değerinde", "tahvillere", "sahip", "bir", "yatırımcı", "yılda", "125", "Sterlin", "ikramiye", "kazanabilir", "."]), ",",
("Granit adaları; Seyşeller ve Tioman ile Saint Helena gibi volkanik adaları kapsar." , ["Granit", "adaları", ";", "Seyşeller", "ve", "Tioman", "ile", "Saint", "Helena", "gibi", "volkanik", "adaları", "kapsar", "."]), "Sir",
("Barış antlaşmasıyla İspanya, Amerika'ya Porto Riko, Guam ve Filipinler kolonilerini devretti.", ["Barış", "antlaşmasıyla", "İspanya", ",", "Amerika'ya", "Porto", "Riko", ",", "Guam", "ve", "Filipinler", "kolonilerini", "devretti", "."]), "Ernest",
("Makedonya\'nın sınır bölgelerini güvence altına alan Philip, büyük bir Makedon ordusu kurdu ve uzun bir fetih seferi için Trakya\'ya doğru yürüdü.", ["Makedonya\'nın", "sınır", "bölgelerini", "güvence", "altına", "alan", "Philip", ",", "büyük", "bir", "Makedon", "ordusu", "kurdu", "ve", "uzun", "bir", "fetih", "seferi", "için", "Trakya\'ya", "doğru", "yürüdü", "."]), "Shackleton'ın",
("Fransız gazetesi Le Figaro'ya göre bu hükumet planı sayesinde 42 milyon Euro kazanç sağlanabilir ve elde edilen paranın 15.5 milyonu ulusal güvenlik için kullanılabilir.", ["Fransız", "gazetesi", "Le", "Figaro'ya", "göre", "bu", "hükumet", "planı", "sayesinde", "42", "milyon", "Euro", "kazanç", "sağlanabilir", "ve", "elde", "edilen", "paranın", "15.5", "milyonu", "ulusal", "güvenlik", "için", "kullanılabilir", "."]), "kaptanlığını",
("Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.", ["Ortalama", "şansa", "ve", "10.000", "Sterlin", "değerinde", "tahvillere", "sahip", "bir", "yatırımcı", "yılda", "125", "Sterlin", "ikramiye", "kazanabilir", "."]), "yaptığı",
("3 Kasım Salı günü, Ankara Belediye Başkanı 2014'te hükümetle birlikte oluşturulan kentsel gelişim anlaşmasını askıya alma kararı verdi.", ["3", "Kasım", "Salı", "günü", ",", "Ankara", "Belediye", "Başkanı", "2014'te", "hükümetle", "birlikte", "oluşturulan", "kentsel", "gelişim", "anlaşmasını", "askıya", "alma", "kararı", "verdi", "."]), "İngiliz",
("Stalin, Abakumov'u Beria'nın enerji bakanlıkları üzerindeki baskınlığına karşı MGB içinde kendi ağını kurmaya teşvik etmeye başlamıştı.", ["Stalin", ",", "Abakumov'u", "Beria'nın", "enerji", "bakanlıkları", "üzerindeki", "baskınlığına", "karşı", "MGB", "içinde", "kendi", "ını", "kurmaya", "teşvik", "etmeye", "başlamıştı", "."]), "Endurance",
("Güney Avrupa'daki kazı alanlarının çoğunluğu gibi, bu bulgu M.Ö. 5. yüzyılın başlar", ["Güney", "Avrupa'daki", "kazı", "alanlarının", "çoğunluğu", "gibi", ",", "bu", "bulgu", "M.Ö.", "5.", "yüzyılın", "başlar"]), "gemisi",
("Sağlığın bozulması Hitchcock hayatının son yirmi yılında üretimini azalttı.", ["Sağlığın", "bozulması", "Hitchcock", "hayatının", "son", "yirmi", "yılında", "üretimini", "azalttı", "."]), "yirmi",
"sekiz",
"kişi",
"ile",
"Antarktika'yı",
"geçmek",
"üzere",
"yelken",
"açtı",
".",
],
),
(
'Danışılan "%100 Cospedal" olduğunu belirtti.',
["Danışılan", '"', "%", "100", "Cospedal", '"', "olduğunu", "belirtti", "."],
),
(
"1976'da parkur artık kullanılmıyordu; 1990'da ise bir yangın, daha sonraları ahırlarla birlikte yıkılacak olan tahta tribünlerden geri kalanları da yok etmişti.",
[
"1976'da",
"parkur",
"artık",
"kullanılmıyordu",
";",
"1990'da",
"ise",
"bir",
"yangın",
",",
"daha",
"sonraları",
"ahırlarla",
"birlikte",
"yıkılacak",
"olan",
"tahta",
"tribünlerden",
"geri",
"kalanları",
"da",
"yok",
"etmişti",
".",
],
),
(
"Dahiyane bir ameliyat ve zorlu bir rehabilitasyon sürecinden sonra, tamamen iyileştim.",
[
"Dahiyane",
"bir",
"ameliyat",
"ve",
"zorlu",
"bir",
"rehabilitasyon",
"sürecinden",
"sonra",
",",
"tamamen",
"iyileştim",
".",
],
),
(
"Yaklaşık iki hafta süren bireysel erken oy kullanma döneminin ardından 5,7 milyondan fazla Floridalı sandık başına gitti.",
[
"Yaklaşık",
"iki",
"hafta",
"süren",
"bireysel",
"erken",
"oy",
"kullanma",
"döneminin",
"ardından",
"5,7",
"milyondan",
"fazla",
"Floridalı",
"sandık",
"başına",
"gitti",
".",
],
),
(
"Ancak, bu ABD Çevre Koruma Ajansı'nın dünyayı bu konularda uyarmasının ardından ortaya çıktı.",
[
"Ancak",
",",
"bu",
"ABD",
"Çevre",
"Koruma",
"Ajansı'nın",
"dünyayı",
"bu",
"konularda",
"uyarmasının",
"ardından",
"ortaya",
"çıktı",
".",
],
),
(
"Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.",
[
"Ortalama",
"şansa",
"ve",
"10.000",
"Sterlin",
"değerinde",
"tahvillere",
"sahip",
"bir",
"yatırımcı",
"yılda",
"125",
"Sterlin",
"ikramiye",
"kazanabilir",
".",
],
),
(
"Granit adaları; Seyşeller ve Tioman ile Saint Helena gibi volkanik adaları kapsar.",
[
"Granit",
"adaları",
";",
"Seyşeller",
"ve",
"Tioman",
"ile",
"Saint",
"Helena",
"gibi",
"volkanik",
"adaları",
"kapsar",
".",
],
),
(
"Barış antlaşmasıyla İspanya, Amerika'ya Porto Riko, Guam ve Filipinler kolonilerini devretti.",
[
"Barış",
"antlaşmasıyla",
"İspanya",
",",
"Amerika'ya",
"Porto",
"Riko",
",",
"Guam",
"ve",
"Filipinler",
"kolonilerini",
"devretti",
".",
],
),
(
"Makedonya'nın sınır bölgelerini güvence altına alan Philip, büyük bir Makedon ordusu kurdu ve uzun bir fetih seferi için Trakya'ya doğru yürüdü.",
[
"Makedonya'nın",
"sınır",
"bölgelerini",
"güvence",
"altına",
"alan",
"Philip",
",",
"büyük",
"bir",
"Makedon",
"ordusu",
"kurdu",
"ve",
"uzun",
"bir",
"fetih",
"seferi",
"için",
"Trakya'ya",
"doğru",
"yürüdü",
".",
],
),
(
"Fransız gazetesi Le Figaro'ya göre bu hükumet planı sayesinde 42 milyon Euro kazanç sağlanabilir ve elde edilen paranın 15.5 milyonu ulusal güvenlik için kullanılabilir.",
[
"Fransız",
"gazetesi",
"Le",
"Figaro'ya",
"göre",
"bu",
"hükumet",
"planı",
"sayesinde",
"42",
"milyon",
"Euro",
"kazanç",
"sağlanabilir",
"ve",
"elde",
"edilen",
"paranın",
"15.5",
"milyonu",
"ulusal",
"güvenlik",
"için",
"kullanılabilir",
".",
],
),
(
"Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.",
[
"Ortalama",
"şansa",
"ve",
"10.000",
"Sterlin",
"değerinde",
"tahvillere",
"sahip",
"bir",
"yatırımcı",
"yılda",
"125",
"Sterlin",
"ikramiye",
"kazanabilir",
".",
],
),
(
"3 Kasım Salı günü, Ankara Belediye Başkanı 2014'te hükümetle birlikte oluşturulan kentsel gelişim anlaşmasını askıya alma kararı verdi.",
[
"3",
"Kasım",
"Salı",
"günü",
",",
"Ankara",
"Belediye",
"Başkanı",
"2014'te",
"hükümetle",
"birlikte",
"oluşturulan",
"kentsel",
"gelişim",
"anlaşmasını",
"askıya",
"alma",
"kararı",
"verdi",
".",
],
),
(
"Stalin, Abakumov'u Beria'nın enerji bakanlıkları üzerindeki baskınlığına karşı MGB içinde kendi ağını kurmaya teşvik etmeye başlamıştı.",
[
"Stalin",
",",
"Abakumov'u",
"Beria'nın",
"enerji",
"bakanlıkları",
"üzerindeki",
"baskınlığına",
"karşı",
"MGB",
"içinde",
"kendi",
"ını",
"kurmaya",
"teşvik",
"etmeye",
"başlamıştı",
".",
],
),
(
"Güney Avrupa'daki kazı alanlarının çoğunluğu gibi, bu bulgu M.Ö. 5. yüzyılın başlar",
[
"Güney",
"Avrupa'daki",
"kazı",
"alanlarının",
"çoğunluğu",
"gibi",
",",
"bu",
"bulgu",
"M.Ö.",
"5.",
"yüzyılın",
"başlar",
],
),
(
"Sağlığın bozulması Hitchcock hayatının son yirmi yılında üretimini azalttı.",
[
"Sağlığın",
"bozulması",
"Hitchcock",
"hayatının",
"son",
"yirmi",
"yılında",
"üretimini",
"azalttı",
".",
],
),
] ]
TESTS = ABBREV_TESTS + URL_TESTS + NUMBER_TESTS + PUNCT_TESTS + GENERAL_TESTS
TESTS = (ABBREV_TESTS + URL_TESTS + NUMBER_TESTS + PUNCT_TESTS + GENERAL_TESTS)
@pytest.mark.parametrize("text,expected_tokens", TESTS) @pytest.mark.parametrize("text,expected_tokens", TESTS)
@ -149,4 +696,3 @@ def test_tr_tokenizer_handles_allcases(tr_tokenizer, text, expected_tokens):
token_list = [token.text for token in tokens if not token.is_space] token_list = [token.text for token in tokens if not token.is_space]
print(token_list) print(token_list)
assert expected_tokens == token_list assert expected_tokens == token_list

View File

@ -89,7 +89,6 @@ def test_uk_tokenizer_splits_open_appostrophe(uk_tokenizer, text):
assert tokens[0].text == "'" assert tokens[0].text == "'"
@pytest.mark.skip(reason="See Issue #3327 and PR #3329")
@pytest.mark.parametrize("text", ["Тест''"]) @pytest.mark.parametrize("text", ["Тест''"])
def test_uk_tokenizer_splits_double_end_quote(uk_tokenizer, text): def test_uk_tokenizer_splits_double_end_quote(uk_tokenizer, text):
tokens = uk_tokenizer(text) tokens = uk_tokenizer(text)

View File

@ -7,7 +7,6 @@ from spacy.tokens import Doc
from spacy.pipeline._parser_internals.nonproj import projectivize from spacy.pipeline._parser_internals.nonproj import projectivize
from spacy.pipeline._parser_internals.arc_eager import ArcEager from spacy.pipeline._parser_internals.arc_eager import ArcEager
from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
from spacy.pipeline._parser_internals.stateclass import StateClass
def get_sequence_costs(M, words, heads, deps, transitions): def get_sequence_costs(M, words, heads, deps, transitions):
@ -59,7 +58,7 @@ def test_oracle_four_words(arc_eager, vocab):
["S"], ["S"],
["L-left"], ["L-left"],
["S"], ["S"],
["D"] ["D"],
] ]
assert state.is_final() assert state.is_final()
for i, state_costs in enumerate(cost_history): for i, state_costs in enumerate(cost_history):
@ -185,9 +184,9 @@ def test_oracle_dev_sentence(vocab, arc_eager):
"L-nn", # Attach 'Cars' to 'Inc.' "L-nn", # Attach 'Cars' to 'Inc.'
"L-nn", # Attach 'Motor' to 'Inc.' "L-nn", # Attach 'Motor' to 'Inc.'
"L-nn", # Attach 'Rolls-Royce' to 'Inc.' "L-nn", # Attach 'Rolls-Royce' to 'Inc.'
"S", # Shift "Inc." "S", # Shift "Inc."
"L-nsubj", # Attach 'Inc.' to 'said' "L-nsubj", # Attach 'Inc.' to 'said'
"S", # Shift 'said' "S", # Shift 'said'
"S", # Shift 'it' "S", # Shift 'it'
"L-nsubj", # Attach 'it.' to 'expects' "L-nsubj", # Attach 'it.' to 'expects'
"R-ccomp", # Attach 'expects' to 'said' "R-ccomp", # Attach 'expects' to 'said'
@ -251,7 +250,7 @@ def test_oracle_bad_tokenization(vocab, arc_eager):
is root is is root is
bad comp is bad comp is
""" """
gold_words = [] gold_words = []
gold_deps = [] gold_deps = []
gold_heads = [] gold_heads = []
@ -268,7 +267,9 @@ def test_oracle_bad_tokenization(vocab, arc_eager):
arc_eager.add_action(2, dep) # Left arc_eager.add_action(2, dep) # Left
arc_eager.add_action(3, dep) # Right arc_eager.add_action(3, dep) # Right
reference = Doc(Vocab(), words=gold_words, deps=gold_deps, heads=gold_heads) reference = Doc(Vocab(), words=gold_words, deps=gold_deps, heads=gold_heads)
predicted = Doc(reference.vocab, words=["[", "catalase", "]", ":", "that", "is", "bad"]) predicted = Doc(
reference.vocab, words=["[", "catalase", "]", ":", "that", "is", "bad"]
)
example = Example(predicted=predicted, reference=reference) example = Example(predicted=predicted, reference=reference)
ae_oracle_actions = arc_eager.get_oracle_sequence(example, _debug=False) ae_oracle_actions = arc_eager.get_oracle_sequence(example, _debug=False)
ae_oracle_actions = [arc_eager.get_class_name(i) for i in ae_oracle_actions] ae_oracle_actions = [arc_eager.get_class_name(i) for i in ae_oracle_actions]

View File

@ -301,11 +301,9 @@ def test_block_ner():
assert [token.ent_type_ for token in doc] == expected_types assert [token.ent_type_ for token in doc] == expected_types
@pytest.mark.parametrize( @pytest.mark.parametrize("use_upper", [True, False])
"use_upper", [True, False]
)
def test_overfitting_IO(use_upper): def test_overfitting_IO(use_upper):
# Simple test to try and quickly overfit the NER component - ensuring the ML models work correctly # Simple test to try and quickly overfit the NER component
nlp = English() nlp = English()
ner = nlp.add_pipe("ner", config={"model": {"use_upper": use_upper}}) ner = nlp.add_pipe("ner", config={"model": {"use_upper": use_upper}})
train_examples = [] train_examples = []
@ -361,6 +359,84 @@ def test_overfitting_IO(use_upper):
assert_equal(batch_deps_1, no_batch_deps) assert_equal(batch_deps_1, no_batch_deps)
def test_beam_ner_scores():
# Test that we can get confidence values out of the beam_ner pipe
beam_width = 16
beam_density = 0.0001
nlp = English()
config = {
"beam_width": beam_width,
"beam_density": beam_density,
}
ner = nlp.add_pipe("beam_ner", config=config)
train_examples = []
for text, annotations in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
for ent in annotations.get("entities"):
ner.add_label(ent[2])
optimizer = nlp.initialize()
# update once
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
# test the scores from the beam
test_text = "I like London."
doc = nlp.make_doc(test_text)
docs = [doc]
beams = ner.predict(docs)
entity_scores = ner.scored_ents(beams)[0]
for j in range(len(doc)):
for label in ner.labels:
score = entity_scores[(j, j+1, label)]
eps = 0.00001
assert 0 - eps <= score <= 1 + eps
def test_beam_overfitting_IO():
# Simple test to try and quickly overfit the Beam NER component
nlp = English()
beam_width = 16
beam_density = 0.0001
config = {
"beam_width": beam_width,
"beam_density": beam_density,
}
ner = nlp.add_pipe("beam_ner", config=config)
train_examples = []
for text, annotations in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
for ent in annotations.get("entities"):
ner.add_label(ent[2])
optimizer = nlp.initialize()
# run overfitting
for i in range(50):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
assert losses["beam_ner"] < 0.0001
# test the scores from the beam
test_text = "I like London."
docs = [nlp.make_doc(test_text)]
beams = ner.predict(docs)
entity_scores = ner.scored_ents(beams)[0]
assert entity_scores[(2, 3, "LOC")] == 1.0
assert entity_scores[(2, 3, "PERSON")] == 0.0
# Also test the results are still the same after IO
with make_tempdir() as tmp_dir:
nlp.to_disk(tmp_dir)
nlp2 = util.load_model_from_path(tmp_dir)
docs2 = [nlp2.make_doc(test_text)]
ner2 = nlp2.get_pipe("beam_ner")
beams2 = ner2.predict(docs2)
entity_scores2 = ner2.scored_ents(beams2)[0]
assert entity_scores2[(2, 3, "LOC")] == 1.0
assert entity_scores2[(2, 3, "PERSON")] == 0.0
def test_ner_warns_no_lookups(caplog): def test_ner_warns_no_lookups(caplog):
nlp = English() nlp = English()
assert nlp.lang in util.LEXEME_NORM_LANGS assert nlp.lang in util.LEXEME_NORM_LANGS

View File

@ -1,13 +1,9 @@
# coding: utf8
from __future__ import unicode_literals
import pytest import pytest
import hypothesis import hypothesis
import hypothesis.strategies import hypothesis.strategies
import numpy import numpy
from spacy.vocab import Vocab from spacy.vocab import Vocab
from spacy.language import Language from spacy.language import Language
from spacy.pipeline import DependencyParser
from spacy.pipeline._parser_internals.arc_eager import ArcEager from spacy.pipeline._parser_internals.arc_eager import ArcEager
from spacy.tokens import Doc from spacy.tokens import Doc
from spacy.pipeline._parser_internals._beam_utils import BeamBatch from spacy.pipeline._parser_internals._beam_utils import BeamBatch
@ -44,7 +40,7 @@ def docs(vocab):
words=["Rats", "bite", "things"], words=["Rats", "bite", "things"],
heads=[1, 1, 1], heads=[1, 1, 1],
deps=["nsubj", "ROOT", "dobj"], deps=["nsubj", "ROOT", "dobj"],
sent_starts=[True, False, False] sent_starts=[True, False, False],
) )
] ]
@ -77,10 +73,12 @@ def batch_size(docs):
def beam_width(): def beam_width():
return 4 return 4
@pytest.fixture(params=[0.0, 0.5, 1.0]) @pytest.fixture(params=[0.0, 0.5, 1.0])
def beam_density(request): def beam_density(request):
return request.param return request.param
@pytest.fixture @pytest.fixture
def vector_size(): def vector_size():
return 6 return 6
@ -100,7 +98,9 @@ def scores(moves, batch_size, beam_width):
numpy.random.uniform(-0.1, 0.1, (beam_width, moves.n_moves)) numpy.random.uniform(-0.1, 0.1, (beam_width, moves.n_moves))
for _ in range(batch_size) for _ in range(batch_size)
] ]
), dtype="float32") ),
dtype="float32",
)
def test_create_beam(beam): def test_create_beam(beam):
@ -128,8 +128,6 @@ def test_beam_parse(examples, beam_width):
parser(doc) parser(doc)
@hypothesis.given(hyp=hypothesis.strategies.data()) @hypothesis.given(hyp=hypothesis.strategies.data())
def test_beam_density(moves, examples, beam_width, hyp): def test_beam_density(moves, examples, beam_width, hyp):
beam_density = float(hyp.draw(hypothesis.strategies.floats(0.0, 1.0, width=32))) beam_density = float(hyp.draw(hypothesis.strategies.floats(0.0, 1.0, width=32)))

View File

@ -28,6 +28,26 @@ TRAIN_DATA = [
] ]
CONFLICTING_DATA = [
(
"I like London and Berlin.",
{
"heads": [1, 1, 1, 2, 2, 1],
"deps": ["nsubj", "ROOT", "dobj", "cc", "conj", "punct"],
},
),
(
"I like London and Berlin.",
{
"heads": [0, 0, 0, 0, 0, 0],
"deps": ["ROOT", "nsubj", "nsubj", "cc", "conj", "punct"],
},
),
]
eps = 0.01
def test_parser_root(en_vocab): def test_parser_root(en_vocab):
words = ["i", "do", "n't", "have", "other", "assistance"] words = ["i", "do", "n't", "have", "other", "assistance"]
heads = [3, 3, 3, 3, 5, 3] heads = [3, 3, 3, 3, 5, 3]
@ -185,26 +205,31 @@ def test_parser_set_sent_starts(en_vocab):
assert token.head in sent assert token.head in sent
def test_overfitting_IO(): @pytest.mark.parametrize("pipe_name", ["parser", "beam_parser"])
# Simple test to try and quickly overfit the dependency parser - ensuring the ML models work correctly def test_overfitting_IO(pipe_name):
# Simple test to try and quickly overfit the dependency parser (normal or beam)
nlp = English() nlp = English()
parser = nlp.add_pipe("parser") parser = nlp.add_pipe(pipe_name)
train_examples = [] train_examples = []
for text, annotations in TRAIN_DATA: for text, annotations in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations)) train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
for dep in annotations.get("deps", []): for dep in annotations.get("deps", []):
parser.add_label(dep) parser.add_label(dep)
optimizer = nlp.initialize() optimizer = nlp.initialize()
for i in range(100): # run overfitting
for i in range(150):
losses = {} losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses) nlp.update(train_examples, sgd=optimizer, losses=losses)
assert losses["parser"] < 0.0001 assert losses[pipe_name] < 0.0001
# test the trained model # test the trained model
test_text = "I like securities." test_text = "I like securities."
doc = nlp(test_text) doc = nlp(test_text)
assert doc[0].dep_ == "nsubj" assert doc[0].dep_ == "nsubj"
assert doc[2].dep_ == "dobj" assert doc[2].dep_ == "dobj"
assert doc[3].dep_ == "punct" assert doc[3].dep_ == "punct"
assert doc[0].head.i == 1
assert doc[2].head.i == 1
assert doc[3].head.i == 1
# Also test the results are still the same after IO # Also test the results are still the same after IO
with make_tempdir() as tmp_dir: with make_tempdir() as tmp_dir:
nlp.to_disk(tmp_dir) nlp.to_disk(tmp_dir)
@ -213,6 +238,9 @@ def test_overfitting_IO():
assert doc2[0].dep_ == "nsubj" assert doc2[0].dep_ == "nsubj"
assert doc2[2].dep_ == "dobj" assert doc2[2].dep_ == "dobj"
assert doc2[3].dep_ == "punct" assert doc2[3].dep_ == "punct"
assert doc2[0].head.i == 1
assert doc2[2].head.i == 1
assert doc2[3].head.i == 1
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
texts = [ texts = [
@ -226,3 +254,123 @@ def test_overfitting_IO():
no_batch_deps = [doc.to_array([DEP]) for doc in [nlp(text) for text in texts]] no_batch_deps = [doc.to_array([DEP]) for doc in [nlp(text) for text in texts]]
assert_equal(batch_deps_1, batch_deps_2) assert_equal(batch_deps_1, batch_deps_2)
assert_equal(batch_deps_1, no_batch_deps) assert_equal(batch_deps_1, no_batch_deps)
def test_beam_parser_scores():
# Test that we can get confidence values out of the beam_parser pipe
beam_width = 16
beam_density = 0.0001
nlp = English()
config = {
"beam_width": beam_width,
"beam_density": beam_density,
}
parser = nlp.add_pipe("beam_parser", config=config)
train_examples = []
for text, annotations in CONFLICTING_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
for dep in annotations.get("deps", []):
parser.add_label(dep)
optimizer = nlp.initialize()
# update a bit with conflicting data
for i in range(10):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
# test the scores from the beam
test_text = "I like securities."
doc = nlp.make_doc(test_text)
docs = [doc]
beams = parser.predict(docs)
head_scores, label_scores = parser.scored_parses(beams)
for j in range(len(doc)):
for label in parser.labels:
label_score = label_scores[0][(j, label)]
assert 0 - eps <= label_score <= 1 + eps
for i in range(len(doc)):
head_score = head_scores[0][(j, i)]
assert 0 - eps <= head_score <= 1 + eps
def test_beam_overfitting_IO():
# Simple test to try and quickly overfit the Beam dependency parser
nlp = English()
beam_width = 16
beam_density = 0.0001
config = {
"beam_width": beam_width,
"beam_density": beam_density,
}
parser = nlp.add_pipe("beam_parser", config=config)
train_examples = []
for text, annotations in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
for dep in annotations.get("deps", []):
parser.add_label(dep)
optimizer = nlp.initialize()
# run overfitting
for i in range(150):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
assert losses["beam_parser"] < 0.0001
# test the scores from the beam
test_text = "I like securities."
docs = [nlp.make_doc(test_text)]
beams = parser.predict(docs)
head_scores, label_scores = parser.scored_parses(beams)
# we only processed one document
head_scores = head_scores[0]
label_scores = label_scores[0]
# test label annotations: 0=nsubj, 2=dobj, 3=punct
assert label_scores[(0, "nsubj")] == pytest.approx(1.0, eps)
assert label_scores[(0, "dobj")] == pytest.approx(0.0, eps)
assert label_scores[(0, "punct")] == pytest.approx(0.0, eps)
assert label_scores[(2, "nsubj")] == pytest.approx(0.0, eps)
assert label_scores[(2, "dobj")] == pytest.approx(1.0, eps)
assert label_scores[(2, "punct")] == pytest.approx(0.0, eps)
assert label_scores[(3, "nsubj")] == pytest.approx(0.0, eps)
assert label_scores[(3, "dobj")] == pytest.approx(0.0, eps)
assert label_scores[(3, "punct")] == pytest.approx(1.0, eps)
# test head annotations: the root is token at index 1
assert head_scores[(0, 0)] == pytest.approx(0.0, eps)
assert head_scores[(0, 1)] == pytest.approx(1.0, eps)
assert head_scores[(0, 2)] == pytest.approx(0.0, eps)
assert head_scores[(2, 0)] == pytest.approx(0.0, eps)
assert head_scores[(2, 1)] == pytest.approx(1.0, eps)
assert head_scores[(2, 2)] == pytest.approx(0.0, eps)
assert head_scores[(3, 0)] == pytest.approx(0.0, eps)
assert head_scores[(3, 1)] == pytest.approx(1.0, eps)
assert head_scores[(3, 2)] == pytest.approx(0.0, eps)
# Also test the results are still the same after IO
with make_tempdir() as tmp_dir:
nlp.to_disk(tmp_dir)
nlp2 = util.load_model_from_path(tmp_dir)
docs2 = [nlp2.make_doc(test_text)]
parser2 = nlp2.get_pipe("beam_parser")
beams2 = parser2.predict(docs2)
head_scores2, label_scores2 = parser2.scored_parses(beams2)
# we only processed one document
head_scores2 = head_scores2[0]
label_scores2 = label_scores2[0]
# check the results again
assert label_scores2[(0, "nsubj")] == pytest.approx(1.0, eps)
assert label_scores2[(0, "dobj")] == pytest.approx(0.0, eps)
assert label_scores2[(0, "punct")] == pytest.approx(0.0, eps)
assert label_scores2[(2, "nsubj")] == pytest.approx(0.0, eps)
assert label_scores2[(2, "dobj")] == pytest.approx(1.0, eps)
assert label_scores2[(2, "punct")] == pytest.approx(0.0, eps)
assert label_scores2[(3, "nsubj")] == pytest.approx(0.0, eps)
assert label_scores2[(3, "dobj")] == pytest.approx(0.0, eps)
assert label_scores2[(3, "punct")] == pytest.approx(1.0, eps)
assert head_scores2[(0, 0)] == pytest.approx(0.0, eps)
assert head_scores2[(0, 1)] == pytest.approx(1.0, eps)
assert head_scores2[(0, 2)] == pytest.approx(0.0, eps)
assert head_scores2[(2, 0)] == pytest.approx(0.0, eps)
assert head_scores2[(2, 1)] == pytest.approx(1.0, eps)
assert head_scores2[(2, 2)] == pytest.approx(0.0, eps)
assert head_scores2[(3, 0)] == pytest.approx(0.0, eps)
assert head_scores2[(3, 1)] == pytest.approx(1.0, eps)
assert head_scores2[(3, 2)] == pytest.approx(0.0, eps)

View File

@ -4,14 +4,17 @@ from spacy.tokens.doc import Doc
from spacy.vocab import Vocab from spacy.vocab import Vocab
from spacy.pipeline._parser_internals.stateclass import StateClass from spacy.pipeline._parser_internals.stateclass import StateClass
@pytest.fixture @pytest.fixture
def vocab(): def vocab():
return Vocab() return Vocab()
@pytest.fixture @pytest.fixture
def doc(vocab): def doc(vocab):
return Doc(vocab, words=["a", "b", "c", "d"]) return Doc(vocab, words=["a", "b", "c", "d"])
def test_init_state(doc): def test_init_state(doc):
state = StateClass(doc) state = StateClass(doc)
assert state.stack == [] assert state.stack == []
@ -19,6 +22,7 @@ def test_init_state(doc):
assert not state.is_final() assert not state.is_final()
assert state.buffer_length() == 4 assert state.buffer_length() == 4
def test_push_pop(doc): def test_push_pop(doc):
state = StateClass(doc) state = StateClass(doc)
state.push() state.push()
@ -33,6 +37,7 @@ def test_push_pop(doc):
assert state.stack == [0] assert state.stack == [0]
assert 1 not in state.queue assert 1 not in state.queue
def test_stack_depth(doc): def test_stack_depth(doc):
state = StateClass(doc) state = StateClass(doc)
assert state.stack_depth() == 0 assert state.stack_depth() == 0

View File

@ -161,7 +161,7 @@ def test_attributeruler_score(nlp, pattern_dicts):
# "cat" is the only correct lemma # "cat" is the only correct lemma
assert scores["lemma_acc"] == pytest.approx(0.2) assert scores["lemma_acc"] == pytest.approx(0.2)
# no morphs are set # no morphs are set
assert scores["morph_acc"] == None assert scores["morph_acc"] is None
def test_attributeruler_rule_order(nlp): def test_attributeruler_rule_order(nlp):

View File

@ -201,13 +201,9 @@ def test_entity_ruler_overlapping_spans(nlp):
@pytest.mark.parametrize("n_process", [1, 2]) @pytest.mark.parametrize("n_process", [1, 2])
def test_entity_ruler_multiprocessing(nlp, n_process): def test_entity_ruler_multiprocessing(nlp, n_process):
texts = [ texts = ["I enjoy eating Pizza Hut pizza."]
"I enjoy eating Pizza Hut pizza."
]
patterns = [ patterns = [{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}]
{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}
]
ruler = nlp.add_pipe("entity_ruler") ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns) ruler.add_patterns(patterns)

View File

@ -159,8 +159,12 @@ def test_pipe_class_component_model():
"model": { "model": {
"@architectures": "spacy.TextCatEnsemble.v2", "@architectures": "spacy.TextCatEnsemble.v2",
"tok2vec": DEFAULT_TOK2VEC_MODEL, "tok2vec": DEFAULT_TOK2VEC_MODEL,
"linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "linear_model": {
"no_output_layer": False}, "@architectures": "spacy.TextCatBOW.v1",
"exclusive_classes": False,
"ngram_size": 1,
"no_output_layer": False,
},
}, },
"value1": 10, "value1": 10,
} }

View File

@ -37,7 +37,16 @@ TRAIN_DATA = [
] ]
PARTIAL_DATA = [ PARTIAL_DATA = [
# partial annotation
("I like green eggs", {"tags": ["", "V", "J", ""]}), ("I like green eggs", {"tags": ["", "V", "J", ""]}),
# misaligned partial annotation
(
"He hates green eggs",
{
"words": ["He", "hate", "s", "green", "eggs"],
"tags": ["", "V", "S", "J", ""],
},
),
] ]
@ -126,6 +135,7 @@ def test_incomplete_data():
assert doc[1].tag_ is "V" assert doc[1].tag_ is "V"
assert doc[2].tag_ is "J" assert doc[2].tag_ is "J"
def test_overfitting_IO(): def test_overfitting_IO():
# Simple test to try and quickly overfit the tagger - ensuring the ML models work correctly # Simple test to try and quickly overfit the tagger - ensuring the ML models work correctly
nlp = English() nlp = English()

View File

@ -15,15 +15,31 @@ from spacy.training import Example
from ..util import make_tempdir from ..util import make_tempdir
TRAIN_DATA = [ TRAIN_DATA_SINGLE_LABEL = [
("I'm so happy.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}), ("I'm so happy.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
("I'm so angry", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}), ("I'm so angry", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
] ]
TRAIN_DATA_MULTI_LABEL = [
("I'm angry and confused", {"cats": {"ANGRY": 1.0, "CONFUSED": 1.0, "HAPPY": 0.0}}),
("I'm confused but happy", {"cats": {"ANGRY": 0.0, "CONFUSED": 1.0, "HAPPY": 1.0}}),
]
def make_get_examples(nlp):
def make_get_examples_single_label(nlp):
train_examples = [] train_examples = []
for t in TRAIN_DATA: for t in TRAIN_DATA_SINGLE_LABEL:
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
def get_examples():
return train_examples
return get_examples
def make_get_examples_multi_label(nlp):
train_examples = []
for t in TRAIN_DATA_MULTI_LABEL:
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1])) train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
def get_examples(): def get_examples():
@ -85,49 +101,75 @@ def test_textcat_learns_multilabel():
assert score > 0.5 assert score > 0.5
def test_label_types(): @pytest.mark.parametrize("name", ["textcat", "textcat_multilabel"])
def test_label_types(name):
nlp = Language() nlp = Language()
textcat = nlp.add_pipe("textcat") textcat = nlp.add_pipe(name)
textcat.add_label("answer") textcat.add_label("answer")
with pytest.raises(ValueError): with pytest.raises(ValueError):
textcat.add_label(9) textcat.add_label(9)
def test_no_label(): @pytest.mark.parametrize("name", ["textcat", "textcat_multilabel"])
def test_no_label(name):
nlp = Language() nlp = Language()
nlp.add_pipe("textcat") nlp.add_pipe(name)
with pytest.raises(ValueError): with pytest.raises(ValueError):
nlp.initialize() nlp.initialize()
def test_implicit_label(): @pytest.mark.parametrize(
"name,get_examples",
[
("textcat", make_get_examples_single_label),
("textcat_multilabel", make_get_examples_multi_label),
],
)
def test_implicit_label(name, get_examples):
nlp = Language() nlp = Language()
nlp.add_pipe("textcat") nlp.add_pipe(name)
nlp.initialize(get_examples=make_get_examples(nlp)) nlp.initialize(get_examples=get_examples(nlp))
def test_no_resize(): @pytest.mark.parametrize("name", ["textcat", "textcat_multilabel"])
def test_no_resize(name):
nlp = Language() nlp = Language()
textcat = nlp.add_pipe("textcat") textcat = nlp.add_pipe(name)
textcat.add_label("POSITIVE") textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE") textcat.add_label("NEGATIVE")
nlp.initialize() nlp.initialize()
assert textcat.model.get_dim("nO") == 2 assert textcat.model.get_dim("nO") >= 2
# this throws an error because the textcat can't be resized after initialization # this throws an error because the textcat can't be resized after initialization
with pytest.raises(ValueError): with pytest.raises(ValueError):
textcat.add_label("NEUTRAL") textcat.add_label("NEUTRAL")
def test_initialize_examples(): def test_error_with_multi_labels():
nlp = Language() nlp = Language()
textcat = nlp.add_pipe("textcat") textcat = nlp.add_pipe("textcat")
for text, annotations in TRAIN_DATA: train_examples = []
for text, annotations in TRAIN_DATA_MULTI_LABEL:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
with pytest.raises(ValueError):
optimizer = nlp.initialize(get_examples=lambda: train_examples)
@pytest.mark.parametrize(
"name,get_examples, train_data",
[
("textcat", make_get_examples_single_label, TRAIN_DATA_SINGLE_LABEL),
("textcat_multilabel", make_get_examples_multi_label, TRAIN_DATA_MULTI_LABEL),
],
)
def test_initialize_examples(name, get_examples, train_data):
nlp = Language()
textcat = nlp.add_pipe(name)
for text, annotations in train_data:
for label, value in annotations.get("cats").items(): for label, value in annotations.get("cats").items():
textcat.add_label(label) textcat.add_label(label)
# you shouldn't really call this more than once, but for testing it should be fine # you shouldn't really call this more than once, but for testing it should be fine
nlp.initialize() nlp.initialize()
get_examples = make_get_examples(nlp) nlp.initialize(get_examples=get_examples(nlp))
nlp.initialize(get_examples=get_examples)
with pytest.raises(TypeError): with pytest.raises(TypeError):
nlp.initialize(get_examples=lambda: None) nlp.initialize(get_examples=lambda: None)
with pytest.raises(TypeError): with pytest.raises(TypeError):
@ -138,12 +180,10 @@ def test_overfitting_IO():
# Simple test to try and quickly overfit the single-label textcat component - ensuring the ML models work correctly # Simple test to try and quickly overfit the single-label textcat component - ensuring the ML models work correctly
fix_random_seed(0) fix_random_seed(0)
nlp = English() nlp = English()
nlp.config["initialize"]["components"]["textcat"] = {"positive_label": "POSITIVE"} textcat = nlp.add_pipe("textcat")
# Set exclusive labels
config = {"model": {"linear_model": {"exclusive_classes": True}}}
textcat = nlp.add_pipe("textcat", config=config)
train_examples = [] train_examples = []
for text, annotations in TRAIN_DATA: for text, annotations in TRAIN_DATA_SINGLE_LABEL:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations)) train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
optimizer = nlp.initialize(get_examples=lambda: train_examples) optimizer = nlp.initialize(get_examples=lambda: train_examples)
assert textcat.model.get_dim("nO") == 2 assert textcat.model.get_dim("nO") == 2
@ -172,6 +212,8 @@ def test_overfitting_IO():
# Test scoring # Test scoring
scores = nlp.evaluate(train_examples) scores = nlp.evaluate(train_examples)
assert scores["cats_micro_f"] == 1.0 assert scores["cats_micro_f"] == 1.0
assert scores["cats_macro_f"] == 1.0
assert scores["cats_macro_auc"] == 1.0
assert scores["cats_score"] == 1.0 assert scores["cats_score"] == 1.0
assert "cats_score_desc" in scores assert "cats_score_desc" in scores
@ -192,7 +234,7 @@ def test_overfitting_IO_multi():
config = {"model": {"linear_model": {"exclusive_classes": False}}} config = {"model": {"linear_model": {"exclusive_classes": False}}}
textcat = nlp.add_pipe("textcat", config=config) textcat = nlp.add_pipe("textcat", config=config)
train_examples = [] train_examples = []
for text, annotations in TRAIN_DATA: for text, annotations in TRAIN_DATA_MULTI_LABEL:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations)) train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
optimizer = nlp.initialize(get_examples=lambda: train_examples) optimizer = nlp.initialize(get_examples=lambda: train_examples)
assert textcat.model.get_dim("nO") == 2 assert textcat.model.get_dim("nO") == 2
@ -231,27 +273,75 @@ def test_overfitting_IO_multi():
assert_equal(batch_cats_1, no_batch_cats) assert_equal(batch_cats_1, no_batch_cats)
def test_overfitting_IO_multi():
# Simple test to try and quickly overfit the multi-label textcat component - ensuring the ML models work correctly
fix_random_seed(0)
nlp = English()
textcat = nlp.add_pipe("textcat_multilabel")
train_examples = []
for text, annotations in TRAIN_DATA_MULTI_LABEL:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
optimizer = nlp.initialize(get_examples=lambda: train_examples)
assert textcat.model.get_dim("nO") == 3
for i in range(100):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
assert losses["textcat_multilabel"] < 0.01
# test the trained model
test_text = "I am confused but happy."
doc = nlp(test_text)
cats = doc.cats
assert cats["HAPPY"] > 0.9
assert cats["CONFUSED"] > 0.9
# Also test the results are still the same after IO
with make_tempdir() as tmp_dir:
nlp.to_disk(tmp_dir)
nlp2 = util.load_model_from_path(tmp_dir)
doc2 = nlp2(test_text)
cats2 = doc2.cats
assert cats2["HAPPY"] > 0.9
assert cats2["CONFUSED"] > 0.9
# Test scoring
scores = nlp.evaluate(train_examples)
assert scores["cats_micro_f"] == 1.0
assert scores["cats_macro_f"] == 1.0
assert "cats_score_desc" in scores
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
texts = ["Just a sentence.", "I like green eggs.", "I am happy.", "I eat ham."]
batch_deps_1 = [doc.cats for doc in nlp.pipe(texts)]
batch_deps_2 = [doc.cats for doc in nlp.pipe(texts)]
no_batch_deps = [doc.cats for doc in [nlp(text) for text in texts]]
assert_equal(batch_deps_1, batch_deps_2)
assert_equal(batch_deps_1, no_batch_deps)
# fmt: off # fmt: off
@pytest.mark.parametrize( @pytest.mark.parametrize(
"textcat_config", "name,train_data,textcat_config",
[ [
{"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}, ("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}),
{"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 4, "no_output_layer": False}, ("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 4, "no_output_layer": False}),
{"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 3, "no_output_layer": True}, ("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 3, "no_output_layer": True}),
{"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 2, "no_output_layer": True}, ("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 2, "no_output_layer": True}),
{"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}}, ("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}}),
{"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 5, "no_output_layer": False}}, ("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 5, "no_output_layer": False}}),
{"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True}, ("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True}),
{"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False}, ("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False}),
], ],
) )
# fmt: on # fmt: on
def test_textcat_configs(textcat_config): def test_textcat_configs(name, train_data, textcat_config):
pipe_config = {"model": textcat_config} pipe_config = {"model": textcat_config}
nlp = English() nlp = English()
textcat = nlp.add_pipe("textcat", config=pipe_config) textcat = nlp.add_pipe(name, config=pipe_config)
train_examples = [] train_examples = []
for text, annotations in TRAIN_DATA: for text, annotations in train_data:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations)) train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
for label, value in annotations.get("cats").items(): for label, value in annotations.get("cats").items():
textcat.add_label(label) textcat.add_label(label)
@ -264,15 +354,24 @@ def test_textcat_configs(textcat_config):
def test_positive_class(): def test_positive_class():
nlp = English() nlp = English()
textcat = nlp.add_pipe("textcat") textcat = nlp.add_pipe("textcat")
get_examples = make_get_examples(nlp) get_examples = make_get_examples_single_label(nlp)
textcat.initialize(get_examples, labels=["POS", "NEG"], positive_label="POS") textcat.initialize(get_examples, labels=["POS", "NEG"], positive_label="POS")
assert textcat.labels == ("POS", "NEG") assert textcat.labels == ("POS", "NEG")
assert textcat.cfg["positive_label"] == "POS"
textcat_multilabel = nlp.add_pipe("textcat_multilabel")
get_examples = make_get_examples_multi_label(nlp)
with pytest.raises(TypeError):
textcat_multilabel.initialize(get_examples, labels=["POS", "NEG"], positive_label="POS")
textcat_multilabel.initialize(get_examples, labels=["FICTION", "DRAMA"])
assert textcat_multilabel.labels == ("FICTION", "DRAMA")
assert "positive_label" not in textcat_multilabel.cfg
def test_positive_class_not_present(): def test_positive_class_not_present():
nlp = English() nlp = English()
textcat = nlp.add_pipe("textcat") textcat = nlp.add_pipe("textcat")
get_examples = make_get_examples(nlp) get_examples = make_get_examples_single_label(nlp)
with pytest.raises(ValueError): with pytest.raises(ValueError):
textcat.initialize(get_examples, labels=["SOME", "THING"], positive_label="POS") textcat.initialize(get_examples, labels=["SOME", "THING"], positive_label="POS")
@ -280,11 +379,9 @@ def test_positive_class_not_present():
def test_positive_class_not_binary(): def test_positive_class_not_binary():
nlp = English() nlp = English()
textcat = nlp.add_pipe("textcat") textcat = nlp.add_pipe("textcat")
get_examples = make_get_examples(nlp) get_examples = make_get_examples_multi_label(nlp)
with pytest.raises(ValueError): with pytest.raises(ValueError):
textcat.initialize( textcat.initialize(get_examples, labels=["SOME", "THING", "POS"], positive_label="POS")
get_examples, labels=["SOME", "THING", "POS"], positive_label="POS"
)
def test_textcat_evaluation(): def test_textcat_evaluation():

View File

@ -113,7 +113,7 @@ cfg_string = """
factory = "tok2vec" factory = "tok2vec"
[components.tok2vec.model] [components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1" @architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed] [components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1" @architectures = "spacy.MultiHashEmbed.v1"
@ -123,7 +123,7 @@ cfg_string = """
include_static_vectors = false include_static_vectors = false
[components.tok2vec.model.encode] [components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1" @architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96 width = 96
depth = 4 depth = 4
window_size = 1 window_size = 1

View File

@ -288,35 +288,33 @@ def test_multiple_predictions():
dummy_pipe(doc) dummy_pipe(doc)
@pytest.mark.skip(reason="removed Beam stuff during the Example/GoldParse refactor")
def test_issue4313(): def test_issue4313():
""" This should not crash or exit with some strange error code """ """ This should not crash or exit with some strange error code """
beam_width = 16 beam_width = 16
beam_density = 0.0001 beam_density = 0.0001
nlp = English() nlp = English()
config = {} config = {
ner = nlp.create_pipe("ner", config=config) "beam_width": beam_width,
"beam_density": beam_density,
}
ner = nlp.add_pipe("beam_ner", config=config)
ner.add_label("SOME_LABEL") ner.add_label("SOME_LABEL")
ner.initialize(lambda: []) nlp.initialize()
# add a new label to the doc # add a new label to the doc
doc = nlp("What do you think about Apple ?") doc = nlp("What do you think about Apple ?")
assert len(ner.labels) == 1 assert len(ner.labels) == 1
assert "SOME_LABEL" in ner.labels assert "SOME_LABEL" in ner.labels
ner.add_label("MY_ORG") # TODO: not sure if we want this to be necessary...
apple_ent = Span(doc, 5, 6, label="MY_ORG") apple_ent = Span(doc, 5, 6, label="MY_ORG")
doc.ents = list(doc.ents) + [apple_ent] doc.ents = list(doc.ents) + [apple_ent]
# ensure the beam_parse still works with the new label # ensure the beam_parse still works with the new label
docs = [doc] docs = [doc]
beams = nlp.entity.beam_parse( ner = nlp.get_pipe("beam_ner")
docs, beam_width=beam_width, beam_density=beam_density beams = ner.beam_parse(
docs, drop=0.0, beam_width=beam_width, beam_density=beam_density
) )
for doc, beam in zip(docs, beams):
entity_scores = defaultdict(float)
for score, ents in nlp.entity.moves.get_beam_parses(beam):
for start, end, label in ents:
entity_scores[(start, end, label)] += score
def test_issue4348(): def test_issue4348():
"""Test that training the tagger with empty data, doesn't throw errors""" """Test that training the tagger with empty data, doesn't throw errors"""

View File

@ -2,8 +2,11 @@ import pytest
from thinc.api import Config, fix_random_seed from thinc.api import Config, fix_random_seed
from spacy.lang.en import English from spacy.lang.en import English
from spacy.pipeline.textcat import default_model_config, bow_model_config from spacy.pipeline.textcat import single_label_default_config, single_label_bow_config
from spacy.pipeline.textcat import cnn_model_config from spacy.pipeline.textcat import single_label_cnn_config
from spacy.pipeline.textcat_multilabel import multi_label_default_config
from spacy.pipeline.textcat_multilabel import multi_label_bow_config
from spacy.pipeline.textcat_multilabel import multi_label_cnn_config
from spacy.tokens import Span from spacy.tokens import Span
from spacy import displacy from spacy import displacy
from spacy.pipeline import merge_entities from spacy.pipeline import merge_entities
@ -11,7 +14,15 @@ from spacy.training import Example
@pytest.mark.parametrize( @pytest.mark.parametrize(
"textcat_config", [default_model_config, bow_model_config, cnn_model_config] "textcat_config",
[
single_label_default_config,
single_label_bow_config,
single_label_cnn_config,
multi_label_default_config,
multi_label_bow_config,
multi_label_cnn_config,
],
) )
def test_issue5551(textcat_config): def test_issue5551(textcat_config):
"""Test that after fixing the random seed, the results of the pipeline are truly identical""" """Test that after fixing the random seed, the results of the pipeline are truly identical"""

View File

@ -1,4 +1,3 @@
import pydantic
import pytest import pytest
from pydantic import ValidationError from pydantic import ValidationError
from spacy.schemas import TokenPattern, TokenPatternSchema from spacy.schemas import TokenPattern, TokenPatternSchema

View File

@ -208,7 +208,7 @@ def test_create_nlp_from_pretraining_config():
config = Config().from_str(pretrain_config_string) config = Config().from_str(pretrain_config_string)
pretrain_config = load_config(DEFAULT_CONFIG_PRETRAIN_PATH) pretrain_config = load_config(DEFAULT_CONFIG_PRETRAIN_PATH)
filled = config.merge(pretrain_config) filled = config.merge(pretrain_config)
resolved = registry.resolve(filled["pretraining"], schema=ConfigSchemaPretrain) registry.resolve(filled["pretraining"], schema=ConfigSchemaPretrain)
def test_create_nlp_from_config_multiple_instances(): def test_create_nlp_from_config_multiple_instances():

View File

@ -4,7 +4,7 @@ from spacy.pipeline import Tagger, DependencyParser, EntityRecognizer
from spacy.pipeline import TextCategorizer, SentenceRecognizer, TrainablePipe from spacy.pipeline import TextCategorizer, SentenceRecognizer, TrainablePipe
from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
from spacy.pipeline.tagger import DEFAULT_TAGGER_MODEL from spacy.pipeline.tagger import DEFAULT_TAGGER_MODEL
from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL from spacy.pipeline.textcat import DEFAULT_SINGLE_TEXTCAT_MODEL
from spacy.pipeline.senter import DEFAULT_SENTER_MODEL from spacy.pipeline.senter import DEFAULT_SENTER_MODEL
from spacy.lang.en import English from spacy.lang.en import English
from thinc.api import Linear from thinc.api import Linear
@ -24,7 +24,7 @@ def parser(en_vocab):
"update_with_oracle_cut_size": 100, "update_with_oracle_cut_size": 100,
"beam_width": 1, "beam_width": 1,
"beam_update_prob": 1.0, "beam_update_prob": 1.0,
"beam_density": 0.0 "beam_density": 0.0,
} }
cfg = {"model": DEFAULT_PARSER_MODEL} cfg = {"model": DEFAULT_PARSER_MODEL}
model = registry.resolve(cfg, validate=True)["model"] model = registry.resolve(cfg, validate=True)["model"]
@ -41,7 +41,7 @@ def blank_parser(en_vocab):
"update_with_oracle_cut_size": 100, "update_with_oracle_cut_size": 100,
"beam_width": 1, "beam_width": 1,
"beam_update_prob": 1.0, "beam_update_prob": 1.0,
"beam_density": 0.0 "beam_density": 0.0,
} }
cfg = {"model": DEFAULT_PARSER_MODEL} cfg = {"model": DEFAULT_PARSER_MODEL}
model = registry.resolve(cfg, validate=True)["model"] model = registry.resolve(cfg, validate=True)["model"]
@ -66,7 +66,7 @@ def test_serialize_parser_roundtrip_bytes(en_vocab, Parser):
"update_with_oracle_cut_size": 100, "update_with_oracle_cut_size": 100,
"beam_width": 1, "beam_width": 1,
"beam_update_prob": 1.0, "beam_update_prob": 1.0,
"beam_density": 0.0 "beam_density": 0.0,
} }
cfg = {"model": DEFAULT_PARSER_MODEL} cfg = {"model": DEFAULT_PARSER_MODEL}
model = registry.resolve(cfg, validate=True)["model"] model = registry.resolve(cfg, validate=True)["model"]
@ -90,7 +90,7 @@ def test_serialize_parser_strings(Parser):
"update_with_oracle_cut_size": 100, "update_with_oracle_cut_size": 100,
"beam_width": 1, "beam_width": 1,
"beam_update_prob": 1.0, "beam_update_prob": 1.0,
"beam_density": 0.0 "beam_density": 0.0,
} }
cfg = {"model": DEFAULT_PARSER_MODEL} cfg = {"model": DEFAULT_PARSER_MODEL}
model = registry.resolve(cfg, validate=True)["model"] model = registry.resolve(cfg, validate=True)["model"]
@ -112,7 +112,7 @@ def test_serialize_parser_roundtrip_disk(en_vocab, Parser):
"update_with_oracle_cut_size": 100, "update_with_oracle_cut_size": 100,
"beam_width": 1, "beam_width": 1,
"beam_update_prob": 1.0, "beam_update_prob": 1.0,
"beam_density": 0.0 "beam_density": 0.0,
} }
cfg = {"model": DEFAULT_PARSER_MODEL} cfg = {"model": DEFAULT_PARSER_MODEL}
model = registry.resolve(cfg, validate=True)["model"] model = registry.resolve(cfg, validate=True)["model"]
@ -140,9 +140,6 @@ def test_to_from_bytes(parser, blank_parser):
assert blank_parser.moves.n_moves == parser.moves.n_moves assert blank_parser.moves.n_moves == parser.moves.n_moves
@pytest.mark.skip(
reason="This seems to be a dict ordering bug somewhere. Only failing on some platforms."
)
def test_serialize_tagger_roundtrip_bytes(en_vocab, taggers): def test_serialize_tagger_roundtrip_bytes(en_vocab, taggers):
tagger1 = taggers[0] tagger1 = taggers[0]
tagger1_b = tagger1.to_bytes() tagger1_b = tagger1.to_bytes()
@ -191,7 +188,7 @@ def test_serialize_tagger_strings(en_vocab, de_vocab, taggers):
def test_serialize_textcat_empty(en_vocab): def test_serialize_textcat_empty(en_vocab):
# See issue #1105 # See issue #1105
cfg = {"model": DEFAULT_TEXTCAT_MODEL} cfg = {"model": DEFAULT_SINGLE_TEXTCAT_MODEL}
model = registry.resolve(cfg, validate=True)["model"] model = registry.resolve(cfg, validate=True)["model"]
textcat = TextCategorizer(en_vocab, model, threshold=0.5) textcat = TextCategorizer(en_vocab, model, threshold=0.5)
textcat.to_bytes(exclude=["vocab"]) textcat.to_bytes(exclude=["vocab"])

View File

@ -26,7 +26,6 @@ def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
assert tokenizer_reloaded.rules == {} assert tokenizer_reloaded.rules == {}
@pytest.mark.skip(reason="Currently unreliable across platforms")
@pytest.mark.parametrize("text", ["I💜you", "theyre", "“hello”"]) @pytest.mark.parametrize("text", ["I💜you", "theyre", "“hello”"])
def test_serialize_tokenizer_roundtrip_bytes(en_tokenizer, text): def test_serialize_tokenizer_roundtrip_bytes(en_tokenizer, text):
tokenizer = en_tokenizer tokenizer = en_tokenizer
@ -38,7 +37,6 @@ def test_serialize_tokenizer_roundtrip_bytes(en_tokenizer, text):
assert [token.text for token in doc1] == [token.text for token in doc2] assert [token.text for token in doc1] == [token.text for token in doc2]
@pytest.mark.skip(reason="Currently unreliable across platforms")
def test_serialize_tokenizer_roundtrip_disk(en_tokenizer): def test_serialize_tokenizer_roundtrip_disk(en_tokenizer):
tokenizer = en_tokenizer tokenizer = en_tokenizer
with make_tempdir() as d: with make_tempdir() as d:

View File

@ -3,7 +3,9 @@ from click import NoSuchOption
from spacy.training import docs_to_json, offsets_to_biluo_tags from spacy.training import docs_to_json, offsets_to_biluo_tags
from spacy.training.converters import iob_to_docs, conll_ner_to_docs, conllu_to_docs from spacy.training.converters import iob_to_docs, conll_ner_to_docs, conllu_to_docs
from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate
from spacy.lang.nl import Dutch
from spacy.util import ENV_VARS from spacy.util import ENV_VARS
from spacy.cli import info
from spacy.cli.init_config import init_config, RECOMMENDATIONS from spacy.cli.init_config import init_config, RECOMMENDATIONS
from spacy.cli._util import validate_project_commands, parse_config_overrides from spacy.cli._util import validate_project_commands, parse_config_overrides
from spacy.cli._util import load_project_config, substitute_project_variables from spacy.cli._util import load_project_config, substitute_project_variables
@ -15,6 +17,16 @@ import os
from .util import make_tempdir from .util import make_tempdir
def test_cli_info():
nlp = Dutch()
nlp.add_pipe("textcat")
with make_tempdir() as tmp_dir:
nlp.to_disk(tmp_dir)
raw_data = info(tmp_dir, exclude=[""])
assert raw_data["lang"] == "nl"
assert raw_data["components"] == ["textcat"]
def test_cli_converters_conllu_to_docs(): def test_cli_converters_conllu_to_docs():
# from NorNE: https://github.com/ltgoslo/norne/blob/3d23274965f513f23aa48455b28b1878dad23c05/ud/nob/no_bokmaal-ud-dev.conllu # from NorNE: https://github.com/ltgoslo/norne/blob/3d23274965f513f23aa48455b28b1878dad23c05/ud/nob/no_bokmaal-ud-dev.conllu
lines = [ lines = [

View File

@ -83,6 +83,7 @@ def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
def test_prefer_gpu(): def test_prefer_gpu():
try: try:
import cupy # noqa: F401 import cupy # noqa: F401
prefer_gpu() prefer_gpu()
assert isinstance(get_current_ops(), CupyOps) assert isinstance(get_current_ops(), CupyOps)
except ImportError: except ImportError:
@ -92,17 +93,20 @@ def test_prefer_gpu():
def test_require_gpu(): def test_require_gpu():
try: try:
import cupy # noqa: F401 import cupy # noqa: F401
require_gpu() require_gpu()
assert isinstance(get_current_ops(), CupyOps) assert isinstance(get_current_ops(), CupyOps)
except ImportError: except ImportError:
with pytest.raises(ValueError): with pytest.raises(ValueError):
require_gpu() require_gpu()
def test_require_cpu(): def test_require_cpu():
require_cpu() require_cpu()
assert isinstance(get_current_ops(), NumpyOps) assert isinstance(get_current_ops(), NumpyOps)
try: try:
import cupy # noqa: F401 import cupy # noqa: F401
require_gpu() require_gpu()
assert isinstance(get_current_ops(), CupyOps) assert isinstance(get_current_ops(), CupyOps)
except ImportError: except ImportError:

View File

@ -294,7 +294,7 @@ def test_partial_annotation(en_tokenizer):
# cats doesn't have an unset state # cats doesn't have an unset state
if key.startswith("cats"): if key.startswith("cats"):
continue continue
assert scores[key] == None assert scores[key] is None
# partially annotated reference, not overlapping with predicted annotation # partially annotated reference, not overlapping with predicted annotation
ref_doc = en_tokenizer("a b c d e") ref_doc = en_tokenizer("a b c d e")
@ -306,13 +306,13 @@ def test_partial_annotation(en_tokenizer):
example = Example(pred_doc, ref_doc) example = Example(pred_doc, ref_doc)
scorer = Scorer() scorer = Scorer()
scores = scorer.score([example]) scores = scorer.score([example])
assert scores["token_acc"] == None assert scores["token_acc"] is None
assert scores["tag_acc"] == 0.0 assert scores["tag_acc"] == 0.0
assert scores["pos_acc"] == 0.0 assert scores["pos_acc"] == 0.0
assert scores["morph_acc"] == 0.0 assert scores["morph_acc"] == 0.0
assert scores["dep_uas"] == 1.0 assert scores["dep_uas"] == 1.0
assert scores["dep_las"] == 0.0 assert scores["dep_las"] == 0.0
assert scores["sents_f"] == None assert scores["sents_f"] is None
# partially annotated reference, overlapping with predicted annotation # partially annotated reference, overlapping with predicted annotation
ref_doc = en_tokenizer("a b c d e") ref_doc = en_tokenizer("a b c d e")
@ -324,13 +324,13 @@ def test_partial_annotation(en_tokenizer):
example = Example(pred_doc, ref_doc) example = Example(pred_doc, ref_doc)
scorer = Scorer() scorer = Scorer()
scores = scorer.score([example]) scores = scorer.score([example])
assert scores["token_acc"] == None assert scores["token_acc"] is None
assert scores["tag_acc"] == 1.0 assert scores["tag_acc"] == 1.0
assert scores["pos_acc"] == 1.0 assert scores["pos_acc"] == 1.0
assert scores["morph_acc"] == 0.0 assert scores["morph_acc"] == 0.0
assert scores["dep_uas"] == 1.0 assert scores["dep_uas"] == 1.0
assert scores["dep_las"] == 0.0 assert scores["dep_las"] == 0.0
assert scores["sents_f"] == None assert scores["sents_f"] is None
def test_roc_auc_score(): def test_roc_auc_score():
@ -391,7 +391,7 @@ def test_roc_auc_score():
score.score_set(0.25, 0) score.score_set(0.25, 0)
score.score_set(0.75, 0) score.score_set(0.75, 0)
with pytest.raises(ValueError): with pytest.raises(ValueError):
s = score.score _ = score.score # noqa: F841
y_true = [1, 1] y_true = [1, 1]
y_score = [0.25, 0.75] y_score = [0.25, 0.75]
@ -402,4 +402,4 @@ def test_roc_auc_score():
score.score_set(0.25, 1) score.score_set(0.25, 1)
score.score_set(0.75, 1) score.score_set(0.75, 1)
with pytest.raises(ValueError): with pytest.raises(ValueError):
s = score.score _ = score.score # noqa: F841

View File

@ -180,3 +180,9 @@ def test_tokenizer_special_cases_idx(tokenizer):
doc = tokenizer(text) doc = tokenizer(text)
assert doc[1].idx == 4 assert doc[1].idx == 4
assert doc[2].idx == 7 assert doc[2].idx == 7
def test_tokenizer_special_cases_spaces(tokenizer):
assert [t.text for t in tokenizer("a b c")] == ["a", "b", "c"]
tokenizer.add_special_case("a b c", [{"ORTH": "a b c"}])
assert [t.text for t in tokenizer("a b c")] == ["a b c"]

View File

@ -51,7 +51,7 @@ def test_readers():
for example in train_corpus(nlp): for example in train_corpus(nlp):
nlp.update([example], sgd=optimizer) nlp.update([example], sgd=optimizer)
scores = nlp.evaluate(list(dev_corpus(nlp))) scores = nlp.evaluate(list(dev_corpus(nlp)))
assert scores["cats_score"] == 0.0 assert scores["cats_macro_auc"] == 0.0
# ensure the pipeline runs # ensure the pipeline runs
doc = nlp("Quick test") doc = nlp("Quick test")
assert doc.cats assert doc.cats
@ -73,7 +73,7 @@ def test_cat_readers(reader, additional_config):
nlp_config_string = """ nlp_config_string = """
[training] [training]
seed = 0 seed = 0
[training.score_weights] [training.score_weights]
cats_macro_auc = 1.0 cats_macro_auc = 1.0

View File

@ -71,7 +71,6 @@ def test_table_api_to_from_bytes():
assert "def" not in new_table2 assert "def" not in new_table2
@pytest.mark.skip(reason="This fails on Python 3.5")
def test_lookups_to_from_bytes(): def test_lookups_to_from_bytes():
lookups = Lookups() lookups = Lookups()
lookups.add_table("table1", {"foo": "bar", "hello": "world"}) lookups.add_table("table1", {"foo": "bar", "hello": "world"})
@ -91,7 +90,6 @@ def test_lookups_to_from_bytes():
assert new_lookups.to_bytes() == lookups_bytes assert new_lookups.to_bytes() == lookups_bytes
@pytest.mark.skip(reason="This fails on Python 3.5")
def test_lookups_to_from_disk(): def test_lookups_to_from_disk():
lookups = Lookups() lookups = Lookups()
lookups.add_table("table1", {"foo": "bar", "hello": "world"}) lookups.add_table("table1", {"foo": "bar", "hello": "world"})
@ -111,7 +109,6 @@ def test_lookups_to_from_disk():
assert table2["b"] == 2 assert table2["b"] == 2
@pytest.mark.skip(reason="This fails on Python 3.5")
def test_lookups_to_from_bytes_via_vocab(): def test_lookups_to_from_bytes_via_vocab():
table_name = "test" table_name = "test"
vocab = Vocab() vocab = Vocab()
@ -128,7 +125,6 @@ def test_lookups_to_from_bytes_via_vocab():
assert new_vocab.to_bytes() == vocab_bytes assert new_vocab.to_bytes() == vocab_bytes
@pytest.mark.skip(reason="This fails on Python 3.5")
def test_lookups_to_from_disk_via_vocab(): def test_lookups_to_from_disk_via_vocab():
table_name = "test" table_name = "test"
vocab = Vocab() vocab = Vocab()

View File

@ -258,6 +258,7 @@ cdef class Tokenizer:
tokens = doc.c tokens = doc.c
# Otherwise create a separate array to store modified tokens # Otherwise create a separate array to store modified tokens
else: else:
assert max_length > 0
tokens = <TokenC*>mem.alloc(max_length, sizeof(TokenC)) tokens = <TokenC*>mem.alloc(max_length, sizeof(TokenC))
# Modify tokenization according to filtered special cases # Modify tokenization according to filtered special cases
offset = self._retokenize_special_spans(doc, tokens, span_data) offset = self._retokenize_special_spans(doc, tokens, span_data)
@ -610,7 +611,7 @@ cdef class Tokenizer:
self.mem.free(stale_special) self.mem.free(stale_special)
self._rules[string] = substrings self._rules[string] = substrings
self._flush_cache() self._flush_cache()
if self.find_prefix(string) or self.find_infix(string) or self.find_suffix(string): if self.find_prefix(string) or self.find_infix(string) or self.find_suffix(string) or " " in string:
self._special_matcher.add(string, None, self._tokenize_affixes(string, False)) self._special_matcher.add(string, None, self._tokenize_affixes(string, False))
def _reload_special_cases(self): def _reload_special_cases(self):

View File

@ -188,8 +188,15 @@ def _merge(Doc doc, merges):
and doc.c[start - 1].ent_type == token.ent_type: and doc.c[start - 1].ent_type == token.ent_type:
merged_iob = 1 merged_iob = 1
token.ent_iob = merged_iob token.ent_iob = merged_iob
# Set lemma to concatenated lemmas
merged_lemma = ""
for span_token in span:
merged_lemma += span_token.lemma_
if doc.c[span_token.i].spacy:
merged_lemma += " "
merged_lemma = merged_lemma.strip()
token.lemma = doc.vocab.strings.add(merged_lemma)
# Unset attributes that don't match new token # Unset attributes that don't match new token
token.lemma = 0
token.norm = 0 token.norm = 0
tokens[merge_index] = token tokens[merge_index] = token
# Resize the doc.tensor, if it's set. Let the last row for each token stand # Resize the doc.tensor, if it's set. Let the last row for each token stand
@ -335,7 +342,9 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
token = &doc.c[token_index + i] token = &doc.c[token_index + i]
lex = doc.vocab.get(doc.mem, orth) lex = doc.vocab.get(doc.mem, orth)
token.lex = lex token.lex = lex
token.lemma = 0 # reset lemma # If lemma is currently set, set default lemma to orth
if token.lemma != 0:
token.lemma = lex.orth
token.norm = 0 # reset norm token.norm = 0 # reset norm
if to_process_tensor: if to_process_tensor:
# setting the tensors of the split tokens to array of zeros # setting the tensors of the split tokens to array of zeros

View File

@ -225,6 +225,7 @@ cdef class Doc:
# Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds # Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
# However, we need to remember the true starting places, so that we can # However, we need to remember the true starting places, so that we can
# realloc. # realloc.
assert size + (PADDING*2) > 0
data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC)) data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
cdef int i cdef int i
for i in range(size + (PADDING*2)): for i in range(size + (PADDING*2)):
@ -1097,7 +1098,7 @@ cdef class Doc:
(vocab,) = vocab (vocab,) = vocab
if attrs is None: if attrs is None:
attrs = Doc._get_array_attrs() attrs = list(Doc._get_array_attrs())
else: else:
if any(isinstance(attr, str) for attr in attrs): # resolve attribute names if any(isinstance(attr, str) for attr in attrs): # resolve attribute names
attrs = [intify_attr(attr) for attr in attrs] # intify_attr returns None for invalid attrs attrs = [intify_attr(attr) for attr in attrs] # intify_attr returns None for invalid attrs
@ -1177,6 +1178,7 @@ cdef class Doc:
other.length = self.length other.length = self.length
other.max_length = self.max_length other.max_length = self.max_length
buff_size = other.max_length + (PADDING*2) buff_size = other.max_length + (PADDING*2)
assert buff_size > 0
tokens = <TokenC*>other.mem.alloc(buff_size, sizeof(TokenC)) tokens = <TokenC*>other.mem.alloc(buff_size, sizeof(TokenC))
memcpy(tokens, self.c - PADDING, buff_size * sizeof(TokenC)) memcpy(tokens, self.c - PADDING, buff_size * sizeof(TokenC))
other.c = &tokens[PADDING] other.c = &tokens[PADDING]

View File

@ -37,9 +37,17 @@ def init_nlp(config: Config, *, use_gpu: int = -1) -> "Language":
T = registry.resolve(config["training"], schema=ConfigSchemaTraining) T = registry.resolve(config["training"], schema=ConfigSchemaTraining)
dot_names = [T["train_corpus"], T["dev_corpus"]] dot_names = [T["train_corpus"], T["dev_corpus"]]
if not isinstance(T["train_corpus"], str): if not isinstance(T["train_corpus"], str):
raise ConfigValidationError(desc=Errors.E897.format(field="training.train_corpus", type=type(T["train_corpus"]))) raise ConfigValidationError(
desc=Errors.E897.format(
field="training.train_corpus", type=type(T["train_corpus"])
)
)
if not isinstance(T["dev_corpus"], str): if not isinstance(T["dev_corpus"], str):
raise ConfigValidationError(desc=Errors.E897.format(field="training.dev_corpus", type=type(T["dev_corpus"]))) raise ConfigValidationError(
desc=Errors.E897.format(
field="training.dev_corpus", type=type(T["dev_corpus"])
)
)
train_corpus, dev_corpus = resolve_dot_names(config, dot_names) train_corpus, dev_corpus = resolve_dot_names(config, dot_names)
optimizer = T["optimizer"] optimizer = T["optimizer"]
# Components that shouldn't be updated during training # Components that shouldn't be updated during training

View File

@ -59,6 +59,19 @@ def train(
batcher = T["batcher"] batcher = T["batcher"]
train_logger = T["logger"] train_logger = T["logger"]
before_to_disk = create_before_to_disk_callback(T["before_to_disk"]) before_to_disk = create_before_to_disk_callback(T["before_to_disk"])
# Helper function to save checkpoints. This is a closure for convenience,
# to avoid passing in all the args all the time.
def save_checkpoint(is_best):
with nlp.use_params(optimizer.averages):
before_to_disk(nlp).to_disk(output_path / DIR_MODEL_LAST)
if is_best:
# Avoid saving twice (saving will be more expensive than
# the dir copy)
if (output_path / DIR_MODEL_BEST).exists():
shutil.rmtree(output_path / DIR_MODEL_BEST)
shutil.copytree(output_path / DIR_MODEL_LAST, output_path / DIR_MODEL_BEST)
# Components that shouldn't be updated during training # Components that shouldn't be updated during training
frozen_components = T["frozen_components"] frozen_components = T["frozen_components"]
# Create iterator, which yields out info after each optimization step. # Create iterator, which yields out info after each optimization step.
@ -87,36 +100,31 @@ def train(
if is_best_checkpoint is not None and output_path is not None: if is_best_checkpoint is not None and output_path is not None:
with nlp.select_pipes(disable=frozen_components): with nlp.select_pipes(disable=frozen_components):
update_meta(T, nlp, info) update_meta(T, nlp, info)
with nlp.use_params(optimizer.averages): save_checkpoint(is_best_checkpoint)
nlp = before_to_disk(nlp)
nlp.to_disk(output_path / DIR_MODEL_BEST)
except Exception as e: except Exception as e:
if output_path is not None: if output_path is not None:
# We don't want to swallow the traceback if we don't have a
# specific error, but we do want to warn that we're trying
# to do something here.
stdout.write( stdout.write(
msg.warn( msg.warn(
f"Aborting and saving the final best model. " f"Aborting and saving the final best model. "
f"Encountered exception: {str(e)}" f"Encountered exception: {repr(e)}"
) )
+ "\n" + "\n"
) )
raise e raise e
finally: finally:
finalize_logger() finalize_logger()
if optimizer.averages: save_checkpoint(False)
nlp.use_params(optimizer.averages) # This will only run if we did't hit an error
if output_path is not None: if optimizer.averages:
final_model_path = output_path / DIR_MODEL_LAST nlp.use_params(optimizer.averages)
nlp.to_disk(final_model_path) if output_path is not None:
# This will only run if we don't hit an error stdout.write(
stdout.write( msg.good("Saved pipeline to output directory", output_path / DIR_MODEL_LAST)
msg.good("Saved pipeline to output directory", final_model_path) + "\n" + "\n"
) )
return (nlp, final_model_path) return (nlp, output_path / DIR_MODEL_LAST)
else: else:
return (nlp, None) return (nlp, None)
def train_while_improving( def train_while_improving(

View File

@ -10,7 +10,7 @@ from wasabi import Printer
from .example import Example from .example import Example
from ..tokens import Doc from ..tokens import Doc
from ..schemas import ConfigSchemaTraining, ConfigSchemaPretrain from ..schemas import ConfigSchemaPretrain
from ..util import registry, load_model_from_config, dot_to_object from ..util import registry, load_model_from_config, dot_to_object
@ -30,7 +30,6 @@ def pretrain(
set_gpu_allocator(allocator) set_gpu_allocator(allocator)
nlp = load_model_from_config(config) nlp = load_model_from_config(config)
_config = nlp.config.interpolate() _config = nlp.config.interpolate()
T = registry.resolve(_config["training"], schema=ConfigSchemaTraining)
P = registry.resolve(_config["pretraining"], schema=ConfigSchemaPretrain) P = registry.resolve(_config["pretraining"], schema=ConfigSchemaPretrain)
corpus = dot_to_object(_config, P["corpus"]) corpus = dot_to_object(_config, P["corpus"])
corpus = registry.resolve({"corpus": corpus})["corpus"] corpus = registry.resolve({"corpus": corpus})["corpus"]

View File

@ -69,7 +69,7 @@ CONFIG_SECTION_ORDER = ["paths", "variables", "system", "nlp", "components", "co
logger = logging.getLogger("spacy") logger = logging.getLogger("spacy")
logger_stream_handler = logging.StreamHandler() logger_stream_handler = logging.StreamHandler()
logger_stream_handler.setFormatter(logging.Formatter('%(message)s')) logger_stream_handler.setFormatter(logging.Formatter("%(message)s"))
logger.addHandler(logger_stream_handler) logger.addHandler(logger_stream_handler)

View File

@ -164,7 +164,7 @@ cdef class Vocab:
if len(string) < 3 or self.length < 10000: if len(string) < 3 or self.length < 10000:
mem = self.mem mem = self.mem
cdef bint is_oov = mem is not self.mem cdef bint is_oov = mem is not self.mem
lex = <LexemeC*>mem.alloc(sizeof(LexemeC), 1) lex = <LexemeC*>mem.alloc(1, sizeof(LexemeC))
lex.orth = self.strings.add(string) lex.orth = self.strings.add(string)
lex.length = len(string) lex.length = len(string)
if self.vectors is not None: if self.vectors is not None:

View File

@ -5,6 +5,7 @@ source: spacy/ml/models
menu: menu:
- ['Tok2Vec', 'tok2vec-arch'] - ['Tok2Vec', 'tok2vec-arch']
- ['Transformers', 'transformers'] - ['Transformers', 'transformers']
- ['Pretraining', 'pretrain']
- ['Parser & NER', 'parser'] - ['Parser & NER', 'parser']
- ['Tagging', 'tagger'] - ['Tagging', 'tagger']
- ['Text Classification', 'textcat'] - ['Text Classification', 'textcat']
@ -25,20 +26,20 @@ usage documentation on
## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"} ## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"}
### spacy.Tok2Vec.v1 {#Tok2Vec} ### spacy.Tok2Vec.v2 {#Tok2Vec}
> #### Example config > #### Example config
> >
> ```ini > ```ini
> [model] > [model]
> @architectures = "spacy.Tok2Vec.v1" > @architectures = "spacy.Tok2Vec.v2"
> >
> [model.embed] > [model.embed]
> @architectures = "spacy.CharacterEmbed.v1" > @architectures = "spacy.CharacterEmbed.v1"
> # ... > # ...
> >
> [model.encode] > [model.encode]
> @architectures = "spacy.MaxoutWindowEncoder.v1" > @architectures = "spacy.MaxoutWindowEncoder.v2"
> # ... > # ...
> ``` > ```
@ -196,13 +197,13 @@ network to construct a single vector to represent the information.
| `nC` | The number of UTF-8 bytes to embed per word. Recommended values are between `3` and `8`, although it may depend on the length of words in the language. ~~int~~ | | `nC` | The number of UTF-8 bytes to embed per word. Recommended values are between `3` and `8`, although it may depend on the length of words in the language. ~~int~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
### spacy.MaxoutWindowEncoder.v1 {#MaxoutWindowEncoder} ### spacy.MaxoutWindowEncoder.v2 {#MaxoutWindowEncoder}
> #### Example config > #### Example config
> >
> ```ini > ```ini
> [model] > [model]
> @architectures = "spacy.MaxoutWindowEncoder.v1" > @architectures = "spacy.MaxoutWindowEncoder.v2"
> width = 128 > width = 128
> window_size = 1 > window_size = 1
> maxout_pieces = 3 > maxout_pieces = 3
@ -220,13 +221,13 @@ and residual connections.
| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ | | `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
### spacy.MishWindowEncoder.v1 {#MishWindowEncoder} ### spacy.MishWindowEncoder.v2 {#MishWindowEncoder}
> #### Example config > #### Example config
> >
> ```ini > ```ini
> [model] > [model]
> @architectures = "spacy.MishWindowEncoder.v1" > @architectures = "spacy.MishWindowEncoder.v2"
> width = 64 > width = 64
> window_size = 1 > window_size = 1
> depth = 4 > depth = 4
@ -251,19 +252,19 @@ and residual connections.
> [model] > [model]
> @architectures = "spacy.TorchBiLSTMEncoder.v1" > @architectures = "spacy.TorchBiLSTMEncoder.v1"
> width = 64 > width = 64
> window_size = 1 > depth = 2
> depth = 4 > dropout = 0.0
> ``` > ```
Encode context using bidirectional LSTM layers. Requires Encode context using bidirectional LSTM layers. Requires
[PyTorch](https://pytorch.org). [PyTorch](https://pytorch.org).
| Name | Description | | Name | Description |
| ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `width` | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. ~~int~~ | | `width` | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. ~~int~~ |
| `window_size` | The number of words to concatenate around each token to construct the convolution. Recommended value is `1`. ~~int~~ | | `depth` | The number of recurrent layers, for instance `depth=2` results in stacking two LSTMs together. ~~int~~ |
| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ | | `dropout` | Creates a Dropout layer on the outputs of each LSTM layer except the last layer. Set to 0.0 to disable this functionality. ~~float~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
### spacy.StaticVectors.v1 {#StaticVectors} ### spacy.StaticVectors.v1 {#StaticVectors}
@ -426,6 +427,71 @@ one component.
| `grad_factor` | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. ~~float~~ | | `grad_factor` | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. ~~float~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | | **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
## Pretraining architectures {#pretrain source="spacy/ml/models/multi_task.py"}
The spacy `pretrain` command lets you initialize a `Tok2Vec` layer in your
pipeline with information from raw text. To this end, additional layers are
added to build a network for a temporary task that forces the `Tok2Vec` layer to
learn something about sentence structure and word cooccurrence statistics. Two
pretraining objectives are available, both of which are variants of the cloze
task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced for
BERT.
For more information, see the section on
[pretraining](/usage/embeddings-transformers#pretraining).
### spacy.PretrainVectors.v1 {#pretrain_vectors}
> #### Example config
>
> ```ini
> [pretraining]
> component = "tok2vec"
> ...
>
> [pretraining.objective]
> @architectures = "spacy.PretrainVectors.v1"
> maxout_pieces = 3
> hidden_size = 300
> loss = "cosine"
> ```
Predict the word's vector from a static embeddings table as pretraining
objective for a Tok2Vec layer.
| Name | Description |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~ |
| `hidden_size` | Size of the hidden layer of the model. ~~int~~ |
| `loss` | The loss function can be either "cosine" or "L2". We typically recommend to use "cosine". ~~~str~~ |
| **CREATES** | A callable function that can create the Model, given the `vocab` of the pipeline and the `tok2vec` layer to pretrain. ~~Callable[[Vocab, Model], Model]~~ |
### spacy.PretrainCharacters.v1 {#pretrain_chars}
> #### Example config
>
> ```ini
> [pretraining]
> component = "tok2vec"
> ...
>
> [pretraining.objective]
> @architectures = "spacy.PretrainCharacters.v1"
> maxout_pieces = 3
> hidden_size = 300
> n_characters = 4
> ```
Predict some number of leading and trailing UTF-8 bytes as pretraining objective
for a Tok2Vec layer.
| Name | Description |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~ |
| `hidden_size` | Size of the hidden layer of the model. ~~int~~ |
| `n_characters` | The window of characters - e.g. if `n_characters = 2`, the model will try to predict the first two and last two characters of the word. ~~int~~ |
| **CREATES** | A callable function that can create the Model, given the `vocab` of the pipeline and the `tok2vec` layer to pretrain. ~~Callable[[Vocab, Model], Model]~~ |
## Parser & NER architectures {#parser} ## Parser & NER architectures {#parser}
### spacy.TransitionBasedParser.v2 {#TransitionBasedParser source="spacy/ml/models/parser.py"} ### spacy.TransitionBasedParser.v2 {#TransitionBasedParser source="spacy/ml/models/parser.py"}
@ -534,7 +600,7 @@ specific data and challenge.
> no_output_layer = false > no_output_layer = false
> >
> [model.tok2vec] > [model.tok2vec]
> @architectures = "spacy.Tok2Vec.v1" > @architectures = "spacy.Tok2Vec.v2"
> >
> [model.tok2vec.embed] > [model.tok2vec.embed]
> @architectures = "spacy.MultiHashEmbed.v1" > @architectures = "spacy.MultiHashEmbed.v1"
@ -544,7 +610,7 @@ specific data and challenge.
> include_static_vectors = false > include_static_vectors = false
> >
> [model.tok2vec.encode] > [model.tok2vec.encode]
> @architectures = "spacy.MaxoutWindowEncoder.v1" > @architectures = "spacy.MaxoutWindowEncoder.v2"
> width = ${model.tok2vec.embed.width} > width = ${model.tok2vec.embed.width}
> window_size = 1 > window_size = 1
> maxout_pieces = 3 > maxout_pieces = 3

View File

@ -61,20 +61,27 @@ markup to copy-paste into
[GitHub issues](https://github.com/explosion/spaCy/issues). [GitHub issues](https://github.com/explosion/spaCy/issues).
```cli ```cli
$ python -m spacy info [--markdown] [--silent] $ python -m spacy info [--markdown] [--silent] [--exclude]
``` ```
> #### Example
>
> ```cli
> $ python -m spacy info en_core_web_lg --markdown
> ```
```cli ```cli
$ python -m spacy info [model] [--markdown] [--silent] $ python -m spacy info [model] [--markdown] [--silent] [--exclude]
``` ```
| Name | Description | | Name | Description |
| ------------------------------------------------ | ----------------------------------------------------------------------------------------- | | ------------------------------------------------ | --------------------------------------------------------------------------------------------- |
| `model` | A trained pipeline, i.e. package name or path (optional). ~~Optional[str] \(positional)~~ | | `model` | A trained pipeline, i.e. package name or path (optional). ~~Optional[str] \(positional)~~ |
| `--markdown`, `-md` | Print information as Markdown. ~~bool (flag)~~ | | `--markdown`, `-md` | Print information as Markdown. ~~bool (flag)~~ |
| `--silent`, `-s` <Tag variant="new">2.0.12</Tag> | Don't print anything, just return the values. ~~bool (flag)~~ | | `--silent`, `-s` <Tag variant="new">2.0.12</Tag> | Don't print anything, just return the values. ~~bool (flag)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | `--exclude`, `-e` | Comma-separated keys to exclude from the print-out. Defaults to `"labels"`. ~~Optional[str]~~ |
| **PRINTS** | Information about your spaCy installation. | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| **PRINTS** | Information about your spaCy installation. |
## validate {#validate new="2" tag="command"} ## validate {#validate new="2" tag="command"}
@ -121,7 +128,7 @@ customize those settings in your config file later.
> ``` > ```
```cli ```cli
$ python -m spacy init config [output_file] [--lang] [--pipeline] [--optimize] [--gpu] [--pretraining] $ python -m spacy init config [output_file] [--lang] [--pipeline] [--optimize] [--gpu] [--pretraining] [--force]
``` ```
| Name | Description | | Name | Description |
@ -132,6 +139,7 @@ $ python -m spacy init config [output_file] [--lang] [--pipeline] [--optimize] [
| `--optimize`, `-o` | `"efficiency"` or `"accuracy"`. Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters. Defaults to `"efficiency"`. ~~str (option)~~ | | `--optimize`, `-o` | `"efficiency"` or `"accuracy"`. Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters. Defaults to `"efficiency"`. ~~str (option)~~ |
| `--gpu`, `-G` | Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters. ~~bool (flag)~~ | | `--gpu`, `-G` | Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters. ~~bool (flag)~~ |
| `--pretraining`, `-pt` | Include config for pretraining (with [`spacy pretrain`](/api/cli#pretrain)). Defaults to `False`. ~~bool (flag)~~ | | `--pretraining`, `-pt` | Include config for pretraining (with [`spacy pretrain`](/api/cli#pretrain)). Defaults to `False`. ~~bool (flag)~~ |
| `--force`, `-f` | Force overwriting the output file if it already exists. ~~bool (flag)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| **CREATES** | The config file for training. | | **CREATES** | The config file for training. |
@ -783,6 +791,12 @@ in the section `[paths]`.
</Infobox> </Infobox>
> #### Example
>
> ```cli
> $ python -m spacy train config.cfg --output ./output --paths.train ./train --paths.dev ./dev
> ```
```cli ```cli
$ python -m spacy train [config_path] [--output] [--code] [--verbose] [--gpu-id] [overrides] $ python -m spacy train [config_path] [--output] [--code] [--verbose] [--gpu-id] [overrides]
``` ```
@ -801,15 +815,16 @@ $ python -m spacy train [config_path] [--output] [--code] [--verbose] [--gpu-id]
## pretrain {#pretrain new="2.1" tag="command,experimental"} ## pretrain {#pretrain new="2.1" tag="command,experimental"}
Pretrain the "token to vector" ([`Tok2vec`](/api/tok2vec)) layer of pipeline Pretrain the "token to vector" ([`Tok2vec`](/api/tok2vec)) layer of pipeline
components on [raw text](/api/data-formats#pretrain), using an approximate components on raw text, using an approximate language-modeling objective.
language-modeling objective. Specifically, we load pretrained vectors, and train Specifically, we load pretrained vectors, and train a component like a CNN,
a component like a CNN, BiLSTM, etc to predict vectors which match the BiLSTM, etc to predict vectors which match the pretrained ones. The weights are
pretrained ones. The weights are saved to a directory after each epoch. You can saved to a directory after each epoch. You can then include a **path to one of
then include a **path to one of these pretrained weights files** in your these pretrained weights files** in your
[training config](/usage/training#config) as the `init_tok2vec` setting when you [training config](/usage/training#config) as the `init_tok2vec` setting when you
train your pipeline. This technique may be especially helpful if you have little train your pipeline. This technique may be especially helpful if you have little
labelled data. See the usage docs on labelled data. See the usage docs on
[pretraining](/usage/embeddings-transformers#pretraining) for more info. [pretraining](/usage/embeddings-transformers#pretraining) for more info. To read
the raw text, a [`JsonlCorpus`](/api/top-level#jsonlcorpus) is typically used.
<Infobox title="Changed in v3.0" variant="warning"> <Infobox title="Changed in v3.0" variant="warning">
@ -823,6 +838,12 @@ auto-generated by setting `--pretraining` on
</Infobox> </Infobox>
> #### Example
>
> ```cli
> $ python -m spacy pretrain config.cfg ./output_pretrain --paths.raw_text ./data.jsonl
> ```
```cli ```cli
$ python -m spacy pretrain [config_path] [output_dir] [--code] [--resume-path] [--epoch-resume] [--gpu-id] [overrides] $ python -m spacy pretrain [config_path] [output_dir] [--code] [--resume-path] [--epoch-resume] [--gpu-id] [overrides]
``` ```

View File

@ -94,7 +94,7 @@ Defines the `nlp` object, its tokenizer and
> >
> [components.textcat.model] > [components.textcat.model]
> @architectures = "spacy.TextCatBOW.v1" > @architectures = "spacy.TextCatBOW.v1"
> exclusive_classes = false > exclusive_classes = true
> ngram_size = 1 > ngram_size = 1
> no_output_layer = false > no_output_layer = false
> ``` > ```
@ -148,7 +148,7 @@ This section defines a **dictionary** mapping of string keys to functions. Each
function takes an `nlp` object and yields [`Example`](/api/example) objects. By function takes an `nlp` object and yields [`Example`](/api/example) objects. By
default, the two keys `train` and `dev` are specified and each refer to a default, the two keys `train` and `dev` are specified and each refer to a
[`Corpus`](/api/top-level#Corpus). When pretraining, an additional `pretrain` [`Corpus`](/api/top-level#Corpus). When pretraining, an additional `pretrain`
section is added that defaults to a [`JsonlCorpus`](/api/top-level#JsonlCorpus). section is added that defaults to a [`JsonlCorpus`](/api/top-level#jsonlcorpus).
You can also register custom functions that return a callable. You can also register custom functions that return a callable.
| Name | Description | | Name | Description |

View File

@ -0,0 +1,454 @@
---
title: Multi-label TextCategorizer
tag: class
source: spacy/pipeline/textcat_multilabel.py
new: 3
teaser: 'Pipeline component for multi-label text classification'
api_base_class: /api/pipe
api_string_name: textcat_multilabel
api_trainable: true
---
The text categorizer predicts **categories over a whole document**. It
learns non-mutually exclusive labels, which means that zero or more labels
may be true per document.
## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes
how the component should be configured. You can override its settings via the
`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
[`config.cfg` for training](/usage/training#config). See the
[model architectures](/api/architectures) documentation for details on the
architectures and their arguments and hyperparameters.
> #### Example
>
> ```python
> from spacy.pipeline.textcat_multilabel import DEFAULT_MULTI_TEXTCAT_MODEL
> config = {
> "threshold": 0.5,
> "model": DEFAULT_MULTI_TEXTCAT_MODEL,
> }
> nlp.add_pipe("textcat_multilabel", config=config)
> ```
| Setting | Description |
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ |
```python
%%GITHUB_SPACY/spacy/pipeline/textcat_multilabel.py
```
## MultiLabel_TextCategorizer.\_\_init\_\_ {#init tag="method"}
> #### Example
>
> ```python
> # Construction via add_pipe with default model
> textcat = nlp.add_pipe("textcat_multilabel")
>
> # Construction via add_pipe with custom model
> config = {"model": {"@architectures": "my_textcat"}}
> parser = nlp.add_pipe("textcat_multilabel", config=config)
>
> # Construction from class
> from spacy.pipeline import MultiLabel_TextCategorizer
> textcat = MultiLabel_TextCategorizer(nlp.vocab, model, threshold=0.5)
> ```
Create a new pipeline instance. In your application, you would normally use a
shortcut for this and instantiate the component using its string name and
[`nlp.add_pipe`](/api/language#create_pipe).
| Name | Description |
| -------------- | -------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | The shared vocabulary. ~~Vocab~~ |
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
| _keyword-only_ | |
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
## MultiLabel_TextCategorizer.\_\_call\_\_ {#call tag="method"}
Apply the pipe to one document. The document is modified in place, and returned.
This usually happens under the hood when the `nlp` object is called on a text
and all pipeline components are applied to the `Doc` in order. Both
[`__call__`](/api/multilabel_textcategorizer#call) and [`pipe`](/api/multilabel_textcategorizer#pipe)
delegate to the [`predict`](/api/multilabel_textcategorizer#predict) and
[`set_annotations`](/api/multilabel_textcategorizer#set_annotations) methods.
> #### Example
>
> ```python
> doc = nlp("This is a sentence.")
> textcat = nlp.add_pipe("textcat_multilabel")
> # This usually happens under the hood
> processed = textcat(doc)
> ```
| Name | Description |
| ----------- | -------------------------------- |
| `doc` | The document to process. ~~Doc~~ |
| **RETURNS** | The processed document. ~~Doc~~ |
## MultiLabel_TextCategorizer.pipe {#pipe tag="method"}
Apply the pipe to a stream of documents. This usually happens under the hood
when the `nlp` object is called on a text and all pipeline components are
applied to the `Doc` in order. Both [`__call__`](/api/multilabel_textcategorizer#call) and
[`pipe`](/api/multilabel_textcategorizer#pipe) delegate to the
[`predict`](/api/multilabel_textcategorizer#predict) and
[`set_annotations`](/api/multilabel_textcategorizer#set_annotations) methods.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> for doc in textcat.pipe(docs, batch_size=50):
> pass
> ```
| Name | Description |
| -------------- | ------------------------------------------------------------- |
| `stream` | A stream of documents. ~~Iterable[Doc]~~ |
| _keyword-only_ | |
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
| **YIELDS** | The processed documents in order. ~~Doc~~ |
## MultiLabel_TextCategorizer.initialize {#initialize tag="method" new="3"}
Initialize the component for training. `get_examples` should be a function that
returns an iterable of [`Example`](/api/example) objects. The data examples are
used to **initialize the model** of the component and can either be the full
training data or a representative sample. Initialization includes validating the
network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data. This method is typically called
by [`Language.initialize`](/api/language#initialize) and lets you customize
arguments it receives via the
[`[initialize.components]`](/api/data-formats#config-initialize) block in the
config.
<Infobox variant="warning" title="Changed in v3.0" id="begin_training">
This method was previously called `begin_training`.
</Infobox>
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> textcat.initialize(lambda: [], nlp=nlp)
> ```
>
> ```ini
> ### config.cfg
> [initialize.components.textcat_multilabel]
>
> [initialize.components.textcat_multilabel.labels]
> @readers = "spacy.read_labels.v1"
> path = "corpus/labels/textcat.json
> ```
| Name | Description |
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
| _keyword-only_ | |
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ |
## MultiLabel_TextCategorizer.predict {#predict tag="method"}
Apply the component's model to a batch of [`Doc`](/api/doc) objects without
modifying them.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> scores = textcat.predict([doc1, doc2])
> ```
| Name | Description |
| ----------- | ------------------------------------------- |
| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
| **RETURNS** | The model's prediction for each document. |
## MultiLabel_TextCategorizer.set_annotations {#set_annotations tag="method"}
Modify a batch of [`Doc`](/api/doc) objects using pre-computed scores.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> scores = textcat.predict(docs)
> textcat.set_annotations(docs, scores)
> ```
| Name | Description |
| -------- | --------------------------------------------------------- |
| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
| `scores` | The scores to set, produced by `MultiLabel_TextCategorizer.predict`. |
## MultiLabel_TextCategorizer.update {#update tag="method"}
Learn from a batch of [`Example`](/api/example) objects containing the
predictions and gold-standard annotations, and update the component's model.
Delegates to [`predict`](/api/multilabel_textcategorizer#predict) and
[`get_loss`](/api/multilabel_textcategorizer#get_loss).
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> optimizer = nlp.initialize()
> losses = textcat.update(examples, sgd=optimizer)
> ```
| Name | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
## MultiLabel_TextCategorizer.rehearse {#rehearse tag="method,experimental" new="3"}
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
current model to make predictions similar to an initial model to try to address
the "catastrophic forgetting" problem. This feature is experimental.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> optimizer = nlp.resume_training()
> losses = textcat.rehearse(examples, sgd=optimizer)
> ```
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
## MultiLabel_TextCategorizer.get_loss {#get_loss tag="method"}
Find the loss and gradient of loss for the batch of documents and their
predicted scores.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> scores = textcat.predict([eg.predicted for eg in examples])
> loss, d_loss = textcat.get_loss(examples, scores)
> ```
| Name | Description |
| ----------- | --------------------------------------------------------------------------- |
| `examples` | The batch of examples. ~~Iterable[Example]~~ |
| `scores` | Scores representing the model's predictions. |
| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ |
## MultiLabel_TextCategorizer.score {#score tag="method" new="3"}
Score a batch of examples.
> #### Example
>
> ```python
> scores = textcat.score(examples)
> ```
| Name | Description |
| ---------------- | -------------------------------------------------------------------------------------------------------------------- |
| `examples` | The examples to score. ~~Iterable[Example]~~ |
| _keyword-only_ | |
| **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ |
## MultiLabel_TextCategorizer.create_optimizer {#create_optimizer tag="method"}
Create an optimizer for the pipeline component.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> optimizer = textcat.create_optimizer()
> ```
| Name | Description |
| ----------- | ---------------------------- |
| **RETURNS** | The optimizer. ~~Optimizer~~ |
## MultiLabel_TextCategorizer.use_params {#use_params tag="method, contextmanager"}
Modify the pipe's model to use the given parameter values.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> with textcat.use_params(optimizer.averages):
> textcat.to_disk("/best_model")
> ```
| Name | Description |
| -------- | -------------------------------------------------- |
| `params` | The parameter values to use in the model. ~~dict~~ |
## MultiLabel_TextCategorizer.add_label {#add_label tag="method"}
Add a new label to the pipe. Raises an error if the output dimension is already
set, or if the model has already been fully [initialized](#initialize). Note
that you don't have to call this method if you provide a **representative data
sample** to the [`initialize`](#initialize) method. In this case, all labels
found in the sample will be automatically added to the model, and the output
dimension will be [inferred](/usage/layers-architectures#thinc-shape-inference)
automatically.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> textcat.add_label("MY_LABEL")
> ```
| Name | Description |
| ----------- | ----------------------------------------------------------- |
| `label` | The label to add. ~~str~~ |
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
## MultiLabel_TextCategorizer.to_disk {#to_disk tag="method"}
Serialize the pipe to disk.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> textcat.to_disk("/path/to/textcat")
> ```
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
## MultiLabel_TextCategorizer.from_disk {#from_disk tag="method"}
Load the pipe from disk. Modifies the object in place and returns it.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> textcat.from_disk("/path/to/textcat")
> ```
| Name | Description |
| -------------- | ----------------------------------------------------------------------------------------------- |
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
| **RETURNS** | The modified `MultiLabel_TextCategorizer` object. ~~MultiLabel_TextCategorizer~~ |
## MultiLabel_TextCategorizer.to_bytes {#to_bytes tag="method"}
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> textcat_bytes = textcat.to_bytes()
> ```
Serialize the pipe to a bytestring.
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------- |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
| **RETURNS** | The serialized form of the `MultiLabel_TextCategorizer` object. ~~bytes~~ |
## MultiLabel_TextCategorizer.from_bytes {#from_bytes tag="method"}
Load the pipe from a bytestring. Modifies the object in place and returns it.
> #### Example
>
> ```python
> textcat_bytes = textcat.to_bytes()
> textcat = nlp.add_pipe("textcat")
> textcat.from_bytes(textcat_bytes)
> ```
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------- |
| `bytes_data` | The data to load from. ~~bytes~~ |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
| **RETURNS** | The `MultiLabel_TextCategorizer` object. ~~MultiLabel_TextCategorizer~~ |
## MultiLabel_TextCategorizer.labels {#labels tag="property"}
The labels currently added to the component.
> #### Example
>
> ```python
> textcat.add_label("MY_LABEL")
> assert "MY_LABEL" in textcat.labels
> ```
| Name | Description |
| ----------- | ------------------------------------------------------ |
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
## MultiLabel_TextCategorizer.label_data {#label_data tag="property" new="3"}
The labels currently added to the component and their internal meta information.
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
[`MultiLabel_TextCategorizer.initialize`](/api/multilabel_textcategorizer#initialize) to initialize
the model with a pre-defined label set.
> #### Example
>
> ```python
> labels = textcat.label_data
> textcat.initialize(lambda: [], nlp=nlp, labels=labels)
> ```
| Name | Description |
| ----------- | ---------------------------------------------------------- |
| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = textcat.to_disk("/path", exclude=["vocab"])
> ```
| Name | Description |
| ------- | -------------------------------------------------------------- |
| `vocab` | The shared [`Vocab`](/api/vocab). |
| `cfg` | The config file. You usually don't want to exclude this. |
| `model` | The binary model data. You usually don't want to exclude this. |

View File

@ -3,17 +3,15 @@ title: TextCategorizer
tag: class tag: class
source: spacy/pipeline/textcat.py source: spacy/pipeline/textcat.py
new: 2 new: 2
teaser: 'Pipeline component for text classification' teaser: 'Pipeline component for single-label text classification'
api_base_class: /api/pipe api_base_class: /api/pipe
api_string_name: textcat api_string_name: textcat
api_trainable: true api_trainable: true
--- ---
The text categorizer predicts **categories over a whole document**. It can learn The text categorizer predicts **categories over a whole document**. It can learn
one or more labels, and the labels can be mutually exclusive (i.e. one true one or more labels, and the labels are mutually exclusive - there is exactly one
label per document) or non-mutually exclusive (i.e. zero or more labels may be true label per document.
true per document). The multi-label setting is controlled by the model instance
that's provided.
## Config and implementation {#config} ## Config and implementation {#config}
@ -27,10 +25,10 @@ architectures and their arguments and hyperparameters.
> #### Example > #### Example
> >
> ```python > ```python
> from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL > from spacy.pipeline.textcat import DEFAULT_SINGLE_TEXTCAT_MODEL
> config = { > config = {
> "threshold": 0.5, > "threshold": 0.5,
> "model": DEFAULT_TEXTCAT_MODEL, > "model": DEFAULT_SINGLE_TEXTCAT_MODEL,
> } > }
> nlp.add_pipe("textcat", config=config) > nlp.add_pipe("textcat", config=config)
> ``` > ```
@ -280,7 +278,6 @@ Score a batch of examples.
| ---------------- | -------------------------------------------------------------------------------------------------------------------- | | ---------------- | -------------------------------------------------------------------------------------------------------------------- |
| `examples` | The examples to score. ~~Iterable[Example]~~ | | `examples` | The examples to score. ~~Iterable[Example]~~ |
| _keyword-only_ | | | _keyword-only_ | |
| `positive_label` | Optional positive label. ~~Optional[str]~~ |
| **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ | | **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ |
## TextCategorizer.create_optimizer {#create_optimizer tag="method"} ## TextCategorizer.create_optimizer {#create_optimizer tag="method"}

View File

@ -129,13 +129,13 @@ the entity recognizer, use a
factory = "tok2vec" factory = "tok2vec"
[components.tok2vec.model] [components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1" @architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed] [components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1" @architectures = "spacy.MultiHashEmbed.v1"
[components.tok2vec.model.encode] [components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1" @architectures = "spacy.MaxoutWindowEncoder.v2"
[components.ner] [components.ner]
factory = "ner" factory = "ner"
@ -161,13 +161,13 @@ factory = "ner"
@architectures = "spacy.TransitionBasedParser.v1" @architectures = "spacy.TransitionBasedParser.v1"
[components.ner.model.tok2vec] [components.ner.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1" @architectures = "spacy.Tok2Vec.v2"
[components.ner.model.tok2vec.embed] [components.ner.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1" @architectures = "spacy.MultiHashEmbed.v1"
[components.ner.model.tok2vec.encode] [components.ner.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1" @architectures = "spacy.MaxoutWindowEncoder.v2"
``` ```
<!-- TODO: Once rehearsal is tested, mention it here. --> <!-- TODO: Once rehearsal is tested, mention it here. -->
@ -713,34 +713,39 @@ layer = "tok2vec"
#### Pretraining objectives {#pretraining-details} #### Pretraining objectives {#pretraining-details}
Two pretraining objectives are available, both of which are variants of the
cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
for BERT. The objective can be defined and configured via the
`[pretraining.objective]` config block.
> ```ini > ```ini
> ### Characters objective > ### Characters objective
> [pretraining.objective] > [pretraining.objective]
> type = "characters" > @architectures = "spacy.PretrainCharacters.v1"
> maxout_pieces = 3
> hidden_size = 300
> n_characters = 4 > n_characters = 4
> ``` > ```
> >
> ```ini > ```ini
> ### Vectors objective > ### Vectors objective
> [pretraining.objective] > [pretraining.objective]
> type = "vectors" > @architectures = "spacy.PretrainVectors.v1"
> maxout_pieces = 3
> hidden_size = 300
> loss = "cosine" > loss = "cosine"
> ``` > ```
- **Characters:** The `"characters"` objective asks the model to predict some Two pretraining objectives are available, both of which are variants of the
number of leading and trailing UTF-8 bytes for the words. For instance, cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
setting `n_characters = 2`, the model will try to predict the first two and for BERT. The objective can be defined and configured via the
last two characters of the word. `[pretraining.objective]` config block.
- **Vectors:** The `"vectors"` objective asks the model to predict the word's - [`PretrainCharacters`](/api/architectures#pretrain_chars): The `"characters"`
vector, from a static embeddings table. This requires a word vectors model to objective asks the model to predict some number of leading and trailing UTF-8
be trained and loaded. The vectors objective can optimize either a cosine or bytes for the words. For instance, setting `n_characters = 2`, the model will
an L2 loss. We've generally found cosine loss to perform better. try to predict the first two and last two characters of the word.
- [`PretrainVectors`](/api/architectures#pretrain_vectors): The `"vectors"`
objective asks the model to predict the word's vector, from a static
embeddings table. This requires a word vectors model to be trained and loaded.
The vectors objective can optimize either a cosine or an L2 loss. We've
generally found cosine loss to perform better.
These pretraining objectives use a trick that we term **language modelling with These pretraining objectives use a trick that we term **language modelling with
approximate outputs (LMAO)**. The motivation for the trick is that predicting an approximate outputs (LMAO)**. The motivation for the trick is that predicting an

View File

@ -134,7 +134,7 @@ labels = []
nO = null nO = null
[components.textcat.model.tok2vec] [components.textcat.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1" @architectures = "spacy.Tok2Vec.v2"
[components.textcat.model.tok2vec.embed] [components.textcat.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1" @architectures = "spacy.MultiHashEmbed.v1"
@ -144,7 +144,7 @@ attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
include_static_vectors = false include_static_vectors = false
[components.textcat.model.tok2vec.encode] [components.textcat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1" @architectures = "spacy.MaxoutWindowEncoder.v2"
width = ${components.textcat.model.tok2vec.embed.width} width = ${components.textcat.model.tok2vec.embed.width}
window_size = 1 window_size = 1
maxout_pieces = 3 maxout_pieces = 3
@ -152,7 +152,7 @@ depth = 2
[components.textcat.model.linear_model] [components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v1" @architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false exclusive_classes = true
ngram_size = 1 ngram_size = 1
no_output_layer = false no_output_layer = false
``` ```
@ -170,7 +170,7 @@ labels = []
[components.textcat.model] [components.textcat.model]
@architectures = "spacy.TextCatBOW.v1" @architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false exclusive_classes = true
ngram_size = 1 ngram_size = 1
no_output_layer = false no_output_layer = false
nO = null nO = null
@ -201,14 +201,14 @@ tokens, and their combination forms a typical
factory = "tok2vec" factory = "tok2vec"
[components.tok2vec.model] [components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1" @architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed] [components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1" @architectures = "spacy.MultiHashEmbed.v1"
# ... # ...
[components.tok2vec.model.encode] [components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1" @architectures = "spacy.MaxoutWindowEncoder.v2"
# ... # ...
``` ```
@ -224,7 +224,7 @@ architecture:
# ... # ...
[components.tok2vec.model.encode] [components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1" @architectures = "spacy.MaxoutWindowEncoder.v2"
# ... # ...
``` ```
@ -716,7 +716,7 @@ that we want to classify as being related or not. As these candidate pairs are
typically formed within one document, this function takes a [`Doc`](/api/doc) as typically formed within one document, this function takes a [`Doc`](/api/doc) as
input and outputs a `List` of `Span` tuples. For instance, the following input and outputs a `List` of `Span` tuples. For instance, the following
implementation takes any two entities from the same document, as long as they implementation takes any two entities from the same document, as long as they
are within a **maximum distance** (in number of tokens) of eachother: are within a **maximum distance** (in number of tokens) of each other:
> #### config.cfg (excerpt) > #### config.cfg (excerpt)
> >
@ -742,7 +742,7 @@ def create_instances(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]
return get_candidates return get_candidates
``` ```
This function in added to the [`@misc` registry](/api/top-level#registry) so we This function is added to the [`@misc` registry](/api/top-level#registry) so we
can refer to it from the config, and easily swap it out for any other candidate can refer to it from the config, and easily swap it out for any other candidate
generation function. generation function.

View File

@ -1060,7 +1060,7 @@ In this example we assume a custom function `read_custom_data` which loads or
generates texts with relevant text classification annotations. Then, small generates texts with relevant text classification annotations. Then, small
lexical variations of the input text are created before generating the final lexical variations of the input text are created before generating the final
[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets [`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets
you register the function creating the custom reader in the `readers` you register the function creating the custom reader in the `readers`
[registry](/api/top-level#registry) and assign it a string name, so it can be [registry](/api/top-level#registry) and assign it a string name, so it can be
used in your config. All arguments on the registered function become available used in your config. All arguments on the registered function become available
as **config settings** in this case, `source`. as **config settings** in this case, `source`.