Merge branch 'develop' of https://github.com/explosion/spaCy into develop

This commit is contained in:
Ines Montani 2021-01-13 12:03:02 +11:00
commit 97d5a7ba99
82 changed files with 2710 additions and 4817 deletions

106
.github/contributors/bratao.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Bruno Souza Cabral |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 24/12/2020 |
| GitHub username | bratao |
| Website (optional) | |

1
.gitignore vendored
View File

@ -51,6 +51,7 @@ env3.*/
.pypyenv
.pytest_cache/
.mypy_cache/
.hypothesis/
# Distribution / packaging
env/

View File

@ -35,7 +35,10 @@ def download_cli(
def download(model: str, direct: bool = False, *pip_args) -> None:
if not (is_package("spacy") or is_package("spacy-nightly")) and "--no-deps" not in pip_args:
if (
not (is_package("spacy") or is_package("spacy-nightly"))
and "--no-deps" not in pip_args
):
msg.warn(
"Skipping pipeline package dependencies and setting `--no-deps`. "
"You don't seem to have the spaCy package itself installed "

View File

@ -172,7 +172,9 @@ def render_parses(
file_.write(html)
def print_prf_per_type(msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str) -> None:
def print_prf_per_type(
msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str
) -> None:
data = [
(k, f"{v['p']*100:.2f}", f"{v['r']*100:.2f}", f"{v['f']*100:.2f}")
for k, v in scores.items()

View File

@ -1,10 +1,10 @@
from typing import Optional, Dict, Any, Union
from typing import Optional, Dict, Any, Union, List
import platform
from pathlib import Path
from wasabi import Printer, MarkdownRenderer
import srsly
from ._util import app, Arg, Opt
from ._util import app, Arg, Opt, string_to_list
from .. import util
from .. import about
@ -15,20 +15,22 @@ def info_cli(
model: Optional[str] = Arg(None, help="Optional loadable spaCy pipeline"),
markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues"),
silent: bool = Opt(False, "--silent", "-s", "-S", help="Don't print anything (just return)"),
exclude: Optional[str] = Opt("labels", "--exclude", "-e", help="Comma-separated keys to exclude from the print-out"),
# fmt: on
):
"""
Print info about spaCy installation. If a pipeline is speficied as an argument,
Print info about spaCy installation. If a pipeline is specified as an argument,
print its meta information. Flag --markdown prints details in Markdown for easy
copy-pasting to GitHub issues.
DOCS: https://nightly.spacy.io/api/cli#info
"""
info(model, markdown=markdown, silent=silent)
exclude = string_to_list(exclude)
info(model, markdown=markdown, silent=silent, exclude=exclude)
def info(
model: Optional[str] = None, *, markdown: bool = False, silent: bool = True
model: Optional[str] = None, *, markdown: bool = False, silent: bool = True, exclude: List[str]
) -> Union[str, dict]:
msg = Printer(no_print=silent, pretty=not silent)
if model:
@ -42,13 +44,13 @@ def info(
data["Pipelines"] = ", ".join(
f"{n} ({v})" for n, v in data["Pipelines"].items()
)
markdown_data = get_markdown(data, title=title)
markdown_data = get_markdown(data, title=title, exclude=exclude)
if markdown:
if not silent:
print(markdown_data)
return markdown_data
if not silent:
table_data = dict(data)
table_data = {k: v for k, v in data.items() if k not in exclude}
msg.table(table_data, title=title)
return raw_data
@ -82,7 +84,7 @@ def info_model(model: str, *, silent: bool = True) -> Dict[str, Any]:
if util.is_package(model):
model_path = util.get_package_path(model)
else:
model_path = model
model_path = Path(model)
meta_path = model_path / "meta.json"
if not meta_path.is_file():
msg.fail("Can't find pipeline meta.json", meta_path, exits=1)
@ -96,7 +98,7 @@ def info_model(model: str, *, silent: bool = True) -> Dict[str, Any]:
}
def get_markdown(data: Dict[str, Any], title: Optional[str] = None) -> str:
def get_markdown(data: Dict[str, Any], title: Optional[str] = None, exclude: List[str] = None) -> str:
"""Get data in GitHub-flavoured Markdown format for issues etc.
data (dict or list of tuples): Label/value pairs.
@ -108,8 +110,16 @@ def get_markdown(data: Dict[str, Any], title: Optional[str] = None) -> str:
md.add(md.title(2, title))
items = []
for key, value in data.items():
if isinstance(value, str) and Path(value).exists():
if exclude and key in exclude:
continue
if isinstance(value, str):
try:
existing_path = Path(value).exists()
except:
# invalid Path, like a URL string
existing_path = False
if existing_path:
continue
items.append(f"{md.bold(f'{key}:')} {value}")
md.add(md.list(items))
return f"\n{md.text}\n"

View File

@ -32,6 +32,7 @@ def init_config_cli(
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
gpu: bool = Opt(False, "--gpu", "-G", help="Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters."),
pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
force_overwrite: bool = Opt(False, "--force", "-F", help="Force overwriting the output file"),
# fmt: on
):
"""
@ -46,6 +47,12 @@ def init_config_cli(
optimize = optimize.value
pipeline = string_to_list(pipeline)
is_stdout = str(output_file) == "-"
if not is_stdout and output_file.exists() and not force_overwrite:
msg = Printer()
msg.fail(
"The provided output file already exists. To force overwriting the config file, set the --force or -F flag.",
exits=1,
)
config = init_config(
lang=lang,
pipeline=pipeline,
@ -162,7 +169,7 @@ def init_config(
"Hardware": variables["hardware"].upper(),
"Transformer": template_vars.transformer.get("name", False),
}
msg.info("Generated template specific for your use case")
msg.info("Generated config template specific for your use case")
for label, value in use_case.items():
msg.text(f"- {label}: {value}")
with show_validation_error(hint_fill=False):

View File

@ -149,13 +149,44 @@ grad_factor = 1.0
[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
exclusive_classes = true
ngram_size = 1
no_output_layer = false
{% else -%}
[components.textcat.model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
{%- endif %}
{%- endif %}
{% if "textcat_multilabel" in components %}
[components.textcat_multilabel]
factory = "textcat_multilabel"
{% if optimize == "accuracy" %}
[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null
[components.textcat_multilabel.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
[components.textcat_multilabel.model.tok2vec.pooling]
@layers = "reduce_mean.v1"
[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
{% else -%}
[components.textcat_multilabel.model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
@ -174,7 +205,7 @@ no_output_layer = false
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
@ -189,7 +220,7 @@ rows = [5000, 2500]
include_static_vectors = {{ "true" if optimize == "accuracy" else "false" }}
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = {{ 96 if optimize == "efficiency" else 256 }}
depth = {{ 4 if optimize == "efficiency" else 8 }}
window_size = 1
@ -288,13 +319,41 @@ width = ${components.tok2vec.model.encode.width}
[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
exclusive_classes = true
ngram_size = 1
no_output_layer = false
{% else -%}
[components.textcat.model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
{%- endif %}
{%- endif %}
{% if "textcat_multilabel" in components %}
[components.textcat_multilabel]
factory = "textcat_multilabel"
{% if optimize == "accuracy" %}
[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null
[components.textcat_multilabel.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
{% else -%}
[components.textcat_multilabel.model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
@ -303,7 +362,7 @@ no_output_layer = false
{% endif %}
{% for pipe in components %}
{% if pipe not in ["tagger", "morphologizer", "parser", "ner", "textcat", "entity_linker"] %}
{% if pipe not in ["tagger", "morphologizer", "parser", "ner", "textcat", "textcat_multilabel", "entity_linker"] %}
{# Other components defined by the user: we just assume they're factories #}
[components.{{ pipe }}]
factory = "{{ pipe }}"

View File

@ -463,6 +463,10 @@ class Errors:
"issue tracker: http://github.com/explosion/spaCy/issues")
# TODO: fix numbering after merging develop into master
E895 = ("The 'textcat' component received gold-standard annotations with "
"multiple labels per document. In spaCy 3 you should use the "
"'textcat_multilabel' component for this instead. "
"Example of an offending annotation: {value}")
E896 = ("There was an error using the static vectors. Ensure that the vectors "
"of the vocab are properly initialized, or set 'include_static_vectors' "
"to False.")

View File

@ -214,8 +214,22 @@ _macedonian_lower = r"ѓѕјљњќѐѝ"
_macedonian_upper = r"ЃЅЈЉЊЌЀЍ"
_macedonian = r"ѓѕјљњќѐѝЃЅЈЉЊЌЀЍ"
_upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian_upper + _macedonian_upper
_lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower + _macedonian_lower
_upper = (
LATIN_UPPER
+ _russian_upper
+ _tatar_upper
+ _greek_upper
+ _ukrainian_upper
+ _macedonian_upper
)
_lower = (
LATIN_LOWER
+ _russian_lower
+ _tatar_lower
+ _greek_lower
+ _ukrainian_lower
+ _macedonian_lower
)
_uncased = (
_bengali
@ -230,7 +244,9 @@ _uncased = (
+ _cjk
)
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _macedonian + _uncased)
ALPHA = group_chars(
LATIN + _russian + _tatar + _greek + _ukrainian + _macedonian + _uncased
)
ALPHA_LOWER = group_chars(_lower + _uncased)
ALPHA_UPPER = group_chars(_upper + _uncased)

View File

@ -1,18 +1,11 @@
from .stop_words import STOP_WORDS
from .tag_map import TAG_MAP
from ...language import Language
from ...attrs import LANG
from .lex_attrs import LEX_ATTRS
from ...language import Language
class CzechDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "cs"
tag_map = TAG_MAP
stop_words = STOP_WORDS
lex_attr_getters = LEX_ATTRS
stop_words = STOP_WORDS
class Czech(Language):

File diff suppressed because it is too large Load Diff

View File

@ -14,7 +14,7 @@ class MacedonianLemmatizer(Lemmatizer):
if univ_pos in ("", "eol", "space"):
return [string.lower()]
if string[-3:] == 'јќи':
if string[-3:] == "јќи":
string = string[:-3]
univ_pos = "verb"
@ -23,7 +23,13 @@ class MacedonianLemmatizer(Lemmatizer):
index_table = self.lookups.get_table("lemma_index", {})
exc_table = self.lookups.get_table("lemma_exc", {})
rules_table = self.lookups.get_table("lemma_rules", {})
if not any((index_table.get(univ_pos), exc_table.get(univ_pos), rules_table.get(univ_pos))):
if not any(
(
index_table.get(univ_pos),
exc_table.get(univ_pos),
rules_table.get(univ_pos),
)
):
if univ_pos == "propn":
return [string]
else:

View File

@ -1,21 +1,104 @@
from ...attrs import LIKE_NUM
_num_words = [
"нула", "еден", "една", "едно", "два", "две", "три", "четири", "пет", "шест", "седум", "осум", "девет", "десет",
"единаесет", "дванаесет", "тринаесет", "четиринаесет", "петнаесет", "шеснаесет", "седумнаесет", "осумнаесет",
"деветнаесет", "дваесет", "триесет", "четириесет", "педесет", "шеесет", "седумдесет", "осумдесет", "деведесет",
"сто", "двесте", "триста", "четиристотини", "петстотини", "шестотини", "седумстотини", "осумстотини",
"деветстотини", "илјада", "илјади", 'милион', 'милиони', 'милијарда', 'милијарди', 'билион', 'билиони',
"двајца", "тројца", "четворица", "петмина", "шестмина", "седуммина", "осуммина", "деветмина", "обата", "обајцата",
"прв", "втор", "трет", "четврт", "седм", "осм", "двестоти",
"два-три", "два-триесет", "два-триесетмина", "два-тринаесет", "два-тројца", "две-три", "две-тристотини",
"пет-шеесет", "пет-шеесетмина", "пет-шеснаесетмина", "пет-шест", "пет-шестмина", "пет-шестотини", "петина",
"осмина", "седум-осум", "седум-осумдесет", "седум-осуммина", "седум-осумнаесет", "седум-осумнаесетмина",
"три-четириесет", "три-четиринаесет", "шеесет", "шеесетина", "шеесетмина", "шеснаесет", "шеснаесетмина",
"шест-седум", "шест-седумдесет", "шест-седумнаесет", "шест-седумстотини", "шестоти", "шестотини"
"нула",
"еден",
"една",
"едно",
"два",
"две",
"три",
"четири",
"пет",
"шест",
"седум",
"осум",
"девет",
"десет",
"единаесет",
"дванаесет",
"тринаесет",
"четиринаесет",
"петнаесет",
"шеснаесет",
"седумнаесет",
"осумнаесет",
"деветнаесет",
"дваесет",
"триесет",
"четириесет",
"педесет",
"шеесет",
"седумдесет",
"осумдесет",
"деведесет",
"сто",
"двесте",
"триста",
"четиристотини",
"петстотини",
"шестотини",
"седумстотини",
"осумстотини",
"деветстотини",
"илјада",
"илјади",
"милион",
"милиони",
"милијарда",
"милијарди",
"билион",
"билиони",
"двајца",
"тројца",
"четворица",
"петмина",
"шестмина",
"седуммина",
"осуммина",
"деветмина",
"обата",
"обајцата",
"прв",
"втор",
"трет",
"четврт",
"седм",
"осм",
"двестоти",
"два-три",
"два-триесет",
"два-триесетмина",
"два-тринаесет",
"два-тројца",
"две-три",
"две-тристотини",
"пет-шеесет",
"пет-шеесетмина",
"пет-шеснаесетмина",
"пет-шест",
"пет-шестмина",
"пет-шестотини",
"петина",
"осмина",
"седум-осум",
"седум-осумдесет",
"седум-осуммина",
"седум-осумнаесет",
"седум-осумнаесетмина",
"три-четириесет",
"три-четиринаесет",
"шеесет",
"шеесетина",
"шеесетмина",
"шеснаесет",
"шеснаесетмина",
"шест-седум",
"шест-седумдесет",
"шест-седумнаесет",
"шест-седумстотини",
"шестоти",
"шестотини",
]

View File

@ -21,8 +21,7 @@ _abbr_exc = [
{ORTH: "хл", NORM: "хектолитар"},
{ORTH: "дкл", NORM: "декалитар"},
{ORTH: "л", NORM: "литар"},
{ORTH: "дл", NORM: "децилитар"}
{ORTH: "дл", NORM: "децилитар"},
]
for abbr in _abbr_exc:
_exc[abbr[ORTH]] = [abbr]
@ -33,7 +32,6 @@ _abbr_line_exc = [
{ORTH: "г-ѓа", NORM: "госпоѓа"},
{ORTH: "г-ца", NORM: "госпоѓица"},
{ORTH: "г-дин", NORM: "господин"},
]
for abbr in _abbr_line_exc:
@ -54,7 +52,6 @@ _abbr_dot_exc = [
{ORTH: "т.", NORM: "точка"},
{ORTH: "т.е.", NORM: "то ест"},
{ORTH: "т.н.", NORM: "таканаречен"},
{ORTH: "бр.", NORM: "број"},
{ORTH: "гр.", NORM: "град"},
{ORTH: "др.", NORM: "другар"},
@ -68,7 +65,6 @@ _abbr_dot_exc = [
{ORTH: "с.", NORM: "страница"},
{ORTH: "стр.", NORM: "страница"},
{ORTH: "чл.", NORM: "член"},
{ORTH: "арх.", NORM: "архитект"},
{ORTH: "бел.", NORM: "белешка"},
{ORTH: "гимн.", NORM: "гимназија"},
@ -89,8 +85,6 @@ _abbr_dot_exc = [
{ORTH: "истор.", NORM: "историја"},
{ORTH: "геогр.", NORM: "географија"},
{ORTH: "литер.", NORM: "литература"},
]
for abbr in _abbr_dot_exc:

View File

@ -45,7 +45,7 @@ _abbr_period_exc = [
{ORTH: "Doç.", NORM: "doçent"},
{ORTH: "doğ."},
{ORTH: "Dr.", NORM: "doktor"},
{ORTH: "dr.", NORM:"doktor"},
{ORTH: "dr.", NORM: "doktor"},
{ORTH: "drl.", NORM: "derleyen"},
{ORTH: "Dz.", NORM: "deniz"},
{ORTH: "Dz.K.K.lığı"},
@ -118,7 +118,7 @@ _abbr_period_exc = [
{ORTH: "Uzm.", NORM: "uzman"},
{ORTH: "Üçvş.", NORM: "üstçavuş"},
{ORTH: "Üni.", NORM: "üniversitesi"},
{ORTH: "Ütğm.", NORM: "üsteğmen"},
{ORTH: "Ütğm.", NORM: "üsteğmen"},
{ORTH: "vb."},
{ORTH: "vs.", NORM: "vesaire"},
{ORTH: "Yard.", NORM: "yardımcı"},
@ -163,19 +163,29 @@ for abbr in _abbr_exc:
_exc[abbr[ORTH]] = [abbr]
_num = r"[+-]?\d+([,.]\d+)*"
_ord_num = r"(\d+\.)"
_date = r"(((\d{1,2}[./-]){2})?(\d{4})|(\d{1,2}[./]\d{1,2}(\.)?))"
_dash_num = r"(([{al}\d]+/\d+)|(\d+/[{al}]))".format(al=ALPHA)
_roman_num = "M{0,3}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})"
_roman_num = "M{0,3}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})"
_roman_ord = r"({rn})\.".format(rn=_roman_num)
_time_exp = r"\d+(:\d+)*"
_inflections = r"'[{al}]+".format(al=ALPHA_LOWER)
_abbrev_inflected = r"[{a}]+\.'[{al}]+".format(a=ALPHA, al=ALPHA_LOWER)
_nums = r"(({d})|({dn})|({te})|({on})|({n})|({ro})|({rn}))({inf})?".format(d=_date, dn=_dash_num, te=_time_exp, on=_ord_num, n=_num, ro=_roman_ord, rn=_roman_num, inf=_inflections)
_nums = r"(({d})|({dn})|({te})|({on})|({n})|({ro})|({rn}))({inf})?".format(
d=_date,
dn=_dash_num,
te=_time_exp,
on=_ord_num,
n=_num,
ro=_roman_ord,
rn=_roman_num,
inf=_inflections,
)
TOKENIZER_EXCEPTIONS = _exc
TOKEN_MATCH = re.compile(r"^({abbr})|({n})$".format(n=_nums, abbr=_abbrev_inflected)).match
TOKEN_MATCH = re.compile(
r"^({abbr})|({n})$".format(n=_nums, abbr=_abbrev_inflected)
).match

View File

@ -1,4 +1,3 @@
import numpy
from thinc.api import Model
from ..attrs import LOWER

View File

@ -21,14 +21,14 @@ def transition_parser_v1(
nO: Optional[int] = None,
) -> Model:
return build_tb_parser_model(
tok2vec,
state_type,
extra_state_tokens,
hidden_width,
maxout_pieces,
use_upper,
nO,
)
tok2vec,
state_type,
extra_state_tokens,
hidden_width,
maxout_pieces,
use_upper,
nO,
)
@registry.architectures.register("spacy.TransitionBasedParser.v2")
@ -42,14 +42,15 @@ def transition_parser_v2(
nO: Optional[int] = None,
) -> Model:
return build_tb_parser_model(
tok2vec,
state_type,
extra_state_tokens,
hidden_width,
maxout_pieces,
use_upper,
nO,
)
tok2vec,
state_type,
extra_state_tokens,
hidden_width,
maxout_pieces,
use_upper,
nO,
)
def build_tb_parser_model(
tok2vec: Model[List[Doc], List[Floats2d]],
@ -162,8 +163,8 @@ def _resize_upper(model, new_nO):
# just adding rows here.
if smaller.has_dim("nO"):
old_nO = smaller.get_dim("nO")
larger_W[: old_nO] = smaller_W
larger_b[: old_nO] = smaller_b
larger_W[:old_nO] = smaller_W
larger_b[:old_nO] = smaller_b
for i in range(old_nO, new_nO):
model.attrs["unseen_classes"].add(i)

View File

@ -6,6 +6,7 @@ from thinc.api import chain, concatenate, clone, Dropout, ParametricAttention
from thinc.api import SparseLinear, Softmax, softmax_activation, Maxout, reduce_sum
from thinc.api import HashEmbed, with_array, with_cpu, uniqued
from thinc.api import Relu, residual, expand_window
from thinc.layers.chain import init as init_chain
from ...attrs import ID, ORTH, PREFIX, SUFFIX, SHAPE, LOWER
from ...util import registry
@ -13,6 +14,7 @@ from ..extract_ngrams import extract_ngrams
from ..staticvectors import StaticVectors
from ..featureextractor import FeatureExtractor
from ...tokens import Doc
from .tok2vec import get_tok2vec_width
@registry.architectures.register("spacy.TextCatCNN.v1")
@ -69,13 +71,16 @@ def build_text_classifier_v2(
exclusive_classes = not linear_model.attrs["multi_label"]
with Model.define_operators({">>": chain, "|": concatenate}):
width = tok2vec.maybe_get_dim("nO")
attention_layer = ParametricAttention(width) # TODO: benchmark performance difference of this layer
maxout_layer = Maxout(nO=width, nI=width)
linear_layer = Linear(nO=nO, nI=width)
cnn_model = (
tok2vec
>> list2ragged()
>> ParametricAttention(width) # TODO: benchmark performance difference of this layer
>> attention_layer
>> reduce_sum()
>> residual(Maxout(nO=width, nI=width))
>> Linear(nO=nO, nI=width)
>> residual(maxout_layer)
>> linear_layer
>> Dropout(0.0)
)
@ -89,9 +94,25 @@ def build_text_classifier_v2(
if model.has_dim("nO") is not False:
model.set_dim("nO", nO)
model.set_ref("output_layer", linear_model.get_ref("output_layer"))
model.set_ref("attention_layer", attention_layer)
model.set_ref("maxout_layer", maxout_layer)
model.set_ref("linear_layer", linear_layer)
model.attrs["multi_label"] = not exclusive_classes
model.init = init_ensemble_textcat
return model
def init_ensemble_textcat(model, X, Y) -> Model:
tok2vec_width = get_tok2vec_width(model)
model.get_ref("attention_layer").set_dim("nO", tok2vec_width)
model.get_ref("maxout_layer").set_dim("nO", tok2vec_width)
model.get_ref("maxout_layer").set_dim("nI", tok2vec_width)
model.get_ref("linear_layer").set_dim("nI", tok2vec_width)
init_chain(model, X, Y)
return model
# TODO: move to legacy
@registry.architectures.register("spacy.TextCatEnsemble.v1")
def build_text_classifier_v1(

View File

@ -20,6 +20,17 @@ def tok2vec_listener_v1(width: int, upstream: str = "*"):
return tok2vec
def get_tok2vec_width(model: Model):
nO = None
if model.has_ref("tok2vec"):
tok2vec = model.get_ref("tok2vec")
if tok2vec.has_dim("nO"):
nO = tok2vec.get_dim("nO")
elif tok2vec.has_ref("listener"):
nO = tok2vec.get_ref("listener").get_dim("nO")
return nO
@registry.architectures.register("spacy.HashEmbedCNN.v1")
def build_hash_embed_cnn_tok2vec(
*,
@ -76,6 +87,7 @@ def build_hash_embed_cnn_tok2vec(
)
# TODO: archive
@registry.architectures.register("spacy.Tok2Vec.v1")
def build_Tok2Vec_model(
embed: Model[List[Doc], List[Floats2d]],
@ -97,6 +109,28 @@ def build_Tok2Vec_model(
return tok2vec
@registry.architectures.register("spacy.Tok2Vec.v2")
def build_Tok2Vec_model(
embed: Model[List[Doc], List[Floats2d]],
encode: Model[List[Floats2d], List[Floats2d]],
) -> Model[List[Doc], List[Floats2d]]:
"""Construct a tok2vec model out of embedding and encoding subnetworks.
See https://explosion.ai/blog/deep-learning-formula-nlp
embed (Model[List[Doc], List[Floats2d]]): Embed tokens into context-independent
word vector representations.
encode (Model[List[Floats2d], List[Floats2d]]): Encode context into the
embeddings, using an architecture such as a CNN, BiLSTM or transformer.
"""
tok2vec = chain(embed, encode)
tok2vec.set_dim("nO", encode.get_dim("nO"))
tok2vec.set_ref("embed", embed)
tok2vec.set_ref("encode", encode)
return tok2vec
@registry.architectures.register("spacy.MultiHashEmbed.v1")
def MultiHashEmbed(
width: int,
@ -244,6 +278,7 @@ def CharacterEmbed(
return model
# TODO: archive
@registry.architectures.register("spacy.MaxoutWindowEncoder.v1")
def MaxoutWindowEncoder(
width: int, window_size: int, maxout_pieces: int, depth: int
@ -275,7 +310,39 @@ def MaxoutWindowEncoder(
model.attrs["receptive_field"] = window_size * depth
return model
@registry.architectures.register("spacy.MaxoutWindowEncoder.v2")
def MaxoutWindowEncoder(
width: int, window_size: int, maxout_pieces: int, depth: int
) -> Model[List[Floats2d], List[Floats2d]]:
"""Encode context using convolutions with maxout activation, layer
normalization and residual connections.
width (int): The input and output width. These are required to be the same,
to allow residual connections. This value will be determined by the
width of the inputs. Recommended values are between 64 and 300.
window_size (int): The number of words to concatenate around each token
to construct the convolution. Recommended value is 1.
maxout_pieces (int): The number of maxout pieces to use. Recommended
values are 2 or 3.
depth (int): The number of convolutional layers. Recommended value is 4.
"""
cnn = chain(
expand_window(window_size=window_size),
Maxout(
nO=width,
nI=width * ((window_size * 2) + 1),
nP=maxout_pieces,
dropout=0.0,
normalize=True,
),
)
model = clone(residual(cnn), depth)
model.set_dim("nO", width)
receptive_field = window_size * depth
return with_array(model, pad=receptive_field)
# TODO: archive
@registry.architectures.register("spacy.MishWindowEncoder.v1")
def MishWindowEncoder(
width: int, window_size: int, depth: int
@ -299,6 +366,29 @@ def MishWindowEncoder(
return model
@registry.architectures.register("spacy.MishWindowEncoder.v2")
def MishWindowEncoder(
width: int, window_size: int, depth: int
) -> Model[List[Floats2d], List[Floats2d]]:
"""Encode context using convolutions with mish activation, layer
normalization and residual connections.
width (int): The input and output width. These are required to be the same,
to allow residual connections. This value will be determined by the
width of the inputs. Recommended values are between 64 and 300.
window_size (int): The number of words to concatenate around each token
to construct the convolution. Recommended value is 1.
depth (int): The number of convolutional layers. Recommended value is 4.
"""
cnn = chain(
expand_window(window_size=window_size),
Mish(nO=width, nI=width * ((window_size * 2) + 1), dropout=0.0, normalize=True),
)
model = clone(residual(cnn), depth)
model.set_dim("nO", width)
return with_array(model)
@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1")
def BiLSTMEncoder(
width: int, depth: int, dropout: float
@ -308,9 +398,9 @@ def BiLSTMEncoder(
width (int): The input and output width. These are required to be the same,
to allow residual connections. This value will be determined by the
width of the inputs. Recommended values are between 64 and 300.
window_size (int): The number of words to concatenate around each token
to construct the convolution. Recommended value is 1.
depth (int): The number of convolutional layers. Recommended value is 4.
depth (int): The number of recurrent layers.
dropout (float): Creates a Dropout layer on the outputs of each LSTM layer
except the last layer. Set to 0 to disable this functionality.
"""
if depth == 0:
return noop()

View File

@ -47,8 +47,7 @@ def forward(
except ValueError:
raise RuntimeError(Errors.E896)
output = Ragged(
vectors_data,
model.ops.asarray([len(doc) for doc in docs], dtype="i")
vectors_data, model.ops.asarray([len(doc) for doc in docs], dtype="i")
)
mask = None
if is_train:

View File

@ -1,8 +1,10 @@
from thinc.api import Model, noop, use_ops, Linear
from thinc.api import Model, noop
from .parser_model import ParserStepModel
def TransitionModel(tok2vec, lower, upper, resize_output, dropout=0.2, unseen_classes=set()):
def TransitionModel(
tok2vec, lower, upper, resize_output, dropout=0.2, unseen_classes=set()
):
"""Set up a stepwise transition-based model"""
if upper is None:
has_upper = False
@ -44,4 +46,3 @@ def init(model, X=None, Y=None):
if model.attrs["has_upper"]:
statevecs = model.ops.alloc2f(2, lower.get_dim("nO"))
model.get_ref("upper").initialize(X=statevecs)

View File

@ -133,8 +133,9 @@ cdef class Morphology:
"""
cdef MorphAnalysisC tag
tag.length = len(field_feature_pairs)
tag.fields = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t))
tag.features = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t))
if tag.length > 0:
tag.fields = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t))
tag.features = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t))
for i, (field, feature) in enumerate(field_feature_pairs):
tag.fields[i] = field
tag.features[i] = feature

View File

@ -11,6 +11,7 @@ from .senter import SentenceRecognizer
from .sentencizer import Sentencizer
from .tagger import Tagger
from .textcat import TextCategorizer
from .textcat_multilabel import MultiLabel_TextCategorizer
from .tok2vec import Tok2Vec
from .functions import merge_entities, merge_noun_chunks, merge_subtokens
@ -22,13 +23,14 @@ __all__ = [
"EntityRuler",
"Morphologizer",
"Lemmatizer",
"TrainablePipe",
"MultiLabel_TextCategorizer",
"Pipe",
"SentenceRecognizer",
"Sentencizer",
"Tagger",
"TextCategorizer",
"Tok2Vec",
"TrainablePipe",
"merge_entities",
"merge_noun_chunks",
"merge_subtokens",

View File

@ -255,7 +255,7 @@ def get_gradient(nr_class, beam_maps, histories, losses):
for a beam state -- so we have "the gradient of loss for taking
action i given history H."
Histories: Each hitory is a list of actions
Histories: Each history is a list of actions
Each candidate has a history
Each beam has multiple candidates
Each batch has multiple beams

View File

@ -4,4 +4,4 @@ from .transition_system cimport Transition, TransitionSystem
cdef class ArcEager(TransitionSystem):
pass
cdef get_arcs(self, StateC* state)

View File

@ -1,6 +1,7 @@
# cython: profile=True, cdivision=True, infer_types=True
from cymem.cymem cimport Pool, Address
from libc.stdint cimport int32_t
from libcpp.vector cimport vector
from collections import defaultdict, Counter
@ -10,9 +11,9 @@ from ...structs cimport TokenC
from ...tokens.doc cimport Doc, set_children_from_heads
from ...training.example cimport Example
from .stateclass cimport StateClass
from ._state cimport StateC
from ._state cimport StateC, ArcC
from ...errors import Errors
from thinc.extra.search cimport Beam
cdef weight_t MIN_SCORE = -90000
cdef attr_t SUBTOK_LABEL = hash_string(u'subtok')
@ -65,6 +66,7 @@ cdef GoldParseStateC create_gold_state(Pool mem, const StateC* state,
cdef GoldParseStateC gs
gs.length = len(heads)
gs.stride = 1
assert gs.length > 0
gs.labels = <attr_t*>mem.alloc(gs.length, sizeof(gs.labels[0]))
gs.heads = <int32_t*>mem.alloc(gs.length, sizeof(gs.heads[0]))
gs.n_kids = <int32_t*>mem.alloc(gs.length, sizeof(gs.n_kids[0]))
@ -126,6 +128,7 @@ cdef GoldParseStateC create_gold_state(Pool mem, const StateC* state,
1
)
# Make an array of pointers, pointing into the gs_kids_flat array.
assert gs.length > 0
gs.kids = <int32_t**>mem.alloc(gs.length, sizeof(int32_t*))
for i in range(gs.length):
if gs.n_kids[i] != 0:
@ -609,7 +612,7 @@ cdef class ArcEager(TransitionSystem):
return gold
def init_gold_batch(self, examples):
# TODO: Projectivitity?
# TODO: Projectivity?
all_states = self.init_batch([eg.predicted for eg in examples])
golds = []
states = []
@ -705,6 +708,28 @@ cdef class ArcEager(TransitionSystem):
doc.c[i].dep = self.root_label
set_children_from_heads(doc.c, 0, doc.length)
def get_beam_parses(self, Beam beam):
parses = []
probs = beam.probs
for i in range(beam.size):
state = <StateC*>beam.at(i)
if state.is_final():
prob = probs[i]
parse = []
arcs = self.get_arcs(state)
if arcs:
for arc in arcs:
dep = arc["label"]
label = self.strings[dep]
parse.append((arc["head"], arc["child"], label))
parses.append((prob, parse))
return parses
cdef get_arcs(self, StateC* state):
cdef vector[ArcC] arcs
state.get_arcs(&arcs)
return list(arcs)
def has_gold(self, Example eg, start=0, end=None):
for word in eg.y[start:end]:
if word.dep != 0:

View File

@ -2,6 +2,7 @@ from libc.stdint cimport int32_t
from cymem.cymem cimport Pool
from collections import Counter
from thinc.extra.search cimport Beam
from ...tokens.doc cimport Doc
from ...tokens.span import Span
@ -63,6 +64,7 @@ cdef GoldNERStateC create_gold_state(
Example example
) except *:
cdef GoldNERStateC gs
assert example.x.length > 0
gs.ner = <Transition*>mem.alloc(example.x.length, sizeof(Transition))
ner_tags = example.get_aligned_ner()
for i, ner_tag in enumerate(ner_tags):
@ -245,6 +247,21 @@ cdef class BiluoPushDown(TransitionSystem):
if doc.c[i].ent_iob == 0:
doc.c[i].ent_iob = 2
def get_beam_parses(self, Beam beam):
parses = []
probs = beam.probs
for i in range(beam.size):
state = <StateC*>beam.at(i)
if state.is_final():
prob = probs[i]
parse = []
for j in range(state._ents.size()):
ent = state._ents.at(j)
if ent.start != -1 and ent.end != -1:
parse.append((ent.start, ent.end, self.strings[ent.label]))
parses.append((prob, parse))
return parses
def init_gold(self, StateClass state, Example example):
return BiluoGold(self, state, example)

View File

@ -226,6 +226,7 @@ class AttributeRuler(Pipe):
DOCS: https://nightly.spacy.io/api/tagger#score
"""
def morph_key_getter(token, attr):
return getattr(token, attr).key
@ -240,8 +241,16 @@ class AttributeRuler(Pipe):
elif attr == POS:
results.update(Scorer.score_token_attr(examples, "pos", **kwargs))
elif attr == MORPH:
results.update(Scorer.score_token_attr(examples, "morph", getter=morph_key_getter, **kwargs))
results.update(Scorer.score_token_attr_per_feat(examples, "morph", getter=morph_key_getter, **kwargs))
results.update(
Scorer.score_token_attr(
examples, "morph", getter=morph_key_getter, **kwargs
)
)
results.update(
Scorer.score_token_attr_per_feat(
examples, "morph", getter=morph_key_getter, **kwargs
)
)
elif attr == LEMMA:
results.update(Scorer.score_token_attr(examples, "lemma", **kwargs))
return results

View File

@ -1,4 +1,5 @@
# cython: infer_types=True, profile=True, binding=True
from collections import defaultdict
from typing import Optional, Iterable
from thinc.api import Model, Config
@ -258,3 +259,20 @@ cdef class DependencyParser(Parser):
results.update(Scorer.score_deps(examples, "dep", **kwargs))
del results["sents_per_type"]
return results
def scored_parses(self, beams):
"""Return two dictionaries with scores for each beam/doc that was processed:
one containing (i, head) keys, and another containing (i, label) keys.
"""
head_scores = []
label_scores = []
for beam in beams:
score_head_dict = defaultdict(float)
score_label_dict = defaultdict(float)
for score, parses in self.moves.get_beam_parses(beam):
for head, i, label in parses:
score_head_dict[(i, head)] += score
score_label_dict[(i, label)] += score
head_scores.append(score_head_dict)
label_scores.append(score_label_dict)
return head_scores, label_scores

View File

@ -24,7 +24,7 @@ default_model_config = """
@architectures = "spacy.Tagger.v1"
[model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"
@architectures = "spacy.Tok2Vec.v2"
[model.tok2vec.embed]
@architectures = "spacy.CharacterEmbed.v1"
@ -35,7 +35,7 @@ nC = 8
include_static_vectors = false
[model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 128
depth = 4
window_size = 1

View File

@ -1,4 +1,5 @@
# cython: infer_types=True, profile=True, binding=True
from collections import defaultdict
from typing import Optional, Iterable
from thinc.api import Model, Config
@ -197,3 +198,16 @@ cdef class EntityRecognizer(Parser):
"""
validate_examples(examples, "EntityRecognizer.score")
return get_ner_prf(examples)
def scored_ents(self, beams):
"""Return a dictionary of (start, end, label) tuples with corresponding scores
for each beam/doc that was processed.
"""
entity_scores = []
for beam in beams:
score_dict = defaultdict(float)
for score, ents in self.moves.get_beam_parses(beam):
for start, end, label in ents:
score_dict[(start, end, label)] += score
entity_scores.append(score_dict)
return entity_scores

View File

@ -256,8 +256,14 @@ class Tagger(TrainablePipe):
DOCS: https://nightly.spacy.io/api/tagger#get_loss
"""
validate_examples(examples, "Tagger.get_loss")
loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False, missing_value="")
truths = [eg.get_aligned("TAG", as_string=True) for eg in examples]
loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False)
# Convert empty tag "" to missing value None so that both misaligned
# tokens and tokens with missing annotation have the default missing
# value None.
truths = []
for eg in examples:
eg_truths = [tag if tag is not "" else None for tag in eg.get_aligned("TAG", as_string=True)]
truths.append(eg_truths)
d_scores, loss = loss_func(scores, truths)
if self.model.ops.xp.isnan(loss):
raise ValueError(Errors.E910.format(name=self.name))

View File

@ -14,12 +14,12 @@ from ..tokens import Doc
from ..vocab import Vocab
default_model_config = """
single_label_default_config = """
[model]
@architectures = "spacy.TextCatEnsemble.v2"
[model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"
@architectures = "spacy.Tok2Vec.v2"
[model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
@ -29,7 +29,7 @@ attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
include_static_vectors = false
[model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = ${model.tok2vec.embed.width}
window_size = 1
maxout_pieces = 3
@ -37,24 +37,24 @@ depth = 2
[model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
exclusive_classes = true
ngram_size = 1
no_output_layer = false
"""
DEFAULT_TEXTCAT_MODEL = Config().from_str(default_model_config)["model"]
DEFAULT_SINGLE_TEXTCAT_MODEL = Config().from_str(single_label_default_config)["model"]
bow_model_config = """
single_label_bow_config = """
[model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
exclusive_classes = true
ngram_size = 1
no_output_layer = false
"""
cnn_model_config = """
single_label_cnn_config = """
[model]
@architectures = "spacy.TextCatCNN.v1"
exclusive_classes = false
exclusive_classes = true
[model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1"
@ -71,7 +71,7 @@ subword_features = true
@Language.factory(
"textcat",
assigns=["doc.cats"],
default_config={"threshold": 0.5, "model": DEFAULT_TEXTCAT_MODEL},
default_config={"threshold": 0.5, "model": DEFAULT_SINGLE_TEXTCAT_MODEL},
default_score_weights={
"cats_score": 1.0,
"cats_score_desc": None,
@ -103,7 +103,7 @@ def make_textcat(
class TextCategorizer(TrainablePipe):
"""Pipeline component for text classification.
"""Pipeline component for single-label text classification.
DOCS: https://nightly.spacy.io/api/textcategorizer
"""
@ -111,7 +111,7 @@ class TextCategorizer(TrainablePipe):
def __init__(
self, vocab: Vocab, model: Model, name: str = "textcat", *, threshold: float
) -> None:
"""Initialize a text categorizer.
"""Initialize a text categorizer for single-label classification.
vocab (Vocab): The shared vocabulary.
model (thinc.api.Model): The Thinc Model powering the pipeline component.
@ -214,6 +214,7 @@ class TextCategorizer(TrainablePipe):
losses = {}
losses.setdefault(self.name, 0.0)
validate_examples(examples, "TextCategorizer.update")
self._validate_categories(examples)
if not any(len(eg.predicted) if eg.predicted else 0 for eg in examples):
# Handle cases where there are no tokens in any docs.
return losses
@ -256,6 +257,7 @@ class TextCategorizer(TrainablePipe):
if self._rehearsal_model is None:
return losses
validate_examples(examples, "TextCategorizer.rehearse")
self._validate_categories(examples)
docs = [eg.predicted for eg in examples]
if not any(len(doc) for doc in docs):
# Handle cases where there are no tokens in any docs.
@ -296,6 +298,7 @@ class TextCategorizer(TrainablePipe):
DOCS: https://nightly.spacy.io/api/textcategorizer#get_loss
"""
validate_examples(examples, "TextCategorizer.get_loss")
self._validate_categories(examples)
truths, not_missing = self._examples_to_truth(examples)
not_missing = self.model.ops.asarray(not_missing)
d_scores = (scores - truths) / scores.shape[0]
@ -341,6 +344,7 @@ class TextCategorizer(TrainablePipe):
DOCS: https://nightly.spacy.io/api/textcategorizer#initialize
"""
validate_get_examples(get_examples, "TextCategorizer.initialize")
self._validate_categories(get_examples())
if labels is None:
for example in get_examples():
for cat in example.y.cats:
@ -373,12 +377,20 @@ class TextCategorizer(TrainablePipe):
DOCS: https://nightly.spacy.io/api/textcategorizer#score
"""
validate_examples(examples, "TextCategorizer.score")
self._validate_categories(examples)
return Scorer.score_cats(
examples,
"cats",
labels=self.labels,
multi_label=self.model.attrs["multi_label"],
multi_label=False,
positive_label=self.cfg["positive_label"],
threshold=self.cfg["threshold"],
**kwargs,
)
def _validate_categories(self, examples: List[Example]):
"""Check whether the provided examples all have single-label cats annotations."""
for ex in examples:
if list(ex.reference.cats.values()).count(1.0) > 1:
raise ValueError(Errors.E895.format(value=ex.reference.cats))

View File

@ -0,0 +1,191 @@
from itertools import islice
from typing import Iterable, Optional, Dict, List, Callable, Any
from thinc.api import Model, Config
from thinc.types import Floats2d
from ..language import Language
from ..training import Example, validate_examples, validate_get_examples
from ..errors import Errors
from ..scorer import Scorer
from ..tokens import Doc
from ..vocab import Vocab
from .textcat import TextCategorizer
multi_label_default_config = """
[model]
@architectures = "spacy.TextCatEnsemble.v2"
[model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"
[model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = 64
rows = [2000, 2000, 1000, 1000, 1000, 1000]
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
include_static_vectors = false
[model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = ${model.tok2vec.embed.width}
window_size = 1
maxout_pieces = 3
depth = 2
[model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
"""
DEFAULT_MULTI_TEXTCAT_MODEL = Config().from_str(multi_label_default_config)["model"]
multi_label_bow_config = """
[model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
"""
multi_label_cnn_config = """
[model]
@architectures = "spacy.TextCatCNN.v1"
exclusive_classes = false
[model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = null
width = 96
depth = 4
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true
"""
@Language.factory(
"textcat_multilabel",
assigns=["doc.cats"],
default_config={"threshold": 0.5, "model": DEFAULT_MULTI_TEXTCAT_MODEL},
default_score_weights={
"cats_score": 1.0,
"cats_score_desc": None,
"cats_micro_p": None,
"cats_micro_r": None,
"cats_micro_f": None,
"cats_macro_p": None,
"cats_macro_r": None,
"cats_macro_f": None,
"cats_macro_auc": None,
"cats_f_per_type": None,
"cats_macro_auc_per_type": None,
},
)
def make_multilabel_textcat(
nlp: Language, name: str, model: Model[List[Doc], List[Floats2d]], threshold: float
) -> "TextCategorizer":
"""Create a TextCategorizer compoment. The text categorizer predicts categories
over a whole document. It can learn one or more labels, and the labels can
be mutually exclusive (i.e. one true label per doc) or non-mutually exclusive
(i.e. zero or more labels may be true per doc). The multi-label setting is
controlled by the model instance that's provided.
model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
scores for each category.
threshold (float): Cutoff to consider a prediction "positive".
"""
return MultiLabel_TextCategorizer(nlp.vocab, model, name, threshold=threshold)
class MultiLabel_TextCategorizer(TextCategorizer):
"""Pipeline component for multi-label text classification.
DOCS: https://nightly.spacy.io/api/multilabel_textcategorizer
"""
def __init__(
self,
vocab: Vocab,
model: Model,
name: str = "textcat_multilabel",
*,
threshold: float,
) -> None:
"""Initialize a text categorizer for multi-label classification.
vocab (Vocab): The shared vocabulary.
model (thinc.api.Model): The Thinc Model powering the pipeline component.
name (str): The component instance name, used to add entries to the
losses during training.
threshold (float): Cutoff to consider a prediction "positive".
DOCS: https://nightly.spacy.io/api/multilabel_textcategorizer#init
"""
self.vocab = vocab
self.model = model
self.name = name
self._rehearsal_model = None
cfg = {"labels": [], "threshold": threshold}
self.cfg = dict(cfg)
def initialize(
self,
get_examples: Callable[[], Iterable[Example]],
*,
nlp: Optional[Language] = None,
labels: Optional[Dict] = None,
):
"""Initialize the pipe for training, using a representative set
of data examples.
get_examples (Callable[[], Iterable[Example]]): Function that
returns a representative sample of gold-standard Example objects.
nlp (Language): The current nlp object the component is part of.
labels: The labels to add to the component, typically generated by the
`init labels` command. If no labels are provided, the get_examples
callback is used to extract the labels from the data.
DOCS: https://nightly.spacy.io/api/multilabel_textcategorizer#initialize
"""
validate_get_examples(get_examples, "MultiLabel_TextCategorizer.initialize")
if labels is None:
for example in get_examples():
for cat in example.y.cats:
self.add_label(cat)
else:
for label in labels:
self.add_label(label)
subbatch = list(islice(get_examples(), 10))
doc_sample = [eg.reference for eg in subbatch]
label_sample, _ = self._examples_to_truth(subbatch)
self._require_labels()
assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
assert len(label_sample) > 0, Errors.E923.format(name=self.name)
self.model.initialize(X=doc_sample, Y=label_sample)
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
"""Score a batch of examples.
examples (Iterable[Example]): The examples to score.
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_cats.
DOCS: https://nightly.spacy.io/api/multilabel_textcategorizer#score
"""
validate_examples(examples, "MultiLabel_TextCategorizer.score")
return Scorer.score_cats(
examples,
"cats",
labels=self.labels,
multi_label=True,
threshold=self.cfg["threshold"],
**kwargs,
)
def _validate_categories(self, examples: List[Example]):
"""This component allows any type of single- or multi-label annotations.
This method overwrites the more strict one from 'textcat'. """
pass

View File

@ -3,7 +3,7 @@ import numpy as np
from collections import defaultdict
from .training import Example
from .tokens import Token, Doc, Span, MorphAnalysis
from .tokens import Token, Doc, Span
from .errors import Errors
from .util import get_lang_class, SimpleFrozenList
from .morphology import Morphology
@ -176,7 +176,7 @@ class Scorer:
"token_acc": None,
"token_p": None,
"token_r": None,
"token_f": None
"token_f": None,
}
@staticmethod
@ -276,7 +276,10 @@ class Scorer:
if gold_i not in missing_indices:
value = getter(token, attr)
morph = gold_doc.vocab.strings[value]
if value not in missing_values and morph != Morphology.EMPTY_MORPH:
if (
value not in missing_values
and morph != Morphology.EMPTY_MORPH
):
for feat in morph.split(Morphology.FEATURE_SEP):
field, values = feat.split(Morphology.FIELD_SEP)
if field not in per_feat:
@ -367,7 +370,6 @@ class Scorer:
f"{attr}_per_type": None,
}
@staticmethod
def score_cats(
examples: Iterable[Example],
@ -458,7 +460,7 @@ class Scorer:
gold_label, gold_score = max(gold_cats, key=lambda it: it[1])
if gold_score is not None and gold_score > 0:
f_per_type[gold_label].fn += 1
else:
elif pred_cats:
pred_label, pred_score = max(pred_cats, key=lambda it: it[1])
if pred_score >= threshold:
f_per_type[pred_label].fp += 1
@ -473,7 +475,10 @@ class Scorer:
macro_f = sum(prf.fscore for prf in f_per_type.values()) / n_cats
# Limit macro_auc to those labels with gold annotations,
# but still divide by all cats to avoid artificial boosting of datasets with missing labels
macro_auc = sum(auc.score if auc.is_binary() else 0.0 for auc in auc_per_type.values()) / n_cats
macro_auc = (
sum(auc.score if auc.is_binary() else 0.0 for auc in auc_per_type.values())
/ n_cats
)
results = {
f"{attr}_score": None,
f"{attr}_score_desc": None,
@ -485,7 +490,9 @@ class Scorer:
f"{attr}_macro_f": macro_f,
f"{attr}_macro_auc": macro_auc,
f"{attr}_f_per_type": {k: v.to_dict() for k, v in f_per_type.items()},
f"{attr}_auc_per_type": {k: v.score if v.is_binary() else None for k, v in auc_per_type.items()},
f"{attr}_auc_per_type": {
k: v.score if v.is_binary() else None for k, v in auc_per_type.items()
},
}
if len(labels) == 2 and not multi_label and positive_label:
positive_label_f = results[f"{attr}_f_per_type"][positive_label]["f"]
@ -675,8 +682,7 @@ class Scorer:
def get_ner_prf(examples: Iterable[Example]) -> Dict[str, Any]:
"""Compute micro-PRF and per-entity PRF scores for a sequence of examples.
"""
"""Compute micro-PRF and per-entity PRF scores for a sequence of examples."""
score_per_type = defaultdict(PRFScore)
for eg in examples:
if not eg.y.has_annotation("ENT_IOB"):

View File

@ -154,10 +154,10 @@ def test_doc_api_serialize(en_tokenizer, text):
logger = logging.getLogger("spacy")
with mock.patch.object(logger, "warning") as mock_warning:
_ = tokens.to_bytes()
_ = tokens.to_bytes() # noqa: F841
mock_warning.assert_not_called()
tokens.user_hooks["similarity"] = inner_func
_ = tokens.to_bytes()
_ = tokens.to_bytes() # noqa: F841
mock_warning.assert_called_once()

View File

@ -21,11 +21,13 @@ def test_doc_retokenize_merge(en_tokenizer):
assert doc[4].text == "the beach boys"
assert doc[4].text_with_ws == "the beach boys "
assert doc[4].tag_ == "NAMED"
assert doc[4].lemma_ == "LEMMA"
assert str(doc[4].morph) == "Number=Plur"
assert doc[5].text == "all night"
assert doc[5].text_with_ws == "all night"
assert doc[5].tag_ == "NAMED"
assert str(doc[5].morph) == "Number=Plur"
assert doc[5].lemma_ == "LEMMA"
def test_doc_retokenize_merge_children(en_tokenizer):
@ -103,25 +105,29 @@ def test_doc_retokenize_spans_merge_tokens(en_tokenizer):
def test_doc_retokenize_spans_merge_tokens_default_attrs(en_vocab):
words = ["The", "players", "start", "."]
lemmas = [t.lower() for t in words]
heads = [1, 2, 2, 2]
tags = ["DT", "NN", "VBZ", "."]
pos = ["DET", "NOUN", "VERB", "PUNCT"]
doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads)
doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads, lemmas=lemmas)
assert len(doc) == 4
assert doc[0].text == "The"
assert doc[0].tag_ == "DT"
assert doc[0].pos_ == "DET"
assert doc[0].lemma_ == "the"
with doc.retokenize() as retokenizer:
retokenizer.merge(doc[0:2])
assert len(doc) == 3
assert doc[0].text == "The players"
assert doc[0].tag_ == "NN"
assert doc[0].pos_ == "NOUN"
doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads)
assert doc[0].lemma_ == "the players"
doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads, lemmas=lemmas)
assert len(doc) == 4
assert doc[0].text == "The"
assert doc[0].tag_ == "DT"
assert doc[0].pos_ == "DET"
assert doc[0].lemma_ == "the"
with doc.retokenize() as retokenizer:
retokenizer.merge(doc[0:2])
retokenizer.merge(doc[2:4])
@ -129,9 +135,11 @@ def test_doc_retokenize_spans_merge_tokens_default_attrs(en_vocab):
assert doc[0].text == "The players"
assert doc[0].tag_ == "NN"
assert doc[0].pos_ == "NOUN"
assert doc[0].lemma_ == "the players"
assert doc[1].text == "start ."
assert doc[1].tag_ == "VBZ"
assert doc[1].pos_ == "VERB"
assert doc[1].lemma_ == "start ."
def test_doc_retokenize_spans_merge_heads(en_vocab):

View File

@ -39,6 +39,36 @@ def test_doc_retokenize_split(en_vocab):
assert len(str(doc)) == 19
def test_doc_retokenize_split_lemmas(en_vocab):
# If lemmas are not set, leave unset
words = ["LosAngeles", "start", "."]
heads = [1, 2, 2]
doc = Doc(en_vocab, words=words, heads=heads)
with doc.retokenize() as retokenizer:
retokenizer.split(
doc[0],
["Los", "Angeles"],
[(doc[0], 1), doc[1]],
)
assert doc[0].lemma_ == ""
assert doc[1].lemma_ == ""
# If lemmas are set, use split orth as default lemma
words = ["LosAngeles", "start", "."]
heads = [1, 2, 2]
doc = Doc(en_vocab, words=words, heads=heads)
for t in doc:
t.lemma_ = "a"
with doc.retokenize() as retokenizer:
retokenizer.split(
doc[0],
["Los", "Angeles"],
[(doc[0], 1), doc[1]],
)
assert doc[0].lemma_ == "Los"
assert doc[1].lemma_ == "Angeles"
def test_doc_retokenize_split_dependencies(en_vocab):
doc = Doc(en_vocab, words=["LosAngeles", "start", "."])
dep1 = doc.vocab.strings.add("amod")

View File

@ -113,9 +113,8 @@ def test_en_tokenizer_norm_exceptions(en_tokenizer, text, norms):
assert [token.norm_ for token in tokens] == norms
@pytest.mark.skip
@pytest.mark.parametrize(
"text,norm", [("radicalised", "radicalized"), ("cuz", "because")]
"text,norm", [("Jan.", "January"), ("'cuz", "because")]
)
def test_en_lex_attrs_norm_exceptions(en_tokenizer, text, norm):
tokens = en_tokenizer(text)

View File

@ -4,21 +4,21 @@ from spacy.lang.mk.lex_attrs import like_num
def test_tokenizer_handles_long_text(mk_tokenizer):
text = """
Во организациските работи или на нашите собранија со членството, никој од нас не зборуваше за
организацијата и идеологијата. Работна беше нашата работа, а не идеолошка. Што се однесува до социјализмот на
Делчев, неговата дејност зборува сама за себе - спротивно. Во суштина, водачите си имаа свои основни погледи и
свои разбирања за положбата и работите, коишто стоеја пред нив и ги завршуваа со голема упорност, настојчивост и
насоченост. Значи, идеологија имаше, само што нивната идеологија имаше своја оригиналност. Македонија денеска,
чиста рожба на животот и положбата во Македонија, кои му служеа како база на неговите побуди, беше дејност која
имаше потреба од ум за да си најде своја смисла. Таквата идеологија и заемното дејство на умот и срцето му
помогнаа на Делчев да не се занесе по патот на својата идеологија... Во суштина, Организацијата и нејзините
водачи имаа свои разбирања за работите и положбата во идеен поглед, но тоа беше врската, животот и положбата во
Македонија и го внесуваа во својата идеологија гласот на своето срце, и на крај, прибегнуваа до умот,
за да најдат смисла или да ѝ дадат. Тоа содејство и заемен сооднос на умот и срцето му помогнаа на Делчев да ја
држи својата идеологија во сообразност со положбата на работите... Водачите навистина направија една жртва
бидејќи на населението не му зборуваа за своите мисли и идеи. Тие се одрекоа од секаква субјективност во своите
мисли. Целта беше да не се зголемуваат целите и задачите како и преданоста во работата. Населението не можеше да
ги разбере овие идеи...
Во организациските работи или на нашите собранија со членството, никој од нас не зборуваше за
организацијата и идеологијата. Работна беше нашата работа, а не идеолошка. Што се однесува до социјализмот на
Делчев, неговата дејност зборува сама за себе - спротивно. Во суштина, водачите си имаа свои основни погледи и
свои разбирања за положбата и работите, коишто стоеја пред нив и ги завршуваа со голема упорност, настојчивост и
насоченост. Значи, идеологија имаше, само што нивната идеологија имаше своја оригиналност. Македонија денеска,
чиста рожба на животот и положбата во Македонија, кои му служеа како база на неговите побуди, беше дејност која
имаше потреба од ум за да си најде своја смисла. Таквата идеологија и заемното дејство на умот и срцето му
помогнаа на Делчев да не се занесе по патот на својата идеологија... Во суштина, Организацијата и нејзините
водачи имаа свои разбирања за работите и положбата во идеен поглед, но тоа беше врската, животот и положбата во
Македонија и го внесуваа во својата идеологија гласот на своето срце, и на крај, прибегнуваа до умот,
за да најдат смисла или да ѝ дадат. Тоа содејство и заемен сооднос на умот и срцето му помогнаа на Делчев да ја
држи својата идеологија во сообразност со положбата на работите... Водачите навистина направија една жртва
бидејќи на населението не му зборуваа за своите мисли и идеи. Тие се одрекоа од секаква субјективност во своите
мисли. Целта беше да не се зголемуваат целите и задачите како и преданоста во работата. Населението не можеше да
ги разбере овие идеи...
"""
tokens = mk_tokenizer(text)
assert len(tokens) == 297
@ -45,7 +45,7 @@ def test_tokenizer_handles_long_text(mk_tokenizer):
(",", False),
("милијарда", True),
("билион", True),
]
],
)
def test_mk_lex_attrs_like_number(mk_tokenizer, word, match):
tokens = mk_tokenizer(word)
@ -53,14 +53,7 @@ def test_mk_lex_attrs_like_number(mk_tokenizer, word, match):
assert tokens[0].like_num == match
@pytest.mark.parametrize(
"word",
[
"двесте",
"два-три",
"пет-шест"
]
)
@pytest.mark.parametrize("word", ["двесте", "два-три", "пет-шест"])
def test_mk_lex_attrs_capitals(word):
assert like_num(word)
assert like_num(word.upper())
@ -77,8 +70,8 @@ def test_mk_lex_attrs_capitals(word):
"петто",
"стоти",
"шеесетите",
"седумдесетите"
]
"седумдесетите",
],
)
def test_mk_lex_attrs_like_number_for_ordinal(word):
assert like_num(word)

View File

@ -5,24 +5,22 @@ from spacy.lang.tr.lex_attrs import like_num
def test_tr_tokenizer_handles_long_text(tr_tokenizer):
text = """Pamuk nasıl ipliğe dönüştürülür?
Sıkıştırılmış balyalar halindeki pamuk, iplik fabrikasına getirildiğinde hem
lifleri birbirine dolaşmıştır, hem de tarladan toplanırken araya bitkinin
parçaları karışmıştır. Üstelik balyalardaki pamuğun cinsi aynı olsa bile kalitesi
Sıkıştırılmış balyalar halindeki pamuk, iplik fabrikasına getirildiğinde hem
lifleri birbirine dolaşmıştır, hem de tarladan toplanırken araya bitkinin
parçaları karışmıştır. Üstelik balyalardaki pamuğun cinsi aynı olsa bile kalitesi
değişeceğinden, önce bütün balyaların birbirine karıştırılarak harmanlanması gerekir.
Daha sonra pamuk yığınları, liflerin ılıp temizlenmesi için tek bir birim halinde
Daha sonra pamuk yığınları, liflerin ılıp temizlenmesi için tek bir birim halinde
birleştirilmiş çeşitli makinelerden geçirilir.Bunlardan biri, dönen tokmaklarıyla
pamuğu dövüp kabartarak dağınık yumaklar haline getiren ve liflerin arasındaki yabancı
maddeleri temizleyen hallaç makinesidir. Daha sonra tarak makinesine giren pamuk demetleri,
herbirinin yüzeyinde yüzbinlerce incecik iğne bulunan döner silindirlerin arasından geçerek lif lif ayrılır
ve tül inceliğinde gevşek bir örtüye dönüşür. Ama bir sonraki makine bu lifleri dağınık
ve tül inceliğinde gevşek bir örtüye dönüşür. Ama bir sonraki makine bu lifleri dağınık
ve gevşek bir biçimde birbirine yaklaştırarak 2 cm eninde bir pamuk şeridi haline getirir."""
tokens = tr_tokenizer(text)
assert len(tokens) == 146
@pytest.mark.parametrize(
"word",
[

View File

@ -2,145 +2,692 @@ import pytest
ABBREV_TESTS = [
("Dr. Murat Bey ile görüştüm.", ["Dr.", "Murat", "Bey", "ile", "görüştüm", "."]),
("Dr.la görüştüm.", ["Dr.la", "görüştüm", "."]),
("Dr.'la görüştüm.", ["Dr.'la", "görüştüm", "."]),
("TBMM'de çalışıyormuş.", ["TBMM'de", "çalışıyormuş", "."]),
("Hem İst. hem Ank. bu konuda gayet iyi durumda.", ["Hem", "İst.", "hem", "Ank.", "bu", "konuda", "gayet", "iyi", "durumda", "."]),
("Hem İst. hem Ank.'da yağış var.", ["Hem", "İst.", "hem", "Ank.'da", "yağış", "var", "."]),
("Dr.", ["Dr."]),
("Yrd.Doç.", ["Yrd.Doç."]),
("Prof.'un", ["Prof.'un"]),
("Böl.'nde", ["Böl.'nde"]),
("Dr. Murat Bey ile görüştüm.", ["Dr.", "Murat", "Bey", "ile", "görüştüm", "."]),
("Dr.la görüştüm.", ["Dr.la", "görüştüm", "."]),
("Dr.'la görüştüm.", ["Dr.'la", "görüştüm", "."]),
("TBMM'de çalışıyormuş.", ["TBMM'de", "çalışıyormuş", "."]),
(
"Hem İst. hem Ank. bu konuda gayet iyi durumda.",
["Hem", "İst.", "hem", "Ank.", "bu", "konuda", "gayet", "iyi", "durumda", "."],
),
(
"Hem İst. hem Ank.'da yağış var.",
["Hem", "İst.", "hem", "Ank.'da", "yağış", "var", "."],
),
("Dr.", ["Dr."]),
("Yrd.Doç.", ["Yrd.Doç."]),
("Prof.'un", ["Prof.'un"]),
("Böl.'nde", ["Böl.'nde"]),
]
URL_TESTS = [
("Bizler de www.duygu.com.tr adında bir websitesi kurduk.", ["Bizler", "de", "www.duygu.com.tr", "adında", "bir", "websitesi", "kurduk", "."]),
("Bizler de https://www.duygu.com.tr adında bir websitesi kurduk.", ["Bizler", "de", "https://www.duygu.com.tr", "adında", "bir", "websitesi", "kurduk", "."]),
("Bizler de www.duygu.com.tr'dan satın aldık.", ["Bizler", "de", "www.duygu.com.tr'dan", "satın", "aldık", "."]),
("Bizler de https://www.duygu.com.tr'dan satın aldık.", ["Bizler", "de", "https://www.duygu.com.tr'dan", "satın", "aldık", "."]),
(
"Bizler de www.duygu.com.tr adında bir websitesi kurduk.",
[
"Bizler",
"de",
"www.duygu.com.tr",
"adında",
"bir",
"websitesi",
"kurduk",
".",
],
),
(
"Bizler de https://www.duygu.com.tr adında bir websitesi kurduk.",
[
"Bizler",
"de",
"https://www.duygu.com.tr",
"adında",
"bir",
"websitesi",
"kurduk",
".",
],
),
(
"Bizler de www.duygu.com.tr'dan satın aldık.",
["Bizler", "de", "www.duygu.com.tr'dan", "satın", "aldık", "."],
),
(
"Bizler de https://www.duygu.com.tr'dan satın aldık.",
["Bizler", "de", "https://www.duygu.com.tr'dan", "satın", "aldık", "."],
),
]
NUMBER_TESTS = [
("Rakamla 6 yazılıydı.", ["Rakamla", "6", "yazılıydı", "."]),
("Hava -4 dereceydi.", ["Hava", "-4", "dereceydi", "."]),
("Hava sıcaklığı -4ten +6ya yükseldi.", ["Hava", "sıcaklığı", "-4ten", "+6ya", "yükseldi", "."]),
("Hava sıcaklığı -4'ten +6'ya yükseldi.", ["Hava", "sıcaklığı", "-4'ten", "+6'ya", "yükseldi", "."]),
("Yarışta 6. oldum.", ["Yarışta", "6.", "oldum", "."]),
("Yarışta 438547745. oldum.", ["Yarışta", "438547745.", "oldum", "."]),
("Kitap IV. Murat hakkında.",["Kitap", "IV.", "Murat", "hakkında", "."]),
#("Bana söylediği sayı 6.", ["Bana", "söylediği", "sayı", "6", "."]),
("Saat 6'da buluşalım.", ["Saat", "6'da", "buluşalım", "."]),
("Saat 6dan sonra buluşalım.", ["Saat", "6dan", "sonra", "buluşalım", "."]),
("6.dan sonra saymadım.", ["6.dan", "sonra", "saymadım", "."]),
("6.'dan sonra saymadım.", ["6.'dan", "sonra", "saymadım", "."]),
("Saat 6'ydı.", ["Saat", "6'ydı", "."]),
("5'te", ["5'te"]),
("6'da", ["6'da"]),
("9dan", ["9dan"]),
("19'da", ["19'da"]),
("VI'da", ["VI'da"]),
("5.", ["5."]),
("72.", ["72."]),
("VI.", ["VI."]),
("6.'dan", ["6.'dan"]),
("19.'dan", ["19.'dan"]),
("6.dan", ["6.dan"]),
("16.dan", ["16.dan"]),
("VI.'dan", ["VI.'dan"]),
("VI.dan", ["VI.dan"]),
("Hepsi 1994 yılında oldu.", ["Hepsi", "1994", "yılında", "oldu", "."]),
("Hepsi 1994'te oldu.", ["Hepsi", "1994'te", "oldu", "."]),
("2/3 tarihli faturayı bulamadım.", ["2/3", "tarihli", "faturayı", "bulamadım", "."]),
("2.3 tarihli faturayı bulamadım.", ["2.3", "tarihli", "faturayı", "bulamadım", "."]),
("2.3. tarihli faturayı bulamadım.", ["2.3.", "tarihli", "faturayı", "bulamadım", "."]),
("2/3/2020 tarihli faturayı bulamadm.", ["2/3/2020", "tarihli", "faturayı", "bulamadm", "."]),
("2/3/1987 tarihinden beri burda yaşıyorum.", ["2/3/1987", "tarihinden", "beri", "burda", "yaşıyorum", "."]),
("2-3-1987 tarihinden beri burdayım.", ["2-3-1987", "tarihinden", "beri", "burdayım", "."]),
("2.3.1987 tarihinden beri burdayım.", ["2.3.1987", "tarihinden", "beri", "burdayım", "."]),
("Bu olay 2005-2006 tarihleri arasında oldu.", ["Bu", "olay", "2005", "-", "2006", "tarihleri", "arasında", "oldu", "."]),
("Bu olay 4/12/2005-21/3/2006 tarihleri arasında oldu.", ["Bu", "olay", "4/12/2005", "-", "21/3/2006", "tarihleri", "arasında", "oldu", ".",]),
("Ek fıkra: 5/11/2003-4999/3 maddesine göre uygundur.", ["Ek", "fıkra", ":", "5/11/2003", "-", "4999/3", "maddesine", "göre", "uygundur", "."]),
("2/A alanları: 6831 sayılı Kanunun 2nci maddesinin birinci fıkrasının (A) bendine göre", ["2/A", "alanları", ":", "6831", "sayılı", "Kanunun", "2nci", "maddesinin", "birinci", "fıkrasının", "(", "A", ")", "bendine", "göre"]),
("ŞEHİTTEĞMENKALMAZ Cad. No: 2/311", ["ŞEHİTTEĞMENKALMAZ", "Cad.", "No", ":", "2/311"]),
("2-3-2025", ["2-3-2025",]),
("2/3/2025", ["2/3/2025"]),
("Yıllardır 0.5 uç kullanıyorum.", ["Yıllardır", "0.5", "", "kullanıyorum", "."]),
("Kan değerlerim 0.5-0.7 arasıydı.", ["Kan", "değerlerim", "0.5", "-", "0.7", "arasıydı", "."]),
("0.5", ["0.5"]),
("1/2", ["1/2"]),
("%1", ["%", "1"]),
("%1lik", ["%", "1lik"]),
("%1'lik", ["%", "1'lik"]),
("%1lik dilim", ["%", "1lik", "dilim"]),
("%1'lik dilim", ["%", "1'lik", "dilim"]),
("%1.5", ["%", "1.5"]),
#("%1-%2 arası büyüme bekleniyor.", ["%", "1", "-", "%", "2", "arası", "büyüme", "bekleniyor", "."]),
("%1-2 arası büyüme bekliyoruz.", ["%", "1", "-", "2", "arası", "büyüme", "bekliyoruz", "."]),
("%11-12 arası büyüme bekliyoruz.", ["%", "11", "-", "12", "arası", "büyüme", "bekliyoruz", "."]),
("%1.5luk büyüme bekliyoruz.", ["%", "1.5luk", "büyüme", "bekliyoruz", "."]),
("Saat 1-2 arası gelin lütfen.", ["Saat", "1", "-", "2", "arası", "gelin", "lütfen", "."]),
("Saat 15:30 gibi buluşalım.", ["Saat", "15:30", "gibi", "buluşalım", "."]),
("Saat 15:30'da buluşalım.", ["Saat", "15:30'da", "buluşalım", "."]),
("Saat 15.30'da buluşalım.", ["Saat", "15.30'da", "buluşalım", "."]),
("Saat 15.30da buluşalım.", ["Saat", "15.30da", "buluşalım", "."]),
("Saat 15 civarı buluşalım.", ["Saat", "15", "civarı", "buluşalım", "."]),
("9daki otobüse binsek mi?", ["9daki", "otobüse", "binsek", "mi", "?"]),
("Okulumuz 3-B şubesi", ["Okulumuz", "3-B", "şubesi"]),
("Okulumuz 3/B şubesi", ["Okulumuz", "3/B", "şubesi"]),
("Okulumuz 3B şubesi", ["Okulumuz", "3B", "şubesi"]),
("Okulumuz 3b şubesi", ["Okulumuz", "3b", "şubesi"]),
("Antonio Gaudí 20. yüzyılda, 1904-1914 yılları arasında on yıl süren bir reform süreci getirmiştir.", ["Antonio", "Gaudí", "20.", "yüzyılda", ",", "1904", "-", "1914", "yılları", "arasında", "on", "yıl", "süren", "bir", "reform", "süreci", "getirmiştir", "."]),
("Dizel yakıtın avro bölgesi ortalaması olan 1,165 avroya kıyasla litre başına 1,335 avroya mal olduğunu gösteriyor.", ["Dizel", "yakıtın", "avro", "bölgesi", "ortalaması", "olan", "1,165", "avroya", "kıyasla", "litre", "başına", "1,335", "avroya", "mal", "olduğunu", "gösteriyor", "."]),
("Marcus Antonius M.Ö. 1 Ocak 49'da, Sezar'dan Vali'nin kendisini barış dostu ilan ettiği bir bildiri yayınlamıştır.", ["Marcus", "Antonius", "M.Ö.", "1", "Ocak", "49'da", ",", "Sezar'dan", "Vali'nin", "kendisini", "barış", "dostu", "ilan", "ettiği", "bir", "bildiri", "yayınlamıştır", "."])
("Rakamla 6 yazılıydı.", ["Rakamla", "6", "yazılıydı", "."]),
("Hava -4 dereceydi.", ["Hava", "-4", "dereceydi", "."]),
(
"Hava sıcaklığı -4ten +6ya yükseldi.",
["Hava", "sıcaklığı", "-4ten", "+6ya", "yükseldi", "."],
),
(
"Hava sıcaklığı -4'ten +6'ya yükseldi.",
["Hava", "sıcaklığı", "-4'ten", "+6'ya", "yükseldi", "."],
),
("Yarışta 6. oldum.", ["Yarışta", "6.", "oldum", "."]),
("Yarışta 438547745. oldum.", ["Yarışta", "438547745.", "oldum", "."]),
("Kitap IV. Murat hakkında.", ["Kitap", "IV.", "Murat", "hakkında", "."]),
# ("Bana söylediği sayı 6.", ["Bana", "söylediği", "sayı", "6", "."]),
("Saat 6'da buluşalım.", ["Saat", "6'da", "buluşalım", "."]),
("Saat 6dan sonra buluşalım.", ["Saat", "6dan", "sonra", "buluşalım", "."]),
("6.dan sonra saymadım.", ["6.dan", "sonra", "saymadım", "."]),
("6.'dan sonra saymadım.", ["6.'dan", "sonra", "saymadım", "."]),
("Saat 6'ydı.", ["Saat", "6'ydı", "."]),
("5'te", ["5'te"]),
("6'da", ["6'da"]),
("9dan", ["9dan"]),
("19'da", ["19'da"]),
("VI'da", ["VI'da"]),
("5.", ["5."]),
("72.", ["72."]),
("VI.", ["VI."]),
("6.'dan", ["6.'dan"]),
("19.'dan", ["19.'dan"]),
("6.dan", ["6.dan"]),
("16.dan", ["16.dan"]),
("VI.'dan", ["VI.'dan"]),
("VI.dan", ["VI.dan"]),
("Hepsi 1994 yılında oldu.", ["Hepsi", "1994", "yılında", "oldu", "."]),
("Hepsi 1994'te oldu.", ["Hepsi", "1994'te", "oldu", "."]),
(
"2/3 tarihli faturayı bulamadım.",
["2/3", "tarihli", "faturayı", "bulamadım", "."],
),
(
"2.3 tarihli faturayı bulamadım.",
["2.3", "tarihli", "faturayı", "bulamadım", "."],
),
(
"2.3. tarihli faturayı bulamadım.",
["2.3.", "tarihli", "faturayı", "bulamadım", "."],
),
(
"2/3/2020 tarihli faturayı bulamadm.",
["2/3/2020", "tarihli", "faturayı", "bulamadm", "."],
),
(
"2/3/1987 tarihinden beri burda yaşıyorum.",
["2/3/1987", "tarihinden", "beri", "burda", "yaşıyorum", "."],
),
(
"2-3-1987 tarihinden beri burdayım.",
["2-3-1987", "tarihinden", "beri", "burdayım", "."],
),
(
"2.3.1987 tarihinden beri burdayım.",
["2.3.1987", "tarihinden", "beri", "burdayım", "."],
),
(
"Bu olay 2005-2006 tarihleri arasında oldu.",
["Bu", "olay", "2005", "-", "2006", "tarihleri", "arasında", "oldu", "."],
),
(
"Bu olay 4/12/2005-21/3/2006 tarihleri arasında oldu.",
[
"Bu",
"olay",
"4/12/2005",
"-",
"21/3/2006",
"tarihleri",
"arasında",
"oldu",
".",
],
),
(
"Ek fıkra: 5/11/2003-4999/3 maddesine göre uygundur.",
[
"Ek",
"fıkra",
":",
"5/11/2003",
"-",
"4999/3",
"maddesine",
"göre",
"uygundur",
".",
],
),
(
"2/A alanları: 6831 sayılı Kanunun 2nci maddesinin birinci fıkrasının (A) bendine göre",
[
"2/A",
"alanları",
":",
"6831",
"sayılı",
"Kanunun",
"2nci",
"maddesinin",
"birinci",
"fıkrasının",
"(",
"A",
")",
"bendine",
"göre",
],
),
(
"ŞEHİTTEĞMENKALMAZ Cad. No: 2/311",
["ŞEHİTTEĞMENKALMAZ", "Cad.", "No", ":", "2/311"],
),
(
"2-3-2025",
[
"2-3-2025",
],
),
("2/3/2025", ["2/3/2025"]),
("Yıllardır 0.5 uç kullanıyorum.", ["Yıllardır", "0.5", "", "kullanıyorum", "."]),
(
"Kan değerlerim 0.5-0.7 arasıydı.",
["Kan", "değerlerim", "0.5", "-", "0.7", "arasıydı", "."],
),
("0.5", ["0.5"]),
("1/2", ["1/2"]),
("%1", ["%", "1"]),
("%1lik", ["%", "1lik"]),
("%1'lik", ["%", "1'lik"]),
("%1lik dilim", ["%", "1lik", "dilim"]),
("%1'lik dilim", ["%", "1'lik", "dilim"]),
("%1.5", ["%", "1.5"]),
# ("%1-%2 arası büyüme bekleniyor.", ["%", "1", "-", "%", "2", "arası", "büyüme", "bekleniyor", "."]),
(
"%1-2 arası büyüme bekliyoruz.",
["%", "1", "-", "2", "arası", "büyüme", "bekliyoruz", "."],
),
(
"%11-12 arası büyüme bekliyoruz.",
["%", "11", "-", "12", "arası", "büyüme", "bekliyoruz", "."],
),
("%1.5luk büyüme bekliyoruz.", ["%", "1.5luk", "büyüme", "bekliyoruz", "."]),
(
"Saat 1-2 arası gelin lütfen.",
["Saat", "1", "-", "2", "arası", "gelin", "lütfen", "."],
),
("Saat 15:30 gibi buluşalım.", ["Saat", "15:30", "gibi", "buluşalım", "."]),
("Saat 15:30'da buluşalım.", ["Saat", "15:30'da", "buluşalım", "."]),
("Saat 15.30'da buluşalım.", ["Saat", "15.30'da", "buluşalım", "."]),
("Saat 15.30da buluşalım.", ["Saat", "15.30da", "buluşalım", "."]),
("Saat 15 civarı buluşalım.", ["Saat", "15", "civarı", "buluşalım", "."]),
("9daki otobüse binsek mi?", ["9daki", "otobüse", "binsek", "mi", "?"]),
("Okulumuz 3-B şubesi", ["Okulumuz", "3-B", "şubesi"]),
("Okulumuz 3/B şubesi", ["Okulumuz", "3/B", "şubesi"]),
("Okulumuz 3B şubesi", ["Okulumuz", "3B", "şubesi"]),
("Okulumuz 3b şubesi", ["Okulumuz", "3b", "şubesi"]),
(
"Antonio Gaudí 20. yüzyılda, 1904-1914 yılları arasında on yıl süren bir reform süreci getirmiştir.",
[
"Antonio",
"Gaudí",
"20.",
"yüzyılda",
",",
"1904",
"-",
"1914",
"yılları",
"arasında",
"on",
"yıl",
"süren",
"bir",
"reform",
"süreci",
"getirmiştir",
".",
],
),
(
"Dizel yakıtın avro bölgesi ortalaması olan 1,165 avroya kıyasla litre başına 1,335 avroya mal olduğunu gösteriyor.",
[
"Dizel",
"yakıtın",
"avro",
"bölgesi",
"ortalaması",
"olan",
"1,165",
"avroya",
"kıyasla",
"litre",
"başına",
"1,335",
"avroya",
"mal",
"olduğunu",
"gösteriyor",
".",
],
),
(
"Marcus Antonius M.Ö. 1 Ocak 49'da, Sezar'dan Vali'nin kendisini barış dostu ilan ettiği bir bildiri yayınlamıştır.",
[
"Marcus",
"Antonius",
"M.Ö.",
"1",
"Ocak",
"49'da",
",",
"Sezar'dan",
"Vali'nin",
"kendisini",
"barış",
"dostu",
"ilan",
"ettiği",
"bir",
"bildiri",
"yayınlamıştır",
".",
],
),
]
PUNCT_TESTS = [
("Gitmedim dedim ya!", ["Gitmedim", "dedim", "ya", "!"]),
("Gitmedim dedim ya!!", ["Gitmedim", "dedim", "ya", "!", "!"]),
("Gitsek mi?", ["Gitsek", "mi", "?"]),
("Gitsek mi??", ["Gitsek", "mi", "?", "?"]),
("Gitsek mi?!?", ["Gitsek", "mi", "?", "!", "?"]),
("Ankara - Antalya arası otobüs işliyor.", ["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."]),
("Ankara-Antalya arası otobüs işliyor.", ["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."]),
("Sen--ben, ya da onlar.", ["Sen", "--", "ben", ",", "ya", "da", "onlar", "."]),
("Senden, benden, bizden şarkısını biliyor musun?", ["Senden", ",", "benden", ",", "bizden", "şarkısını", "biliyor", "musun", "?"]),
("Akif'le geldik, sonra da o ayrıldı.", ["Akif'le", "geldik", ",", "sonra", "da", "o", "ayrıldı", "."]),
("Bu adam ne dedi şimdi???", ["Bu", "adam", "ne", "dedi", "şimdi", "?", "?", "?"]),
("Yok hasta olmuş, yok annesi hastaymış, bahaneler işte...", ["Yok", "hasta", "olmuş", ",", "yok", "annesi", "hastaymış", ",", "bahaneler", "işte", "..."]),
("Ankara'dan İstanbul'a ... bir aşk hikayesi.", ["Ankara'dan", "İstanbul'a", "...", "bir", "aşk", "hikayesi", "."]),
("Ahmet'te", ["Ahmet'te"]),
("İstanbul'da", ["İstanbul'da"]),
("Gitmedim dedim ya!", ["Gitmedim", "dedim", "ya", "!"]),
("Gitmedim dedim ya!!", ["Gitmedim", "dedim", "ya", "!", "!"]),
("Gitsek mi?", ["Gitsek", "mi", "?"]),
("Gitsek mi??", ["Gitsek", "mi", "?", "?"]),
("Gitsek mi?!?", ["Gitsek", "mi", "?", "!", "?"]),
(
"Ankara - Antalya arası otobüs işliyor.",
["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."],
),
(
"Ankara-Antalya arası otobüs işliyor.",
["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."],
),
("Sen--ben, ya da onlar.", ["Sen", "--", "ben", ",", "ya", "da", "onlar", "."]),
(
"Senden, benden, bizden şarkısını biliyor musun?",
["Senden", ",", "benden", ",", "bizden", "şarkısını", "biliyor", "musun", "?"],
),
(
"Akif'le geldik, sonra da o ayrıldı.",
["Akif'le", "geldik", ",", "sonra", "da", "o", "ayrıldı", "."],
),
("Bu adam ne dedi şimdi???", ["Bu", "adam", "ne", "dedi", "şimdi", "?", "?", "?"]),
(
"Yok hasta olmuş, yok annesi hastaymış, bahaneler işte...",
[
"Yok",
"hasta",
"olmuş",
",",
"yok",
"annesi",
"hastaymış",
",",
"bahaneler",
"işte",
"...",
],
),
(
"Ankara'dan İstanbul'a ... bir aşk hikayesi.",
["Ankara'dan", "İstanbul'a", "...", "bir", "aşk", "hikayesi", "."],
),
("Ahmet'te", ["Ahmet'te"]),
("İstanbul'da", ["İstanbul'da"]),
]
GENERAL_TESTS = [
("1914'teki Endurance seferinde, Sir Ernest Shackleton'ın kaptanlığını yaptığı İngiliz Endurance gemisi yirmi sekiz kişi ile Antarktika'yı geçmek üzere yelken açtı.", ["1914'teki", "Endurance", "seferinde", ",", "Sir", "Ernest", "Shackleton'ın", "kaptanlığını", "yaptığı", "İngiliz", "Endurance", "gemisi", "yirmi", "sekiz", "kişi", "ile", "Antarktika'yı", "geçmek", "üzere", "yelken", "açtı", "."]),
("Danışılan \"%100 Cospedal\" olduğunu belirtti.", ["Danışılan", '"', "%", "100", "Cospedal", '"', "olduğunu", "belirtti", "."]),
("1976'da parkur artık kullanılmıyordu; 1990'da ise bir yangın, daha sonraları ahırlarla birlikte yıkılacak olan tahta tribünlerden geri kalanları da yok etmişti.", ["1976'da", "parkur", "artık", "kullanılmıyordu", ";", "1990'da", "ise", "bir", "yangın", ",", "daha", "sonraları", "ahırlarla", "birlikte", "yıkılacak", "olan", "tahta", "tribünlerden", "geri", "kalanları", "da", "yok", "etmişti", "."]),
("Dahiyane bir ameliyat ve zorlu bir rehabilitasyon sürecinden sonra, tamamen iyileştim.", ["Dahiyane", "bir", "ameliyat", "ve", "zorlu", "bir", "rehabilitasyon", "sürecinden", "sonra", ",", "tamamen", "iyileştim", "."]),
("Yaklaşık iki hafta süren bireysel erken oy kullanma döneminin ardından 5,7 milyondan fazla Floridalı sandık başına gitti.", ["Yaklaşık", "iki", "hafta", "süren", "bireysel", "erken", "oy", "kullanma", "döneminin", "ardından", "5,7", "milyondan", "fazla", "Floridalı", "sandık", "başına", "gitti", "."]),
("Ancak, bu ABD Çevre Koruma Ajansı'nın dünyayı bu konularda uyarmasının ardından ortaya çıktı.", ["Ancak", ",", "bu", "ABD", "Çevre", "Koruma", "Ajansı'nın", "dünyayı", "bu", "konularda", "uyarmasının", "ardından", "ortaya", "çıktı", "."]),
("Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.", ["Ortalama", "şansa", "ve", "10.000", "Sterlin", "değerinde", "tahvillere", "sahip", "bir", "yatırımcı", "yılda", "125", "Sterlin", "ikramiye", "kazanabilir", "."]),
("Granit adaları; Seyşeller ve Tioman ile Saint Helena gibi volkanik adaları kapsar." , ["Granit", "adaları", ";", "Seyşeller", "ve", "Tioman", "ile", "Saint", "Helena", "gibi", "volkanik", "adaları", "kapsar", "."]),
("Barış antlaşmasıyla İspanya, Amerika'ya Porto Riko, Guam ve Filipinler kolonilerini devretti.", ["Barış", "antlaşmasıyla", "İspanya", ",", "Amerika'ya", "Porto", "Riko", ",", "Guam", "ve", "Filipinler", "kolonilerini", "devretti", "."]),
("Makedonya\'nın sınır bölgelerini güvence altına alan Philip, büyük bir Makedon ordusu kurdu ve uzun bir fetih seferi için Trakya\'ya doğru yürüdü.", ["Makedonya\'nın", "sınır", "bölgelerini", "güvence", "altına", "alan", "Philip", ",", "büyük", "bir", "Makedon", "ordusu", "kurdu", "ve", "uzun", "bir", "fetih", "seferi", "için", "Trakya\'ya", "doğru", "yürüdü", "."]),
("Fransız gazetesi Le Figaro'ya göre bu hükumet planı sayesinde 42 milyon Euro kazanç sağlanabilir ve elde edilen paranın 15.5 milyonu ulusal güvenlik için kullanılabilir.", ["Fransız", "gazetesi", "Le", "Figaro'ya", "göre", "bu", "hükumet", "planı", "sayesinde", "42", "milyon", "Euro", "kazanç", "sağlanabilir", "ve", "elde", "edilen", "paranın", "15.5", "milyonu", "ulusal", "güvenlik", "için", "kullanılabilir", "."]),
("Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.", ["Ortalama", "şansa", "ve", "10.000", "Sterlin", "değerinde", "tahvillere", "sahip", "bir", "yatırımcı", "yılda", "125", "Sterlin", "ikramiye", "kazanabilir", "."]),
("3 Kasım Salı günü, Ankara Belediye Başkanı 2014'te hükümetle birlikte oluşturulan kentsel gelişim anlaşmasını askıya alma kararı verdi.", ["3", "Kasım", "Salı", "günü", ",", "Ankara", "Belediye", "Başkanı", "2014'te", "hükümetle", "birlikte", "oluşturulan", "kentsel", "gelişim", "anlaşmasını", "askıya", "alma", "kararı", "verdi", "."]),
("Stalin, Abakumov'u Beria'nın enerji bakanlıkları üzerindeki baskınlığına karşı MGB içinde kendi ağını kurmaya teşvik etmeye başlamıştı.", ["Stalin", ",", "Abakumov'u", "Beria'nın", "enerji", "bakanlıkları", "üzerindeki", "baskınlığına", "karşı", "MGB", "içinde", "kendi", "ını", "kurmaya", "teşvik", "etmeye", "başlamıştı", "."]),
("Güney Avrupa'daki kazı alanlarının çoğunluğu gibi, bu bulgu M.Ö. 5. yüzyılın başlar", ["Güney", "Avrupa'daki", "kazı", "alanlarının", "çoğunluğu", "gibi", ",", "bu", "bulgu", "M.Ö.", "5.", "yüzyılın", "başlar"]),
("Sağlığın bozulması Hitchcock hayatının son yirmi yılında üretimini azalttı.", ["Sağlığın", "bozulması", "Hitchcock", "hayatının", "son", "yirmi", "yılında", "üretimini", "azalttı", "."]),
(
"1914'teki Endurance seferinde, Sir Ernest Shackleton'ın kaptanlığını yaptığı İngiliz Endurance gemisi yirmi sekiz kişi ile Antarktika'yı geçmek üzere yelken açtı.",
[
"1914'teki",
"Endurance",
"seferinde",
",",
"Sir",
"Ernest",
"Shackleton'ın",
"kaptanlığını",
"yaptığı",
"İngiliz",
"Endurance",
"gemisi",
"yirmi",
"sekiz",
"kişi",
"ile",
"Antarktika'yı",
"geçmek",
"üzere",
"yelken",
"açtı",
".",
],
),
(
'Danışılan "%100 Cospedal" olduğunu belirtti.',
["Danışılan", '"', "%", "100", "Cospedal", '"', "olduğunu", "belirtti", "."],
),
(
"1976'da parkur artık kullanılmıyordu; 1990'da ise bir yangın, daha sonraları ahırlarla birlikte yıkılacak olan tahta tribünlerden geri kalanları da yok etmişti.",
[
"1976'da",
"parkur",
"artık",
"kullanılmıyordu",
";",
"1990'da",
"ise",
"bir",
"yangın",
",",
"daha",
"sonraları",
"ahırlarla",
"birlikte",
"yıkılacak",
"olan",
"tahta",
"tribünlerden",
"geri",
"kalanları",
"da",
"yok",
"etmişti",
".",
],
),
(
"Dahiyane bir ameliyat ve zorlu bir rehabilitasyon sürecinden sonra, tamamen iyileştim.",
[
"Dahiyane",
"bir",
"ameliyat",
"ve",
"zorlu",
"bir",
"rehabilitasyon",
"sürecinden",
"sonra",
",",
"tamamen",
"iyileştim",
".",
],
),
(
"Yaklaşık iki hafta süren bireysel erken oy kullanma döneminin ardından 5,7 milyondan fazla Floridalı sandık başına gitti.",
[
"Yaklaşık",
"iki",
"hafta",
"süren",
"bireysel",
"erken",
"oy",
"kullanma",
"döneminin",
"ardından",
"5,7",
"milyondan",
"fazla",
"Floridalı",
"sandık",
"başına",
"gitti",
".",
],
),
(
"Ancak, bu ABD Çevre Koruma Ajansı'nın dünyayı bu konularda uyarmasının ardından ortaya çıktı.",
[
"Ancak",
",",
"bu",
"ABD",
"Çevre",
"Koruma",
"Ajansı'nın",
"dünyayı",
"bu",
"konularda",
"uyarmasının",
"ardından",
"ortaya",
"çıktı",
".",
],
),
(
"Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.",
[
"Ortalama",
"şansa",
"ve",
"10.000",
"Sterlin",
"değerinde",
"tahvillere",
"sahip",
"bir",
"yatırımcı",
"yılda",
"125",
"Sterlin",
"ikramiye",
"kazanabilir",
".",
],
),
(
"Granit adaları; Seyşeller ve Tioman ile Saint Helena gibi volkanik adaları kapsar.",
[
"Granit",
"adaları",
";",
"Seyşeller",
"ve",
"Tioman",
"ile",
"Saint",
"Helena",
"gibi",
"volkanik",
"adaları",
"kapsar",
".",
],
),
(
"Barış antlaşmasıyla İspanya, Amerika'ya Porto Riko, Guam ve Filipinler kolonilerini devretti.",
[
"Barış",
"antlaşmasıyla",
"İspanya",
",",
"Amerika'ya",
"Porto",
"Riko",
",",
"Guam",
"ve",
"Filipinler",
"kolonilerini",
"devretti",
".",
],
),
(
"Makedonya'nın sınır bölgelerini güvence altına alan Philip, büyük bir Makedon ordusu kurdu ve uzun bir fetih seferi için Trakya'ya doğru yürüdü.",
[
"Makedonya'nın",
"sınır",
"bölgelerini",
"güvence",
"altına",
"alan",
"Philip",
",",
"büyük",
"bir",
"Makedon",
"ordusu",
"kurdu",
"ve",
"uzun",
"bir",
"fetih",
"seferi",
"için",
"Trakya'ya",
"doğru",
"yürüdü",
".",
],
),
(
"Fransız gazetesi Le Figaro'ya göre bu hükumet planı sayesinde 42 milyon Euro kazanç sağlanabilir ve elde edilen paranın 15.5 milyonu ulusal güvenlik için kullanılabilir.",
[
"Fransız",
"gazetesi",
"Le",
"Figaro'ya",
"göre",
"bu",
"hükumet",
"planı",
"sayesinde",
"42",
"milyon",
"Euro",
"kazanç",
"sağlanabilir",
"ve",
"elde",
"edilen",
"paranın",
"15.5",
"milyonu",
"ulusal",
"güvenlik",
"için",
"kullanılabilir",
".",
],
),
(
"Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.",
[
"Ortalama",
"şansa",
"ve",
"10.000",
"Sterlin",
"değerinde",
"tahvillere",
"sahip",
"bir",
"yatırımcı",
"yılda",
"125",
"Sterlin",
"ikramiye",
"kazanabilir",
".",
],
),
(
"3 Kasım Salı günü, Ankara Belediye Başkanı 2014'te hükümetle birlikte oluşturulan kentsel gelişim anlaşmasını askıya alma kararı verdi.",
[
"3",
"Kasım",
"Salı",
"günü",
",",
"Ankara",
"Belediye",
"Başkanı",
"2014'te",
"hükümetle",
"birlikte",
"oluşturulan",
"kentsel",
"gelişim",
"anlaşmasını",
"askıya",
"alma",
"kararı",
"verdi",
".",
],
),
(
"Stalin, Abakumov'u Beria'nın enerji bakanlıkları üzerindeki baskınlığına karşı MGB içinde kendi ağını kurmaya teşvik etmeye başlamıştı.",
[
"Stalin",
",",
"Abakumov'u",
"Beria'nın",
"enerji",
"bakanlıkları",
"üzerindeki",
"baskınlığına",
"karşı",
"MGB",
"içinde",
"kendi",
"ını",
"kurmaya",
"teşvik",
"etmeye",
"başlamıştı",
".",
],
),
(
"Güney Avrupa'daki kazı alanlarının çoğunluğu gibi, bu bulgu M.Ö. 5. yüzyılın başlar",
[
"Güney",
"Avrupa'daki",
"kazı",
"alanlarının",
"çoğunluğu",
"gibi",
",",
"bu",
"bulgu",
"M.Ö.",
"5.",
"yüzyılın",
"başlar",
],
),
(
"Sağlığın bozulması Hitchcock hayatının son yirmi yılında üretimini azalttı.",
[
"Sağlığın",
"bozulması",
"Hitchcock",
"hayatının",
"son",
"yirmi",
"yılında",
"üretimini",
"azalttı",
".",
],
),
]
TESTS = (ABBREV_TESTS + URL_TESTS + NUMBER_TESTS + PUNCT_TESTS + GENERAL_TESTS)
TESTS = ABBREV_TESTS + URL_TESTS + NUMBER_TESTS + PUNCT_TESTS + GENERAL_TESTS
@pytest.mark.parametrize("text,expected_tokens", TESTS)
@ -149,4 +696,3 @@ def test_tr_tokenizer_handles_allcases(tr_tokenizer, text, expected_tokens):
token_list = [token.text for token in tokens if not token.is_space]
print(token_list)
assert expected_tokens == token_list

View File

@ -89,7 +89,6 @@ def test_uk_tokenizer_splits_open_appostrophe(uk_tokenizer, text):
assert tokens[0].text == "'"
@pytest.mark.skip(reason="See Issue #3327 and PR #3329")
@pytest.mark.parametrize("text", ["Тест''"])
def test_uk_tokenizer_splits_double_end_quote(uk_tokenizer, text):
tokens = uk_tokenizer(text)

View File

@ -7,7 +7,6 @@ from spacy.tokens import Doc
from spacy.pipeline._parser_internals.nonproj import projectivize
from spacy.pipeline._parser_internals.arc_eager import ArcEager
from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
from spacy.pipeline._parser_internals.stateclass import StateClass
def get_sequence_costs(M, words, heads, deps, transitions):
@ -59,7 +58,7 @@ def test_oracle_four_words(arc_eager, vocab):
["S"],
["L-left"],
["S"],
["D"]
["D"],
]
assert state.is_final()
for i, state_costs in enumerate(cost_history):
@ -185,9 +184,9 @@ def test_oracle_dev_sentence(vocab, arc_eager):
"L-nn", # Attach 'Cars' to 'Inc.'
"L-nn", # Attach 'Motor' to 'Inc.'
"L-nn", # Attach 'Rolls-Royce' to 'Inc.'
"S", # Shift "Inc."
"S", # Shift "Inc."
"L-nsubj", # Attach 'Inc.' to 'said'
"S", # Shift 'said'
"S", # Shift 'said'
"S", # Shift 'it'
"L-nsubj", # Attach 'it.' to 'expects'
"R-ccomp", # Attach 'expects' to 'said'
@ -251,7 +250,7 @@ def test_oracle_bad_tokenization(vocab, arc_eager):
is root is
bad comp is
"""
gold_words = []
gold_deps = []
gold_heads = []
@ -268,7 +267,9 @@ def test_oracle_bad_tokenization(vocab, arc_eager):
arc_eager.add_action(2, dep) # Left
arc_eager.add_action(3, dep) # Right
reference = Doc(Vocab(), words=gold_words, deps=gold_deps, heads=gold_heads)
predicted = Doc(reference.vocab, words=["[", "catalase", "]", ":", "that", "is", "bad"])
predicted = Doc(
reference.vocab, words=["[", "catalase", "]", ":", "that", "is", "bad"]
)
example = Example(predicted=predicted, reference=reference)
ae_oracle_actions = arc_eager.get_oracle_sequence(example, _debug=False)
ae_oracle_actions = [arc_eager.get_class_name(i) for i in ae_oracle_actions]

View File

@ -301,11 +301,9 @@ def test_block_ner():
assert [token.ent_type_ for token in doc] == expected_types
@pytest.mark.parametrize(
"use_upper", [True, False]
)
@pytest.mark.parametrize("use_upper", [True, False])
def test_overfitting_IO(use_upper):
# Simple test to try and quickly overfit the NER component - ensuring the ML models work correctly
# Simple test to try and quickly overfit the NER component
nlp = English()
ner = nlp.add_pipe("ner", config={"model": {"use_upper": use_upper}})
train_examples = []
@ -361,6 +359,84 @@ def test_overfitting_IO(use_upper):
assert_equal(batch_deps_1, no_batch_deps)
def test_beam_ner_scores():
# Test that we can get confidence values out of the beam_ner pipe
beam_width = 16
beam_density = 0.0001
nlp = English()
config = {
"beam_width": beam_width,
"beam_density": beam_density,
}
ner = nlp.add_pipe("beam_ner", config=config)
train_examples = []
for text, annotations in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
for ent in annotations.get("entities"):
ner.add_label(ent[2])
optimizer = nlp.initialize()
# update once
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
# test the scores from the beam
test_text = "I like London."
doc = nlp.make_doc(test_text)
docs = [doc]
beams = ner.predict(docs)
entity_scores = ner.scored_ents(beams)[0]
for j in range(len(doc)):
for label in ner.labels:
score = entity_scores[(j, j+1, label)]
eps = 0.00001
assert 0 - eps <= score <= 1 + eps
def test_beam_overfitting_IO():
# Simple test to try and quickly overfit the Beam NER component
nlp = English()
beam_width = 16
beam_density = 0.0001
config = {
"beam_width": beam_width,
"beam_density": beam_density,
}
ner = nlp.add_pipe("beam_ner", config=config)
train_examples = []
for text, annotations in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
for ent in annotations.get("entities"):
ner.add_label(ent[2])
optimizer = nlp.initialize()
# run overfitting
for i in range(50):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
assert losses["beam_ner"] < 0.0001
# test the scores from the beam
test_text = "I like London."
docs = [nlp.make_doc(test_text)]
beams = ner.predict(docs)
entity_scores = ner.scored_ents(beams)[0]
assert entity_scores[(2, 3, "LOC")] == 1.0
assert entity_scores[(2, 3, "PERSON")] == 0.0
# Also test the results are still the same after IO
with make_tempdir() as tmp_dir:
nlp.to_disk(tmp_dir)
nlp2 = util.load_model_from_path(tmp_dir)
docs2 = [nlp2.make_doc(test_text)]
ner2 = nlp2.get_pipe("beam_ner")
beams2 = ner2.predict(docs2)
entity_scores2 = ner2.scored_ents(beams2)[0]
assert entity_scores2[(2, 3, "LOC")] == 1.0
assert entity_scores2[(2, 3, "PERSON")] == 0.0
def test_ner_warns_no_lookups(caplog):
nlp = English()
assert nlp.lang in util.LEXEME_NORM_LANGS

View File

@ -1,13 +1,9 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
import hypothesis
import hypothesis.strategies
import numpy
from spacy.vocab import Vocab
from spacy.language import Language
from spacy.pipeline import DependencyParser
from spacy.pipeline._parser_internals.arc_eager import ArcEager
from spacy.tokens import Doc
from spacy.pipeline._parser_internals._beam_utils import BeamBatch
@ -44,7 +40,7 @@ def docs(vocab):
words=["Rats", "bite", "things"],
heads=[1, 1, 1],
deps=["nsubj", "ROOT", "dobj"],
sent_starts=[True, False, False]
sent_starts=[True, False, False],
)
]
@ -77,10 +73,12 @@ def batch_size(docs):
def beam_width():
return 4
@pytest.fixture(params=[0.0, 0.5, 1.0])
def beam_density(request):
return request.param
@pytest.fixture
def vector_size():
return 6
@ -100,7 +98,9 @@ def scores(moves, batch_size, beam_width):
numpy.random.uniform(-0.1, 0.1, (beam_width, moves.n_moves))
for _ in range(batch_size)
]
), dtype="float32")
),
dtype="float32",
)
def test_create_beam(beam):
@ -128,8 +128,6 @@ def test_beam_parse(examples, beam_width):
parser(doc)
@hypothesis.given(hyp=hypothesis.strategies.data())
def test_beam_density(moves, examples, beam_width, hyp):
beam_density = float(hyp.draw(hypothesis.strategies.floats(0.0, 1.0, width=32)))

View File

@ -28,6 +28,26 @@ TRAIN_DATA = [
]
CONFLICTING_DATA = [
(
"I like London and Berlin.",
{
"heads": [1, 1, 1, 2, 2, 1],
"deps": ["nsubj", "ROOT", "dobj", "cc", "conj", "punct"],
},
),
(
"I like London and Berlin.",
{
"heads": [0, 0, 0, 0, 0, 0],
"deps": ["ROOT", "nsubj", "nsubj", "cc", "conj", "punct"],
},
),
]
eps = 0.01
def test_parser_root(en_vocab):
words = ["i", "do", "n't", "have", "other", "assistance"]
heads = [3, 3, 3, 3, 5, 3]
@ -185,26 +205,31 @@ def test_parser_set_sent_starts(en_vocab):
assert token.head in sent
def test_overfitting_IO():
# Simple test to try and quickly overfit the dependency parser - ensuring the ML models work correctly
@pytest.mark.parametrize("pipe_name", ["parser", "beam_parser"])
def test_overfitting_IO(pipe_name):
# Simple test to try and quickly overfit the dependency parser (normal or beam)
nlp = English()
parser = nlp.add_pipe("parser")
parser = nlp.add_pipe(pipe_name)
train_examples = []
for text, annotations in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
for dep in annotations.get("deps", []):
parser.add_label(dep)
optimizer = nlp.initialize()
for i in range(100):
# run overfitting
for i in range(150):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
assert losses["parser"] < 0.0001
assert losses[pipe_name] < 0.0001
# test the trained model
test_text = "I like securities."
doc = nlp(test_text)
assert doc[0].dep_ == "nsubj"
assert doc[2].dep_ == "dobj"
assert doc[3].dep_ == "punct"
assert doc[0].head.i == 1
assert doc[2].head.i == 1
assert doc[3].head.i == 1
# Also test the results are still the same after IO
with make_tempdir() as tmp_dir:
nlp.to_disk(tmp_dir)
@ -213,6 +238,9 @@ def test_overfitting_IO():
assert doc2[0].dep_ == "nsubj"
assert doc2[2].dep_ == "dobj"
assert doc2[3].dep_ == "punct"
assert doc2[0].head.i == 1
assert doc2[2].head.i == 1
assert doc2[3].head.i == 1
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
texts = [
@ -226,3 +254,123 @@ def test_overfitting_IO():
no_batch_deps = [doc.to_array([DEP]) for doc in [nlp(text) for text in texts]]
assert_equal(batch_deps_1, batch_deps_2)
assert_equal(batch_deps_1, no_batch_deps)
def test_beam_parser_scores():
# Test that we can get confidence values out of the beam_parser pipe
beam_width = 16
beam_density = 0.0001
nlp = English()
config = {
"beam_width": beam_width,
"beam_density": beam_density,
}
parser = nlp.add_pipe("beam_parser", config=config)
train_examples = []
for text, annotations in CONFLICTING_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
for dep in annotations.get("deps", []):
parser.add_label(dep)
optimizer = nlp.initialize()
# update a bit with conflicting data
for i in range(10):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
# test the scores from the beam
test_text = "I like securities."
doc = nlp.make_doc(test_text)
docs = [doc]
beams = parser.predict(docs)
head_scores, label_scores = parser.scored_parses(beams)
for j in range(len(doc)):
for label in parser.labels:
label_score = label_scores[0][(j, label)]
assert 0 - eps <= label_score <= 1 + eps
for i in range(len(doc)):
head_score = head_scores[0][(j, i)]
assert 0 - eps <= head_score <= 1 + eps
def test_beam_overfitting_IO():
# Simple test to try and quickly overfit the Beam dependency parser
nlp = English()
beam_width = 16
beam_density = 0.0001
config = {
"beam_width": beam_width,
"beam_density": beam_density,
}
parser = nlp.add_pipe("beam_parser", config=config)
train_examples = []
for text, annotations in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
for dep in annotations.get("deps", []):
parser.add_label(dep)
optimizer = nlp.initialize()
# run overfitting
for i in range(150):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
assert losses["beam_parser"] < 0.0001
# test the scores from the beam
test_text = "I like securities."
docs = [nlp.make_doc(test_text)]
beams = parser.predict(docs)
head_scores, label_scores = parser.scored_parses(beams)
# we only processed one document
head_scores = head_scores[0]
label_scores = label_scores[0]
# test label annotations: 0=nsubj, 2=dobj, 3=punct
assert label_scores[(0, "nsubj")] == pytest.approx(1.0, eps)
assert label_scores[(0, "dobj")] == pytest.approx(0.0, eps)
assert label_scores[(0, "punct")] == pytest.approx(0.0, eps)
assert label_scores[(2, "nsubj")] == pytest.approx(0.0, eps)
assert label_scores[(2, "dobj")] == pytest.approx(1.0, eps)
assert label_scores[(2, "punct")] == pytest.approx(0.0, eps)
assert label_scores[(3, "nsubj")] == pytest.approx(0.0, eps)
assert label_scores[(3, "dobj")] == pytest.approx(0.0, eps)
assert label_scores[(3, "punct")] == pytest.approx(1.0, eps)
# test head annotations: the root is token at index 1
assert head_scores[(0, 0)] == pytest.approx(0.0, eps)
assert head_scores[(0, 1)] == pytest.approx(1.0, eps)
assert head_scores[(0, 2)] == pytest.approx(0.0, eps)
assert head_scores[(2, 0)] == pytest.approx(0.0, eps)
assert head_scores[(2, 1)] == pytest.approx(1.0, eps)
assert head_scores[(2, 2)] == pytest.approx(0.0, eps)
assert head_scores[(3, 0)] == pytest.approx(0.0, eps)
assert head_scores[(3, 1)] == pytest.approx(1.0, eps)
assert head_scores[(3, 2)] == pytest.approx(0.0, eps)
# Also test the results are still the same after IO
with make_tempdir() as tmp_dir:
nlp.to_disk(tmp_dir)
nlp2 = util.load_model_from_path(tmp_dir)
docs2 = [nlp2.make_doc(test_text)]
parser2 = nlp2.get_pipe("beam_parser")
beams2 = parser2.predict(docs2)
head_scores2, label_scores2 = parser2.scored_parses(beams2)
# we only processed one document
head_scores2 = head_scores2[0]
label_scores2 = label_scores2[0]
# check the results again
assert label_scores2[(0, "nsubj")] == pytest.approx(1.0, eps)
assert label_scores2[(0, "dobj")] == pytest.approx(0.0, eps)
assert label_scores2[(0, "punct")] == pytest.approx(0.0, eps)
assert label_scores2[(2, "nsubj")] == pytest.approx(0.0, eps)
assert label_scores2[(2, "dobj")] == pytest.approx(1.0, eps)
assert label_scores2[(2, "punct")] == pytest.approx(0.0, eps)
assert label_scores2[(3, "nsubj")] == pytest.approx(0.0, eps)
assert label_scores2[(3, "dobj")] == pytest.approx(0.0, eps)
assert label_scores2[(3, "punct")] == pytest.approx(1.0, eps)
assert head_scores2[(0, 0)] == pytest.approx(0.0, eps)
assert head_scores2[(0, 1)] == pytest.approx(1.0, eps)
assert head_scores2[(0, 2)] == pytest.approx(0.0, eps)
assert head_scores2[(2, 0)] == pytest.approx(0.0, eps)
assert head_scores2[(2, 1)] == pytest.approx(1.0, eps)
assert head_scores2[(2, 2)] == pytest.approx(0.0, eps)
assert head_scores2[(3, 0)] == pytest.approx(0.0, eps)
assert head_scores2[(3, 1)] == pytest.approx(1.0, eps)
assert head_scores2[(3, 2)] == pytest.approx(0.0, eps)

View File

@ -4,14 +4,17 @@ from spacy.tokens.doc import Doc
from spacy.vocab import Vocab
from spacy.pipeline._parser_internals.stateclass import StateClass
@pytest.fixture
def vocab():
return Vocab()
@pytest.fixture
def doc(vocab):
return Doc(vocab, words=["a", "b", "c", "d"])
def test_init_state(doc):
state = StateClass(doc)
assert state.stack == []
@ -19,6 +22,7 @@ def test_init_state(doc):
assert not state.is_final()
assert state.buffer_length() == 4
def test_push_pop(doc):
state = StateClass(doc)
state.push()
@ -33,6 +37,7 @@ def test_push_pop(doc):
assert state.stack == [0]
assert 1 not in state.queue
def test_stack_depth(doc):
state = StateClass(doc)
assert state.stack_depth() == 0

View File

@ -161,7 +161,7 @@ def test_attributeruler_score(nlp, pattern_dicts):
# "cat" is the only correct lemma
assert scores["lemma_acc"] == pytest.approx(0.2)
# no morphs are set
assert scores["morph_acc"] == None
assert scores["morph_acc"] is None
def test_attributeruler_rule_order(nlp):

View File

@ -201,13 +201,9 @@ def test_entity_ruler_overlapping_spans(nlp):
@pytest.mark.parametrize("n_process", [1, 2])
def test_entity_ruler_multiprocessing(nlp, n_process):
texts = [
"I enjoy eating Pizza Hut pizza."
]
texts = ["I enjoy eating Pizza Hut pizza."]
patterns = [
{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}
]
patterns = [{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}]
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

View File

@ -159,8 +159,12 @@ def test_pipe_class_component_model():
"model": {
"@architectures": "spacy.TextCatEnsemble.v2",
"tok2vec": DEFAULT_TOK2VEC_MODEL,
"linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1,
"no_output_layer": False},
"linear_model": {
"@architectures": "spacy.TextCatBOW.v1",
"exclusive_classes": False,
"ngram_size": 1,
"no_output_layer": False,
},
},
"value1": 10,
}

View File

@ -37,7 +37,16 @@ TRAIN_DATA = [
]
PARTIAL_DATA = [
# partial annotation
("I like green eggs", {"tags": ["", "V", "J", ""]}),
# misaligned partial annotation
(
"He hates green eggs",
{
"words": ["He", "hate", "s", "green", "eggs"],
"tags": ["", "V", "S", "J", ""],
},
),
]
@ -126,6 +135,7 @@ def test_incomplete_data():
assert doc[1].tag_ is "V"
assert doc[2].tag_ is "J"
def test_overfitting_IO():
# Simple test to try and quickly overfit the tagger - ensuring the ML models work correctly
nlp = English()

View File

@ -15,15 +15,31 @@ from spacy.training import Example
from ..util import make_tempdir
TRAIN_DATA = [
TRAIN_DATA_SINGLE_LABEL = [
("I'm so happy.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
("I'm so angry", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
]
TRAIN_DATA_MULTI_LABEL = [
("I'm angry and confused", {"cats": {"ANGRY": 1.0, "CONFUSED": 1.0, "HAPPY": 0.0}}),
("I'm confused but happy", {"cats": {"ANGRY": 0.0, "CONFUSED": 1.0, "HAPPY": 1.0}}),
]
def make_get_examples(nlp):
def make_get_examples_single_label(nlp):
train_examples = []
for t in TRAIN_DATA:
for t in TRAIN_DATA_SINGLE_LABEL:
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
def get_examples():
return train_examples
return get_examples
def make_get_examples_multi_label(nlp):
train_examples = []
for t in TRAIN_DATA_MULTI_LABEL:
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
def get_examples():
@ -85,49 +101,75 @@ def test_textcat_learns_multilabel():
assert score > 0.5
def test_label_types():
@pytest.mark.parametrize("name", ["textcat", "textcat_multilabel"])
def test_label_types(name):
nlp = Language()
textcat = nlp.add_pipe("textcat")
textcat = nlp.add_pipe(name)
textcat.add_label("answer")
with pytest.raises(ValueError):
textcat.add_label(9)
def test_no_label():
@pytest.mark.parametrize("name", ["textcat", "textcat_multilabel"])
def test_no_label(name):
nlp = Language()
nlp.add_pipe("textcat")
nlp.add_pipe(name)
with pytest.raises(ValueError):
nlp.initialize()
def test_implicit_label():
@pytest.mark.parametrize(
"name,get_examples",
[
("textcat", make_get_examples_single_label),
("textcat_multilabel", make_get_examples_multi_label),
],
)
def test_implicit_label(name, get_examples):
nlp = Language()
nlp.add_pipe("textcat")
nlp.initialize(get_examples=make_get_examples(nlp))
nlp.add_pipe(name)
nlp.initialize(get_examples=get_examples(nlp))
def test_no_resize():
@pytest.mark.parametrize("name", ["textcat", "textcat_multilabel"])
def test_no_resize(name):
nlp = Language()
textcat = nlp.add_pipe("textcat")
textcat = nlp.add_pipe(name)
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")
nlp.initialize()
assert textcat.model.get_dim("nO") == 2
assert textcat.model.get_dim("nO") >= 2
# this throws an error because the textcat can't be resized after initialization
with pytest.raises(ValueError):
textcat.add_label("NEUTRAL")
def test_initialize_examples():
def test_error_with_multi_labels():
nlp = Language()
textcat = nlp.add_pipe("textcat")
for text, annotations in TRAIN_DATA:
train_examples = []
for text, annotations in TRAIN_DATA_MULTI_LABEL:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
with pytest.raises(ValueError):
optimizer = nlp.initialize(get_examples=lambda: train_examples)
@pytest.mark.parametrize(
"name,get_examples, train_data",
[
("textcat", make_get_examples_single_label, TRAIN_DATA_SINGLE_LABEL),
("textcat_multilabel", make_get_examples_multi_label, TRAIN_DATA_MULTI_LABEL),
],
)
def test_initialize_examples(name, get_examples, train_data):
nlp = Language()
textcat = nlp.add_pipe(name)
for text, annotations in train_data:
for label, value in annotations.get("cats").items():
textcat.add_label(label)
# you shouldn't really call this more than once, but for testing it should be fine
nlp.initialize()
get_examples = make_get_examples(nlp)
nlp.initialize(get_examples=get_examples)
nlp.initialize(get_examples=get_examples(nlp))
with pytest.raises(TypeError):
nlp.initialize(get_examples=lambda: None)
with pytest.raises(TypeError):
@ -138,12 +180,10 @@ def test_overfitting_IO():
# Simple test to try and quickly overfit the single-label textcat component - ensuring the ML models work correctly
fix_random_seed(0)
nlp = English()
nlp.config["initialize"]["components"]["textcat"] = {"positive_label": "POSITIVE"}
# Set exclusive labels
config = {"model": {"linear_model": {"exclusive_classes": True}}}
textcat = nlp.add_pipe("textcat", config=config)
textcat = nlp.add_pipe("textcat")
train_examples = []
for text, annotations in TRAIN_DATA:
for text, annotations in TRAIN_DATA_SINGLE_LABEL:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
optimizer = nlp.initialize(get_examples=lambda: train_examples)
assert textcat.model.get_dim("nO") == 2
@ -172,6 +212,8 @@ def test_overfitting_IO():
# Test scoring
scores = nlp.evaluate(train_examples)
assert scores["cats_micro_f"] == 1.0
assert scores["cats_macro_f"] == 1.0
assert scores["cats_macro_auc"] == 1.0
assert scores["cats_score"] == 1.0
assert "cats_score_desc" in scores
@ -192,7 +234,7 @@ def test_overfitting_IO_multi():
config = {"model": {"linear_model": {"exclusive_classes": False}}}
textcat = nlp.add_pipe("textcat", config=config)
train_examples = []
for text, annotations in TRAIN_DATA:
for text, annotations in TRAIN_DATA_MULTI_LABEL:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
optimizer = nlp.initialize(get_examples=lambda: train_examples)
assert textcat.model.get_dim("nO") == 2
@ -231,27 +273,75 @@ def test_overfitting_IO_multi():
assert_equal(batch_cats_1, no_batch_cats)
def test_overfitting_IO_multi():
# Simple test to try and quickly overfit the multi-label textcat component - ensuring the ML models work correctly
fix_random_seed(0)
nlp = English()
textcat = nlp.add_pipe("textcat_multilabel")
train_examples = []
for text, annotations in TRAIN_DATA_MULTI_LABEL:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
optimizer = nlp.initialize(get_examples=lambda: train_examples)
assert textcat.model.get_dim("nO") == 3
for i in range(100):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
assert losses["textcat_multilabel"] < 0.01
# test the trained model
test_text = "I am confused but happy."
doc = nlp(test_text)
cats = doc.cats
assert cats["HAPPY"] > 0.9
assert cats["CONFUSED"] > 0.9
# Also test the results are still the same after IO
with make_tempdir() as tmp_dir:
nlp.to_disk(tmp_dir)
nlp2 = util.load_model_from_path(tmp_dir)
doc2 = nlp2(test_text)
cats2 = doc2.cats
assert cats2["HAPPY"] > 0.9
assert cats2["CONFUSED"] > 0.9
# Test scoring
scores = nlp.evaluate(train_examples)
assert scores["cats_micro_f"] == 1.0
assert scores["cats_macro_f"] == 1.0
assert "cats_score_desc" in scores
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
texts = ["Just a sentence.", "I like green eggs.", "I am happy.", "I eat ham."]
batch_deps_1 = [doc.cats for doc in nlp.pipe(texts)]
batch_deps_2 = [doc.cats for doc in nlp.pipe(texts)]
no_batch_deps = [doc.cats for doc in [nlp(text) for text in texts]]
assert_equal(batch_deps_1, batch_deps_2)
assert_equal(batch_deps_1, no_batch_deps)
# fmt: off
@pytest.mark.parametrize(
"textcat_config",
"name,train_data,textcat_config",
[
{"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False},
{"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 4, "no_output_layer": False},
{"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 3, "no_output_layer": True},
{"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 2, "no_output_layer": True},
{"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}},
{"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 5, "no_output_layer": False}},
{"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True},
{"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False},
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}),
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 4, "no_output_layer": False}),
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 3, "no_output_layer": True}),
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 2, "no_output_layer": True}),
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}}),
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 5, "no_output_layer": False}}),
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True}),
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False}),
],
)
# fmt: on
def test_textcat_configs(textcat_config):
def test_textcat_configs(name, train_data, textcat_config):
pipe_config = {"model": textcat_config}
nlp = English()
textcat = nlp.add_pipe("textcat", config=pipe_config)
textcat = nlp.add_pipe(name, config=pipe_config)
train_examples = []
for text, annotations in TRAIN_DATA:
for text, annotations in train_data:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
for label, value in annotations.get("cats").items():
textcat.add_label(label)
@ -264,15 +354,24 @@ def test_textcat_configs(textcat_config):
def test_positive_class():
nlp = English()
textcat = nlp.add_pipe("textcat")
get_examples = make_get_examples(nlp)
get_examples = make_get_examples_single_label(nlp)
textcat.initialize(get_examples, labels=["POS", "NEG"], positive_label="POS")
assert textcat.labels == ("POS", "NEG")
assert textcat.cfg["positive_label"] == "POS"
textcat_multilabel = nlp.add_pipe("textcat_multilabel")
get_examples = make_get_examples_multi_label(nlp)
with pytest.raises(TypeError):
textcat_multilabel.initialize(get_examples, labels=["POS", "NEG"], positive_label="POS")
textcat_multilabel.initialize(get_examples, labels=["FICTION", "DRAMA"])
assert textcat_multilabel.labels == ("FICTION", "DRAMA")
assert "positive_label" not in textcat_multilabel.cfg
def test_positive_class_not_present():
nlp = English()
textcat = nlp.add_pipe("textcat")
get_examples = make_get_examples(nlp)
get_examples = make_get_examples_single_label(nlp)
with pytest.raises(ValueError):
textcat.initialize(get_examples, labels=["SOME", "THING"], positive_label="POS")
@ -280,11 +379,9 @@ def test_positive_class_not_present():
def test_positive_class_not_binary():
nlp = English()
textcat = nlp.add_pipe("textcat")
get_examples = make_get_examples(nlp)
get_examples = make_get_examples_multi_label(nlp)
with pytest.raises(ValueError):
textcat.initialize(
get_examples, labels=["SOME", "THING", "POS"], positive_label="POS"
)
textcat.initialize(get_examples, labels=["SOME", "THING", "POS"], positive_label="POS")
def test_textcat_evaluation():

View File

@ -113,7 +113,7 @@ cfg_string = """
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
@ -123,7 +123,7 @@ cfg_string = """
include_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1

View File

@ -288,35 +288,33 @@ def test_multiple_predictions():
dummy_pipe(doc)
@pytest.mark.skip(reason="removed Beam stuff during the Example/GoldParse refactor")
def test_issue4313():
""" This should not crash or exit with some strange error code """
beam_width = 16
beam_density = 0.0001
nlp = English()
config = {}
ner = nlp.create_pipe("ner", config=config)
config = {
"beam_width": beam_width,
"beam_density": beam_density,
}
ner = nlp.add_pipe("beam_ner", config=config)
ner.add_label("SOME_LABEL")
ner.initialize(lambda: [])
nlp.initialize()
# add a new label to the doc
doc = nlp("What do you think about Apple ?")
assert len(ner.labels) == 1
assert "SOME_LABEL" in ner.labels
ner.add_label("MY_ORG") # TODO: not sure if we want this to be necessary...
apple_ent = Span(doc, 5, 6, label="MY_ORG")
doc.ents = list(doc.ents) + [apple_ent]
# ensure the beam_parse still works with the new label
docs = [doc]
beams = nlp.entity.beam_parse(
docs, beam_width=beam_width, beam_density=beam_density
ner = nlp.get_pipe("beam_ner")
beams = ner.beam_parse(
docs, drop=0.0, beam_width=beam_width, beam_density=beam_density
)
for doc, beam in zip(docs, beams):
entity_scores = defaultdict(float)
for score, ents in nlp.entity.moves.get_beam_parses(beam):
for start, end, label in ents:
entity_scores[(start, end, label)] += score
def test_issue4348():
"""Test that training the tagger with empty data, doesn't throw errors"""

View File

@ -2,8 +2,11 @@ import pytest
from thinc.api import Config, fix_random_seed
from spacy.lang.en import English
from spacy.pipeline.textcat import default_model_config, bow_model_config
from spacy.pipeline.textcat import cnn_model_config
from spacy.pipeline.textcat import single_label_default_config, single_label_bow_config
from spacy.pipeline.textcat import single_label_cnn_config
from spacy.pipeline.textcat_multilabel import multi_label_default_config
from spacy.pipeline.textcat_multilabel import multi_label_bow_config
from spacy.pipeline.textcat_multilabel import multi_label_cnn_config
from spacy.tokens import Span
from spacy import displacy
from spacy.pipeline import merge_entities
@ -11,7 +14,15 @@ from spacy.training import Example
@pytest.mark.parametrize(
"textcat_config", [default_model_config, bow_model_config, cnn_model_config]
"textcat_config",
[
single_label_default_config,
single_label_bow_config,
single_label_cnn_config,
multi_label_default_config,
multi_label_bow_config,
multi_label_cnn_config,
],
)
def test_issue5551(textcat_config):
"""Test that after fixing the random seed, the results of the pipeline are truly identical"""

View File

@ -1,4 +1,3 @@
import pydantic
import pytest
from pydantic import ValidationError
from spacy.schemas import TokenPattern, TokenPatternSchema

View File

@ -208,7 +208,7 @@ def test_create_nlp_from_pretraining_config():
config = Config().from_str(pretrain_config_string)
pretrain_config = load_config(DEFAULT_CONFIG_PRETRAIN_PATH)
filled = config.merge(pretrain_config)
resolved = registry.resolve(filled["pretraining"], schema=ConfigSchemaPretrain)
registry.resolve(filled["pretraining"], schema=ConfigSchemaPretrain)
def test_create_nlp_from_config_multiple_instances():

View File

@ -4,7 +4,7 @@ from spacy.pipeline import Tagger, DependencyParser, EntityRecognizer
from spacy.pipeline import TextCategorizer, SentenceRecognizer, TrainablePipe
from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
from spacy.pipeline.tagger import DEFAULT_TAGGER_MODEL
from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL
from spacy.pipeline.textcat import DEFAULT_SINGLE_TEXTCAT_MODEL
from spacy.pipeline.senter import DEFAULT_SENTER_MODEL
from spacy.lang.en import English
from thinc.api import Linear
@ -24,7 +24,7 @@ def parser(en_vocab):
"update_with_oracle_cut_size": 100,
"beam_width": 1,
"beam_update_prob": 1.0,
"beam_density": 0.0
"beam_density": 0.0,
}
cfg = {"model": DEFAULT_PARSER_MODEL}
model = registry.resolve(cfg, validate=True)["model"]
@ -41,7 +41,7 @@ def blank_parser(en_vocab):
"update_with_oracle_cut_size": 100,
"beam_width": 1,
"beam_update_prob": 1.0,
"beam_density": 0.0
"beam_density": 0.0,
}
cfg = {"model": DEFAULT_PARSER_MODEL}
model = registry.resolve(cfg, validate=True)["model"]
@ -66,7 +66,7 @@ def test_serialize_parser_roundtrip_bytes(en_vocab, Parser):
"update_with_oracle_cut_size": 100,
"beam_width": 1,
"beam_update_prob": 1.0,
"beam_density": 0.0
"beam_density": 0.0,
}
cfg = {"model": DEFAULT_PARSER_MODEL}
model = registry.resolve(cfg, validate=True)["model"]
@ -90,7 +90,7 @@ def test_serialize_parser_strings(Parser):
"update_with_oracle_cut_size": 100,
"beam_width": 1,
"beam_update_prob": 1.0,
"beam_density": 0.0
"beam_density": 0.0,
}
cfg = {"model": DEFAULT_PARSER_MODEL}
model = registry.resolve(cfg, validate=True)["model"]
@ -112,7 +112,7 @@ def test_serialize_parser_roundtrip_disk(en_vocab, Parser):
"update_with_oracle_cut_size": 100,
"beam_width": 1,
"beam_update_prob": 1.0,
"beam_density": 0.0
"beam_density": 0.0,
}
cfg = {"model": DEFAULT_PARSER_MODEL}
model = registry.resolve(cfg, validate=True)["model"]
@ -140,9 +140,6 @@ def test_to_from_bytes(parser, blank_parser):
assert blank_parser.moves.n_moves == parser.moves.n_moves
@pytest.mark.skip(
reason="This seems to be a dict ordering bug somewhere. Only failing on some platforms."
)
def test_serialize_tagger_roundtrip_bytes(en_vocab, taggers):
tagger1 = taggers[0]
tagger1_b = tagger1.to_bytes()
@ -191,7 +188,7 @@ def test_serialize_tagger_strings(en_vocab, de_vocab, taggers):
def test_serialize_textcat_empty(en_vocab):
# See issue #1105
cfg = {"model": DEFAULT_TEXTCAT_MODEL}
cfg = {"model": DEFAULT_SINGLE_TEXTCAT_MODEL}
model = registry.resolve(cfg, validate=True)["model"]
textcat = TextCategorizer(en_vocab, model, threshold=0.5)
textcat.to_bytes(exclude=["vocab"])

View File

@ -26,7 +26,6 @@ def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
assert tokenizer_reloaded.rules == {}
@pytest.mark.skip(reason="Currently unreliable across platforms")
@pytest.mark.parametrize("text", ["I💜you", "theyre", "“hello”"])
def test_serialize_tokenizer_roundtrip_bytes(en_tokenizer, text):
tokenizer = en_tokenizer
@ -38,7 +37,6 @@ def test_serialize_tokenizer_roundtrip_bytes(en_tokenizer, text):
assert [token.text for token in doc1] == [token.text for token in doc2]
@pytest.mark.skip(reason="Currently unreliable across platforms")
def test_serialize_tokenizer_roundtrip_disk(en_tokenizer):
tokenizer = en_tokenizer
with make_tempdir() as d:

View File

@ -3,7 +3,9 @@ from click import NoSuchOption
from spacy.training import docs_to_json, offsets_to_biluo_tags
from spacy.training.converters import iob_to_docs, conll_ner_to_docs, conllu_to_docs
from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate
from spacy.lang.nl import Dutch
from spacy.util import ENV_VARS
from spacy.cli import info
from spacy.cli.init_config import init_config, RECOMMENDATIONS
from spacy.cli._util import validate_project_commands, parse_config_overrides
from spacy.cli._util import load_project_config, substitute_project_variables
@ -15,6 +17,16 @@ import os
from .util import make_tempdir
def test_cli_info():
nlp = Dutch()
nlp.add_pipe("textcat")
with make_tempdir() as tmp_dir:
nlp.to_disk(tmp_dir)
raw_data = info(tmp_dir, exclude=[""])
assert raw_data["lang"] == "nl"
assert raw_data["components"] == ["textcat"]
def test_cli_converters_conllu_to_docs():
# from NorNE: https://github.com/ltgoslo/norne/blob/3d23274965f513f23aa48455b28b1878dad23c05/ud/nob/no_bokmaal-ud-dev.conllu
lines = [

View File

@ -83,6 +83,7 @@ def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
def test_prefer_gpu():
try:
import cupy # noqa: F401
prefer_gpu()
assert isinstance(get_current_ops(), CupyOps)
except ImportError:
@ -92,17 +93,20 @@ def test_prefer_gpu():
def test_require_gpu():
try:
import cupy # noqa: F401
require_gpu()
assert isinstance(get_current_ops(), CupyOps)
except ImportError:
with pytest.raises(ValueError):
require_gpu()
def test_require_cpu():
require_cpu()
assert isinstance(get_current_ops(), NumpyOps)
try:
import cupy # noqa: F401
require_gpu()
assert isinstance(get_current_ops(), CupyOps)
except ImportError:

View File

@ -294,7 +294,7 @@ def test_partial_annotation(en_tokenizer):
# cats doesn't have an unset state
if key.startswith("cats"):
continue
assert scores[key] == None
assert scores[key] is None
# partially annotated reference, not overlapping with predicted annotation
ref_doc = en_tokenizer("a b c d e")
@ -306,13 +306,13 @@ def test_partial_annotation(en_tokenizer):
example = Example(pred_doc, ref_doc)
scorer = Scorer()
scores = scorer.score([example])
assert scores["token_acc"] == None
assert scores["token_acc"] is None
assert scores["tag_acc"] == 0.0
assert scores["pos_acc"] == 0.0
assert scores["morph_acc"] == 0.0
assert scores["dep_uas"] == 1.0
assert scores["dep_las"] == 0.0
assert scores["sents_f"] == None
assert scores["sents_f"] is None
# partially annotated reference, overlapping with predicted annotation
ref_doc = en_tokenizer("a b c d e")
@ -324,13 +324,13 @@ def test_partial_annotation(en_tokenizer):
example = Example(pred_doc, ref_doc)
scorer = Scorer()
scores = scorer.score([example])
assert scores["token_acc"] == None
assert scores["token_acc"] is None
assert scores["tag_acc"] == 1.0
assert scores["pos_acc"] == 1.0
assert scores["morph_acc"] == 0.0
assert scores["dep_uas"] == 1.0
assert scores["dep_las"] == 0.0
assert scores["sents_f"] == None
assert scores["sents_f"] is None
def test_roc_auc_score():
@ -391,7 +391,7 @@ def test_roc_auc_score():
score.score_set(0.25, 0)
score.score_set(0.75, 0)
with pytest.raises(ValueError):
s = score.score
_ = score.score # noqa: F841
y_true = [1, 1]
y_score = [0.25, 0.75]
@ -402,4 +402,4 @@ def test_roc_auc_score():
score.score_set(0.25, 1)
score.score_set(0.75, 1)
with pytest.raises(ValueError):
s = score.score
_ = score.score # noqa: F841

View File

@ -180,3 +180,9 @@ def test_tokenizer_special_cases_idx(tokenizer):
doc = tokenizer(text)
assert doc[1].idx == 4
assert doc[2].idx == 7
def test_tokenizer_special_cases_spaces(tokenizer):
assert [t.text for t in tokenizer("a b c")] == ["a", "b", "c"]
tokenizer.add_special_case("a b c", [{"ORTH": "a b c"}])
assert [t.text for t in tokenizer("a b c")] == ["a b c"]

View File

@ -51,7 +51,7 @@ def test_readers():
for example in train_corpus(nlp):
nlp.update([example], sgd=optimizer)
scores = nlp.evaluate(list(dev_corpus(nlp)))
assert scores["cats_score"] == 0.0
assert scores["cats_macro_auc"] == 0.0
# ensure the pipeline runs
doc = nlp("Quick test")
assert doc.cats
@ -73,7 +73,7 @@ def test_cat_readers(reader, additional_config):
nlp_config_string = """
[training]
seed = 0
[training.score_weights]
cats_macro_auc = 1.0

View File

@ -71,7 +71,6 @@ def test_table_api_to_from_bytes():
assert "def" not in new_table2
@pytest.mark.skip(reason="This fails on Python 3.5")
def test_lookups_to_from_bytes():
lookups = Lookups()
lookups.add_table("table1", {"foo": "bar", "hello": "world"})
@ -91,7 +90,6 @@ def test_lookups_to_from_bytes():
assert new_lookups.to_bytes() == lookups_bytes
@pytest.mark.skip(reason="This fails on Python 3.5")
def test_lookups_to_from_disk():
lookups = Lookups()
lookups.add_table("table1", {"foo": "bar", "hello": "world"})
@ -111,7 +109,6 @@ def test_lookups_to_from_disk():
assert table2["b"] == 2
@pytest.mark.skip(reason="This fails on Python 3.5")
def test_lookups_to_from_bytes_via_vocab():
table_name = "test"
vocab = Vocab()
@ -128,7 +125,6 @@ def test_lookups_to_from_bytes_via_vocab():
assert new_vocab.to_bytes() == vocab_bytes
@pytest.mark.skip(reason="This fails on Python 3.5")
def test_lookups_to_from_disk_via_vocab():
table_name = "test"
vocab = Vocab()

View File

@ -258,6 +258,7 @@ cdef class Tokenizer:
tokens = doc.c
# Otherwise create a separate array to store modified tokens
else:
assert max_length > 0
tokens = <TokenC*>mem.alloc(max_length, sizeof(TokenC))
# Modify tokenization according to filtered special cases
offset = self._retokenize_special_spans(doc, tokens, span_data)
@ -610,7 +611,7 @@ cdef class Tokenizer:
self.mem.free(stale_special)
self._rules[string] = substrings
self._flush_cache()
if self.find_prefix(string) or self.find_infix(string) or self.find_suffix(string):
if self.find_prefix(string) or self.find_infix(string) or self.find_suffix(string) or " " in string:
self._special_matcher.add(string, None, self._tokenize_affixes(string, False))
def _reload_special_cases(self):

View File

@ -188,8 +188,15 @@ def _merge(Doc doc, merges):
and doc.c[start - 1].ent_type == token.ent_type:
merged_iob = 1
token.ent_iob = merged_iob
# Set lemma to concatenated lemmas
merged_lemma = ""
for span_token in span:
merged_lemma += span_token.lemma_
if doc.c[span_token.i].spacy:
merged_lemma += " "
merged_lemma = merged_lemma.strip()
token.lemma = doc.vocab.strings.add(merged_lemma)
# Unset attributes that don't match new token
token.lemma = 0
token.norm = 0
tokens[merge_index] = token
# Resize the doc.tensor, if it's set. Let the last row for each token stand
@ -335,7 +342,9 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
token = &doc.c[token_index + i]
lex = doc.vocab.get(doc.mem, orth)
token.lex = lex
token.lemma = 0 # reset lemma
# If lemma is currently set, set default lemma to orth
if token.lemma != 0:
token.lemma = lex.orth
token.norm = 0 # reset norm
if to_process_tensor:
# setting the tensors of the split tokens to array of zeros

View File

@ -225,6 +225,7 @@ cdef class Doc:
# Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
# However, we need to remember the true starting places, so that we can
# realloc.
assert size + (PADDING*2) > 0
data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
cdef int i
for i in range(size + (PADDING*2)):
@ -1097,7 +1098,7 @@ cdef class Doc:
(vocab,) = vocab
if attrs is None:
attrs = Doc._get_array_attrs()
attrs = list(Doc._get_array_attrs())
else:
if any(isinstance(attr, str) for attr in attrs): # resolve attribute names
attrs = [intify_attr(attr) for attr in attrs] # intify_attr returns None for invalid attrs
@ -1177,6 +1178,7 @@ cdef class Doc:
other.length = self.length
other.max_length = self.max_length
buff_size = other.max_length + (PADDING*2)
assert buff_size > 0
tokens = <TokenC*>other.mem.alloc(buff_size, sizeof(TokenC))
memcpy(tokens, self.c - PADDING, buff_size * sizeof(TokenC))
other.c = &tokens[PADDING]

View File

@ -37,9 +37,17 @@ def init_nlp(config: Config, *, use_gpu: int = -1) -> "Language":
T = registry.resolve(config["training"], schema=ConfigSchemaTraining)
dot_names = [T["train_corpus"], T["dev_corpus"]]
if not isinstance(T["train_corpus"], str):
raise ConfigValidationError(desc=Errors.E897.format(field="training.train_corpus", type=type(T["train_corpus"])))
raise ConfigValidationError(
desc=Errors.E897.format(
field="training.train_corpus", type=type(T["train_corpus"])
)
)
if not isinstance(T["dev_corpus"], str):
raise ConfigValidationError(desc=Errors.E897.format(field="training.dev_corpus", type=type(T["dev_corpus"])))
raise ConfigValidationError(
desc=Errors.E897.format(
field="training.dev_corpus", type=type(T["dev_corpus"])
)
)
train_corpus, dev_corpus = resolve_dot_names(config, dot_names)
optimizer = T["optimizer"]
# Components that shouldn't be updated during training

View File

@ -59,6 +59,19 @@ def train(
batcher = T["batcher"]
train_logger = T["logger"]
before_to_disk = create_before_to_disk_callback(T["before_to_disk"])
# Helper function to save checkpoints. This is a closure for convenience,
# to avoid passing in all the args all the time.
def save_checkpoint(is_best):
with nlp.use_params(optimizer.averages):
before_to_disk(nlp).to_disk(output_path / DIR_MODEL_LAST)
if is_best:
# Avoid saving twice (saving will be more expensive than
# the dir copy)
if (output_path / DIR_MODEL_BEST).exists():
shutil.rmtree(output_path / DIR_MODEL_BEST)
shutil.copytree(output_path / DIR_MODEL_LAST, output_path / DIR_MODEL_BEST)
# Components that shouldn't be updated during training
frozen_components = T["frozen_components"]
# Create iterator, which yields out info after each optimization step.
@ -87,36 +100,31 @@ def train(
if is_best_checkpoint is not None and output_path is not None:
with nlp.select_pipes(disable=frozen_components):
update_meta(T, nlp, info)
with nlp.use_params(optimizer.averages):
nlp = before_to_disk(nlp)
nlp.to_disk(output_path / DIR_MODEL_BEST)
save_checkpoint(is_best_checkpoint)
except Exception as e:
if output_path is not None:
# We don't want to swallow the traceback if we don't have a
# specific error, but we do want to warn that we're trying
# to do something here.
stdout.write(
msg.warn(
f"Aborting and saving the final best model. "
f"Encountered exception: {str(e)}"
f"Encountered exception: {repr(e)}"
)
+ "\n"
)
raise e
finally:
finalize_logger()
if optimizer.averages:
nlp.use_params(optimizer.averages)
if output_path is not None:
final_model_path = output_path / DIR_MODEL_LAST
nlp.to_disk(final_model_path)
# This will only run if we don't hit an error
stdout.write(
msg.good("Saved pipeline to output directory", final_model_path) + "\n"
)
return (nlp, final_model_path)
else:
return (nlp, None)
save_checkpoint(False)
# This will only run if we did't hit an error
if optimizer.averages:
nlp.use_params(optimizer.averages)
if output_path is not None:
stdout.write(
msg.good("Saved pipeline to output directory", output_path / DIR_MODEL_LAST)
+ "\n"
)
return (nlp, output_path / DIR_MODEL_LAST)
else:
return (nlp, None)
def train_while_improving(

View File

@ -10,7 +10,7 @@ from wasabi import Printer
from .example import Example
from ..tokens import Doc
from ..schemas import ConfigSchemaTraining, ConfigSchemaPretrain
from ..schemas import ConfigSchemaPretrain
from ..util import registry, load_model_from_config, dot_to_object
@ -30,7 +30,6 @@ def pretrain(
set_gpu_allocator(allocator)
nlp = load_model_from_config(config)
_config = nlp.config.interpolate()
T = registry.resolve(_config["training"], schema=ConfigSchemaTraining)
P = registry.resolve(_config["pretraining"], schema=ConfigSchemaPretrain)
corpus = dot_to_object(_config, P["corpus"])
corpus = registry.resolve({"corpus": corpus})["corpus"]

View File

@ -69,7 +69,7 @@ CONFIG_SECTION_ORDER = ["paths", "variables", "system", "nlp", "components", "co
logger = logging.getLogger("spacy")
logger_stream_handler = logging.StreamHandler()
logger_stream_handler.setFormatter(logging.Formatter('%(message)s'))
logger_stream_handler.setFormatter(logging.Formatter("%(message)s"))
logger.addHandler(logger_stream_handler)

View File

@ -164,7 +164,7 @@ cdef class Vocab:
if len(string) < 3 or self.length < 10000:
mem = self.mem
cdef bint is_oov = mem is not self.mem
lex = <LexemeC*>mem.alloc(sizeof(LexemeC), 1)
lex = <LexemeC*>mem.alloc(1, sizeof(LexemeC))
lex.orth = self.strings.add(string)
lex.length = len(string)
if self.vectors is not None:

View File

@ -5,6 +5,7 @@ source: spacy/ml/models
menu:
- ['Tok2Vec', 'tok2vec-arch']
- ['Transformers', 'transformers']
- ['Pretraining', 'pretrain']
- ['Parser & NER', 'parser']
- ['Tagging', 'tagger']
- ['Text Classification', 'textcat']
@ -25,20 +26,20 @@ usage documentation on
## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"}
### spacy.Tok2Vec.v1 {#Tok2Vec}
### spacy.Tok2Vec.v2 {#Tok2Vec}
> #### Example config
>
> ```ini
> [model]
> @architectures = "spacy.Tok2Vec.v1"
> @architectures = "spacy.Tok2Vec.v2"
>
> [model.embed]
> @architectures = "spacy.CharacterEmbed.v1"
> # ...
>
> [model.encode]
> @architectures = "spacy.MaxoutWindowEncoder.v1"
> @architectures = "spacy.MaxoutWindowEncoder.v2"
> # ...
> ```
@ -196,13 +197,13 @@ network to construct a single vector to represent the information.
| `nC` | The number of UTF-8 bytes to embed per word. Recommended values are between `3` and `8`, although it may depend on the length of words in the language. ~~int~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
### spacy.MaxoutWindowEncoder.v1 {#MaxoutWindowEncoder}
### spacy.MaxoutWindowEncoder.v2 {#MaxoutWindowEncoder}
> #### Example config
>
> ```ini
> [model]
> @architectures = "spacy.MaxoutWindowEncoder.v1"
> @architectures = "spacy.MaxoutWindowEncoder.v2"
> width = 128
> window_size = 1
> maxout_pieces = 3
@ -220,13 +221,13 @@ and residual connections.
| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
### spacy.MishWindowEncoder.v1 {#MishWindowEncoder}
### spacy.MishWindowEncoder.v2 {#MishWindowEncoder}
> #### Example config
>
> ```ini
> [model]
> @architectures = "spacy.MishWindowEncoder.v1"
> @architectures = "spacy.MishWindowEncoder.v2"
> width = 64
> window_size = 1
> depth = 4
@ -251,19 +252,19 @@ and residual connections.
> [model]
> @architectures = "spacy.TorchBiLSTMEncoder.v1"
> width = 64
> window_size = 1
> depth = 4
> depth = 2
> dropout = 0.0
> ```
Encode context using bidirectional LSTM layers. Requires
[PyTorch](https://pytorch.org).
| Name | Description |
| ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `width` | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. ~~int~~ |
| `window_size` | The number of words to concatenate around each token to construct the convolution. Recommended value is `1`. ~~int~~ |
| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
| Name | Description |
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `width` | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. ~~int~~ |
| `depth` | The number of recurrent layers, for instance `depth=2` results in stacking two LSTMs together. ~~int~~ |
| `dropout` | Creates a Dropout layer on the outputs of each LSTM layer except the last layer. Set to 0.0 to disable this functionality. ~~float~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
### spacy.StaticVectors.v1 {#StaticVectors}
@ -426,6 +427,71 @@ one component.
| `grad_factor` | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. ~~float~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
## Pretraining architectures {#pretrain source="spacy/ml/models/multi_task.py"}
The spacy `pretrain` command lets you initialize a `Tok2Vec` layer in your
pipeline with information from raw text. To this end, additional layers are
added to build a network for a temporary task that forces the `Tok2Vec` layer to
learn something about sentence structure and word cooccurrence statistics. Two
pretraining objectives are available, both of which are variants of the cloze
task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced for
BERT.
For more information, see the section on
[pretraining](/usage/embeddings-transformers#pretraining).
### spacy.PretrainVectors.v1 {#pretrain_vectors}
> #### Example config
>
> ```ini
> [pretraining]
> component = "tok2vec"
> ...
>
> [pretraining.objective]
> @architectures = "spacy.PretrainVectors.v1"
> maxout_pieces = 3
> hidden_size = 300
> loss = "cosine"
> ```
Predict the word's vector from a static embeddings table as pretraining
objective for a Tok2Vec layer.
| Name | Description |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~ |
| `hidden_size` | Size of the hidden layer of the model. ~~int~~ |
| `loss` | The loss function can be either "cosine" or "L2". We typically recommend to use "cosine". ~~~str~~ |
| **CREATES** | A callable function that can create the Model, given the `vocab` of the pipeline and the `tok2vec` layer to pretrain. ~~Callable[[Vocab, Model], Model]~~ |
### spacy.PretrainCharacters.v1 {#pretrain_chars}
> #### Example config
>
> ```ini
> [pretraining]
> component = "tok2vec"
> ...
>
> [pretraining.objective]
> @architectures = "spacy.PretrainCharacters.v1"
> maxout_pieces = 3
> hidden_size = 300
> n_characters = 4
> ```
Predict some number of leading and trailing UTF-8 bytes as pretraining objective
for a Tok2Vec layer.
| Name | Description |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~ |
| `hidden_size` | Size of the hidden layer of the model. ~~int~~ |
| `n_characters` | The window of characters - e.g. if `n_characters = 2`, the model will try to predict the first two and last two characters of the word. ~~int~~ |
| **CREATES** | A callable function that can create the Model, given the `vocab` of the pipeline and the `tok2vec` layer to pretrain. ~~Callable[[Vocab, Model], Model]~~ |
## Parser & NER architectures {#parser}
### spacy.TransitionBasedParser.v2 {#TransitionBasedParser source="spacy/ml/models/parser.py"}
@ -534,7 +600,7 @@ specific data and challenge.
> no_output_layer = false
>
> [model.tok2vec]
> @architectures = "spacy.Tok2Vec.v1"
> @architectures = "spacy.Tok2Vec.v2"
>
> [model.tok2vec.embed]
> @architectures = "spacy.MultiHashEmbed.v1"
@ -544,7 +610,7 @@ specific data and challenge.
> include_static_vectors = false
>
> [model.tok2vec.encode]
> @architectures = "spacy.MaxoutWindowEncoder.v1"
> @architectures = "spacy.MaxoutWindowEncoder.v2"
> width = ${model.tok2vec.embed.width}
> window_size = 1
> maxout_pieces = 3

View File

@ -61,20 +61,27 @@ markup to copy-paste into
[GitHub issues](https://github.com/explosion/spaCy/issues).
```cli
$ python -m spacy info [--markdown] [--silent]
$ python -m spacy info [--markdown] [--silent] [--exclude]
```
> #### Example
>
> ```cli
> $ python -m spacy info en_core_web_lg --markdown
> ```
```cli
$ python -m spacy info [model] [--markdown] [--silent]
$ python -m spacy info [model] [--markdown] [--silent] [--exclude]
```
| Name | Description |
| ------------------------------------------------ | ----------------------------------------------------------------------------------------- |
| `model` | A trained pipeline, i.e. package name or path (optional). ~~Optional[str] \(positional)~~ |
| `--markdown`, `-md` | Print information as Markdown. ~~bool (flag)~~ |
| `--silent`, `-s` <Tag variant="new">2.0.12</Tag> | Don't print anything, just return the values. ~~bool (flag)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| **PRINTS** | Information about your spaCy installation. |
| Name | Description |
| ------------------------------------------------ | --------------------------------------------------------------------------------------------- |
| `model` | A trained pipeline, i.e. package name or path (optional). ~~Optional[str] \(positional)~~ |
| `--markdown`, `-md` | Print information as Markdown. ~~bool (flag)~~ |
| `--silent`, `-s` <Tag variant="new">2.0.12</Tag> | Don't print anything, just return the values. ~~bool (flag)~~ |
| `--exclude`, `-e` | Comma-separated keys to exclude from the print-out. Defaults to `"labels"`. ~~Optional[str]~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| **PRINTS** | Information about your spaCy installation. |
## validate {#validate new="2" tag="command"}
@ -121,7 +128,7 @@ customize those settings in your config file later.
> ```
```cli
$ python -m spacy init config [output_file] [--lang] [--pipeline] [--optimize] [--gpu] [--pretraining]
$ python -m spacy init config [output_file] [--lang] [--pipeline] [--optimize] [--gpu] [--pretraining] [--force]
```
| Name | Description |
@ -132,6 +139,7 @@ $ python -m spacy init config [output_file] [--lang] [--pipeline] [--optimize] [
| `--optimize`, `-o` | `"efficiency"` or `"accuracy"`. Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters. Defaults to `"efficiency"`. ~~str (option)~~ |
| `--gpu`, `-G` | Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters. ~~bool (flag)~~ |
| `--pretraining`, `-pt` | Include config for pretraining (with [`spacy pretrain`](/api/cli#pretrain)). Defaults to `False`. ~~bool (flag)~~ |
| `--force`, `-f` | Force overwriting the output file if it already exists. ~~bool (flag)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| **CREATES** | The config file for training. |
@ -783,6 +791,12 @@ in the section `[paths]`.
</Infobox>
> #### Example
>
> ```cli
> $ python -m spacy train config.cfg --output ./output --paths.train ./train --paths.dev ./dev
> ```
```cli
$ python -m spacy train [config_path] [--output] [--code] [--verbose] [--gpu-id] [overrides]
```
@ -801,15 +815,16 @@ $ python -m spacy train [config_path] [--output] [--code] [--verbose] [--gpu-id]
## pretrain {#pretrain new="2.1" tag="command,experimental"}
Pretrain the "token to vector" ([`Tok2vec`](/api/tok2vec)) layer of pipeline
components on [raw text](/api/data-formats#pretrain), using an approximate
language-modeling objective. Specifically, we load pretrained vectors, and train
a component like a CNN, BiLSTM, etc to predict vectors which match the
pretrained ones. The weights are saved to a directory after each epoch. You can
then include a **path to one of these pretrained weights files** in your
components on raw text, using an approximate language-modeling objective.
Specifically, we load pretrained vectors, and train a component like a CNN,
BiLSTM, etc to predict vectors which match the pretrained ones. The weights are
saved to a directory after each epoch. You can then include a **path to one of
these pretrained weights files** in your
[training config](/usage/training#config) as the `init_tok2vec` setting when you
train your pipeline. This technique may be especially helpful if you have little
labelled data. See the usage docs on
[pretraining](/usage/embeddings-transformers#pretraining) for more info.
[pretraining](/usage/embeddings-transformers#pretraining) for more info. To read
the raw text, a [`JsonlCorpus`](/api/top-level#jsonlcorpus) is typically used.
<Infobox title="Changed in v3.0" variant="warning">
@ -823,6 +838,12 @@ auto-generated by setting `--pretraining` on
</Infobox>
> #### Example
>
> ```cli
> $ python -m spacy pretrain config.cfg ./output_pretrain --paths.raw_text ./data.jsonl
> ```
```cli
$ python -m spacy pretrain [config_path] [output_dir] [--code] [--resume-path] [--epoch-resume] [--gpu-id] [overrides]
```

View File

@ -94,7 +94,7 @@ Defines the `nlp` object, its tokenizer and
>
> [components.textcat.model]
> @architectures = "spacy.TextCatBOW.v1"
> exclusive_classes = false
> exclusive_classes = true
> ngram_size = 1
> no_output_layer = false
> ```
@ -148,7 +148,7 @@ This section defines a **dictionary** mapping of string keys to functions. Each
function takes an `nlp` object and yields [`Example`](/api/example) objects. By
default, the two keys `train` and `dev` are specified and each refer to a
[`Corpus`](/api/top-level#Corpus). When pretraining, an additional `pretrain`
section is added that defaults to a [`JsonlCorpus`](/api/top-level#JsonlCorpus).
section is added that defaults to a [`JsonlCorpus`](/api/top-level#jsonlcorpus).
You can also register custom functions that return a callable.
| Name | Description |

View File

@ -0,0 +1,454 @@
---
title: Multi-label TextCategorizer
tag: class
source: spacy/pipeline/textcat_multilabel.py
new: 3
teaser: 'Pipeline component for multi-label text classification'
api_base_class: /api/pipe
api_string_name: textcat_multilabel
api_trainable: true
---
The text categorizer predicts **categories over a whole document**. It
learns non-mutually exclusive labels, which means that zero or more labels
may be true per document.
## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes
how the component should be configured. You can override its settings via the
`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
[`config.cfg` for training](/usage/training#config). See the
[model architectures](/api/architectures) documentation for details on the
architectures and their arguments and hyperparameters.
> #### Example
>
> ```python
> from spacy.pipeline.textcat_multilabel import DEFAULT_MULTI_TEXTCAT_MODEL
> config = {
> "threshold": 0.5,
> "model": DEFAULT_MULTI_TEXTCAT_MODEL,
> }
> nlp.add_pipe("textcat_multilabel", config=config)
> ```
| Setting | Description |
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ |
```python
%%GITHUB_SPACY/spacy/pipeline/textcat_multilabel.py
```
## MultiLabel_TextCategorizer.\_\_init\_\_ {#init tag="method"}
> #### Example
>
> ```python
> # Construction via add_pipe with default model
> textcat = nlp.add_pipe("textcat_multilabel")
>
> # Construction via add_pipe with custom model
> config = {"model": {"@architectures": "my_textcat"}}
> parser = nlp.add_pipe("textcat_multilabel", config=config)
>
> # Construction from class
> from spacy.pipeline import MultiLabel_TextCategorizer
> textcat = MultiLabel_TextCategorizer(nlp.vocab, model, threshold=0.5)
> ```
Create a new pipeline instance. In your application, you would normally use a
shortcut for this and instantiate the component using its string name and
[`nlp.add_pipe`](/api/language#create_pipe).
| Name | Description |
| -------------- | -------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | The shared vocabulary. ~~Vocab~~ |
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
| _keyword-only_ | |
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
## MultiLabel_TextCategorizer.\_\_call\_\_ {#call tag="method"}
Apply the pipe to one document. The document is modified in place, and returned.
This usually happens under the hood when the `nlp` object is called on a text
and all pipeline components are applied to the `Doc` in order. Both
[`__call__`](/api/multilabel_textcategorizer#call) and [`pipe`](/api/multilabel_textcategorizer#pipe)
delegate to the [`predict`](/api/multilabel_textcategorizer#predict) and
[`set_annotations`](/api/multilabel_textcategorizer#set_annotations) methods.
> #### Example
>
> ```python
> doc = nlp("This is a sentence.")
> textcat = nlp.add_pipe("textcat_multilabel")
> # This usually happens under the hood
> processed = textcat(doc)
> ```
| Name | Description |
| ----------- | -------------------------------- |
| `doc` | The document to process. ~~Doc~~ |
| **RETURNS** | The processed document. ~~Doc~~ |
## MultiLabel_TextCategorizer.pipe {#pipe tag="method"}
Apply the pipe to a stream of documents. This usually happens under the hood
when the `nlp` object is called on a text and all pipeline components are
applied to the `Doc` in order. Both [`__call__`](/api/multilabel_textcategorizer#call) and
[`pipe`](/api/multilabel_textcategorizer#pipe) delegate to the
[`predict`](/api/multilabel_textcategorizer#predict) and
[`set_annotations`](/api/multilabel_textcategorizer#set_annotations) methods.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> for doc in textcat.pipe(docs, batch_size=50):
> pass
> ```
| Name | Description |
| -------------- | ------------------------------------------------------------- |
| `stream` | A stream of documents. ~~Iterable[Doc]~~ |
| _keyword-only_ | |
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
| **YIELDS** | The processed documents in order. ~~Doc~~ |
## MultiLabel_TextCategorizer.initialize {#initialize tag="method" new="3"}
Initialize the component for training. `get_examples` should be a function that
returns an iterable of [`Example`](/api/example) objects. The data examples are
used to **initialize the model** of the component and can either be the full
training data or a representative sample. Initialization includes validating the
network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data. This method is typically called
by [`Language.initialize`](/api/language#initialize) and lets you customize
arguments it receives via the
[`[initialize.components]`](/api/data-formats#config-initialize) block in the
config.
<Infobox variant="warning" title="Changed in v3.0" id="begin_training">
This method was previously called `begin_training`.
</Infobox>
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> textcat.initialize(lambda: [], nlp=nlp)
> ```
>
> ```ini
> ### config.cfg
> [initialize.components.textcat_multilabel]
>
> [initialize.components.textcat_multilabel.labels]
> @readers = "spacy.read_labels.v1"
> path = "corpus/labels/textcat.json
> ```
| Name | Description |
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
| _keyword-only_ | |
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ |
## MultiLabel_TextCategorizer.predict {#predict tag="method"}
Apply the component's model to a batch of [`Doc`](/api/doc) objects without
modifying them.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> scores = textcat.predict([doc1, doc2])
> ```
| Name | Description |
| ----------- | ------------------------------------------- |
| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
| **RETURNS** | The model's prediction for each document. |
## MultiLabel_TextCategorizer.set_annotations {#set_annotations tag="method"}
Modify a batch of [`Doc`](/api/doc) objects using pre-computed scores.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> scores = textcat.predict(docs)
> textcat.set_annotations(docs, scores)
> ```
| Name | Description |
| -------- | --------------------------------------------------------- |
| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
| `scores` | The scores to set, produced by `MultiLabel_TextCategorizer.predict`. |
## MultiLabel_TextCategorizer.update {#update tag="method"}
Learn from a batch of [`Example`](/api/example) objects containing the
predictions and gold-standard annotations, and update the component's model.
Delegates to [`predict`](/api/multilabel_textcategorizer#predict) and
[`get_loss`](/api/multilabel_textcategorizer#get_loss).
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> optimizer = nlp.initialize()
> losses = textcat.update(examples, sgd=optimizer)
> ```
| Name | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
## MultiLabel_TextCategorizer.rehearse {#rehearse tag="method,experimental" new="3"}
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
current model to make predictions similar to an initial model to try to address
the "catastrophic forgetting" problem. This feature is experimental.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> optimizer = nlp.resume_training()
> losses = textcat.rehearse(examples, sgd=optimizer)
> ```
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
## MultiLabel_TextCategorizer.get_loss {#get_loss tag="method"}
Find the loss and gradient of loss for the batch of documents and their
predicted scores.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> scores = textcat.predict([eg.predicted for eg in examples])
> loss, d_loss = textcat.get_loss(examples, scores)
> ```
| Name | Description |
| ----------- | --------------------------------------------------------------------------- |
| `examples` | The batch of examples. ~~Iterable[Example]~~ |
| `scores` | Scores representing the model's predictions. |
| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ |
## MultiLabel_TextCategorizer.score {#score tag="method" new="3"}
Score a batch of examples.
> #### Example
>
> ```python
> scores = textcat.score(examples)
> ```
| Name | Description |
| ---------------- | -------------------------------------------------------------------------------------------------------------------- |
| `examples` | The examples to score. ~~Iterable[Example]~~ |
| _keyword-only_ | |
| **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ |
## MultiLabel_TextCategorizer.create_optimizer {#create_optimizer tag="method"}
Create an optimizer for the pipeline component.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> optimizer = textcat.create_optimizer()
> ```
| Name | Description |
| ----------- | ---------------------------- |
| **RETURNS** | The optimizer. ~~Optimizer~~ |
## MultiLabel_TextCategorizer.use_params {#use_params tag="method, contextmanager"}
Modify the pipe's model to use the given parameter values.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> with textcat.use_params(optimizer.averages):
> textcat.to_disk("/best_model")
> ```
| Name | Description |
| -------- | -------------------------------------------------- |
| `params` | The parameter values to use in the model. ~~dict~~ |
## MultiLabel_TextCategorizer.add_label {#add_label tag="method"}
Add a new label to the pipe. Raises an error if the output dimension is already
set, or if the model has already been fully [initialized](#initialize). Note
that you don't have to call this method if you provide a **representative data
sample** to the [`initialize`](#initialize) method. In this case, all labels
found in the sample will be automatically added to the model, and the output
dimension will be [inferred](/usage/layers-architectures#thinc-shape-inference)
automatically.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> textcat.add_label("MY_LABEL")
> ```
| Name | Description |
| ----------- | ----------------------------------------------------------- |
| `label` | The label to add. ~~str~~ |
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
## MultiLabel_TextCategorizer.to_disk {#to_disk tag="method"}
Serialize the pipe to disk.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> textcat.to_disk("/path/to/textcat")
> ```
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
## MultiLabel_TextCategorizer.from_disk {#from_disk tag="method"}
Load the pipe from disk. Modifies the object in place and returns it.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> textcat.from_disk("/path/to/textcat")
> ```
| Name | Description |
| -------------- | ----------------------------------------------------------------------------------------------- |
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
| **RETURNS** | The modified `MultiLabel_TextCategorizer` object. ~~MultiLabel_TextCategorizer~~ |
## MultiLabel_TextCategorizer.to_bytes {#to_bytes tag="method"}
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> textcat_bytes = textcat.to_bytes()
> ```
Serialize the pipe to a bytestring.
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------- |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
| **RETURNS** | The serialized form of the `MultiLabel_TextCategorizer` object. ~~bytes~~ |
## MultiLabel_TextCategorizer.from_bytes {#from_bytes tag="method"}
Load the pipe from a bytestring. Modifies the object in place and returns it.
> #### Example
>
> ```python
> textcat_bytes = textcat.to_bytes()
> textcat = nlp.add_pipe("textcat")
> textcat.from_bytes(textcat_bytes)
> ```
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------- |
| `bytes_data` | The data to load from. ~~bytes~~ |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
| **RETURNS** | The `MultiLabel_TextCategorizer` object. ~~MultiLabel_TextCategorizer~~ |
## MultiLabel_TextCategorizer.labels {#labels tag="property"}
The labels currently added to the component.
> #### Example
>
> ```python
> textcat.add_label("MY_LABEL")
> assert "MY_LABEL" in textcat.labels
> ```
| Name | Description |
| ----------- | ------------------------------------------------------ |
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
## MultiLabel_TextCategorizer.label_data {#label_data tag="property" new="3"}
The labels currently added to the component and their internal meta information.
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
[`MultiLabel_TextCategorizer.initialize`](/api/multilabel_textcategorizer#initialize) to initialize
the model with a pre-defined label set.
> #### Example
>
> ```python
> labels = textcat.label_data
> textcat.initialize(lambda: [], nlp=nlp, labels=labels)
> ```
| Name | Description |
| ----------- | ---------------------------------------------------------- |
| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = textcat.to_disk("/path", exclude=["vocab"])
> ```
| Name | Description |
| ------- | -------------------------------------------------------------- |
| `vocab` | The shared [`Vocab`](/api/vocab). |
| `cfg` | The config file. You usually don't want to exclude this. |
| `model` | The binary model data. You usually don't want to exclude this. |

View File

@ -3,17 +3,15 @@ title: TextCategorizer
tag: class
source: spacy/pipeline/textcat.py
new: 2
teaser: 'Pipeline component for text classification'
teaser: 'Pipeline component for single-label text classification'
api_base_class: /api/pipe
api_string_name: textcat
api_trainable: true
---
The text categorizer predicts **categories over a whole document**. It can learn
one or more labels, and the labels can be mutually exclusive (i.e. one true
label per document) or non-mutually exclusive (i.e. zero or more labels may be
true per document). The multi-label setting is controlled by the model instance
that's provided.
one or more labels, and the labels are mutually exclusive - there is exactly one
true label per document.
## Config and implementation {#config}
@ -27,10 +25,10 @@ architectures and their arguments and hyperparameters.
> #### Example
>
> ```python
> from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL
> from spacy.pipeline.textcat import DEFAULT_SINGLE_TEXTCAT_MODEL
> config = {
> "threshold": 0.5,
> "model": DEFAULT_TEXTCAT_MODEL,
> "model": DEFAULT_SINGLE_TEXTCAT_MODEL,
> }
> nlp.add_pipe("textcat", config=config)
> ```
@ -280,7 +278,6 @@ Score a batch of examples.
| ---------------- | -------------------------------------------------------------------------------------------------------------------- |
| `examples` | The examples to score. ~~Iterable[Example]~~ |
| _keyword-only_ | |
| `positive_label` | Optional positive label. ~~Optional[str]~~ |
| **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ |
## TextCategorizer.create_optimizer {#create_optimizer tag="method"}

View File

@ -129,13 +129,13 @@ the entity recognizer, use a
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
@architectures = "spacy.MaxoutWindowEncoder.v2"
[components.ner]
factory = "ner"
@ -161,13 +161,13 @@ factory = "ner"
@architectures = "spacy.TransitionBasedParser.v1"
[components.ner.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"
@architectures = "spacy.Tok2Vec.v2"
[components.ner.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
[components.ner.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
@architectures = "spacy.MaxoutWindowEncoder.v2"
```
<!-- TODO: Once rehearsal is tested, mention it here. -->
@ -713,34 +713,39 @@ layer = "tok2vec"
#### Pretraining objectives {#pretraining-details}
Two pretraining objectives are available, both of which are variants of the
cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
for BERT. The objective can be defined and configured via the
`[pretraining.objective]` config block.
> ```ini
> ### Characters objective
> [pretraining.objective]
> type = "characters"
> @architectures = "spacy.PretrainCharacters.v1"
> maxout_pieces = 3
> hidden_size = 300
> n_characters = 4
> ```
>
> ```ini
> ### Vectors objective
> [pretraining.objective]
> type = "vectors"
> @architectures = "spacy.PretrainVectors.v1"
> maxout_pieces = 3
> hidden_size = 300
> loss = "cosine"
> ```
- **Characters:** The `"characters"` objective asks the model to predict some
number of leading and trailing UTF-8 bytes for the words. For instance,
setting `n_characters = 2`, the model will try to predict the first two and
last two characters of the word.
Two pretraining objectives are available, both of which are variants of the
cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
for BERT. The objective can be defined and configured via the
`[pretraining.objective]` config block.
- **Vectors:** The `"vectors"` objective asks the model to predict the word's
vector, from a static embeddings table. This requires a word vectors model to
be trained and loaded. The vectors objective can optimize either a cosine or
an L2 loss. We've generally found cosine loss to perform better.
- [`PretrainCharacters`](/api/architectures#pretrain_chars): The `"characters"`
objective asks the model to predict some number of leading and trailing UTF-8
bytes for the words. For instance, setting `n_characters = 2`, the model will
try to predict the first two and last two characters of the word.
- [`PretrainVectors`](/api/architectures#pretrain_vectors): The `"vectors"`
objective asks the model to predict the word's vector, from a static
embeddings table. This requires a word vectors model to be trained and loaded.
The vectors objective can optimize either a cosine or an L2 loss. We've
generally found cosine loss to perform better.
These pretraining objectives use a trick that we term **language modelling with
approximate outputs (LMAO)**. The motivation for the trick is that predicting an

View File

@ -134,7 +134,7 @@ labels = []
nO = null
[components.textcat.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"
@architectures = "spacy.Tok2Vec.v2"
[components.textcat.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
@ -144,7 +144,7 @@ attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
include_static_vectors = false
[components.textcat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = ${components.textcat.model.tok2vec.embed.width}
window_size = 1
maxout_pieces = 3
@ -152,7 +152,7 @@ depth = 2
[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
exclusive_classes = true
ngram_size = 1
no_output_layer = false
```
@ -170,7 +170,7 @@ labels = []
[components.textcat.model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null
@ -201,14 +201,14 @@ tokens, and their combination forms a typical
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
# ...
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
@architectures = "spacy.MaxoutWindowEncoder.v2"
# ...
```
@ -224,7 +224,7 @@ architecture:
# ...
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
@architectures = "spacy.MaxoutWindowEncoder.v2"
# ...
```
@ -716,7 +716,7 @@ that we want to classify as being related or not. As these candidate pairs are
typically formed within one document, this function takes a [`Doc`](/api/doc) as
input and outputs a `List` of `Span` tuples. For instance, the following
implementation takes any two entities from the same document, as long as they
are within a **maximum distance** (in number of tokens) of eachother:
are within a **maximum distance** (in number of tokens) of each other:
> #### config.cfg (excerpt)
>
@ -742,7 +742,7 @@ def create_instances(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]
return get_candidates
```
This function in added to the [`@misc` registry](/api/top-level#registry) so we
This function is added to the [`@misc` registry](/api/top-level#registry) so we
can refer to it from the config, and easily swap it out for any other candidate
generation function.

View File

@ -1060,7 +1060,7 @@ In this example we assume a custom function `read_custom_data` which loads or
generates texts with relevant text classification annotations. Then, small
lexical variations of the input text are created before generating the final
[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets
you register the function creating the custom reader in the `readers`
you register the function creating the custom reader in the `readers`
[registry](/api/top-level#registry) and assign it a string name, so it can be
used in your config. All arguments on the registered function become available
as **config settings** in this case, `source`.