mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 02:06:31 +03:00
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
This commit is contained in:
commit
97d5a7ba99
106
.github/contributors/bratao.md
vendored
Normal file
106
.github/contributors/bratao.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [X] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Bruno Souza Cabral |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 24/12/2020 |
|
||||
| GitHub username | bratao |
|
||||
| Website (optional) | |
|
1
.gitignore
vendored
1
.gitignore
vendored
|
@ -51,6 +51,7 @@ env3.*/
|
|||
.pypyenv
|
||||
.pytest_cache/
|
||||
.mypy_cache/
|
||||
.hypothesis/
|
||||
|
||||
# Distribution / packaging
|
||||
env/
|
||||
|
|
|
@ -35,7 +35,10 @@ def download_cli(
|
|||
|
||||
|
||||
def download(model: str, direct: bool = False, *pip_args) -> None:
|
||||
if not (is_package("spacy") or is_package("spacy-nightly")) and "--no-deps" not in pip_args:
|
||||
if (
|
||||
not (is_package("spacy") or is_package("spacy-nightly"))
|
||||
and "--no-deps" not in pip_args
|
||||
):
|
||||
msg.warn(
|
||||
"Skipping pipeline package dependencies and setting `--no-deps`. "
|
||||
"You don't seem to have the spaCy package itself installed "
|
||||
|
|
|
@ -172,7 +172,9 @@ def render_parses(
|
|||
file_.write(html)
|
||||
|
||||
|
||||
def print_prf_per_type(msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str) -> None:
|
||||
def print_prf_per_type(
|
||||
msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str
|
||||
) -> None:
|
||||
data = [
|
||||
(k, f"{v['p']*100:.2f}", f"{v['r']*100:.2f}", f"{v['f']*100:.2f}")
|
||||
for k, v in scores.items()
|
||||
|
|
|
@ -1,10 +1,10 @@
|
|||
from typing import Optional, Dict, Any, Union
|
||||
from typing import Optional, Dict, Any, Union, List
|
||||
import platform
|
||||
from pathlib import Path
|
||||
from wasabi import Printer, MarkdownRenderer
|
||||
import srsly
|
||||
|
||||
from ._util import app, Arg, Opt
|
||||
from ._util import app, Arg, Opt, string_to_list
|
||||
from .. import util
|
||||
from .. import about
|
||||
|
||||
|
@ -15,20 +15,22 @@ def info_cli(
|
|||
model: Optional[str] = Arg(None, help="Optional loadable spaCy pipeline"),
|
||||
markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues"),
|
||||
silent: bool = Opt(False, "--silent", "-s", "-S", help="Don't print anything (just return)"),
|
||||
exclude: Optional[str] = Opt("labels", "--exclude", "-e", help="Comma-separated keys to exclude from the print-out"),
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
Print info about spaCy installation. If a pipeline is speficied as an argument,
|
||||
Print info about spaCy installation. If a pipeline is specified as an argument,
|
||||
print its meta information. Flag --markdown prints details in Markdown for easy
|
||||
copy-pasting to GitHub issues.
|
||||
|
||||
DOCS: https://nightly.spacy.io/api/cli#info
|
||||
"""
|
||||
info(model, markdown=markdown, silent=silent)
|
||||
exclude = string_to_list(exclude)
|
||||
info(model, markdown=markdown, silent=silent, exclude=exclude)
|
||||
|
||||
|
||||
def info(
|
||||
model: Optional[str] = None, *, markdown: bool = False, silent: bool = True
|
||||
model: Optional[str] = None, *, markdown: bool = False, silent: bool = True, exclude: List[str]
|
||||
) -> Union[str, dict]:
|
||||
msg = Printer(no_print=silent, pretty=not silent)
|
||||
if model:
|
||||
|
@ -42,13 +44,13 @@ def info(
|
|||
data["Pipelines"] = ", ".join(
|
||||
f"{n} ({v})" for n, v in data["Pipelines"].items()
|
||||
)
|
||||
markdown_data = get_markdown(data, title=title)
|
||||
markdown_data = get_markdown(data, title=title, exclude=exclude)
|
||||
if markdown:
|
||||
if not silent:
|
||||
print(markdown_data)
|
||||
return markdown_data
|
||||
if not silent:
|
||||
table_data = dict(data)
|
||||
table_data = {k: v for k, v in data.items() if k not in exclude}
|
||||
msg.table(table_data, title=title)
|
||||
return raw_data
|
||||
|
||||
|
@ -82,7 +84,7 @@ def info_model(model: str, *, silent: bool = True) -> Dict[str, Any]:
|
|||
if util.is_package(model):
|
||||
model_path = util.get_package_path(model)
|
||||
else:
|
||||
model_path = model
|
||||
model_path = Path(model)
|
||||
meta_path = model_path / "meta.json"
|
||||
if not meta_path.is_file():
|
||||
msg.fail("Can't find pipeline meta.json", meta_path, exits=1)
|
||||
|
@ -96,7 +98,7 @@ def info_model(model: str, *, silent: bool = True) -> Dict[str, Any]:
|
|||
}
|
||||
|
||||
|
||||
def get_markdown(data: Dict[str, Any], title: Optional[str] = None) -> str:
|
||||
def get_markdown(data: Dict[str, Any], title: Optional[str] = None, exclude: List[str] = None) -> str:
|
||||
"""Get data in GitHub-flavoured Markdown format for issues etc.
|
||||
|
||||
data (dict or list of tuples): Label/value pairs.
|
||||
|
@ -108,8 +110,16 @@ def get_markdown(data: Dict[str, Any], title: Optional[str] = None) -> str:
|
|||
md.add(md.title(2, title))
|
||||
items = []
|
||||
for key, value in data.items():
|
||||
if isinstance(value, str) and Path(value).exists():
|
||||
if exclude and key in exclude:
|
||||
continue
|
||||
if isinstance(value, str):
|
||||
try:
|
||||
existing_path = Path(value).exists()
|
||||
except:
|
||||
# invalid Path, like a URL string
|
||||
existing_path = False
|
||||
if existing_path:
|
||||
continue
|
||||
items.append(f"{md.bold(f'{key}:')} {value}")
|
||||
md.add(md.list(items))
|
||||
return f"\n{md.text}\n"
|
||||
|
|
|
@ -32,6 +32,7 @@ def init_config_cli(
|
|||
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
|
||||
gpu: bool = Opt(False, "--gpu", "-G", help="Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters."),
|
||||
pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
|
||||
force_overwrite: bool = Opt(False, "--force", "-F", help="Force overwriting the output file"),
|
||||
# fmt: on
|
||||
):
|
||||
"""
|
||||
|
@ -46,6 +47,12 @@ def init_config_cli(
|
|||
optimize = optimize.value
|
||||
pipeline = string_to_list(pipeline)
|
||||
is_stdout = str(output_file) == "-"
|
||||
if not is_stdout and output_file.exists() and not force_overwrite:
|
||||
msg = Printer()
|
||||
msg.fail(
|
||||
"The provided output file already exists. To force overwriting the config file, set the --force or -F flag.",
|
||||
exits=1,
|
||||
)
|
||||
config = init_config(
|
||||
lang=lang,
|
||||
pipeline=pipeline,
|
||||
|
@ -162,7 +169,7 @@ def init_config(
|
|||
"Hardware": variables["hardware"].upper(),
|
||||
"Transformer": template_vars.transformer.get("name", False),
|
||||
}
|
||||
msg.info("Generated template specific for your use case")
|
||||
msg.info("Generated config template specific for your use case")
|
||||
for label, value in use_case.items():
|
||||
msg.text(f"- {label}: {value}")
|
||||
with show_validation_error(hint_fill=False):
|
||||
|
|
|
@ -149,13 +149,44 @@ grad_factor = 1.0
|
|||
|
||||
[components.textcat.model.linear_model]
|
||||
@architectures = "spacy.TextCatBOW.v1"
|
||||
exclusive_classes = false
|
||||
exclusive_classes = true
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
|
||||
{% else -%}
|
||||
[components.textcat.model]
|
||||
@architectures = "spacy.TextCatBOW.v1"
|
||||
exclusive_classes = true
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
{%- endif %}
|
||||
{%- endif %}
|
||||
|
||||
{% if "textcat_multilabel" in components %}
|
||||
[components.textcat_multilabel]
|
||||
factory = "textcat_multilabel"
|
||||
|
||||
{% if optimize == "accuracy" %}
|
||||
[components.textcat_multilabel.model]
|
||||
@architectures = "spacy.TextCatEnsemble.v2"
|
||||
nO = null
|
||||
|
||||
[components.textcat_multilabel.model.tok2vec]
|
||||
@architectures = "spacy-transformers.TransformerListener.v1"
|
||||
grad_factor = 1.0
|
||||
|
||||
[components.textcat_multilabel.model.tok2vec.pooling]
|
||||
@layers = "reduce_mean.v1"
|
||||
|
||||
[components.textcat_multilabel.model.linear_model]
|
||||
@architectures = "spacy.TextCatBOW.v1"
|
||||
exclusive_classes = false
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
|
||||
{% else -%}
|
||||
[components.textcat_multilabel.model]
|
||||
@architectures = "spacy.TextCatBOW.v1"
|
||||
exclusive_classes = false
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
|
@ -174,7 +205,7 @@ no_output_layer = false
|
|||
factory = "tok2vec"
|
||||
|
||||
[components.tok2vec.model]
|
||||
@architectures = "spacy.Tok2Vec.v1"
|
||||
@architectures = "spacy.Tok2Vec.v2"
|
||||
|
||||
[components.tok2vec.model.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
|
@ -189,7 +220,7 @@ rows = [5000, 2500]
|
|||
include_static_vectors = {{ "true" if optimize == "accuracy" else "false" }}
|
||||
|
||||
[components.tok2vec.model.encode]
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||
width = {{ 96 if optimize == "efficiency" else 256 }}
|
||||
depth = {{ 4 if optimize == "efficiency" else 8 }}
|
||||
window_size = 1
|
||||
|
@ -288,13 +319,41 @@ width = ${components.tok2vec.model.encode.width}
|
|||
|
||||
[components.textcat.model.linear_model]
|
||||
@architectures = "spacy.TextCatBOW.v1"
|
||||
exclusive_classes = false
|
||||
exclusive_classes = true
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
|
||||
{% else -%}
|
||||
[components.textcat.model]
|
||||
@architectures = "spacy.TextCatBOW.v1"
|
||||
exclusive_classes = true
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
{%- endif %}
|
||||
{%- endif %}
|
||||
|
||||
{% if "textcat_multilabel" in components %}
|
||||
[components.textcat_multilabel]
|
||||
factory = "textcat_multilabel"
|
||||
|
||||
{% if optimize == "accuracy" %}
|
||||
[components.textcat_multilabel.model]
|
||||
@architectures = "spacy.TextCatEnsemble.v2"
|
||||
nO = null
|
||||
|
||||
[components.textcat_multilabel.model.tok2vec]
|
||||
@architectures = "spacy.Tok2VecListener.v1"
|
||||
width = ${components.tok2vec.model.encode.width}
|
||||
|
||||
[components.textcat_multilabel.model.linear_model]
|
||||
@architectures = "spacy.TextCatBOW.v1"
|
||||
exclusive_classes = false
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
|
||||
{% else -%}
|
||||
[components.textcat_multilabel.model]
|
||||
@architectures = "spacy.TextCatBOW.v1"
|
||||
exclusive_classes = false
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
|
@ -303,7 +362,7 @@ no_output_layer = false
|
|||
{% endif %}
|
||||
|
||||
{% for pipe in components %}
|
||||
{% if pipe not in ["tagger", "morphologizer", "parser", "ner", "textcat", "entity_linker"] %}
|
||||
{% if pipe not in ["tagger", "morphologizer", "parser", "ner", "textcat", "textcat_multilabel", "entity_linker"] %}
|
||||
{# Other components defined by the user: we just assume they're factories #}
|
||||
[components.{{ pipe }}]
|
||||
factory = "{{ pipe }}"
|
||||
|
|
|
@ -463,6 +463,10 @@ class Errors:
|
|||
"issue tracker: http://github.com/explosion/spaCy/issues")
|
||||
|
||||
# TODO: fix numbering after merging develop into master
|
||||
E895 = ("The 'textcat' component received gold-standard annotations with "
|
||||
"multiple labels per document. In spaCy 3 you should use the "
|
||||
"'textcat_multilabel' component for this instead. "
|
||||
"Example of an offending annotation: {value}")
|
||||
E896 = ("There was an error using the static vectors. Ensure that the vectors "
|
||||
"of the vocab are properly initialized, or set 'include_static_vectors' "
|
||||
"to False.")
|
||||
|
|
|
@ -214,8 +214,22 @@ _macedonian_lower = r"ѓѕјљњќѐѝ"
|
|||
_macedonian_upper = r"ЃЅЈЉЊЌЀЍ"
|
||||
_macedonian = r"ѓѕјљњќѐѝЃЅЈЉЊЌЀЍ"
|
||||
|
||||
_upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian_upper + _macedonian_upper
|
||||
_lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower + _macedonian_lower
|
||||
_upper = (
|
||||
LATIN_UPPER
|
||||
+ _russian_upper
|
||||
+ _tatar_upper
|
||||
+ _greek_upper
|
||||
+ _ukrainian_upper
|
||||
+ _macedonian_upper
|
||||
)
|
||||
_lower = (
|
||||
LATIN_LOWER
|
||||
+ _russian_lower
|
||||
+ _tatar_lower
|
||||
+ _greek_lower
|
||||
+ _ukrainian_lower
|
||||
+ _macedonian_lower
|
||||
)
|
||||
|
||||
_uncased = (
|
||||
_bengali
|
||||
|
@ -230,7 +244,9 @@ _uncased = (
|
|||
+ _cjk
|
||||
)
|
||||
|
||||
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _macedonian + _uncased)
|
||||
ALPHA = group_chars(
|
||||
LATIN + _russian + _tatar + _greek + _ukrainian + _macedonian + _uncased
|
||||
)
|
||||
ALPHA_LOWER = group_chars(_lower + _uncased)
|
||||
ALPHA_UPPER = group_chars(_upper + _uncased)
|
||||
|
||||
|
|
|
@ -1,18 +1,11 @@
|
|||
from .stop_words import STOP_WORDS
|
||||
from .tag_map import TAG_MAP
|
||||
from ...language import Language
|
||||
from ...attrs import LANG
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from ...language import Language
|
||||
|
||||
|
||||
class CzechDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
lex_attr_getters[LANG] = lambda text: "cs"
|
||||
tag_map = TAG_MAP
|
||||
stop_words = STOP_WORDS
|
||||
lex_attr_getters = LEX_ATTRS
|
||||
stop_words = STOP_WORDS
|
||||
|
||||
|
||||
class Czech(Language):
|
||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -14,7 +14,7 @@ class MacedonianLemmatizer(Lemmatizer):
|
|||
if univ_pos in ("", "eol", "space"):
|
||||
return [string.lower()]
|
||||
|
||||
if string[-3:] == 'јќи':
|
||||
if string[-3:] == "јќи":
|
||||
string = string[:-3]
|
||||
univ_pos = "verb"
|
||||
|
||||
|
@ -23,7 +23,13 @@ class MacedonianLemmatizer(Lemmatizer):
|
|||
index_table = self.lookups.get_table("lemma_index", {})
|
||||
exc_table = self.lookups.get_table("lemma_exc", {})
|
||||
rules_table = self.lookups.get_table("lemma_rules", {})
|
||||
if not any((index_table.get(univ_pos), exc_table.get(univ_pos), rules_table.get(univ_pos))):
|
||||
if not any(
|
||||
(
|
||||
index_table.get(univ_pos),
|
||||
exc_table.get(univ_pos),
|
||||
rules_table.get(univ_pos),
|
||||
)
|
||||
):
|
||||
if univ_pos == "propn":
|
||||
return [string]
|
||||
else:
|
||||
|
|
|
@ -1,21 +1,104 @@
|
|||
from ...attrs import LIKE_NUM
|
||||
|
||||
_num_words = [
|
||||
"нула", "еден", "една", "едно", "два", "две", "три", "четири", "пет", "шест", "седум", "осум", "девет", "десет",
|
||||
"единаесет", "дванаесет", "тринаесет", "четиринаесет", "петнаесет", "шеснаесет", "седумнаесет", "осумнаесет",
|
||||
"деветнаесет", "дваесет", "триесет", "четириесет", "педесет", "шеесет", "седумдесет", "осумдесет", "деведесет",
|
||||
"сто", "двесте", "триста", "четиристотини", "петстотини", "шестотини", "седумстотини", "осумстотини",
|
||||
"деветстотини", "илјада", "илјади", 'милион', 'милиони', 'милијарда', 'милијарди', 'билион', 'билиони',
|
||||
|
||||
"двајца", "тројца", "четворица", "петмина", "шестмина", "седуммина", "осуммина", "деветмина", "обата", "обајцата",
|
||||
|
||||
"прв", "втор", "трет", "четврт", "седм", "осм", "двестоти",
|
||||
|
||||
"два-три", "два-триесет", "два-триесетмина", "два-тринаесет", "два-тројца", "две-три", "две-тристотини",
|
||||
"пет-шеесет", "пет-шеесетмина", "пет-шеснаесетмина", "пет-шест", "пет-шестмина", "пет-шестотини", "петина",
|
||||
"осмина", "седум-осум", "седум-осумдесет", "седум-осуммина", "седум-осумнаесет", "седум-осумнаесетмина",
|
||||
"три-четириесет", "три-четиринаесет", "шеесет", "шеесетина", "шеесетмина", "шеснаесет", "шеснаесетмина",
|
||||
"шест-седум", "шест-седумдесет", "шест-седумнаесет", "шест-седумстотини", "шестоти", "шестотини"
|
||||
"нула",
|
||||
"еден",
|
||||
"една",
|
||||
"едно",
|
||||
"два",
|
||||
"две",
|
||||
"три",
|
||||
"четири",
|
||||
"пет",
|
||||
"шест",
|
||||
"седум",
|
||||
"осум",
|
||||
"девет",
|
||||
"десет",
|
||||
"единаесет",
|
||||
"дванаесет",
|
||||
"тринаесет",
|
||||
"четиринаесет",
|
||||
"петнаесет",
|
||||
"шеснаесет",
|
||||
"седумнаесет",
|
||||
"осумнаесет",
|
||||
"деветнаесет",
|
||||
"дваесет",
|
||||
"триесет",
|
||||
"четириесет",
|
||||
"педесет",
|
||||
"шеесет",
|
||||
"седумдесет",
|
||||
"осумдесет",
|
||||
"деведесет",
|
||||
"сто",
|
||||
"двесте",
|
||||
"триста",
|
||||
"четиристотини",
|
||||
"петстотини",
|
||||
"шестотини",
|
||||
"седумстотини",
|
||||
"осумстотини",
|
||||
"деветстотини",
|
||||
"илјада",
|
||||
"илјади",
|
||||
"милион",
|
||||
"милиони",
|
||||
"милијарда",
|
||||
"милијарди",
|
||||
"билион",
|
||||
"билиони",
|
||||
"двајца",
|
||||
"тројца",
|
||||
"четворица",
|
||||
"петмина",
|
||||
"шестмина",
|
||||
"седуммина",
|
||||
"осуммина",
|
||||
"деветмина",
|
||||
"обата",
|
||||
"обајцата",
|
||||
"прв",
|
||||
"втор",
|
||||
"трет",
|
||||
"четврт",
|
||||
"седм",
|
||||
"осм",
|
||||
"двестоти",
|
||||
"два-три",
|
||||
"два-триесет",
|
||||
"два-триесетмина",
|
||||
"два-тринаесет",
|
||||
"два-тројца",
|
||||
"две-три",
|
||||
"две-тристотини",
|
||||
"пет-шеесет",
|
||||
"пет-шеесетмина",
|
||||
"пет-шеснаесетмина",
|
||||
"пет-шест",
|
||||
"пет-шестмина",
|
||||
"пет-шестотини",
|
||||
"петина",
|
||||
"осмина",
|
||||
"седум-осум",
|
||||
"седум-осумдесет",
|
||||
"седум-осуммина",
|
||||
"седум-осумнаесет",
|
||||
"седум-осумнаесетмина",
|
||||
"три-четириесет",
|
||||
"три-четиринаесет",
|
||||
"шеесет",
|
||||
"шеесетина",
|
||||
"шеесетмина",
|
||||
"шеснаесет",
|
||||
"шеснаесетмина",
|
||||
"шест-седум",
|
||||
"шест-седумдесет",
|
||||
"шест-седумнаесет",
|
||||
"шест-седумстотини",
|
||||
"шестоти",
|
||||
"шестотини",
|
||||
]
|
||||
|
||||
|
||||
|
|
|
@ -21,8 +21,7 @@ _abbr_exc = [
|
|||
{ORTH: "хл", NORM: "хектолитар"},
|
||||
{ORTH: "дкл", NORM: "декалитар"},
|
||||
{ORTH: "л", NORM: "литар"},
|
||||
{ORTH: "дл", NORM: "децилитар"}
|
||||
|
||||
{ORTH: "дл", NORM: "децилитар"},
|
||||
]
|
||||
for abbr in _abbr_exc:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
@ -33,7 +32,6 @@ _abbr_line_exc = [
|
|||
{ORTH: "г-ѓа", NORM: "госпоѓа"},
|
||||
{ORTH: "г-ца", NORM: "госпоѓица"},
|
||||
{ORTH: "г-дин", NORM: "господин"},
|
||||
|
||||
]
|
||||
|
||||
for abbr in _abbr_line_exc:
|
||||
|
@ -54,7 +52,6 @@ _abbr_dot_exc = [
|
|||
{ORTH: "т.", NORM: "точка"},
|
||||
{ORTH: "т.е.", NORM: "то ест"},
|
||||
{ORTH: "т.н.", NORM: "таканаречен"},
|
||||
|
||||
{ORTH: "бр.", NORM: "број"},
|
||||
{ORTH: "гр.", NORM: "град"},
|
||||
{ORTH: "др.", NORM: "другар"},
|
||||
|
@ -68,7 +65,6 @@ _abbr_dot_exc = [
|
|||
{ORTH: "с.", NORM: "страница"},
|
||||
{ORTH: "стр.", NORM: "страница"},
|
||||
{ORTH: "чл.", NORM: "член"},
|
||||
|
||||
{ORTH: "арх.", NORM: "архитект"},
|
||||
{ORTH: "бел.", NORM: "белешка"},
|
||||
{ORTH: "гимн.", NORM: "гимназија"},
|
||||
|
@ -89,8 +85,6 @@ _abbr_dot_exc = [
|
|||
{ORTH: "истор.", NORM: "историја"},
|
||||
{ORTH: "геогр.", NORM: "географија"},
|
||||
{ORTH: "литер.", NORM: "литература"},
|
||||
|
||||
|
||||
]
|
||||
|
||||
for abbr in _abbr_dot_exc:
|
||||
|
|
|
@ -45,7 +45,7 @@ _abbr_period_exc = [
|
|||
{ORTH: "Doç.", NORM: "doçent"},
|
||||
{ORTH: "doğ."},
|
||||
{ORTH: "Dr.", NORM: "doktor"},
|
||||
{ORTH: "dr.", NORM:"doktor"},
|
||||
{ORTH: "dr.", NORM: "doktor"},
|
||||
{ORTH: "drl.", NORM: "derleyen"},
|
||||
{ORTH: "Dz.", NORM: "deniz"},
|
||||
{ORTH: "Dz.K.K.lığı"},
|
||||
|
@ -118,7 +118,7 @@ _abbr_period_exc = [
|
|||
{ORTH: "Uzm.", NORM: "uzman"},
|
||||
{ORTH: "Üçvş.", NORM: "üstçavuş"},
|
||||
{ORTH: "Üni.", NORM: "üniversitesi"},
|
||||
{ORTH: "Ütğm.", NORM: "üsteğmen"},
|
||||
{ORTH: "Ütğm.", NORM: "üsteğmen"},
|
||||
{ORTH: "vb."},
|
||||
{ORTH: "vs.", NORM: "vesaire"},
|
||||
{ORTH: "Yard.", NORM: "yardımcı"},
|
||||
|
@ -163,19 +163,29 @@ for abbr in _abbr_exc:
|
|||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
|
||||
|
||||
_num = r"[+-]?\d+([,.]\d+)*"
|
||||
_ord_num = r"(\d+\.)"
|
||||
_date = r"(((\d{1,2}[./-]){2})?(\d{4})|(\d{1,2}[./]\d{1,2}(\.)?))"
|
||||
_dash_num = r"(([{al}\d]+/\d+)|(\d+/[{al}]))".format(al=ALPHA)
|
||||
_roman_num = "M{0,3}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})"
|
||||
_roman_num = "M{0,3}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})"
|
||||
_roman_ord = r"({rn})\.".format(rn=_roman_num)
|
||||
_time_exp = r"\d+(:\d+)*"
|
||||
|
||||
_inflections = r"'[{al}]+".format(al=ALPHA_LOWER)
|
||||
_abbrev_inflected = r"[{a}]+\.'[{al}]+".format(a=ALPHA, al=ALPHA_LOWER)
|
||||
|
||||
_nums = r"(({d})|({dn})|({te})|({on})|({n})|({ro})|({rn}))({inf})?".format(d=_date, dn=_dash_num, te=_time_exp, on=_ord_num, n=_num, ro=_roman_ord, rn=_roman_num, inf=_inflections)
|
||||
_nums = r"(({d})|({dn})|({te})|({on})|({n})|({ro})|({rn}))({inf})?".format(
|
||||
d=_date,
|
||||
dn=_dash_num,
|
||||
te=_time_exp,
|
||||
on=_ord_num,
|
||||
n=_num,
|
||||
ro=_roman_ord,
|
||||
rn=_roman_num,
|
||||
inf=_inflections,
|
||||
)
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
TOKEN_MATCH = re.compile(r"^({abbr})|({n})$".format(n=_nums, abbr=_abbrev_inflected)).match
|
||||
TOKEN_MATCH = re.compile(
|
||||
r"^({abbr})|({n})$".format(n=_nums, abbr=_abbrev_inflected)
|
||||
).match
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
import numpy
|
||||
from thinc.api import Model
|
||||
|
||||
from ..attrs import LOWER
|
||||
|
|
|
@ -21,14 +21,14 @@ def transition_parser_v1(
|
|||
nO: Optional[int] = None,
|
||||
) -> Model:
|
||||
return build_tb_parser_model(
|
||||
tok2vec,
|
||||
state_type,
|
||||
extra_state_tokens,
|
||||
hidden_width,
|
||||
maxout_pieces,
|
||||
use_upper,
|
||||
nO,
|
||||
)
|
||||
tok2vec,
|
||||
state_type,
|
||||
extra_state_tokens,
|
||||
hidden_width,
|
||||
maxout_pieces,
|
||||
use_upper,
|
||||
nO,
|
||||
)
|
||||
|
||||
|
||||
@registry.architectures.register("spacy.TransitionBasedParser.v2")
|
||||
|
@ -42,14 +42,15 @@ def transition_parser_v2(
|
|||
nO: Optional[int] = None,
|
||||
) -> Model:
|
||||
return build_tb_parser_model(
|
||||
tok2vec,
|
||||
state_type,
|
||||
extra_state_tokens,
|
||||
hidden_width,
|
||||
maxout_pieces,
|
||||
use_upper,
|
||||
nO,
|
||||
)
|
||||
tok2vec,
|
||||
state_type,
|
||||
extra_state_tokens,
|
||||
hidden_width,
|
||||
maxout_pieces,
|
||||
use_upper,
|
||||
nO,
|
||||
)
|
||||
|
||||
|
||||
def build_tb_parser_model(
|
||||
tok2vec: Model[List[Doc], List[Floats2d]],
|
||||
|
@ -162,8 +163,8 @@ def _resize_upper(model, new_nO):
|
|||
# just adding rows here.
|
||||
if smaller.has_dim("nO"):
|
||||
old_nO = smaller.get_dim("nO")
|
||||
larger_W[: old_nO] = smaller_W
|
||||
larger_b[: old_nO] = smaller_b
|
||||
larger_W[:old_nO] = smaller_W
|
||||
larger_b[:old_nO] = smaller_b
|
||||
for i in range(old_nO, new_nO):
|
||||
model.attrs["unseen_classes"].add(i)
|
||||
|
||||
|
|
|
@ -6,6 +6,7 @@ from thinc.api import chain, concatenate, clone, Dropout, ParametricAttention
|
|||
from thinc.api import SparseLinear, Softmax, softmax_activation, Maxout, reduce_sum
|
||||
from thinc.api import HashEmbed, with_array, with_cpu, uniqued
|
||||
from thinc.api import Relu, residual, expand_window
|
||||
from thinc.layers.chain import init as init_chain
|
||||
|
||||
from ...attrs import ID, ORTH, PREFIX, SUFFIX, SHAPE, LOWER
|
||||
from ...util import registry
|
||||
|
@ -13,6 +14,7 @@ from ..extract_ngrams import extract_ngrams
|
|||
from ..staticvectors import StaticVectors
|
||||
from ..featureextractor import FeatureExtractor
|
||||
from ...tokens import Doc
|
||||
from .tok2vec import get_tok2vec_width
|
||||
|
||||
|
||||
@registry.architectures.register("spacy.TextCatCNN.v1")
|
||||
|
@ -69,13 +71,16 @@ def build_text_classifier_v2(
|
|||
exclusive_classes = not linear_model.attrs["multi_label"]
|
||||
with Model.define_operators({">>": chain, "|": concatenate}):
|
||||
width = tok2vec.maybe_get_dim("nO")
|
||||
attention_layer = ParametricAttention(width) # TODO: benchmark performance difference of this layer
|
||||
maxout_layer = Maxout(nO=width, nI=width)
|
||||
linear_layer = Linear(nO=nO, nI=width)
|
||||
cnn_model = (
|
||||
tok2vec
|
||||
>> list2ragged()
|
||||
>> ParametricAttention(width) # TODO: benchmark performance difference of this layer
|
||||
>> attention_layer
|
||||
>> reduce_sum()
|
||||
>> residual(Maxout(nO=width, nI=width))
|
||||
>> Linear(nO=nO, nI=width)
|
||||
>> residual(maxout_layer)
|
||||
>> linear_layer
|
||||
>> Dropout(0.0)
|
||||
)
|
||||
|
||||
|
@ -89,9 +94,25 @@ def build_text_classifier_v2(
|
|||
if model.has_dim("nO") is not False:
|
||||
model.set_dim("nO", nO)
|
||||
model.set_ref("output_layer", linear_model.get_ref("output_layer"))
|
||||
model.set_ref("attention_layer", attention_layer)
|
||||
model.set_ref("maxout_layer", maxout_layer)
|
||||
model.set_ref("linear_layer", linear_layer)
|
||||
model.attrs["multi_label"] = not exclusive_classes
|
||||
|
||||
model.init = init_ensemble_textcat
|
||||
return model
|
||||
|
||||
|
||||
def init_ensemble_textcat(model, X, Y) -> Model:
|
||||
tok2vec_width = get_tok2vec_width(model)
|
||||
model.get_ref("attention_layer").set_dim("nO", tok2vec_width)
|
||||
model.get_ref("maxout_layer").set_dim("nO", tok2vec_width)
|
||||
model.get_ref("maxout_layer").set_dim("nI", tok2vec_width)
|
||||
model.get_ref("linear_layer").set_dim("nI", tok2vec_width)
|
||||
init_chain(model, X, Y)
|
||||
return model
|
||||
|
||||
|
||||
# TODO: move to legacy
|
||||
@registry.architectures.register("spacy.TextCatEnsemble.v1")
|
||||
def build_text_classifier_v1(
|
||||
|
|
|
@ -20,6 +20,17 @@ def tok2vec_listener_v1(width: int, upstream: str = "*"):
|
|||
return tok2vec
|
||||
|
||||
|
||||
def get_tok2vec_width(model: Model):
|
||||
nO = None
|
||||
if model.has_ref("tok2vec"):
|
||||
tok2vec = model.get_ref("tok2vec")
|
||||
if tok2vec.has_dim("nO"):
|
||||
nO = tok2vec.get_dim("nO")
|
||||
elif tok2vec.has_ref("listener"):
|
||||
nO = tok2vec.get_ref("listener").get_dim("nO")
|
||||
return nO
|
||||
|
||||
|
||||
@registry.architectures.register("spacy.HashEmbedCNN.v1")
|
||||
def build_hash_embed_cnn_tok2vec(
|
||||
*,
|
||||
|
@ -76,6 +87,7 @@ def build_hash_embed_cnn_tok2vec(
|
|||
)
|
||||
|
||||
|
||||
# TODO: archive
|
||||
@registry.architectures.register("spacy.Tok2Vec.v1")
|
||||
def build_Tok2Vec_model(
|
||||
embed: Model[List[Doc], List[Floats2d]],
|
||||
|
@ -97,6 +109,28 @@ def build_Tok2Vec_model(
|
|||
return tok2vec
|
||||
|
||||
|
||||
|
||||
@registry.architectures.register("spacy.Tok2Vec.v2")
|
||||
def build_Tok2Vec_model(
|
||||
embed: Model[List[Doc], List[Floats2d]],
|
||||
encode: Model[List[Floats2d], List[Floats2d]],
|
||||
) -> Model[List[Doc], List[Floats2d]]:
|
||||
"""Construct a tok2vec model out of embedding and encoding subnetworks.
|
||||
See https://explosion.ai/blog/deep-learning-formula-nlp
|
||||
|
||||
embed (Model[List[Doc], List[Floats2d]]): Embed tokens into context-independent
|
||||
word vector representations.
|
||||
encode (Model[List[Floats2d], List[Floats2d]]): Encode context into the
|
||||
embeddings, using an architecture such as a CNN, BiLSTM or transformer.
|
||||
"""
|
||||
tok2vec = chain(embed, encode)
|
||||
tok2vec.set_dim("nO", encode.get_dim("nO"))
|
||||
tok2vec.set_ref("embed", embed)
|
||||
tok2vec.set_ref("encode", encode)
|
||||
return tok2vec
|
||||
|
||||
|
||||
|
||||
@registry.architectures.register("spacy.MultiHashEmbed.v1")
|
||||
def MultiHashEmbed(
|
||||
width: int,
|
||||
|
@ -244,6 +278,7 @@ def CharacterEmbed(
|
|||
return model
|
||||
|
||||
|
||||
# TODO: archive
|
||||
@registry.architectures.register("spacy.MaxoutWindowEncoder.v1")
|
||||
def MaxoutWindowEncoder(
|
||||
width: int, window_size: int, maxout_pieces: int, depth: int
|
||||
|
@ -275,7 +310,39 @@ def MaxoutWindowEncoder(
|
|||
model.attrs["receptive_field"] = window_size * depth
|
||||
return model
|
||||
|
||||
@registry.architectures.register("spacy.MaxoutWindowEncoder.v2")
|
||||
def MaxoutWindowEncoder(
|
||||
width: int, window_size: int, maxout_pieces: int, depth: int
|
||||
) -> Model[List[Floats2d], List[Floats2d]]:
|
||||
"""Encode context using convolutions with maxout activation, layer
|
||||
normalization and residual connections.
|
||||
|
||||
width (int): The input and output width. These are required to be the same,
|
||||
to allow residual connections. This value will be determined by the
|
||||
width of the inputs. Recommended values are between 64 and 300.
|
||||
window_size (int): The number of words to concatenate around each token
|
||||
to construct the convolution. Recommended value is 1.
|
||||
maxout_pieces (int): The number of maxout pieces to use. Recommended
|
||||
values are 2 or 3.
|
||||
depth (int): The number of convolutional layers. Recommended value is 4.
|
||||
"""
|
||||
cnn = chain(
|
||||
expand_window(window_size=window_size),
|
||||
Maxout(
|
||||
nO=width,
|
||||
nI=width * ((window_size * 2) + 1),
|
||||
nP=maxout_pieces,
|
||||
dropout=0.0,
|
||||
normalize=True,
|
||||
),
|
||||
)
|
||||
model = clone(residual(cnn), depth)
|
||||
model.set_dim("nO", width)
|
||||
receptive_field = window_size * depth
|
||||
return with_array(model, pad=receptive_field)
|
||||
|
||||
|
||||
# TODO: archive
|
||||
@registry.architectures.register("spacy.MishWindowEncoder.v1")
|
||||
def MishWindowEncoder(
|
||||
width: int, window_size: int, depth: int
|
||||
|
@ -299,6 +366,29 @@ def MishWindowEncoder(
|
|||
return model
|
||||
|
||||
|
||||
@registry.architectures.register("spacy.MishWindowEncoder.v2")
|
||||
def MishWindowEncoder(
|
||||
width: int, window_size: int, depth: int
|
||||
) -> Model[List[Floats2d], List[Floats2d]]:
|
||||
"""Encode context using convolutions with mish activation, layer
|
||||
normalization and residual connections.
|
||||
|
||||
width (int): The input and output width. These are required to be the same,
|
||||
to allow residual connections. This value will be determined by the
|
||||
width of the inputs. Recommended values are between 64 and 300.
|
||||
window_size (int): The number of words to concatenate around each token
|
||||
to construct the convolution. Recommended value is 1.
|
||||
depth (int): The number of convolutional layers. Recommended value is 4.
|
||||
"""
|
||||
cnn = chain(
|
||||
expand_window(window_size=window_size),
|
||||
Mish(nO=width, nI=width * ((window_size * 2) + 1), dropout=0.0, normalize=True),
|
||||
)
|
||||
model = clone(residual(cnn), depth)
|
||||
model.set_dim("nO", width)
|
||||
return with_array(model)
|
||||
|
||||
|
||||
@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1")
|
||||
def BiLSTMEncoder(
|
||||
width: int, depth: int, dropout: float
|
||||
|
@ -308,9 +398,9 @@ def BiLSTMEncoder(
|
|||
width (int): The input and output width. These are required to be the same,
|
||||
to allow residual connections. This value will be determined by the
|
||||
width of the inputs. Recommended values are between 64 and 300.
|
||||
window_size (int): The number of words to concatenate around each token
|
||||
to construct the convolution. Recommended value is 1.
|
||||
depth (int): The number of convolutional layers. Recommended value is 4.
|
||||
depth (int): The number of recurrent layers.
|
||||
dropout (float): Creates a Dropout layer on the outputs of each LSTM layer
|
||||
except the last layer. Set to 0 to disable this functionality.
|
||||
"""
|
||||
if depth == 0:
|
||||
return noop()
|
||||
|
|
|
@ -47,8 +47,7 @@ def forward(
|
|||
except ValueError:
|
||||
raise RuntimeError(Errors.E896)
|
||||
output = Ragged(
|
||||
vectors_data,
|
||||
model.ops.asarray([len(doc) for doc in docs], dtype="i")
|
||||
vectors_data, model.ops.asarray([len(doc) for doc in docs], dtype="i")
|
||||
)
|
||||
mask = None
|
||||
if is_train:
|
||||
|
|
|
@ -1,8 +1,10 @@
|
|||
from thinc.api import Model, noop, use_ops, Linear
|
||||
from thinc.api import Model, noop
|
||||
from .parser_model import ParserStepModel
|
||||
|
||||
|
||||
def TransitionModel(tok2vec, lower, upper, resize_output, dropout=0.2, unseen_classes=set()):
|
||||
def TransitionModel(
|
||||
tok2vec, lower, upper, resize_output, dropout=0.2, unseen_classes=set()
|
||||
):
|
||||
"""Set up a stepwise transition-based model"""
|
||||
if upper is None:
|
||||
has_upper = False
|
||||
|
@ -44,4 +46,3 @@ def init(model, X=None, Y=None):
|
|||
if model.attrs["has_upper"]:
|
||||
statevecs = model.ops.alloc2f(2, lower.get_dim("nO"))
|
||||
model.get_ref("upper").initialize(X=statevecs)
|
||||
|
||||
|
|
|
@ -133,8 +133,9 @@ cdef class Morphology:
|
|||
"""
|
||||
cdef MorphAnalysisC tag
|
||||
tag.length = len(field_feature_pairs)
|
||||
tag.fields = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t))
|
||||
tag.features = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t))
|
||||
if tag.length > 0:
|
||||
tag.fields = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t))
|
||||
tag.features = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t))
|
||||
for i, (field, feature) in enumerate(field_feature_pairs):
|
||||
tag.fields[i] = field
|
||||
tag.features[i] = feature
|
||||
|
|
|
@ -11,6 +11,7 @@ from .senter import SentenceRecognizer
|
|||
from .sentencizer import Sentencizer
|
||||
from .tagger import Tagger
|
||||
from .textcat import TextCategorizer
|
||||
from .textcat_multilabel import MultiLabel_TextCategorizer
|
||||
from .tok2vec import Tok2Vec
|
||||
from .functions import merge_entities, merge_noun_chunks, merge_subtokens
|
||||
|
||||
|
@ -22,13 +23,14 @@ __all__ = [
|
|||
"EntityRuler",
|
||||
"Morphologizer",
|
||||
"Lemmatizer",
|
||||
"TrainablePipe",
|
||||
"MultiLabel_TextCategorizer",
|
||||
"Pipe",
|
||||
"SentenceRecognizer",
|
||||
"Sentencizer",
|
||||
"Tagger",
|
||||
"TextCategorizer",
|
||||
"Tok2Vec",
|
||||
"TrainablePipe",
|
||||
"merge_entities",
|
||||
"merge_noun_chunks",
|
||||
"merge_subtokens",
|
||||
|
|
|
@ -255,7 +255,7 @@ def get_gradient(nr_class, beam_maps, histories, losses):
|
|||
for a beam state -- so we have "the gradient of loss for taking
|
||||
action i given history H."
|
||||
|
||||
Histories: Each hitory is a list of actions
|
||||
Histories: Each history is a list of actions
|
||||
Each candidate has a history
|
||||
Each beam has multiple candidates
|
||||
Each batch has multiple beams
|
||||
|
|
|
@ -4,4 +4,4 @@ from .transition_system cimport Transition, TransitionSystem
|
|||
|
||||
|
||||
cdef class ArcEager(TransitionSystem):
|
||||
pass
|
||||
cdef get_arcs(self, StateC* state)
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
# cython: profile=True, cdivision=True, infer_types=True
|
||||
from cymem.cymem cimport Pool, Address
|
||||
from libc.stdint cimport int32_t
|
||||
from libcpp.vector cimport vector
|
||||
|
||||
from collections import defaultdict, Counter
|
||||
|
||||
|
@ -10,9 +11,9 @@ from ...structs cimport TokenC
|
|||
from ...tokens.doc cimport Doc, set_children_from_heads
|
||||
from ...training.example cimport Example
|
||||
from .stateclass cimport StateClass
|
||||
from ._state cimport StateC
|
||||
|
||||
from ._state cimport StateC, ArcC
|
||||
from ...errors import Errors
|
||||
from thinc.extra.search cimport Beam
|
||||
|
||||
cdef weight_t MIN_SCORE = -90000
|
||||
cdef attr_t SUBTOK_LABEL = hash_string(u'subtok')
|
||||
|
@ -65,6 +66,7 @@ cdef GoldParseStateC create_gold_state(Pool mem, const StateC* state,
|
|||
cdef GoldParseStateC gs
|
||||
gs.length = len(heads)
|
||||
gs.stride = 1
|
||||
assert gs.length > 0
|
||||
gs.labels = <attr_t*>mem.alloc(gs.length, sizeof(gs.labels[0]))
|
||||
gs.heads = <int32_t*>mem.alloc(gs.length, sizeof(gs.heads[0]))
|
||||
gs.n_kids = <int32_t*>mem.alloc(gs.length, sizeof(gs.n_kids[0]))
|
||||
|
@ -126,6 +128,7 @@ cdef GoldParseStateC create_gold_state(Pool mem, const StateC* state,
|
|||
1
|
||||
)
|
||||
# Make an array of pointers, pointing into the gs_kids_flat array.
|
||||
assert gs.length > 0
|
||||
gs.kids = <int32_t**>mem.alloc(gs.length, sizeof(int32_t*))
|
||||
for i in range(gs.length):
|
||||
if gs.n_kids[i] != 0:
|
||||
|
@ -609,7 +612,7 @@ cdef class ArcEager(TransitionSystem):
|
|||
return gold
|
||||
|
||||
def init_gold_batch(self, examples):
|
||||
# TODO: Projectivitity?
|
||||
# TODO: Projectivity?
|
||||
all_states = self.init_batch([eg.predicted for eg in examples])
|
||||
golds = []
|
||||
states = []
|
||||
|
@ -705,6 +708,28 @@ cdef class ArcEager(TransitionSystem):
|
|||
doc.c[i].dep = self.root_label
|
||||
set_children_from_heads(doc.c, 0, doc.length)
|
||||
|
||||
def get_beam_parses(self, Beam beam):
|
||||
parses = []
|
||||
probs = beam.probs
|
||||
for i in range(beam.size):
|
||||
state = <StateC*>beam.at(i)
|
||||
if state.is_final():
|
||||
prob = probs[i]
|
||||
parse = []
|
||||
arcs = self.get_arcs(state)
|
||||
if arcs:
|
||||
for arc in arcs:
|
||||
dep = arc["label"]
|
||||
label = self.strings[dep]
|
||||
parse.append((arc["head"], arc["child"], label))
|
||||
parses.append((prob, parse))
|
||||
return parses
|
||||
|
||||
cdef get_arcs(self, StateC* state):
|
||||
cdef vector[ArcC] arcs
|
||||
state.get_arcs(&arcs)
|
||||
return list(arcs)
|
||||
|
||||
def has_gold(self, Example eg, start=0, end=None):
|
||||
for word in eg.y[start:end]:
|
||||
if word.dep != 0:
|
||||
|
|
|
@ -2,6 +2,7 @@ from libc.stdint cimport int32_t
|
|||
from cymem.cymem cimport Pool
|
||||
|
||||
from collections import Counter
|
||||
from thinc.extra.search cimport Beam
|
||||
|
||||
from ...tokens.doc cimport Doc
|
||||
from ...tokens.span import Span
|
||||
|
@ -63,6 +64,7 @@ cdef GoldNERStateC create_gold_state(
|
|||
Example example
|
||||
) except *:
|
||||
cdef GoldNERStateC gs
|
||||
assert example.x.length > 0
|
||||
gs.ner = <Transition*>mem.alloc(example.x.length, sizeof(Transition))
|
||||
ner_tags = example.get_aligned_ner()
|
||||
for i, ner_tag in enumerate(ner_tags):
|
||||
|
@ -245,6 +247,21 @@ cdef class BiluoPushDown(TransitionSystem):
|
|||
if doc.c[i].ent_iob == 0:
|
||||
doc.c[i].ent_iob = 2
|
||||
|
||||
def get_beam_parses(self, Beam beam):
|
||||
parses = []
|
||||
probs = beam.probs
|
||||
for i in range(beam.size):
|
||||
state = <StateC*>beam.at(i)
|
||||
if state.is_final():
|
||||
prob = probs[i]
|
||||
parse = []
|
||||
for j in range(state._ents.size()):
|
||||
ent = state._ents.at(j)
|
||||
if ent.start != -1 and ent.end != -1:
|
||||
parse.append((ent.start, ent.end, self.strings[ent.label]))
|
||||
parses.append((prob, parse))
|
||||
return parses
|
||||
|
||||
def init_gold(self, StateClass state, Example example):
|
||||
return BiluoGold(self, state, example)
|
||||
|
||||
|
|
|
@ -226,6 +226,7 @@ class AttributeRuler(Pipe):
|
|||
|
||||
DOCS: https://nightly.spacy.io/api/tagger#score
|
||||
"""
|
||||
|
||||
def morph_key_getter(token, attr):
|
||||
return getattr(token, attr).key
|
||||
|
||||
|
@ -240,8 +241,16 @@ class AttributeRuler(Pipe):
|
|||
elif attr == POS:
|
||||
results.update(Scorer.score_token_attr(examples, "pos", **kwargs))
|
||||
elif attr == MORPH:
|
||||
results.update(Scorer.score_token_attr(examples, "morph", getter=morph_key_getter, **kwargs))
|
||||
results.update(Scorer.score_token_attr_per_feat(examples, "morph", getter=morph_key_getter, **kwargs))
|
||||
results.update(
|
||||
Scorer.score_token_attr(
|
||||
examples, "morph", getter=morph_key_getter, **kwargs
|
||||
)
|
||||
)
|
||||
results.update(
|
||||
Scorer.score_token_attr_per_feat(
|
||||
examples, "morph", getter=morph_key_getter, **kwargs
|
||||
)
|
||||
)
|
||||
elif attr == LEMMA:
|
||||
results.update(Scorer.score_token_attr(examples, "lemma", **kwargs))
|
||||
return results
|
||||
|
|
|
@ -1,4 +1,5 @@
|
|||
# cython: infer_types=True, profile=True, binding=True
|
||||
from collections import defaultdict
|
||||
from typing import Optional, Iterable
|
||||
from thinc.api import Model, Config
|
||||
|
||||
|
@ -258,3 +259,20 @@ cdef class DependencyParser(Parser):
|
|||
results.update(Scorer.score_deps(examples, "dep", **kwargs))
|
||||
del results["sents_per_type"]
|
||||
return results
|
||||
|
||||
def scored_parses(self, beams):
|
||||
"""Return two dictionaries with scores for each beam/doc that was processed:
|
||||
one containing (i, head) keys, and another containing (i, label) keys.
|
||||
"""
|
||||
head_scores = []
|
||||
label_scores = []
|
||||
for beam in beams:
|
||||
score_head_dict = defaultdict(float)
|
||||
score_label_dict = defaultdict(float)
|
||||
for score, parses in self.moves.get_beam_parses(beam):
|
||||
for head, i, label in parses:
|
||||
score_head_dict[(i, head)] += score
|
||||
score_label_dict[(i, label)] += score
|
||||
head_scores.append(score_head_dict)
|
||||
label_scores.append(score_label_dict)
|
||||
return head_scores, label_scores
|
||||
|
|
|
@ -24,7 +24,7 @@ default_model_config = """
|
|||
@architectures = "spacy.Tagger.v1"
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.Tok2Vec.v1"
|
||||
@architectures = "spacy.Tok2Vec.v2"
|
||||
|
||||
[model.tok2vec.embed]
|
||||
@architectures = "spacy.CharacterEmbed.v1"
|
||||
|
@ -35,7 +35,7 @@ nC = 8
|
|||
include_static_vectors = false
|
||||
|
||||
[model.tok2vec.encode]
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||
width = 128
|
||||
depth = 4
|
||||
window_size = 1
|
||||
|
|
|
@ -1,4 +1,5 @@
|
|||
# cython: infer_types=True, profile=True, binding=True
|
||||
from collections import defaultdict
|
||||
from typing import Optional, Iterable
|
||||
from thinc.api import Model, Config
|
||||
|
||||
|
@ -197,3 +198,16 @@ cdef class EntityRecognizer(Parser):
|
|||
"""
|
||||
validate_examples(examples, "EntityRecognizer.score")
|
||||
return get_ner_prf(examples)
|
||||
|
||||
def scored_ents(self, beams):
|
||||
"""Return a dictionary of (start, end, label) tuples with corresponding scores
|
||||
for each beam/doc that was processed.
|
||||
"""
|
||||
entity_scores = []
|
||||
for beam in beams:
|
||||
score_dict = defaultdict(float)
|
||||
for score, ents in self.moves.get_beam_parses(beam):
|
||||
for start, end, label in ents:
|
||||
score_dict[(start, end, label)] += score
|
||||
entity_scores.append(score_dict)
|
||||
return entity_scores
|
||||
|
|
|
@ -256,8 +256,14 @@ class Tagger(TrainablePipe):
|
|||
DOCS: https://nightly.spacy.io/api/tagger#get_loss
|
||||
"""
|
||||
validate_examples(examples, "Tagger.get_loss")
|
||||
loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False, missing_value="")
|
||||
truths = [eg.get_aligned("TAG", as_string=True) for eg in examples]
|
||||
loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False)
|
||||
# Convert empty tag "" to missing value None so that both misaligned
|
||||
# tokens and tokens with missing annotation have the default missing
|
||||
# value None.
|
||||
truths = []
|
||||
for eg in examples:
|
||||
eg_truths = [tag if tag is not "" else None for tag in eg.get_aligned("TAG", as_string=True)]
|
||||
truths.append(eg_truths)
|
||||
d_scores, loss = loss_func(scores, truths)
|
||||
if self.model.ops.xp.isnan(loss):
|
||||
raise ValueError(Errors.E910.format(name=self.name))
|
||||
|
|
|
@ -14,12 +14,12 @@ from ..tokens import Doc
|
|||
from ..vocab import Vocab
|
||||
|
||||
|
||||
default_model_config = """
|
||||
single_label_default_config = """
|
||||
[model]
|
||||
@architectures = "spacy.TextCatEnsemble.v2"
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.Tok2Vec.v1"
|
||||
@architectures = "spacy.Tok2Vec.v2"
|
||||
|
||||
[model.tok2vec.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
|
@ -29,7 +29,7 @@ attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
|||
include_static_vectors = false
|
||||
|
||||
[model.tok2vec.encode]
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||
width = ${model.tok2vec.embed.width}
|
||||
window_size = 1
|
||||
maxout_pieces = 3
|
||||
|
@ -37,24 +37,24 @@ depth = 2
|
|||
|
||||
[model.linear_model]
|
||||
@architectures = "spacy.TextCatBOW.v1"
|
||||
exclusive_classes = false
|
||||
exclusive_classes = true
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
"""
|
||||
DEFAULT_TEXTCAT_MODEL = Config().from_str(default_model_config)["model"]
|
||||
DEFAULT_SINGLE_TEXTCAT_MODEL = Config().from_str(single_label_default_config)["model"]
|
||||
|
||||
bow_model_config = """
|
||||
single_label_bow_config = """
|
||||
[model]
|
||||
@architectures = "spacy.TextCatBOW.v1"
|
||||
exclusive_classes = false
|
||||
exclusive_classes = true
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
"""
|
||||
|
||||
cnn_model_config = """
|
||||
single_label_cnn_config = """
|
||||
[model]
|
||||
@architectures = "spacy.TextCatCNN.v1"
|
||||
exclusive_classes = false
|
||||
exclusive_classes = true
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.HashEmbedCNN.v1"
|
||||
|
@ -71,7 +71,7 @@ subword_features = true
|
|||
@Language.factory(
|
||||
"textcat",
|
||||
assigns=["doc.cats"],
|
||||
default_config={"threshold": 0.5, "model": DEFAULT_TEXTCAT_MODEL},
|
||||
default_config={"threshold": 0.5, "model": DEFAULT_SINGLE_TEXTCAT_MODEL},
|
||||
default_score_weights={
|
||||
"cats_score": 1.0,
|
||||
"cats_score_desc": None,
|
||||
|
@ -103,7 +103,7 @@ def make_textcat(
|
|||
|
||||
|
||||
class TextCategorizer(TrainablePipe):
|
||||
"""Pipeline component for text classification.
|
||||
"""Pipeline component for single-label text classification.
|
||||
|
||||
DOCS: https://nightly.spacy.io/api/textcategorizer
|
||||
"""
|
||||
|
@ -111,7 +111,7 @@ class TextCategorizer(TrainablePipe):
|
|||
def __init__(
|
||||
self, vocab: Vocab, model: Model, name: str = "textcat", *, threshold: float
|
||||
) -> None:
|
||||
"""Initialize a text categorizer.
|
||||
"""Initialize a text categorizer for single-label classification.
|
||||
|
||||
vocab (Vocab): The shared vocabulary.
|
||||
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
||||
|
@ -214,6 +214,7 @@ class TextCategorizer(TrainablePipe):
|
|||
losses = {}
|
||||
losses.setdefault(self.name, 0.0)
|
||||
validate_examples(examples, "TextCategorizer.update")
|
||||
self._validate_categories(examples)
|
||||
if not any(len(eg.predicted) if eg.predicted else 0 for eg in examples):
|
||||
# Handle cases where there are no tokens in any docs.
|
||||
return losses
|
||||
|
@ -256,6 +257,7 @@ class TextCategorizer(TrainablePipe):
|
|||
if self._rehearsal_model is None:
|
||||
return losses
|
||||
validate_examples(examples, "TextCategorizer.rehearse")
|
||||
self._validate_categories(examples)
|
||||
docs = [eg.predicted for eg in examples]
|
||||
if not any(len(doc) for doc in docs):
|
||||
# Handle cases where there are no tokens in any docs.
|
||||
|
@ -296,6 +298,7 @@ class TextCategorizer(TrainablePipe):
|
|||
DOCS: https://nightly.spacy.io/api/textcategorizer#get_loss
|
||||
"""
|
||||
validate_examples(examples, "TextCategorizer.get_loss")
|
||||
self._validate_categories(examples)
|
||||
truths, not_missing = self._examples_to_truth(examples)
|
||||
not_missing = self.model.ops.asarray(not_missing)
|
||||
d_scores = (scores - truths) / scores.shape[0]
|
||||
|
@ -341,6 +344,7 @@ class TextCategorizer(TrainablePipe):
|
|||
DOCS: https://nightly.spacy.io/api/textcategorizer#initialize
|
||||
"""
|
||||
validate_get_examples(get_examples, "TextCategorizer.initialize")
|
||||
self._validate_categories(get_examples())
|
||||
if labels is None:
|
||||
for example in get_examples():
|
||||
for cat in example.y.cats:
|
||||
|
@ -373,12 +377,20 @@ class TextCategorizer(TrainablePipe):
|
|||
DOCS: https://nightly.spacy.io/api/textcategorizer#score
|
||||
"""
|
||||
validate_examples(examples, "TextCategorizer.score")
|
||||
self._validate_categories(examples)
|
||||
return Scorer.score_cats(
|
||||
examples,
|
||||
"cats",
|
||||
labels=self.labels,
|
||||
multi_label=self.model.attrs["multi_label"],
|
||||
multi_label=False,
|
||||
positive_label=self.cfg["positive_label"],
|
||||
threshold=self.cfg["threshold"],
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
|
||||
def _validate_categories(self, examples: List[Example]):
|
||||
"""Check whether the provided examples all have single-label cats annotations."""
|
||||
for ex in examples:
|
||||
if list(ex.reference.cats.values()).count(1.0) > 1:
|
||||
raise ValueError(Errors.E895.format(value=ex.reference.cats))
|
||||
|
|
191
spacy/pipeline/textcat_multilabel.py
Normal file
191
spacy/pipeline/textcat_multilabel.py
Normal file
|
@ -0,0 +1,191 @@
|
|||
from itertools import islice
|
||||
from typing import Iterable, Optional, Dict, List, Callable, Any
|
||||
|
||||
from thinc.api import Model, Config
|
||||
from thinc.types import Floats2d
|
||||
|
||||
from ..language import Language
|
||||
from ..training import Example, validate_examples, validate_get_examples
|
||||
from ..errors import Errors
|
||||
from ..scorer import Scorer
|
||||
from ..tokens import Doc
|
||||
from ..vocab import Vocab
|
||||
from .textcat import TextCategorizer
|
||||
|
||||
|
||||
multi_label_default_config = """
|
||||
[model]
|
||||
@architectures = "spacy.TextCatEnsemble.v2"
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.Tok2Vec.v1"
|
||||
|
||||
[model.tok2vec.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
width = 64
|
||||
rows = [2000, 2000, 1000, 1000, 1000, 1000]
|
||||
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
||||
include_static_vectors = false
|
||||
|
||||
[model.tok2vec.encode]
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
width = ${model.tok2vec.embed.width}
|
||||
window_size = 1
|
||||
maxout_pieces = 3
|
||||
depth = 2
|
||||
|
||||
[model.linear_model]
|
||||
@architectures = "spacy.TextCatBOW.v1"
|
||||
exclusive_classes = false
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
"""
|
||||
DEFAULT_MULTI_TEXTCAT_MODEL = Config().from_str(multi_label_default_config)["model"]
|
||||
|
||||
multi_label_bow_config = """
|
||||
[model]
|
||||
@architectures = "spacy.TextCatBOW.v1"
|
||||
exclusive_classes = false
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
"""
|
||||
|
||||
multi_label_cnn_config = """
|
||||
[model]
|
||||
@architectures = "spacy.TextCatCNN.v1"
|
||||
exclusive_classes = false
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.HashEmbedCNN.v1"
|
||||
pretrained_vectors = null
|
||||
width = 96
|
||||
depth = 4
|
||||
embed_size = 2000
|
||||
window_size = 1
|
||||
maxout_pieces = 3
|
||||
subword_features = true
|
||||
"""
|
||||
|
||||
|
||||
@Language.factory(
|
||||
"textcat_multilabel",
|
||||
assigns=["doc.cats"],
|
||||
default_config={"threshold": 0.5, "model": DEFAULT_MULTI_TEXTCAT_MODEL},
|
||||
default_score_weights={
|
||||
"cats_score": 1.0,
|
||||
"cats_score_desc": None,
|
||||
"cats_micro_p": None,
|
||||
"cats_micro_r": None,
|
||||
"cats_micro_f": None,
|
||||
"cats_macro_p": None,
|
||||
"cats_macro_r": None,
|
||||
"cats_macro_f": None,
|
||||
"cats_macro_auc": None,
|
||||
"cats_f_per_type": None,
|
||||
"cats_macro_auc_per_type": None,
|
||||
},
|
||||
)
|
||||
def make_multilabel_textcat(
|
||||
nlp: Language, name: str, model: Model[List[Doc], List[Floats2d]], threshold: float
|
||||
) -> "TextCategorizer":
|
||||
"""Create a TextCategorizer compoment. The text categorizer predicts categories
|
||||
over a whole document. It can learn one or more labels, and the labels can
|
||||
be mutually exclusive (i.e. one true label per doc) or non-mutually exclusive
|
||||
(i.e. zero or more labels may be true per doc). The multi-label setting is
|
||||
controlled by the model instance that's provided.
|
||||
|
||||
model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
|
||||
scores for each category.
|
||||
threshold (float): Cutoff to consider a prediction "positive".
|
||||
"""
|
||||
return MultiLabel_TextCategorizer(nlp.vocab, model, name, threshold=threshold)
|
||||
|
||||
|
||||
class MultiLabel_TextCategorizer(TextCategorizer):
|
||||
"""Pipeline component for multi-label text classification.
|
||||
|
||||
DOCS: https://nightly.spacy.io/api/multilabel_textcategorizer
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab: Vocab,
|
||||
model: Model,
|
||||
name: str = "textcat_multilabel",
|
||||
*,
|
||||
threshold: float,
|
||||
) -> None:
|
||||
"""Initialize a text categorizer for multi-label classification.
|
||||
|
||||
vocab (Vocab): The shared vocabulary.
|
||||
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
||||
name (str): The component instance name, used to add entries to the
|
||||
losses during training.
|
||||
threshold (float): Cutoff to consider a prediction "positive".
|
||||
|
||||
DOCS: https://nightly.spacy.io/api/multilabel_textcategorizer#init
|
||||
"""
|
||||
self.vocab = vocab
|
||||
self.model = model
|
||||
self.name = name
|
||||
self._rehearsal_model = None
|
||||
cfg = {"labels": [], "threshold": threshold}
|
||||
self.cfg = dict(cfg)
|
||||
|
||||
def initialize(
|
||||
self,
|
||||
get_examples: Callable[[], Iterable[Example]],
|
||||
*,
|
||||
nlp: Optional[Language] = None,
|
||||
labels: Optional[Dict] = None,
|
||||
):
|
||||
"""Initialize the pipe for training, using a representative set
|
||||
of data examples.
|
||||
|
||||
get_examples (Callable[[], Iterable[Example]]): Function that
|
||||
returns a representative sample of gold-standard Example objects.
|
||||
nlp (Language): The current nlp object the component is part of.
|
||||
labels: The labels to add to the component, typically generated by the
|
||||
`init labels` command. If no labels are provided, the get_examples
|
||||
callback is used to extract the labels from the data.
|
||||
|
||||
DOCS: https://nightly.spacy.io/api/multilabel_textcategorizer#initialize
|
||||
"""
|
||||
validate_get_examples(get_examples, "MultiLabel_TextCategorizer.initialize")
|
||||
if labels is None:
|
||||
for example in get_examples():
|
||||
for cat in example.y.cats:
|
||||
self.add_label(cat)
|
||||
else:
|
||||
for label in labels:
|
||||
self.add_label(label)
|
||||
subbatch = list(islice(get_examples(), 10))
|
||||
doc_sample = [eg.reference for eg in subbatch]
|
||||
label_sample, _ = self._examples_to_truth(subbatch)
|
||||
self._require_labels()
|
||||
assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
|
||||
assert len(label_sample) > 0, Errors.E923.format(name=self.name)
|
||||
self.model.initialize(X=doc_sample, Y=label_sample)
|
||||
|
||||
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
|
||||
"""Score a batch of examples.
|
||||
|
||||
examples (Iterable[Example]): The examples to score.
|
||||
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_cats.
|
||||
|
||||
DOCS: https://nightly.spacy.io/api/multilabel_textcategorizer#score
|
||||
"""
|
||||
validate_examples(examples, "MultiLabel_TextCategorizer.score")
|
||||
return Scorer.score_cats(
|
||||
examples,
|
||||
"cats",
|
||||
labels=self.labels,
|
||||
multi_label=True,
|
||||
threshold=self.cfg["threshold"],
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
def _validate_categories(self, examples: List[Example]):
|
||||
"""This component allows any type of single- or multi-label annotations.
|
||||
This method overwrites the more strict one from 'textcat'. """
|
||||
pass
|
|
@ -3,7 +3,7 @@ import numpy as np
|
|||
from collections import defaultdict
|
||||
|
||||
from .training import Example
|
||||
from .tokens import Token, Doc, Span, MorphAnalysis
|
||||
from .tokens import Token, Doc, Span
|
||||
from .errors import Errors
|
||||
from .util import get_lang_class, SimpleFrozenList
|
||||
from .morphology import Morphology
|
||||
|
@ -176,7 +176,7 @@ class Scorer:
|
|||
"token_acc": None,
|
||||
"token_p": None,
|
||||
"token_r": None,
|
||||
"token_f": None
|
||||
"token_f": None,
|
||||
}
|
||||
|
||||
@staticmethod
|
||||
|
@ -276,7 +276,10 @@ class Scorer:
|
|||
if gold_i not in missing_indices:
|
||||
value = getter(token, attr)
|
||||
morph = gold_doc.vocab.strings[value]
|
||||
if value not in missing_values and morph != Morphology.EMPTY_MORPH:
|
||||
if (
|
||||
value not in missing_values
|
||||
and morph != Morphology.EMPTY_MORPH
|
||||
):
|
||||
for feat in morph.split(Morphology.FEATURE_SEP):
|
||||
field, values = feat.split(Morphology.FIELD_SEP)
|
||||
if field not in per_feat:
|
||||
|
@ -367,7 +370,6 @@ class Scorer:
|
|||
f"{attr}_per_type": None,
|
||||
}
|
||||
|
||||
|
||||
@staticmethod
|
||||
def score_cats(
|
||||
examples: Iterable[Example],
|
||||
|
@ -458,7 +460,7 @@ class Scorer:
|
|||
gold_label, gold_score = max(gold_cats, key=lambda it: it[1])
|
||||
if gold_score is not None and gold_score > 0:
|
||||
f_per_type[gold_label].fn += 1
|
||||
else:
|
||||
elif pred_cats:
|
||||
pred_label, pred_score = max(pred_cats, key=lambda it: it[1])
|
||||
if pred_score >= threshold:
|
||||
f_per_type[pred_label].fp += 1
|
||||
|
@ -473,7 +475,10 @@ class Scorer:
|
|||
macro_f = sum(prf.fscore for prf in f_per_type.values()) / n_cats
|
||||
# Limit macro_auc to those labels with gold annotations,
|
||||
# but still divide by all cats to avoid artificial boosting of datasets with missing labels
|
||||
macro_auc = sum(auc.score if auc.is_binary() else 0.0 for auc in auc_per_type.values()) / n_cats
|
||||
macro_auc = (
|
||||
sum(auc.score if auc.is_binary() else 0.0 for auc in auc_per_type.values())
|
||||
/ n_cats
|
||||
)
|
||||
results = {
|
||||
f"{attr}_score": None,
|
||||
f"{attr}_score_desc": None,
|
||||
|
@ -485,7 +490,9 @@ class Scorer:
|
|||
f"{attr}_macro_f": macro_f,
|
||||
f"{attr}_macro_auc": macro_auc,
|
||||
f"{attr}_f_per_type": {k: v.to_dict() for k, v in f_per_type.items()},
|
||||
f"{attr}_auc_per_type": {k: v.score if v.is_binary() else None for k, v in auc_per_type.items()},
|
||||
f"{attr}_auc_per_type": {
|
||||
k: v.score if v.is_binary() else None for k, v in auc_per_type.items()
|
||||
},
|
||||
}
|
||||
if len(labels) == 2 and not multi_label and positive_label:
|
||||
positive_label_f = results[f"{attr}_f_per_type"][positive_label]["f"]
|
||||
|
@ -675,8 +682,7 @@ class Scorer:
|
|||
|
||||
|
||||
def get_ner_prf(examples: Iterable[Example]) -> Dict[str, Any]:
|
||||
"""Compute micro-PRF and per-entity PRF scores for a sequence of examples.
|
||||
"""
|
||||
"""Compute micro-PRF and per-entity PRF scores for a sequence of examples."""
|
||||
score_per_type = defaultdict(PRFScore)
|
||||
for eg in examples:
|
||||
if not eg.y.has_annotation("ENT_IOB"):
|
||||
|
|
|
@ -154,10 +154,10 @@ def test_doc_api_serialize(en_tokenizer, text):
|
|||
|
||||
logger = logging.getLogger("spacy")
|
||||
with mock.patch.object(logger, "warning") as mock_warning:
|
||||
_ = tokens.to_bytes()
|
||||
_ = tokens.to_bytes() # noqa: F841
|
||||
mock_warning.assert_not_called()
|
||||
tokens.user_hooks["similarity"] = inner_func
|
||||
_ = tokens.to_bytes()
|
||||
_ = tokens.to_bytes() # noqa: F841
|
||||
mock_warning.assert_called_once()
|
||||
|
||||
|
||||
|
|
|
@ -21,11 +21,13 @@ def test_doc_retokenize_merge(en_tokenizer):
|
|||
assert doc[4].text == "the beach boys"
|
||||
assert doc[4].text_with_ws == "the beach boys "
|
||||
assert doc[4].tag_ == "NAMED"
|
||||
assert doc[4].lemma_ == "LEMMA"
|
||||
assert str(doc[4].morph) == "Number=Plur"
|
||||
assert doc[5].text == "all night"
|
||||
assert doc[5].text_with_ws == "all night"
|
||||
assert doc[5].tag_ == "NAMED"
|
||||
assert str(doc[5].morph) == "Number=Plur"
|
||||
assert doc[5].lemma_ == "LEMMA"
|
||||
|
||||
|
||||
def test_doc_retokenize_merge_children(en_tokenizer):
|
||||
|
@ -103,25 +105,29 @@ def test_doc_retokenize_spans_merge_tokens(en_tokenizer):
|
|||
|
||||
def test_doc_retokenize_spans_merge_tokens_default_attrs(en_vocab):
|
||||
words = ["The", "players", "start", "."]
|
||||
lemmas = [t.lower() for t in words]
|
||||
heads = [1, 2, 2, 2]
|
||||
tags = ["DT", "NN", "VBZ", "."]
|
||||
pos = ["DET", "NOUN", "VERB", "PUNCT"]
|
||||
doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads)
|
||||
doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads, lemmas=lemmas)
|
||||
assert len(doc) == 4
|
||||
assert doc[0].text == "The"
|
||||
assert doc[0].tag_ == "DT"
|
||||
assert doc[0].pos_ == "DET"
|
||||
assert doc[0].lemma_ == "the"
|
||||
with doc.retokenize() as retokenizer:
|
||||
retokenizer.merge(doc[0:2])
|
||||
assert len(doc) == 3
|
||||
assert doc[0].text == "The players"
|
||||
assert doc[0].tag_ == "NN"
|
||||
assert doc[0].pos_ == "NOUN"
|
||||
doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads)
|
||||
assert doc[0].lemma_ == "the players"
|
||||
doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads, lemmas=lemmas)
|
||||
assert len(doc) == 4
|
||||
assert doc[0].text == "The"
|
||||
assert doc[0].tag_ == "DT"
|
||||
assert doc[0].pos_ == "DET"
|
||||
assert doc[0].lemma_ == "the"
|
||||
with doc.retokenize() as retokenizer:
|
||||
retokenizer.merge(doc[0:2])
|
||||
retokenizer.merge(doc[2:4])
|
||||
|
@ -129,9 +135,11 @@ def test_doc_retokenize_spans_merge_tokens_default_attrs(en_vocab):
|
|||
assert doc[0].text == "The players"
|
||||
assert doc[0].tag_ == "NN"
|
||||
assert doc[0].pos_ == "NOUN"
|
||||
assert doc[0].lemma_ == "the players"
|
||||
assert doc[1].text == "start ."
|
||||
assert doc[1].tag_ == "VBZ"
|
||||
assert doc[1].pos_ == "VERB"
|
||||
assert doc[1].lemma_ == "start ."
|
||||
|
||||
|
||||
def test_doc_retokenize_spans_merge_heads(en_vocab):
|
||||
|
|
|
@ -39,6 +39,36 @@ def test_doc_retokenize_split(en_vocab):
|
|||
assert len(str(doc)) == 19
|
||||
|
||||
|
||||
def test_doc_retokenize_split_lemmas(en_vocab):
|
||||
# If lemmas are not set, leave unset
|
||||
words = ["LosAngeles", "start", "."]
|
||||
heads = [1, 2, 2]
|
||||
doc = Doc(en_vocab, words=words, heads=heads)
|
||||
with doc.retokenize() as retokenizer:
|
||||
retokenizer.split(
|
||||
doc[0],
|
||||
["Los", "Angeles"],
|
||||
[(doc[0], 1), doc[1]],
|
||||
)
|
||||
assert doc[0].lemma_ == ""
|
||||
assert doc[1].lemma_ == ""
|
||||
|
||||
# If lemmas are set, use split orth as default lemma
|
||||
words = ["LosAngeles", "start", "."]
|
||||
heads = [1, 2, 2]
|
||||
doc = Doc(en_vocab, words=words, heads=heads)
|
||||
for t in doc:
|
||||
t.lemma_ = "a"
|
||||
with doc.retokenize() as retokenizer:
|
||||
retokenizer.split(
|
||||
doc[0],
|
||||
["Los", "Angeles"],
|
||||
[(doc[0], 1), doc[1]],
|
||||
)
|
||||
assert doc[0].lemma_ == "Los"
|
||||
assert doc[1].lemma_ == "Angeles"
|
||||
|
||||
|
||||
def test_doc_retokenize_split_dependencies(en_vocab):
|
||||
doc = Doc(en_vocab, words=["LosAngeles", "start", "."])
|
||||
dep1 = doc.vocab.strings.add("amod")
|
||||
|
|
|
@ -113,9 +113,8 @@ def test_en_tokenizer_norm_exceptions(en_tokenizer, text, norms):
|
|||
assert [token.norm_ for token in tokens] == norms
|
||||
|
||||
|
||||
@pytest.mark.skip
|
||||
@pytest.mark.parametrize(
|
||||
"text,norm", [("radicalised", "radicalized"), ("cuz", "because")]
|
||||
"text,norm", [("Jan.", "January"), ("'cuz", "because")]
|
||||
)
|
||||
def test_en_lex_attrs_norm_exceptions(en_tokenizer, text, norm):
|
||||
tokens = en_tokenizer(text)
|
||||
|
|
|
@ -4,21 +4,21 @@ from spacy.lang.mk.lex_attrs import like_num
|
|||
|
||||
def test_tokenizer_handles_long_text(mk_tokenizer):
|
||||
text = """
|
||||
Во организациските работи или на нашите собранија со членството, никој од нас не зборуваше за
|
||||
организацијата и идеологијата. Работна беше нашата работа, а не идеолошка. Што се однесува до социјализмот на
|
||||
Делчев, неговата дејност зборува сама за себе - спротивно. Во суштина, водачите си имаа свои основни погледи и
|
||||
свои разбирања за положбата и работите, коишто стоеја пред нив и ги завршуваа со голема упорност, настојчивост и
|
||||
насоченост. Значи, идеологија имаше, само што нивната идеологија имаше своја оригиналност. Македонија денеска,
|
||||
чиста рожба на животот и положбата во Македонија, кои му служеа како база на неговите побуди, беше дејност која
|
||||
имаше потреба од ум за да си најде своја смисла. Таквата идеологија и заемното дејство на умот и срцето му
|
||||
помогнаа на Делчев да не се занесе по патот на својата идеологија... Во суштина, Организацијата и нејзините
|
||||
водачи имаа свои разбирања за работите и положбата во идеен поглед, но тоа беше врската, животот и положбата во
|
||||
Македонија и го внесуваа во својата идеологија гласот на своето срце, и на крај, прибегнуваа до умот,
|
||||
за да најдат смисла или да ѝ дадат. Тоа содејство и заемен сооднос на умот и срцето му помогнаа на Делчев да ја
|
||||
држи својата идеологија во сообразност со положбата на работите... Водачите навистина направија една жртва
|
||||
бидејќи на населението не му зборуваа за своите мисли и идеи. Тие се одрекоа од секаква субјективност во своите
|
||||
мисли. Целта беше да не се зголемуваат целите и задачите како и преданоста во работата. Населението не можеше да
|
||||
ги разбере овие идеи...
|
||||
Во организациските работи или на нашите собранија со членството, никој од нас не зборуваше за
|
||||
организацијата и идеологијата. Работна беше нашата работа, а не идеолошка. Што се однесува до социјализмот на
|
||||
Делчев, неговата дејност зборува сама за себе - спротивно. Во суштина, водачите си имаа свои основни погледи и
|
||||
свои разбирања за положбата и работите, коишто стоеја пред нив и ги завршуваа со голема упорност, настојчивост и
|
||||
насоченост. Значи, идеологија имаше, само што нивната идеологија имаше своја оригиналност. Македонија денеска,
|
||||
чиста рожба на животот и положбата во Македонија, кои му служеа како база на неговите побуди, беше дејност која
|
||||
имаше потреба од ум за да си најде своја смисла. Таквата идеологија и заемното дејство на умот и срцето му
|
||||
помогнаа на Делчев да не се занесе по патот на својата идеологија... Во суштина, Организацијата и нејзините
|
||||
водачи имаа свои разбирања за работите и положбата во идеен поглед, но тоа беше врската, животот и положбата во
|
||||
Македонија и го внесуваа во својата идеологија гласот на своето срце, и на крај, прибегнуваа до умот,
|
||||
за да најдат смисла или да ѝ дадат. Тоа содејство и заемен сооднос на умот и срцето му помогнаа на Делчев да ја
|
||||
држи својата идеологија во сообразност со положбата на работите... Водачите навистина направија една жртва
|
||||
бидејќи на населението не му зборуваа за своите мисли и идеи. Тие се одрекоа од секаква субјективност во своите
|
||||
мисли. Целта беше да не се зголемуваат целите и задачите како и преданоста во работата. Населението не можеше да
|
||||
ги разбере овие идеи...
|
||||
"""
|
||||
tokens = mk_tokenizer(text)
|
||||
assert len(tokens) == 297
|
||||
|
@ -45,7 +45,7 @@ def test_tokenizer_handles_long_text(mk_tokenizer):
|
|||
(",", False),
|
||||
("милијарда", True),
|
||||
("билион", True),
|
||||
]
|
||||
],
|
||||
)
|
||||
def test_mk_lex_attrs_like_number(mk_tokenizer, word, match):
|
||||
tokens = mk_tokenizer(word)
|
||||
|
@ -53,14 +53,7 @@ def test_mk_lex_attrs_like_number(mk_tokenizer, word, match):
|
|||
assert tokens[0].like_num == match
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"word",
|
||||
[
|
||||
"двесте",
|
||||
"два-три",
|
||||
"пет-шест"
|
||||
]
|
||||
)
|
||||
@pytest.mark.parametrize("word", ["двесте", "два-три", "пет-шест"])
|
||||
def test_mk_lex_attrs_capitals(word):
|
||||
assert like_num(word)
|
||||
assert like_num(word.upper())
|
||||
|
@ -77,8 +70,8 @@ def test_mk_lex_attrs_capitals(word):
|
|||
"петто",
|
||||
"стоти",
|
||||
"шеесетите",
|
||||
"седумдесетите"
|
||||
]
|
||||
"седумдесетите",
|
||||
],
|
||||
)
|
||||
def test_mk_lex_attrs_like_number_for_ordinal(word):
|
||||
assert like_num(word)
|
||||
|
|
|
@ -5,24 +5,22 @@ from spacy.lang.tr.lex_attrs import like_num
|
|||
def test_tr_tokenizer_handles_long_text(tr_tokenizer):
|
||||
text = """Pamuk nasıl ipliğe dönüştürülür?
|
||||
|
||||
Sıkıştırılmış balyalar halindeki pamuk, iplik fabrikasına getirildiğinde hem
|
||||
lifleri birbirine dolaşmıştır, hem de tarladan toplanırken araya bitkinin
|
||||
parçaları karışmıştır. Üstelik balyalardaki pamuğun cinsi aynı olsa bile kalitesi
|
||||
Sıkıştırılmış balyalar halindeki pamuk, iplik fabrikasına getirildiğinde hem
|
||||
lifleri birbirine dolaşmıştır, hem de tarladan toplanırken araya bitkinin
|
||||
parçaları karışmıştır. Üstelik balyalardaki pamuğun cinsi aynı olsa bile kalitesi
|
||||
değişeceğinden, önce bütün balyaların birbirine karıştırılarak harmanlanması gerekir.
|
||||
|
||||
Daha sonra pamuk yığınları, liflerin açılıp temizlenmesi için tek bir birim halinde
|
||||
Daha sonra pamuk yığınları, liflerin açılıp temizlenmesi için tek bir birim halinde
|
||||
birleştirilmiş çeşitli makinelerden geçirilir.Bunlardan biri, dönen tokmaklarıyla
|
||||
pamuğu dövüp kabartarak dağınık yumaklar haline getiren ve liflerin arasındaki yabancı
|
||||
maddeleri temizleyen hallaç makinesidir. Daha sonra tarak makinesine giren pamuk demetleri,
|
||||
herbirinin yüzeyinde yüzbinlerce incecik iğne bulunan döner silindirlerin arasından geçerek lif lif ayrılır
|
||||
ve tül inceliğinde gevşek bir örtüye dönüşür. Ama bir sonraki makine bu lifleri dağınık
|
||||
ve tül inceliğinde gevşek bir örtüye dönüşür. Ama bir sonraki makine bu lifleri dağınık
|
||||
ve gevşek bir biçimde birbirine yaklaştırarak 2 cm eninde bir pamuk şeridi haline getirir."""
|
||||
tokens = tr_tokenizer(text)
|
||||
assert len(tokens) == 146
|
||||
|
||||
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"word",
|
||||
[
|
||||
|
|
|
@ -2,145 +2,692 @@ import pytest
|
|||
|
||||
|
||||
ABBREV_TESTS = [
|
||||
("Dr. Murat Bey ile görüştüm.", ["Dr.", "Murat", "Bey", "ile", "görüştüm", "."]),
|
||||
("Dr.la görüştüm.", ["Dr.la", "görüştüm", "."]),
|
||||
("Dr.'la görüştüm.", ["Dr.'la", "görüştüm", "."]),
|
||||
("TBMM'de çalışıyormuş.", ["TBMM'de", "çalışıyormuş", "."]),
|
||||
("Hem İst. hem Ank. bu konuda gayet iyi durumda.", ["Hem", "İst.", "hem", "Ank.", "bu", "konuda", "gayet", "iyi", "durumda", "."]),
|
||||
("Hem İst. hem Ank.'da yağış var.", ["Hem", "İst.", "hem", "Ank.'da", "yağış", "var", "."]),
|
||||
("Dr.", ["Dr."]),
|
||||
("Yrd.Doç.", ["Yrd.Doç."]),
|
||||
("Prof.'un", ["Prof.'un"]),
|
||||
("Böl.'nde", ["Böl.'nde"]),
|
||||
("Dr. Murat Bey ile görüştüm.", ["Dr.", "Murat", "Bey", "ile", "görüştüm", "."]),
|
||||
("Dr.la görüştüm.", ["Dr.la", "görüştüm", "."]),
|
||||
("Dr.'la görüştüm.", ["Dr.'la", "görüştüm", "."]),
|
||||
("TBMM'de çalışıyormuş.", ["TBMM'de", "çalışıyormuş", "."]),
|
||||
(
|
||||
"Hem İst. hem Ank. bu konuda gayet iyi durumda.",
|
||||
["Hem", "İst.", "hem", "Ank.", "bu", "konuda", "gayet", "iyi", "durumda", "."],
|
||||
),
|
||||
(
|
||||
"Hem İst. hem Ank.'da yağış var.",
|
||||
["Hem", "İst.", "hem", "Ank.'da", "yağış", "var", "."],
|
||||
),
|
||||
("Dr.", ["Dr."]),
|
||||
("Yrd.Doç.", ["Yrd.Doç."]),
|
||||
("Prof.'un", ["Prof.'un"]),
|
||||
("Böl.'nde", ["Böl.'nde"]),
|
||||
]
|
||||
|
||||
|
||||
|
||||
URL_TESTS = [
|
||||
("Bizler de www.duygu.com.tr adında bir websitesi kurduk.", ["Bizler", "de", "www.duygu.com.tr", "adında", "bir", "websitesi", "kurduk", "."]),
|
||||
("Bizler de https://www.duygu.com.tr adında bir websitesi kurduk.", ["Bizler", "de", "https://www.duygu.com.tr", "adında", "bir", "websitesi", "kurduk", "."]),
|
||||
("Bizler de www.duygu.com.tr'dan satın aldık.", ["Bizler", "de", "www.duygu.com.tr'dan", "satın", "aldık", "."]),
|
||||
("Bizler de https://www.duygu.com.tr'dan satın aldık.", ["Bizler", "de", "https://www.duygu.com.tr'dan", "satın", "aldık", "."]),
|
||||
(
|
||||
"Bizler de www.duygu.com.tr adında bir websitesi kurduk.",
|
||||
[
|
||||
"Bizler",
|
||||
"de",
|
||||
"www.duygu.com.tr",
|
||||
"adında",
|
||||
"bir",
|
||||
"websitesi",
|
||||
"kurduk",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Bizler de https://www.duygu.com.tr adında bir websitesi kurduk.",
|
||||
[
|
||||
"Bizler",
|
||||
"de",
|
||||
"https://www.duygu.com.tr",
|
||||
"adında",
|
||||
"bir",
|
||||
"websitesi",
|
||||
"kurduk",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Bizler de www.duygu.com.tr'dan satın aldık.",
|
||||
["Bizler", "de", "www.duygu.com.tr'dan", "satın", "aldık", "."],
|
||||
),
|
||||
(
|
||||
"Bizler de https://www.duygu.com.tr'dan satın aldık.",
|
||||
["Bizler", "de", "https://www.duygu.com.tr'dan", "satın", "aldık", "."],
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
|
||||
NUMBER_TESTS = [
|
||||
("Rakamla 6 yazılıydı.", ["Rakamla", "6", "yazılıydı", "."]),
|
||||
("Hava -4 dereceydi.", ["Hava", "-4", "dereceydi", "."]),
|
||||
("Hava sıcaklığı -4ten +6ya yükseldi.", ["Hava", "sıcaklığı", "-4ten", "+6ya", "yükseldi", "."]),
|
||||
("Hava sıcaklığı -4'ten +6'ya yükseldi.", ["Hava", "sıcaklığı", "-4'ten", "+6'ya", "yükseldi", "."]),
|
||||
("Yarışta 6. oldum.", ["Yarışta", "6.", "oldum", "."]),
|
||||
("Yarışta 438547745. oldum.", ["Yarışta", "438547745.", "oldum", "."]),
|
||||
("Kitap IV. Murat hakkında.",["Kitap", "IV.", "Murat", "hakkında", "."]),
|
||||
#("Bana söylediği sayı 6.", ["Bana", "söylediği", "sayı", "6", "."]),
|
||||
("Saat 6'da buluşalım.", ["Saat", "6'da", "buluşalım", "."]),
|
||||
("Saat 6dan sonra buluşalım.", ["Saat", "6dan", "sonra", "buluşalım", "."]),
|
||||
("6.dan sonra saymadım.", ["6.dan", "sonra", "saymadım", "."]),
|
||||
("6.'dan sonra saymadım.", ["6.'dan", "sonra", "saymadım", "."]),
|
||||
("Saat 6'ydı.", ["Saat", "6'ydı", "."]),
|
||||
("5'te", ["5'te"]),
|
||||
("6'da", ["6'da"]),
|
||||
("9dan", ["9dan"]),
|
||||
("19'da", ["19'da"]),
|
||||
("VI'da", ["VI'da"]),
|
||||
("5.", ["5."]),
|
||||
("72.", ["72."]),
|
||||
("VI.", ["VI."]),
|
||||
("6.'dan", ["6.'dan"]),
|
||||
("19.'dan", ["19.'dan"]),
|
||||
("6.dan", ["6.dan"]),
|
||||
("16.dan", ["16.dan"]),
|
||||
("VI.'dan", ["VI.'dan"]),
|
||||
("VI.dan", ["VI.dan"]),
|
||||
("Hepsi 1994 yılında oldu.", ["Hepsi", "1994", "yılında", "oldu", "."]),
|
||||
("Hepsi 1994'te oldu.", ["Hepsi", "1994'te", "oldu", "."]),
|
||||
("2/3 tarihli faturayı bulamadım.", ["2/3", "tarihli", "faturayı", "bulamadım", "."]),
|
||||
("2.3 tarihli faturayı bulamadım.", ["2.3", "tarihli", "faturayı", "bulamadım", "."]),
|
||||
("2.3. tarihli faturayı bulamadım.", ["2.3.", "tarihli", "faturayı", "bulamadım", "."]),
|
||||
("2/3/2020 tarihli faturayı bulamadm.", ["2/3/2020", "tarihli", "faturayı", "bulamadm", "."]),
|
||||
("2/3/1987 tarihinden beri burda yaşıyorum.", ["2/3/1987", "tarihinden", "beri", "burda", "yaşıyorum", "."]),
|
||||
("2-3-1987 tarihinden beri burdayım.", ["2-3-1987", "tarihinden", "beri", "burdayım", "."]),
|
||||
("2.3.1987 tarihinden beri burdayım.", ["2.3.1987", "tarihinden", "beri", "burdayım", "."]),
|
||||
("Bu olay 2005-2006 tarihleri arasında oldu.", ["Bu", "olay", "2005", "-", "2006", "tarihleri", "arasında", "oldu", "."]),
|
||||
("Bu olay 4/12/2005-21/3/2006 tarihleri arasında oldu.", ["Bu", "olay", "4/12/2005", "-", "21/3/2006", "tarihleri", "arasında", "oldu", ".",]),
|
||||
("Ek fıkra: 5/11/2003-4999/3 maddesine göre uygundur.", ["Ek", "fıkra", ":", "5/11/2003", "-", "4999/3", "maddesine", "göre", "uygundur", "."]),
|
||||
("2/A alanları: 6831 sayılı Kanunun 2nci maddesinin birinci fıkrasının (A) bendine göre", ["2/A", "alanları", ":", "6831", "sayılı", "Kanunun", "2nci", "maddesinin", "birinci", "fıkrasının", "(", "A", ")", "bendine", "göre"]),
|
||||
("ŞEHİTTEĞMENKALMAZ Cad. No: 2/311", ["ŞEHİTTEĞMENKALMAZ", "Cad.", "No", ":", "2/311"]),
|
||||
("2-3-2025", ["2-3-2025",]),
|
||||
("2/3/2025", ["2/3/2025"]),
|
||||
("Yıllardır 0.5 uç kullanıyorum.", ["Yıllardır", "0.5", "uç", "kullanıyorum", "."]),
|
||||
("Kan değerlerim 0.5-0.7 arasıydı.", ["Kan", "değerlerim", "0.5", "-", "0.7", "arasıydı", "."]),
|
||||
("0.5", ["0.5"]),
|
||||
("1/2", ["1/2"]),
|
||||
("%1", ["%", "1"]),
|
||||
("%1lik", ["%", "1lik"]),
|
||||
("%1'lik", ["%", "1'lik"]),
|
||||
("%1lik dilim", ["%", "1lik", "dilim"]),
|
||||
("%1'lik dilim", ["%", "1'lik", "dilim"]),
|
||||
("%1.5", ["%", "1.5"]),
|
||||
#("%1-%2 arası büyüme bekleniyor.", ["%", "1", "-", "%", "2", "arası", "büyüme", "bekleniyor", "."]),
|
||||
("%1-2 arası büyüme bekliyoruz.", ["%", "1", "-", "2", "arası", "büyüme", "bekliyoruz", "."]),
|
||||
("%11-12 arası büyüme bekliyoruz.", ["%", "11", "-", "12", "arası", "büyüme", "bekliyoruz", "."]),
|
||||
("%1.5luk büyüme bekliyoruz.", ["%", "1.5luk", "büyüme", "bekliyoruz", "."]),
|
||||
("Saat 1-2 arası gelin lütfen.", ["Saat", "1", "-", "2", "arası", "gelin", "lütfen", "."]),
|
||||
("Saat 15:30 gibi buluşalım.", ["Saat", "15:30", "gibi", "buluşalım", "."]),
|
||||
("Saat 15:30'da buluşalım.", ["Saat", "15:30'da", "buluşalım", "."]),
|
||||
("Saat 15.30'da buluşalım.", ["Saat", "15.30'da", "buluşalım", "."]),
|
||||
("Saat 15.30da buluşalım.", ["Saat", "15.30da", "buluşalım", "."]),
|
||||
("Saat 15 civarı buluşalım.", ["Saat", "15", "civarı", "buluşalım", "."]),
|
||||
("9’daki otobüse binsek mi?", ["9’daki", "otobüse", "binsek", "mi", "?"]),
|
||||
("Okulumuz 3-B şubesi", ["Okulumuz", "3-B", "şubesi"]),
|
||||
("Okulumuz 3/B şubesi", ["Okulumuz", "3/B", "şubesi"]),
|
||||
("Okulumuz 3B şubesi", ["Okulumuz", "3B", "şubesi"]),
|
||||
("Okulumuz 3b şubesi", ["Okulumuz", "3b", "şubesi"]),
|
||||
("Antonio Gaudí 20. yüzyılda, 1904-1914 yılları arasında on yıl süren bir reform süreci getirmiştir.", ["Antonio", "Gaudí", "20.", "yüzyılda", ",", "1904", "-", "1914", "yılları", "arasında", "on", "yıl", "süren", "bir", "reform", "süreci", "getirmiştir", "."]),
|
||||
("Dizel yakıtın avro bölgesi ortalaması olan 1,165 avroya kıyasla litre başına 1,335 avroya mal olduğunu gösteriyor.", ["Dizel", "yakıtın", "avro", "bölgesi", "ortalaması", "olan", "1,165", "avroya", "kıyasla", "litre", "başına", "1,335", "avroya", "mal", "olduğunu", "gösteriyor", "."]),
|
||||
("Marcus Antonius M.Ö. 1 Ocak 49'da, Sezar'dan Vali'nin kendisini barış dostu ilan ettiği bir bildiri yayınlamıştır.", ["Marcus", "Antonius", "M.Ö.", "1", "Ocak", "49'da", ",", "Sezar'dan", "Vali'nin", "kendisini", "barış", "dostu", "ilan", "ettiği", "bir", "bildiri", "yayınlamıştır", "."])
|
||||
("Rakamla 6 yazılıydı.", ["Rakamla", "6", "yazılıydı", "."]),
|
||||
("Hava -4 dereceydi.", ["Hava", "-4", "dereceydi", "."]),
|
||||
(
|
||||
"Hava sıcaklığı -4ten +6ya yükseldi.",
|
||||
["Hava", "sıcaklığı", "-4ten", "+6ya", "yükseldi", "."],
|
||||
),
|
||||
(
|
||||
"Hava sıcaklığı -4'ten +6'ya yükseldi.",
|
||||
["Hava", "sıcaklığı", "-4'ten", "+6'ya", "yükseldi", "."],
|
||||
),
|
||||
("Yarışta 6. oldum.", ["Yarışta", "6.", "oldum", "."]),
|
||||
("Yarışta 438547745. oldum.", ["Yarışta", "438547745.", "oldum", "."]),
|
||||
("Kitap IV. Murat hakkında.", ["Kitap", "IV.", "Murat", "hakkında", "."]),
|
||||
# ("Bana söylediği sayı 6.", ["Bana", "söylediği", "sayı", "6", "."]),
|
||||
("Saat 6'da buluşalım.", ["Saat", "6'da", "buluşalım", "."]),
|
||||
("Saat 6dan sonra buluşalım.", ["Saat", "6dan", "sonra", "buluşalım", "."]),
|
||||
("6.dan sonra saymadım.", ["6.dan", "sonra", "saymadım", "."]),
|
||||
("6.'dan sonra saymadım.", ["6.'dan", "sonra", "saymadım", "."]),
|
||||
("Saat 6'ydı.", ["Saat", "6'ydı", "."]),
|
||||
("5'te", ["5'te"]),
|
||||
("6'da", ["6'da"]),
|
||||
("9dan", ["9dan"]),
|
||||
("19'da", ["19'da"]),
|
||||
("VI'da", ["VI'da"]),
|
||||
("5.", ["5."]),
|
||||
("72.", ["72."]),
|
||||
("VI.", ["VI."]),
|
||||
("6.'dan", ["6.'dan"]),
|
||||
("19.'dan", ["19.'dan"]),
|
||||
("6.dan", ["6.dan"]),
|
||||
("16.dan", ["16.dan"]),
|
||||
("VI.'dan", ["VI.'dan"]),
|
||||
("VI.dan", ["VI.dan"]),
|
||||
("Hepsi 1994 yılında oldu.", ["Hepsi", "1994", "yılında", "oldu", "."]),
|
||||
("Hepsi 1994'te oldu.", ["Hepsi", "1994'te", "oldu", "."]),
|
||||
(
|
||||
"2/3 tarihli faturayı bulamadım.",
|
||||
["2/3", "tarihli", "faturayı", "bulamadım", "."],
|
||||
),
|
||||
(
|
||||
"2.3 tarihli faturayı bulamadım.",
|
||||
["2.3", "tarihli", "faturayı", "bulamadım", "."],
|
||||
),
|
||||
(
|
||||
"2.3. tarihli faturayı bulamadım.",
|
||||
["2.3.", "tarihli", "faturayı", "bulamadım", "."],
|
||||
),
|
||||
(
|
||||
"2/3/2020 tarihli faturayı bulamadm.",
|
||||
["2/3/2020", "tarihli", "faturayı", "bulamadm", "."],
|
||||
),
|
||||
(
|
||||
"2/3/1987 tarihinden beri burda yaşıyorum.",
|
||||
["2/3/1987", "tarihinden", "beri", "burda", "yaşıyorum", "."],
|
||||
),
|
||||
(
|
||||
"2-3-1987 tarihinden beri burdayım.",
|
||||
["2-3-1987", "tarihinden", "beri", "burdayım", "."],
|
||||
),
|
||||
(
|
||||
"2.3.1987 tarihinden beri burdayım.",
|
||||
["2.3.1987", "tarihinden", "beri", "burdayım", "."],
|
||||
),
|
||||
(
|
||||
"Bu olay 2005-2006 tarihleri arasında oldu.",
|
||||
["Bu", "olay", "2005", "-", "2006", "tarihleri", "arasında", "oldu", "."],
|
||||
),
|
||||
(
|
||||
"Bu olay 4/12/2005-21/3/2006 tarihleri arasında oldu.",
|
||||
[
|
||||
"Bu",
|
||||
"olay",
|
||||
"4/12/2005",
|
||||
"-",
|
||||
"21/3/2006",
|
||||
"tarihleri",
|
||||
"arasında",
|
||||
"oldu",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Ek fıkra: 5/11/2003-4999/3 maddesine göre uygundur.",
|
||||
[
|
||||
"Ek",
|
||||
"fıkra",
|
||||
":",
|
||||
"5/11/2003",
|
||||
"-",
|
||||
"4999/3",
|
||||
"maddesine",
|
||||
"göre",
|
||||
"uygundur",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"2/A alanları: 6831 sayılı Kanunun 2nci maddesinin birinci fıkrasının (A) bendine göre",
|
||||
[
|
||||
"2/A",
|
||||
"alanları",
|
||||
":",
|
||||
"6831",
|
||||
"sayılı",
|
||||
"Kanunun",
|
||||
"2nci",
|
||||
"maddesinin",
|
||||
"birinci",
|
||||
"fıkrasının",
|
||||
"(",
|
||||
"A",
|
||||
")",
|
||||
"bendine",
|
||||
"göre",
|
||||
],
|
||||
),
|
||||
(
|
||||
"ŞEHİTTEĞMENKALMAZ Cad. No: 2/311",
|
||||
["ŞEHİTTEĞMENKALMAZ", "Cad.", "No", ":", "2/311"],
|
||||
),
|
||||
(
|
||||
"2-3-2025",
|
||||
[
|
||||
"2-3-2025",
|
||||
],
|
||||
),
|
||||
("2/3/2025", ["2/3/2025"]),
|
||||
("Yıllardır 0.5 uç kullanıyorum.", ["Yıllardır", "0.5", "uç", "kullanıyorum", "."]),
|
||||
(
|
||||
"Kan değerlerim 0.5-0.7 arasıydı.",
|
||||
["Kan", "değerlerim", "0.5", "-", "0.7", "arasıydı", "."],
|
||||
),
|
||||
("0.5", ["0.5"]),
|
||||
("1/2", ["1/2"]),
|
||||
("%1", ["%", "1"]),
|
||||
("%1lik", ["%", "1lik"]),
|
||||
("%1'lik", ["%", "1'lik"]),
|
||||
("%1lik dilim", ["%", "1lik", "dilim"]),
|
||||
("%1'lik dilim", ["%", "1'lik", "dilim"]),
|
||||
("%1.5", ["%", "1.5"]),
|
||||
# ("%1-%2 arası büyüme bekleniyor.", ["%", "1", "-", "%", "2", "arası", "büyüme", "bekleniyor", "."]),
|
||||
(
|
||||
"%1-2 arası büyüme bekliyoruz.",
|
||||
["%", "1", "-", "2", "arası", "büyüme", "bekliyoruz", "."],
|
||||
),
|
||||
(
|
||||
"%11-12 arası büyüme bekliyoruz.",
|
||||
["%", "11", "-", "12", "arası", "büyüme", "bekliyoruz", "."],
|
||||
),
|
||||
("%1.5luk büyüme bekliyoruz.", ["%", "1.5luk", "büyüme", "bekliyoruz", "."]),
|
||||
(
|
||||
"Saat 1-2 arası gelin lütfen.",
|
||||
["Saat", "1", "-", "2", "arası", "gelin", "lütfen", "."],
|
||||
),
|
||||
("Saat 15:30 gibi buluşalım.", ["Saat", "15:30", "gibi", "buluşalım", "."]),
|
||||
("Saat 15:30'da buluşalım.", ["Saat", "15:30'da", "buluşalım", "."]),
|
||||
("Saat 15.30'da buluşalım.", ["Saat", "15.30'da", "buluşalım", "."]),
|
||||
("Saat 15.30da buluşalım.", ["Saat", "15.30da", "buluşalım", "."]),
|
||||
("Saat 15 civarı buluşalım.", ["Saat", "15", "civarı", "buluşalım", "."]),
|
||||
("9’daki otobüse binsek mi?", ["9’daki", "otobüse", "binsek", "mi", "?"]),
|
||||
("Okulumuz 3-B şubesi", ["Okulumuz", "3-B", "şubesi"]),
|
||||
("Okulumuz 3/B şubesi", ["Okulumuz", "3/B", "şubesi"]),
|
||||
("Okulumuz 3B şubesi", ["Okulumuz", "3B", "şubesi"]),
|
||||
("Okulumuz 3b şubesi", ["Okulumuz", "3b", "şubesi"]),
|
||||
(
|
||||
"Antonio Gaudí 20. yüzyılda, 1904-1914 yılları arasında on yıl süren bir reform süreci getirmiştir.",
|
||||
[
|
||||
"Antonio",
|
||||
"Gaudí",
|
||||
"20.",
|
||||
"yüzyılda",
|
||||
",",
|
||||
"1904",
|
||||
"-",
|
||||
"1914",
|
||||
"yılları",
|
||||
"arasında",
|
||||
"on",
|
||||
"yıl",
|
||||
"süren",
|
||||
"bir",
|
||||
"reform",
|
||||
"süreci",
|
||||
"getirmiştir",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Dizel yakıtın avro bölgesi ortalaması olan 1,165 avroya kıyasla litre başına 1,335 avroya mal olduğunu gösteriyor.",
|
||||
[
|
||||
"Dizel",
|
||||
"yakıtın",
|
||||
"avro",
|
||||
"bölgesi",
|
||||
"ortalaması",
|
||||
"olan",
|
||||
"1,165",
|
||||
"avroya",
|
||||
"kıyasla",
|
||||
"litre",
|
||||
"başına",
|
||||
"1,335",
|
||||
"avroya",
|
||||
"mal",
|
||||
"olduğunu",
|
||||
"gösteriyor",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Marcus Antonius M.Ö. 1 Ocak 49'da, Sezar'dan Vali'nin kendisini barış dostu ilan ettiği bir bildiri yayınlamıştır.",
|
||||
[
|
||||
"Marcus",
|
||||
"Antonius",
|
||||
"M.Ö.",
|
||||
"1",
|
||||
"Ocak",
|
||||
"49'da",
|
||||
",",
|
||||
"Sezar'dan",
|
||||
"Vali'nin",
|
||||
"kendisini",
|
||||
"barış",
|
||||
"dostu",
|
||||
"ilan",
|
||||
"ettiği",
|
||||
"bir",
|
||||
"bildiri",
|
||||
"yayınlamıştır",
|
||||
".",
|
||||
],
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
PUNCT_TESTS = [
|
||||
("Gitmedim dedim ya!", ["Gitmedim", "dedim", "ya", "!"]),
|
||||
("Gitmedim dedim ya!!", ["Gitmedim", "dedim", "ya", "!", "!"]),
|
||||
("Gitsek mi?", ["Gitsek", "mi", "?"]),
|
||||
("Gitsek mi??", ["Gitsek", "mi", "?", "?"]),
|
||||
("Gitsek mi?!?", ["Gitsek", "mi", "?", "!", "?"]),
|
||||
("Ankara - Antalya arası otobüs işliyor.", ["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."]),
|
||||
("Ankara-Antalya arası otobüs işliyor.", ["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."]),
|
||||
("Sen--ben, ya da onlar.", ["Sen", "--", "ben", ",", "ya", "da", "onlar", "."]),
|
||||
("Senden, benden, bizden şarkısını biliyor musun?", ["Senden", ",", "benden", ",", "bizden", "şarkısını", "biliyor", "musun", "?"]),
|
||||
("Akif'le geldik, sonra da o ayrıldı.", ["Akif'le", "geldik", ",", "sonra", "da", "o", "ayrıldı", "."]),
|
||||
("Bu adam ne dedi şimdi???", ["Bu", "adam", "ne", "dedi", "şimdi", "?", "?", "?"]),
|
||||
("Yok hasta olmuş, yok annesi hastaymış, bahaneler işte...", ["Yok", "hasta", "olmuş", ",", "yok", "annesi", "hastaymış", ",", "bahaneler", "işte", "..."]),
|
||||
("Ankara'dan İstanbul'a ... bir aşk hikayesi.", ["Ankara'dan", "İstanbul'a", "...", "bir", "aşk", "hikayesi", "."]),
|
||||
("Ahmet'te", ["Ahmet'te"]),
|
||||
("İstanbul'da", ["İstanbul'da"]),
|
||||
("Gitmedim dedim ya!", ["Gitmedim", "dedim", "ya", "!"]),
|
||||
("Gitmedim dedim ya!!", ["Gitmedim", "dedim", "ya", "!", "!"]),
|
||||
("Gitsek mi?", ["Gitsek", "mi", "?"]),
|
||||
("Gitsek mi??", ["Gitsek", "mi", "?", "?"]),
|
||||
("Gitsek mi?!?", ["Gitsek", "mi", "?", "!", "?"]),
|
||||
(
|
||||
"Ankara - Antalya arası otobüs işliyor.",
|
||||
["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."],
|
||||
),
|
||||
(
|
||||
"Ankara-Antalya arası otobüs işliyor.",
|
||||
["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."],
|
||||
),
|
||||
("Sen--ben, ya da onlar.", ["Sen", "--", "ben", ",", "ya", "da", "onlar", "."]),
|
||||
(
|
||||
"Senden, benden, bizden şarkısını biliyor musun?",
|
||||
["Senden", ",", "benden", ",", "bizden", "şarkısını", "biliyor", "musun", "?"],
|
||||
),
|
||||
(
|
||||
"Akif'le geldik, sonra da o ayrıldı.",
|
||||
["Akif'le", "geldik", ",", "sonra", "da", "o", "ayrıldı", "."],
|
||||
),
|
||||
("Bu adam ne dedi şimdi???", ["Bu", "adam", "ne", "dedi", "şimdi", "?", "?", "?"]),
|
||||
(
|
||||
"Yok hasta olmuş, yok annesi hastaymış, bahaneler işte...",
|
||||
[
|
||||
"Yok",
|
||||
"hasta",
|
||||
"olmuş",
|
||||
",",
|
||||
"yok",
|
||||
"annesi",
|
||||
"hastaymış",
|
||||
",",
|
||||
"bahaneler",
|
||||
"işte",
|
||||
"...",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Ankara'dan İstanbul'a ... bir aşk hikayesi.",
|
||||
["Ankara'dan", "İstanbul'a", "...", "bir", "aşk", "hikayesi", "."],
|
||||
),
|
||||
("Ahmet'te", ["Ahmet'te"]),
|
||||
("İstanbul'da", ["İstanbul'da"]),
|
||||
]
|
||||
|
||||
GENERAL_TESTS = [
|
||||
("1914'teki Endurance seferinde, Sir Ernest Shackleton'ın kaptanlığını yaptığı İngiliz Endurance gemisi yirmi sekiz kişi ile Antarktika'yı geçmek üzere yelken açtı.", ["1914'teki", "Endurance", "seferinde", ",", "Sir", "Ernest", "Shackleton'ın", "kaptanlığını", "yaptığı", "İngiliz", "Endurance", "gemisi", "yirmi", "sekiz", "kişi", "ile", "Antarktika'yı", "geçmek", "üzere", "yelken", "açtı", "."]),
|
||||
("Danışılan \"%100 Cospedal\" olduğunu belirtti.", ["Danışılan", '"', "%", "100", "Cospedal", '"', "olduğunu", "belirtti", "."]),
|
||||
("1976'da parkur artık kullanılmıyordu; 1990'da ise bir yangın, daha sonraları ahırlarla birlikte yıkılacak olan tahta tribünlerden geri kalanları da yok etmişti.", ["1976'da", "parkur", "artık", "kullanılmıyordu", ";", "1990'da", "ise", "bir", "yangın", ",", "daha", "sonraları", "ahırlarla", "birlikte", "yıkılacak", "olan", "tahta", "tribünlerden", "geri", "kalanları", "da", "yok", "etmişti", "."]),
|
||||
("Dahiyane bir ameliyat ve zorlu bir rehabilitasyon sürecinden sonra, tamamen iyileştim.", ["Dahiyane", "bir", "ameliyat", "ve", "zorlu", "bir", "rehabilitasyon", "sürecinden", "sonra", ",", "tamamen", "iyileştim", "."]),
|
||||
("Yaklaşık iki hafta süren bireysel erken oy kullanma döneminin ardından 5,7 milyondan fazla Floridalı sandık başına gitti.", ["Yaklaşık", "iki", "hafta", "süren", "bireysel", "erken", "oy", "kullanma", "döneminin", "ardından", "5,7", "milyondan", "fazla", "Floridalı", "sandık", "başına", "gitti", "."]),
|
||||
("Ancak, bu ABD Çevre Koruma Ajansı'nın dünyayı bu konularda uyarmasının ardından ortaya çıktı.", ["Ancak", ",", "bu", "ABD", "Çevre", "Koruma", "Ajansı'nın", "dünyayı", "bu", "konularda", "uyarmasının", "ardından", "ortaya", "çıktı", "."]),
|
||||
("Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.", ["Ortalama", "şansa", "ve", "10.000", "Sterlin", "değerinde", "tahvillere", "sahip", "bir", "yatırımcı", "yılda", "125", "Sterlin", "ikramiye", "kazanabilir", "."]),
|
||||
("Granit adaları; Seyşeller ve Tioman ile Saint Helena gibi volkanik adaları kapsar." , ["Granit", "adaları", ";", "Seyşeller", "ve", "Tioman", "ile", "Saint", "Helena", "gibi", "volkanik", "adaları", "kapsar", "."]),
|
||||
("Barış antlaşmasıyla İspanya, Amerika'ya Porto Riko, Guam ve Filipinler kolonilerini devretti.", ["Barış", "antlaşmasıyla", "İspanya", ",", "Amerika'ya", "Porto", "Riko", ",", "Guam", "ve", "Filipinler", "kolonilerini", "devretti", "."]),
|
||||
("Makedonya\'nın sınır bölgelerini güvence altına alan Philip, büyük bir Makedon ordusu kurdu ve uzun bir fetih seferi için Trakya\'ya doğru yürüdü.", ["Makedonya\'nın", "sınır", "bölgelerini", "güvence", "altına", "alan", "Philip", ",", "büyük", "bir", "Makedon", "ordusu", "kurdu", "ve", "uzun", "bir", "fetih", "seferi", "için", "Trakya\'ya", "doğru", "yürüdü", "."]),
|
||||
("Fransız gazetesi Le Figaro'ya göre bu hükumet planı sayesinde 42 milyon Euro kazanç sağlanabilir ve elde edilen paranın 15.5 milyonu ulusal güvenlik için kullanılabilir.", ["Fransız", "gazetesi", "Le", "Figaro'ya", "göre", "bu", "hükumet", "planı", "sayesinde", "42", "milyon", "Euro", "kazanç", "sağlanabilir", "ve", "elde", "edilen", "paranın", "15.5", "milyonu", "ulusal", "güvenlik", "için", "kullanılabilir", "."]),
|
||||
("Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.", ["Ortalama", "şansa", "ve", "10.000", "Sterlin", "değerinde", "tahvillere", "sahip", "bir", "yatırımcı", "yılda", "125", "Sterlin", "ikramiye", "kazanabilir", "."]),
|
||||
("3 Kasım Salı günü, Ankara Belediye Başkanı 2014'te hükümetle birlikte oluşturulan kentsel gelişim anlaşmasını askıya alma kararı verdi.", ["3", "Kasım", "Salı", "günü", ",", "Ankara", "Belediye", "Başkanı", "2014'te", "hükümetle", "birlikte", "oluşturulan", "kentsel", "gelişim", "anlaşmasını", "askıya", "alma", "kararı", "verdi", "."]),
|
||||
("Stalin, Abakumov'u Beria'nın enerji bakanlıkları üzerindeki baskınlığına karşı MGB içinde kendi ağını kurmaya teşvik etmeye başlamıştı.", ["Stalin", ",", "Abakumov'u", "Beria'nın", "enerji", "bakanlıkları", "üzerindeki", "baskınlığına", "karşı", "MGB", "içinde", "kendi", "ağını", "kurmaya", "teşvik", "etmeye", "başlamıştı", "."]),
|
||||
("Güney Avrupa'daki kazı alanlarının çoğunluğu gibi, bu bulgu M.Ö. 5. yüzyılın başlar", ["Güney", "Avrupa'daki", "kazı", "alanlarının", "çoğunluğu", "gibi", ",", "bu", "bulgu", "M.Ö.", "5.", "yüzyılın", "başlar"]),
|
||||
("Sağlığın bozulması Hitchcock hayatının son yirmi yılında üretimini azalttı.", ["Sağlığın", "bozulması", "Hitchcock", "hayatının", "son", "yirmi", "yılında", "üretimini", "azalttı", "."]),
|
||||
(
|
||||
"1914'teki Endurance seferinde, Sir Ernest Shackleton'ın kaptanlığını yaptığı İngiliz Endurance gemisi yirmi sekiz kişi ile Antarktika'yı geçmek üzere yelken açtı.",
|
||||
[
|
||||
"1914'teki",
|
||||
"Endurance",
|
||||
"seferinde",
|
||||
",",
|
||||
"Sir",
|
||||
"Ernest",
|
||||
"Shackleton'ın",
|
||||
"kaptanlığını",
|
||||
"yaptığı",
|
||||
"İngiliz",
|
||||
"Endurance",
|
||||
"gemisi",
|
||||
"yirmi",
|
||||
"sekiz",
|
||||
"kişi",
|
||||
"ile",
|
||||
"Antarktika'yı",
|
||||
"geçmek",
|
||||
"üzere",
|
||||
"yelken",
|
||||
"açtı",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
'Danışılan "%100 Cospedal" olduğunu belirtti.',
|
||||
["Danışılan", '"', "%", "100", "Cospedal", '"', "olduğunu", "belirtti", "."],
|
||||
),
|
||||
(
|
||||
"1976'da parkur artık kullanılmıyordu; 1990'da ise bir yangın, daha sonraları ahırlarla birlikte yıkılacak olan tahta tribünlerden geri kalanları da yok etmişti.",
|
||||
[
|
||||
"1976'da",
|
||||
"parkur",
|
||||
"artık",
|
||||
"kullanılmıyordu",
|
||||
";",
|
||||
"1990'da",
|
||||
"ise",
|
||||
"bir",
|
||||
"yangın",
|
||||
",",
|
||||
"daha",
|
||||
"sonraları",
|
||||
"ahırlarla",
|
||||
"birlikte",
|
||||
"yıkılacak",
|
||||
"olan",
|
||||
"tahta",
|
||||
"tribünlerden",
|
||||
"geri",
|
||||
"kalanları",
|
||||
"da",
|
||||
"yok",
|
||||
"etmişti",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Dahiyane bir ameliyat ve zorlu bir rehabilitasyon sürecinden sonra, tamamen iyileştim.",
|
||||
[
|
||||
"Dahiyane",
|
||||
"bir",
|
||||
"ameliyat",
|
||||
"ve",
|
||||
"zorlu",
|
||||
"bir",
|
||||
"rehabilitasyon",
|
||||
"sürecinden",
|
||||
"sonra",
|
||||
",",
|
||||
"tamamen",
|
||||
"iyileştim",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Yaklaşık iki hafta süren bireysel erken oy kullanma döneminin ardından 5,7 milyondan fazla Floridalı sandık başına gitti.",
|
||||
[
|
||||
"Yaklaşık",
|
||||
"iki",
|
||||
"hafta",
|
||||
"süren",
|
||||
"bireysel",
|
||||
"erken",
|
||||
"oy",
|
||||
"kullanma",
|
||||
"döneminin",
|
||||
"ardından",
|
||||
"5,7",
|
||||
"milyondan",
|
||||
"fazla",
|
||||
"Floridalı",
|
||||
"sandık",
|
||||
"başına",
|
||||
"gitti",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Ancak, bu ABD Çevre Koruma Ajansı'nın dünyayı bu konularda uyarmasının ardından ortaya çıktı.",
|
||||
[
|
||||
"Ancak",
|
||||
",",
|
||||
"bu",
|
||||
"ABD",
|
||||
"Çevre",
|
||||
"Koruma",
|
||||
"Ajansı'nın",
|
||||
"dünyayı",
|
||||
"bu",
|
||||
"konularda",
|
||||
"uyarmasının",
|
||||
"ardından",
|
||||
"ortaya",
|
||||
"çıktı",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.",
|
||||
[
|
||||
"Ortalama",
|
||||
"şansa",
|
||||
"ve",
|
||||
"10.000",
|
||||
"Sterlin",
|
||||
"değerinde",
|
||||
"tahvillere",
|
||||
"sahip",
|
||||
"bir",
|
||||
"yatırımcı",
|
||||
"yılda",
|
||||
"125",
|
||||
"Sterlin",
|
||||
"ikramiye",
|
||||
"kazanabilir",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Granit adaları; Seyşeller ve Tioman ile Saint Helena gibi volkanik adaları kapsar.",
|
||||
[
|
||||
"Granit",
|
||||
"adaları",
|
||||
";",
|
||||
"Seyşeller",
|
||||
"ve",
|
||||
"Tioman",
|
||||
"ile",
|
||||
"Saint",
|
||||
"Helena",
|
||||
"gibi",
|
||||
"volkanik",
|
||||
"adaları",
|
||||
"kapsar",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Barış antlaşmasıyla İspanya, Amerika'ya Porto Riko, Guam ve Filipinler kolonilerini devretti.",
|
||||
[
|
||||
"Barış",
|
||||
"antlaşmasıyla",
|
||||
"İspanya",
|
||||
",",
|
||||
"Amerika'ya",
|
||||
"Porto",
|
||||
"Riko",
|
||||
",",
|
||||
"Guam",
|
||||
"ve",
|
||||
"Filipinler",
|
||||
"kolonilerini",
|
||||
"devretti",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Makedonya'nın sınır bölgelerini güvence altına alan Philip, büyük bir Makedon ordusu kurdu ve uzun bir fetih seferi için Trakya'ya doğru yürüdü.",
|
||||
[
|
||||
"Makedonya'nın",
|
||||
"sınır",
|
||||
"bölgelerini",
|
||||
"güvence",
|
||||
"altına",
|
||||
"alan",
|
||||
"Philip",
|
||||
",",
|
||||
"büyük",
|
||||
"bir",
|
||||
"Makedon",
|
||||
"ordusu",
|
||||
"kurdu",
|
||||
"ve",
|
||||
"uzun",
|
||||
"bir",
|
||||
"fetih",
|
||||
"seferi",
|
||||
"için",
|
||||
"Trakya'ya",
|
||||
"doğru",
|
||||
"yürüdü",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Fransız gazetesi Le Figaro'ya göre bu hükumet planı sayesinde 42 milyon Euro kazanç sağlanabilir ve elde edilen paranın 15.5 milyonu ulusal güvenlik için kullanılabilir.",
|
||||
[
|
||||
"Fransız",
|
||||
"gazetesi",
|
||||
"Le",
|
||||
"Figaro'ya",
|
||||
"göre",
|
||||
"bu",
|
||||
"hükumet",
|
||||
"planı",
|
||||
"sayesinde",
|
||||
"42",
|
||||
"milyon",
|
||||
"Euro",
|
||||
"kazanç",
|
||||
"sağlanabilir",
|
||||
"ve",
|
||||
"elde",
|
||||
"edilen",
|
||||
"paranın",
|
||||
"15.5",
|
||||
"milyonu",
|
||||
"ulusal",
|
||||
"güvenlik",
|
||||
"için",
|
||||
"kullanılabilir",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.",
|
||||
[
|
||||
"Ortalama",
|
||||
"şansa",
|
||||
"ve",
|
||||
"10.000",
|
||||
"Sterlin",
|
||||
"değerinde",
|
||||
"tahvillere",
|
||||
"sahip",
|
||||
"bir",
|
||||
"yatırımcı",
|
||||
"yılda",
|
||||
"125",
|
||||
"Sterlin",
|
||||
"ikramiye",
|
||||
"kazanabilir",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"3 Kasım Salı günü, Ankara Belediye Başkanı 2014'te hükümetle birlikte oluşturulan kentsel gelişim anlaşmasını askıya alma kararı verdi.",
|
||||
[
|
||||
"3",
|
||||
"Kasım",
|
||||
"Salı",
|
||||
"günü",
|
||||
",",
|
||||
"Ankara",
|
||||
"Belediye",
|
||||
"Başkanı",
|
||||
"2014'te",
|
||||
"hükümetle",
|
||||
"birlikte",
|
||||
"oluşturulan",
|
||||
"kentsel",
|
||||
"gelişim",
|
||||
"anlaşmasını",
|
||||
"askıya",
|
||||
"alma",
|
||||
"kararı",
|
||||
"verdi",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Stalin, Abakumov'u Beria'nın enerji bakanlıkları üzerindeki baskınlığına karşı MGB içinde kendi ağını kurmaya teşvik etmeye başlamıştı.",
|
||||
[
|
||||
"Stalin",
|
||||
",",
|
||||
"Abakumov'u",
|
||||
"Beria'nın",
|
||||
"enerji",
|
||||
"bakanlıkları",
|
||||
"üzerindeki",
|
||||
"baskınlığına",
|
||||
"karşı",
|
||||
"MGB",
|
||||
"içinde",
|
||||
"kendi",
|
||||
"ağını",
|
||||
"kurmaya",
|
||||
"teşvik",
|
||||
"etmeye",
|
||||
"başlamıştı",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Güney Avrupa'daki kazı alanlarının çoğunluğu gibi, bu bulgu M.Ö. 5. yüzyılın başlar",
|
||||
[
|
||||
"Güney",
|
||||
"Avrupa'daki",
|
||||
"kazı",
|
||||
"alanlarının",
|
||||
"çoğunluğu",
|
||||
"gibi",
|
||||
",",
|
||||
"bu",
|
||||
"bulgu",
|
||||
"M.Ö.",
|
||||
"5.",
|
||||
"yüzyılın",
|
||||
"başlar",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Sağlığın bozulması Hitchcock hayatının son yirmi yılında üretimini azalttı.",
|
||||
[
|
||||
"Sağlığın",
|
||||
"bozulması",
|
||||
"Hitchcock",
|
||||
"hayatının",
|
||||
"son",
|
||||
"yirmi",
|
||||
"yılında",
|
||||
"üretimini",
|
||||
"azalttı",
|
||||
".",
|
||||
],
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
|
||||
TESTS = (ABBREV_TESTS + URL_TESTS + NUMBER_TESTS + PUNCT_TESTS + GENERAL_TESTS)
|
||||
|
||||
TESTS = ABBREV_TESTS + URL_TESTS + NUMBER_TESTS + PUNCT_TESTS + GENERAL_TESTS
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,expected_tokens", TESTS)
|
||||
|
@ -149,4 +696,3 @@ def test_tr_tokenizer_handles_allcases(tr_tokenizer, text, expected_tokens):
|
|||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
print(token_list)
|
||||
assert expected_tokens == token_list
|
||||
|
||||
|
|
|
@ -89,7 +89,6 @@ def test_uk_tokenizer_splits_open_appostrophe(uk_tokenizer, text):
|
|||
assert tokens[0].text == "'"
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="See Issue #3327 and PR #3329")
|
||||
@pytest.mark.parametrize("text", ["Тест''"])
|
||||
def test_uk_tokenizer_splits_double_end_quote(uk_tokenizer, text):
|
||||
tokens = uk_tokenizer(text)
|
||||
|
|
|
@ -7,7 +7,6 @@ from spacy.tokens import Doc
|
|||
from spacy.pipeline._parser_internals.nonproj import projectivize
|
||||
from spacy.pipeline._parser_internals.arc_eager import ArcEager
|
||||
from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
|
||||
from spacy.pipeline._parser_internals.stateclass import StateClass
|
||||
|
||||
|
||||
def get_sequence_costs(M, words, heads, deps, transitions):
|
||||
|
@ -59,7 +58,7 @@ def test_oracle_four_words(arc_eager, vocab):
|
|||
["S"],
|
||||
["L-left"],
|
||||
["S"],
|
||||
["D"]
|
||||
["D"],
|
||||
]
|
||||
assert state.is_final()
|
||||
for i, state_costs in enumerate(cost_history):
|
||||
|
@ -185,9 +184,9 @@ def test_oracle_dev_sentence(vocab, arc_eager):
|
|||
"L-nn", # Attach 'Cars' to 'Inc.'
|
||||
"L-nn", # Attach 'Motor' to 'Inc.'
|
||||
"L-nn", # Attach 'Rolls-Royce' to 'Inc.'
|
||||
"S", # Shift "Inc."
|
||||
"S", # Shift "Inc."
|
||||
"L-nsubj", # Attach 'Inc.' to 'said'
|
||||
"S", # Shift 'said'
|
||||
"S", # Shift 'said'
|
||||
"S", # Shift 'it'
|
||||
"L-nsubj", # Attach 'it.' to 'expects'
|
||||
"R-ccomp", # Attach 'expects' to 'said'
|
||||
|
@ -251,7 +250,7 @@ def test_oracle_bad_tokenization(vocab, arc_eager):
|
|||
is root is
|
||||
bad comp is
|
||||
"""
|
||||
|
||||
|
||||
gold_words = []
|
||||
gold_deps = []
|
||||
gold_heads = []
|
||||
|
@ -268,7 +267,9 @@ def test_oracle_bad_tokenization(vocab, arc_eager):
|
|||
arc_eager.add_action(2, dep) # Left
|
||||
arc_eager.add_action(3, dep) # Right
|
||||
reference = Doc(Vocab(), words=gold_words, deps=gold_deps, heads=gold_heads)
|
||||
predicted = Doc(reference.vocab, words=["[", "catalase", "]", ":", "that", "is", "bad"])
|
||||
predicted = Doc(
|
||||
reference.vocab, words=["[", "catalase", "]", ":", "that", "is", "bad"]
|
||||
)
|
||||
example = Example(predicted=predicted, reference=reference)
|
||||
ae_oracle_actions = arc_eager.get_oracle_sequence(example, _debug=False)
|
||||
ae_oracle_actions = [arc_eager.get_class_name(i) for i in ae_oracle_actions]
|
||||
|
|
|
@ -301,11 +301,9 @@ def test_block_ner():
|
|||
assert [token.ent_type_ for token in doc] == expected_types
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"use_upper", [True, False]
|
||||
)
|
||||
@pytest.mark.parametrize("use_upper", [True, False])
|
||||
def test_overfitting_IO(use_upper):
|
||||
# Simple test to try and quickly overfit the NER component - ensuring the ML models work correctly
|
||||
# Simple test to try and quickly overfit the NER component
|
||||
nlp = English()
|
||||
ner = nlp.add_pipe("ner", config={"model": {"use_upper": use_upper}})
|
||||
train_examples = []
|
||||
|
@ -361,6 +359,84 @@ def test_overfitting_IO(use_upper):
|
|||
assert_equal(batch_deps_1, no_batch_deps)
|
||||
|
||||
|
||||
def test_beam_ner_scores():
|
||||
# Test that we can get confidence values out of the beam_ner pipe
|
||||
beam_width = 16
|
||||
beam_density = 0.0001
|
||||
nlp = English()
|
||||
config = {
|
||||
"beam_width": beam_width,
|
||||
"beam_density": beam_density,
|
||||
}
|
||||
ner = nlp.add_pipe("beam_ner", config=config)
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
for ent in annotations.get("entities"):
|
||||
ner.add_label(ent[2])
|
||||
optimizer = nlp.initialize()
|
||||
|
||||
# update once
|
||||
losses = {}
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
|
||||
# test the scores from the beam
|
||||
test_text = "I like London."
|
||||
doc = nlp.make_doc(test_text)
|
||||
docs = [doc]
|
||||
beams = ner.predict(docs)
|
||||
entity_scores = ner.scored_ents(beams)[0]
|
||||
|
||||
for j in range(len(doc)):
|
||||
for label in ner.labels:
|
||||
score = entity_scores[(j, j+1, label)]
|
||||
eps = 0.00001
|
||||
assert 0 - eps <= score <= 1 + eps
|
||||
|
||||
|
||||
def test_beam_overfitting_IO():
|
||||
# Simple test to try and quickly overfit the Beam NER component
|
||||
nlp = English()
|
||||
beam_width = 16
|
||||
beam_density = 0.0001
|
||||
config = {
|
||||
"beam_width": beam_width,
|
||||
"beam_density": beam_density,
|
||||
}
|
||||
ner = nlp.add_pipe("beam_ner", config=config)
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
for ent in annotations.get("entities"):
|
||||
ner.add_label(ent[2])
|
||||
optimizer = nlp.initialize()
|
||||
|
||||
# run overfitting
|
||||
for i in range(50):
|
||||
losses = {}
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
assert losses["beam_ner"] < 0.0001
|
||||
|
||||
# test the scores from the beam
|
||||
test_text = "I like London."
|
||||
docs = [nlp.make_doc(test_text)]
|
||||
beams = ner.predict(docs)
|
||||
entity_scores = ner.scored_ents(beams)[0]
|
||||
assert entity_scores[(2, 3, "LOC")] == 1.0
|
||||
assert entity_scores[(2, 3, "PERSON")] == 0.0
|
||||
|
||||
# Also test the results are still the same after IO
|
||||
with make_tempdir() as tmp_dir:
|
||||
nlp.to_disk(tmp_dir)
|
||||
nlp2 = util.load_model_from_path(tmp_dir)
|
||||
docs2 = [nlp2.make_doc(test_text)]
|
||||
ner2 = nlp2.get_pipe("beam_ner")
|
||||
beams2 = ner2.predict(docs2)
|
||||
entity_scores2 = ner2.scored_ents(beams2)[0]
|
||||
assert entity_scores2[(2, 3, "LOC")] == 1.0
|
||||
assert entity_scores2[(2, 3, "PERSON")] == 0.0
|
||||
|
||||
|
||||
def test_ner_warns_no_lookups(caplog):
|
||||
nlp = English()
|
||||
assert nlp.lang in util.LEXEME_NORM_LANGS
|
||||
|
|
|
@ -1,13 +1,9 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
import hypothesis
|
||||
import hypothesis.strategies
|
||||
import numpy
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.language import Language
|
||||
from spacy.pipeline import DependencyParser
|
||||
from spacy.pipeline._parser_internals.arc_eager import ArcEager
|
||||
from spacy.tokens import Doc
|
||||
from spacy.pipeline._parser_internals._beam_utils import BeamBatch
|
||||
|
@ -44,7 +40,7 @@ def docs(vocab):
|
|||
words=["Rats", "bite", "things"],
|
||||
heads=[1, 1, 1],
|
||||
deps=["nsubj", "ROOT", "dobj"],
|
||||
sent_starts=[True, False, False]
|
||||
sent_starts=[True, False, False],
|
||||
)
|
||||
]
|
||||
|
||||
|
@ -77,10 +73,12 @@ def batch_size(docs):
|
|||
def beam_width():
|
||||
return 4
|
||||
|
||||
|
||||
@pytest.fixture(params=[0.0, 0.5, 1.0])
|
||||
def beam_density(request):
|
||||
return request.param
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def vector_size():
|
||||
return 6
|
||||
|
@ -100,7 +98,9 @@ def scores(moves, batch_size, beam_width):
|
|||
numpy.random.uniform(-0.1, 0.1, (beam_width, moves.n_moves))
|
||||
for _ in range(batch_size)
|
||||
]
|
||||
), dtype="float32")
|
||||
),
|
||||
dtype="float32",
|
||||
)
|
||||
|
||||
|
||||
def test_create_beam(beam):
|
||||
|
@ -128,8 +128,6 @@ def test_beam_parse(examples, beam_width):
|
|||
parser(doc)
|
||||
|
||||
|
||||
|
||||
|
||||
@hypothesis.given(hyp=hypothesis.strategies.data())
|
||||
def test_beam_density(moves, examples, beam_width, hyp):
|
||||
beam_density = float(hyp.draw(hypothesis.strategies.floats(0.0, 1.0, width=32)))
|
||||
|
|
|
@ -28,6 +28,26 @@ TRAIN_DATA = [
|
|||
]
|
||||
|
||||
|
||||
CONFLICTING_DATA = [
|
||||
(
|
||||
"I like London and Berlin.",
|
||||
{
|
||||
"heads": [1, 1, 1, 2, 2, 1],
|
||||
"deps": ["nsubj", "ROOT", "dobj", "cc", "conj", "punct"],
|
||||
},
|
||||
),
|
||||
(
|
||||
"I like London and Berlin.",
|
||||
{
|
||||
"heads": [0, 0, 0, 0, 0, 0],
|
||||
"deps": ["ROOT", "nsubj", "nsubj", "cc", "conj", "punct"],
|
||||
},
|
||||
),
|
||||
]
|
||||
|
||||
eps = 0.01
|
||||
|
||||
|
||||
def test_parser_root(en_vocab):
|
||||
words = ["i", "do", "n't", "have", "other", "assistance"]
|
||||
heads = [3, 3, 3, 3, 5, 3]
|
||||
|
@ -185,26 +205,31 @@ def test_parser_set_sent_starts(en_vocab):
|
|||
assert token.head in sent
|
||||
|
||||
|
||||
def test_overfitting_IO():
|
||||
# Simple test to try and quickly overfit the dependency parser - ensuring the ML models work correctly
|
||||
@pytest.mark.parametrize("pipe_name", ["parser", "beam_parser"])
|
||||
def test_overfitting_IO(pipe_name):
|
||||
# Simple test to try and quickly overfit the dependency parser (normal or beam)
|
||||
nlp = English()
|
||||
parser = nlp.add_pipe("parser")
|
||||
parser = nlp.add_pipe(pipe_name)
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
for dep in annotations.get("deps", []):
|
||||
parser.add_label(dep)
|
||||
optimizer = nlp.initialize()
|
||||
for i in range(100):
|
||||
# run overfitting
|
||||
for i in range(150):
|
||||
losses = {}
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
assert losses["parser"] < 0.0001
|
||||
assert losses[pipe_name] < 0.0001
|
||||
# test the trained model
|
||||
test_text = "I like securities."
|
||||
doc = nlp(test_text)
|
||||
assert doc[0].dep_ == "nsubj"
|
||||
assert doc[2].dep_ == "dobj"
|
||||
assert doc[3].dep_ == "punct"
|
||||
assert doc[0].head.i == 1
|
||||
assert doc[2].head.i == 1
|
||||
assert doc[3].head.i == 1
|
||||
# Also test the results are still the same after IO
|
||||
with make_tempdir() as tmp_dir:
|
||||
nlp.to_disk(tmp_dir)
|
||||
|
@ -213,6 +238,9 @@ def test_overfitting_IO():
|
|||
assert doc2[0].dep_ == "nsubj"
|
||||
assert doc2[2].dep_ == "dobj"
|
||||
assert doc2[3].dep_ == "punct"
|
||||
assert doc2[0].head.i == 1
|
||||
assert doc2[2].head.i == 1
|
||||
assert doc2[3].head.i == 1
|
||||
|
||||
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||
texts = [
|
||||
|
@ -226,3 +254,123 @@ def test_overfitting_IO():
|
|||
no_batch_deps = [doc.to_array([DEP]) for doc in [nlp(text) for text in texts]]
|
||||
assert_equal(batch_deps_1, batch_deps_2)
|
||||
assert_equal(batch_deps_1, no_batch_deps)
|
||||
|
||||
|
||||
def test_beam_parser_scores():
|
||||
# Test that we can get confidence values out of the beam_parser pipe
|
||||
beam_width = 16
|
||||
beam_density = 0.0001
|
||||
nlp = English()
|
||||
config = {
|
||||
"beam_width": beam_width,
|
||||
"beam_density": beam_density,
|
||||
}
|
||||
parser = nlp.add_pipe("beam_parser", config=config)
|
||||
train_examples = []
|
||||
for text, annotations in CONFLICTING_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
for dep in annotations.get("deps", []):
|
||||
parser.add_label(dep)
|
||||
optimizer = nlp.initialize()
|
||||
|
||||
# update a bit with conflicting data
|
||||
for i in range(10):
|
||||
losses = {}
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
|
||||
# test the scores from the beam
|
||||
test_text = "I like securities."
|
||||
doc = nlp.make_doc(test_text)
|
||||
docs = [doc]
|
||||
beams = parser.predict(docs)
|
||||
head_scores, label_scores = parser.scored_parses(beams)
|
||||
|
||||
for j in range(len(doc)):
|
||||
for label in parser.labels:
|
||||
label_score = label_scores[0][(j, label)]
|
||||
assert 0 - eps <= label_score <= 1 + eps
|
||||
for i in range(len(doc)):
|
||||
head_score = head_scores[0][(j, i)]
|
||||
assert 0 - eps <= head_score <= 1 + eps
|
||||
|
||||
|
||||
def test_beam_overfitting_IO():
|
||||
# Simple test to try and quickly overfit the Beam dependency parser
|
||||
nlp = English()
|
||||
beam_width = 16
|
||||
beam_density = 0.0001
|
||||
config = {
|
||||
"beam_width": beam_width,
|
||||
"beam_density": beam_density,
|
||||
}
|
||||
parser = nlp.add_pipe("beam_parser", config=config)
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
for dep in annotations.get("deps", []):
|
||||
parser.add_label(dep)
|
||||
optimizer = nlp.initialize()
|
||||
# run overfitting
|
||||
for i in range(150):
|
||||
losses = {}
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
assert losses["beam_parser"] < 0.0001
|
||||
# test the scores from the beam
|
||||
test_text = "I like securities."
|
||||
docs = [nlp.make_doc(test_text)]
|
||||
beams = parser.predict(docs)
|
||||
head_scores, label_scores = parser.scored_parses(beams)
|
||||
# we only processed one document
|
||||
head_scores = head_scores[0]
|
||||
label_scores = label_scores[0]
|
||||
# test label annotations: 0=nsubj, 2=dobj, 3=punct
|
||||
assert label_scores[(0, "nsubj")] == pytest.approx(1.0, eps)
|
||||
assert label_scores[(0, "dobj")] == pytest.approx(0.0, eps)
|
||||
assert label_scores[(0, "punct")] == pytest.approx(0.0, eps)
|
||||
assert label_scores[(2, "nsubj")] == pytest.approx(0.0, eps)
|
||||
assert label_scores[(2, "dobj")] == pytest.approx(1.0, eps)
|
||||
assert label_scores[(2, "punct")] == pytest.approx(0.0, eps)
|
||||
assert label_scores[(3, "nsubj")] == pytest.approx(0.0, eps)
|
||||
assert label_scores[(3, "dobj")] == pytest.approx(0.0, eps)
|
||||
assert label_scores[(3, "punct")] == pytest.approx(1.0, eps)
|
||||
# test head annotations: the root is token at index 1
|
||||
assert head_scores[(0, 0)] == pytest.approx(0.0, eps)
|
||||
assert head_scores[(0, 1)] == pytest.approx(1.0, eps)
|
||||
assert head_scores[(0, 2)] == pytest.approx(0.0, eps)
|
||||
assert head_scores[(2, 0)] == pytest.approx(0.0, eps)
|
||||
assert head_scores[(2, 1)] == pytest.approx(1.0, eps)
|
||||
assert head_scores[(2, 2)] == pytest.approx(0.0, eps)
|
||||
assert head_scores[(3, 0)] == pytest.approx(0.0, eps)
|
||||
assert head_scores[(3, 1)] == pytest.approx(1.0, eps)
|
||||
assert head_scores[(3, 2)] == pytest.approx(0.0, eps)
|
||||
|
||||
# Also test the results are still the same after IO
|
||||
with make_tempdir() as tmp_dir:
|
||||
nlp.to_disk(tmp_dir)
|
||||
nlp2 = util.load_model_from_path(tmp_dir)
|
||||
docs2 = [nlp2.make_doc(test_text)]
|
||||
parser2 = nlp2.get_pipe("beam_parser")
|
||||
beams2 = parser2.predict(docs2)
|
||||
head_scores2, label_scores2 = parser2.scored_parses(beams2)
|
||||
# we only processed one document
|
||||
head_scores2 = head_scores2[0]
|
||||
label_scores2 = label_scores2[0]
|
||||
# check the results again
|
||||
assert label_scores2[(0, "nsubj")] == pytest.approx(1.0, eps)
|
||||
assert label_scores2[(0, "dobj")] == pytest.approx(0.0, eps)
|
||||
assert label_scores2[(0, "punct")] == pytest.approx(0.0, eps)
|
||||
assert label_scores2[(2, "nsubj")] == pytest.approx(0.0, eps)
|
||||
assert label_scores2[(2, "dobj")] == pytest.approx(1.0, eps)
|
||||
assert label_scores2[(2, "punct")] == pytest.approx(0.0, eps)
|
||||
assert label_scores2[(3, "nsubj")] == pytest.approx(0.0, eps)
|
||||
assert label_scores2[(3, "dobj")] == pytest.approx(0.0, eps)
|
||||
assert label_scores2[(3, "punct")] == pytest.approx(1.0, eps)
|
||||
assert head_scores2[(0, 0)] == pytest.approx(0.0, eps)
|
||||
assert head_scores2[(0, 1)] == pytest.approx(1.0, eps)
|
||||
assert head_scores2[(0, 2)] == pytest.approx(0.0, eps)
|
||||
assert head_scores2[(2, 0)] == pytest.approx(0.0, eps)
|
||||
assert head_scores2[(2, 1)] == pytest.approx(1.0, eps)
|
||||
assert head_scores2[(2, 2)] == pytest.approx(0.0, eps)
|
||||
assert head_scores2[(3, 0)] == pytest.approx(0.0, eps)
|
||||
assert head_scores2[(3, 1)] == pytest.approx(1.0, eps)
|
||||
assert head_scores2[(3, 2)] == pytest.approx(0.0, eps)
|
||||
|
|
|
@ -4,14 +4,17 @@ from spacy.tokens.doc import Doc
|
|||
from spacy.vocab import Vocab
|
||||
from spacy.pipeline._parser_internals.stateclass import StateClass
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def vocab():
|
||||
return Vocab()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def doc(vocab):
|
||||
return Doc(vocab, words=["a", "b", "c", "d"])
|
||||
|
||||
|
||||
def test_init_state(doc):
|
||||
state = StateClass(doc)
|
||||
assert state.stack == []
|
||||
|
@ -19,6 +22,7 @@ def test_init_state(doc):
|
|||
assert not state.is_final()
|
||||
assert state.buffer_length() == 4
|
||||
|
||||
|
||||
def test_push_pop(doc):
|
||||
state = StateClass(doc)
|
||||
state.push()
|
||||
|
@ -33,6 +37,7 @@ def test_push_pop(doc):
|
|||
assert state.stack == [0]
|
||||
assert 1 not in state.queue
|
||||
|
||||
|
||||
def test_stack_depth(doc):
|
||||
state = StateClass(doc)
|
||||
assert state.stack_depth() == 0
|
||||
|
|
|
@ -161,7 +161,7 @@ def test_attributeruler_score(nlp, pattern_dicts):
|
|||
# "cat" is the only correct lemma
|
||||
assert scores["lemma_acc"] == pytest.approx(0.2)
|
||||
# no morphs are set
|
||||
assert scores["morph_acc"] == None
|
||||
assert scores["morph_acc"] is None
|
||||
|
||||
|
||||
def test_attributeruler_rule_order(nlp):
|
||||
|
|
|
@ -201,13 +201,9 @@ def test_entity_ruler_overlapping_spans(nlp):
|
|||
|
||||
@pytest.mark.parametrize("n_process", [1, 2])
|
||||
def test_entity_ruler_multiprocessing(nlp, n_process):
|
||||
texts = [
|
||||
"I enjoy eating Pizza Hut pizza."
|
||||
]
|
||||
texts = ["I enjoy eating Pizza Hut pizza."]
|
||||
|
||||
patterns = [
|
||||
{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}
|
||||
]
|
||||
patterns = [{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}]
|
||||
|
||||
ruler = nlp.add_pipe("entity_ruler")
|
||||
ruler.add_patterns(patterns)
|
||||
|
|
|
@ -159,8 +159,12 @@ def test_pipe_class_component_model():
|
|||
"model": {
|
||||
"@architectures": "spacy.TextCatEnsemble.v2",
|
||||
"tok2vec": DEFAULT_TOK2VEC_MODEL,
|
||||
"linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1,
|
||||
"no_output_layer": False},
|
||||
"linear_model": {
|
||||
"@architectures": "spacy.TextCatBOW.v1",
|
||||
"exclusive_classes": False,
|
||||
"ngram_size": 1,
|
||||
"no_output_layer": False,
|
||||
},
|
||||
},
|
||||
"value1": 10,
|
||||
}
|
||||
|
|
|
@ -37,7 +37,16 @@ TRAIN_DATA = [
|
|||
]
|
||||
|
||||
PARTIAL_DATA = [
|
||||
# partial annotation
|
||||
("I like green eggs", {"tags": ["", "V", "J", ""]}),
|
||||
# misaligned partial annotation
|
||||
(
|
||||
"He hates green eggs",
|
||||
{
|
||||
"words": ["He", "hate", "s", "green", "eggs"],
|
||||
"tags": ["", "V", "S", "J", ""],
|
||||
},
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
|
@ -126,6 +135,7 @@ def test_incomplete_data():
|
|||
assert doc[1].tag_ is "V"
|
||||
assert doc[2].tag_ is "J"
|
||||
|
||||
|
||||
def test_overfitting_IO():
|
||||
# Simple test to try and quickly overfit the tagger - ensuring the ML models work correctly
|
||||
nlp = English()
|
||||
|
|
|
@ -15,15 +15,31 @@ from spacy.training import Example
|
|||
from ..util import make_tempdir
|
||||
|
||||
|
||||
TRAIN_DATA = [
|
||||
TRAIN_DATA_SINGLE_LABEL = [
|
||||
("I'm so happy.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
|
||||
("I'm so angry", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
|
||||
]
|
||||
|
||||
TRAIN_DATA_MULTI_LABEL = [
|
||||
("I'm angry and confused", {"cats": {"ANGRY": 1.0, "CONFUSED": 1.0, "HAPPY": 0.0}}),
|
||||
("I'm confused but happy", {"cats": {"ANGRY": 0.0, "CONFUSED": 1.0, "HAPPY": 1.0}}),
|
||||
]
|
||||
|
||||
def make_get_examples(nlp):
|
||||
|
||||
def make_get_examples_single_label(nlp):
|
||||
train_examples = []
|
||||
for t in TRAIN_DATA:
|
||||
for t in TRAIN_DATA_SINGLE_LABEL:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||
|
||||
def get_examples():
|
||||
return train_examples
|
||||
|
||||
return get_examples
|
||||
|
||||
|
||||
def make_get_examples_multi_label(nlp):
|
||||
train_examples = []
|
||||
for t in TRAIN_DATA_MULTI_LABEL:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||
|
||||
def get_examples():
|
||||
|
@ -85,49 +101,75 @@ def test_textcat_learns_multilabel():
|
|||
assert score > 0.5
|
||||
|
||||
|
||||
def test_label_types():
|
||||
@pytest.mark.parametrize("name", ["textcat", "textcat_multilabel"])
|
||||
def test_label_types(name):
|
||||
nlp = Language()
|
||||
textcat = nlp.add_pipe("textcat")
|
||||
textcat = nlp.add_pipe(name)
|
||||
textcat.add_label("answer")
|
||||
with pytest.raises(ValueError):
|
||||
textcat.add_label(9)
|
||||
|
||||
|
||||
def test_no_label():
|
||||
@pytest.mark.parametrize("name", ["textcat", "textcat_multilabel"])
|
||||
def test_no_label(name):
|
||||
nlp = Language()
|
||||
nlp.add_pipe("textcat")
|
||||
nlp.add_pipe(name)
|
||||
with pytest.raises(ValueError):
|
||||
nlp.initialize()
|
||||
|
||||
|
||||
def test_implicit_label():
|
||||
@pytest.mark.parametrize(
|
||||
"name,get_examples",
|
||||
[
|
||||
("textcat", make_get_examples_single_label),
|
||||
("textcat_multilabel", make_get_examples_multi_label),
|
||||
],
|
||||
)
|
||||
def test_implicit_label(name, get_examples):
|
||||
nlp = Language()
|
||||
nlp.add_pipe("textcat")
|
||||
nlp.initialize(get_examples=make_get_examples(nlp))
|
||||
nlp.add_pipe(name)
|
||||
nlp.initialize(get_examples=get_examples(nlp))
|
||||
|
||||
|
||||
def test_no_resize():
|
||||
@pytest.mark.parametrize("name", ["textcat", "textcat_multilabel"])
|
||||
def test_no_resize(name):
|
||||
nlp = Language()
|
||||
textcat = nlp.add_pipe("textcat")
|
||||
textcat = nlp.add_pipe(name)
|
||||
textcat.add_label("POSITIVE")
|
||||
textcat.add_label("NEGATIVE")
|
||||
nlp.initialize()
|
||||
assert textcat.model.get_dim("nO") == 2
|
||||
assert textcat.model.get_dim("nO") >= 2
|
||||
# this throws an error because the textcat can't be resized after initialization
|
||||
with pytest.raises(ValueError):
|
||||
textcat.add_label("NEUTRAL")
|
||||
|
||||
|
||||
def test_initialize_examples():
|
||||
def test_error_with_multi_labels():
|
||||
nlp = Language()
|
||||
textcat = nlp.add_pipe("textcat")
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA_MULTI_LABEL:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
with pytest.raises(ValueError):
|
||||
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"name,get_examples, train_data",
|
||||
[
|
||||
("textcat", make_get_examples_single_label, TRAIN_DATA_SINGLE_LABEL),
|
||||
("textcat_multilabel", make_get_examples_multi_label, TRAIN_DATA_MULTI_LABEL),
|
||||
],
|
||||
)
|
||||
def test_initialize_examples(name, get_examples, train_data):
|
||||
nlp = Language()
|
||||
textcat = nlp.add_pipe(name)
|
||||
for text, annotations in train_data:
|
||||
for label, value in annotations.get("cats").items():
|
||||
textcat.add_label(label)
|
||||
# you shouldn't really call this more than once, but for testing it should be fine
|
||||
nlp.initialize()
|
||||
get_examples = make_get_examples(nlp)
|
||||
nlp.initialize(get_examples=get_examples)
|
||||
nlp.initialize(get_examples=get_examples(nlp))
|
||||
with pytest.raises(TypeError):
|
||||
nlp.initialize(get_examples=lambda: None)
|
||||
with pytest.raises(TypeError):
|
||||
|
@ -138,12 +180,10 @@ def test_overfitting_IO():
|
|||
# Simple test to try and quickly overfit the single-label textcat component - ensuring the ML models work correctly
|
||||
fix_random_seed(0)
|
||||
nlp = English()
|
||||
nlp.config["initialize"]["components"]["textcat"] = {"positive_label": "POSITIVE"}
|
||||
# Set exclusive labels
|
||||
config = {"model": {"linear_model": {"exclusive_classes": True}}}
|
||||
textcat = nlp.add_pipe("textcat", config=config)
|
||||
textcat = nlp.add_pipe("textcat")
|
||||
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
for text, annotations in TRAIN_DATA_SINGLE_LABEL:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||
assert textcat.model.get_dim("nO") == 2
|
||||
|
@ -172,6 +212,8 @@ def test_overfitting_IO():
|
|||
# Test scoring
|
||||
scores = nlp.evaluate(train_examples)
|
||||
assert scores["cats_micro_f"] == 1.0
|
||||
assert scores["cats_macro_f"] == 1.0
|
||||
assert scores["cats_macro_auc"] == 1.0
|
||||
assert scores["cats_score"] == 1.0
|
||||
assert "cats_score_desc" in scores
|
||||
|
||||
|
@ -192,7 +234,7 @@ def test_overfitting_IO_multi():
|
|||
config = {"model": {"linear_model": {"exclusive_classes": False}}}
|
||||
textcat = nlp.add_pipe("textcat", config=config)
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
for text, annotations in TRAIN_DATA_MULTI_LABEL:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||
assert textcat.model.get_dim("nO") == 2
|
||||
|
@ -231,27 +273,75 @@ def test_overfitting_IO_multi():
|
|||
assert_equal(batch_cats_1, no_batch_cats)
|
||||
|
||||
|
||||
def test_overfitting_IO_multi():
|
||||
# Simple test to try and quickly overfit the multi-label textcat component - ensuring the ML models work correctly
|
||||
fix_random_seed(0)
|
||||
nlp = English()
|
||||
textcat = nlp.add_pipe("textcat_multilabel")
|
||||
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA_MULTI_LABEL:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||
assert textcat.model.get_dim("nO") == 3
|
||||
|
||||
for i in range(100):
|
||||
losses = {}
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
assert losses["textcat_multilabel"] < 0.01
|
||||
|
||||
# test the trained model
|
||||
test_text = "I am confused but happy."
|
||||
doc = nlp(test_text)
|
||||
cats = doc.cats
|
||||
assert cats["HAPPY"] > 0.9
|
||||
assert cats["CONFUSED"] > 0.9
|
||||
|
||||
# Also test the results are still the same after IO
|
||||
with make_tempdir() as tmp_dir:
|
||||
nlp.to_disk(tmp_dir)
|
||||
nlp2 = util.load_model_from_path(tmp_dir)
|
||||
doc2 = nlp2(test_text)
|
||||
cats2 = doc2.cats
|
||||
assert cats2["HAPPY"] > 0.9
|
||||
assert cats2["CONFUSED"] > 0.9
|
||||
|
||||
# Test scoring
|
||||
scores = nlp.evaluate(train_examples)
|
||||
assert scores["cats_micro_f"] == 1.0
|
||||
assert scores["cats_macro_f"] == 1.0
|
||||
assert "cats_score_desc" in scores
|
||||
|
||||
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||
texts = ["Just a sentence.", "I like green eggs.", "I am happy.", "I eat ham."]
|
||||
batch_deps_1 = [doc.cats for doc in nlp.pipe(texts)]
|
||||
batch_deps_2 = [doc.cats for doc in nlp.pipe(texts)]
|
||||
no_batch_deps = [doc.cats for doc in [nlp(text) for text in texts]]
|
||||
assert_equal(batch_deps_1, batch_deps_2)
|
||||
assert_equal(batch_deps_1, no_batch_deps)
|
||||
|
||||
|
||||
# fmt: off
|
||||
@pytest.mark.parametrize(
|
||||
"textcat_config",
|
||||
"name,train_data,textcat_config",
|
||||
[
|
||||
{"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False},
|
||||
{"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 4, "no_output_layer": False},
|
||||
{"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 3, "no_output_layer": True},
|
||||
{"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 2, "no_output_layer": True},
|
||||
{"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}},
|
||||
{"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 5, "no_output_layer": False}},
|
||||
{"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True},
|
||||
{"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False},
|
||||
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}),
|
||||
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 4, "no_output_layer": False}),
|
||||
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 3, "no_output_layer": True}),
|
||||
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 2, "no_output_layer": True}),
|
||||
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}}),
|
||||
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 5, "no_output_layer": False}}),
|
||||
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True}),
|
||||
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False}),
|
||||
],
|
||||
)
|
||||
# fmt: on
|
||||
def test_textcat_configs(textcat_config):
|
||||
def test_textcat_configs(name, train_data, textcat_config):
|
||||
pipe_config = {"model": textcat_config}
|
||||
nlp = English()
|
||||
textcat = nlp.add_pipe("textcat", config=pipe_config)
|
||||
textcat = nlp.add_pipe(name, config=pipe_config)
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
for text, annotations in train_data:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
for label, value in annotations.get("cats").items():
|
||||
textcat.add_label(label)
|
||||
|
@ -264,15 +354,24 @@ def test_textcat_configs(textcat_config):
|
|||
def test_positive_class():
|
||||
nlp = English()
|
||||
textcat = nlp.add_pipe("textcat")
|
||||
get_examples = make_get_examples(nlp)
|
||||
get_examples = make_get_examples_single_label(nlp)
|
||||
textcat.initialize(get_examples, labels=["POS", "NEG"], positive_label="POS")
|
||||
assert textcat.labels == ("POS", "NEG")
|
||||
assert textcat.cfg["positive_label"] == "POS"
|
||||
|
||||
textcat_multilabel = nlp.add_pipe("textcat_multilabel")
|
||||
get_examples = make_get_examples_multi_label(nlp)
|
||||
with pytest.raises(TypeError):
|
||||
textcat_multilabel.initialize(get_examples, labels=["POS", "NEG"], positive_label="POS")
|
||||
textcat_multilabel.initialize(get_examples, labels=["FICTION", "DRAMA"])
|
||||
assert textcat_multilabel.labels == ("FICTION", "DRAMA")
|
||||
assert "positive_label" not in textcat_multilabel.cfg
|
||||
|
||||
|
||||
def test_positive_class_not_present():
|
||||
nlp = English()
|
||||
textcat = nlp.add_pipe("textcat")
|
||||
get_examples = make_get_examples(nlp)
|
||||
get_examples = make_get_examples_single_label(nlp)
|
||||
with pytest.raises(ValueError):
|
||||
textcat.initialize(get_examples, labels=["SOME", "THING"], positive_label="POS")
|
||||
|
||||
|
@ -280,11 +379,9 @@ def test_positive_class_not_present():
|
|||
def test_positive_class_not_binary():
|
||||
nlp = English()
|
||||
textcat = nlp.add_pipe("textcat")
|
||||
get_examples = make_get_examples(nlp)
|
||||
get_examples = make_get_examples_multi_label(nlp)
|
||||
with pytest.raises(ValueError):
|
||||
textcat.initialize(
|
||||
get_examples, labels=["SOME", "THING", "POS"], positive_label="POS"
|
||||
)
|
||||
textcat.initialize(get_examples, labels=["SOME", "THING", "POS"], positive_label="POS")
|
||||
|
||||
|
||||
def test_textcat_evaluation():
|
||||
|
|
|
@ -113,7 +113,7 @@ cfg_string = """
|
|||
factory = "tok2vec"
|
||||
|
||||
[components.tok2vec.model]
|
||||
@architectures = "spacy.Tok2Vec.v1"
|
||||
@architectures = "spacy.Tok2Vec.v2"
|
||||
|
||||
[components.tok2vec.model.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
|
@ -123,7 +123,7 @@ cfg_string = """
|
|||
include_static_vectors = false
|
||||
|
||||
[components.tok2vec.model.encode]
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||
width = 96
|
||||
depth = 4
|
||||
window_size = 1
|
||||
|
|
|
@ -288,35 +288,33 @@ def test_multiple_predictions():
|
|||
dummy_pipe(doc)
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="removed Beam stuff during the Example/GoldParse refactor")
|
||||
def test_issue4313():
|
||||
""" This should not crash or exit with some strange error code """
|
||||
beam_width = 16
|
||||
beam_density = 0.0001
|
||||
nlp = English()
|
||||
config = {}
|
||||
ner = nlp.create_pipe("ner", config=config)
|
||||
config = {
|
||||
"beam_width": beam_width,
|
||||
"beam_density": beam_density,
|
||||
}
|
||||
ner = nlp.add_pipe("beam_ner", config=config)
|
||||
ner.add_label("SOME_LABEL")
|
||||
ner.initialize(lambda: [])
|
||||
nlp.initialize()
|
||||
# add a new label to the doc
|
||||
doc = nlp("What do you think about Apple ?")
|
||||
assert len(ner.labels) == 1
|
||||
assert "SOME_LABEL" in ner.labels
|
||||
ner.add_label("MY_ORG") # TODO: not sure if we want this to be necessary...
|
||||
apple_ent = Span(doc, 5, 6, label="MY_ORG")
|
||||
doc.ents = list(doc.ents) + [apple_ent]
|
||||
|
||||
# ensure the beam_parse still works with the new label
|
||||
docs = [doc]
|
||||
beams = nlp.entity.beam_parse(
|
||||
docs, beam_width=beam_width, beam_density=beam_density
|
||||
ner = nlp.get_pipe("beam_ner")
|
||||
beams = ner.beam_parse(
|
||||
docs, drop=0.0, beam_width=beam_width, beam_density=beam_density
|
||||
)
|
||||
|
||||
for doc, beam in zip(docs, beams):
|
||||
entity_scores = defaultdict(float)
|
||||
for score, ents in nlp.entity.moves.get_beam_parses(beam):
|
||||
for start, end, label in ents:
|
||||
entity_scores[(start, end, label)] += score
|
||||
|
||||
|
||||
def test_issue4348():
|
||||
"""Test that training the tagger with empty data, doesn't throw errors"""
|
||||
|
|
|
@ -2,8 +2,11 @@ import pytest
|
|||
from thinc.api import Config, fix_random_seed
|
||||
|
||||
from spacy.lang.en import English
|
||||
from spacy.pipeline.textcat import default_model_config, bow_model_config
|
||||
from spacy.pipeline.textcat import cnn_model_config
|
||||
from spacy.pipeline.textcat import single_label_default_config, single_label_bow_config
|
||||
from spacy.pipeline.textcat import single_label_cnn_config
|
||||
from spacy.pipeline.textcat_multilabel import multi_label_default_config
|
||||
from spacy.pipeline.textcat_multilabel import multi_label_bow_config
|
||||
from spacy.pipeline.textcat_multilabel import multi_label_cnn_config
|
||||
from spacy.tokens import Span
|
||||
from spacy import displacy
|
||||
from spacy.pipeline import merge_entities
|
||||
|
@ -11,7 +14,15 @@ from spacy.training import Example
|
|||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"textcat_config", [default_model_config, bow_model_config, cnn_model_config]
|
||||
"textcat_config",
|
||||
[
|
||||
single_label_default_config,
|
||||
single_label_bow_config,
|
||||
single_label_cnn_config,
|
||||
multi_label_default_config,
|
||||
multi_label_bow_config,
|
||||
multi_label_cnn_config,
|
||||
],
|
||||
)
|
||||
def test_issue5551(textcat_config):
|
||||
"""Test that after fixing the random seed, the results of the pipeline are truly identical"""
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
import pydantic
|
||||
import pytest
|
||||
from pydantic import ValidationError
|
||||
from spacy.schemas import TokenPattern, TokenPatternSchema
|
||||
|
|
|
@ -208,7 +208,7 @@ def test_create_nlp_from_pretraining_config():
|
|||
config = Config().from_str(pretrain_config_string)
|
||||
pretrain_config = load_config(DEFAULT_CONFIG_PRETRAIN_PATH)
|
||||
filled = config.merge(pretrain_config)
|
||||
resolved = registry.resolve(filled["pretraining"], schema=ConfigSchemaPretrain)
|
||||
registry.resolve(filled["pretraining"], schema=ConfigSchemaPretrain)
|
||||
|
||||
|
||||
def test_create_nlp_from_config_multiple_instances():
|
||||
|
|
|
@ -4,7 +4,7 @@ from spacy.pipeline import Tagger, DependencyParser, EntityRecognizer
|
|||
from spacy.pipeline import TextCategorizer, SentenceRecognizer, TrainablePipe
|
||||
from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
|
||||
from spacy.pipeline.tagger import DEFAULT_TAGGER_MODEL
|
||||
from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL
|
||||
from spacy.pipeline.textcat import DEFAULT_SINGLE_TEXTCAT_MODEL
|
||||
from spacy.pipeline.senter import DEFAULT_SENTER_MODEL
|
||||
from spacy.lang.en import English
|
||||
from thinc.api import Linear
|
||||
|
@ -24,7 +24,7 @@ def parser(en_vocab):
|
|||
"update_with_oracle_cut_size": 100,
|
||||
"beam_width": 1,
|
||||
"beam_update_prob": 1.0,
|
||||
"beam_density": 0.0
|
||||
"beam_density": 0.0,
|
||||
}
|
||||
cfg = {"model": DEFAULT_PARSER_MODEL}
|
||||
model = registry.resolve(cfg, validate=True)["model"]
|
||||
|
@ -41,7 +41,7 @@ def blank_parser(en_vocab):
|
|||
"update_with_oracle_cut_size": 100,
|
||||
"beam_width": 1,
|
||||
"beam_update_prob": 1.0,
|
||||
"beam_density": 0.0
|
||||
"beam_density": 0.0,
|
||||
}
|
||||
cfg = {"model": DEFAULT_PARSER_MODEL}
|
||||
model = registry.resolve(cfg, validate=True)["model"]
|
||||
|
@ -66,7 +66,7 @@ def test_serialize_parser_roundtrip_bytes(en_vocab, Parser):
|
|||
"update_with_oracle_cut_size": 100,
|
||||
"beam_width": 1,
|
||||
"beam_update_prob": 1.0,
|
||||
"beam_density": 0.0
|
||||
"beam_density": 0.0,
|
||||
}
|
||||
cfg = {"model": DEFAULT_PARSER_MODEL}
|
||||
model = registry.resolve(cfg, validate=True)["model"]
|
||||
|
@ -90,7 +90,7 @@ def test_serialize_parser_strings(Parser):
|
|||
"update_with_oracle_cut_size": 100,
|
||||
"beam_width": 1,
|
||||
"beam_update_prob": 1.0,
|
||||
"beam_density": 0.0
|
||||
"beam_density": 0.0,
|
||||
}
|
||||
cfg = {"model": DEFAULT_PARSER_MODEL}
|
||||
model = registry.resolve(cfg, validate=True)["model"]
|
||||
|
@ -112,7 +112,7 @@ def test_serialize_parser_roundtrip_disk(en_vocab, Parser):
|
|||
"update_with_oracle_cut_size": 100,
|
||||
"beam_width": 1,
|
||||
"beam_update_prob": 1.0,
|
||||
"beam_density": 0.0
|
||||
"beam_density": 0.0,
|
||||
}
|
||||
cfg = {"model": DEFAULT_PARSER_MODEL}
|
||||
model = registry.resolve(cfg, validate=True)["model"]
|
||||
|
@ -140,9 +140,6 @@ def test_to_from_bytes(parser, blank_parser):
|
|||
assert blank_parser.moves.n_moves == parser.moves.n_moves
|
||||
|
||||
|
||||
@pytest.mark.skip(
|
||||
reason="This seems to be a dict ordering bug somewhere. Only failing on some platforms."
|
||||
)
|
||||
def test_serialize_tagger_roundtrip_bytes(en_vocab, taggers):
|
||||
tagger1 = taggers[0]
|
||||
tagger1_b = tagger1.to_bytes()
|
||||
|
@ -191,7 +188,7 @@ def test_serialize_tagger_strings(en_vocab, de_vocab, taggers):
|
|||
|
||||
def test_serialize_textcat_empty(en_vocab):
|
||||
# See issue #1105
|
||||
cfg = {"model": DEFAULT_TEXTCAT_MODEL}
|
||||
cfg = {"model": DEFAULT_SINGLE_TEXTCAT_MODEL}
|
||||
model = registry.resolve(cfg, validate=True)["model"]
|
||||
textcat = TextCategorizer(en_vocab, model, threshold=0.5)
|
||||
textcat.to_bytes(exclude=["vocab"])
|
||||
|
|
|
@ -26,7 +26,6 @@ def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
|
|||
assert tokenizer_reloaded.rules == {}
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="Currently unreliable across platforms")
|
||||
@pytest.mark.parametrize("text", ["I💜you", "they’re", "“hello”"])
|
||||
def test_serialize_tokenizer_roundtrip_bytes(en_tokenizer, text):
|
||||
tokenizer = en_tokenizer
|
||||
|
@ -38,7 +37,6 @@ def test_serialize_tokenizer_roundtrip_bytes(en_tokenizer, text):
|
|||
assert [token.text for token in doc1] == [token.text for token in doc2]
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="Currently unreliable across platforms")
|
||||
def test_serialize_tokenizer_roundtrip_disk(en_tokenizer):
|
||||
tokenizer = en_tokenizer
|
||||
with make_tempdir() as d:
|
||||
|
|
|
@ -3,7 +3,9 @@ from click import NoSuchOption
|
|||
from spacy.training import docs_to_json, offsets_to_biluo_tags
|
||||
from spacy.training.converters import iob_to_docs, conll_ner_to_docs, conllu_to_docs
|
||||
from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate
|
||||
from spacy.lang.nl import Dutch
|
||||
from spacy.util import ENV_VARS
|
||||
from spacy.cli import info
|
||||
from spacy.cli.init_config import init_config, RECOMMENDATIONS
|
||||
from spacy.cli._util import validate_project_commands, parse_config_overrides
|
||||
from spacy.cli._util import load_project_config, substitute_project_variables
|
||||
|
@ -15,6 +17,16 @@ import os
|
|||
from .util import make_tempdir
|
||||
|
||||
|
||||
def test_cli_info():
|
||||
nlp = Dutch()
|
||||
nlp.add_pipe("textcat")
|
||||
with make_tempdir() as tmp_dir:
|
||||
nlp.to_disk(tmp_dir)
|
||||
raw_data = info(tmp_dir, exclude=[""])
|
||||
assert raw_data["lang"] == "nl"
|
||||
assert raw_data["components"] == ["textcat"]
|
||||
|
||||
|
||||
def test_cli_converters_conllu_to_docs():
|
||||
# from NorNE: https://github.com/ltgoslo/norne/blob/3d23274965f513f23aa48455b28b1878dad23c05/ud/nob/no_bokmaal-ud-dev.conllu
|
||||
lines = [
|
||||
|
|
|
@ -83,6 +83,7 @@ def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
|
|||
def test_prefer_gpu():
|
||||
try:
|
||||
import cupy # noqa: F401
|
||||
|
||||
prefer_gpu()
|
||||
assert isinstance(get_current_ops(), CupyOps)
|
||||
except ImportError:
|
||||
|
@ -92,17 +93,20 @@ def test_prefer_gpu():
|
|||
def test_require_gpu():
|
||||
try:
|
||||
import cupy # noqa: F401
|
||||
|
||||
require_gpu()
|
||||
assert isinstance(get_current_ops(), CupyOps)
|
||||
except ImportError:
|
||||
with pytest.raises(ValueError):
|
||||
require_gpu()
|
||||
|
||||
|
||||
def test_require_cpu():
|
||||
require_cpu()
|
||||
assert isinstance(get_current_ops(), NumpyOps)
|
||||
try:
|
||||
import cupy # noqa: F401
|
||||
|
||||
require_gpu()
|
||||
assert isinstance(get_current_ops(), CupyOps)
|
||||
except ImportError:
|
||||
|
|
|
@ -294,7 +294,7 @@ def test_partial_annotation(en_tokenizer):
|
|||
# cats doesn't have an unset state
|
||||
if key.startswith("cats"):
|
||||
continue
|
||||
assert scores[key] == None
|
||||
assert scores[key] is None
|
||||
|
||||
# partially annotated reference, not overlapping with predicted annotation
|
||||
ref_doc = en_tokenizer("a b c d e")
|
||||
|
@ -306,13 +306,13 @@ def test_partial_annotation(en_tokenizer):
|
|||
example = Example(pred_doc, ref_doc)
|
||||
scorer = Scorer()
|
||||
scores = scorer.score([example])
|
||||
assert scores["token_acc"] == None
|
||||
assert scores["token_acc"] is None
|
||||
assert scores["tag_acc"] == 0.0
|
||||
assert scores["pos_acc"] == 0.0
|
||||
assert scores["morph_acc"] == 0.0
|
||||
assert scores["dep_uas"] == 1.0
|
||||
assert scores["dep_las"] == 0.0
|
||||
assert scores["sents_f"] == None
|
||||
assert scores["sents_f"] is None
|
||||
|
||||
# partially annotated reference, overlapping with predicted annotation
|
||||
ref_doc = en_tokenizer("a b c d e")
|
||||
|
@ -324,13 +324,13 @@ def test_partial_annotation(en_tokenizer):
|
|||
example = Example(pred_doc, ref_doc)
|
||||
scorer = Scorer()
|
||||
scores = scorer.score([example])
|
||||
assert scores["token_acc"] == None
|
||||
assert scores["token_acc"] is None
|
||||
assert scores["tag_acc"] == 1.0
|
||||
assert scores["pos_acc"] == 1.0
|
||||
assert scores["morph_acc"] == 0.0
|
||||
assert scores["dep_uas"] == 1.0
|
||||
assert scores["dep_las"] == 0.0
|
||||
assert scores["sents_f"] == None
|
||||
assert scores["sents_f"] is None
|
||||
|
||||
|
||||
def test_roc_auc_score():
|
||||
|
@ -391,7 +391,7 @@ def test_roc_auc_score():
|
|||
score.score_set(0.25, 0)
|
||||
score.score_set(0.75, 0)
|
||||
with pytest.raises(ValueError):
|
||||
s = score.score
|
||||
_ = score.score # noqa: F841
|
||||
|
||||
y_true = [1, 1]
|
||||
y_score = [0.25, 0.75]
|
||||
|
@ -402,4 +402,4 @@ def test_roc_auc_score():
|
|||
score.score_set(0.25, 1)
|
||||
score.score_set(0.75, 1)
|
||||
with pytest.raises(ValueError):
|
||||
s = score.score
|
||||
_ = score.score # noqa: F841
|
||||
|
|
|
@ -180,3 +180,9 @@ def test_tokenizer_special_cases_idx(tokenizer):
|
|||
doc = tokenizer(text)
|
||||
assert doc[1].idx == 4
|
||||
assert doc[2].idx == 7
|
||||
|
||||
|
||||
def test_tokenizer_special_cases_spaces(tokenizer):
|
||||
assert [t.text for t in tokenizer("a b c")] == ["a", "b", "c"]
|
||||
tokenizer.add_special_case("a b c", [{"ORTH": "a b c"}])
|
||||
assert [t.text for t in tokenizer("a b c")] == ["a b c"]
|
||||
|
|
|
@ -51,7 +51,7 @@ def test_readers():
|
|||
for example in train_corpus(nlp):
|
||||
nlp.update([example], sgd=optimizer)
|
||||
scores = nlp.evaluate(list(dev_corpus(nlp)))
|
||||
assert scores["cats_score"] == 0.0
|
||||
assert scores["cats_macro_auc"] == 0.0
|
||||
# ensure the pipeline runs
|
||||
doc = nlp("Quick test")
|
||||
assert doc.cats
|
||||
|
@ -73,7 +73,7 @@ def test_cat_readers(reader, additional_config):
|
|||
nlp_config_string = """
|
||||
[training]
|
||||
seed = 0
|
||||
|
||||
|
||||
[training.score_weights]
|
||||
cats_macro_auc = 1.0
|
||||
|
||||
|
|
|
@ -71,7 +71,6 @@ def test_table_api_to_from_bytes():
|
|||
assert "def" not in new_table2
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="This fails on Python 3.5")
|
||||
def test_lookups_to_from_bytes():
|
||||
lookups = Lookups()
|
||||
lookups.add_table("table1", {"foo": "bar", "hello": "world"})
|
||||
|
@ -91,7 +90,6 @@ def test_lookups_to_from_bytes():
|
|||
assert new_lookups.to_bytes() == lookups_bytes
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="This fails on Python 3.5")
|
||||
def test_lookups_to_from_disk():
|
||||
lookups = Lookups()
|
||||
lookups.add_table("table1", {"foo": "bar", "hello": "world"})
|
||||
|
@ -111,7 +109,6 @@ def test_lookups_to_from_disk():
|
|||
assert table2["b"] == 2
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="This fails on Python 3.5")
|
||||
def test_lookups_to_from_bytes_via_vocab():
|
||||
table_name = "test"
|
||||
vocab = Vocab()
|
||||
|
@ -128,7 +125,6 @@ def test_lookups_to_from_bytes_via_vocab():
|
|||
assert new_vocab.to_bytes() == vocab_bytes
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="This fails on Python 3.5")
|
||||
def test_lookups_to_from_disk_via_vocab():
|
||||
table_name = "test"
|
||||
vocab = Vocab()
|
||||
|
|
|
@ -258,6 +258,7 @@ cdef class Tokenizer:
|
|||
tokens = doc.c
|
||||
# Otherwise create a separate array to store modified tokens
|
||||
else:
|
||||
assert max_length > 0
|
||||
tokens = <TokenC*>mem.alloc(max_length, sizeof(TokenC))
|
||||
# Modify tokenization according to filtered special cases
|
||||
offset = self._retokenize_special_spans(doc, tokens, span_data)
|
||||
|
@ -610,7 +611,7 @@ cdef class Tokenizer:
|
|||
self.mem.free(stale_special)
|
||||
self._rules[string] = substrings
|
||||
self._flush_cache()
|
||||
if self.find_prefix(string) or self.find_infix(string) or self.find_suffix(string):
|
||||
if self.find_prefix(string) or self.find_infix(string) or self.find_suffix(string) or " " in string:
|
||||
self._special_matcher.add(string, None, self._tokenize_affixes(string, False))
|
||||
|
||||
def _reload_special_cases(self):
|
||||
|
|
|
@ -188,8 +188,15 @@ def _merge(Doc doc, merges):
|
|||
and doc.c[start - 1].ent_type == token.ent_type:
|
||||
merged_iob = 1
|
||||
token.ent_iob = merged_iob
|
||||
# Set lemma to concatenated lemmas
|
||||
merged_lemma = ""
|
||||
for span_token in span:
|
||||
merged_lemma += span_token.lemma_
|
||||
if doc.c[span_token.i].spacy:
|
||||
merged_lemma += " "
|
||||
merged_lemma = merged_lemma.strip()
|
||||
token.lemma = doc.vocab.strings.add(merged_lemma)
|
||||
# Unset attributes that don't match new token
|
||||
token.lemma = 0
|
||||
token.norm = 0
|
||||
tokens[merge_index] = token
|
||||
# Resize the doc.tensor, if it's set. Let the last row for each token stand
|
||||
|
@ -335,7 +342,9 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
|
|||
token = &doc.c[token_index + i]
|
||||
lex = doc.vocab.get(doc.mem, orth)
|
||||
token.lex = lex
|
||||
token.lemma = 0 # reset lemma
|
||||
# If lemma is currently set, set default lemma to orth
|
||||
if token.lemma != 0:
|
||||
token.lemma = lex.orth
|
||||
token.norm = 0 # reset norm
|
||||
if to_process_tensor:
|
||||
# setting the tensors of the split tokens to array of zeros
|
||||
|
|
|
@ -225,6 +225,7 @@ cdef class Doc:
|
|||
# Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
|
||||
# However, we need to remember the true starting places, so that we can
|
||||
# realloc.
|
||||
assert size + (PADDING*2) > 0
|
||||
data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
|
||||
cdef int i
|
||||
for i in range(size + (PADDING*2)):
|
||||
|
@ -1097,7 +1098,7 @@ cdef class Doc:
|
|||
(vocab,) = vocab
|
||||
|
||||
if attrs is None:
|
||||
attrs = Doc._get_array_attrs()
|
||||
attrs = list(Doc._get_array_attrs())
|
||||
else:
|
||||
if any(isinstance(attr, str) for attr in attrs): # resolve attribute names
|
||||
attrs = [intify_attr(attr) for attr in attrs] # intify_attr returns None for invalid attrs
|
||||
|
@ -1177,6 +1178,7 @@ cdef class Doc:
|
|||
other.length = self.length
|
||||
other.max_length = self.max_length
|
||||
buff_size = other.max_length + (PADDING*2)
|
||||
assert buff_size > 0
|
||||
tokens = <TokenC*>other.mem.alloc(buff_size, sizeof(TokenC))
|
||||
memcpy(tokens, self.c - PADDING, buff_size * sizeof(TokenC))
|
||||
other.c = &tokens[PADDING]
|
||||
|
|
|
@ -37,9 +37,17 @@ def init_nlp(config: Config, *, use_gpu: int = -1) -> "Language":
|
|||
T = registry.resolve(config["training"], schema=ConfigSchemaTraining)
|
||||
dot_names = [T["train_corpus"], T["dev_corpus"]]
|
||||
if not isinstance(T["train_corpus"], str):
|
||||
raise ConfigValidationError(desc=Errors.E897.format(field="training.train_corpus", type=type(T["train_corpus"])))
|
||||
raise ConfigValidationError(
|
||||
desc=Errors.E897.format(
|
||||
field="training.train_corpus", type=type(T["train_corpus"])
|
||||
)
|
||||
)
|
||||
if not isinstance(T["dev_corpus"], str):
|
||||
raise ConfigValidationError(desc=Errors.E897.format(field="training.dev_corpus", type=type(T["dev_corpus"])))
|
||||
raise ConfigValidationError(
|
||||
desc=Errors.E897.format(
|
||||
field="training.dev_corpus", type=type(T["dev_corpus"])
|
||||
)
|
||||
)
|
||||
train_corpus, dev_corpus = resolve_dot_names(config, dot_names)
|
||||
optimizer = T["optimizer"]
|
||||
# Components that shouldn't be updated during training
|
||||
|
|
|
@ -59,6 +59,19 @@ def train(
|
|||
batcher = T["batcher"]
|
||||
train_logger = T["logger"]
|
||||
before_to_disk = create_before_to_disk_callback(T["before_to_disk"])
|
||||
|
||||
# Helper function to save checkpoints. This is a closure for convenience,
|
||||
# to avoid passing in all the args all the time.
|
||||
def save_checkpoint(is_best):
|
||||
with nlp.use_params(optimizer.averages):
|
||||
before_to_disk(nlp).to_disk(output_path / DIR_MODEL_LAST)
|
||||
if is_best:
|
||||
# Avoid saving twice (saving will be more expensive than
|
||||
# the dir copy)
|
||||
if (output_path / DIR_MODEL_BEST).exists():
|
||||
shutil.rmtree(output_path / DIR_MODEL_BEST)
|
||||
shutil.copytree(output_path / DIR_MODEL_LAST, output_path / DIR_MODEL_BEST)
|
||||
|
||||
# Components that shouldn't be updated during training
|
||||
frozen_components = T["frozen_components"]
|
||||
# Create iterator, which yields out info after each optimization step.
|
||||
|
@ -87,36 +100,31 @@ def train(
|
|||
if is_best_checkpoint is not None and output_path is not None:
|
||||
with nlp.select_pipes(disable=frozen_components):
|
||||
update_meta(T, nlp, info)
|
||||
with nlp.use_params(optimizer.averages):
|
||||
nlp = before_to_disk(nlp)
|
||||
nlp.to_disk(output_path / DIR_MODEL_BEST)
|
||||
save_checkpoint(is_best_checkpoint)
|
||||
except Exception as e:
|
||||
if output_path is not None:
|
||||
# We don't want to swallow the traceback if we don't have a
|
||||
# specific error, but we do want to warn that we're trying
|
||||
# to do something here.
|
||||
stdout.write(
|
||||
msg.warn(
|
||||
f"Aborting and saving the final best model. "
|
||||
f"Encountered exception: {str(e)}"
|
||||
f"Encountered exception: {repr(e)}"
|
||||
)
|
||||
+ "\n"
|
||||
)
|
||||
raise e
|
||||
finally:
|
||||
finalize_logger()
|
||||
if optimizer.averages:
|
||||
nlp.use_params(optimizer.averages)
|
||||
if output_path is not None:
|
||||
final_model_path = output_path / DIR_MODEL_LAST
|
||||
nlp.to_disk(final_model_path)
|
||||
# This will only run if we don't hit an error
|
||||
stdout.write(
|
||||
msg.good("Saved pipeline to output directory", final_model_path) + "\n"
|
||||
)
|
||||
return (nlp, final_model_path)
|
||||
else:
|
||||
return (nlp, None)
|
||||
save_checkpoint(False)
|
||||
# This will only run if we did't hit an error
|
||||
if optimizer.averages:
|
||||
nlp.use_params(optimizer.averages)
|
||||
if output_path is not None:
|
||||
stdout.write(
|
||||
msg.good("Saved pipeline to output directory", output_path / DIR_MODEL_LAST)
|
||||
+ "\n"
|
||||
)
|
||||
return (nlp, output_path / DIR_MODEL_LAST)
|
||||
else:
|
||||
return (nlp, None)
|
||||
|
||||
|
||||
def train_while_improving(
|
||||
|
|
|
@ -10,7 +10,7 @@ from wasabi import Printer
|
|||
|
||||
from .example import Example
|
||||
from ..tokens import Doc
|
||||
from ..schemas import ConfigSchemaTraining, ConfigSchemaPretrain
|
||||
from ..schemas import ConfigSchemaPretrain
|
||||
from ..util import registry, load_model_from_config, dot_to_object
|
||||
|
||||
|
||||
|
@ -30,7 +30,6 @@ def pretrain(
|
|||
set_gpu_allocator(allocator)
|
||||
nlp = load_model_from_config(config)
|
||||
_config = nlp.config.interpolate()
|
||||
T = registry.resolve(_config["training"], schema=ConfigSchemaTraining)
|
||||
P = registry.resolve(_config["pretraining"], schema=ConfigSchemaPretrain)
|
||||
corpus = dot_to_object(_config, P["corpus"])
|
||||
corpus = registry.resolve({"corpus": corpus})["corpus"]
|
||||
|
|
|
@ -69,7 +69,7 @@ CONFIG_SECTION_ORDER = ["paths", "variables", "system", "nlp", "components", "co
|
|||
|
||||
logger = logging.getLogger("spacy")
|
||||
logger_stream_handler = logging.StreamHandler()
|
||||
logger_stream_handler.setFormatter(logging.Formatter('%(message)s'))
|
||||
logger_stream_handler.setFormatter(logging.Formatter("%(message)s"))
|
||||
logger.addHandler(logger_stream_handler)
|
||||
|
||||
|
||||
|
|
|
@ -164,7 +164,7 @@ cdef class Vocab:
|
|||
if len(string) < 3 or self.length < 10000:
|
||||
mem = self.mem
|
||||
cdef bint is_oov = mem is not self.mem
|
||||
lex = <LexemeC*>mem.alloc(sizeof(LexemeC), 1)
|
||||
lex = <LexemeC*>mem.alloc(1, sizeof(LexemeC))
|
||||
lex.orth = self.strings.add(string)
|
||||
lex.length = len(string)
|
||||
if self.vectors is not None:
|
||||
|
|
|
@ -5,6 +5,7 @@ source: spacy/ml/models
|
|||
menu:
|
||||
- ['Tok2Vec', 'tok2vec-arch']
|
||||
- ['Transformers', 'transformers']
|
||||
- ['Pretraining', 'pretrain']
|
||||
- ['Parser & NER', 'parser']
|
||||
- ['Tagging', 'tagger']
|
||||
- ['Text Classification', 'textcat']
|
||||
|
@ -25,20 +26,20 @@ usage documentation on
|
|||
|
||||
## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"}
|
||||
|
||||
### spacy.Tok2Vec.v1 {#Tok2Vec}
|
||||
### spacy.Tok2Vec.v2 {#Tok2Vec}
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.Tok2Vec.v1"
|
||||
> @architectures = "spacy.Tok2Vec.v2"
|
||||
>
|
||||
> [model.embed]
|
||||
> @architectures = "spacy.CharacterEmbed.v1"
|
||||
> # ...
|
||||
>
|
||||
> [model.encode]
|
||||
> @architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
> @architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||
> # ...
|
||||
> ```
|
||||
|
||||
|
@ -196,13 +197,13 @@ network to construct a single vector to represent the information.
|
|||
| `nC` | The number of UTF-8 bytes to embed per word. Recommended values are between `3` and `8`, although it may depend on the length of words in the language. ~~int~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
|
||||
### spacy.MaxoutWindowEncoder.v1 {#MaxoutWindowEncoder}
|
||||
### spacy.MaxoutWindowEncoder.v2 {#MaxoutWindowEncoder}
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
> @architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||
> width = 128
|
||||
> window_size = 1
|
||||
> maxout_pieces = 3
|
||||
|
@ -220,13 +221,13 @@ and residual connections.
|
|||
| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
|
||||
|
||||
### spacy.MishWindowEncoder.v1 {#MishWindowEncoder}
|
||||
### spacy.MishWindowEncoder.v2 {#MishWindowEncoder}
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.MishWindowEncoder.v1"
|
||||
> @architectures = "spacy.MishWindowEncoder.v2"
|
||||
> width = 64
|
||||
> window_size = 1
|
||||
> depth = 4
|
||||
|
@ -251,19 +252,19 @@ and residual connections.
|
|||
> [model]
|
||||
> @architectures = "spacy.TorchBiLSTMEncoder.v1"
|
||||
> width = 64
|
||||
> window_size = 1
|
||||
> depth = 4
|
||||
> depth = 2
|
||||
> dropout = 0.0
|
||||
> ```
|
||||
|
||||
Encode context using bidirectional LSTM layers. Requires
|
||||
[PyTorch](https://pytorch.org).
|
||||
|
||||
| Name | Description |
|
||||
| ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `width` | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. ~~int~~ |
|
||||
| `window_size` | The number of words to concatenate around each token to construct the convolution. Recommended value is `1`. ~~int~~ |
|
||||
| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
|
||||
| Name | Description |
|
||||
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `width` | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. ~~int~~ |
|
||||
| `depth` | The number of recurrent layers, for instance `depth=2` results in stacking two LSTMs together. ~~int~~ |
|
||||
| `dropout` | Creates a Dropout layer on the outputs of each LSTM layer except the last layer. Set to 0.0 to disable this functionality. ~~float~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
|
||||
|
||||
### spacy.StaticVectors.v1 {#StaticVectors}
|
||||
|
||||
|
@ -426,6 +427,71 @@ one component.
|
|||
| `grad_factor` | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. ~~float~~ |
|
||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
|
||||
## Pretraining architectures {#pretrain source="spacy/ml/models/multi_task.py"}
|
||||
|
||||
The spacy `pretrain` command lets you initialize a `Tok2Vec` layer in your
|
||||
pipeline with information from raw text. To this end, additional layers are
|
||||
added to build a network for a temporary task that forces the `Tok2Vec` layer to
|
||||
learn something about sentence structure and word cooccurrence statistics. Two
|
||||
pretraining objectives are available, both of which are variants of the cloze
|
||||
task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced for
|
||||
BERT.
|
||||
|
||||
For more information, see the section on
|
||||
[pretraining](/usage/embeddings-transformers#pretraining).
|
||||
|
||||
### spacy.PretrainVectors.v1 {#pretrain_vectors}
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [pretraining]
|
||||
> component = "tok2vec"
|
||||
> ...
|
||||
>
|
||||
> [pretraining.objective]
|
||||
> @architectures = "spacy.PretrainVectors.v1"
|
||||
> maxout_pieces = 3
|
||||
> hidden_size = 300
|
||||
> loss = "cosine"
|
||||
> ```
|
||||
|
||||
Predict the word's vector from a static embeddings table as pretraining
|
||||
objective for a Tok2Vec layer.
|
||||
|
||||
| Name | Description |
|
||||
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~ |
|
||||
| `hidden_size` | Size of the hidden layer of the model. ~~int~~ |
|
||||
| `loss` | The loss function can be either "cosine" or "L2". We typically recommend to use "cosine". ~~~str~~ |
|
||||
| **CREATES** | A callable function that can create the Model, given the `vocab` of the pipeline and the `tok2vec` layer to pretrain. ~~Callable[[Vocab, Model], Model]~~ |
|
||||
|
||||
### spacy.PretrainCharacters.v1 {#pretrain_chars}
|
||||
|
||||
> #### Example config
|
||||
>
|
||||
> ```ini
|
||||
> [pretraining]
|
||||
> component = "tok2vec"
|
||||
> ...
|
||||
>
|
||||
> [pretraining.objective]
|
||||
> @architectures = "spacy.PretrainCharacters.v1"
|
||||
> maxout_pieces = 3
|
||||
> hidden_size = 300
|
||||
> n_characters = 4
|
||||
> ```
|
||||
|
||||
Predict some number of leading and trailing UTF-8 bytes as pretraining objective
|
||||
for a Tok2Vec layer.
|
||||
|
||||
| Name | Description |
|
||||
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~ |
|
||||
| `hidden_size` | Size of the hidden layer of the model. ~~int~~ |
|
||||
| `n_characters` | The window of characters - e.g. if `n_characters = 2`, the model will try to predict the first two and last two characters of the word. ~~int~~ |
|
||||
| **CREATES** | A callable function that can create the Model, given the `vocab` of the pipeline and the `tok2vec` layer to pretrain. ~~Callable[[Vocab, Model], Model]~~ |
|
||||
|
||||
## Parser & NER architectures {#parser}
|
||||
|
||||
### spacy.TransitionBasedParser.v2 {#TransitionBasedParser source="spacy/ml/models/parser.py"}
|
||||
|
@ -534,7 +600,7 @@ specific data and challenge.
|
|||
> no_output_layer = false
|
||||
>
|
||||
> [model.tok2vec]
|
||||
> @architectures = "spacy.Tok2Vec.v1"
|
||||
> @architectures = "spacy.Tok2Vec.v2"
|
||||
>
|
||||
> [model.tok2vec.embed]
|
||||
> @architectures = "spacy.MultiHashEmbed.v1"
|
||||
|
@ -544,7 +610,7 @@ specific data and challenge.
|
|||
> include_static_vectors = false
|
||||
>
|
||||
> [model.tok2vec.encode]
|
||||
> @architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
> @architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||
> width = ${model.tok2vec.embed.width}
|
||||
> window_size = 1
|
||||
> maxout_pieces = 3
|
||||
|
|
|
@ -61,20 +61,27 @@ markup to copy-paste into
|
|||
[GitHub issues](https://github.com/explosion/spaCy/issues).
|
||||
|
||||
```cli
|
||||
$ python -m spacy info [--markdown] [--silent]
|
||||
$ python -m spacy info [--markdown] [--silent] [--exclude]
|
||||
```
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```cli
|
||||
> $ python -m spacy info en_core_web_lg --markdown
|
||||
> ```
|
||||
|
||||
```cli
|
||||
$ python -m spacy info [model] [--markdown] [--silent]
|
||||
$ python -m spacy info [model] [--markdown] [--silent] [--exclude]
|
||||
```
|
||||
|
||||
| Name | Description |
|
||||
| ------------------------------------------------ | ----------------------------------------------------------------------------------------- |
|
||||
| `model` | A trained pipeline, i.e. package name or path (optional). ~~Optional[str] \(positional)~~ |
|
||||
| `--markdown`, `-md` | Print information as Markdown. ~~bool (flag)~~ |
|
||||
| `--silent`, `-s` <Tag variant="new">2.0.12</Tag> | Don't print anything, just return the values. ~~bool (flag)~~ |
|
||||
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||
| **PRINTS** | Information about your spaCy installation. |
|
||||
| Name | Description |
|
||||
| ------------------------------------------------ | --------------------------------------------------------------------------------------------- |
|
||||
| `model` | A trained pipeline, i.e. package name or path (optional). ~~Optional[str] \(positional)~~ |
|
||||
| `--markdown`, `-md` | Print information as Markdown. ~~bool (flag)~~ |
|
||||
| `--silent`, `-s` <Tag variant="new">2.0.12</Tag> | Don't print anything, just return the values. ~~bool (flag)~~ |
|
||||
| `--exclude`, `-e` | Comma-separated keys to exclude from the print-out. Defaults to `"labels"`. ~~Optional[str]~~ |
|
||||
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||
| **PRINTS** | Information about your spaCy installation. |
|
||||
|
||||
## validate {#validate new="2" tag="command"}
|
||||
|
||||
|
@ -121,7 +128,7 @@ customize those settings in your config file later.
|
|||
> ```
|
||||
|
||||
```cli
|
||||
$ python -m spacy init config [output_file] [--lang] [--pipeline] [--optimize] [--gpu] [--pretraining]
|
||||
$ python -m spacy init config [output_file] [--lang] [--pipeline] [--optimize] [--gpu] [--pretraining] [--force]
|
||||
```
|
||||
|
||||
| Name | Description |
|
||||
|
@ -132,6 +139,7 @@ $ python -m spacy init config [output_file] [--lang] [--pipeline] [--optimize] [
|
|||
| `--optimize`, `-o` | `"efficiency"` or `"accuracy"`. Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters. Defaults to `"efficiency"`. ~~str (option)~~ |
|
||||
| `--gpu`, `-G` | Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters. ~~bool (flag)~~ |
|
||||
| `--pretraining`, `-pt` | Include config for pretraining (with [`spacy pretrain`](/api/cli#pretrain)). Defaults to `False`. ~~bool (flag)~~ |
|
||||
| `--force`, `-f` | Force overwriting the output file if it already exists. ~~bool (flag)~~ |
|
||||
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||
| **CREATES** | The config file for training. |
|
||||
|
||||
|
@ -783,6 +791,12 @@ in the section `[paths]`.
|
|||
|
||||
</Infobox>
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```cli
|
||||
> $ python -m spacy train config.cfg --output ./output --paths.train ./train --paths.dev ./dev
|
||||
> ```
|
||||
|
||||
```cli
|
||||
$ python -m spacy train [config_path] [--output] [--code] [--verbose] [--gpu-id] [overrides]
|
||||
```
|
||||
|
@ -801,15 +815,16 @@ $ python -m spacy train [config_path] [--output] [--code] [--verbose] [--gpu-id]
|
|||
## pretrain {#pretrain new="2.1" tag="command,experimental"}
|
||||
|
||||
Pretrain the "token to vector" ([`Tok2vec`](/api/tok2vec)) layer of pipeline
|
||||
components on [raw text](/api/data-formats#pretrain), using an approximate
|
||||
language-modeling objective. Specifically, we load pretrained vectors, and train
|
||||
a component like a CNN, BiLSTM, etc to predict vectors which match the
|
||||
pretrained ones. The weights are saved to a directory after each epoch. You can
|
||||
then include a **path to one of these pretrained weights files** in your
|
||||
components on raw text, using an approximate language-modeling objective.
|
||||
Specifically, we load pretrained vectors, and train a component like a CNN,
|
||||
BiLSTM, etc to predict vectors which match the pretrained ones. The weights are
|
||||
saved to a directory after each epoch. You can then include a **path to one of
|
||||
these pretrained weights files** in your
|
||||
[training config](/usage/training#config) as the `init_tok2vec` setting when you
|
||||
train your pipeline. This technique may be especially helpful if you have little
|
||||
labelled data. See the usage docs on
|
||||
[pretraining](/usage/embeddings-transformers#pretraining) for more info.
|
||||
[pretraining](/usage/embeddings-transformers#pretraining) for more info. To read
|
||||
the raw text, a [`JsonlCorpus`](/api/top-level#jsonlcorpus) is typically used.
|
||||
|
||||
<Infobox title="Changed in v3.0" variant="warning">
|
||||
|
||||
|
@ -823,6 +838,12 @@ auto-generated by setting `--pretraining` on
|
|||
|
||||
</Infobox>
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```cli
|
||||
> $ python -m spacy pretrain config.cfg ./output_pretrain --paths.raw_text ./data.jsonl
|
||||
> ```
|
||||
|
||||
```cli
|
||||
$ python -m spacy pretrain [config_path] [output_dir] [--code] [--resume-path] [--epoch-resume] [--gpu-id] [overrides]
|
||||
```
|
||||
|
|
|
@ -94,7 +94,7 @@ Defines the `nlp` object, its tokenizer and
|
|||
>
|
||||
> [components.textcat.model]
|
||||
> @architectures = "spacy.TextCatBOW.v1"
|
||||
> exclusive_classes = false
|
||||
> exclusive_classes = true
|
||||
> ngram_size = 1
|
||||
> no_output_layer = false
|
||||
> ```
|
||||
|
@ -148,7 +148,7 @@ This section defines a **dictionary** mapping of string keys to functions. Each
|
|||
function takes an `nlp` object and yields [`Example`](/api/example) objects. By
|
||||
default, the two keys `train` and `dev` are specified and each refer to a
|
||||
[`Corpus`](/api/top-level#Corpus). When pretraining, an additional `pretrain`
|
||||
section is added that defaults to a [`JsonlCorpus`](/api/top-level#JsonlCorpus).
|
||||
section is added that defaults to a [`JsonlCorpus`](/api/top-level#jsonlcorpus).
|
||||
You can also register custom functions that return a callable.
|
||||
|
||||
| Name | Description |
|
||||
|
|
454
website/docs/api/multilabel_textcategorizer.md
Normal file
454
website/docs/api/multilabel_textcategorizer.md
Normal file
|
@ -0,0 +1,454 @@
|
|||
---
|
||||
title: Multi-label TextCategorizer
|
||||
tag: class
|
||||
source: spacy/pipeline/textcat_multilabel.py
|
||||
new: 3
|
||||
teaser: 'Pipeline component for multi-label text classification'
|
||||
api_base_class: /api/pipe
|
||||
api_string_name: textcat_multilabel
|
||||
api_trainable: true
|
||||
---
|
||||
|
||||
The text categorizer predicts **categories over a whole document**. It
|
||||
learns non-mutually exclusive labels, which means that zero or more labels
|
||||
may be true per document.
|
||||
|
||||
## Config and implementation {#config}
|
||||
|
||||
The default config is defined by the pipeline component factory and describes
|
||||
how the component should be configured. You can override its settings via the
|
||||
`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
|
||||
[`config.cfg` for training](/usage/training#config). See the
|
||||
[model architectures](/api/architectures) documentation for details on the
|
||||
architectures and their arguments and hyperparameters.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> from spacy.pipeline.textcat_multilabel import DEFAULT_MULTI_TEXTCAT_MODEL
|
||||
> config = {
|
||||
> "threshold": 0.5,
|
||||
> "model": DEFAULT_MULTI_TEXTCAT_MODEL,
|
||||
> }
|
||||
> nlp.add_pipe("textcat_multilabel", config=config)
|
||||
> ```
|
||||
|
||||
| Setting | Description |
|
||||
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
||||
| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
|
||||
```python
|
||||
%%GITHUB_SPACY/spacy/pipeline/textcat_multilabel.py
|
||||
```
|
||||
|
||||
## MultiLabel_TextCategorizer.\_\_init\_\_ {#init tag="method"}
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> # Construction via add_pipe with default model
|
||||
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||
>
|
||||
> # Construction via add_pipe with custom model
|
||||
> config = {"model": {"@architectures": "my_textcat"}}
|
||||
> parser = nlp.add_pipe("textcat_multilabel", config=config)
|
||||
>
|
||||
> # Construction from class
|
||||
> from spacy.pipeline import MultiLabel_TextCategorizer
|
||||
> textcat = MultiLabel_TextCategorizer(nlp.vocab, model, threshold=0.5)
|
||||
> ```
|
||||
|
||||
Create a new pipeline instance. In your application, you would normally use a
|
||||
shortcut for this and instantiate the component using its string name and
|
||||
[`nlp.add_pipe`](/api/language#create_pipe).
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | -------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||
| _keyword-only_ | |
|
||||
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
||||
|
||||
## MultiLabel_TextCategorizer.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
Apply the pipe to one document. The document is modified in place, and returned.
|
||||
This usually happens under the hood when the `nlp` object is called on a text
|
||||
and all pipeline components are applied to the `Doc` in order. Both
|
||||
[`__call__`](/api/multilabel_textcategorizer#call) and [`pipe`](/api/multilabel_textcategorizer#pipe)
|
||||
delegate to the [`predict`](/api/multilabel_textcategorizer#predict) and
|
||||
[`set_annotations`](/api/multilabel_textcategorizer#set_annotations) methods.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> doc = nlp("This is a sentence.")
|
||||
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||
> # This usually happens under the hood
|
||||
> processed = textcat(doc)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | -------------------------------- |
|
||||
| `doc` | The document to process. ~~Doc~~ |
|
||||
| **RETURNS** | The processed document. ~~Doc~~ |
|
||||
|
||||
## MultiLabel_TextCategorizer.pipe {#pipe tag="method"}
|
||||
|
||||
Apply the pipe to a stream of documents. This usually happens under the hood
|
||||
when the `nlp` object is called on a text and all pipeline components are
|
||||
applied to the `Doc` in order. Both [`__call__`](/api/multilabel_textcategorizer#call) and
|
||||
[`pipe`](/api/multilabel_textcategorizer#pipe) delegate to the
|
||||
[`predict`](/api/multilabel_textcategorizer#predict) and
|
||||
[`set_annotations`](/api/multilabel_textcategorizer#set_annotations) methods.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||
> for doc in textcat.pipe(docs, batch_size=50):
|
||||
> pass
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------- |
|
||||
| `stream` | A stream of documents. ~~Iterable[Doc]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
|
||||
| **YIELDS** | The processed documents in order. ~~Doc~~ |
|
||||
|
||||
## MultiLabel_TextCategorizer.initialize {#initialize tag="method" new="3"}
|
||||
|
||||
Initialize the component for training. `get_examples` should be a function that
|
||||
returns an iterable of [`Example`](/api/example) objects. The data examples are
|
||||
used to **initialize the model** of the component and can either be the full
|
||||
training data or a representative sample. Initialization includes validating the
|
||||
network,
|
||||
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
|
||||
setting up the label scheme based on the data. This method is typically called
|
||||
by [`Language.initialize`](/api/language#initialize) and lets you customize
|
||||
arguments it receives via the
|
||||
[`[initialize.components]`](/api/data-formats#config-initialize) block in the
|
||||
config.
|
||||
|
||||
<Infobox variant="warning" title="Changed in v3.0" id="begin_training">
|
||||
|
||||
This method was previously called `begin_training`.
|
||||
|
||||
</Infobox>
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||
> textcat.initialize(lambda: [], nlp=nlp)
|
||||
> ```
|
||||
>
|
||||
> ```ini
|
||||
> ### config.cfg
|
||||
> [initialize.components.textcat_multilabel]
|
||||
>
|
||||
> [initialize.components.textcat_multilabel.labels]
|
||||
> @readers = "spacy.read_labels.v1"
|
||||
> path = "corpus/labels/textcat.json
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ |
|
||||
|
||||
## MultiLabel_TextCategorizer.predict {#predict tag="method"}
|
||||
|
||||
Apply the component's model to a batch of [`Doc`](/api/doc) objects without
|
||||
modifying them.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||
> scores = textcat.predict([doc1, doc2])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | ------------------------------------------- |
|
||||
| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
|
||||
| **RETURNS** | The model's prediction for each document. |
|
||||
|
||||
## MultiLabel_TextCategorizer.set_annotations {#set_annotations tag="method"}
|
||||
|
||||
Modify a batch of [`Doc`](/api/doc) objects using pre-computed scores.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||
> scores = textcat.predict(docs)
|
||||
> textcat.set_annotations(docs, scores)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| -------- | --------------------------------------------------------- |
|
||||
| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
|
||||
| `scores` | The scores to set, produced by `MultiLabel_TextCategorizer.predict`. |
|
||||
|
||||
## MultiLabel_TextCategorizer.update {#update tag="method"}
|
||||
|
||||
Learn from a batch of [`Example`](/api/example) objects containing the
|
||||
predictions and gold-standard annotations, and update the component's model.
|
||||
Delegates to [`predict`](/api/multilabel_textcategorizer#predict) and
|
||||
[`get_loss`](/api/multilabel_textcategorizer#get_loss).
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||
> optimizer = nlp.initialize()
|
||||
> losses = textcat.update(examples, sgd=optimizer)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `drop` | The dropout rate. ~~float~~ |
|
||||
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||
|
||||
## MultiLabel_TextCategorizer.rehearse {#rehearse tag="method,experimental" new="3"}
|
||||
|
||||
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
|
||||
current model to make predictions similar to an initial model to try to address
|
||||
the "catastrophic forgetting" problem. This feature is experimental.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||
> optimizer = nlp.resume_training()
|
||||
> losses = textcat.rehearse(examples, sgd=optimizer)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `drop` | The dropout rate. ~~float~~ |
|
||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||
|
||||
## MultiLabel_TextCategorizer.get_loss {#get_loss tag="method"}
|
||||
|
||||
Find the loss and gradient of loss for the batch of documents and their
|
||||
predicted scores.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||
> scores = textcat.predict([eg.predicted for eg in examples])
|
||||
> loss, d_loss = textcat.get_loss(examples, scores)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | --------------------------------------------------------------------------- |
|
||||
| `examples` | The batch of examples. ~~Iterable[Example]~~ |
|
||||
| `scores` | Scores representing the model's predictions. |
|
||||
| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ |
|
||||
|
||||
## MultiLabel_TextCategorizer.score {#score tag="method" new="3"}
|
||||
|
||||
Score a batch of examples.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> scores = textcat.score(examples)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ---------------- | -------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | The examples to score. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | |
|
||||
| **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||
|
||||
## MultiLabel_TextCategorizer.create_optimizer {#create_optimizer tag="method"}
|
||||
|
||||
Create an optimizer for the pipeline component.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> textcat = nlp.add_pipe("textcat")
|
||||
> optimizer = textcat.create_optimizer()
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | ---------------------------- |
|
||||
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||
|
||||
## MultiLabel_TextCategorizer.use_params {#use_params tag="method, contextmanager"}
|
||||
|
||||
Modify the pipe's model to use the given parameter values.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> textcat = nlp.add_pipe("textcat")
|
||||
> with textcat.use_params(optimizer.averages):
|
||||
> textcat.to_disk("/best_model")
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| -------- | -------------------------------------------------- |
|
||||
| `params` | The parameter values to use in the model. ~~dict~~ |
|
||||
|
||||
## MultiLabel_TextCategorizer.add_label {#add_label tag="method"}
|
||||
|
||||
Add a new label to the pipe. Raises an error if the output dimension is already
|
||||
set, or if the model has already been fully [initialized](#initialize). Note
|
||||
that you don't have to call this method if you provide a **representative data
|
||||
sample** to the [`initialize`](#initialize) method. In this case, all labels
|
||||
found in the sample will be automatically added to the model, and the output
|
||||
dimension will be [inferred](/usage/layers-architectures#thinc-shape-inference)
|
||||
automatically.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> textcat = nlp.add_pipe("textcat")
|
||||
> textcat.add_label("MY_LABEL")
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | ----------------------------------------------------------- |
|
||||
| `label` | The label to add. ~~str~~ |
|
||||
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
|
||||
|
||||
## MultiLabel_TextCategorizer.to_disk {#to_disk tag="method"}
|
||||
|
||||
Serialize the pipe to disk.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> textcat = nlp.add_pipe("textcat")
|
||||
> textcat.to_disk("/path/to/textcat")
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||
|
||||
## MultiLabel_TextCategorizer.from_disk {#from_disk tag="method"}
|
||||
|
||||
Load the pipe from disk. Modifies the object in place and returns it.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> textcat = nlp.add_pipe("textcat")
|
||||
> textcat.from_disk("/path/to/textcat")
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | ----------------------------------------------------------------------------------------------- |
|
||||
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||
| **RETURNS** | The modified `MultiLabel_TextCategorizer` object. ~~MultiLabel_TextCategorizer~~ |
|
||||
|
||||
## MultiLabel_TextCategorizer.to_bytes {#to_bytes tag="method"}
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> textcat = nlp.add_pipe("textcat")
|
||||
> textcat_bytes = textcat.to_bytes()
|
||||
> ```
|
||||
|
||||
Serialize the pipe to a bytestring.
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||
| _keyword-only_ | |
|
||||
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||
| **RETURNS** | The serialized form of the `MultiLabel_TextCategorizer` object. ~~bytes~~ |
|
||||
|
||||
## MultiLabel_TextCategorizer.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> textcat_bytes = textcat.to_bytes()
|
||||
> textcat = nlp.add_pipe("textcat")
|
||||
> textcat.from_bytes(textcat_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||
| _keyword-only_ | |
|
||||
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||
| **RETURNS** | The `MultiLabel_TextCategorizer` object. ~~MultiLabel_TextCategorizer~~ |
|
||||
|
||||
## MultiLabel_TextCategorizer.labels {#labels tag="property"}
|
||||
|
||||
The labels currently added to the component.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> textcat.add_label("MY_LABEL")
|
||||
> assert "MY_LABEL" in textcat.labels
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | ------------------------------------------------------ |
|
||||
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||
|
||||
## MultiLabel_TextCategorizer.label_data {#label_data tag="property" new="3"}
|
||||
|
||||
The labels currently added to the component and their internal meta information.
|
||||
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
|
||||
[`MultiLabel_TextCategorizer.initialize`](/api/multilabel_textcategorizer#initialize) to initialize
|
||||
the model with a pre-defined label set.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> labels = textcat.label_data
|
||||
> textcat.initialize(lambda: [], nlp=nlp, labels=labels)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | ---------------------------------------------------------- |
|
||||
| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
different aspects of the object. If needed, you can exclude them from
|
||||
serialization by passing in the string names via the `exclude` argument.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = textcat.to_disk("/path", exclude=["vocab"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ------- | -------------------------------------------------------------- |
|
||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||
| `cfg` | The config file. You usually don't want to exclude this. |
|
||||
| `model` | The binary model data. You usually don't want to exclude this. |
|
|
@ -3,17 +3,15 @@ title: TextCategorizer
|
|||
tag: class
|
||||
source: spacy/pipeline/textcat.py
|
||||
new: 2
|
||||
teaser: 'Pipeline component for text classification'
|
||||
teaser: 'Pipeline component for single-label text classification'
|
||||
api_base_class: /api/pipe
|
||||
api_string_name: textcat
|
||||
api_trainable: true
|
||||
---
|
||||
|
||||
The text categorizer predicts **categories over a whole document**. It can learn
|
||||
one or more labels, and the labels can be mutually exclusive (i.e. one true
|
||||
label per document) or non-mutually exclusive (i.e. zero or more labels may be
|
||||
true per document). The multi-label setting is controlled by the model instance
|
||||
that's provided.
|
||||
one or more labels, and the labels are mutually exclusive - there is exactly one
|
||||
true label per document.
|
||||
|
||||
## Config and implementation {#config}
|
||||
|
||||
|
@ -27,10 +25,10 @@ architectures and their arguments and hyperparameters.
|
|||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL
|
||||
> from spacy.pipeline.textcat import DEFAULT_SINGLE_TEXTCAT_MODEL
|
||||
> config = {
|
||||
> "threshold": 0.5,
|
||||
> "model": DEFAULT_TEXTCAT_MODEL,
|
||||
> "model": DEFAULT_SINGLE_TEXTCAT_MODEL,
|
||||
> }
|
||||
> nlp.add_pipe("textcat", config=config)
|
||||
> ```
|
||||
|
@ -280,7 +278,6 @@ Score a batch of examples.
|
|||
| ---------------- | -------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | The examples to score. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `positive_label` | Optional positive label. ~~Optional[str]~~ |
|
||||
| **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||
|
||||
## TextCategorizer.create_optimizer {#create_optimizer tag="method"}
|
||||
|
|
|
@ -129,13 +129,13 @@ the entity recognizer, use a
|
|||
factory = "tok2vec"
|
||||
|
||||
[components.tok2vec.model]
|
||||
@architectures = "spacy.Tok2Vec.v1"
|
||||
@architectures = "spacy.Tok2Vec.v2"
|
||||
|
||||
[components.tok2vec.model.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
|
||||
[components.tok2vec.model.encode]
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||
|
||||
[components.ner]
|
||||
factory = "ner"
|
||||
|
@ -161,13 +161,13 @@ factory = "ner"
|
|||
@architectures = "spacy.TransitionBasedParser.v1"
|
||||
|
||||
[components.ner.model.tok2vec]
|
||||
@architectures = "spacy.Tok2Vec.v1"
|
||||
@architectures = "spacy.Tok2Vec.v2"
|
||||
|
||||
[components.ner.model.tok2vec.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
|
||||
[components.ner.model.tok2vec.encode]
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||
```
|
||||
|
||||
<!-- TODO: Once rehearsal is tested, mention it here. -->
|
||||
|
@ -713,34 +713,39 @@ layer = "tok2vec"
|
|||
|
||||
#### Pretraining objectives {#pretraining-details}
|
||||
|
||||
Two pretraining objectives are available, both of which are variants of the
|
||||
cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
|
||||
for BERT. The objective can be defined and configured via the
|
||||
`[pretraining.objective]` config block.
|
||||
|
||||
> ```ini
|
||||
> ### Characters objective
|
||||
> [pretraining.objective]
|
||||
> type = "characters"
|
||||
> @architectures = "spacy.PretrainCharacters.v1"
|
||||
> maxout_pieces = 3
|
||||
> hidden_size = 300
|
||||
> n_characters = 4
|
||||
> ```
|
||||
>
|
||||
> ```ini
|
||||
> ### Vectors objective
|
||||
> [pretraining.objective]
|
||||
> type = "vectors"
|
||||
> @architectures = "spacy.PretrainVectors.v1"
|
||||
> maxout_pieces = 3
|
||||
> hidden_size = 300
|
||||
> loss = "cosine"
|
||||
> ```
|
||||
|
||||
- **Characters:** The `"characters"` objective asks the model to predict some
|
||||
number of leading and trailing UTF-8 bytes for the words. For instance,
|
||||
setting `n_characters = 2`, the model will try to predict the first two and
|
||||
last two characters of the word.
|
||||
Two pretraining objectives are available, both of which are variants of the
|
||||
cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
|
||||
for BERT. The objective can be defined and configured via the
|
||||
`[pretraining.objective]` config block.
|
||||
|
||||
- **Vectors:** The `"vectors"` objective asks the model to predict the word's
|
||||
vector, from a static embeddings table. This requires a word vectors model to
|
||||
be trained and loaded. The vectors objective can optimize either a cosine or
|
||||
an L2 loss. We've generally found cosine loss to perform better.
|
||||
- [`PretrainCharacters`](/api/architectures#pretrain_chars): The `"characters"`
|
||||
objective asks the model to predict some number of leading and trailing UTF-8
|
||||
bytes for the words. For instance, setting `n_characters = 2`, the model will
|
||||
try to predict the first two and last two characters of the word.
|
||||
|
||||
- [`PretrainVectors`](/api/architectures#pretrain_vectors): The `"vectors"`
|
||||
objective asks the model to predict the word's vector, from a static
|
||||
embeddings table. This requires a word vectors model to be trained and loaded.
|
||||
The vectors objective can optimize either a cosine or an L2 loss. We've
|
||||
generally found cosine loss to perform better.
|
||||
|
||||
These pretraining objectives use a trick that we term **language modelling with
|
||||
approximate outputs (LMAO)**. The motivation for the trick is that predicting an
|
||||
|
|
|
@ -134,7 +134,7 @@ labels = []
|
|||
nO = null
|
||||
|
||||
[components.textcat.model.tok2vec]
|
||||
@architectures = "spacy.Tok2Vec.v1"
|
||||
@architectures = "spacy.Tok2Vec.v2"
|
||||
|
||||
[components.textcat.model.tok2vec.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
|
@ -144,7 +144,7 @@ attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
|||
include_static_vectors = false
|
||||
|
||||
[components.textcat.model.tok2vec.encode]
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||
width = ${components.textcat.model.tok2vec.embed.width}
|
||||
window_size = 1
|
||||
maxout_pieces = 3
|
||||
|
@ -152,7 +152,7 @@ depth = 2
|
|||
|
||||
[components.textcat.model.linear_model]
|
||||
@architectures = "spacy.TextCatBOW.v1"
|
||||
exclusive_classes = false
|
||||
exclusive_classes = true
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
```
|
||||
|
@ -170,7 +170,7 @@ labels = []
|
|||
|
||||
[components.textcat.model]
|
||||
@architectures = "spacy.TextCatBOW.v1"
|
||||
exclusive_classes = false
|
||||
exclusive_classes = true
|
||||
ngram_size = 1
|
||||
no_output_layer = false
|
||||
nO = null
|
||||
|
@ -201,14 +201,14 @@ tokens, and their combination forms a typical
|
|||
factory = "tok2vec"
|
||||
|
||||
[components.tok2vec.model]
|
||||
@architectures = "spacy.Tok2Vec.v1"
|
||||
@architectures = "spacy.Tok2Vec.v2"
|
||||
|
||||
[components.tok2vec.model.embed]
|
||||
@architectures = "spacy.MultiHashEmbed.v1"
|
||||
# ...
|
||||
|
||||
[components.tok2vec.model.encode]
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||
# ...
|
||||
```
|
||||
|
||||
|
@ -224,7 +224,7 @@ architecture:
|
|||
# ...
|
||||
|
||||
[components.tok2vec.model.encode]
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||
# ...
|
||||
```
|
||||
|
||||
|
@ -716,7 +716,7 @@ that we want to classify as being related or not. As these candidate pairs are
|
|||
typically formed within one document, this function takes a [`Doc`](/api/doc) as
|
||||
input and outputs a `List` of `Span` tuples. For instance, the following
|
||||
implementation takes any two entities from the same document, as long as they
|
||||
are within a **maximum distance** (in number of tokens) of eachother:
|
||||
are within a **maximum distance** (in number of tokens) of each other:
|
||||
|
||||
> #### config.cfg (excerpt)
|
||||
>
|
||||
|
@ -742,7 +742,7 @@ def create_instances(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]
|
|||
return get_candidates
|
||||
```
|
||||
|
||||
This function in added to the [`@misc` registry](/api/top-level#registry) so we
|
||||
This function is added to the [`@misc` registry](/api/top-level#registry) so we
|
||||
can refer to it from the config, and easily swap it out for any other candidate
|
||||
generation function.
|
||||
|
||||
|
|
|
@ -1060,7 +1060,7 @@ In this example we assume a custom function `read_custom_data` which loads or
|
|||
generates texts with relevant text classification annotations. Then, small
|
||||
lexical variations of the input text are created before generating the final
|
||||
[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets
|
||||
you register the function creating the custom reader in the `readers`
|
||||
you register the function creating the custom reader in the `readers`
|
||||
[registry](/api/top-level#registry) and assign it a string name, so it can be
|
||||
used in your config. All arguments on the registered function become available
|
||||
as **config settings** – in this case, `source`.
|
||||
|
|
Loading…
Reference in New Issue
Block a user