mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 02:06:31 +03:00
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
This commit is contained in:
commit
97d5a7ba99
106
.github/contributors/bratao.md
vendored
Normal file
106
.github/contributors/bratao.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [X] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Bruno Souza Cabral |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 24/12/2020 |
|
||||||
|
| GitHub username | bratao |
|
||||||
|
| Website (optional) | |
|
1
.gitignore
vendored
1
.gitignore
vendored
|
@ -51,6 +51,7 @@ env3.*/
|
||||||
.pypyenv
|
.pypyenv
|
||||||
.pytest_cache/
|
.pytest_cache/
|
||||||
.mypy_cache/
|
.mypy_cache/
|
||||||
|
.hypothesis/
|
||||||
|
|
||||||
# Distribution / packaging
|
# Distribution / packaging
|
||||||
env/
|
env/
|
||||||
|
|
|
@ -35,7 +35,10 @@ def download_cli(
|
||||||
|
|
||||||
|
|
||||||
def download(model: str, direct: bool = False, *pip_args) -> None:
|
def download(model: str, direct: bool = False, *pip_args) -> None:
|
||||||
if not (is_package("spacy") or is_package("spacy-nightly")) and "--no-deps" not in pip_args:
|
if (
|
||||||
|
not (is_package("spacy") or is_package("spacy-nightly"))
|
||||||
|
and "--no-deps" not in pip_args
|
||||||
|
):
|
||||||
msg.warn(
|
msg.warn(
|
||||||
"Skipping pipeline package dependencies and setting `--no-deps`. "
|
"Skipping pipeline package dependencies and setting `--no-deps`. "
|
||||||
"You don't seem to have the spaCy package itself installed "
|
"You don't seem to have the spaCy package itself installed "
|
||||||
|
|
|
@ -172,7 +172,9 @@ def render_parses(
|
||||||
file_.write(html)
|
file_.write(html)
|
||||||
|
|
||||||
|
|
||||||
def print_prf_per_type(msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str) -> None:
|
def print_prf_per_type(
|
||||||
|
msg: Printer, scores: Dict[str, Dict[str, float]], name: str, type: str
|
||||||
|
) -> None:
|
||||||
data = [
|
data = [
|
||||||
(k, f"{v['p']*100:.2f}", f"{v['r']*100:.2f}", f"{v['f']*100:.2f}")
|
(k, f"{v['p']*100:.2f}", f"{v['r']*100:.2f}", f"{v['f']*100:.2f}")
|
||||||
for k, v in scores.items()
|
for k, v in scores.items()
|
||||||
|
|
|
@ -1,10 +1,10 @@
|
||||||
from typing import Optional, Dict, Any, Union
|
from typing import Optional, Dict, Any, Union, List
|
||||||
import platform
|
import platform
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from wasabi import Printer, MarkdownRenderer
|
from wasabi import Printer, MarkdownRenderer
|
||||||
import srsly
|
import srsly
|
||||||
|
|
||||||
from ._util import app, Arg, Opt
|
from ._util import app, Arg, Opt, string_to_list
|
||||||
from .. import util
|
from .. import util
|
||||||
from .. import about
|
from .. import about
|
||||||
|
|
||||||
|
@ -15,20 +15,22 @@ def info_cli(
|
||||||
model: Optional[str] = Arg(None, help="Optional loadable spaCy pipeline"),
|
model: Optional[str] = Arg(None, help="Optional loadable spaCy pipeline"),
|
||||||
markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues"),
|
markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues"),
|
||||||
silent: bool = Opt(False, "--silent", "-s", "-S", help="Don't print anything (just return)"),
|
silent: bool = Opt(False, "--silent", "-s", "-S", help="Don't print anything (just return)"),
|
||||||
|
exclude: Optional[str] = Opt("labels", "--exclude", "-e", help="Comma-separated keys to exclude from the print-out"),
|
||||||
# fmt: on
|
# fmt: on
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Print info about spaCy installation. If a pipeline is speficied as an argument,
|
Print info about spaCy installation. If a pipeline is specified as an argument,
|
||||||
print its meta information. Flag --markdown prints details in Markdown for easy
|
print its meta information. Flag --markdown prints details in Markdown for easy
|
||||||
copy-pasting to GitHub issues.
|
copy-pasting to GitHub issues.
|
||||||
|
|
||||||
DOCS: https://nightly.spacy.io/api/cli#info
|
DOCS: https://nightly.spacy.io/api/cli#info
|
||||||
"""
|
"""
|
||||||
info(model, markdown=markdown, silent=silent)
|
exclude = string_to_list(exclude)
|
||||||
|
info(model, markdown=markdown, silent=silent, exclude=exclude)
|
||||||
|
|
||||||
|
|
||||||
def info(
|
def info(
|
||||||
model: Optional[str] = None, *, markdown: bool = False, silent: bool = True
|
model: Optional[str] = None, *, markdown: bool = False, silent: bool = True, exclude: List[str]
|
||||||
) -> Union[str, dict]:
|
) -> Union[str, dict]:
|
||||||
msg = Printer(no_print=silent, pretty=not silent)
|
msg = Printer(no_print=silent, pretty=not silent)
|
||||||
if model:
|
if model:
|
||||||
|
@ -42,13 +44,13 @@ def info(
|
||||||
data["Pipelines"] = ", ".join(
|
data["Pipelines"] = ", ".join(
|
||||||
f"{n} ({v})" for n, v in data["Pipelines"].items()
|
f"{n} ({v})" for n, v in data["Pipelines"].items()
|
||||||
)
|
)
|
||||||
markdown_data = get_markdown(data, title=title)
|
markdown_data = get_markdown(data, title=title, exclude=exclude)
|
||||||
if markdown:
|
if markdown:
|
||||||
if not silent:
|
if not silent:
|
||||||
print(markdown_data)
|
print(markdown_data)
|
||||||
return markdown_data
|
return markdown_data
|
||||||
if not silent:
|
if not silent:
|
||||||
table_data = dict(data)
|
table_data = {k: v for k, v in data.items() if k not in exclude}
|
||||||
msg.table(table_data, title=title)
|
msg.table(table_data, title=title)
|
||||||
return raw_data
|
return raw_data
|
||||||
|
|
||||||
|
@ -82,7 +84,7 @@ def info_model(model: str, *, silent: bool = True) -> Dict[str, Any]:
|
||||||
if util.is_package(model):
|
if util.is_package(model):
|
||||||
model_path = util.get_package_path(model)
|
model_path = util.get_package_path(model)
|
||||||
else:
|
else:
|
||||||
model_path = model
|
model_path = Path(model)
|
||||||
meta_path = model_path / "meta.json"
|
meta_path = model_path / "meta.json"
|
||||||
if not meta_path.is_file():
|
if not meta_path.is_file():
|
||||||
msg.fail("Can't find pipeline meta.json", meta_path, exits=1)
|
msg.fail("Can't find pipeline meta.json", meta_path, exits=1)
|
||||||
|
@ -96,7 +98,7 @@ def info_model(model: str, *, silent: bool = True) -> Dict[str, Any]:
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
def get_markdown(data: Dict[str, Any], title: Optional[str] = None) -> str:
|
def get_markdown(data: Dict[str, Any], title: Optional[str] = None, exclude: List[str] = None) -> str:
|
||||||
"""Get data in GitHub-flavoured Markdown format for issues etc.
|
"""Get data in GitHub-flavoured Markdown format for issues etc.
|
||||||
|
|
||||||
data (dict or list of tuples): Label/value pairs.
|
data (dict or list of tuples): Label/value pairs.
|
||||||
|
@ -108,8 +110,16 @@ def get_markdown(data: Dict[str, Any], title: Optional[str] = None) -> str:
|
||||||
md.add(md.title(2, title))
|
md.add(md.title(2, title))
|
||||||
items = []
|
items = []
|
||||||
for key, value in data.items():
|
for key, value in data.items():
|
||||||
if isinstance(value, str) and Path(value).exists():
|
if exclude and key in exclude:
|
||||||
continue
|
continue
|
||||||
|
if isinstance(value, str):
|
||||||
|
try:
|
||||||
|
existing_path = Path(value).exists()
|
||||||
|
except:
|
||||||
|
# invalid Path, like a URL string
|
||||||
|
existing_path = False
|
||||||
|
if existing_path:
|
||||||
|
continue
|
||||||
items.append(f"{md.bold(f'{key}:')} {value}")
|
items.append(f"{md.bold(f'{key}:')} {value}")
|
||||||
md.add(md.list(items))
|
md.add(md.list(items))
|
||||||
return f"\n{md.text}\n"
|
return f"\n{md.text}\n"
|
||||||
|
|
|
@ -32,6 +32,7 @@ def init_config_cli(
|
||||||
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
|
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
|
||||||
gpu: bool = Opt(False, "--gpu", "-G", help="Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters."),
|
gpu: bool = Opt(False, "--gpu", "-G", help="Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters."),
|
||||||
pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
|
pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
|
||||||
|
force_overwrite: bool = Opt(False, "--force", "-F", help="Force overwriting the output file"),
|
||||||
# fmt: on
|
# fmt: on
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
|
@ -46,6 +47,12 @@ def init_config_cli(
|
||||||
optimize = optimize.value
|
optimize = optimize.value
|
||||||
pipeline = string_to_list(pipeline)
|
pipeline = string_to_list(pipeline)
|
||||||
is_stdout = str(output_file) == "-"
|
is_stdout = str(output_file) == "-"
|
||||||
|
if not is_stdout and output_file.exists() and not force_overwrite:
|
||||||
|
msg = Printer()
|
||||||
|
msg.fail(
|
||||||
|
"The provided output file already exists. To force overwriting the config file, set the --force or -F flag.",
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
config = init_config(
|
config = init_config(
|
||||||
lang=lang,
|
lang=lang,
|
||||||
pipeline=pipeline,
|
pipeline=pipeline,
|
||||||
|
@ -162,7 +169,7 @@ def init_config(
|
||||||
"Hardware": variables["hardware"].upper(),
|
"Hardware": variables["hardware"].upper(),
|
||||||
"Transformer": template_vars.transformer.get("name", False),
|
"Transformer": template_vars.transformer.get("name", False),
|
||||||
}
|
}
|
||||||
msg.info("Generated template specific for your use case")
|
msg.info("Generated config template specific for your use case")
|
||||||
for label, value in use_case.items():
|
for label, value in use_case.items():
|
||||||
msg.text(f"- {label}: {value}")
|
msg.text(f"- {label}: {value}")
|
||||||
with show_validation_error(hint_fill=False):
|
with show_validation_error(hint_fill=False):
|
||||||
|
|
|
@ -149,13 +149,44 @@ grad_factor = 1.0
|
||||||
|
|
||||||
[components.textcat.model.linear_model]
|
[components.textcat.model.linear_model]
|
||||||
@architectures = "spacy.TextCatBOW.v1"
|
@architectures = "spacy.TextCatBOW.v1"
|
||||||
exclusive_classes = false
|
exclusive_classes = true
|
||||||
ngram_size = 1
|
ngram_size = 1
|
||||||
no_output_layer = false
|
no_output_layer = false
|
||||||
|
|
||||||
{% else -%}
|
{% else -%}
|
||||||
[components.textcat.model]
|
[components.textcat.model]
|
||||||
@architectures = "spacy.TextCatBOW.v1"
|
@architectures = "spacy.TextCatBOW.v1"
|
||||||
|
exclusive_classes = true
|
||||||
|
ngram_size = 1
|
||||||
|
no_output_layer = false
|
||||||
|
{%- endif %}
|
||||||
|
{%- endif %}
|
||||||
|
|
||||||
|
{% if "textcat_multilabel" in components %}
|
||||||
|
[components.textcat_multilabel]
|
||||||
|
factory = "textcat_multilabel"
|
||||||
|
|
||||||
|
{% if optimize == "accuracy" %}
|
||||||
|
[components.textcat_multilabel.model]
|
||||||
|
@architectures = "spacy.TextCatEnsemble.v2"
|
||||||
|
nO = null
|
||||||
|
|
||||||
|
[components.textcat_multilabel.model.tok2vec]
|
||||||
|
@architectures = "spacy-transformers.TransformerListener.v1"
|
||||||
|
grad_factor = 1.0
|
||||||
|
|
||||||
|
[components.textcat_multilabel.model.tok2vec.pooling]
|
||||||
|
@layers = "reduce_mean.v1"
|
||||||
|
|
||||||
|
[components.textcat_multilabel.model.linear_model]
|
||||||
|
@architectures = "spacy.TextCatBOW.v1"
|
||||||
|
exclusive_classes = false
|
||||||
|
ngram_size = 1
|
||||||
|
no_output_layer = false
|
||||||
|
|
||||||
|
{% else -%}
|
||||||
|
[components.textcat_multilabel.model]
|
||||||
|
@architectures = "spacy.TextCatBOW.v1"
|
||||||
exclusive_classes = false
|
exclusive_classes = false
|
||||||
ngram_size = 1
|
ngram_size = 1
|
||||||
no_output_layer = false
|
no_output_layer = false
|
||||||
|
@ -174,7 +205,7 @@ no_output_layer = false
|
||||||
factory = "tok2vec"
|
factory = "tok2vec"
|
||||||
|
|
||||||
[components.tok2vec.model]
|
[components.tok2vec.model]
|
||||||
@architectures = "spacy.Tok2Vec.v1"
|
@architectures = "spacy.Tok2Vec.v2"
|
||||||
|
|
||||||
[components.tok2vec.model.embed]
|
[components.tok2vec.model.embed]
|
||||||
@architectures = "spacy.MultiHashEmbed.v1"
|
@architectures = "spacy.MultiHashEmbed.v1"
|
||||||
|
@ -189,7 +220,7 @@ rows = [5000, 2500]
|
||||||
include_static_vectors = {{ "true" if optimize == "accuracy" else "false" }}
|
include_static_vectors = {{ "true" if optimize == "accuracy" else "false" }}
|
||||||
|
|
||||||
[components.tok2vec.model.encode]
|
[components.tok2vec.model.encode]
|
||||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||||
width = {{ 96 if optimize == "efficiency" else 256 }}
|
width = {{ 96 if optimize == "efficiency" else 256 }}
|
||||||
depth = {{ 4 if optimize == "efficiency" else 8 }}
|
depth = {{ 4 if optimize == "efficiency" else 8 }}
|
||||||
window_size = 1
|
window_size = 1
|
||||||
|
@ -288,13 +319,41 @@ width = ${components.tok2vec.model.encode.width}
|
||||||
|
|
||||||
[components.textcat.model.linear_model]
|
[components.textcat.model.linear_model]
|
||||||
@architectures = "spacy.TextCatBOW.v1"
|
@architectures = "spacy.TextCatBOW.v1"
|
||||||
exclusive_classes = false
|
exclusive_classes = true
|
||||||
ngram_size = 1
|
ngram_size = 1
|
||||||
no_output_layer = false
|
no_output_layer = false
|
||||||
|
|
||||||
{% else -%}
|
{% else -%}
|
||||||
[components.textcat.model]
|
[components.textcat.model]
|
||||||
@architectures = "spacy.TextCatBOW.v1"
|
@architectures = "spacy.TextCatBOW.v1"
|
||||||
|
exclusive_classes = true
|
||||||
|
ngram_size = 1
|
||||||
|
no_output_layer = false
|
||||||
|
{%- endif %}
|
||||||
|
{%- endif %}
|
||||||
|
|
||||||
|
{% if "textcat_multilabel" in components %}
|
||||||
|
[components.textcat_multilabel]
|
||||||
|
factory = "textcat_multilabel"
|
||||||
|
|
||||||
|
{% if optimize == "accuracy" %}
|
||||||
|
[components.textcat_multilabel.model]
|
||||||
|
@architectures = "spacy.TextCatEnsemble.v2"
|
||||||
|
nO = null
|
||||||
|
|
||||||
|
[components.textcat_multilabel.model.tok2vec]
|
||||||
|
@architectures = "spacy.Tok2VecListener.v1"
|
||||||
|
width = ${components.tok2vec.model.encode.width}
|
||||||
|
|
||||||
|
[components.textcat_multilabel.model.linear_model]
|
||||||
|
@architectures = "spacy.TextCatBOW.v1"
|
||||||
|
exclusive_classes = false
|
||||||
|
ngram_size = 1
|
||||||
|
no_output_layer = false
|
||||||
|
|
||||||
|
{% else -%}
|
||||||
|
[components.textcat_multilabel.model]
|
||||||
|
@architectures = "spacy.TextCatBOW.v1"
|
||||||
exclusive_classes = false
|
exclusive_classes = false
|
||||||
ngram_size = 1
|
ngram_size = 1
|
||||||
no_output_layer = false
|
no_output_layer = false
|
||||||
|
@ -303,7 +362,7 @@ no_output_layer = false
|
||||||
{% endif %}
|
{% endif %}
|
||||||
|
|
||||||
{% for pipe in components %}
|
{% for pipe in components %}
|
||||||
{% if pipe not in ["tagger", "morphologizer", "parser", "ner", "textcat", "entity_linker"] %}
|
{% if pipe not in ["tagger", "morphologizer", "parser", "ner", "textcat", "textcat_multilabel", "entity_linker"] %}
|
||||||
{# Other components defined by the user: we just assume they're factories #}
|
{# Other components defined by the user: we just assume they're factories #}
|
||||||
[components.{{ pipe }}]
|
[components.{{ pipe }}]
|
||||||
factory = "{{ pipe }}"
|
factory = "{{ pipe }}"
|
||||||
|
|
|
@ -463,6 +463,10 @@ class Errors:
|
||||||
"issue tracker: http://github.com/explosion/spaCy/issues")
|
"issue tracker: http://github.com/explosion/spaCy/issues")
|
||||||
|
|
||||||
# TODO: fix numbering after merging develop into master
|
# TODO: fix numbering after merging develop into master
|
||||||
|
E895 = ("The 'textcat' component received gold-standard annotations with "
|
||||||
|
"multiple labels per document. In spaCy 3 you should use the "
|
||||||
|
"'textcat_multilabel' component for this instead. "
|
||||||
|
"Example of an offending annotation: {value}")
|
||||||
E896 = ("There was an error using the static vectors. Ensure that the vectors "
|
E896 = ("There was an error using the static vectors. Ensure that the vectors "
|
||||||
"of the vocab are properly initialized, or set 'include_static_vectors' "
|
"of the vocab are properly initialized, or set 'include_static_vectors' "
|
||||||
"to False.")
|
"to False.")
|
||||||
|
|
|
@ -214,8 +214,22 @@ _macedonian_lower = r"ѓѕјљњќѐѝ"
|
||||||
_macedonian_upper = r"ЃЅЈЉЊЌЀЍ"
|
_macedonian_upper = r"ЃЅЈЉЊЌЀЍ"
|
||||||
_macedonian = r"ѓѕјљњќѐѝЃЅЈЉЊЌЀЍ"
|
_macedonian = r"ѓѕјљњќѐѝЃЅЈЉЊЌЀЍ"
|
||||||
|
|
||||||
_upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian_upper + _macedonian_upper
|
_upper = (
|
||||||
_lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower + _macedonian_lower
|
LATIN_UPPER
|
||||||
|
+ _russian_upper
|
||||||
|
+ _tatar_upper
|
||||||
|
+ _greek_upper
|
||||||
|
+ _ukrainian_upper
|
||||||
|
+ _macedonian_upper
|
||||||
|
)
|
||||||
|
_lower = (
|
||||||
|
LATIN_LOWER
|
||||||
|
+ _russian_lower
|
||||||
|
+ _tatar_lower
|
||||||
|
+ _greek_lower
|
||||||
|
+ _ukrainian_lower
|
||||||
|
+ _macedonian_lower
|
||||||
|
)
|
||||||
|
|
||||||
_uncased = (
|
_uncased = (
|
||||||
_bengali
|
_bengali
|
||||||
|
@ -230,7 +244,9 @@ _uncased = (
|
||||||
+ _cjk
|
+ _cjk
|
||||||
)
|
)
|
||||||
|
|
||||||
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _macedonian + _uncased)
|
ALPHA = group_chars(
|
||||||
|
LATIN + _russian + _tatar + _greek + _ukrainian + _macedonian + _uncased
|
||||||
|
)
|
||||||
ALPHA_LOWER = group_chars(_lower + _uncased)
|
ALPHA_LOWER = group_chars(_lower + _uncased)
|
||||||
ALPHA_UPPER = group_chars(_upper + _uncased)
|
ALPHA_UPPER = group_chars(_upper + _uncased)
|
||||||
|
|
||||||
|
|
|
@ -1,18 +1,11 @@
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .tag_map import TAG_MAP
|
|
||||||
from ...language import Language
|
|
||||||
from ...attrs import LANG
|
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
|
|
||||||
|
|
||||||
class CzechDefaults(Language.Defaults):
|
class CzechDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
|
||||||
lex_attr_getters.update(LEX_ATTRS)
|
|
||||||
lex_attr_getters[LANG] = lambda text: "cs"
|
|
||||||
tag_map = TAG_MAP
|
|
||||||
stop_words = STOP_WORDS
|
|
||||||
lex_attr_getters = LEX_ATTRS
|
lex_attr_getters = LEX_ATTRS
|
||||||
|
stop_words = STOP_WORDS
|
||||||
|
|
||||||
|
|
||||||
class Czech(Language):
|
class Czech(Language):
|
||||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -14,7 +14,7 @@ class MacedonianLemmatizer(Lemmatizer):
|
||||||
if univ_pos in ("", "eol", "space"):
|
if univ_pos in ("", "eol", "space"):
|
||||||
return [string.lower()]
|
return [string.lower()]
|
||||||
|
|
||||||
if string[-3:] == 'јќи':
|
if string[-3:] == "јќи":
|
||||||
string = string[:-3]
|
string = string[:-3]
|
||||||
univ_pos = "verb"
|
univ_pos = "verb"
|
||||||
|
|
||||||
|
@ -23,7 +23,13 @@ class MacedonianLemmatizer(Lemmatizer):
|
||||||
index_table = self.lookups.get_table("lemma_index", {})
|
index_table = self.lookups.get_table("lemma_index", {})
|
||||||
exc_table = self.lookups.get_table("lemma_exc", {})
|
exc_table = self.lookups.get_table("lemma_exc", {})
|
||||||
rules_table = self.lookups.get_table("lemma_rules", {})
|
rules_table = self.lookups.get_table("lemma_rules", {})
|
||||||
if not any((index_table.get(univ_pos), exc_table.get(univ_pos), rules_table.get(univ_pos))):
|
if not any(
|
||||||
|
(
|
||||||
|
index_table.get(univ_pos),
|
||||||
|
exc_table.get(univ_pos),
|
||||||
|
rules_table.get(univ_pos),
|
||||||
|
)
|
||||||
|
):
|
||||||
if univ_pos == "propn":
|
if univ_pos == "propn":
|
||||||
return [string]
|
return [string]
|
||||||
else:
|
else:
|
||||||
|
|
|
@ -1,21 +1,104 @@
|
||||||
from ...attrs import LIKE_NUM
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
_num_words = [
|
_num_words = [
|
||||||
"нула", "еден", "една", "едно", "два", "две", "три", "четири", "пет", "шест", "седум", "осум", "девет", "десет",
|
"нула",
|
||||||
"единаесет", "дванаесет", "тринаесет", "четиринаесет", "петнаесет", "шеснаесет", "седумнаесет", "осумнаесет",
|
"еден",
|
||||||
"деветнаесет", "дваесет", "триесет", "четириесет", "педесет", "шеесет", "седумдесет", "осумдесет", "деведесет",
|
"една",
|
||||||
"сто", "двесте", "триста", "четиристотини", "петстотини", "шестотини", "седумстотини", "осумстотини",
|
"едно",
|
||||||
"деветстотини", "илјада", "илјади", 'милион', 'милиони', 'милијарда', 'милијарди', 'билион', 'билиони',
|
"два",
|
||||||
|
"две",
|
||||||
"двајца", "тројца", "четворица", "петмина", "шестмина", "седуммина", "осуммина", "деветмина", "обата", "обајцата",
|
"три",
|
||||||
|
"четири",
|
||||||
"прв", "втор", "трет", "четврт", "седм", "осм", "двестоти",
|
"пет",
|
||||||
|
"шест",
|
||||||
"два-три", "два-триесет", "два-триесетмина", "два-тринаесет", "два-тројца", "две-три", "две-тристотини",
|
"седум",
|
||||||
"пет-шеесет", "пет-шеесетмина", "пет-шеснаесетмина", "пет-шест", "пет-шестмина", "пет-шестотини", "петина",
|
"осум",
|
||||||
"осмина", "седум-осум", "седум-осумдесет", "седум-осуммина", "седум-осумнаесет", "седум-осумнаесетмина",
|
"девет",
|
||||||
"три-четириесет", "три-четиринаесет", "шеесет", "шеесетина", "шеесетмина", "шеснаесет", "шеснаесетмина",
|
"десет",
|
||||||
"шест-седум", "шест-седумдесет", "шест-седумнаесет", "шест-седумстотини", "шестоти", "шестотини"
|
"единаесет",
|
||||||
|
"дванаесет",
|
||||||
|
"тринаесет",
|
||||||
|
"четиринаесет",
|
||||||
|
"петнаесет",
|
||||||
|
"шеснаесет",
|
||||||
|
"седумнаесет",
|
||||||
|
"осумнаесет",
|
||||||
|
"деветнаесет",
|
||||||
|
"дваесет",
|
||||||
|
"триесет",
|
||||||
|
"четириесет",
|
||||||
|
"педесет",
|
||||||
|
"шеесет",
|
||||||
|
"седумдесет",
|
||||||
|
"осумдесет",
|
||||||
|
"деведесет",
|
||||||
|
"сто",
|
||||||
|
"двесте",
|
||||||
|
"триста",
|
||||||
|
"четиристотини",
|
||||||
|
"петстотини",
|
||||||
|
"шестотини",
|
||||||
|
"седумстотини",
|
||||||
|
"осумстотини",
|
||||||
|
"деветстотини",
|
||||||
|
"илјада",
|
||||||
|
"илјади",
|
||||||
|
"милион",
|
||||||
|
"милиони",
|
||||||
|
"милијарда",
|
||||||
|
"милијарди",
|
||||||
|
"билион",
|
||||||
|
"билиони",
|
||||||
|
"двајца",
|
||||||
|
"тројца",
|
||||||
|
"четворица",
|
||||||
|
"петмина",
|
||||||
|
"шестмина",
|
||||||
|
"седуммина",
|
||||||
|
"осуммина",
|
||||||
|
"деветмина",
|
||||||
|
"обата",
|
||||||
|
"обајцата",
|
||||||
|
"прв",
|
||||||
|
"втор",
|
||||||
|
"трет",
|
||||||
|
"четврт",
|
||||||
|
"седм",
|
||||||
|
"осм",
|
||||||
|
"двестоти",
|
||||||
|
"два-три",
|
||||||
|
"два-триесет",
|
||||||
|
"два-триесетмина",
|
||||||
|
"два-тринаесет",
|
||||||
|
"два-тројца",
|
||||||
|
"две-три",
|
||||||
|
"две-тристотини",
|
||||||
|
"пет-шеесет",
|
||||||
|
"пет-шеесетмина",
|
||||||
|
"пет-шеснаесетмина",
|
||||||
|
"пет-шест",
|
||||||
|
"пет-шестмина",
|
||||||
|
"пет-шестотини",
|
||||||
|
"петина",
|
||||||
|
"осмина",
|
||||||
|
"седум-осум",
|
||||||
|
"седум-осумдесет",
|
||||||
|
"седум-осуммина",
|
||||||
|
"седум-осумнаесет",
|
||||||
|
"седум-осумнаесетмина",
|
||||||
|
"три-четириесет",
|
||||||
|
"три-четиринаесет",
|
||||||
|
"шеесет",
|
||||||
|
"шеесетина",
|
||||||
|
"шеесетмина",
|
||||||
|
"шеснаесет",
|
||||||
|
"шеснаесетмина",
|
||||||
|
"шест-седум",
|
||||||
|
"шест-седумдесет",
|
||||||
|
"шест-седумнаесет",
|
||||||
|
"шест-седумстотини",
|
||||||
|
"шестоти",
|
||||||
|
"шестотини",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -21,8 +21,7 @@ _abbr_exc = [
|
||||||
{ORTH: "хл", NORM: "хектолитар"},
|
{ORTH: "хл", NORM: "хектолитар"},
|
||||||
{ORTH: "дкл", NORM: "декалитар"},
|
{ORTH: "дкл", NORM: "декалитар"},
|
||||||
{ORTH: "л", NORM: "литар"},
|
{ORTH: "л", NORM: "литар"},
|
||||||
{ORTH: "дл", NORM: "децилитар"}
|
{ORTH: "дл", NORM: "децилитар"},
|
||||||
|
|
||||||
]
|
]
|
||||||
for abbr in _abbr_exc:
|
for abbr in _abbr_exc:
|
||||||
_exc[abbr[ORTH]] = [abbr]
|
_exc[abbr[ORTH]] = [abbr]
|
||||||
|
@ -33,7 +32,6 @@ _abbr_line_exc = [
|
||||||
{ORTH: "г-ѓа", NORM: "госпоѓа"},
|
{ORTH: "г-ѓа", NORM: "госпоѓа"},
|
||||||
{ORTH: "г-ца", NORM: "госпоѓица"},
|
{ORTH: "г-ца", NORM: "госпоѓица"},
|
||||||
{ORTH: "г-дин", NORM: "господин"},
|
{ORTH: "г-дин", NORM: "господин"},
|
||||||
|
|
||||||
]
|
]
|
||||||
|
|
||||||
for abbr in _abbr_line_exc:
|
for abbr in _abbr_line_exc:
|
||||||
|
@ -54,7 +52,6 @@ _abbr_dot_exc = [
|
||||||
{ORTH: "т.", NORM: "точка"},
|
{ORTH: "т.", NORM: "точка"},
|
||||||
{ORTH: "т.е.", NORM: "то ест"},
|
{ORTH: "т.е.", NORM: "то ест"},
|
||||||
{ORTH: "т.н.", NORM: "таканаречен"},
|
{ORTH: "т.н.", NORM: "таканаречен"},
|
||||||
|
|
||||||
{ORTH: "бр.", NORM: "број"},
|
{ORTH: "бр.", NORM: "број"},
|
||||||
{ORTH: "гр.", NORM: "град"},
|
{ORTH: "гр.", NORM: "град"},
|
||||||
{ORTH: "др.", NORM: "другар"},
|
{ORTH: "др.", NORM: "другар"},
|
||||||
|
@ -68,7 +65,6 @@ _abbr_dot_exc = [
|
||||||
{ORTH: "с.", NORM: "страница"},
|
{ORTH: "с.", NORM: "страница"},
|
||||||
{ORTH: "стр.", NORM: "страница"},
|
{ORTH: "стр.", NORM: "страница"},
|
||||||
{ORTH: "чл.", NORM: "член"},
|
{ORTH: "чл.", NORM: "член"},
|
||||||
|
|
||||||
{ORTH: "арх.", NORM: "архитект"},
|
{ORTH: "арх.", NORM: "архитект"},
|
||||||
{ORTH: "бел.", NORM: "белешка"},
|
{ORTH: "бел.", NORM: "белешка"},
|
||||||
{ORTH: "гимн.", NORM: "гимназија"},
|
{ORTH: "гимн.", NORM: "гимназија"},
|
||||||
|
@ -89,8 +85,6 @@ _abbr_dot_exc = [
|
||||||
{ORTH: "истор.", NORM: "историја"},
|
{ORTH: "истор.", NORM: "историја"},
|
||||||
{ORTH: "геогр.", NORM: "географија"},
|
{ORTH: "геогр.", NORM: "географија"},
|
||||||
{ORTH: "литер.", NORM: "литература"},
|
{ORTH: "литер.", NORM: "литература"},
|
||||||
|
|
||||||
|
|
||||||
]
|
]
|
||||||
|
|
||||||
for abbr in _abbr_dot_exc:
|
for abbr in _abbr_dot_exc:
|
||||||
|
|
|
@ -45,7 +45,7 @@ _abbr_period_exc = [
|
||||||
{ORTH: "Doç.", NORM: "doçent"},
|
{ORTH: "Doç.", NORM: "doçent"},
|
||||||
{ORTH: "doğ."},
|
{ORTH: "doğ."},
|
||||||
{ORTH: "Dr.", NORM: "doktor"},
|
{ORTH: "Dr.", NORM: "doktor"},
|
||||||
{ORTH: "dr.", NORM:"doktor"},
|
{ORTH: "dr.", NORM: "doktor"},
|
||||||
{ORTH: "drl.", NORM: "derleyen"},
|
{ORTH: "drl.", NORM: "derleyen"},
|
||||||
{ORTH: "Dz.", NORM: "deniz"},
|
{ORTH: "Dz.", NORM: "deniz"},
|
||||||
{ORTH: "Dz.K.K.lığı"},
|
{ORTH: "Dz.K.K.lığı"},
|
||||||
|
@ -118,7 +118,7 @@ _abbr_period_exc = [
|
||||||
{ORTH: "Uzm.", NORM: "uzman"},
|
{ORTH: "Uzm.", NORM: "uzman"},
|
||||||
{ORTH: "Üçvş.", NORM: "üstçavuş"},
|
{ORTH: "Üçvş.", NORM: "üstçavuş"},
|
||||||
{ORTH: "Üni.", NORM: "üniversitesi"},
|
{ORTH: "Üni.", NORM: "üniversitesi"},
|
||||||
{ORTH: "Ütğm.", NORM: "üsteğmen"},
|
{ORTH: "Ütğm.", NORM: "üsteğmen"},
|
||||||
{ORTH: "vb."},
|
{ORTH: "vb."},
|
||||||
{ORTH: "vs.", NORM: "vesaire"},
|
{ORTH: "vs.", NORM: "vesaire"},
|
||||||
{ORTH: "Yard.", NORM: "yardımcı"},
|
{ORTH: "Yard.", NORM: "yardımcı"},
|
||||||
|
@ -163,19 +163,29 @@ for abbr in _abbr_exc:
|
||||||
_exc[abbr[ORTH]] = [abbr]
|
_exc[abbr[ORTH]] = [abbr]
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
_num = r"[+-]?\d+([,.]\d+)*"
|
_num = r"[+-]?\d+([,.]\d+)*"
|
||||||
_ord_num = r"(\d+\.)"
|
_ord_num = r"(\d+\.)"
|
||||||
_date = r"(((\d{1,2}[./-]){2})?(\d{4})|(\d{1,2}[./]\d{1,2}(\.)?))"
|
_date = r"(((\d{1,2}[./-]){2})?(\d{4})|(\d{1,2}[./]\d{1,2}(\.)?))"
|
||||||
_dash_num = r"(([{al}\d]+/\d+)|(\d+/[{al}]))".format(al=ALPHA)
|
_dash_num = r"(([{al}\d]+/\d+)|(\d+/[{al}]))".format(al=ALPHA)
|
||||||
_roman_num = "M{0,3}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})"
|
_roman_num = "M{0,3}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})"
|
||||||
_roman_ord = r"({rn})\.".format(rn=_roman_num)
|
_roman_ord = r"({rn})\.".format(rn=_roman_num)
|
||||||
_time_exp = r"\d+(:\d+)*"
|
_time_exp = r"\d+(:\d+)*"
|
||||||
|
|
||||||
_inflections = r"'[{al}]+".format(al=ALPHA_LOWER)
|
_inflections = r"'[{al}]+".format(al=ALPHA_LOWER)
|
||||||
_abbrev_inflected = r"[{a}]+\.'[{al}]+".format(a=ALPHA, al=ALPHA_LOWER)
|
_abbrev_inflected = r"[{a}]+\.'[{al}]+".format(a=ALPHA, al=ALPHA_LOWER)
|
||||||
|
|
||||||
_nums = r"(({d})|({dn})|({te})|({on})|({n})|({ro})|({rn}))({inf})?".format(d=_date, dn=_dash_num, te=_time_exp, on=_ord_num, n=_num, ro=_roman_ord, rn=_roman_num, inf=_inflections)
|
_nums = r"(({d})|({dn})|({te})|({on})|({n})|({ro})|({rn}))({inf})?".format(
|
||||||
|
d=_date,
|
||||||
|
dn=_dash_num,
|
||||||
|
te=_time_exp,
|
||||||
|
on=_ord_num,
|
||||||
|
n=_num,
|
||||||
|
ro=_roman_ord,
|
||||||
|
rn=_roman_num,
|
||||||
|
inf=_inflections,
|
||||||
|
)
|
||||||
|
|
||||||
TOKENIZER_EXCEPTIONS = _exc
|
TOKENIZER_EXCEPTIONS = _exc
|
||||||
TOKEN_MATCH = re.compile(r"^({abbr})|({n})$".format(n=_nums, abbr=_abbrev_inflected)).match
|
TOKEN_MATCH = re.compile(
|
||||||
|
r"^({abbr})|({n})$".format(n=_nums, abbr=_abbrev_inflected)
|
||||||
|
).match
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
import numpy
|
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
|
|
||||||
from ..attrs import LOWER
|
from ..attrs import LOWER
|
||||||
|
|
|
@ -21,14 +21,14 @@ def transition_parser_v1(
|
||||||
nO: Optional[int] = None,
|
nO: Optional[int] = None,
|
||||||
) -> Model:
|
) -> Model:
|
||||||
return build_tb_parser_model(
|
return build_tb_parser_model(
|
||||||
tok2vec,
|
tok2vec,
|
||||||
state_type,
|
state_type,
|
||||||
extra_state_tokens,
|
extra_state_tokens,
|
||||||
hidden_width,
|
hidden_width,
|
||||||
maxout_pieces,
|
maxout_pieces,
|
||||||
use_upper,
|
use_upper,
|
||||||
nO,
|
nO,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.TransitionBasedParser.v2")
|
@registry.architectures.register("spacy.TransitionBasedParser.v2")
|
||||||
|
@ -42,14 +42,15 @@ def transition_parser_v2(
|
||||||
nO: Optional[int] = None,
|
nO: Optional[int] = None,
|
||||||
) -> Model:
|
) -> Model:
|
||||||
return build_tb_parser_model(
|
return build_tb_parser_model(
|
||||||
tok2vec,
|
tok2vec,
|
||||||
state_type,
|
state_type,
|
||||||
extra_state_tokens,
|
extra_state_tokens,
|
||||||
hidden_width,
|
hidden_width,
|
||||||
maxout_pieces,
|
maxout_pieces,
|
||||||
use_upper,
|
use_upper,
|
||||||
nO,
|
nO,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
def build_tb_parser_model(
|
def build_tb_parser_model(
|
||||||
tok2vec: Model[List[Doc], List[Floats2d]],
|
tok2vec: Model[List[Doc], List[Floats2d]],
|
||||||
|
@ -162,8 +163,8 @@ def _resize_upper(model, new_nO):
|
||||||
# just adding rows here.
|
# just adding rows here.
|
||||||
if smaller.has_dim("nO"):
|
if smaller.has_dim("nO"):
|
||||||
old_nO = smaller.get_dim("nO")
|
old_nO = smaller.get_dim("nO")
|
||||||
larger_W[: old_nO] = smaller_W
|
larger_W[:old_nO] = smaller_W
|
||||||
larger_b[: old_nO] = smaller_b
|
larger_b[:old_nO] = smaller_b
|
||||||
for i in range(old_nO, new_nO):
|
for i in range(old_nO, new_nO):
|
||||||
model.attrs["unseen_classes"].add(i)
|
model.attrs["unseen_classes"].add(i)
|
||||||
|
|
||||||
|
|
|
@ -6,6 +6,7 @@ from thinc.api import chain, concatenate, clone, Dropout, ParametricAttention
|
||||||
from thinc.api import SparseLinear, Softmax, softmax_activation, Maxout, reduce_sum
|
from thinc.api import SparseLinear, Softmax, softmax_activation, Maxout, reduce_sum
|
||||||
from thinc.api import HashEmbed, with_array, with_cpu, uniqued
|
from thinc.api import HashEmbed, with_array, with_cpu, uniqued
|
||||||
from thinc.api import Relu, residual, expand_window
|
from thinc.api import Relu, residual, expand_window
|
||||||
|
from thinc.layers.chain import init as init_chain
|
||||||
|
|
||||||
from ...attrs import ID, ORTH, PREFIX, SUFFIX, SHAPE, LOWER
|
from ...attrs import ID, ORTH, PREFIX, SUFFIX, SHAPE, LOWER
|
||||||
from ...util import registry
|
from ...util import registry
|
||||||
|
@ -13,6 +14,7 @@ from ..extract_ngrams import extract_ngrams
|
||||||
from ..staticvectors import StaticVectors
|
from ..staticvectors import StaticVectors
|
||||||
from ..featureextractor import FeatureExtractor
|
from ..featureextractor import FeatureExtractor
|
||||||
from ...tokens import Doc
|
from ...tokens import Doc
|
||||||
|
from .tok2vec import get_tok2vec_width
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.TextCatCNN.v1")
|
@registry.architectures.register("spacy.TextCatCNN.v1")
|
||||||
|
@ -69,13 +71,16 @@ def build_text_classifier_v2(
|
||||||
exclusive_classes = not linear_model.attrs["multi_label"]
|
exclusive_classes = not linear_model.attrs["multi_label"]
|
||||||
with Model.define_operators({">>": chain, "|": concatenate}):
|
with Model.define_operators({">>": chain, "|": concatenate}):
|
||||||
width = tok2vec.maybe_get_dim("nO")
|
width = tok2vec.maybe_get_dim("nO")
|
||||||
|
attention_layer = ParametricAttention(width) # TODO: benchmark performance difference of this layer
|
||||||
|
maxout_layer = Maxout(nO=width, nI=width)
|
||||||
|
linear_layer = Linear(nO=nO, nI=width)
|
||||||
cnn_model = (
|
cnn_model = (
|
||||||
tok2vec
|
tok2vec
|
||||||
>> list2ragged()
|
>> list2ragged()
|
||||||
>> ParametricAttention(width) # TODO: benchmark performance difference of this layer
|
>> attention_layer
|
||||||
>> reduce_sum()
|
>> reduce_sum()
|
||||||
>> residual(Maxout(nO=width, nI=width))
|
>> residual(maxout_layer)
|
||||||
>> Linear(nO=nO, nI=width)
|
>> linear_layer
|
||||||
>> Dropout(0.0)
|
>> Dropout(0.0)
|
||||||
)
|
)
|
||||||
|
|
||||||
|
@ -89,9 +94,25 @@ def build_text_classifier_v2(
|
||||||
if model.has_dim("nO") is not False:
|
if model.has_dim("nO") is not False:
|
||||||
model.set_dim("nO", nO)
|
model.set_dim("nO", nO)
|
||||||
model.set_ref("output_layer", linear_model.get_ref("output_layer"))
|
model.set_ref("output_layer", linear_model.get_ref("output_layer"))
|
||||||
|
model.set_ref("attention_layer", attention_layer)
|
||||||
|
model.set_ref("maxout_layer", maxout_layer)
|
||||||
|
model.set_ref("linear_layer", linear_layer)
|
||||||
model.attrs["multi_label"] = not exclusive_classes
|
model.attrs["multi_label"] = not exclusive_classes
|
||||||
|
|
||||||
|
model.init = init_ensemble_textcat
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
def init_ensemble_textcat(model, X, Y) -> Model:
|
||||||
|
tok2vec_width = get_tok2vec_width(model)
|
||||||
|
model.get_ref("attention_layer").set_dim("nO", tok2vec_width)
|
||||||
|
model.get_ref("maxout_layer").set_dim("nO", tok2vec_width)
|
||||||
|
model.get_ref("maxout_layer").set_dim("nI", tok2vec_width)
|
||||||
|
model.get_ref("linear_layer").set_dim("nI", tok2vec_width)
|
||||||
|
init_chain(model, X, Y)
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
# TODO: move to legacy
|
# TODO: move to legacy
|
||||||
@registry.architectures.register("spacy.TextCatEnsemble.v1")
|
@registry.architectures.register("spacy.TextCatEnsemble.v1")
|
||||||
def build_text_classifier_v1(
|
def build_text_classifier_v1(
|
||||||
|
|
|
@ -20,6 +20,17 @@ def tok2vec_listener_v1(width: int, upstream: str = "*"):
|
||||||
return tok2vec
|
return tok2vec
|
||||||
|
|
||||||
|
|
||||||
|
def get_tok2vec_width(model: Model):
|
||||||
|
nO = None
|
||||||
|
if model.has_ref("tok2vec"):
|
||||||
|
tok2vec = model.get_ref("tok2vec")
|
||||||
|
if tok2vec.has_dim("nO"):
|
||||||
|
nO = tok2vec.get_dim("nO")
|
||||||
|
elif tok2vec.has_ref("listener"):
|
||||||
|
nO = tok2vec.get_ref("listener").get_dim("nO")
|
||||||
|
return nO
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.HashEmbedCNN.v1")
|
@registry.architectures.register("spacy.HashEmbedCNN.v1")
|
||||||
def build_hash_embed_cnn_tok2vec(
|
def build_hash_embed_cnn_tok2vec(
|
||||||
*,
|
*,
|
||||||
|
@ -76,6 +87,7 @@ def build_hash_embed_cnn_tok2vec(
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# TODO: archive
|
||||||
@registry.architectures.register("spacy.Tok2Vec.v1")
|
@registry.architectures.register("spacy.Tok2Vec.v1")
|
||||||
def build_Tok2Vec_model(
|
def build_Tok2Vec_model(
|
||||||
embed: Model[List[Doc], List[Floats2d]],
|
embed: Model[List[Doc], List[Floats2d]],
|
||||||
|
@ -97,6 +109,28 @@ def build_Tok2Vec_model(
|
||||||
return tok2vec
|
return tok2vec
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@registry.architectures.register("spacy.Tok2Vec.v2")
|
||||||
|
def build_Tok2Vec_model(
|
||||||
|
embed: Model[List[Doc], List[Floats2d]],
|
||||||
|
encode: Model[List[Floats2d], List[Floats2d]],
|
||||||
|
) -> Model[List[Doc], List[Floats2d]]:
|
||||||
|
"""Construct a tok2vec model out of embedding and encoding subnetworks.
|
||||||
|
See https://explosion.ai/blog/deep-learning-formula-nlp
|
||||||
|
|
||||||
|
embed (Model[List[Doc], List[Floats2d]]): Embed tokens into context-independent
|
||||||
|
word vector representations.
|
||||||
|
encode (Model[List[Floats2d], List[Floats2d]]): Encode context into the
|
||||||
|
embeddings, using an architecture such as a CNN, BiLSTM or transformer.
|
||||||
|
"""
|
||||||
|
tok2vec = chain(embed, encode)
|
||||||
|
tok2vec.set_dim("nO", encode.get_dim("nO"))
|
||||||
|
tok2vec.set_ref("embed", embed)
|
||||||
|
tok2vec.set_ref("encode", encode)
|
||||||
|
return tok2vec
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.MultiHashEmbed.v1")
|
@registry.architectures.register("spacy.MultiHashEmbed.v1")
|
||||||
def MultiHashEmbed(
|
def MultiHashEmbed(
|
||||||
width: int,
|
width: int,
|
||||||
|
@ -244,6 +278,7 @@ def CharacterEmbed(
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
# TODO: archive
|
||||||
@registry.architectures.register("spacy.MaxoutWindowEncoder.v1")
|
@registry.architectures.register("spacy.MaxoutWindowEncoder.v1")
|
||||||
def MaxoutWindowEncoder(
|
def MaxoutWindowEncoder(
|
||||||
width: int, window_size: int, maxout_pieces: int, depth: int
|
width: int, window_size: int, maxout_pieces: int, depth: int
|
||||||
|
@ -275,7 +310,39 @@ def MaxoutWindowEncoder(
|
||||||
model.attrs["receptive_field"] = window_size * depth
|
model.attrs["receptive_field"] = window_size * depth
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
@registry.architectures.register("spacy.MaxoutWindowEncoder.v2")
|
||||||
|
def MaxoutWindowEncoder(
|
||||||
|
width: int, window_size: int, maxout_pieces: int, depth: int
|
||||||
|
) -> Model[List[Floats2d], List[Floats2d]]:
|
||||||
|
"""Encode context using convolutions with maxout activation, layer
|
||||||
|
normalization and residual connections.
|
||||||
|
|
||||||
|
width (int): The input and output width. These are required to be the same,
|
||||||
|
to allow residual connections. This value will be determined by the
|
||||||
|
width of the inputs. Recommended values are between 64 and 300.
|
||||||
|
window_size (int): The number of words to concatenate around each token
|
||||||
|
to construct the convolution. Recommended value is 1.
|
||||||
|
maxout_pieces (int): The number of maxout pieces to use. Recommended
|
||||||
|
values are 2 or 3.
|
||||||
|
depth (int): The number of convolutional layers. Recommended value is 4.
|
||||||
|
"""
|
||||||
|
cnn = chain(
|
||||||
|
expand_window(window_size=window_size),
|
||||||
|
Maxout(
|
||||||
|
nO=width,
|
||||||
|
nI=width * ((window_size * 2) + 1),
|
||||||
|
nP=maxout_pieces,
|
||||||
|
dropout=0.0,
|
||||||
|
normalize=True,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
model = clone(residual(cnn), depth)
|
||||||
|
model.set_dim("nO", width)
|
||||||
|
receptive_field = window_size * depth
|
||||||
|
return with_array(model, pad=receptive_field)
|
||||||
|
|
||||||
|
|
||||||
|
# TODO: archive
|
||||||
@registry.architectures.register("spacy.MishWindowEncoder.v1")
|
@registry.architectures.register("spacy.MishWindowEncoder.v1")
|
||||||
def MishWindowEncoder(
|
def MishWindowEncoder(
|
||||||
width: int, window_size: int, depth: int
|
width: int, window_size: int, depth: int
|
||||||
|
@ -299,6 +366,29 @@ def MishWindowEncoder(
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
@registry.architectures.register("spacy.MishWindowEncoder.v2")
|
||||||
|
def MishWindowEncoder(
|
||||||
|
width: int, window_size: int, depth: int
|
||||||
|
) -> Model[List[Floats2d], List[Floats2d]]:
|
||||||
|
"""Encode context using convolutions with mish activation, layer
|
||||||
|
normalization and residual connections.
|
||||||
|
|
||||||
|
width (int): The input and output width. These are required to be the same,
|
||||||
|
to allow residual connections. This value will be determined by the
|
||||||
|
width of the inputs. Recommended values are between 64 and 300.
|
||||||
|
window_size (int): The number of words to concatenate around each token
|
||||||
|
to construct the convolution. Recommended value is 1.
|
||||||
|
depth (int): The number of convolutional layers. Recommended value is 4.
|
||||||
|
"""
|
||||||
|
cnn = chain(
|
||||||
|
expand_window(window_size=window_size),
|
||||||
|
Mish(nO=width, nI=width * ((window_size * 2) + 1), dropout=0.0, normalize=True),
|
||||||
|
)
|
||||||
|
model = clone(residual(cnn), depth)
|
||||||
|
model.set_dim("nO", width)
|
||||||
|
return with_array(model)
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1")
|
@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1")
|
||||||
def BiLSTMEncoder(
|
def BiLSTMEncoder(
|
||||||
width: int, depth: int, dropout: float
|
width: int, depth: int, dropout: float
|
||||||
|
@ -308,9 +398,9 @@ def BiLSTMEncoder(
|
||||||
width (int): The input and output width. These are required to be the same,
|
width (int): The input and output width. These are required to be the same,
|
||||||
to allow residual connections. This value will be determined by the
|
to allow residual connections. This value will be determined by the
|
||||||
width of the inputs. Recommended values are between 64 and 300.
|
width of the inputs. Recommended values are between 64 and 300.
|
||||||
window_size (int): The number of words to concatenate around each token
|
depth (int): The number of recurrent layers.
|
||||||
to construct the convolution. Recommended value is 1.
|
dropout (float): Creates a Dropout layer on the outputs of each LSTM layer
|
||||||
depth (int): The number of convolutional layers. Recommended value is 4.
|
except the last layer. Set to 0 to disable this functionality.
|
||||||
"""
|
"""
|
||||||
if depth == 0:
|
if depth == 0:
|
||||||
return noop()
|
return noop()
|
||||||
|
|
|
@ -47,8 +47,7 @@ def forward(
|
||||||
except ValueError:
|
except ValueError:
|
||||||
raise RuntimeError(Errors.E896)
|
raise RuntimeError(Errors.E896)
|
||||||
output = Ragged(
|
output = Ragged(
|
||||||
vectors_data,
|
vectors_data, model.ops.asarray([len(doc) for doc in docs], dtype="i")
|
||||||
model.ops.asarray([len(doc) for doc in docs], dtype="i")
|
|
||||||
)
|
)
|
||||||
mask = None
|
mask = None
|
||||||
if is_train:
|
if is_train:
|
||||||
|
|
|
@ -1,8 +1,10 @@
|
||||||
from thinc.api import Model, noop, use_ops, Linear
|
from thinc.api import Model, noop
|
||||||
from .parser_model import ParserStepModel
|
from .parser_model import ParserStepModel
|
||||||
|
|
||||||
|
|
||||||
def TransitionModel(tok2vec, lower, upper, resize_output, dropout=0.2, unseen_classes=set()):
|
def TransitionModel(
|
||||||
|
tok2vec, lower, upper, resize_output, dropout=0.2, unseen_classes=set()
|
||||||
|
):
|
||||||
"""Set up a stepwise transition-based model"""
|
"""Set up a stepwise transition-based model"""
|
||||||
if upper is None:
|
if upper is None:
|
||||||
has_upper = False
|
has_upper = False
|
||||||
|
@ -44,4 +46,3 @@ def init(model, X=None, Y=None):
|
||||||
if model.attrs["has_upper"]:
|
if model.attrs["has_upper"]:
|
||||||
statevecs = model.ops.alloc2f(2, lower.get_dim("nO"))
|
statevecs = model.ops.alloc2f(2, lower.get_dim("nO"))
|
||||||
model.get_ref("upper").initialize(X=statevecs)
|
model.get_ref("upper").initialize(X=statevecs)
|
||||||
|
|
||||||
|
|
|
@ -133,8 +133,9 @@ cdef class Morphology:
|
||||||
"""
|
"""
|
||||||
cdef MorphAnalysisC tag
|
cdef MorphAnalysisC tag
|
||||||
tag.length = len(field_feature_pairs)
|
tag.length = len(field_feature_pairs)
|
||||||
tag.fields = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t))
|
if tag.length > 0:
|
||||||
tag.features = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t))
|
tag.fields = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t))
|
||||||
|
tag.features = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t))
|
||||||
for i, (field, feature) in enumerate(field_feature_pairs):
|
for i, (field, feature) in enumerate(field_feature_pairs):
|
||||||
tag.fields[i] = field
|
tag.fields[i] = field
|
||||||
tag.features[i] = feature
|
tag.features[i] = feature
|
||||||
|
|
|
@ -11,6 +11,7 @@ from .senter import SentenceRecognizer
|
||||||
from .sentencizer import Sentencizer
|
from .sentencizer import Sentencizer
|
||||||
from .tagger import Tagger
|
from .tagger import Tagger
|
||||||
from .textcat import TextCategorizer
|
from .textcat import TextCategorizer
|
||||||
|
from .textcat_multilabel import MultiLabel_TextCategorizer
|
||||||
from .tok2vec import Tok2Vec
|
from .tok2vec import Tok2Vec
|
||||||
from .functions import merge_entities, merge_noun_chunks, merge_subtokens
|
from .functions import merge_entities, merge_noun_chunks, merge_subtokens
|
||||||
|
|
||||||
|
@ -22,13 +23,14 @@ __all__ = [
|
||||||
"EntityRuler",
|
"EntityRuler",
|
||||||
"Morphologizer",
|
"Morphologizer",
|
||||||
"Lemmatizer",
|
"Lemmatizer",
|
||||||
"TrainablePipe",
|
"MultiLabel_TextCategorizer",
|
||||||
"Pipe",
|
"Pipe",
|
||||||
"SentenceRecognizer",
|
"SentenceRecognizer",
|
||||||
"Sentencizer",
|
"Sentencizer",
|
||||||
"Tagger",
|
"Tagger",
|
||||||
"TextCategorizer",
|
"TextCategorizer",
|
||||||
"Tok2Vec",
|
"Tok2Vec",
|
||||||
|
"TrainablePipe",
|
||||||
"merge_entities",
|
"merge_entities",
|
||||||
"merge_noun_chunks",
|
"merge_noun_chunks",
|
||||||
"merge_subtokens",
|
"merge_subtokens",
|
||||||
|
|
|
@ -255,7 +255,7 @@ def get_gradient(nr_class, beam_maps, histories, losses):
|
||||||
for a beam state -- so we have "the gradient of loss for taking
|
for a beam state -- so we have "the gradient of loss for taking
|
||||||
action i given history H."
|
action i given history H."
|
||||||
|
|
||||||
Histories: Each hitory is a list of actions
|
Histories: Each history is a list of actions
|
||||||
Each candidate has a history
|
Each candidate has a history
|
||||||
Each beam has multiple candidates
|
Each beam has multiple candidates
|
||||||
Each batch has multiple beams
|
Each batch has multiple beams
|
||||||
|
|
|
@ -4,4 +4,4 @@ from .transition_system cimport Transition, TransitionSystem
|
||||||
|
|
||||||
|
|
||||||
cdef class ArcEager(TransitionSystem):
|
cdef class ArcEager(TransitionSystem):
|
||||||
pass
|
cdef get_arcs(self, StateC* state)
|
||||||
|
|
|
@ -1,6 +1,7 @@
|
||||||
# cython: profile=True, cdivision=True, infer_types=True
|
# cython: profile=True, cdivision=True, infer_types=True
|
||||||
from cymem.cymem cimport Pool, Address
|
from cymem.cymem cimport Pool, Address
|
||||||
from libc.stdint cimport int32_t
|
from libc.stdint cimport int32_t
|
||||||
|
from libcpp.vector cimport vector
|
||||||
|
|
||||||
from collections import defaultdict, Counter
|
from collections import defaultdict, Counter
|
||||||
|
|
||||||
|
@ -10,9 +11,9 @@ from ...structs cimport TokenC
|
||||||
from ...tokens.doc cimport Doc, set_children_from_heads
|
from ...tokens.doc cimport Doc, set_children_from_heads
|
||||||
from ...training.example cimport Example
|
from ...training.example cimport Example
|
||||||
from .stateclass cimport StateClass
|
from .stateclass cimport StateClass
|
||||||
from ._state cimport StateC
|
from ._state cimport StateC, ArcC
|
||||||
|
|
||||||
from ...errors import Errors
|
from ...errors import Errors
|
||||||
|
from thinc.extra.search cimport Beam
|
||||||
|
|
||||||
cdef weight_t MIN_SCORE = -90000
|
cdef weight_t MIN_SCORE = -90000
|
||||||
cdef attr_t SUBTOK_LABEL = hash_string(u'subtok')
|
cdef attr_t SUBTOK_LABEL = hash_string(u'subtok')
|
||||||
|
@ -65,6 +66,7 @@ cdef GoldParseStateC create_gold_state(Pool mem, const StateC* state,
|
||||||
cdef GoldParseStateC gs
|
cdef GoldParseStateC gs
|
||||||
gs.length = len(heads)
|
gs.length = len(heads)
|
||||||
gs.stride = 1
|
gs.stride = 1
|
||||||
|
assert gs.length > 0
|
||||||
gs.labels = <attr_t*>mem.alloc(gs.length, sizeof(gs.labels[0]))
|
gs.labels = <attr_t*>mem.alloc(gs.length, sizeof(gs.labels[0]))
|
||||||
gs.heads = <int32_t*>mem.alloc(gs.length, sizeof(gs.heads[0]))
|
gs.heads = <int32_t*>mem.alloc(gs.length, sizeof(gs.heads[0]))
|
||||||
gs.n_kids = <int32_t*>mem.alloc(gs.length, sizeof(gs.n_kids[0]))
|
gs.n_kids = <int32_t*>mem.alloc(gs.length, sizeof(gs.n_kids[0]))
|
||||||
|
@ -126,6 +128,7 @@ cdef GoldParseStateC create_gold_state(Pool mem, const StateC* state,
|
||||||
1
|
1
|
||||||
)
|
)
|
||||||
# Make an array of pointers, pointing into the gs_kids_flat array.
|
# Make an array of pointers, pointing into the gs_kids_flat array.
|
||||||
|
assert gs.length > 0
|
||||||
gs.kids = <int32_t**>mem.alloc(gs.length, sizeof(int32_t*))
|
gs.kids = <int32_t**>mem.alloc(gs.length, sizeof(int32_t*))
|
||||||
for i in range(gs.length):
|
for i in range(gs.length):
|
||||||
if gs.n_kids[i] != 0:
|
if gs.n_kids[i] != 0:
|
||||||
|
@ -609,7 +612,7 @@ cdef class ArcEager(TransitionSystem):
|
||||||
return gold
|
return gold
|
||||||
|
|
||||||
def init_gold_batch(self, examples):
|
def init_gold_batch(self, examples):
|
||||||
# TODO: Projectivitity?
|
# TODO: Projectivity?
|
||||||
all_states = self.init_batch([eg.predicted for eg in examples])
|
all_states = self.init_batch([eg.predicted for eg in examples])
|
||||||
golds = []
|
golds = []
|
||||||
states = []
|
states = []
|
||||||
|
@ -705,6 +708,28 @@ cdef class ArcEager(TransitionSystem):
|
||||||
doc.c[i].dep = self.root_label
|
doc.c[i].dep = self.root_label
|
||||||
set_children_from_heads(doc.c, 0, doc.length)
|
set_children_from_heads(doc.c, 0, doc.length)
|
||||||
|
|
||||||
|
def get_beam_parses(self, Beam beam):
|
||||||
|
parses = []
|
||||||
|
probs = beam.probs
|
||||||
|
for i in range(beam.size):
|
||||||
|
state = <StateC*>beam.at(i)
|
||||||
|
if state.is_final():
|
||||||
|
prob = probs[i]
|
||||||
|
parse = []
|
||||||
|
arcs = self.get_arcs(state)
|
||||||
|
if arcs:
|
||||||
|
for arc in arcs:
|
||||||
|
dep = arc["label"]
|
||||||
|
label = self.strings[dep]
|
||||||
|
parse.append((arc["head"], arc["child"], label))
|
||||||
|
parses.append((prob, parse))
|
||||||
|
return parses
|
||||||
|
|
||||||
|
cdef get_arcs(self, StateC* state):
|
||||||
|
cdef vector[ArcC] arcs
|
||||||
|
state.get_arcs(&arcs)
|
||||||
|
return list(arcs)
|
||||||
|
|
||||||
def has_gold(self, Example eg, start=0, end=None):
|
def has_gold(self, Example eg, start=0, end=None):
|
||||||
for word in eg.y[start:end]:
|
for word in eg.y[start:end]:
|
||||||
if word.dep != 0:
|
if word.dep != 0:
|
||||||
|
|
|
@ -2,6 +2,7 @@ from libc.stdint cimport int32_t
|
||||||
from cymem.cymem cimport Pool
|
from cymem.cymem cimport Pool
|
||||||
|
|
||||||
from collections import Counter
|
from collections import Counter
|
||||||
|
from thinc.extra.search cimport Beam
|
||||||
|
|
||||||
from ...tokens.doc cimport Doc
|
from ...tokens.doc cimport Doc
|
||||||
from ...tokens.span import Span
|
from ...tokens.span import Span
|
||||||
|
@ -63,6 +64,7 @@ cdef GoldNERStateC create_gold_state(
|
||||||
Example example
|
Example example
|
||||||
) except *:
|
) except *:
|
||||||
cdef GoldNERStateC gs
|
cdef GoldNERStateC gs
|
||||||
|
assert example.x.length > 0
|
||||||
gs.ner = <Transition*>mem.alloc(example.x.length, sizeof(Transition))
|
gs.ner = <Transition*>mem.alloc(example.x.length, sizeof(Transition))
|
||||||
ner_tags = example.get_aligned_ner()
|
ner_tags = example.get_aligned_ner()
|
||||||
for i, ner_tag in enumerate(ner_tags):
|
for i, ner_tag in enumerate(ner_tags):
|
||||||
|
@ -245,6 +247,21 @@ cdef class BiluoPushDown(TransitionSystem):
|
||||||
if doc.c[i].ent_iob == 0:
|
if doc.c[i].ent_iob == 0:
|
||||||
doc.c[i].ent_iob = 2
|
doc.c[i].ent_iob = 2
|
||||||
|
|
||||||
|
def get_beam_parses(self, Beam beam):
|
||||||
|
parses = []
|
||||||
|
probs = beam.probs
|
||||||
|
for i in range(beam.size):
|
||||||
|
state = <StateC*>beam.at(i)
|
||||||
|
if state.is_final():
|
||||||
|
prob = probs[i]
|
||||||
|
parse = []
|
||||||
|
for j in range(state._ents.size()):
|
||||||
|
ent = state._ents.at(j)
|
||||||
|
if ent.start != -1 and ent.end != -1:
|
||||||
|
parse.append((ent.start, ent.end, self.strings[ent.label]))
|
||||||
|
parses.append((prob, parse))
|
||||||
|
return parses
|
||||||
|
|
||||||
def init_gold(self, StateClass state, Example example):
|
def init_gold(self, StateClass state, Example example):
|
||||||
return BiluoGold(self, state, example)
|
return BiluoGold(self, state, example)
|
||||||
|
|
||||||
|
|
|
@ -226,6 +226,7 @@ class AttributeRuler(Pipe):
|
||||||
|
|
||||||
DOCS: https://nightly.spacy.io/api/tagger#score
|
DOCS: https://nightly.spacy.io/api/tagger#score
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def morph_key_getter(token, attr):
|
def morph_key_getter(token, attr):
|
||||||
return getattr(token, attr).key
|
return getattr(token, attr).key
|
||||||
|
|
||||||
|
@ -240,8 +241,16 @@ class AttributeRuler(Pipe):
|
||||||
elif attr == POS:
|
elif attr == POS:
|
||||||
results.update(Scorer.score_token_attr(examples, "pos", **kwargs))
|
results.update(Scorer.score_token_attr(examples, "pos", **kwargs))
|
||||||
elif attr == MORPH:
|
elif attr == MORPH:
|
||||||
results.update(Scorer.score_token_attr(examples, "morph", getter=morph_key_getter, **kwargs))
|
results.update(
|
||||||
results.update(Scorer.score_token_attr_per_feat(examples, "morph", getter=morph_key_getter, **kwargs))
|
Scorer.score_token_attr(
|
||||||
|
examples, "morph", getter=morph_key_getter, **kwargs
|
||||||
|
)
|
||||||
|
)
|
||||||
|
results.update(
|
||||||
|
Scorer.score_token_attr_per_feat(
|
||||||
|
examples, "morph", getter=morph_key_getter, **kwargs
|
||||||
|
)
|
||||||
|
)
|
||||||
elif attr == LEMMA:
|
elif attr == LEMMA:
|
||||||
results.update(Scorer.score_token_attr(examples, "lemma", **kwargs))
|
results.update(Scorer.score_token_attr(examples, "lemma", **kwargs))
|
||||||
return results
|
return results
|
||||||
|
|
|
@ -1,4 +1,5 @@
|
||||||
# cython: infer_types=True, profile=True, binding=True
|
# cython: infer_types=True, profile=True, binding=True
|
||||||
|
from collections import defaultdict
|
||||||
from typing import Optional, Iterable
|
from typing import Optional, Iterable
|
||||||
from thinc.api import Model, Config
|
from thinc.api import Model, Config
|
||||||
|
|
||||||
|
@ -258,3 +259,20 @@ cdef class DependencyParser(Parser):
|
||||||
results.update(Scorer.score_deps(examples, "dep", **kwargs))
|
results.update(Scorer.score_deps(examples, "dep", **kwargs))
|
||||||
del results["sents_per_type"]
|
del results["sents_per_type"]
|
||||||
return results
|
return results
|
||||||
|
|
||||||
|
def scored_parses(self, beams):
|
||||||
|
"""Return two dictionaries with scores for each beam/doc that was processed:
|
||||||
|
one containing (i, head) keys, and another containing (i, label) keys.
|
||||||
|
"""
|
||||||
|
head_scores = []
|
||||||
|
label_scores = []
|
||||||
|
for beam in beams:
|
||||||
|
score_head_dict = defaultdict(float)
|
||||||
|
score_label_dict = defaultdict(float)
|
||||||
|
for score, parses in self.moves.get_beam_parses(beam):
|
||||||
|
for head, i, label in parses:
|
||||||
|
score_head_dict[(i, head)] += score
|
||||||
|
score_label_dict[(i, label)] += score
|
||||||
|
head_scores.append(score_head_dict)
|
||||||
|
label_scores.append(score_label_dict)
|
||||||
|
return head_scores, label_scores
|
||||||
|
|
|
@ -24,7 +24,7 @@ default_model_config = """
|
||||||
@architectures = "spacy.Tagger.v1"
|
@architectures = "spacy.Tagger.v1"
|
||||||
|
|
||||||
[model.tok2vec]
|
[model.tok2vec]
|
||||||
@architectures = "spacy.Tok2Vec.v1"
|
@architectures = "spacy.Tok2Vec.v2"
|
||||||
|
|
||||||
[model.tok2vec.embed]
|
[model.tok2vec.embed]
|
||||||
@architectures = "spacy.CharacterEmbed.v1"
|
@architectures = "spacy.CharacterEmbed.v1"
|
||||||
|
@ -35,7 +35,7 @@ nC = 8
|
||||||
include_static_vectors = false
|
include_static_vectors = false
|
||||||
|
|
||||||
[model.tok2vec.encode]
|
[model.tok2vec.encode]
|
||||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||||
width = 128
|
width = 128
|
||||||
depth = 4
|
depth = 4
|
||||||
window_size = 1
|
window_size = 1
|
||||||
|
|
|
@ -1,4 +1,5 @@
|
||||||
# cython: infer_types=True, profile=True, binding=True
|
# cython: infer_types=True, profile=True, binding=True
|
||||||
|
from collections import defaultdict
|
||||||
from typing import Optional, Iterable
|
from typing import Optional, Iterable
|
||||||
from thinc.api import Model, Config
|
from thinc.api import Model, Config
|
||||||
|
|
||||||
|
@ -197,3 +198,16 @@ cdef class EntityRecognizer(Parser):
|
||||||
"""
|
"""
|
||||||
validate_examples(examples, "EntityRecognizer.score")
|
validate_examples(examples, "EntityRecognizer.score")
|
||||||
return get_ner_prf(examples)
|
return get_ner_prf(examples)
|
||||||
|
|
||||||
|
def scored_ents(self, beams):
|
||||||
|
"""Return a dictionary of (start, end, label) tuples with corresponding scores
|
||||||
|
for each beam/doc that was processed.
|
||||||
|
"""
|
||||||
|
entity_scores = []
|
||||||
|
for beam in beams:
|
||||||
|
score_dict = defaultdict(float)
|
||||||
|
for score, ents in self.moves.get_beam_parses(beam):
|
||||||
|
for start, end, label in ents:
|
||||||
|
score_dict[(start, end, label)] += score
|
||||||
|
entity_scores.append(score_dict)
|
||||||
|
return entity_scores
|
||||||
|
|
|
@ -256,8 +256,14 @@ class Tagger(TrainablePipe):
|
||||||
DOCS: https://nightly.spacy.io/api/tagger#get_loss
|
DOCS: https://nightly.spacy.io/api/tagger#get_loss
|
||||||
"""
|
"""
|
||||||
validate_examples(examples, "Tagger.get_loss")
|
validate_examples(examples, "Tagger.get_loss")
|
||||||
loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False, missing_value="")
|
loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False)
|
||||||
truths = [eg.get_aligned("TAG", as_string=True) for eg in examples]
|
# Convert empty tag "" to missing value None so that both misaligned
|
||||||
|
# tokens and tokens with missing annotation have the default missing
|
||||||
|
# value None.
|
||||||
|
truths = []
|
||||||
|
for eg in examples:
|
||||||
|
eg_truths = [tag if tag is not "" else None for tag in eg.get_aligned("TAG", as_string=True)]
|
||||||
|
truths.append(eg_truths)
|
||||||
d_scores, loss = loss_func(scores, truths)
|
d_scores, loss = loss_func(scores, truths)
|
||||||
if self.model.ops.xp.isnan(loss):
|
if self.model.ops.xp.isnan(loss):
|
||||||
raise ValueError(Errors.E910.format(name=self.name))
|
raise ValueError(Errors.E910.format(name=self.name))
|
||||||
|
|
|
@ -14,12 +14,12 @@ from ..tokens import Doc
|
||||||
from ..vocab import Vocab
|
from ..vocab import Vocab
|
||||||
|
|
||||||
|
|
||||||
default_model_config = """
|
single_label_default_config = """
|
||||||
[model]
|
[model]
|
||||||
@architectures = "spacy.TextCatEnsemble.v2"
|
@architectures = "spacy.TextCatEnsemble.v2"
|
||||||
|
|
||||||
[model.tok2vec]
|
[model.tok2vec]
|
||||||
@architectures = "spacy.Tok2Vec.v1"
|
@architectures = "spacy.Tok2Vec.v2"
|
||||||
|
|
||||||
[model.tok2vec.embed]
|
[model.tok2vec.embed]
|
||||||
@architectures = "spacy.MultiHashEmbed.v1"
|
@architectures = "spacy.MultiHashEmbed.v1"
|
||||||
|
@ -29,7 +29,7 @@ attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
||||||
include_static_vectors = false
|
include_static_vectors = false
|
||||||
|
|
||||||
[model.tok2vec.encode]
|
[model.tok2vec.encode]
|
||||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||||
width = ${model.tok2vec.embed.width}
|
width = ${model.tok2vec.embed.width}
|
||||||
window_size = 1
|
window_size = 1
|
||||||
maxout_pieces = 3
|
maxout_pieces = 3
|
||||||
|
@ -37,24 +37,24 @@ depth = 2
|
||||||
|
|
||||||
[model.linear_model]
|
[model.linear_model]
|
||||||
@architectures = "spacy.TextCatBOW.v1"
|
@architectures = "spacy.TextCatBOW.v1"
|
||||||
exclusive_classes = false
|
exclusive_classes = true
|
||||||
ngram_size = 1
|
ngram_size = 1
|
||||||
no_output_layer = false
|
no_output_layer = false
|
||||||
"""
|
"""
|
||||||
DEFAULT_TEXTCAT_MODEL = Config().from_str(default_model_config)["model"]
|
DEFAULT_SINGLE_TEXTCAT_MODEL = Config().from_str(single_label_default_config)["model"]
|
||||||
|
|
||||||
bow_model_config = """
|
single_label_bow_config = """
|
||||||
[model]
|
[model]
|
||||||
@architectures = "spacy.TextCatBOW.v1"
|
@architectures = "spacy.TextCatBOW.v1"
|
||||||
exclusive_classes = false
|
exclusive_classes = true
|
||||||
ngram_size = 1
|
ngram_size = 1
|
||||||
no_output_layer = false
|
no_output_layer = false
|
||||||
"""
|
"""
|
||||||
|
|
||||||
cnn_model_config = """
|
single_label_cnn_config = """
|
||||||
[model]
|
[model]
|
||||||
@architectures = "spacy.TextCatCNN.v1"
|
@architectures = "spacy.TextCatCNN.v1"
|
||||||
exclusive_classes = false
|
exclusive_classes = true
|
||||||
|
|
||||||
[model.tok2vec]
|
[model.tok2vec]
|
||||||
@architectures = "spacy.HashEmbedCNN.v1"
|
@architectures = "spacy.HashEmbedCNN.v1"
|
||||||
|
@ -71,7 +71,7 @@ subword_features = true
|
||||||
@Language.factory(
|
@Language.factory(
|
||||||
"textcat",
|
"textcat",
|
||||||
assigns=["doc.cats"],
|
assigns=["doc.cats"],
|
||||||
default_config={"threshold": 0.5, "model": DEFAULT_TEXTCAT_MODEL},
|
default_config={"threshold": 0.5, "model": DEFAULT_SINGLE_TEXTCAT_MODEL},
|
||||||
default_score_weights={
|
default_score_weights={
|
||||||
"cats_score": 1.0,
|
"cats_score": 1.0,
|
||||||
"cats_score_desc": None,
|
"cats_score_desc": None,
|
||||||
|
@ -103,7 +103,7 @@ def make_textcat(
|
||||||
|
|
||||||
|
|
||||||
class TextCategorizer(TrainablePipe):
|
class TextCategorizer(TrainablePipe):
|
||||||
"""Pipeline component for text classification.
|
"""Pipeline component for single-label text classification.
|
||||||
|
|
||||||
DOCS: https://nightly.spacy.io/api/textcategorizer
|
DOCS: https://nightly.spacy.io/api/textcategorizer
|
||||||
"""
|
"""
|
||||||
|
@ -111,7 +111,7 @@ class TextCategorizer(TrainablePipe):
|
||||||
def __init__(
|
def __init__(
|
||||||
self, vocab: Vocab, model: Model, name: str = "textcat", *, threshold: float
|
self, vocab: Vocab, model: Model, name: str = "textcat", *, threshold: float
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Initialize a text categorizer.
|
"""Initialize a text categorizer for single-label classification.
|
||||||
|
|
||||||
vocab (Vocab): The shared vocabulary.
|
vocab (Vocab): The shared vocabulary.
|
||||||
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
||||||
|
@ -214,6 +214,7 @@ class TextCategorizer(TrainablePipe):
|
||||||
losses = {}
|
losses = {}
|
||||||
losses.setdefault(self.name, 0.0)
|
losses.setdefault(self.name, 0.0)
|
||||||
validate_examples(examples, "TextCategorizer.update")
|
validate_examples(examples, "TextCategorizer.update")
|
||||||
|
self._validate_categories(examples)
|
||||||
if not any(len(eg.predicted) if eg.predicted else 0 for eg in examples):
|
if not any(len(eg.predicted) if eg.predicted else 0 for eg in examples):
|
||||||
# Handle cases where there are no tokens in any docs.
|
# Handle cases where there are no tokens in any docs.
|
||||||
return losses
|
return losses
|
||||||
|
@ -256,6 +257,7 @@ class TextCategorizer(TrainablePipe):
|
||||||
if self._rehearsal_model is None:
|
if self._rehearsal_model is None:
|
||||||
return losses
|
return losses
|
||||||
validate_examples(examples, "TextCategorizer.rehearse")
|
validate_examples(examples, "TextCategorizer.rehearse")
|
||||||
|
self._validate_categories(examples)
|
||||||
docs = [eg.predicted for eg in examples]
|
docs = [eg.predicted for eg in examples]
|
||||||
if not any(len(doc) for doc in docs):
|
if not any(len(doc) for doc in docs):
|
||||||
# Handle cases where there are no tokens in any docs.
|
# Handle cases where there are no tokens in any docs.
|
||||||
|
@ -296,6 +298,7 @@ class TextCategorizer(TrainablePipe):
|
||||||
DOCS: https://nightly.spacy.io/api/textcategorizer#get_loss
|
DOCS: https://nightly.spacy.io/api/textcategorizer#get_loss
|
||||||
"""
|
"""
|
||||||
validate_examples(examples, "TextCategorizer.get_loss")
|
validate_examples(examples, "TextCategorizer.get_loss")
|
||||||
|
self._validate_categories(examples)
|
||||||
truths, not_missing = self._examples_to_truth(examples)
|
truths, not_missing = self._examples_to_truth(examples)
|
||||||
not_missing = self.model.ops.asarray(not_missing)
|
not_missing = self.model.ops.asarray(not_missing)
|
||||||
d_scores = (scores - truths) / scores.shape[0]
|
d_scores = (scores - truths) / scores.shape[0]
|
||||||
|
@ -341,6 +344,7 @@ class TextCategorizer(TrainablePipe):
|
||||||
DOCS: https://nightly.spacy.io/api/textcategorizer#initialize
|
DOCS: https://nightly.spacy.io/api/textcategorizer#initialize
|
||||||
"""
|
"""
|
||||||
validate_get_examples(get_examples, "TextCategorizer.initialize")
|
validate_get_examples(get_examples, "TextCategorizer.initialize")
|
||||||
|
self._validate_categories(get_examples())
|
||||||
if labels is None:
|
if labels is None:
|
||||||
for example in get_examples():
|
for example in get_examples():
|
||||||
for cat in example.y.cats:
|
for cat in example.y.cats:
|
||||||
|
@ -373,12 +377,20 @@ class TextCategorizer(TrainablePipe):
|
||||||
DOCS: https://nightly.spacy.io/api/textcategorizer#score
|
DOCS: https://nightly.spacy.io/api/textcategorizer#score
|
||||||
"""
|
"""
|
||||||
validate_examples(examples, "TextCategorizer.score")
|
validate_examples(examples, "TextCategorizer.score")
|
||||||
|
self._validate_categories(examples)
|
||||||
return Scorer.score_cats(
|
return Scorer.score_cats(
|
||||||
examples,
|
examples,
|
||||||
"cats",
|
"cats",
|
||||||
labels=self.labels,
|
labels=self.labels,
|
||||||
multi_label=self.model.attrs["multi_label"],
|
multi_label=False,
|
||||||
positive_label=self.cfg["positive_label"],
|
positive_label=self.cfg["positive_label"],
|
||||||
threshold=self.cfg["threshold"],
|
threshold=self.cfg["threshold"],
|
||||||
**kwargs,
|
**kwargs,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _validate_categories(self, examples: List[Example]):
|
||||||
|
"""Check whether the provided examples all have single-label cats annotations."""
|
||||||
|
for ex in examples:
|
||||||
|
if list(ex.reference.cats.values()).count(1.0) > 1:
|
||||||
|
raise ValueError(Errors.E895.format(value=ex.reference.cats))
|
||||||
|
|
191
spacy/pipeline/textcat_multilabel.py
Normal file
191
spacy/pipeline/textcat_multilabel.py
Normal file
|
@ -0,0 +1,191 @@
|
||||||
|
from itertools import islice
|
||||||
|
from typing import Iterable, Optional, Dict, List, Callable, Any
|
||||||
|
|
||||||
|
from thinc.api import Model, Config
|
||||||
|
from thinc.types import Floats2d
|
||||||
|
|
||||||
|
from ..language import Language
|
||||||
|
from ..training import Example, validate_examples, validate_get_examples
|
||||||
|
from ..errors import Errors
|
||||||
|
from ..scorer import Scorer
|
||||||
|
from ..tokens import Doc
|
||||||
|
from ..vocab import Vocab
|
||||||
|
from .textcat import TextCategorizer
|
||||||
|
|
||||||
|
|
||||||
|
multi_label_default_config = """
|
||||||
|
[model]
|
||||||
|
@architectures = "spacy.TextCatEnsemble.v2"
|
||||||
|
|
||||||
|
[model.tok2vec]
|
||||||
|
@architectures = "spacy.Tok2Vec.v1"
|
||||||
|
|
||||||
|
[model.tok2vec.embed]
|
||||||
|
@architectures = "spacy.MultiHashEmbed.v1"
|
||||||
|
width = 64
|
||||||
|
rows = [2000, 2000, 1000, 1000, 1000, 1000]
|
||||||
|
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
||||||
|
include_static_vectors = false
|
||||||
|
|
||||||
|
[model.tok2vec.encode]
|
||||||
|
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
||||||
|
width = ${model.tok2vec.embed.width}
|
||||||
|
window_size = 1
|
||||||
|
maxout_pieces = 3
|
||||||
|
depth = 2
|
||||||
|
|
||||||
|
[model.linear_model]
|
||||||
|
@architectures = "spacy.TextCatBOW.v1"
|
||||||
|
exclusive_classes = false
|
||||||
|
ngram_size = 1
|
||||||
|
no_output_layer = false
|
||||||
|
"""
|
||||||
|
DEFAULT_MULTI_TEXTCAT_MODEL = Config().from_str(multi_label_default_config)["model"]
|
||||||
|
|
||||||
|
multi_label_bow_config = """
|
||||||
|
[model]
|
||||||
|
@architectures = "spacy.TextCatBOW.v1"
|
||||||
|
exclusive_classes = false
|
||||||
|
ngram_size = 1
|
||||||
|
no_output_layer = false
|
||||||
|
"""
|
||||||
|
|
||||||
|
multi_label_cnn_config = """
|
||||||
|
[model]
|
||||||
|
@architectures = "spacy.TextCatCNN.v1"
|
||||||
|
exclusive_classes = false
|
||||||
|
|
||||||
|
[model.tok2vec]
|
||||||
|
@architectures = "spacy.HashEmbedCNN.v1"
|
||||||
|
pretrained_vectors = null
|
||||||
|
width = 96
|
||||||
|
depth = 4
|
||||||
|
embed_size = 2000
|
||||||
|
window_size = 1
|
||||||
|
maxout_pieces = 3
|
||||||
|
subword_features = true
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
@Language.factory(
|
||||||
|
"textcat_multilabel",
|
||||||
|
assigns=["doc.cats"],
|
||||||
|
default_config={"threshold": 0.5, "model": DEFAULT_MULTI_TEXTCAT_MODEL},
|
||||||
|
default_score_weights={
|
||||||
|
"cats_score": 1.0,
|
||||||
|
"cats_score_desc": None,
|
||||||
|
"cats_micro_p": None,
|
||||||
|
"cats_micro_r": None,
|
||||||
|
"cats_micro_f": None,
|
||||||
|
"cats_macro_p": None,
|
||||||
|
"cats_macro_r": None,
|
||||||
|
"cats_macro_f": None,
|
||||||
|
"cats_macro_auc": None,
|
||||||
|
"cats_f_per_type": None,
|
||||||
|
"cats_macro_auc_per_type": None,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
def make_multilabel_textcat(
|
||||||
|
nlp: Language, name: str, model: Model[List[Doc], List[Floats2d]], threshold: float
|
||||||
|
) -> "TextCategorizer":
|
||||||
|
"""Create a TextCategorizer compoment. The text categorizer predicts categories
|
||||||
|
over a whole document. It can learn one or more labels, and the labels can
|
||||||
|
be mutually exclusive (i.e. one true label per doc) or non-mutually exclusive
|
||||||
|
(i.e. zero or more labels may be true per doc). The multi-label setting is
|
||||||
|
controlled by the model instance that's provided.
|
||||||
|
|
||||||
|
model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
|
||||||
|
scores for each category.
|
||||||
|
threshold (float): Cutoff to consider a prediction "positive".
|
||||||
|
"""
|
||||||
|
return MultiLabel_TextCategorizer(nlp.vocab, model, name, threshold=threshold)
|
||||||
|
|
||||||
|
|
||||||
|
class MultiLabel_TextCategorizer(TextCategorizer):
|
||||||
|
"""Pipeline component for multi-label text classification.
|
||||||
|
|
||||||
|
DOCS: https://nightly.spacy.io/api/multilabel_textcategorizer
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab: Vocab,
|
||||||
|
model: Model,
|
||||||
|
name: str = "textcat_multilabel",
|
||||||
|
*,
|
||||||
|
threshold: float,
|
||||||
|
) -> None:
|
||||||
|
"""Initialize a text categorizer for multi-label classification.
|
||||||
|
|
||||||
|
vocab (Vocab): The shared vocabulary.
|
||||||
|
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
||||||
|
name (str): The component instance name, used to add entries to the
|
||||||
|
losses during training.
|
||||||
|
threshold (float): Cutoff to consider a prediction "positive".
|
||||||
|
|
||||||
|
DOCS: https://nightly.spacy.io/api/multilabel_textcategorizer#init
|
||||||
|
"""
|
||||||
|
self.vocab = vocab
|
||||||
|
self.model = model
|
||||||
|
self.name = name
|
||||||
|
self._rehearsal_model = None
|
||||||
|
cfg = {"labels": [], "threshold": threshold}
|
||||||
|
self.cfg = dict(cfg)
|
||||||
|
|
||||||
|
def initialize(
|
||||||
|
self,
|
||||||
|
get_examples: Callable[[], Iterable[Example]],
|
||||||
|
*,
|
||||||
|
nlp: Optional[Language] = None,
|
||||||
|
labels: Optional[Dict] = None,
|
||||||
|
):
|
||||||
|
"""Initialize the pipe for training, using a representative set
|
||||||
|
of data examples.
|
||||||
|
|
||||||
|
get_examples (Callable[[], Iterable[Example]]): Function that
|
||||||
|
returns a representative sample of gold-standard Example objects.
|
||||||
|
nlp (Language): The current nlp object the component is part of.
|
||||||
|
labels: The labels to add to the component, typically generated by the
|
||||||
|
`init labels` command. If no labels are provided, the get_examples
|
||||||
|
callback is used to extract the labels from the data.
|
||||||
|
|
||||||
|
DOCS: https://nightly.spacy.io/api/multilabel_textcategorizer#initialize
|
||||||
|
"""
|
||||||
|
validate_get_examples(get_examples, "MultiLabel_TextCategorizer.initialize")
|
||||||
|
if labels is None:
|
||||||
|
for example in get_examples():
|
||||||
|
for cat in example.y.cats:
|
||||||
|
self.add_label(cat)
|
||||||
|
else:
|
||||||
|
for label in labels:
|
||||||
|
self.add_label(label)
|
||||||
|
subbatch = list(islice(get_examples(), 10))
|
||||||
|
doc_sample = [eg.reference for eg in subbatch]
|
||||||
|
label_sample, _ = self._examples_to_truth(subbatch)
|
||||||
|
self._require_labels()
|
||||||
|
assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
|
||||||
|
assert len(label_sample) > 0, Errors.E923.format(name=self.name)
|
||||||
|
self.model.initialize(X=doc_sample, Y=label_sample)
|
||||||
|
|
||||||
|
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
|
||||||
|
"""Score a batch of examples.
|
||||||
|
|
||||||
|
examples (Iterable[Example]): The examples to score.
|
||||||
|
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_cats.
|
||||||
|
|
||||||
|
DOCS: https://nightly.spacy.io/api/multilabel_textcategorizer#score
|
||||||
|
"""
|
||||||
|
validate_examples(examples, "MultiLabel_TextCategorizer.score")
|
||||||
|
return Scorer.score_cats(
|
||||||
|
examples,
|
||||||
|
"cats",
|
||||||
|
labels=self.labels,
|
||||||
|
multi_label=True,
|
||||||
|
threshold=self.cfg["threshold"],
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
def _validate_categories(self, examples: List[Example]):
|
||||||
|
"""This component allows any type of single- or multi-label annotations.
|
||||||
|
This method overwrites the more strict one from 'textcat'. """
|
||||||
|
pass
|
|
@ -3,7 +3,7 @@ import numpy as np
|
||||||
from collections import defaultdict
|
from collections import defaultdict
|
||||||
|
|
||||||
from .training import Example
|
from .training import Example
|
||||||
from .tokens import Token, Doc, Span, MorphAnalysis
|
from .tokens import Token, Doc, Span
|
||||||
from .errors import Errors
|
from .errors import Errors
|
||||||
from .util import get_lang_class, SimpleFrozenList
|
from .util import get_lang_class, SimpleFrozenList
|
||||||
from .morphology import Morphology
|
from .morphology import Morphology
|
||||||
|
@ -176,7 +176,7 @@ class Scorer:
|
||||||
"token_acc": None,
|
"token_acc": None,
|
||||||
"token_p": None,
|
"token_p": None,
|
||||||
"token_r": None,
|
"token_r": None,
|
||||||
"token_f": None
|
"token_f": None,
|
||||||
}
|
}
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
|
@ -276,7 +276,10 @@ class Scorer:
|
||||||
if gold_i not in missing_indices:
|
if gold_i not in missing_indices:
|
||||||
value = getter(token, attr)
|
value = getter(token, attr)
|
||||||
morph = gold_doc.vocab.strings[value]
|
morph = gold_doc.vocab.strings[value]
|
||||||
if value not in missing_values and morph != Morphology.EMPTY_MORPH:
|
if (
|
||||||
|
value not in missing_values
|
||||||
|
and morph != Morphology.EMPTY_MORPH
|
||||||
|
):
|
||||||
for feat in morph.split(Morphology.FEATURE_SEP):
|
for feat in morph.split(Morphology.FEATURE_SEP):
|
||||||
field, values = feat.split(Morphology.FIELD_SEP)
|
field, values = feat.split(Morphology.FIELD_SEP)
|
||||||
if field not in per_feat:
|
if field not in per_feat:
|
||||||
|
@ -367,7 +370,6 @@ class Scorer:
|
||||||
f"{attr}_per_type": None,
|
f"{attr}_per_type": None,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def score_cats(
|
def score_cats(
|
||||||
examples: Iterable[Example],
|
examples: Iterable[Example],
|
||||||
|
@ -458,7 +460,7 @@ class Scorer:
|
||||||
gold_label, gold_score = max(gold_cats, key=lambda it: it[1])
|
gold_label, gold_score = max(gold_cats, key=lambda it: it[1])
|
||||||
if gold_score is not None and gold_score > 0:
|
if gold_score is not None and gold_score > 0:
|
||||||
f_per_type[gold_label].fn += 1
|
f_per_type[gold_label].fn += 1
|
||||||
else:
|
elif pred_cats:
|
||||||
pred_label, pred_score = max(pred_cats, key=lambda it: it[1])
|
pred_label, pred_score = max(pred_cats, key=lambda it: it[1])
|
||||||
if pred_score >= threshold:
|
if pred_score >= threshold:
|
||||||
f_per_type[pred_label].fp += 1
|
f_per_type[pred_label].fp += 1
|
||||||
|
@ -473,7 +475,10 @@ class Scorer:
|
||||||
macro_f = sum(prf.fscore for prf in f_per_type.values()) / n_cats
|
macro_f = sum(prf.fscore for prf in f_per_type.values()) / n_cats
|
||||||
# Limit macro_auc to those labels with gold annotations,
|
# Limit macro_auc to those labels with gold annotations,
|
||||||
# but still divide by all cats to avoid artificial boosting of datasets with missing labels
|
# but still divide by all cats to avoid artificial boosting of datasets with missing labels
|
||||||
macro_auc = sum(auc.score if auc.is_binary() else 0.0 for auc in auc_per_type.values()) / n_cats
|
macro_auc = (
|
||||||
|
sum(auc.score if auc.is_binary() else 0.0 for auc in auc_per_type.values())
|
||||||
|
/ n_cats
|
||||||
|
)
|
||||||
results = {
|
results = {
|
||||||
f"{attr}_score": None,
|
f"{attr}_score": None,
|
||||||
f"{attr}_score_desc": None,
|
f"{attr}_score_desc": None,
|
||||||
|
@ -485,7 +490,9 @@ class Scorer:
|
||||||
f"{attr}_macro_f": macro_f,
|
f"{attr}_macro_f": macro_f,
|
||||||
f"{attr}_macro_auc": macro_auc,
|
f"{attr}_macro_auc": macro_auc,
|
||||||
f"{attr}_f_per_type": {k: v.to_dict() for k, v in f_per_type.items()},
|
f"{attr}_f_per_type": {k: v.to_dict() for k, v in f_per_type.items()},
|
||||||
f"{attr}_auc_per_type": {k: v.score if v.is_binary() else None for k, v in auc_per_type.items()},
|
f"{attr}_auc_per_type": {
|
||||||
|
k: v.score if v.is_binary() else None for k, v in auc_per_type.items()
|
||||||
|
},
|
||||||
}
|
}
|
||||||
if len(labels) == 2 and not multi_label and positive_label:
|
if len(labels) == 2 and not multi_label and positive_label:
|
||||||
positive_label_f = results[f"{attr}_f_per_type"][positive_label]["f"]
|
positive_label_f = results[f"{attr}_f_per_type"][positive_label]["f"]
|
||||||
|
@ -675,8 +682,7 @@ class Scorer:
|
||||||
|
|
||||||
|
|
||||||
def get_ner_prf(examples: Iterable[Example]) -> Dict[str, Any]:
|
def get_ner_prf(examples: Iterable[Example]) -> Dict[str, Any]:
|
||||||
"""Compute micro-PRF and per-entity PRF scores for a sequence of examples.
|
"""Compute micro-PRF and per-entity PRF scores for a sequence of examples."""
|
||||||
"""
|
|
||||||
score_per_type = defaultdict(PRFScore)
|
score_per_type = defaultdict(PRFScore)
|
||||||
for eg in examples:
|
for eg in examples:
|
||||||
if not eg.y.has_annotation("ENT_IOB"):
|
if not eg.y.has_annotation("ENT_IOB"):
|
||||||
|
|
|
@ -154,10 +154,10 @@ def test_doc_api_serialize(en_tokenizer, text):
|
||||||
|
|
||||||
logger = logging.getLogger("spacy")
|
logger = logging.getLogger("spacy")
|
||||||
with mock.patch.object(logger, "warning") as mock_warning:
|
with mock.patch.object(logger, "warning") as mock_warning:
|
||||||
_ = tokens.to_bytes()
|
_ = tokens.to_bytes() # noqa: F841
|
||||||
mock_warning.assert_not_called()
|
mock_warning.assert_not_called()
|
||||||
tokens.user_hooks["similarity"] = inner_func
|
tokens.user_hooks["similarity"] = inner_func
|
||||||
_ = tokens.to_bytes()
|
_ = tokens.to_bytes() # noqa: F841
|
||||||
mock_warning.assert_called_once()
|
mock_warning.assert_called_once()
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -21,11 +21,13 @@ def test_doc_retokenize_merge(en_tokenizer):
|
||||||
assert doc[4].text == "the beach boys"
|
assert doc[4].text == "the beach boys"
|
||||||
assert doc[4].text_with_ws == "the beach boys "
|
assert doc[4].text_with_ws == "the beach boys "
|
||||||
assert doc[4].tag_ == "NAMED"
|
assert doc[4].tag_ == "NAMED"
|
||||||
|
assert doc[4].lemma_ == "LEMMA"
|
||||||
assert str(doc[4].morph) == "Number=Plur"
|
assert str(doc[4].morph) == "Number=Plur"
|
||||||
assert doc[5].text == "all night"
|
assert doc[5].text == "all night"
|
||||||
assert doc[5].text_with_ws == "all night"
|
assert doc[5].text_with_ws == "all night"
|
||||||
assert doc[5].tag_ == "NAMED"
|
assert doc[5].tag_ == "NAMED"
|
||||||
assert str(doc[5].morph) == "Number=Plur"
|
assert str(doc[5].morph) == "Number=Plur"
|
||||||
|
assert doc[5].lemma_ == "LEMMA"
|
||||||
|
|
||||||
|
|
||||||
def test_doc_retokenize_merge_children(en_tokenizer):
|
def test_doc_retokenize_merge_children(en_tokenizer):
|
||||||
|
@ -103,25 +105,29 @@ def test_doc_retokenize_spans_merge_tokens(en_tokenizer):
|
||||||
|
|
||||||
def test_doc_retokenize_spans_merge_tokens_default_attrs(en_vocab):
|
def test_doc_retokenize_spans_merge_tokens_default_attrs(en_vocab):
|
||||||
words = ["The", "players", "start", "."]
|
words = ["The", "players", "start", "."]
|
||||||
|
lemmas = [t.lower() for t in words]
|
||||||
heads = [1, 2, 2, 2]
|
heads = [1, 2, 2, 2]
|
||||||
tags = ["DT", "NN", "VBZ", "."]
|
tags = ["DT", "NN", "VBZ", "."]
|
||||||
pos = ["DET", "NOUN", "VERB", "PUNCT"]
|
pos = ["DET", "NOUN", "VERB", "PUNCT"]
|
||||||
doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads)
|
doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads, lemmas=lemmas)
|
||||||
assert len(doc) == 4
|
assert len(doc) == 4
|
||||||
assert doc[0].text == "The"
|
assert doc[0].text == "The"
|
||||||
assert doc[0].tag_ == "DT"
|
assert doc[0].tag_ == "DT"
|
||||||
assert doc[0].pos_ == "DET"
|
assert doc[0].pos_ == "DET"
|
||||||
|
assert doc[0].lemma_ == "the"
|
||||||
with doc.retokenize() as retokenizer:
|
with doc.retokenize() as retokenizer:
|
||||||
retokenizer.merge(doc[0:2])
|
retokenizer.merge(doc[0:2])
|
||||||
assert len(doc) == 3
|
assert len(doc) == 3
|
||||||
assert doc[0].text == "The players"
|
assert doc[0].text == "The players"
|
||||||
assert doc[0].tag_ == "NN"
|
assert doc[0].tag_ == "NN"
|
||||||
assert doc[0].pos_ == "NOUN"
|
assert doc[0].pos_ == "NOUN"
|
||||||
doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads)
|
assert doc[0].lemma_ == "the players"
|
||||||
|
doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads, lemmas=lemmas)
|
||||||
assert len(doc) == 4
|
assert len(doc) == 4
|
||||||
assert doc[0].text == "The"
|
assert doc[0].text == "The"
|
||||||
assert doc[0].tag_ == "DT"
|
assert doc[0].tag_ == "DT"
|
||||||
assert doc[0].pos_ == "DET"
|
assert doc[0].pos_ == "DET"
|
||||||
|
assert doc[0].lemma_ == "the"
|
||||||
with doc.retokenize() as retokenizer:
|
with doc.retokenize() as retokenizer:
|
||||||
retokenizer.merge(doc[0:2])
|
retokenizer.merge(doc[0:2])
|
||||||
retokenizer.merge(doc[2:4])
|
retokenizer.merge(doc[2:4])
|
||||||
|
@ -129,9 +135,11 @@ def test_doc_retokenize_spans_merge_tokens_default_attrs(en_vocab):
|
||||||
assert doc[0].text == "The players"
|
assert doc[0].text == "The players"
|
||||||
assert doc[0].tag_ == "NN"
|
assert doc[0].tag_ == "NN"
|
||||||
assert doc[0].pos_ == "NOUN"
|
assert doc[0].pos_ == "NOUN"
|
||||||
|
assert doc[0].lemma_ == "the players"
|
||||||
assert doc[1].text == "start ."
|
assert doc[1].text == "start ."
|
||||||
assert doc[1].tag_ == "VBZ"
|
assert doc[1].tag_ == "VBZ"
|
||||||
assert doc[1].pos_ == "VERB"
|
assert doc[1].pos_ == "VERB"
|
||||||
|
assert doc[1].lemma_ == "start ."
|
||||||
|
|
||||||
|
|
||||||
def test_doc_retokenize_spans_merge_heads(en_vocab):
|
def test_doc_retokenize_spans_merge_heads(en_vocab):
|
||||||
|
|
|
@ -39,6 +39,36 @@ def test_doc_retokenize_split(en_vocab):
|
||||||
assert len(str(doc)) == 19
|
assert len(str(doc)) == 19
|
||||||
|
|
||||||
|
|
||||||
|
def test_doc_retokenize_split_lemmas(en_vocab):
|
||||||
|
# If lemmas are not set, leave unset
|
||||||
|
words = ["LosAngeles", "start", "."]
|
||||||
|
heads = [1, 2, 2]
|
||||||
|
doc = Doc(en_vocab, words=words, heads=heads)
|
||||||
|
with doc.retokenize() as retokenizer:
|
||||||
|
retokenizer.split(
|
||||||
|
doc[0],
|
||||||
|
["Los", "Angeles"],
|
||||||
|
[(doc[0], 1), doc[1]],
|
||||||
|
)
|
||||||
|
assert doc[0].lemma_ == ""
|
||||||
|
assert doc[1].lemma_ == ""
|
||||||
|
|
||||||
|
# If lemmas are set, use split orth as default lemma
|
||||||
|
words = ["LosAngeles", "start", "."]
|
||||||
|
heads = [1, 2, 2]
|
||||||
|
doc = Doc(en_vocab, words=words, heads=heads)
|
||||||
|
for t in doc:
|
||||||
|
t.lemma_ = "a"
|
||||||
|
with doc.retokenize() as retokenizer:
|
||||||
|
retokenizer.split(
|
||||||
|
doc[0],
|
||||||
|
["Los", "Angeles"],
|
||||||
|
[(doc[0], 1), doc[1]],
|
||||||
|
)
|
||||||
|
assert doc[0].lemma_ == "Los"
|
||||||
|
assert doc[1].lemma_ == "Angeles"
|
||||||
|
|
||||||
|
|
||||||
def test_doc_retokenize_split_dependencies(en_vocab):
|
def test_doc_retokenize_split_dependencies(en_vocab):
|
||||||
doc = Doc(en_vocab, words=["LosAngeles", "start", "."])
|
doc = Doc(en_vocab, words=["LosAngeles", "start", "."])
|
||||||
dep1 = doc.vocab.strings.add("amod")
|
dep1 = doc.vocab.strings.add("amod")
|
||||||
|
|
|
@ -113,9 +113,8 @@ def test_en_tokenizer_norm_exceptions(en_tokenizer, text, norms):
|
||||||
assert [token.norm_ for token in tokens] == norms
|
assert [token.norm_ for token in tokens] == norms
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.skip
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
"text,norm", [("radicalised", "radicalized"), ("cuz", "because")]
|
"text,norm", [("Jan.", "January"), ("'cuz", "because")]
|
||||||
)
|
)
|
||||||
def test_en_lex_attrs_norm_exceptions(en_tokenizer, text, norm):
|
def test_en_lex_attrs_norm_exceptions(en_tokenizer, text, norm):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
|
|
|
@ -4,21 +4,21 @@ from spacy.lang.mk.lex_attrs import like_num
|
||||||
|
|
||||||
def test_tokenizer_handles_long_text(mk_tokenizer):
|
def test_tokenizer_handles_long_text(mk_tokenizer):
|
||||||
text = """
|
text = """
|
||||||
Во организациските работи или на нашите собранија со членството, никој од нас не зборуваше за
|
Во организациските работи или на нашите собранија со членството, никој од нас не зборуваше за
|
||||||
организацијата и идеологијата. Работна беше нашата работа, а не идеолошка. Што се однесува до социјализмот на
|
организацијата и идеологијата. Работна беше нашата работа, а не идеолошка. Што се однесува до социјализмот на
|
||||||
Делчев, неговата дејност зборува сама за себе - спротивно. Во суштина, водачите си имаа свои основни погледи и
|
Делчев, неговата дејност зборува сама за себе - спротивно. Во суштина, водачите си имаа свои основни погледи и
|
||||||
свои разбирања за положбата и работите, коишто стоеја пред нив и ги завршуваа со голема упорност, настојчивост и
|
свои разбирања за положбата и работите, коишто стоеја пред нив и ги завршуваа со голема упорност, настојчивост и
|
||||||
насоченост. Значи, идеологија имаше, само што нивната идеологија имаше своја оригиналност. Македонија денеска,
|
насоченост. Значи, идеологија имаше, само што нивната идеологија имаше своја оригиналност. Македонија денеска,
|
||||||
чиста рожба на животот и положбата во Македонија, кои му служеа како база на неговите побуди, беше дејност која
|
чиста рожба на животот и положбата во Македонија, кои му служеа како база на неговите побуди, беше дејност која
|
||||||
имаше потреба од ум за да си најде своја смисла. Таквата идеологија и заемното дејство на умот и срцето му
|
имаше потреба од ум за да си најде своја смисла. Таквата идеологија и заемното дејство на умот и срцето му
|
||||||
помогнаа на Делчев да не се занесе по патот на својата идеологија... Во суштина, Организацијата и нејзините
|
помогнаа на Делчев да не се занесе по патот на својата идеологија... Во суштина, Организацијата и нејзините
|
||||||
водачи имаа свои разбирања за работите и положбата во идеен поглед, но тоа беше врската, животот и положбата во
|
водачи имаа свои разбирања за работите и положбата во идеен поглед, но тоа беше врската, животот и положбата во
|
||||||
Македонија и го внесуваа во својата идеологија гласот на своето срце, и на крај, прибегнуваа до умот,
|
Македонија и го внесуваа во својата идеологија гласот на своето срце, и на крај, прибегнуваа до умот,
|
||||||
за да најдат смисла или да ѝ дадат. Тоа содејство и заемен сооднос на умот и срцето му помогнаа на Делчев да ја
|
за да најдат смисла или да ѝ дадат. Тоа содејство и заемен сооднос на умот и срцето му помогнаа на Делчев да ја
|
||||||
држи својата идеологија во сообразност со положбата на работите... Водачите навистина направија една жртва
|
држи својата идеологија во сообразност со положбата на работите... Водачите навистина направија една жртва
|
||||||
бидејќи на населението не му зборуваа за своите мисли и идеи. Тие се одрекоа од секаква субјективност во своите
|
бидејќи на населението не му зборуваа за своите мисли и идеи. Тие се одрекоа од секаква субјективност во своите
|
||||||
мисли. Целта беше да не се зголемуваат целите и задачите како и преданоста во работата. Населението не можеше да
|
мисли. Целта беше да не се зголемуваат целите и задачите како и преданоста во работата. Населението не можеше да
|
||||||
ги разбере овие идеи...
|
ги разбере овие идеи...
|
||||||
"""
|
"""
|
||||||
tokens = mk_tokenizer(text)
|
tokens = mk_tokenizer(text)
|
||||||
assert len(tokens) == 297
|
assert len(tokens) == 297
|
||||||
|
@ -45,7 +45,7 @@ def test_tokenizer_handles_long_text(mk_tokenizer):
|
||||||
(",", False),
|
(",", False),
|
||||||
("милијарда", True),
|
("милијарда", True),
|
||||||
("билион", True),
|
("билион", True),
|
||||||
]
|
],
|
||||||
)
|
)
|
||||||
def test_mk_lex_attrs_like_number(mk_tokenizer, word, match):
|
def test_mk_lex_attrs_like_number(mk_tokenizer, word, match):
|
||||||
tokens = mk_tokenizer(word)
|
tokens = mk_tokenizer(word)
|
||||||
|
@ -53,14 +53,7 @@ def test_mk_lex_attrs_like_number(mk_tokenizer, word, match):
|
||||||
assert tokens[0].like_num == match
|
assert tokens[0].like_num == match
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize("word", ["двесте", "два-три", "пет-шест"])
|
||||||
"word",
|
|
||||||
[
|
|
||||||
"двесте",
|
|
||||||
"два-три",
|
|
||||||
"пет-шест"
|
|
||||||
]
|
|
||||||
)
|
|
||||||
def test_mk_lex_attrs_capitals(word):
|
def test_mk_lex_attrs_capitals(word):
|
||||||
assert like_num(word)
|
assert like_num(word)
|
||||||
assert like_num(word.upper())
|
assert like_num(word.upper())
|
||||||
|
@ -77,8 +70,8 @@ def test_mk_lex_attrs_capitals(word):
|
||||||
"петто",
|
"петто",
|
||||||
"стоти",
|
"стоти",
|
||||||
"шеесетите",
|
"шеесетите",
|
||||||
"седумдесетите"
|
"седумдесетите",
|
||||||
]
|
],
|
||||||
)
|
)
|
||||||
def test_mk_lex_attrs_like_number_for_ordinal(word):
|
def test_mk_lex_attrs_like_number_for_ordinal(word):
|
||||||
assert like_num(word)
|
assert like_num(word)
|
||||||
|
|
|
@ -5,24 +5,22 @@ from spacy.lang.tr.lex_attrs import like_num
|
||||||
def test_tr_tokenizer_handles_long_text(tr_tokenizer):
|
def test_tr_tokenizer_handles_long_text(tr_tokenizer):
|
||||||
text = """Pamuk nasıl ipliğe dönüştürülür?
|
text = """Pamuk nasıl ipliğe dönüştürülür?
|
||||||
|
|
||||||
Sıkıştırılmış balyalar halindeki pamuk, iplik fabrikasına getirildiğinde hem
|
Sıkıştırılmış balyalar halindeki pamuk, iplik fabrikasına getirildiğinde hem
|
||||||
lifleri birbirine dolaşmıştır, hem de tarladan toplanırken araya bitkinin
|
lifleri birbirine dolaşmıştır, hem de tarladan toplanırken araya bitkinin
|
||||||
parçaları karışmıştır. Üstelik balyalardaki pamuğun cinsi aynı olsa bile kalitesi
|
parçaları karışmıştır. Üstelik balyalardaki pamuğun cinsi aynı olsa bile kalitesi
|
||||||
değişeceğinden, önce bütün balyaların birbirine karıştırılarak harmanlanması gerekir.
|
değişeceğinden, önce bütün balyaların birbirine karıştırılarak harmanlanması gerekir.
|
||||||
|
|
||||||
Daha sonra pamuk yığınları, liflerin açılıp temizlenmesi için tek bir birim halinde
|
Daha sonra pamuk yığınları, liflerin açılıp temizlenmesi için tek bir birim halinde
|
||||||
birleştirilmiş çeşitli makinelerden geçirilir.Bunlardan biri, dönen tokmaklarıyla
|
birleştirilmiş çeşitli makinelerden geçirilir.Bunlardan biri, dönen tokmaklarıyla
|
||||||
pamuğu dövüp kabartarak dağınık yumaklar haline getiren ve liflerin arasındaki yabancı
|
pamuğu dövüp kabartarak dağınık yumaklar haline getiren ve liflerin arasındaki yabancı
|
||||||
maddeleri temizleyen hallaç makinesidir. Daha sonra tarak makinesine giren pamuk demetleri,
|
maddeleri temizleyen hallaç makinesidir. Daha sonra tarak makinesine giren pamuk demetleri,
|
||||||
herbirinin yüzeyinde yüzbinlerce incecik iğne bulunan döner silindirlerin arasından geçerek lif lif ayrılır
|
herbirinin yüzeyinde yüzbinlerce incecik iğne bulunan döner silindirlerin arasından geçerek lif lif ayrılır
|
||||||
ve tül inceliğinde gevşek bir örtüye dönüşür. Ama bir sonraki makine bu lifleri dağınık
|
ve tül inceliğinde gevşek bir örtüye dönüşür. Ama bir sonraki makine bu lifleri dağınık
|
||||||
ve gevşek bir biçimde birbirine yaklaştırarak 2 cm eninde bir pamuk şeridi haline getirir."""
|
ve gevşek bir biçimde birbirine yaklaştırarak 2 cm eninde bir pamuk şeridi haline getirir."""
|
||||||
tokens = tr_tokenizer(text)
|
tokens = tr_tokenizer(text)
|
||||||
assert len(tokens) == 146
|
assert len(tokens) == 146
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
"word",
|
"word",
|
||||||
[
|
[
|
||||||
|
|
|
@ -2,145 +2,692 @@ import pytest
|
||||||
|
|
||||||
|
|
||||||
ABBREV_TESTS = [
|
ABBREV_TESTS = [
|
||||||
("Dr. Murat Bey ile görüştüm.", ["Dr.", "Murat", "Bey", "ile", "görüştüm", "."]),
|
("Dr. Murat Bey ile görüştüm.", ["Dr.", "Murat", "Bey", "ile", "görüştüm", "."]),
|
||||||
("Dr.la görüştüm.", ["Dr.la", "görüştüm", "."]),
|
("Dr.la görüştüm.", ["Dr.la", "görüştüm", "."]),
|
||||||
("Dr.'la görüştüm.", ["Dr.'la", "görüştüm", "."]),
|
("Dr.'la görüştüm.", ["Dr.'la", "görüştüm", "."]),
|
||||||
("TBMM'de çalışıyormuş.", ["TBMM'de", "çalışıyormuş", "."]),
|
("TBMM'de çalışıyormuş.", ["TBMM'de", "çalışıyormuş", "."]),
|
||||||
("Hem İst. hem Ank. bu konuda gayet iyi durumda.", ["Hem", "İst.", "hem", "Ank.", "bu", "konuda", "gayet", "iyi", "durumda", "."]),
|
(
|
||||||
("Hem İst. hem Ank.'da yağış var.", ["Hem", "İst.", "hem", "Ank.'da", "yağış", "var", "."]),
|
"Hem İst. hem Ank. bu konuda gayet iyi durumda.",
|
||||||
("Dr.", ["Dr."]),
|
["Hem", "İst.", "hem", "Ank.", "bu", "konuda", "gayet", "iyi", "durumda", "."],
|
||||||
("Yrd.Doç.", ["Yrd.Doç."]),
|
),
|
||||||
("Prof.'un", ["Prof.'un"]),
|
(
|
||||||
("Böl.'nde", ["Böl.'nde"]),
|
"Hem İst. hem Ank.'da yağış var.",
|
||||||
|
["Hem", "İst.", "hem", "Ank.'da", "yağış", "var", "."],
|
||||||
|
),
|
||||||
|
("Dr.", ["Dr."]),
|
||||||
|
("Yrd.Doç.", ["Yrd.Doç."]),
|
||||||
|
("Prof.'un", ["Prof.'un"]),
|
||||||
|
("Böl.'nde", ["Böl.'nde"]),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
URL_TESTS = [
|
URL_TESTS = [
|
||||||
("Bizler de www.duygu.com.tr adında bir websitesi kurduk.", ["Bizler", "de", "www.duygu.com.tr", "adında", "bir", "websitesi", "kurduk", "."]),
|
(
|
||||||
("Bizler de https://www.duygu.com.tr adında bir websitesi kurduk.", ["Bizler", "de", "https://www.duygu.com.tr", "adında", "bir", "websitesi", "kurduk", "."]),
|
"Bizler de www.duygu.com.tr adında bir websitesi kurduk.",
|
||||||
("Bizler de www.duygu.com.tr'dan satın aldık.", ["Bizler", "de", "www.duygu.com.tr'dan", "satın", "aldık", "."]),
|
[
|
||||||
("Bizler de https://www.duygu.com.tr'dan satın aldık.", ["Bizler", "de", "https://www.duygu.com.tr'dan", "satın", "aldık", "."]),
|
"Bizler",
|
||||||
|
"de",
|
||||||
|
"www.duygu.com.tr",
|
||||||
|
"adında",
|
||||||
|
"bir",
|
||||||
|
"websitesi",
|
||||||
|
"kurduk",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Bizler de https://www.duygu.com.tr adında bir websitesi kurduk.",
|
||||||
|
[
|
||||||
|
"Bizler",
|
||||||
|
"de",
|
||||||
|
"https://www.duygu.com.tr",
|
||||||
|
"adında",
|
||||||
|
"bir",
|
||||||
|
"websitesi",
|
||||||
|
"kurduk",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Bizler de www.duygu.com.tr'dan satın aldık.",
|
||||||
|
["Bizler", "de", "www.duygu.com.tr'dan", "satın", "aldık", "."],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Bizler de https://www.duygu.com.tr'dan satın aldık.",
|
||||||
|
["Bizler", "de", "https://www.duygu.com.tr'dan", "satın", "aldık", "."],
|
||||||
|
),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
NUMBER_TESTS = [
|
NUMBER_TESTS = [
|
||||||
("Rakamla 6 yazılıydı.", ["Rakamla", "6", "yazılıydı", "."]),
|
("Rakamla 6 yazılıydı.", ["Rakamla", "6", "yazılıydı", "."]),
|
||||||
("Hava -4 dereceydi.", ["Hava", "-4", "dereceydi", "."]),
|
("Hava -4 dereceydi.", ["Hava", "-4", "dereceydi", "."]),
|
||||||
("Hava sıcaklığı -4ten +6ya yükseldi.", ["Hava", "sıcaklığı", "-4ten", "+6ya", "yükseldi", "."]),
|
(
|
||||||
("Hava sıcaklığı -4'ten +6'ya yükseldi.", ["Hava", "sıcaklığı", "-4'ten", "+6'ya", "yükseldi", "."]),
|
"Hava sıcaklığı -4ten +6ya yükseldi.",
|
||||||
("Yarışta 6. oldum.", ["Yarışta", "6.", "oldum", "."]),
|
["Hava", "sıcaklığı", "-4ten", "+6ya", "yükseldi", "."],
|
||||||
("Yarışta 438547745. oldum.", ["Yarışta", "438547745.", "oldum", "."]),
|
),
|
||||||
("Kitap IV. Murat hakkında.",["Kitap", "IV.", "Murat", "hakkında", "."]),
|
(
|
||||||
#("Bana söylediği sayı 6.", ["Bana", "söylediği", "sayı", "6", "."]),
|
"Hava sıcaklığı -4'ten +6'ya yükseldi.",
|
||||||
("Saat 6'da buluşalım.", ["Saat", "6'da", "buluşalım", "."]),
|
["Hava", "sıcaklığı", "-4'ten", "+6'ya", "yükseldi", "."],
|
||||||
("Saat 6dan sonra buluşalım.", ["Saat", "6dan", "sonra", "buluşalım", "."]),
|
),
|
||||||
("6.dan sonra saymadım.", ["6.dan", "sonra", "saymadım", "."]),
|
("Yarışta 6. oldum.", ["Yarışta", "6.", "oldum", "."]),
|
||||||
("6.'dan sonra saymadım.", ["6.'dan", "sonra", "saymadım", "."]),
|
("Yarışta 438547745. oldum.", ["Yarışta", "438547745.", "oldum", "."]),
|
||||||
("Saat 6'ydı.", ["Saat", "6'ydı", "."]),
|
("Kitap IV. Murat hakkında.", ["Kitap", "IV.", "Murat", "hakkında", "."]),
|
||||||
("5'te", ["5'te"]),
|
# ("Bana söylediği sayı 6.", ["Bana", "söylediği", "sayı", "6", "."]),
|
||||||
("6'da", ["6'da"]),
|
("Saat 6'da buluşalım.", ["Saat", "6'da", "buluşalım", "."]),
|
||||||
("9dan", ["9dan"]),
|
("Saat 6dan sonra buluşalım.", ["Saat", "6dan", "sonra", "buluşalım", "."]),
|
||||||
("19'da", ["19'da"]),
|
("6.dan sonra saymadım.", ["6.dan", "sonra", "saymadım", "."]),
|
||||||
("VI'da", ["VI'da"]),
|
("6.'dan sonra saymadım.", ["6.'dan", "sonra", "saymadım", "."]),
|
||||||
("5.", ["5."]),
|
("Saat 6'ydı.", ["Saat", "6'ydı", "."]),
|
||||||
("72.", ["72."]),
|
("5'te", ["5'te"]),
|
||||||
("VI.", ["VI."]),
|
("6'da", ["6'da"]),
|
||||||
("6.'dan", ["6.'dan"]),
|
("9dan", ["9dan"]),
|
||||||
("19.'dan", ["19.'dan"]),
|
("19'da", ["19'da"]),
|
||||||
("6.dan", ["6.dan"]),
|
("VI'da", ["VI'da"]),
|
||||||
("16.dan", ["16.dan"]),
|
("5.", ["5."]),
|
||||||
("VI.'dan", ["VI.'dan"]),
|
("72.", ["72."]),
|
||||||
("VI.dan", ["VI.dan"]),
|
("VI.", ["VI."]),
|
||||||
("Hepsi 1994 yılında oldu.", ["Hepsi", "1994", "yılında", "oldu", "."]),
|
("6.'dan", ["6.'dan"]),
|
||||||
("Hepsi 1994'te oldu.", ["Hepsi", "1994'te", "oldu", "."]),
|
("19.'dan", ["19.'dan"]),
|
||||||
("2/3 tarihli faturayı bulamadım.", ["2/3", "tarihli", "faturayı", "bulamadım", "."]),
|
("6.dan", ["6.dan"]),
|
||||||
("2.3 tarihli faturayı bulamadım.", ["2.3", "tarihli", "faturayı", "bulamadım", "."]),
|
("16.dan", ["16.dan"]),
|
||||||
("2.3. tarihli faturayı bulamadım.", ["2.3.", "tarihli", "faturayı", "bulamadım", "."]),
|
("VI.'dan", ["VI.'dan"]),
|
||||||
("2/3/2020 tarihli faturayı bulamadm.", ["2/3/2020", "tarihli", "faturayı", "bulamadm", "."]),
|
("VI.dan", ["VI.dan"]),
|
||||||
("2/3/1987 tarihinden beri burda yaşıyorum.", ["2/3/1987", "tarihinden", "beri", "burda", "yaşıyorum", "."]),
|
("Hepsi 1994 yılında oldu.", ["Hepsi", "1994", "yılında", "oldu", "."]),
|
||||||
("2-3-1987 tarihinden beri burdayım.", ["2-3-1987", "tarihinden", "beri", "burdayım", "."]),
|
("Hepsi 1994'te oldu.", ["Hepsi", "1994'te", "oldu", "."]),
|
||||||
("2.3.1987 tarihinden beri burdayım.", ["2.3.1987", "tarihinden", "beri", "burdayım", "."]),
|
(
|
||||||
("Bu olay 2005-2006 tarihleri arasında oldu.", ["Bu", "olay", "2005", "-", "2006", "tarihleri", "arasında", "oldu", "."]),
|
"2/3 tarihli faturayı bulamadım.",
|
||||||
("Bu olay 4/12/2005-21/3/2006 tarihleri arasında oldu.", ["Bu", "olay", "4/12/2005", "-", "21/3/2006", "tarihleri", "arasında", "oldu", ".",]),
|
["2/3", "tarihli", "faturayı", "bulamadım", "."],
|
||||||
("Ek fıkra: 5/11/2003-4999/3 maddesine göre uygundur.", ["Ek", "fıkra", ":", "5/11/2003", "-", "4999/3", "maddesine", "göre", "uygundur", "."]),
|
),
|
||||||
("2/A alanları: 6831 sayılı Kanunun 2nci maddesinin birinci fıkrasının (A) bendine göre", ["2/A", "alanları", ":", "6831", "sayılı", "Kanunun", "2nci", "maddesinin", "birinci", "fıkrasının", "(", "A", ")", "bendine", "göre"]),
|
(
|
||||||
("ŞEHİTTEĞMENKALMAZ Cad. No: 2/311", ["ŞEHİTTEĞMENKALMAZ", "Cad.", "No", ":", "2/311"]),
|
"2.3 tarihli faturayı bulamadım.",
|
||||||
("2-3-2025", ["2-3-2025",]),
|
["2.3", "tarihli", "faturayı", "bulamadım", "."],
|
||||||
("2/3/2025", ["2/3/2025"]),
|
),
|
||||||
("Yıllardır 0.5 uç kullanıyorum.", ["Yıllardır", "0.5", "uç", "kullanıyorum", "."]),
|
(
|
||||||
("Kan değerlerim 0.5-0.7 arasıydı.", ["Kan", "değerlerim", "0.5", "-", "0.7", "arasıydı", "."]),
|
"2.3. tarihli faturayı bulamadım.",
|
||||||
("0.5", ["0.5"]),
|
["2.3.", "tarihli", "faturayı", "bulamadım", "."],
|
||||||
("1/2", ["1/2"]),
|
),
|
||||||
("%1", ["%", "1"]),
|
(
|
||||||
("%1lik", ["%", "1lik"]),
|
"2/3/2020 tarihli faturayı bulamadm.",
|
||||||
("%1'lik", ["%", "1'lik"]),
|
["2/3/2020", "tarihli", "faturayı", "bulamadm", "."],
|
||||||
("%1lik dilim", ["%", "1lik", "dilim"]),
|
),
|
||||||
("%1'lik dilim", ["%", "1'lik", "dilim"]),
|
(
|
||||||
("%1.5", ["%", "1.5"]),
|
"2/3/1987 tarihinden beri burda yaşıyorum.",
|
||||||
#("%1-%2 arası büyüme bekleniyor.", ["%", "1", "-", "%", "2", "arası", "büyüme", "bekleniyor", "."]),
|
["2/3/1987", "tarihinden", "beri", "burda", "yaşıyorum", "."],
|
||||||
("%1-2 arası büyüme bekliyoruz.", ["%", "1", "-", "2", "arası", "büyüme", "bekliyoruz", "."]),
|
),
|
||||||
("%11-12 arası büyüme bekliyoruz.", ["%", "11", "-", "12", "arası", "büyüme", "bekliyoruz", "."]),
|
(
|
||||||
("%1.5luk büyüme bekliyoruz.", ["%", "1.5luk", "büyüme", "bekliyoruz", "."]),
|
"2-3-1987 tarihinden beri burdayım.",
|
||||||
("Saat 1-2 arası gelin lütfen.", ["Saat", "1", "-", "2", "arası", "gelin", "lütfen", "."]),
|
["2-3-1987", "tarihinden", "beri", "burdayım", "."],
|
||||||
("Saat 15:30 gibi buluşalım.", ["Saat", "15:30", "gibi", "buluşalım", "."]),
|
),
|
||||||
("Saat 15:30'da buluşalım.", ["Saat", "15:30'da", "buluşalım", "."]),
|
(
|
||||||
("Saat 15.30'da buluşalım.", ["Saat", "15.30'da", "buluşalım", "."]),
|
"2.3.1987 tarihinden beri burdayım.",
|
||||||
("Saat 15.30da buluşalım.", ["Saat", "15.30da", "buluşalım", "."]),
|
["2.3.1987", "tarihinden", "beri", "burdayım", "."],
|
||||||
("Saat 15 civarı buluşalım.", ["Saat", "15", "civarı", "buluşalım", "."]),
|
),
|
||||||
("9’daki otobüse binsek mi?", ["9’daki", "otobüse", "binsek", "mi", "?"]),
|
(
|
||||||
("Okulumuz 3-B şubesi", ["Okulumuz", "3-B", "şubesi"]),
|
"Bu olay 2005-2006 tarihleri arasında oldu.",
|
||||||
("Okulumuz 3/B şubesi", ["Okulumuz", "3/B", "şubesi"]),
|
["Bu", "olay", "2005", "-", "2006", "tarihleri", "arasında", "oldu", "."],
|
||||||
("Okulumuz 3B şubesi", ["Okulumuz", "3B", "şubesi"]),
|
),
|
||||||
("Okulumuz 3b şubesi", ["Okulumuz", "3b", "şubesi"]),
|
(
|
||||||
("Antonio Gaudí 20. yüzyılda, 1904-1914 yılları arasında on yıl süren bir reform süreci getirmiştir.", ["Antonio", "Gaudí", "20.", "yüzyılda", ",", "1904", "-", "1914", "yılları", "arasında", "on", "yıl", "süren", "bir", "reform", "süreci", "getirmiştir", "."]),
|
"Bu olay 4/12/2005-21/3/2006 tarihleri arasında oldu.",
|
||||||
("Dizel yakıtın avro bölgesi ortalaması olan 1,165 avroya kıyasla litre başına 1,335 avroya mal olduğunu gösteriyor.", ["Dizel", "yakıtın", "avro", "bölgesi", "ortalaması", "olan", "1,165", "avroya", "kıyasla", "litre", "başına", "1,335", "avroya", "mal", "olduğunu", "gösteriyor", "."]),
|
[
|
||||||
("Marcus Antonius M.Ö. 1 Ocak 49'da, Sezar'dan Vali'nin kendisini barış dostu ilan ettiği bir bildiri yayınlamıştır.", ["Marcus", "Antonius", "M.Ö.", "1", "Ocak", "49'da", ",", "Sezar'dan", "Vali'nin", "kendisini", "barış", "dostu", "ilan", "ettiği", "bir", "bildiri", "yayınlamıştır", "."])
|
"Bu",
|
||||||
|
"olay",
|
||||||
|
"4/12/2005",
|
||||||
|
"-",
|
||||||
|
"21/3/2006",
|
||||||
|
"tarihleri",
|
||||||
|
"arasında",
|
||||||
|
"oldu",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Ek fıkra: 5/11/2003-4999/3 maddesine göre uygundur.",
|
||||||
|
[
|
||||||
|
"Ek",
|
||||||
|
"fıkra",
|
||||||
|
":",
|
||||||
|
"5/11/2003",
|
||||||
|
"-",
|
||||||
|
"4999/3",
|
||||||
|
"maddesine",
|
||||||
|
"göre",
|
||||||
|
"uygundur",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"2/A alanları: 6831 sayılı Kanunun 2nci maddesinin birinci fıkrasının (A) bendine göre",
|
||||||
|
[
|
||||||
|
"2/A",
|
||||||
|
"alanları",
|
||||||
|
":",
|
||||||
|
"6831",
|
||||||
|
"sayılı",
|
||||||
|
"Kanunun",
|
||||||
|
"2nci",
|
||||||
|
"maddesinin",
|
||||||
|
"birinci",
|
||||||
|
"fıkrasının",
|
||||||
|
"(",
|
||||||
|
"A",
|
||||||
|
")",
|
||||||
|
"bendine",
|
||||||
|
"göre",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"ŞEHİTTEĞMENKALMAZ Cad. No: 2/311",
|
||||||
|
["ŞEHİTTEĞMENKALMAZ", "Cad.", "No", ":", "2/311"],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"2-3-2025",
|
||||||
|
[
|
||||||
|
"2-3-2025",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
("2/3/2025", ["2/3/2025"]),
|
||||||
|
("Yıllardır 0.5 uç kullanıyorum.", ["Yıllardır", "0.5", "uç", "kullanıyorum", "."]),
|
||||||
|
(
|
||||||
|
"Kan değerlerim 0.5-0.7 arasıydı.",
|
||||||
|
["Kan", "değerlerim", "0.5", "-", "0.7", "arasıydı", "."],
|
||||||
|
),
|
||||||
|
("0.5", ["0.5"]),
|
||||||
|
("1/2", ["1/2"]),
|
||||||
|
("%1", ["%", "1"]),
|
||||||
|
("%1lik", ["%", "1lik"]),
|
||||||
|
("%1'lik", ["%", "1'lik"]),
|
||||||
|
("%1lik dilim", ["%", "1lik", "dilim"]),
|
||||||
|
("%1'lik dilim", ["%", "1'lik", "dilim"]),
|
||||||
|
("%1.5", ["%", "1.5"]),
|
||||||
|
# ("%1-%2 arası büyüme bekleniyor.", ["%", "1", "-", "%", "2", "arası", "büyüme", "bekleniyor", "."]),
|
||||||
|
(
|
||||||
|
"%1-2 arası büyüme bekliyoruz.",
|
||||||
|
["%", "1", "-", "2", "arası", "büyüme", "bekliyoruz", "."],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"%11-12 arası büyüme bekliyoruz.",
|
||||||
|
["%", "11", "-", "12", "arası", "büyüme", "bekliyoruz", "."],
|
||||||
|
),
|
||||||
|
("%1.5luk büyüme bekliyoruz.", ["%", "1.5luk", "büyüme", "bekliyoruz", "."]),
|
||||||
|
(
|
||||||
|
"Saat 1-2 arası gelin lütfen.",
|
||||||
|
["Saat", "1", "-", "2", "arası", "gelin", "lütfen", "."],
|
||||||
|
),
|
||||||
|
("Saat 15:30 gibi buluşalım.", ["Saat", "15:30", "gibi", "buluşalım", "."]),
|
||||||
|
("Saat 15:30'da buluşalım.", ["Saat", "15:30'da", "buluşalım", "."]),
|
||||||
|
("Saat 15.30'da buluşalım.", ["Saat", "15.30'da", "buluşalım", "."]),
|
||||||
|
("Saat 15.30da buluşalım.", ["Saat", "15.30da", "buluşalım", "."]),
|
||||||
|
("Saat 15 civarı buluşalım.", ["Saat", "15", "civarı", "buluşalım", "."]),
|
||||||
|
("9’daki otobüse binsek mi?", ["9’daki", "otobüse", "binsek", "mi", "?"]),
|
||||||
|
("Okulumuz 3-B şubesi", ["Okulumuz", "3-B", "şubesi"]),
|
||||||
|
("Okulumuz 3/B şubesi", ["Okulumuz", "3/B", "şubesi"]),
|
||||||
|
("Okulumuz 3B şubesi", ["Okulumuz", "3B", "şubesi"]),
|
||||||
|
("Okulumuz 3b şubesi", ["Okulumuz", "3b", "şubesi"]),
|
||||||
|
(
|
||||||
|
"Antonio Gaudí 20. yüzyılda, 1904-1914 yılları arasında on yıl süren bir reform süreci getirmiştir.",
|
||||||
|
[
|
||||||
|
"Antonio",
|
||||||
|
"Gaudí",
|
||||||
|
"20.",
|
||||||
|
"yüzyılda",
|
||||||
|
",",
|
||||||
|
"1904",
|
||||||
|
"-",
|
||||||
|
"1914",
|
||||||
|
"yılları",
|
||||||
|
"arasında",
|
||||||
|
"on",
|
||||||
|
"yıl",
|
||||||
|
"süren",
|
||||||
|
"bir",
|
||||||
|
"reform",
|
||||||
|
"süreci",
|
||||||
|
"getirmiştir",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Dizel yakıtın avro bölgesi ortalaması olan 1,165 avroya kıyasla litre başına 1,335 avroya mal olduğunu gösteriyor.",
|
||||||
|
[
|
||||||
|
"Dizel",
|
||||||
|
"yakıtın",
|
||||||
|
"avro",
|
||||||
|
"bölgesi",
|
||||||
|
"ortalaması",
|
||||||
|
"olan",
|
||||||
|
"1,165",
|
||||||
|
"avroya",
|
||||||
|
"kıyasla",
|
||||||
|
"litre",
|
||||||
|
"başına",
|
||||||
|
"1,335",
|
||||||
|
"avroya",
|
||||||
|
"mal",
|
||||||
|
"olduğunu",
|
||||||
|
"gösteriyor",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Marcus Antonius M.Ö. 1 Ocak 49'da, Sezar'dan Vali'nin kendisini barış dostu ilan ettiği bir bildiri yayınlamıştır.",
|
||||||
|
[
|
||||||
|
"Marcus",
|
||||||
|
"Antonius",
|
||||||
|
"M.Ö.",
|
||||||
|
"1",
|
||||||
|
"Ocak",
|
||||||
|
"49'da",
|
||||||
|
",",
|
||||||
|
"Sezar'dan",
|
||||||
|
"Vali'nin",
|
||||||
|
"kendisini",
|
||||||
|
"barış",
|
||||||
|
"dostu",
|
||||||
|
"ilan",
|
||||||
|
"ettiği",
|
||||||
|
"bir",
|
||||||
|
"bildiri",
|
||||||
|
"yayınlamıştır",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
PUNCT_TESTS = [
|
PUNCT_TESTS = [
|
||||||
("Gitmedim dedim ya!", ["Gitmedim", "dedim", "ya", "!"]),
|
("Gitmedim dedim ya!", ["Gitmedim", "dedim", "ya", "!"]),
|
||||||
("Gitmedim dedim ya!!", ["Gitmedim", "dedim", "ya", "!", "!"]),
|
("Gitmedim dedim ya!!", ["Gitmedim", "dedim", "ya", "!", "!"]),
|
||||||
("Gitsek mi?", ["Gitsek", "mi", "?"]),
|
("Gitsek mi?", ["Gitsek", "mi", "?"]),
|
||||||
("Gitsek mi??", ["Gitsek", "mi", "?", "?"]),
|
("Gitsek mi??", ["Gitsek", "mi", "?", "?"]),
|
||||||
("Gitsek mi?!?", ["Gitsek", "mi", "?", "!", "?"]),
|
("Gitsek mi?!?", ["Gitsek", "mi", "?", "!", "?"]),
|
||||||
("Ankara - Antalya arası otobüs işliyor.", ["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."]),
|
(
|
||||||
("Ankara-Antalya arası otobüs işliyor.", ["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."]),
|
"Ankara - Antalya arası otobüs işliyor.",
|
||||||
("Sen--ben, ya da onlar.", ["Sen", "--", "ben", ",", "ya", "da", "onlar", "."]),
|
["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."],
|
||||||
("Senden, benden, bizden şarkısını biliyor musun?", ["Senden", ",", "benden", ",", "bizden", "şarkısını", "biliyor", "musun", "?"]),
|
),
|
||||||
("Akif'le geldik, sonra da o ayrıldı.", ["Akif'le", "geldik", ",", "sonra", "da", "o", "ayrıldı", "."]),
|
(
|
||||||
("Bu adam ne dedi şimdi???", ["Bu", "adam", "ne", "dedi", "şimdi", "?", "?", "?"]),
|
"Ankara-Antalya arası otobüs işliyor.",
|
||||||
("Yok hasta olmuş, yok annesi hastaymış, bahaneler işte...", ["Yok", "hasta", "olmuş", ",", "yok", "annesi", "hastaymış", ",", "bahaneler", "işte", "..."]),
|
["Ankara", "-", "Antalya", "arası", "otobüs", "işliyor", "."],
|
||||||
("Ankara'dan İstanbul'a ... bir aşk hikayesi.", ["Ankara'dan", "İstanbul'a", "...", "bir", "aşk", "hikayesi", "."]),
|
),
|
||||||
("Ahmet'te", ["Ahmet'te"]),
|
("Sen--ben, ya da onlar.", ["Sen", "--", "ben", ",", "ya", "da", "onlar", "."]),
|
||||||
("İstanbul'da", ["İstanbul'da"]),
|
(
|
||||||
|
"Senden, benden, bizden şarkısını biliyor musun?",
|
||||||
|
["Senden", ",", "benden", ",", "bizden", "şarkısını", "biliyor", "musun", "?"],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Akif'le geldik, sonra da o ayrıldı.",
|
||||||
|
["Akif'le", "geldik", ",", "sonra", "da", "o", "ayrıldı", "."],
|
||||||
|
),
|
||||||
|
("Bu adam ne dedi şimdi???", ["Bu", "adam", "ne", "dedi", "şimdi", "?", "?", "?"]),
|
||||||
|
(
|
||||||
|
"Yok hasta olmuş, yok annesi hastaymış, bahaneler işte...",
|
||||||
|
[
|
||||||
|
"Yok",
|
||||||
|
"hasta",
|
||||||
|
"olmuş",
|
||||||
|
",",
|
||||||
|
"yok",
|
||||||
|
"annesi",
|
||||||
|
"hastaymış",
|
||||||
|
",",
|
||||||
|
"bahaneler",
|
||||||
|
"işte",
|
||||||
|
"...",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Ankara'dan İstanbul'a ... bir aşk hikayesi.",
|
||||||
|
["Ankara'dan", "İstanbul'a", "...", "bir", "aşk", "hikayesi", "."],
|
||||||
|
),
|
||||||
|
("Ahmet'te", ["Ahmet'te"]),
|
||||||
|
("İstanbul'da", ["İstanbul'da"]),
|
||||||
]
|
]
|
||||||
|
|
||||||
GENERAL_TESTS = [
|
GENERAL_TESTS = [
|
||||||
("1914'teki Endurance seferinde, Sir Ernest Shackleton'ın kaptanlığını yaptığı İngiliz Endurance gemisi yirmi sekiz kişi ile Antarktika'yı geçmek üzere yelken açtı.", ["1914'teki", "Endurance", "seferinde", ",", "Sir", "Ernest", "Shackleton'ın", "kaptanlığını", "yaptığı", "İngiliz", "Endurance", "gemisi", "yirmi", "sekiz", "kişi", "ile", "Antarktika'yı", "geçmek", "üzere", "yelken", "açtı", "."]),
|
(
|
||||||
("Danışılan \"%100 Cospedal\" olduğunu belirtti.", ["Danışılan", '"', "%", "100", "Cospedal", '"', "olduğunu", "belirtti", "."]),
|
"1914'teki Endurance seferinde, Sir Ernest Shackleton'ın kaptanlığını yaptığı İngiliz Endurance gemisi yirmi sekiz kişi ile Antarktika'yı geçmek üzere yelken açtı.",
|
||||||
("1976'da parkur artık kullanılmıyordu; 1990'da ise bir yangın, daha sonraları ahırlarla birlikte yıkılacak olan tahta tribünlerden geri kalanları da yok etmişti.", ["1976'da", "parkur", "artık", "kullanılmıyordu", ";", "1990'da", "ise", "bir", "yangın", ",", "daha", "sonraları", "ahırlarla", "birlikte", "yıkılacak", "olan", "tahta", "tribünlerden", "geri", "kalanları", "da", "yok", "etmişti", "."]),
|
[
|
||||||
("Dahiyane bir ameliyat ve zorlu bir rehabilitasyon sürecinden sonra, tamamen iyileştim.", ["Dahiyane", "bir", "ameliyat", "ve", "zorlu", "bir", "rehabilitasyon", "sürecinden", "sonra", ",", "tamamen", "iyileştim", "."]),
|
"1914'teki",
|
||||||
("Yaklaşık iki hafta süren bireysel erken oy kullanma döneminin ardından 5,7 milyondan fazla Floridalı sandık başına gitti.", ["Yaklaşık", "iki", "hafta", "süren", "bireysel", "erken", "oy", "kullanma", "döneminin", "ardından", "5,7", "milyondan", "fazla", "Floridalı", "sandık", "başına", "gitti", "."]),
|
"Endurance",
|
||||||
("Ancak, bu ABD Çevre Koruma Ajansı'nın dünyayı bu konularda uyarmasının ardından ortaya çıktı.", ["Ancak", ",", "bu", "ABD", "Çevre", "Koruma", "Ajansı'nın", "dünyayı", "bu", "konularda", "uyarmasının", "ardından", "ortaya", "çıktı", "."]),
|
"seferinde",
|
||||||
("Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.", ["Ortalama", "şansa", "ve", "10.000", "Sterlin", "değerinde", "tahvillere", "sahip", "bir", "yatırımcı", "yılda", "125", "Sterlin", "ikramiye", "kazanabilir", "."]),
|
",",
|
||||||
("Granit adaları; Seyşeller ve Tioman ile Saint Helena gibi volkanik adaları kapsar." , ["Granit", "adaları", ";", "Seyşeller", "ve", "Tioman", "ile", "Saint", "Helena", "gibi", "volkanik", "adaları", "kapsar", "."]),
|
"Sir",
|
||||||
("Barış antlaşmasıyla İspanya, Amerika'ya Porto Riko, Guam ve Filipinler kolonilerini devretti.", ["Barış", "antlaşmasıyla", "İspanya", ",", "Amerika'ya", "Porto", "Riko", ",", "Guam", "ve", "Filipinler", "kolonilerini", "devretti", "."]),
|
"Ernest",
|
||||||
("Makedonya\'nın sınır bölgelerini güvence altına alan Philip, büyük bir Makedon ordusu kurdu ve uzun bir fetih seferi için Trakya\'ya doğru yürüdü.", ["Makedonya\'nın", "sınır", "bölgelerini", "güvence", "altına", "alan", "Philip", ",", "büyük", "bir", "Makedon", "ordusu", "kurdu", "ve", "uzun", "bir", "fetih", "seferi", "için", "Trakya\'ya", "doğru", "yürüdü", "."]),
|
"Shackleton'ın",
|
||||||
("Fransız gazetesi Le Figaro'ya göre bu hükumet planı sayesinde 42 milyon Euro kazanç sağlanabilir ve elde edilen paranın 15.5 milyonu ulusal güvenlik için kullanılabilir.", ["Fransız", "gazetesi", "Le", "Figaro'ya", "göre", "bu", "hükumet", "planı", "sayesinde", "42", "milyon", "Euro", "kazanç", "sağlanabilir", "ve", "elde", "edilen", "paranın", "15.5", "milyonu", "ulusal", "güvenlik", "için", "kullanılabilir", "."]),
|
"kaptanlığını",
|
||||||
("Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.", ["Ortalama", "şansa", "ve", "10.000", "Sterlin", "değerinde", "tahvillere", "sahip", "bir", "yatırımcı", "yılda", "125", "Sterlin", "ikramiye", "kazanabilir", "."]),
|
"yaptığı",
|
||||||
("3 Kasım Salı günü, Ankara Belediye Başkanı 2014'te hükümetle birlikte oluşturulan kentsel gelişim anlaşmasını askıya alma kararı verdi.", ["3", "Kasım", "Salı", "günü", ",", "Ankara", "Belediye", "Başkanı", "2014'te", "hükümetle", "birlikte", "oluşturulan", "kentsel", "gelişim", "anlaşmasını", "askıya", "alma", "kararı", "verdi", "."]),
|
"İngiliz",
|
||||||
("Stalin, Abakumov'u Beria'nın enerji bakanlıkları üzerindeki baskınlığına karşı MGB içinde kendi ağını kurmaya teşvik etmeye başlamıştı.", ["Stalin", ",", "Abakumov'u", "Beria'nın", "enerji", "bakanlıkları", "üzerindeki", "baskınlığına", "karşı", "MGB", "içinde", "kendi", "ağını", "kurmaya", "teşvik", "etmeye", "başlamıştı", "."]),
|
"Endurance",
|
||||||
("Güney Avrupa'daki kazı alanlarının çoğunluğu gibi, bu bulgu M.Ö. 5. yüzyılın başlar", ["Güney", "Avrupa'daki", "kazı", "alanlarının", "çoğunluğu", "gibi", ",", "bu", "bulgu", "M.Ö.", "5.", "yüzyılın", "başlar"]),
|
"gemisi",
|
||||||
("Sağlığın bozulması Hitchcock hayatının son yirmi yılında üretimini azalttı.", ["Sağlığın", "bozulması", "Hitchcock", "hayatının", "son", "yirmi", "yılında", "üretimini", "azalttı", "."]),
|
"yirmi",
|
||||||
|
"sekiz",
|
||||||
|
"kişi",
|
||||||
|
"ile",
|
||||||
|
"Antarktika'yı",
|
||||||
|
"geçmek",
|
||||||
|
"üzere",
|
||||||
|
"yelken",
|
||||||
|
"açtı",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
'Danışılan "%100 Cospedal" olduğunu belirtti.',
|
||||||
|
["Danışılan", '"', "%", "100", "Cospedal", '"', "olduğunu", "belirtti", "."],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"1976'da parkur artık kullanılmıyordu; 1990'da ise bir yangın, daha sonraları ahırlarla birlikte yıkılacak olan tahta tribünlerden geri kalanları da yok etmişti.",
|
||||||
|
[
|
||||||
|
"1976'da",
|
||||||
|
"parkur",
|
||||||
|
"artık",
|
||||||
|
"kullanılmıyordu",
|
||||||
|
";",
|
||||||
|
"1990'da",
|
||||||
|
"ise",
|
||||||
|
"bir",
|
||||||
|
"yangın",
|
||||||
|
",",
|
||||||
|
"daha",
|
||||||
|
"sonraları",
|
||||||
|
"ahırlarla",
|
||||||
|
"birlikte",
|
||||||
|
"yıkılacak",
|
||||||
|
"olan",
|
||||||
|
"tahta",
|
||||||
|
"tribünlerden",
|
||||||
|
"geri",
|
||||||
|
"kalanları",
|
||||||
|
"da",
|
||||||
|
"yok",
|
||||||
|
"etmişti",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Dahiyane bir ameliyat ve zorlu bir rehabilitasyon sürecinden sonra, tamamen iyileştim.",
|
||||||
|
[
|
||||||
|
"Dahiyane",
|
||||||
|
"bir",
|
||||||
|
"ameliyat",
|
||||||
|
"ve",
|
||||||
|
"zorlu",
|
||||||
|
"bir",
|
||||||
|
"rehabilitasyon",
|
||||||
|
"sürecinden",
|
||||||
|
"sonra",
|
||||||
|
",",
|
||||||
|
"tamamen",
|
||||||
|
"iyileştim",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Yaklaşık iki hafta süren bireysel erken oy kullanma döneminin ardından 5,7 milyondan fazla Floridalı sandık başına gitti.",
|
||||||
|
[
|
||||||
|
"Yaklaşık",
|
||||||
|
"iki",
|
||||||
|
"hafta",
|
||||||
|
"süren",
|
||||||
|
"bireysel",
|
||||||
|
"erken",
|
||||||
|
"oy",
|
||||||
|
"kullanma",
|
||||||
|
"döneminin",
|
||||||
|
"ardından",
|
||||||
|
"5,7",
|
||||||
|
"milyondan",
|
||||||
|
"fazla",
|
||||||
|
"Floridalı",
|
||||||
|
"sandık",
|
||||||
|
"başına",
|
||||||
|
"gitti",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Ancak, bu ABD Çevre Koruma Ajansı'nın dünyayı bu konularda uyarmasının ardından ortaya çıktı.",
|
||||||
|
[
|
||||||
|
"Ancak",
|
||||||
|
",",
|
||||||
|
"bu",
|
||||||
|
"ABD",
|
||||||
|
"Çevre",
|
||||||
|
"Koruma",
|
||||||
|
"Ajansı'nın",
|
||||||
|
"dünyayı",
|
||||||
|
"bu",
|
||||||
|
"konularda",
|
||||||
|
"uyarmasının",
|
||||||
|
"ardından",
|
||||||
|
"ortaya",
|
||||||
|
"çıktı",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.",
|
||||||
|
[
|
||||||
|
"Ortalama",
|
||||||
|
"şansa",
|
||||||
|
"ve",
|
||||||
|
"10.000",
|
||||||
|
"Sterlin",
|
||||||
|
"değerinde",
|
||||||
|
"tahvillere",
|
||||||
|
"sahip",
|
||||||
|
"bir",
|
||||||
|
"yatırımcı",
|
||||||
|
"yılda",
|
||||||
|
"125",
|
||||||
|
"Sterlin",
|
||||||
|
"ikramiye",
|
||||||
|
"kazanabilir",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Granit adaları; Seyşeller ve Tioman ile Saint Helena gibi volkanik adaları kapsar.",
|
||||||
|
[
|
||||||
|
"Granit",
|
||||||
|
"adaları",
|
||||||
|
";",
|
||||||
|
"Seyşeller",
|
||||||
|
"ve",
|
||||||
|
"Tioman",
|
||||||
|
"ile",
|
||||||
|
"Saint",
|
||||||
|
"Helena",
|
||||||
|
"gibi",
|
||||||
|
"volkanik",
|
||||||
|
"adaları",
|
||||||
|
"kapsar",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Barış antlaşmasıyla İspanya, Amerika'ya Porto Riko, Guam ve Filipinler kolonilerini devretti.",
|
||||||
|
[
|
||||||
|
"Barış",
|
||||||
|
"antlaşmasıyla",
|
||||||
|
"İspanya",
|
||||||
|
",",
|
||||||
|
"Amerika'ya",
|
||||||
|
"Porto",
|
||||||
|
"Riko",
|
||||||
|
",",
|
||||||
|
"Guam",
|
||||||
|
"ve",
|
||||||
|
"Filipinler",
|
||||||
|
"kolonilerini",
|
||||||
|
"devretti",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Makedonya'nın sınır bölgelerini güvence altına alan Philip, büyük bir Makedon ordusu kurdu ve uzun bir fetih seferi için Trakya'ya doğru yürüdü.",
|
||||||
|
[
|
||||||
|
"Makedonya'nın",
|
||||||
|
"sınır",
|
||||||
|
"bölgelerini",
|
||||||
|
"güvence",
|
||||||
|
"altına",
|
||||||
|
"alan",
|
||||||
|
"Philip",
|
||||||
|
",",
|
||||||
|
"büyük",
|
||||||
|
"bir",
|
||||||
|
"Makedon",
|
||||||
|
"ordusu",
|
||||||
|
"kurdu",
|
||||||
|
"ve",
|
||||||
|
"uzun",
|
||||||
|
"bir",
|
||||||
|
"fetih",
|
||||||
|
"seferi",
|
||||||
|
"için",
|
||||||
|
"Trakya'ya",
|
||||||
|
"doğru",
|
||||||
|
"yürüdü",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Fransız gazetesi Le Figaro'ya göre bu hükumet planı sayesinde 42 milyon Euro kazanç sağlanabilir ve elde edilen paranın 15.5 milyonu ulusal güvenlik için kullanılabilir.",
|
||||||
|
[
|
||||||
|
"Fransız",
|
||||||
|
"gazetesi",
|
||||||
|
"Le",
|
||||||
|
"Figaro'ya",
|
||||||
|
"göre",
|
||||||
|
"bu",
|
||||||
|
"hükumet",
|
||||||
|
"planı",
|
||||||
|
"sayesinde",
|
||||||
|
"42",
|
||||||
|
"milyon",
|
||||||
|
"Euro",
|
||||||
|
"kazanç",
|
||||||
|
"sağlanabilir",
|
||||||
|
"ve",
|
||||||
|
"elde",
|
||||||
|
"edilen",
|
||||||
|
"paranın",
|
||||||
|
"15.5",
|
||||||
|
"milyonu",
|
||||||
|
"ulusal",
|
||||||
|
"güvenlik",
|
||||||
|
"için",
|
||||||
|
"kullanılabilir",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Ortalama şansa ve 10.000 Sterlin değerinde tahvillere sahip bir yatırımcı yılda 125 Sterlin ikramiye kazanabilir.",
|
||||||
|
[
|
||||||
|
"Ortalama",
|
||||||
|
"şansa",
|
||||||
|
"ve",
|
||||||
|
"10.000",
|
||||||
|
"Sterlin",
|
||||||
|
"değerinde",
|
||||||
|
"tahvillere",
|
||||||
|
"sahip",
|
||||||
|
"bir",
|
||||||
|
"yatırımcı",
|
||||||
|
"yılda",
|
||||||
|
"125",
|
||||||
|
"Sterlin",
|
||||||
|
"ikramiye",
|
||||||
|
"kazanabilir",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"3 Kasım Salı günü, Ankara Belediye Başkanı 2014'te hükümetle birlikte oluşturulan kentsel gelişim anlaşmasını askıya alma kararı verdi.",
|
||||||
|
[
|
||||||
|
"3",
|
||||||
|
"Kasım",
|
||||||
|
"Salı",
|
||||||
|
"günü",
|
||||||
|
",",
|
||||||
|
"Ankara",
|
||||||
|
"Belediye",
|
||||||
|
"Başkanı",
|
||||||
|
"2014'te",
|
||||||
|
"hükümetle",
|
||||||
|
"birlikte",
|
||||||
|
"oluşturulan",
|
||||||
|
"kentsel",
|
||||||
|
"gelişim",
|
||||||
|
"anlaşmasını",
|
||||||
|
"askıya",
|
||||||
|
"alma",
|
||||||
|
"kararı",
|
||||||
|
"verdi",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Stalin, Abakumov'u Beria'nın enerji bakanlıkları üzerindeki baskınlığına karşı MGB içinde kendi ağını kurmaya teşvik etmeye başlamıştı.",
|
||||||
|
[
|
||||||
|
"Stalin",
|
||||||
|
",",
|
||||||
|
"Abakumov'u",
|
||||||
|
"Beria'nın",
|
||||||
|
"enerji",
|
||||||
|
"bakanlıkları",
|
||||||
|
"üzerindeki",
|
||||||
|
"baskınlığına",
|
||||||
|
"karşı",
|
||||||
|
"MGB",
|
||||||
|
"içinde",
|
||||||
|
"kendi",
|
||||||
|
"ağını",
|
||||||
|
"kurmaya",
|
||||||
|
"teşvik",
|
||||||
|
"etmeye",
|
||||||
|
"başlamıştı",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Güney Avrupa'daki kazı alanlarının çoğunluğu gibi, bu bulgu M.Ö. 5. yüzyılın başlar",
|
||||||
|
[
|
||||||
|
"Güney",
|
||||||
|
"Avrupa'daki",
|
||||||
|
"kazı",
|
||||||
|
"alanlarının",
|
||||||
|
"çoğunluğu",
|
||||||
|
"gibi",
|
||||||
|
",",
|
||||||
|
"bu",
|
||||||
|
"bulgu",
|
||||||
|
"M.Ö.",
|
||||||
|
"5.",
|
||||||
|
"yüzyılın",
|
||||||
|
"başlar",
|
||||||
|
],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"Sağlığın bozulması Hitchcock hayatının son yirmi yılında üretimini azalttı.",
|
||||||
|
[
|
||||||
|
"Sağlığın",
|
||||||
|
"bozulması",
|
||||||
|
"Hitchcock",
|
||||||
|
"hayatının",
|
||||||
|
"son",
|
||||||
|
"yirmi",
|
||||||
|
"yılında",
|
||||||
|
"üretimini",
|
||||||
|
"azalttı",
|
||||||
|
".",
|
||||||
|
],
|
||||||
|
),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
TESTS = ABBREV_TESTS + URL_TESTS + NUMBER_TESTS + PUNCT_TESTS + GENERAL_TESTS
|
||||||
TESTS = (ABBREV_TESTS + URL_TESTS + NUMBER_TESTS + PUNCT_TESTS + GENERAL_TESTS)
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("text,expected_tokens", TESTS)
|
@pytest.mark.parametrize("text,expected_tokens", TESTS)
|
||||||
|
@ -149,4 +696,3 @@ def test_tr_tokenizer_handles_allcases(tr_tokenizer, text, expected_tokens):
|
||||||
token_list = [token.text for token in tokens if not token.is_space]
|
token_list = [token.text for token in tokens if not token.is_space]
|
||||||
print(token_list)
|
print(token_list)
|
||||||
assert expected_tokens == token_list
|
assert expected_tokens == token_list
|
||||||
|
|
||||||
|
|
|
@ -89,7 +89,6 @@ def test_uk_tokenizer_splits_open_appostrophe(uk_tokenizer, text):
|
||||||
assert tokens[0].text == "'"
|
assert tokens[0].text == "'"
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.skip(reason="See Issue #3327 and PR #3329")
|
|
||||||
@pytest.mark.parametrize("text", ["Тест''"])
|
@pytest.mark.parametrize("text", ["Тест''"])
|
||||||
def test_uk_tokenizer_splits_double_end_quote(uk_tokenizer, text):
|
def test_uk_tokenizer_splits_double_end_quote(uk_tokenizer, text):
|
||||||
tokens = uk_tokenizer(text)
|
tokens = uk_tokenizer(text)
|
||||||
|
|
|
@ -7,7 +7,6 @@ from spacy.tokens import Doc
|
||||||
from spacy.pipeline._parser_internals.nonproj import projectivize
|
from spacy.pipeline._parser_internals.nonproj import projectivize
|
||||||
from spacy.pipeline._parser_internals.arc_eager import ArcEager
|
from spacy.pipeline._parser_internals.arc_eager import ArcEager
|
||||||
from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
|
from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
|
||||||
from spacy.pipeline._parser_internals.stateclass import StateClass
|
|
||||||
|
|
||||||
|
|
||||||
def get_sequence_costs(M, words, heads, deps, transitions):
|
def get_sequence_costs(M, words, heads, deps, transitions):
|
||||||
|
@ -59,7 +58,7 @@ def test_oracle_four_words(arc_eager, vocab):
|
||||||
["S"],
|
["S"],
|
||||||
["L-left"],
|
["L-left"],
|
||||||
["S"],
|
["S"],
|
||||||
["D"]
|
["D"],
|
||||||
]
|
]
|
||||||
assert state.is_final()
|
assert state.is_final()
|
||||||
for i, state_costs in enumerate(cost_history):
|
for i, state_costs in enumerate(cost_history):
|
||||||
|
@ -185,9 +184,9 @@ def test_oracle_dev_sentence(vocab, arc_eager):
|
||||||
"L-nn", # Attach 'Cars' to 'Inc.'
|
"L-nn", # Attach 'Cars' to 'Inc.'
|
||||||
"L-nn", # Attach 'Motor' to 'Inc.'
|
"L-nn", # Attach 'Motor' to 'Inc.'
|
||||||
"L-nn", # Attach 'Rolls-Royce' to 'Inc.'
|
"L-nn", # Attach 'Rolls-Royce' to 'Inc.'
|
||||||
"S", # Shift "Inc."
|
"S", # Shift "Inc."
|
||||||
"L-nsubj", # Attach 'Inc.' to 'said'
|
"L-nsubj", # Attach 'Inc.' to 'said'
|
||||||
"S", # Shift 'said'
|
"S", # Shift 'said'
|
||||||
"S", # Shift 'it'
|
"S", # Shift 'it'
|
||||||
"L-nsubj", # Attach 'it.' to 'expects'
|
"L-nsubj", # Attach 'it.' to 'expects'
|
||||||
"R-ccomp", # Attach 'expects' to 'said'
|
"R-ccomp", # Attach 'expects' to 'said'
|
||||||
|
@ -251,7 +250,7 @@ def test_oracle_bad_tokenization(vocab, arc_eager):
|
||||||
is root is
|
is root is
|
||||||
bad comp is
|
bad comp is
|
||||||
"""
|
"""
|
||||||
|
|
||||||
gold_words = []
|
gold_words = []
|
||||||
gold_deps = []
|
gold_deps = []
|
||||||
gold_heads = []
|
gold_heads = []
|
||||||
|
@ -268,7 +267,9 @@ def test_oracle_bad_tokenization(vocab, arc_eager):
|
||||||
arc_eager.add_action(2, dep) # Left
|
arc_eager.add_action(2, dep) # Left
|
||||||
arc_eager.add_action(3, dep) # Right
|
arc_eager.add_action(3, dep) # Right
|
||||||
reference = Doc(Vocab(), words=gold_words, deps=gold_deps, heads=gold_heads)
|
reference = Doc(Vocab(), words=gold_words, deps=gold_deps, heads=gold_heads)
|
||||||
predicted = Doc(reference.vocab, words=["[", "catalase", "]", ":", "that", "is", "bad"])
|
predicted = Doc(
|
||||||
|
reference.vocab, words=["[", "catalase", "]", ":", "that", "is", "bad"]
|
||||||
|
)
|
||||||
example = Example(predicted=predicted, reference=reference)
|
example = Example(predicted=predicted, reference=reference)
|
||||||
ae_oracle_actions = arc_eager.get_oracle_sequence(example, _debug=False)
|
ae_oracle_actions = arc_eager.get_oracle_sequence(example, _debug=False)
|
||||||
ae_oracle_actions = [arc_eager.get_class_name(i) for i in ae_oracle_actions]
|
ae_oracle_actions = [arc_eager.get_class_name(i) for i in ae_oracle_actions]
|
||||||
|
|
|
@ -301,11 +301,9 @@ def test_block_ner():
|
||||||
assert [token.ent_type_ for token in doc] == expected_types
|
assert [token.ent_type_ for token in doc] == expected_types
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize("use_upper", [True, False])
|
||||||
"use_upper", [True, False]
|
|
||||||
)
|
|
||||||
def test_overfitting_IO(use_upper):
|
def test_overfitting_IO(use_upper):
|
||||||
# Simple test to try and quickly overfit the NER component - ensuring the ML models work correctly
|
# Simple test to try and quickly overfit the NER component
|
||||||
nlp = English()
|
nlp = English()
|
||||||
ner = nlp.add_pipe("ner", config={"model": {"use_upper": use_upper}})
|
ner = nlp.add_pipe("ner", config={"model": {"use_upper": use_upper}})
|
||||||
train_examples = []
|
train_examples = []
|
||||||
|
@ -361,6 +359,84 @@ def test_overfitting_IO(use_upper):
|
||||||
assert_equal(batch_deps_1, no_batch_deps)
|
assert_equal(batch_deps_1, no_batch_deps)
|
||||||
|
|
||||||
|
|
||||||
|
def test_beam_ner_scores():
|
||||||
|
# Test that we can get confidence values out of the beam_ner pipe
|
||||||
|
beam_width = 16
|
||||||
|
beam_density = 0.0001
|
||||||
|
nlp = English()
|
||||||
|
config = {
|
||||||
|
"beam_width": beam_width,
|
||||||
|
"beam_density": beam_density,
|
||||||
|
}
|
||||||
|
ner = nlp.add_pipe("beam_ner", config=config)
|
||||||
|
train_examples = []
|
||||||
|
for text, annotations in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
|
for ent in annotations.get("entities"):
|
||||||
|
ner.add_label(ent[2])
|
||||||
|
optimizer = nlp.initialize()
|
||||||
|
|
||||||
|
# update once
|
||||||
|
losses = {}
|
||||||
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
|
|
||||||
|
# test the scores from the beam
|
||||||
|
test_text = "I like London."
|
||||||
|
doc = nlp.make_doc(test_text)
|
||||||
|
docs = [doc]
|
||||||
|
beams = ner.predict(docs)
|
||||||
|
entity_scores = ner.scored_ents(beams)[0]
|
||||||
|
|
||||||
|
for j in range(len(doc)):
|
||||||
|
for label in ner.labels:
|
||||||
|
score = entity_scores[(j, j+1, label)]
|
||||||
|
eps = 0.00001
|
||||||
|
assert 0 - eps <= score <= 1 + eps
|
||||||
|
|
||||||
|
|
||||||
|
def test_beam_overfitting_IO():
|
||||||
|
# Simple test to try and quickly overfit the Beam NER component
|
||||||
|
nlp = English()
|
||||||
|
beam_width = 16
|
||||||
|
beam_density = 0.0001
|
||||||
|
config = {
|
||||||
|
"beam_width": beam_width,
|
||||||
|
"beam_density": beam_density,
|
||||||
|
}
|
||||||
|
ner = nlp.add_pipe("beam_ner", config=config)
|
||||||
|
train_examples = []
|
||||||
|
for text, annotations in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
|
for ent in annotations.get("entities"):
|
||||||
|
ner.add_label(ent[2])
|
||||||
|
optimizer = nlp.initialize()
|
||||||
|
|
||||||
|
# run overfitting
|
||||||
|
for i in range(50):
|
||||||
|
losses = {}
|
||||||
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
|
assert losses["beam_ner"] < 0.0001
|
||||||
|
|
||||||
|
# test the scores from the beam
|
||||||
|
test_text = "I like London."
|
||||||
|
docs = [nlp.make_doc(test_text)]
|
||||||
|
beams = ner.predict(docs)
|
||||||
|
entity_scores = ner.scored_ents(beams)[0]
|
||||||
|
assert entity_scores[(2, 3, "LOC")] == 1.0
|
||||||
|
assert entity_scores[(2, 3, "PERSON")] == 0.0
|
||||||
|
|
||||||
|
# Also test the results are still the same after IO
|
||||||
|
with make_tempdir() as tmp_dir:
|
||||||
|
nlp.to_disk(tmp_dir)
|
||||||
|
nlp2 = util.load_model_from_path(tmp_dir)
|
||||||
|
docs2 = [nlp2.make_doc(test_text)]
|
||||||
|
ner2 = nlp2.get_pipe("beam_ner")
|
||||||
|
beams2 = ner2.predict(docs2)
|
||||||
|
entity_scores2 = ner2.scored_ents(beams2)[0]
|
||||||
|
assert entity_scores2[(2, 3, "LOC")] == 1.0
|
||||||
|
assert entity_scores2[(2, 3, "PERSON")] == 0.0
|
||||||
|
|
||||||
|
|
||||||
def test_ner_warns_no_lookups(caplog):
|
def test_ner_warns_no_lookups(caplog):
|
||||||
nlp = English()
|
nlp = English()
|
||||||
assert nlp.lang in util.LEXEME_NORM_LANGS
|
assert nlp.lang in util.LEXEME_NORM_LANGS
|
||||||
|
|
|
@ -1,13 +1,9 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
import hypothesis
|
import hypothesis
|
||||||
import hypothesis.strategies
|
import hypothesis.strategies
|
||||||
import numpy
|
import numpy
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
from spacy.language import Language
|
from spacy.language import Language
|
||||||
from spacy.pipeline import DependencyParser
|
|
||||||
from spacy.pipeline._parser_internals.arc_eager import ArcEager
|
from spacy.pipeline._parser_internals.arc_eager import ArcEager
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc
|
||||||
from spacy.pipeline._parser_internals._beam_utils import BeamBatch
|
from spacy.pipeline._parser_internals._beam_utils import BeamBatch
|
||||||
|
@ -44,7 +40,7 @@ def docs(vocab):
|
||||||
words=["Rats", "bite", "things"],
|
words=["Rats", "bite", "things"],
|
||||||
heads=[1, 1, 1],
|
heads=[1, 1, 1],
|
||||||
deps=["nsubj", "ROOT", "dobj"],
|
deps=["nsubj", "ROOT", "dobj"],
|
||||||
sent_starts=[True, False, False]
|
sent_starts=[True, False, False],
|
||||||
)
|
)
|
||||||
]
|
]
|
||||||
|
|
||||||
|
@ -77,10 +73,12 @@ def batch_size(docs):
|
||||||
def beam_width():
|
def beam_width():
|
||||||
return 4
|
return 4
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(params=[0.0, 0.5, 1.0])
|
@pytest.fixture(params=[0.0, 0.5, 1.0])
|
||||||
def beam_density(request):
|
def beam_density(request):
|
||||||
return request.param
|
return request.param
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def vector_size():
|
def vector_size():
|
||||||
return 6
|
return 6
|
||||||
|
@ -100,7 +98,9 @@ def scores(moves, batch_size, beam_width):
|
||||||
numpy.random.uniform(-0.1, 0.1, (beam_width, moves.n_moves))
|
numpy.random.uniform(-0.1, 0.1, (beam_width, moves.n_moves))
|
||||||
for _ in range(batch_size)
|
for _ in range(batch_size)
|
||||||
]
|
]
|
||||||
), dtype="float32")
|
),
|
||||||
|
dtype="float32",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def test_create_beam(beam):
|
def test_create_beam(beam):
|
||||||
|
@ -128,8 +128,6 @@ def test_beam_parse(examples, beam_width):
|
||||||
parser(doc)
|
parser(doc)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@hypothesis.given(hyp=hypothesis.strategies.data())
|
@hypothesis.given(hyp=hypothesis.strategies.data())
|
||||||
def test_beam_density(moves, examples, beam_width, hyp):
|
def test_beam_density(moves, examples, beam_width, hyp):
|
||||||
beam_density = float(hyp.draw(hypothesis.strategies.floats(0.0, 1.0, width=32)))
|
beam_density = float(hyp.draw(hypothesis.strategies.floats(0.0, 1.0, width=32)))
|
||||||
|
|
|
@ -28,6 +28,26 @@ TRAIN_DATA = [
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
CONFLICTING_DATA = [
|
||||||
|
(
|
||||||
|
"I like London and Berlin.",
|
||||||
|
{
|
||||||
|
"heads": [1, 1, 1, 2, 2, 1],
|
||||||
|
"deps": ["nsubj", "ROOT", "dobj", "cc", "conj", "punct"],
|
||||||
|
},
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"I like London and Berlin.",
|
||||||
|
{
|
||||||
|
"heads": [0, 0, 0, 0, 0, 0],
|
||||||
|
"deps": ["ROOT", "nsubj", "nsubj", "cc", "conj", "punct"],
|
||||||
|
},
|
||||||
|
),
|
||||||
|
]
|
||||||
|
|
||||||
|
eps = 0.01
|
||||||
|
|
||||||
|
|
||||||
def test_parser_root(en_vocab):
|
def test_parser_root(en_vocab):
|
||||||
words = ["i", "do", "n't", "have", "other", "assistance"]
|
words = ["i", "do", "n't", "have", "other", "assistance"]
|
||||||
heads = [3, 3, 3, 3, 5, 3]
|
heads = [3, 3, 3, 3, 5, 3]
|
||||||
|
@ -185,26 +205,31 @@ def test_parser_set_sent_starts(en_vocab):
|
||||||
assert token.head in sent
|
assert token.head in sent
|
||||||
|
|
||||||
|
|
||||||
def test_overfitting_IO():
|
@pytest.mark.parametrize("pipe_name", ["parser", "beam_parser"])
|
||||||
# Simple test to try and quickly overfit the dependency parser - ensuring the ML models work correctly
|
def test_overfitting_IO(pipe_name):
|
||||||
|
# Simple test to try and quickly overfit the dependency parser (normal or beam)
|
||||||
nlp = English()
|
nlp = English()
|
||||||
parser = nlp.add_pipe("parser")
|
parser = nlp.add_pipe(pipe_name)
|
||||||
train_examples = []
|
train_examples = []
|
||||||
for text, annotations in TRAIN_DATA:
|
for text, annotations in TRAIN_DATA:
|
||||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
for dep in annotations.get("deps", []):
|
for dep in annotations.get("deps", []):
|
||||||
parser.add_label(dep)
|
parser.add_label(dep)
|
||||||
optimizer = nlp.initialize()
|
optimizer = nlp.initialize()
|
||||||
for i in range(100):
|
# run overfitting
|
||||||
|
for i in range(150):
|
||||||
losses = {}
|
losses = {}
|
||||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
assert losses["parser"] < 0.0001
|
assert losses[pipe_name] < 0.0001
|
||||||
# test the trained model
|
# test the trained model
|
||||||
test_text = "I like securities."
|
test_text = "I like securities."
|
||||||
doc = nlp(test_text)
|
doc = nlp(test_text)
|
||||||
assert doc[0].dep_ == "nsubj"
|
assert doc[0].dep_ == "nsubj"
|
||||||
assert doc[2].dep_ == "dobj"
|
assert doc[2].dep_ == "dobj"
|
||||||
assert doc[3].dep_ == "punct"
|
assert doc[3].dep_ == "punct"
|
||||||
|
assert doc[0].head.i == 1
|
||||||
|
assert doc[2].head.i == 1
|
||||||
|
assert doc[3].head.i == 1
|
||||||
# Also test the results are still the same after IO
|
# Also test the results are still the same after IO
|
||||||
with make_tempdir() as tmp_dir:
|
with make_tempdir() as tmp_dir:
|
||||||
nlp.to_disk(tmp_dir)
|
nlp.to_disk(tmp_dir)
|
||||||
|
@ -213,6 +238,9 @@ def test_overfitting_IO():
|
||||||
assert doc2[0].dep_ == "nsubj"
|
assert doc2[0].dep_ == "nsubj"
|
||||||
assert doc2[2].dep_ == "dobj"
|
assert doc2[2].dep_ == "dobj"
|
||||||
assert doc2[3].dep_ == "punct"
|
assert doc2[3].dep_ == "punct"
|
||||||
|
assert doc2[0].head.i == 1
|
||||||
|
assert doc2[2].head.i == 1
|
||||||
|
assert doc2[3].head.i == 1
|
||||||
|
|
||||||
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||||
texts = [
|
texts = [
|
||||||
|
@ -226,3 +254,123 @@ def test_overfitting_IO():
|
||||||
no_batch_deps = [doc.to_array([DEP]) for doc in [nlp(text) for text in texts]]
|
no_batch_deps = [doc.to_array([DEP]) for doc in [nlp(text) for text in texts]]
|
||||||
assert_equal(batch_deps_1, batch_deps_2)
|
assert_equal(batch_deps_1, batch_deps_2)
|
||||||
assert_equal(batch_deps_1, no_batch_deps)
|
assert_equal(batch_deps_1, no_batch_deps)
|
||||||
|
|
||||||
|
|
||||||
|
def test_beam_parser_scores():
|
||||||
|
# Test that we can get confidence values out of the beam_parser pipe
|
||||||
|
beam_width = 16
|
||||||
|
beam_density = 0.0001
|
||||||
|
nlp = English()
|
||||||
|
config = {
|
||||||
|
"beam_width": beam_width,
|
||||||
|
"beam_density": beam_density,
|
||||||
|
}
|
||||||
|
parser = nlp.add_pipe("beam_parser", config=config)
|
||||||
|
train_examples = []
|
||||||
|
for text, annotations in CONFLICTING_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
|
for dep in annotations.get("deps", []):
|
||||||
|
parser.add_label(dep)
|
||||||
|
optimizer = nlp.initialize()
|
||||||
|
|
||||||
|
# update a bit with conflicting data
|
||||||
|
for i in range(10):
|
||||||
|
losses = {}
|
||||||
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
|
|
||||||
|
# test the scores from the beam
|
||||||
|
test_text = "I like securities."
|
||||||
|
doc = nlp.make_doc(test_text)
|
||||||
|
docs = [doc]
|
||||||
|
beams = parser.predict(docs)
|
||||||
|
head_scores, label_scores = parser.scored_parses(beams)
|
||||||
|
|
||||||
|
for j in range(len(doc)):
|
||||||
|
for label in parser.labels:
|
||||||
|
label_score = label_scores[0][(j, label)]
|
||||||
|
assert 0 - eps <= label_score <= 1 + eps
|
||||||
|
for i in range(len(doc)):
|
||||||
|
head_score = head_scores[0][(j, i)]
|
||||||
|
assert 0 - eps <= head_score <= 1 + eps
|
||||||
|
|
||||||
|
|
||||||
|
def test_beam_overfitting_IO():
|
||||||
|
# Simple test to try and quickly overfit the Beam dependency parser
|
||||||
|
nlp = English()
|
||||||
|
beam_width = 16
|
||||||
|
beam_density = 0.0001
|
||||||
|
config = {
|
||||||
|
"beam_width": beam_width,
|
||||||
|
"beam_density": beam_density,
|
||||||
|
}
|
||||||
|
parser = nlp.add_pipe("beam_parser", config=config)
|
||||||
|
train_examples = []
|
||||||
|
for text, annotations in TRAIN_DATA:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
|
for dep in annotations.get("deps", []):
|
||||||
|
parser.add_label(dep)
|
||||||
|
optimizer = nlp.initialize()
|
||||||
|
# run overfitting
|
||||||
|
for i in range(150):
|
||||||
|
losses = {}
|
||||||
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
|
assert losses["beam_parser"] < 0.0001
|
||||||
|
# test the scores from the beam
|
||||||
|
test_text = "I like securities."
|
||||||
|
docs = [nlp.make_doc(test_text)]
|
||||||
|
beams = parser.predict(docs)
|
||||||
|
head_scores, label_scores = parser.scored_parses(beams)
|
||||||
|
# we only processed one document
|
||||||
|
head_scores = head_scores[0]
|
||||||
|
label_scores = label_scores[0]
|
||||||
|
# test label annotations: 0=nsubj, 2=dobj, 3=punct
|
||||||
|
assert label_scores[(0, "nsubj")] == pytest.approx(1.0, eps)
|
||||||
|
assert label_scores[(0, "dobj")] == pytest.approx(0.0, eps)
|
||||||
|
assert label_scores[(0, "punct")] == pytest.approx(0.0, eps)
|
||||||
|
assert label_scores[(2, "nsubj")] == pytest.approx(0.0, eps)
|
||||||
|
assert label_scores[(2, "dobj")] == pytest.approx(1.0, eps)
|
||||||
|
assert label_scores[(2, "punct")] == pytest.approx(0.0, eps)
|
||||||
|
assert label_scores[(3, "nsubj")] == pytest.approx(0.0, eps)
|
||||||
|
assert label_scores[(3, "dobj")] == pytest.approx(0.0, eps)
|
||||||
|
assert label_scores[(3, "punct")] == pytest.approx(1.0, eps)
|
||||||
|
# test head annotations: the root is token at index 1
|
||||||
|
assert head_scores[(0, 0)] == pytest.approx(0.0, eps)
|
||||||
|
assert head_scores[(0, 1)] == pytest.approx(1.0, eps)
|
||||||
|
assert head_scores[(0, 2)] == pytest.approx(0.0, eps)
|
||||||
|
assert head_scores[(2, 0)] == pytest.approx(0.0, eps)
|
||||||
|
assert head_scores[(2, 1)] == pytest.approx(1.0, eps)
|
||||||
|
assert head_scores[(2, 2)] == pytest.approx(0.0, eps)
|
||||||
|
assert head_scores[(3, 0)] == pytest.approx(0.0, eps)
|
||||||
|
assert head_scores[(3, 1)] == pytest.approx(1.0, eps)
|
||||||
|
assert head_scores[(3, 2)] == pytest.approx(0.0, eps)
|
||||||
|
|
||||||
|
# Also test the results are still the same after IO
|
||||||
|
with make_tempdir() as tmp_dir:
|
||||||
|
nlp.to_disk(tmp_dir)
|
||||||
|
nlp2 = util.load_model_from_path(tmp_dir)
|
||||||
|
docs2 = [nlp2.make_doc(test_text)]
|
||||||
|
parser2 = nlp2.get_pipe("beam_parser")
|
||||||
|
beams2 = parser2.predict(docs2)
|
||||||
|
head_scores2, label_scores2 = parser2.scored_parses(beams2)
|
||||||
|
# we only processed one document
|
||||||
|
head_scores2 = head_scores2[0]
|
||||||
|
label_scores2 = label_scores2[0]
|
||||||
|
# check the results again
|
||||||
|
assert label_scores2[(0, "nsubj")] == pytest.approx(1.0, eps)
|
||||||
|
assert label_scores2[(0, "dobj")] == pytest.approx(0.0, eps)
|
||||||
|
assert label_scores2[(0, "punct")] == pytest.approx(0.0, eps)
|
||||||
|
assert label_scores2[(2, "nsubj")] == pytest.approx(0.0, eps)
|
||||||
|
assert label_scores2[(2, "dobj")] == pytest.approx(1.0, eps)
|
||||||
|
assert label_scores2[(2, "punct")] == pytest.approx(0.0, eps)
|
||||||
|
assert label_scores2[(3, "nsubj")] == pytest.approx(0.0, eps)
|
||||||
|
assert label_scores2[(3, "dobj")] == pytest.approx(0.0, eps)
|
||||||
|
assert label_scores2[(3, "punct")] == pytest.approx(1.0, eps)
|
||||||
|
assert head_scores2[(0, 0)] == pytest.approx(0.0, eps)
|
||||||
|
assert head_scores2[(0, 1)] == pytest.approx(1.0, eps)
|
||||||
|
assert head_scores2[(0, 2)] == pytest.approx(0.0, eps)
|
||||||
|
assert head_scores2[(2, 0)] == pytest.approx(0.0, eps)
|
||||||
|
assert head_scores2[(2, 1)] == pytest.approx(1.0, eps)
|
||||||
|
assert head_scores2[(2, 2)] == pytest.approx(0.0, eps)
|
||||||
|
assert head_scores2[(3, 0)] == pytest.approx(0.0, eps)
|
||||||
|
assert head_scores2[(3, 1)] == pytest.approx(1.0, eps)
|
||||||
|
assert head_scores2[(3, 2)] == pytest.approx(0.0, eps)
|
||||||
|
|
|
@ -4,14 +4,17 @@ from spacy.tokens.doc import Doc
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
from spacy.pipeline._parser_internals.stateclass import StateClass
|
from spacy.pipeline._parser_internals.stateclass import StateClass
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def vocab():
|
def vocab():
|
||||||
return Vocab()
|
return Vocab()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def doc(vocab):
|
def doc(vocab):
|
||||||
return Doc(vocab, words=["a", "b", "c", "d"])
|
return Doc(vocab, words=["a", "b", "c", "d"])
|
||||||
|
|
||||||
|
|
||||||
def test_init_state(doc):
|
def test_init_state(doc):
|
||||||
state = StateClass(doc)
|
state = StateClass(doc)
|
||||||
assert state.stack == []
|
assert state.stack == []
|
||||||
|
@ -19,6 +22,7 @@ def test_init_state(doc):
|
||||||
assert not state.is_final()
|
assert not state.is_final()
|
||||||
assert state.buffer_length() == 4
|
assert state.buffer_length() == 4
|
||||||
|
|
||||||
|
|
||||||
def test_push_pop(doc):
|
def test_push_pop(doc):
|
||||||
state = StateClass(doc)
|
state = StateClass(doc)
|
||||||
state.push()
|
state.push()
|
||||||
|
@ -33,6 +37,7 @@ def test_push_pop(doc):
|
||||||
assert state.stack == [0]
|
assert state.stack == [0]
|
||||||
assert 1 not in state.queue
|
assert 1 not in state.queue
|
||||||
|
|
||||||
|
|
||||||
def test_stack_depth(doc):
|
def test_stack_depth(doc):
|
||||||
state = StateClass(doc)
|
state = StateClass(doc)
|
||||||
assert state.stack_depth() == 0
|
assert state.stack_depth() == 0
|
||||||
|
|
|
@ -161,7 +161,7 @@ def test_attributeruler_score(nlp, pattern_dicts):
|
||||||
# "cat" is the only correct lemma
|
# "cat" is the only correct lemma
|
||||||
assert scores["lemma_acc"] == pytest.approx(0.2)
|
assert scores["lemma_acc"] == pytest.approx(0.2)
|
||||||
# no morphs are set
|
# no morphs are set
|
||||||
assert scores["morph_acc"] == None
|
assert scores["morph_acc"] is None
|
||||||
|
|
||||||
|
|
||||||
def test_attributeruler_rule_order(nlp):
|
def test_attributeruler_rule_order(nlp):
|
||||||
|
|
|
@ -201,13 +201,9 @@ def test_entity_ruler_overlapping_spans(nlp):
|
||||||
|
|
||||||
@pytest.mark.parametrize("n_process", [1, 2])
|
@pytest.mark.parametrize("n_process", [1, 2])
|
||||||
def test_entity_ruler_multiprocessing(nlp, n_process):
|
def test_entity_ruler_multiprocessing(nlp, n_process):
|
||||||
texts = [
|
texts = ["I enjoy eating Pizza Hut pizza."]
|
||||||
"I enjoy eating Pizza Hut pizza."
|
|
||||||
]
|
|
||||||
|
|
||||||
patterns = [
|
patterns = [{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}]
|
||||||
{"label": "FASTFOOD", "pattern": "Pizza Hut", "id": "1234"}
|
|
||||||
]
|
|
||||||
|
|
||||||
ruler = nlp.add_pipe("entity_ruler")
|
ruler = nlp.add_pipe("entity_ruler")
|
||||||
ruler.add_patterns(patterns)
|
ruler.add_patterns(patterns)
|
||||||
|
|
|
@ -159,8 +159,12 @@ def test_pipe_class_component_model():
|
||||||
"model": {
|
"model": {
|
||||||
"@architectures": "spacy.TextCatEnsemble.v2",
|
"@architectures": "spacy.TextCatEnsemble.v2",
|
||||||
"tok2vec": DEFAULT_TOK2VEC_MODEL,
|
"tok2vec": DEFAULT_TOK2VEC_MODEL,
|
||||||
"linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1,
|
"linear_model": {
|
||||||
"no_output_layer": False},
|
"@architectures": "spacy.TextCatBOW.v1",
|
||||||
|
"exclusive_classes": False,
|
||||||
|
"ngram_size": 1,
|
||||||
|
"no_output_layer": False,
|
||||||
|
},
|
||||||
},
|
},
|
||||||
"value1": 10,
|
"value1": 10,
|
||||||
}
|
}
|
||||||
|
|
|
@ -37,7 +37,16 @@ TRAIN_DATA = [
|
||||||
]
|
]
|
||||||
|
|
||||||
PARTIAL_DATA = [
|
PARTIAL_DATA = [
|
||||||
|
# partial annotation
|
||||||
("I like green eggs", {"tags": ["", "V", "J", ""]}),
|
("I like green eggs", {"tags": ["", "V", "J", ""]}),
|
||||||
|
# misaligned partial annotation
|
||||||
|
(
|
||||||
|
"He hates green eggs",
|
||||||
|
{
|
||||||
|
"words": ["He", "hate", "s", "green", "eggs"],
|
||||||
|
"tags": ["", "V", "S", "J", ""],
|
||||||
|
},
|
||||||
|
),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@ -126,6 +135,7 @@ def test_incomplete_data():
|
||||||
assert doc[1].tag_ is "V"
|
assert doc[1].tag_ is "V"
|
||||||
assert doc[2].tag_ is "J"
|
assert doc[2].tag_ is "J"
|
||||||
|
|
||||||
|
|
||||||
def test_overfitting_IO():
|
def test_overfitting_IO():
|
||||||
# Simple test to try and quickly overfit the tagger - ensuring the ML models work correctly
|
# Simple test to try and quickly overfit the tagger - ensuring the ML models work correctly
|
||||||
nlp = English()
|
nlp = English()
|
||||||
|
|
|
@ -15,15 +15,31 @@ from spacy.training import Example
|
||||||
from ..util import make_tempdir
|
from ..util import make_tempdir
|
||||||
|
|
||||||
|
|
||||||
TRAIN_DATA = [
|
TRAIN_DATA_SINGLE_LABEL = [
|
||||||
("I'm so happy.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
|
("I'm so happy.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
|
||||||
("I'm so angry", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
|
("I'm so angry", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
TRAIN_DATA_MULTI_LABEL = [
|
||||||
|
("I'm angry and confused", {"cats": {"ANGRY": 1.0, "CONFUSED": 1.0, "HAPPY": 0.0}}),
|
||||||
|
("I'm confused but happy", {"cats": {"ANGRY": 0.0, "CONFUSED": 1.0, "HAPPY": 1.0}}),
|
||||||
|
]
|
||||||
|
|
||||||
def make_get_examples(nlp):
|
|
||||||
|
def make_get_examples_single_label(nlp):
|
||||||
train_examples = []
|
train_examples = []
|
||||||
for t in TRAIN_DATA:
|
for t in TRAIN_DATA_SINGLE_LABEL:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||||
|
|
||||||
|
def get_examples():
|
||||||
|
return train_examples
|
||||||
|
|
||||||
|
return get_examples
|
||||||
|
|
||||||
|
|
||||||
|
def make_get_examples_multi_label(nlp):
|
||||||
|
train_examples = []
|
||||||
|
for t in TRAIN_DATA_MULTI_LABEL:
|
||||||
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||||
|
|
||||||
def get_examples():
|
def get_examples():
|
||||||
|
@ -85,49 +101,75 @@ def test_textcat_learns_multilabel():
|
||||||
assert score > 0.5
|
assert score > 0.5
|
||||||
|
|
||||||
|
|
||||||
def test_label_types():
|
@pytest.mark.parametrize("name", ["textcat", "textcat_multilabel"])
|
||||||
|
def test_label_types(name):
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
textcat = nlp.add_pipe("textcat")
|
textcat = nlp.add_pipe(name)
|
||||||
textcat.add_label("answer")
|
textcat.add_label("answer")
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
textcat.add_label(9)
|
textcat.add_label(9)
|
||||||
|
|
||||||
|
|
||||||
def test_no_label():
|
@pytest.mark.parametrize("name", ["textcat", "textcat_multilabel"])
|
||||||
|
def test_no_label(name):
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
nlp.add_pipe("textcat")
|
nlp.add_pipe(name)
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
nlp.initialize()
|
nlp.initialize()
|
||||||
|
|
||||||
|
|
||||||
def test_implicit_label():
|
@pytest.mark.parametrize(
|
||||||
|
"name,get_examples",
|
||||||
|
[
|
||||||
|
("textcat", make_get_examples_single_label),
|
||||||
|
("textcat_multilabel", make_get_examples_multi_label),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_implicit_label(name, get_examples):
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
nlp.add_pipe("textcat")
|
nlp.add_pipe(name)
|
||||||
nlp.initialize(get_examples=make_get_examples(nlp))
|
nlp.initialize(get_examples=get_examples(nlp))
|
||||||
|
|
||||||
|
|
||||||
def test_no_resize():
|
@pytest.mark.parametrize("name", ["textcat", "textcat_multilabel"])
|
||||||
|
def test_no_resize(name):
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
textcat = nlp.add_pipe("textcat")
|
textcat = nlp.add_pipe(name)
|
||||||
textcat.add_label("POSITIVE")
|
textcat.add_label("POSITIVE")
|
||||||
textcat.add_label("NEGATIVE")
|
textcat.add_label("NEGATIVE")
|
||||||
nlp.initialize()
|
nlp.initialize()
|
||||||
assert textcat.model.get_dim("nO") == 2
|
assert textcat.model.get_dim("nO") >= 2
|
||||||
# this throws an error because the textcat can't be resized after initialization
|
# this throws an error because the textcat can't be resized after initialization
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
textcat.add_label("NEUTRAL")
|
textcat.add_label("NEUTRAL")
|
||||||
|
|
||||||
|
|
||||||
def test_initialize_examples():
|
def test_error_with_multi_labels():
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
textcat = nlp.add_pipe("textcat")
|
textcat = nlp.add_pipe("textcat")
|
||||||
for text, annotations in TRAIN_DATA:
|
train_examples = []
|
||||||
|
for text, annotations in TRAIN_DATA_MULTI_LABEL:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"name,get_examples, train_data",
|
||||||
|
[
|
||||||
|
("textcat", make_get_examples_single_label, TRAIN_DATA_SINGLE_LABEL),
|
||||||
|
("textcat_multilabel", make_get_examples_multi_label, TRAIN_DATA_MULTI_LABEL),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_initialize_examples(name, get_examples, train_data):
|
||||||
|
nlp = Language()
|
||||||
|
textcat = nlp.add_pipe(name)
|
||||||
|
for text, annotations in train_data:
|
||||||
for label, value in annotations.get("cats").items():
|
for label, value in annotations.get("cats").items():
|
||||||
textcat.add_label(label)
|
textcat.add_label(label)
|
||||||
# you shouldn't really call this more than once, but for testing it should be fine
|
# you shouldn't really call this more than once, but for testing it should be fine
|
||||||
nlp.initialize()
|
nlp.initialize()
|
||||||
get_examples = make_get_examples(nlp)
|
nlp.initialize(get_examples=get_examples(nlp))
|
||||||
nlp.initialize(get_examples=get_examples)
|
|
||||||
with pytest.raises(TypeError):
|
with pytest.raises(TypeError):
|
||||||
nlp.initialize(get_examples=lambda: None)
|
nlp.initialize(get_examples=lambda: None)
|
||||||
with pytest.raises(TypeError):
|
with pytest.raises(TypeError):
|
||||||
|
@ -138,12 +180,10 @@ def test_overfitting_IO():
|
||||||
# Simple test to try and quickly overfit the single-label textcat component - ensuring the ML models work correctly
|
# Simple test to try and quickly overfit the single-label textcat component - ensuring the ML models work correctly
|
||||||
fix_random_seed(0)
|
fix_random_seed(0)
|
||||||
nlp = English()
|
nlp = English()
|
||||||
nlp.config["initialize"]["components"]["textcat"] = {"positive_label": "POSITIVE"}
|
textcat = nlp.add_pipe("textcat")
|
||||||
# Set exclusive labels
|
|
||||||
config = {"model": {"linear_model": {"exclusive_classes": True}}}
|
|
||||||
textcat = nlp.add_pipe("textcat", config=config)
|
|
||||||
train_examples = []
|
train_examples = []
|
||||||
for text, annotations in TRAIN_DATA:
|
for text, annotations in TRAIN_DATA_SINGLE_LABEL:
|
||||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||||
assert textcat.model.get_dim("nO") == 2
|
assert textcat.model.get_dim("nO") == 2
|
||||||
|
@ -172,6 +212,8 @@ def test_overfitting_IO():
|
||||||
# Test scoring
|
# Test scoring
|
||||||
scores = nlp.evaluate(train_examples)
|
scores = nlp.evaluate(train_examples)
|
||||||
assert scores["cats_micro_f"] == 1.0
|
assert scores["cats_micro_f"] == 1.0
|
||||||
|
assert scores["cats_macro_f"] == 1.0
|
||||||
|
assert scores["cats_macro_auc"] == 1.0
|
||||||
assert scores["cats_score"] == 1.0
|
assert scores["cats_score"] == 1.0
|
||||||
assert "cats_score_desc" in scores
|
assert "cats_score_desc" in scores
|
||||||
|
|
||||||
|
@ -192,7 +234,7 @@ def test_overfitting_IO_multi():
|
||||||
config = {"model": {"linear_model": {"exclusive_classes": False}}}
|
config = {"model": {"linear_model": {"exclusive_classes": False}}}
|
||||||
textcat = nlp.add_pipe("textcat", config=config)
|
textcat = nlp.add_pipe("textcat", config=config)
|
||||||
train_examples = []
|
train_examples = []
|
||||||
for text, annotations in TRAIN_DATA:
|
for text, annotations in TRAIN_DATA_MULTI_LABEL:
|
||||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||||
assert textcat.model.get_dim("nO") == 2
|
assert textcat.model.get_dim("nO") == 2
|
||||||
|
@ -231,27 +273,75 @@ def test_overfitting_IO_multi():
|
||||||
assert_equal(batch_cats_1, no_batch_cats)
|
assert_equal(batch_cats_1, no_batch_cats)
|
||||||
|
|
||||||
|
|
||||||
|
def test_overfitting_IO_multi():
|
||||||
|
# Simple test to try and quickly overfit the multi-label textcat component - ensuring the ML models work correctly
|
||||||
|
fix_random_seed(0)
|
||||||
|
nlp = English()
|
||||||
|
textcat = nlp.add_pipe("textcat_multilabel")
|
||||||
|
|
||||||
|
train_examples = []
|
||||||
|
for text, annotations in TRAIN_DATA_MULTI_LABEL:
|
||||||
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
|
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||||
|
assert textcat.model.get_dim("nO") == 3
|
||||||
|
|
||||||
|
for i in range(100):
|
||||||
|
losses = {}
|
||||||
|
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||||
|
assert losses["textcat_multilabel"] < 0.01
|
||||||
|
|
||||||
|
# test the trained model
|
||||||
|
test_text = "I am confused but happy."
|
||||||
|
doc = nlp(test_text)
|
||||||
|
cats = doc.cats
|
||||||
|
assert cats["HAPPY"] > 0.9
|
||||||
|
assert cats["CONFUSED"] > 0.9
|
||||||
|
|
||||||
|
# Also test the results are still the same after IO
|
||||||
|
with make_tempdir() as tmp_dir:
|
||||||
|
nlp.to_disk(tmp_dir)
|
||||||
|
nlp2 = util.load_model_from_path(tmp_dir)
|
||||||
|
doc2 = nlp2(test_text)
|
||||||
|
cats2 = doc2.cats
|
||||||
|
assert cats2["HAPPY"] > 0.9
|
||||||
|
assert cats2["CONFUSED"] > 0.9
|
||||||
|
|
||||||
|
# Test scoring
|
||||||
|
scores = nlp.evaluate(train_examples)
|
||||||
|
assert scores["cats_micro_f"] == 1.0
|
||||||
|
assert scores["cats_macro_f"] == 1.0
|
||||||
|
assert "cats_score_desc" in scores
|
||||||
|
|
||||||
|
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||||
|
texts = ["Just a sentence.", "I like green eggs.", "I am happy.", "I eat ham."]
|
||||||
|
batch_deps_1 = [doc.cats for doc in nlp.pipe(texts)]
|
||||||
|
batch_deps_2 = [doc.cats for doc in nlp.pipe(texts)]
|
||||||
|
no_batch_deps = [doc.cats for doc in [nlp(text) for text in texts]]
|
||||||
|
assert_equal(batch_deps_1, batch_deps_2)
|
||||||
|
assert_equal(batch_deps_1, no_batch_deps)
|
||||||
|
|
||||||
|
|
||||||
# fmt: off
|
# fmt: off
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
"textcat_config",
|
"name,train_data,textcat_config",
|
||||||
[
|
[
|
||||||
{"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False},
|
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}),
|
||||||
{"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 4, "no_output_layer": False},
|
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 4, "no_output_layer": False}),
|
||||||
{"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 3, "no_output_layer": True},
|
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 3, "no_output_layer": True}),
|
||||||
{"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 2, "no_output_layer": True},
|
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 2, "no_output_layer": True}),
|
||||||
{"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}},
|
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}}),
|
||||||
{"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 5, "no_output_layer": False}},
|
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatEnsemble.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL, "linear_model": {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 5, "no_output_layer": False}}),
|
||||||
{"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True},
|
("textcat", TRAIN_DATA_SINGLE_LABEL, {"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True}),
|
||||||
{"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False},
|
("textcat_multilabel", TRAIN_DATA_MULTI_LABEL, {"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False}),
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
# fmt: on
|
# fmt: on
|
||||||
def test_textcat_configs(textcat_config):
|
def test_textcat_configs(name, train_data, textcat_config):
|
||||||
pipe_config = {"model": textcat_config}
|
pipe_config = {"model": textcat_config}
|
||||||
nlp = English()
|
nlp = English()
|
||||||
textcat = nlp.add_pipe("textcat", config=pipe_config)
|
textcat = nlp.add_pipe(name, config=pipe_config)
|
||||||
train_examples = []
|
train_examples = []
|
||||||
for text, annotations in TRAIN_DATA:
|
for text, annotations in train_data:
|
||||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||||
for label, value in annotations.get("cats").items():
|
for label, value in annotations.get("cats").items():
|
||||||
textcat.add_label(label)
|
textcat.add_label(label)
|
||||||
|
@ -264,15 +354,24 @@ def test_textcat_configs(textcat_config):
|
||||||
def test_positive_class():
|
def test_positive_class():
|
||||||
nlp = English()
|
nlp = English()
|
||||||
textcat = nlp.add_pipe("textcat")
|
textcat = nlp.add_pipe("textcat")
|
||||||
get_examples = make_get_examples(nlp)
|
get_examples = make_get_examples_single_label(nlp)
|
||||||
textcat.initialize(get_examples, labels=["POS", "NEG"], positive_label="POS")
|
textcat.initialize(get_examples, labels=["POS", "NEG"], positive_label="POS")
|
||||||
assert textcat.labels == ("POS", "NEG")
|
assert textcat.labels == ("POS", "NEG")
|
||||||
|
assert textcat.cfg["positive_label"] == "POS"
|
||||||
|
|
||||||
|
textcat_multilabel = nlp.add_pipe("textcat_multilabel")
|
||||||
|
get_examples = make_get_examples_multi_label(nlp)
|
||||||
|
with pytest.raises(TypeError):
|
||||||
|
textcat_multilabel.initialize(get_examples, labels=["POS", "NEG"], positive_label="POS")
|
||||||
|
textcat_multilabel.initialize(get_examples, labels=["FICTION", "DRAMA"])
|
||||||
|
assert textcat_multilabel.labels == ("FICTION", "DRAMA")
|
||||||
|
assert "positive_label" not in textcat_multilabel.cfg
|
||||||
|
|
||||||
|
|
||||||
def test_positive_class_not_present():
|
def test_positive_class_not_present():
|
||||||
nlp = English()
|
nlp = English()
|
||||||
textcat = nlp.add_pipe("textcat")
|
textcat = nlp.add_pipe("textcat")
|
||||||
get_examples = make_get_examples(nlp)
|
get_examples = make_get_examples_single_label(nlp)
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
textcat.initialize(get_examples, labels=["SOME", "THING"], positive_label="POS")
|
textcat.initialize(get_examples, labels=["SOME", "THING"], positive_label="POS")
|
||||||
|
|
||||||
|
@ -280,11 +379,9 @@ def test_positive_class_not_present():
|
||||||
def test_positive_class_not_binary():
|
def test_positive_class_not_binary():
|
||||||
nlp = English()
|
nlp = English()
|
||||||
textcat = nlp.add_pipe("textcat")
|
textcat = nlp.add_pipe("textcat")
|
||||||
get_examples = make_get_examples(nlp)
|
get_examples = make_get_examples_multi_label(nlp)
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
textcat.initialize(
|
textcat.initialize(get_examples, labels=["SOME", "THING", "POS"], positive_label="POS")
|
||||||
get_examples, labels=["SOME", "THING", "POS"], positive_label="POS"
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def test_textcat_evaluation():
|
def test_textcat_evaluation():
|
||||||
|
|
|
@ -113,7 +113,7 @@ cfg_string = """
|
||||||
factory = "tok2vec"
|
factory = "tok2vec"
|
||||||
|
|
||||||
[components.tok2vec.model]
|
[components.tok2vec.model]
|
||||||
@architectures = "spacy.Tok2Vec.v1"
|
@architectures = "spacy.Tok2Vec.v2"
|
||||||
|
|
||||||
[components.tok2vec.model.embed]
|
[components.tok2vec.model.embed]
|
||||||
@architectures = "spacy.MultiHashEmbed.v1"
|
@architectures = "spacy.MultiHashEmbed.v1"
|
||||||
|
@ -123,7 +123,7 @@ cfg_string = """
|
||||||
include_static_vectors = false
|
include_static_vectors = false
|
||||||
|
|
||||||
[components.tok2vec.model.encode]
|
[components.tok2vec.model.encode]
|
||||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||||
width = 96
|
width = 96
|
||||||
depth = 4
|
depth = 4
|
||||||
window_size = 1
|
window_size = 1
|
||||||
|
|
|
@ -288,35 +288,33 @@ def test_multiple_predictions():
|
||||||
dummy_pipe(doc)
|
dummy_pipe(doc)
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.skip(reason="removed Beam stuff during the Example/GoldParse refactor")
|
|
||||||
def test_issue4313():
|
def test_issue4313():
|
||||||
""" This should not crash or exit with some strange error code """
|
""" This should not crash or exit with some strange error code """
|
||||||
beam_width = 16
|
beam_width = 16
|
||||||
beam_density = 0.0001
|
beam_density = 0.0001
|
||||||
nlp = English()
|
nlp = English()
|
||||||
config = {}
|
config = {
|
||||||
ner = nlp.create_pipe("ner", config=config)
|
"beam_width": beam_width,
|
||||||
|
"beam_density": beam_density,
|
||||||
|
}
|
||||||
|
ner = nlp.add_pipe("beam_ner", config=config)
|
||||||
ner.add_label("SOME_LABEL")
|
ner.add_label("SOME_LABEL")
|
||||||
ner.initialize(lambda: [])
|
nlp.initialize()
|
||||||
# add a new label to the doc
|
# add a new label to the doc
|
||||||
doc = nlp("What do you think about Apple ?")
|
doc = nlp("What do you think about Apple ?")
|
||||||
assert len(ner.labels) == 1
|
assert len(ner.labels) == 1
|
||||||
assert "SOME_LABEL" in ner.labels
|
assert "SOME_LABEL" in ner.labels
|
||||||
|
ner.add_label("MY_ORG") # TODO: not sure if we want this to be necessary...
|
||||||
apple_ent = Span(doc, 5, 6, label="MY_ORG")
|
apple_ent = Span(doc, 5, 6, label="MY_ORG")
|
||||||
doc.ents = list(doc.ents) + [apple_ent]
|
doc.ents = list(doc.ents) + [apple_ent]
|
||||||
|
|
||||||
# ensure the beam_parse still works with the new label
|
# ensure the beam_parse still works with the new label
|
||||||
docs = [doc]
|
docs = [doc]
|
||||||
beams = nlp.entity.beam_parse(
|
ner = nlp.get_pipe("beam_ner")
|
||||||
docs, beam_width=beam_width, beam_density=beam_density
|
beams = ner.beam_parse(
|
||||||
|
docs, drop=0.0, beam_width=beam_width, beam_density=beam_density
|
||||||
)
|
)
|
||||||
|
|
||||||
for doc, beam in zip(docs, beams):
|
|
||||||
entity_scores = defaultdict(float)
|
|
||||||
for score, ents in nlp.entity.moves.get_beam_parses(beam):
|
|
||||||
for start, end, label in ents:
|
|
||||||
entity_scores[(start, end, label)] += score
|
|
||||||
|
|
||||||
|
|
||||||
def test_issue4348():
|
def test_issue4348():
|
||||||
"""Test that training the tagger with empty data, doesn't throw errors"""
|
"""Test that training the tagger with empty data, doesn't throw errors"""
|
||||||
|
|
|
@ -2,8 +2,11 @@ import pytest
|
||||||
from thinc.api import Config, fix_random_seed
|
from thinc.api import Config, fix_random_seed
|
||||||
|
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.pipeline.textcat import default_model_config, bow_model_config
|
from spacy.pipeline.textcat import single_label_default_config, single_label_bow_config
|
||||||
from spacy.pipeline.textcat import cnn_model_config
|
from spacy.pipeline.textcat import single_label_cnn_config
|
||||||
|
from spacy.pipeline.textcat_multilabel import multi_label_default_config
|
||||||
|
from spacy.pipeline.textcat_multilabel import multi_label_bow_config
|
||||||
|
from spacy.pipeline.textcat_multilabel import multi_label_cnn_config
|
||||||
from spacy.tokens import Span
|
from spacy.tokens import Span
|
||||||
from spacy import displacy
|
from spacy import displacy
|
||||||
from spacy.pipeline import merge_entities
|
from spacy.pipeline import merge_entities
|
||||||
|
@ -11,7 +14,15 @@ from spacy.training import Example
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
"textcat_config", [default_model_config, bow_model_config, cnn_model_config]
|
"textcat_config",
|
||||||
|
[
|
||||||
|
single_label_default_config,
|
||||||
|
single_label_bow_config,
|
||||||
|
single_label_cnn_config,
|
||||||
|
multi_label_default_config,
|
||||||
|
multi_label_bow_config,
|
||||||
|
multi_label_cnn_config,
|
||||||
|
],
|
||||||
)
|
)
|
||||||
def test_issue5551(textcat_config):
|
def test_issue5551(textcat_config):
|
||||||
"""Test that after fixing the random seed, the results of the pipeline are truly identical"""
|
"""Test that after fixing the random seed, the results of the pipeline are truly identical"""
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
import pydantic
|
|
||||||
import pytest
|
import pytest
|
||||||
from pydantic import ValidationError
|
from pydantic import ValidationError
|
||||||
from spacy.schemas import TokenPattern, TokenPatternSchema
|
from spacy.schemas import TokenPattern, TokenPatternSchema
|
||||||
|
|
|
@ -208,7 +208,7 @@ def test_create_nlp_from_pretraining_config():
|
||||||
config = Config().from_str(pretrain_config_string)
|
config = Config().from_str(pretrain_config_string)
|
||||||
pretrain_config = load_config(DEFAULT_CONFIG_PRETRAIN_PATH)
|
pretrain_config = load_config(DEFAULT_CONFIG_PRETRAIN_PATH)
|
||||||
filled = config.merge(pretrain_config)
|
filled = config.merge(pretrain_config)
|
||||||
resolved = registry.resolve(filled["pretraining"], schema=ConfigSchemaPretrain)
|
registry.resolve(filled["pretraining"], schema=ConfigSchemaPretrain)
|
||||||
|
|
||||||
|
|
||||||
def test_create_nlp_from_config_multiple_instances():
|
def test_create_nlp_from_config_multiple_instances():
|
||||||
|
|
|
@ -4,7 +4,7 @@ from spacy.pipeline import Tagger, DependencyParser, EntityRecognizer
|
||||||
from spacy.pipeline import TextCategorizer, SentenceRecognizer, TrainablePipe
|
from spacy.pipeline import TextCategorizer, SentenceRecognizer, TrainablePipe
|
||||||
from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
|
from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL
|
||||||
from spacy.pipeline.tagger import DEFAULT_TAGGER_MODEL
|
from spacy.pipeline.tagger import DEFAULT_TAGGER_MODEL
|
||||||
from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL
|
from spacy.pipeline.textcat import DEFAULT_SINGLE_TEXTCAT_MODEL
|
||||||
from spacy.pipeline.senter import DEFAULT_SENTER_MODEL
|
from spacy.pipeline.senter import DEFAULT_SENTER_MODEL
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from thinc.api import Linear
|
from thinc.api import Linear
|
||||||
|
@ -24,7 +24,7 @@ def parser(en_vocab):
|
||||||
"update_with_oracle_cut_size": 100,
|
"update_with_oracle_cut_size": 100,
|
||||||
"beam_width": 1,
|
"beam_width": 1,
|
||||||
"beam_update_prob": 1.0,
|
"beam_update_prob": 1.0,
|
||||||
"beam_density": 0.0
|
"beam_density": 0.0,
|
||||||
}
|
}
|
||||||
cfg = {"model": DEFAULT_PARSER_MODEL}
|
cfg = {"model": DEFAULT_PARSER_MODEL}
|
||||||
model = registry.resolve(cfg, validate=True)["model"]
|
model = registry.resolve(cfg, validate=True)["model"]
|
||||||
|
@ -41,7 +41,7 @@ def blank_parser(en_vocab):
|
||||||
"update_with_oracle_cut_size": 100,
|
"update_with_oracle_cut_size": 100,
|
||||||
"beam_width": 1,
|
"beam_width": 1,
|
||||||
"beam_update_prob": 1.0,
|
"beam_update_prob": 1.0,
|
||||||
"beam_density": 0.0
|
"beam_density": 0.0,
|
||||||
}
|
}
|
||||||
cfg = {"model": DEFAULT_PARSER_MODEL}
|
cfg = {"model": DEFAULT_PARSER_MODEL}
|
||||||
model = registry.resolve(cfg, validate=True)["model"]
|
model = registry.resolve(cfg, validate=True)["model"]
|
||||||
|
@ -66,7 +66,7 @@ def test_serialize_parser_roundtrip_bytes(en_vocab, Parser):
|
||||||
"update_with_oracle_cut_size": 100,
|
"update_with_oracle_cut_size": 100,
|
||||||
"beam_width": 1,
|
"beam_width": 1,
|
||||||
"beam_update_prob": 1.0,
|
"beam_update_prob": 1.0,
|
||||||
"beam_density": 0.0
|
"beam_density": 0.0,
|
||||||
}
|
}
|
||||||
cfg = {"model": DEFAULT_PARSER_MODEL}
|
cfg = {"model": DEFAULT_PARSER_MODEL}
|
||||||
model = registry.resolve(cfg, validate=True)["model"]
|
model = registry.resolve(cfg, validate=True)["model"]
|
||||||
|
@ -90,7 +90,7 @@ def test_serialize_parser_strings(Parser):
|
||||||
"update_with_oracle_cut_size": 100,
|
"update_with_oracle_cut_size": 100,
|
||||||
"beam_width": 1,
|
"beam_width": 1,
|
||||||
"beam_update_prob": 1.0,
|
"beam_update_prob": 1.0,
|
||||||
"beam_density": 0.0
|
"beam_density": 0.0,
|
||||||
}
|
}
|
||||||
cfg = {"model": DEFAULT_PARSER_MODEL}
|
cfg = {"model": DEFAULT_PARSER_MODEL}
|
||||||
model = registry.resolve(cfg, validate=True)["model"]
|
model = registry.resolve(cfg, validate=True)["model"]
|
||||||
|
@ -112,7 +112,7 @@ def test_serialize_parser_roundtrip_disk(en_vocab, Parser):
|
||||||
"update_with_oracle_cut_size": 100,
|
"update_with_oracle_cut_size": 100,
|
||||||
"beam_width": 1,
|
"beam_width": 1,
|
||||||
"beam_update_prob": 1.0,
|
"beam_update_prob": 1.0,
|
||||||
"beam_density": 0.0
|
"beam_density": 0.0,
|
||||||
}
|
}
|
||||||
cfg = {"model": DEFAULT_PARSER_MODEL}
|
cfg = {"model": DEFAULT_PARSER_MODEL}
|
||||||
model = registry.resolve(cfg, validate=True)["model"]
|
model = registry.resolve(cfg, validate=True)["model"]
|
||||||
|
@ -140,9 +140,6 @@ def test_to_from_bytes(parser, blank_parser):
|
||||||
assert blank_parser.moves.n_moves == parser.moves.n_moves
|
assert blank_parser.moves.n_moves == parser.moves.n_moves
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.skip(
|
|
||||||
reason="This seems to be a dict ordering bug somewhere. Only failing on some platforms."
|
|
||||||
)
|
|
||||||
def test_serialize_tagger_roundtrip_bytes(en_vocab, taggers):
|
def test_serialize_tagger_roundtrip_bytes(en_vocab, taggers):
|
||||||
tagger1 = taggers[0]
|
tagger1 = taggers[0]
|
||||||
tagger1_b = tagger1.to_bytes()
|
tagger1_b = tagger1.to_bytes()
|
||||||
|
@ -191,7 +188,7 @@ def test_serialize_tagger_strings(en_vocab, de_vocab, taggers):
|
||||||
|
|
||||||
def test_serialize_textcat_empty(en_vocab):
|
def test_serialize_textcat_empty(en_vocab):
|
||||||
# See issue #1105
|
# See issue #1105
|
||||||
cfg = {"model": DEFAULT_TEXTCAT_MODEL}
|
cfg = {"model": DEFAULT_SINGLE_TEXTCAT_MODEL}
|
||||||
model = registry.resolve(cfg, validate=True)["model"]
|
model = registry.resolve(cfg, validate=True)["model"]
|
||||||
textcat = TextCategorizer(en_vocab, model, threshold=0.5)
|
textcat = TextCategorizer(en_vocab, model, threshold=0.5)
|
||||||
textcat.to_bytes(exclude=["vocab"])
|
textcat.to_bytes(exclude=["vocab"])
|
||||||
|
|
|
@ -26,7 +26,6 @@ def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
|
||||||
assert tokenizer_reloaded.rules == {}
|
assert tokenizer_reloaded.rules == {}
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.skip(reason="Currently unreliable across platforms")
|
|
||||||
@pytest.mark.parametrize("text", ["I💜you", "they’re", "“hello”"])
|
@pytest.mark.parametrize("text", ["I💜you", "they’re", "“hello”"])
|
||||||
def test_serialize_tokenizer_roundtrip_bytes(en_tokenizer, text):
|
def test_serialize_tokenizer_roundtrip_bytes(en_tokenizer, text):
|
||||||
tokenizer = en_tokenizer
|
tokenizer = en_tokenizer
|
||||||
|
@ -38,7 +37,6 @@ def test_serialize_tokenizer_roundtrip_bytes(en_tokenizer, text):
|
||||||
assert [token.text for token in doc1] == [token.text for token in doc2]
|
assert [token.text for token in doc1] == [token.text for token in doc2]
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.skip(reason="Currently unreliable across platforms")
|
|
||||||
def test_serialize_tokenizer_roundtrip_disk(en_tokenizer):
|
def test_serialize_tokenizer_roundtrip_disk(en_tokenizer):
|
||||||
tokenizer = en_tokenizer
|
tokenizer = en_tokenizer
|
||||||
with make_tempdir() as d:
|
with make_tempdir() as d:
|
||||||
|
|
|
@ -3,7 +3,9 @@ from click import NoSuchOption
|
||||||
from spacy.training import docs_to_json, offsets_to_biluo_tags
|
from spacy.training import docs_to_json, offsets_to_biluo_tags
|
||||||
from spacy.training.converters import iob_to_docs, conll_ner_to_docs, conllu_to_docs
|
from spacy.training.converters import iob_to_docs, conll_ner_to_docs, conllu_to_docs
|
||||||
from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate
|
from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate
|
||||||
|
from spacy.lang.nl import Dutch
|
||||||
from spacy.util import ENV_VARS
|
from spacy.util import ENV_VARS
|
||||||
|
from spacy.cli import info
|
||||||
from spacy.cli.init_config import init_config, RECOMMENDATIONS
|
from spacy.cli.init_config import init_config, RECOMMENDATIONS
|
||||||
from spacy.cli._util import validate_project_commands, parse_config_overrides
|
from spacy.cli._util import validate_project_commands, parse_config_overrides
|
||||||
from spacy.cli._util import load_project_config, substitute_project_variables
|
from spacy.cli._util import load_project_config, substitute_project_variables
|
||||||
|
@ -15,6 +17,16 @@ import os
|
||||||
from .util import make_tempdir
|
from .util import make_tempdir
|
||||||
|
|
||||||
|
|
||||||
|
def test_cli_info():
|
||||||
|
nlp = Dutch()
|
||||||
|
nlp.add_pipe("textcat")
|
||||||
|
with make_tempdir() as tmp_dir:
|
||||||
|
nlp.to_disk(tmp_dir)
|
||||||
|
raw_data = info(tmp_dir, exclude=[""])
|
||||||
|
assert raw_data["lang"] == "nl"
|
||||||
|
assert raw_data["components"] == ["textcat"]
|
||||||
|
|
||||||
|
|
||||||
def test_cli_converters_conllu_to_docs():
|
def test_cli_converters_conllu_to_docs():
|
||||||
# from NorNE: https://github.com/ltgoslo/norne/blob/3d23274965f513f23aa48455b28b1878dad23c05/ud/nob/no_bokmaal-ud-dev.conllu
|
# from NorNE: https://github.com/ltgoslo/norne/blob/3d23274965f513f23aa48455b28b1878dad23c05/ud/nob/no_bokmaal-ud-dev.conllu
|
||||||
lines = [
|
lines = [
|
||||||
|
|
|
@ -83,6 +83,7 @@ def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
|
||||||
def test_prefer_gpu():
|
def test_prefer_gpu():
|
||||||
try:
|
try:
|
||||||
import cupy # noqa: F401
|
import cupy # noqa: F401
|
||||||
|
|
||||||
prefer_gpu()
|
prefer_gpu()
|
||||||
assert isinstance(get_current_ops(), CupyOps)
|
assert isinstance(get_current_ops(), CupyOps)
|
||||||
except ImportError:
|
except ImportError:
|
||||||
|
@ -92,17 +93,20 @@ def test_prefer_gpu():
|
||||||
def test_require_gpu():
|
def test_require_gpu():
|
||||||
try:
|
try:
|
||||||
import cupy # noqa: F401
|
import cupy # noqa: F401
|
||||||
|
|
||||||
require_gpu()
|
require_gpu()
|
||||||
assert isinstance(get_current_ops(), CupyOps)
|
assert isinstance(get_current_ops(), CupyOps)
|
||||||
except ImportError:
|
except ImportError:
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
require_gpu()
|
require_gpu()
|
||||||
|
|
||||||
|
|
||||||
def test_require_cpu():
|
def test_require_cpu():
|
||||||
require_cpu()
|
require_cpu()
|
||||||
assert isinstance(get_current_ops(), NumpyOps)
|
assert isinstance(get_current_ops(), NumpyOps)
|
||||||
try:
|
try:
|
||||||
import cupy # noqa: F401
|
import cupy # noqa: F401
|
||||||
|
|
||||||
require_gpu()
|
require_gpu()
|
||||||
assert isinstance(get_current_ops(), CupyOps)
|
assert isinstance(get_current_ops(), CupyOps)
|
||||||
except ImportError:
|
except ImportError:
|
||||||
|
|
|
@ -294,7 +294,7 @@ def test_partial_annotation(en_tokenizer):
|
||||||
# cats doesn't have an unset state
|
# cats doesn't have an unset state
|
||||||
if key.startswith("cats"):
|
if key.startswith("cats"):
|
||||||
continue
|
continue
|
||||||
assert scores[key] == None
|
assert scores[key] is None
|
||||||
|
|
||||||
# partially annotated reference, not overlapping with predicted annotation
|
# partially annotated reference, not overlapping with predicted annotation
|
||||||
ref_doc = en_tokenizer("a b c d e")
|
ref_doc = en_tokenizer("a b c d e")
|
||||||
|
@ -306,13 +306,13 @@ def test_partial_annotation(en_tokenizer):
|
||||||
example = Example(pred_doc, ref_doc)
|
example = Example(pred_doc, ref_doc)
|
||||||
scorer = Scorer()
|
scorer = Scorer()
|
||||||
scores = scorer.score([example])
|
scores = scorer.score([example])
|
||||||
assert scores["token_acc"] == None
|
assert scores["token_acc"] is None
|
||||||
assert scores["tag_acc"] == 0.0
|
assert scores["tag_acc"] == 0.0
|
||||||
assert scores["pos_acc"] == 0.0
|
assert scores["pos_acc"] == 0.0
|
||||||
assert scores["morph_acc"] == 0.0
|
assert scores["morph_acc"] == 0.0
|
||||||
assert scores["dep_uas"] == 1.0
|
assert scores["dep_uas"] == 1.0
|
||||||
assert scores["dep_las"] == 0.0
|
assert scores["dep_las"] == 0.0
|
||||||
assert scores["sents_f"] == None
|
assert scores["sents_f"] is None
|
||||||
|
|
||||||
# partially annotated reference, overlapping with predicted annotation
|
# partially annotated reference, overlapping with predicted annotation
|
||||||
ref_doc = en_tokenizer("a b c d e")
|
ref_doc = en_tokenizer("a b c d e")
|
||||||
|
@ -324,13 +324,13 @@ def test_partial_annotation(en_tokenizer):
|
||||||
example = Example(pred_doc, ref_doc)
|
example = Example(pred_doc, ref_doc)
|
||||||
scorer = Scorer()
|
scorer = Scorer()
|
||||||
scores = scorer.score([example])
|
scores = scorer.score([example])
|
||||||
assert scores["token_acc"] == None
|
assert scores["token_acc"] is None
|
||||||
assert scores["tag_acc"] == 1.0
|
assert scores["tag_acc"] == 1.0
|
||||||
assert scores["pos_acc"] == 1.0
|
assert scores["pos_acc"] == 1.0
|
||||||
assert scores["morph_acc"] == 0.0
|
assert scores["morph_acc"] == 0.0
|
||||||
assert scores["dep_uas"] == 1.0
|
assert scores["dep_uas"] == 1.0
|
||||||
assert scores["dep_las"] == 0.0
|
assert scores["dep_las"] == 0.0
|
||||||
assert scores["sents_f"] == None
|
assert scores["sents_f"] is None
|
||||||
|
|
||||||
|
|
||||||
def test_roc_auc_score():
|
def test_roc_auc_score():
|
||||||
|
@ -391,7 +391,7 @@ def test_roc_auc_score():
|
||||||
score.score_set(0.25, 0)
|
score.score_set(0.25, 0)
|
||||||
score.score_set(0.75, 0)
|
score.score_set(0.75, 0)
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
s = score.score
|
_ = score.score # noqa: F841
|
||||||
|
|
||||||
y_true = [1, 1]
|
y_true = [1, 1]
|
||||||
y_score = [0.25, 0.75]
|
y_score = [0.25, 0.75]
|
||||||
|
@ -402,4 +402,4 @@ def test_roc_auc_score():
|
||||||
score.score_set(0.25, 1)
|
score.score_set(0.25, 1)
|
||||||
score.score_set(0.75, 1)
|
score.score_set(0.75, 1)
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
s = score.score
|
_ = score.score # noqa: F841
|
||||||
|
|
|
@ -180,3 +180,9 @@ def test_tokenizer_special_cases_idx(tokenizer):
|
||||||
doc = tokenizer(text)
|
doc = tokenizer(text)
|
||||||
assert doc[1].idx == 4
|
assert doc[1].idx == 4
|
||||||
assert doc[2].idx == 7
|
assert doc[2].idx == 7
|
||||||
|
|
||||||
|
|
||||||
|
def test_tokenizer_special_cases_spaces(tokenizer):
|
||||||
|
assert [t.text for t in tokenizer("a b c")] == ["a", "b", "c"]
|
||||||
|
tokenizer.add_special_case("a b c", [{"ORTH": "a b c"}])
|
||||||
|
assert [t.text for t in tokenizer("a b c")] == ["a b c"]
|
||||||
|
|
|
@ -51,7 +51,7 @@ def test_readers():
|
||||||
for example in train_corpus(nlp):
|
for example in train_corpus(nlp):
|
||||||
nlp.update([example], sgd=optimizer)
|
nlp.update([example], sgd=optimizer)
|
||||||
scores = nlp.evaluate(list(dev_corpus(nlp)))
|
scores = nlp.evaluate(list(dev_corpus(nlp)))
|
||||||
assert scores["cats_score"] == 0.0
|
assert scores["cats_macro_auc"] == 0.0
|
||||||
# ensure the pipeline runs
|
# ensure the pipeline runs
|
||||||
doc = nlp("Quick test")
|
doc = nlp("Quick test")
|
||||||
assert doc.cats
|
assert doc.cats
|
||||||
|
@ -73,7 +73,7 @@ def test_cat_readers(reader, additional_config):
|
||||||
nlp_config_string = """
|
nlp_config_string = """
|
||||||
[training]
|
[training]
|
||||||
seed = 0
|
seed = 0
|
||||||
|
|
||||||
[training.score_weights]
|
[training.score_weights]
|
||||||
cats_macro_auc = 1.0
|
cats_macro_auc = 1.0
|
||||||
|
|
||||||
|
|
|
@ -71,7 +71,6 @@ def test_table_api_to_from_bytes():
|
||||||
assert "def" not in new_table2
|
assert "def" not in new_table2
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.skip(reason="This fails on Python 3.5")
|
|
||||||
def test_lookups_to_from_bytes():
|
def test_lookups_to_from_bytes():
|
||||||
lookups = Lookups()
|
lookups = Lookups()
|
||||||
lookups.add_table("table1", {"foo": "bar", "hello": "world"})
|
lookups.add_table("table1", {"foo": "bar", "hello": "world"})
|
||||||
|
@ -91,7 +90,6 @@ def test_lookups_to_from_bytes():
|
||||||
assert new_lookups.to_bytes() == lookups_bytes
|
assert new_lookups.to_bytes() == lookups_bytes
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.skip(reason="This fails on Python 3.5")
|
|
||||||
def test_lookups_to_from_disk():
|
def test_lookups_to_from_disk():
|
||||||
lookups = Lookups()
|
lookups = Lookups()
|
||||||
lookups.add_table("table1", {"foo": "bar", "hello": "world"})
|
lookups.add_table("table1", {"foo": "bar", "hello": "world"})
|
||||||
|
@ -111,7 +109,6 @@ def test_lookups_to_from_disk():
|
||||||
assert table2["b"] == 2
|
assert table2["b"] == 2
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.skip(reason="This fails on Python 3.5")
|
|
||||||
def test_lookups_to_from_bytes_via_vocab():
|
def test_lookups_to_from_bytes_via_vocab():
|
||||||
table_name = "test"
|
table_name = "test"
|
||||||
vocab = Vocab()
|
vocab = Vocab()
|
||||||
|
@ -128,7 +125,6 @@ def test_lookups_to_from_bytes_via_vocab():
|
||||||
assert new_vocab.to_bytes() == vocab_bytes
|
assert new_vocab.to_bytes() == vocab_bytes
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.skip(reason="This fails on Python 3.5")
|
|
||||||
def test_lookups_to_from_disk_via_vocab():
|
def test_lookups_to_from_disk_via_vocab():
|
||||||
table_name = "test"
|
table_name = "test"
|
||||||
vocab = Vocab()
|
vocab = Vocab()
|
||||||
|
|
|
@ -258,6 +258,7 @@ cdef class Tokenizer:
|
||||||
tokens = doc.c
|
tokens = doc.c
|
||||||
# Otherwise create a separate array to store modified tokens
|
# Otherwise create a separate array to store modified tokens
|
||||||
else:
|
else:
|
||||||
|
assert max_length > 0
|
||||||
tokens = <TokenC*>mem.alloc(max_length, sizeof(TokenC))
|
tokens = <TokenC*>mem.alloc(max_length, sizeof(TokenC))
|
||||||
# Modify tokenization according to filtered special cases
|
# Modify tokenization according to filtered special cases
|
||||||
offset = self._retokenize_special_spans(doc, tokens, span_data)
|
offset = self._retokenize_special_spans(doc, tokens, span_data)
|
||||||
|
@ -610,7 +611,7 @@ cdef class Tokenizer:
|
||||||
self.mem.free(stale_special)
|
self.mem.free(stale_special)
|
||||||
self._rules[string] = substrings
|
self._rules[string] = substrings
|
||||||
self._flush_cache()
|
self._flush_cache()
|
||||||
if self.find_prefix(string) or self.find_infix(string) or self.find_suffix(string):
|
if self.find_prefix(string) or self.find_infix(string) or self.find_suffix(string) or " " in string:
|
||||||
self._special_matcher.add(string, None, self._tokenize_affixes(string, False))
|
self._special_matcher.add(string, None, self._tokenize_affixes(string, False))
|
||||||
|
|
||||||
def _reload_special_cases(self):
|
def _reload_special_cases(self):
|
||||||
|
|
|
@ -188,8 +188,15 @@ def _merge(Doc doc, merges):
|
||||||
and doc.c[start - 1].ent_type == token.ent_type:
|
and doc.c[start - 1].ent_type == token.ent_type:
|
||||||
merged_iob = 1
|
merged_iob = 1
|
||||||
token.ent_iob = merged_iob
|
token.ent_iob = merged_iob
|
||||||
|
# Set lemma to concatenated lemmas
|
||||||
|
merged_lemma = ""
|
||||||
|
for span_token in span:
|
||||||
|
merged_lemma += span_token.lemma_
|
||||||
|
if doc.c[span_token.i].spacy:
|
||||||
|
merged_lemma += " "
|
||||||
|
merged_lemma = merged_lemma.strip()
|
||||||
|
token.lemma = doc.vocab.strings.add(merged_lemma)
|
||||||
# Unset attributes that don't match new token
|
# Unset attributes that don't match new token
|
||||||
token.lemma = 0
|
|
||||||
token.norm = 0
|
token.norm = 0
|
||||||
tokens[merge_index] = token
|
tokens[merge_index] = token
|
||||||
# Resize the doc.tensor, if it's set. Let the last row for each token stand
|
# Resize the doc.tensor, if it's set. Let the last row for each token stand
|
||||||
|
@ -335,7 +342,9 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
|
||||||
token = &doc.c[token_index + i]
|
token = &doc.c[token_index + i]
|
||||||
lex = doc.vocab.get(doc.mem, orth)
|
lex = doc.vocab.get(doc.mem, orth)
|
||||||
token.lex = lex
|
token.lex = lex
|
||||||
token.lemma = 0 # reset lemma
|
# If lemma is currently set, set default lemma to orth
|
||||||
|
if token.lemma != 0:
|
||||||
|
token.lemma = lex.orth
|
||||||
token.norm = 0 # reset norm
|
token.norm = 0 # reset norm
|
||||||
if to_process_tensor:
|
if to_process_tensor:
|
||||||
# setting the tensors of the split tokens to array of zeros
|
# setting the tensors of the split tokens to array of zeros
|
||||||
|
|
|
@ -225,6 +225,7 @@ cdef class Doc:
|
||||||
# Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
|
# Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
|
||||||
# However, we need to remember the true starting places, so that we can
|
# However, we need to remember the true starting places, so that we can
|
||||||
# realloc.
|
# realloc.
|
||||||
|
assert size + (PADDING*2) > 0
|
||||||
data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
|
data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
|
||||||
cdef int i
|
cdef int i
|
||||||
for i in range(size + (PADDING*2)):
|
for i in range(size + (PADDING*2)):
|
||||||
|
@ -1097,7 +1098,7 @@ cdef class Doc:
|
||||||
(vocab,) = vocab
|
(vocab,) = vocab
|
||||||
|
|
||||||
if attrs is None:
|
if attrs is None:
|
||||||
attrs = Doc._get_array_attrs()
|
attrs = list(Doc._get_array_attrs())
|
||||||
else:
|
else:
|
||||||
if any(isinstance(attr, str) for attr in attrs): # resolve attribute names
|
if any(isinstance(attr, str) for attr in attrs): # resolve attribute names
|
||||||
attrs = [intify_attr(attr) for attr in attrs] # intify_attr returns None for invalid attrs
|
attrs = [intify_attr(attr) for attr in attrs] # intify_attr returns None for invalid attrs
|
||||||
|
@ -1177,6 +1178,7 @@ cdef class Doc:
|
||||||
other.length = self.length
|
other.length = self.length
|
||||||
other.max_length = self.max_length
|
other.max_length = self.max_length
|
||||||
buff_size = other.max_length + (PADDING*2)
|
buff_size = other.max_length + (PADDING*2)
|
||||||
|
assert buff_size > 0
|
||||||
tokens = <TokenC*>other.mem.alloc(buff_size, sizeof(TokenC))
|
tokens = <TokenC*>other.mem.alloc(buff_size, sizeof(TokenC))
|
||||||
memcpy(tokens, self.c - PADDING, buff_size * sizeof(TokenC))
|
memcpy(tokens, self.c - PADDING, buff_size * sizeof(TokenC))
|
||||||
other.c = &tokens[PADDING]
|
other.c = &tokens[PADDING]
|
||||||
|
|
|
@ -37,9 +37,17 @@ def init_nlp(config: Config, *, use_gpu: int = -1) -> "Language":
|
||||||
T = registry.resolve(config["training"], schema=ConfigSchemaTraining)
|
T = registry.resolve(config["training"], schema=ConfigSchemaTraining)
|
||||||
dot_names = [T["train_corpus"], T["dev_corpus"]]
|
dot_names = [T["train_corpus"], T["dev_corpus"]]
|
||||||
if not isinstance(T["train_corpus"], str):
|
if not isinstance(T["train_corpus"], str):
|
||||||
raise ConfigValidationError(desc=Errors.E897.format(field="training.train_corpus", type=type(T["train_corpus"])))
|
raise ConfigValidationError(
|
||||||
|
desc=Errors.E897.format(
|
||||||
|
field="training.train_corpus", type=type(T["train_corpus"])
|
||||||
|
)
|
||||||
|
)
|
||||||
if not isinstance(T["dev_corpus"], str):
|
if not isinstance(T["dev_corpus"], str):
|
||||||
raise ConfigValidationError(desc=Errors.E897.format(field="training.dev_corpus", type=type(T["dev_corpus"])))
|
raise ConfigValidationError(
|
||||||
|
desc=Errors.E897.format(
|
||||||
|
field="training.dev_corpus", type=type(T["dev_corpus"])
|
||||||
|
)
|
||||||
|
)
|
||||||
train_corpus, dev_corpus = resolve_dot_names(config, dot_names)
|
train_corpus, dev_corpus = resolve_dot_names(config, dot_names)
|
||||||
optimizer = T["optimizer"]
|
optimizer = T["optimizer"]
|
||||||
# Components that shouldn't be updated during training
|
# Components that shouldn't be updated during training
|
||||||
|
|
|
@ -59,6 +59,19 @@ def train(
|
||||||
batcher = T["batcher"]
|
batcher = T["batcher"]
|
||||||
train_logger = T["logger"]
|
train_logger = T["logger"]
|
||||||
before_to_disk = create_before_to_disk_callback(T["before_to_disk"])
|
before_to_disk = create_before_to_disk_callback(T["before_to_disk"])
|
||||||
|
|
||||||
|
# Helper function to save checkpoints. This is a closure for convenience,
|
||||||
|
# to avoid passing in all the args all the time.
|
||||||
|
def save_checkpoint(is_best):
|
||||||
|
with nlp.use_params(optimizer.averages):
|
||||||
|
before_to_disk(nlp).to_disk(output_path / DIR_MODEL_LAST)
|
||||||
|
if is_best:
|
||||||
|
# Avoid saving twice (saving will be more expensive than
|
||||||
|
# the dir copy)
|
||||||
|
if (output_path / DIR_MODEL_BEST).exists():
|
||||||
|
shutil.rmtree(output_path / DIR_MODEL_BEST)
|
||||||
|
shutil.copytree(output_path / DIR_MODEL_LAST, output_path / DIR_MODEL_BEST)
|
||||||
|
|
||||||
# Components that shouldn't be updated during training
|
# Components that shouldn't be updated during training
|
||||||
frozen_components = T["frozen_components"]
|
frozen_components = T["frozen_components"]
|
||||||
# Create iterator, which yields out info after each optimization step.
|
# Create iterator, which yields out info after each optimization step.
|
||||||
|
@ -87,36 +100,31 @@ def train(
|
||||||
if is_best_checkpoint is not None and output_path is not None:
|
if is_best_checkpoint is not None and output_path is not None:
|
||||||
with nlp.select_pipes(disable=frozen_components):
|
with nlp.select_pipes(disable=frozen_components):
|
||||||
update_meta(T, nlp, info)
|
update_meta(T, nlp, info)
|
||||||
with nlp.use_params(optimizer.averages):
|
save_checkpoint(is_best_checkpoint)
|
||||||
nlp = before_to_disk(nlp)
|
|
||||||
nlp.to_disk(output_path / DIR_MODEL_BEST)
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
if output_path is not None:
|
if output_path is not None:
|
||||||
# We don't want to swallow the traceback if we don't have a
|
|
||||||
# specific error, but we do want to warn that we're trying
|
|
||||||
# to do something here.
|
|
||||||
stdout.write(
|
stdout.write(
|
||||||
msg.warn(
|
msg.warn(
|
||||||
f"Aborting and saving the final best model. "
|
f"Aborting and saving the final best model. "
|
||||||
f"Encountered exception: {str(e)}"
|
f"Encountered exception: {repr(e)}"
|
||||||
)
|
)
|
||||||
+ "\n"
|
+ "\n"
|
||||||
)
|
)
|
||||||
raise e
|
raise e
|
||||||
finally:
|
finally:
|
||||||
finalize_logger()
|
finalize_logger()
|
||||||
if optimizer.averages:
|
save_checkpoint(False)
|
||||||
nlp.use_params(optimizer.averages)
|
# This will only run if we did't hit an error
|
||||||
if output_path is not None:
|
if optimizer.averages:
|
||||||
final_model_path = output_path / DIR_MODEL_LAST
|
nlp.use_params(optimizer.averages)
|
||||||
nlp.to_disk(final_model_path)
|
if output_path is not None:
|
||||||
# This will only run if we don't hit an error
|
stdout.write(
|
||||||
stdout.write(
|
msg.good("Saved pipeline to output directory", output_path / DIR_MODEL_LAST)
|
||||||
msg.good("Saved pipeline to output directory", final_model_path) + "\n"
|
+ "\n"
|
||||||
)
|
)
|
||||||
return (nlp, final_model_path)
|
return (nlp, output_path / DIR_MODEL_LAST)
|
||||||
else:
|
else:
|
||||||
return (nlp, None)
|
return (nlp, None)
|
||||||
|
|
||||||
|
|
||||||
def train_while_improving(
|
def train_while_improving(
|
||||||
|
|
|
@ -10,7 +10,7 @@ from wasabi import Printer
|
||||||
|
|
||||||
from .example import Example
|
from .example import Example
|
||||||
from ..tokens import Doc
|
from ..tokens import Doc
|
||||||
from ..schemas import ConfigSchemaTraining, ConfigSchemaPretrain
|
from ..schemas import ConfigSchemaPretrain
|
||||||
from ..util import registry, load_model_from_config, dot_to_object
|
from ..util import registry, load_model_from_config, dot_to_object
|
||||||
|
|
||||||
|
|
||||||
|
@ -30,7 +30,6 @@ def pretrain(
|
||||||
set_gpu_allocator(allocator)
|
set_gpu_allocator(allocator)
|
||||||
nlp = load_model_from_config(config)
|
nlp = load_model_from_config(config)
|
||||||
_config = nlp.config.interpolate()
|
_config = nlp.config.interpolate()
|
||||||
T = registry.resolve(_config["training"], schema=ConfigSchemaTraining)
|
|
||||||
P = registry.resolve(_config["pretraining"], schema=ConfigSchemaPretrain)
|
P = registry.resolve(_config["pretraining"], schema=ConfigSchemaPretrain)
|
||||||
corpus = dot_to_object(_config, P["corpus"])
|
corpus = dot_to_object(_config, P["corpus"])
|
||||||
corpus = registry.resolve({"corpus": corpus})["corpus"]
|
corpus = registry.resolve({"corpus": corpus})["corpus"]
|
||||||
|
|
|
@ -69,7 +69,7 @@ CONFIG_SECTION_ORDER = ["paths", "variables", "system", "nlp", "components", "co
|
||||||
|
|
||||||
logger = logging.getLogger("spacy")
|
logger = logging.getLogger("spacy")
|
||||||
logger_stream_handler = logging.StreamHandler()
|
logger_stream_handler = logging.StreamHandler()
|
||||||
logger_stream_handler.setFormatter(logging.Formatter('%(message)s'))
|
logger_stream_handler.setFormatter(logging.Formatter("%(message)s"))
|
||||||
logger.addHandler(logger_stream_handler)
|
logger.addHandler(logger_stream_handler)
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -164,7 +164,7 @@ cdef class Vocab:
|
||||||
if len(string) < 3 or self.length < 10000:
|
if len(string) < 3 or self.length < 10000:
|
||||||
mem = self.mem
|
mem = self.mem
|
||||||
cdef bint is_oov = mem is not self.mem
|
cdef bint is_oov = mem is not self.mem
|
||||||
lex = <LexemeC*>mem.alloc(sizeof(LexemeC), 1)
|
lex = <LexemeC*>mem.alloc(1, sizeof(LexemeC))
|
||||||
lex.orth = self.strings.add(string)
|
lex.orth = self.strings.add(string)
|
||||||
lex.length = len(string)
|
lex.length = len(string)
|
||||||
if self.vectors is not None:
|
if self.vectors is not None:
|
||||||
|
|
|
@ -5,6 +5,7 @@ source: spacy/ml/models
|
||||||
menu:
|
menu:
|
||||||
- ['Tok2Vec', 'tok2vec-arch']
|
- ['Tok2Vec', 'tok2vec-arch']
|
||||||
- ['Transformers', 'transformers']
|
- ['Transformers', 'transformers']
|
||||||
|
- ['Pretraining', 'pretrain']
|
||||||
- ['Parser & NER', 'parser']
|
- ['Parser & NER', 'parser']
|
||||||
- ['Tagging', 'tagger']
|
- ['Tagging', 'tagger']
|
||||||
- ['Text Classification', 'textcat']
|
- ['Text Classification', 'textcat']
|
||||||
|
@ -25,20 +26,20 @@ usage documentation on
|
||||||
|
|
||||||
## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"}
|
## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"}
|
||||||
|
|
||||||
### spacy.Tok2Vec.v1 {#Tok2Vec}
|
### spacy.Tok2Vec.v2 {#Tok2Vec}
|
||||||
|
|
||||||
> #### Example config
|
> #### Example config
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> [model]
|
> [model]
|
||||||
> @architectures = "spacy.Tok2Vec.v1"
|
> @architectures = "spacy.Tok2Vec.v2"
|
||||||
>
|
>
|
||||||
> [model.embed]
|
> [model.embed]
|
||||||
> @architectures = "spacy.CharacterEmbed.v1"
|
> @architectures = "spacy.CharacterEmbed.v1"
|
||||||
> # ...
|
> # ...
|
||||||
>
|
>
|
||||||
> [model.encode]
|
> [model.encode]
|
||||||
> @architectures = "spacy.MaxoutWindowEncoder.v1"
|
> @architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||||
> # ...
|
> # ...
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
|
@ -196,13 +197,13 @@ network to construct a single vector to represent the information.
|
||||||
| `nC` | The number of UTF-8 bytes to embed per word. Recommended values are between `3` and `8`, although it may depend on the length of words in the language. ~~int~~ |
|
| `nC` | The number of UTF-8 bytes to embed per word. Recommended values are between `3` and `8`, although it may depend on the length of words in the language. ~~int~~ |
|
||||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
### spacy.MaxoutWindowEncoder.v1 {#MaxoutWindowEncoder}
|
### spacy.MaxoutWindowEncoder.v2 {#MaxoutWindowEncoder}
|
||||||
|
|
||||||
> #### Example config
|
> #### Example config
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> [model]
|
> [model]
|
||||||
> @architectures = "spacy.MaxoutWindowEncoder.v1"
|
> @architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||||
> width = 128
|
> width = 128
|
||||||
> window_size = 1
|
> window_size = 1
|
||||||
> maxout_pieces = 3
|
> maxout_pieces = 3
|
||||||
|
@ -220,13 +221,13 @@ and residual connections.
|
||||||
| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ |
|
| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ |
|
||||||
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
|
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
|
||||||
|
|
||||||
### spacy.MishWindowEncoder.v1 {#MishWindowEncoder}
|
### spacy.MishWindowEncoder.v2 {#MishWindowEncoder}
|
||||||
|
|
||||||
> #### Example config
|
> #### Example config
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> [model]
|
> [model]
|
||||||
> @architectures = "spacy.MishWindowEncoder.v1"
|
> @architectures = "spacy.MishWindowEncoder.v2"
|
||||||
> width = 64
|
> width = 64
|
||||||
> window_size = 1
|
> window_size = 1
|
||||||
> depth = 4
|
> depth = 4
|
||||||
|
@ -251,19 +252,19 @@ and residual connections.
|
||||||
> [model]
|
> [model]
|
||||||
> @architectures = "spacy.TorchBiLSTMEncoder.v1"
|
> @architectures = "spacy.TorchBiLSTMEncoder.v1"
|
||||||
> width = 64
|
> width = 64
|
||||||
> window_size = 1
|
> depth = 2
|
||||||
> depth = 4
|
> dropout = 0.0
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
Encode context using bidirectional LSTM layers. Requires
|
Encode context using bidirectional LSTM layers. Requires
|
||||||
[PyTorch](https://pytorch.org).
|
[PyTorch](https://pytorch.org).
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `width` | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. ~~int~~ |
|
| `width` | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. ~~int~~ |
|
||||||
| `window_size` | The number of words to concatenate around each token to construct the convolution. Recommended value is `1`. ~~int~~ |
|
| `depth` | The number of recurrent layers, for instance `depth=2` results in stacking two LSTMs together. ~~int~~ |
|
||||||
| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ |
|
| `dropout` | Creates a Dropout layer on the outputs of each LSTM layer except the last layer. Set to 0.0 to disable this functionality. ~~float~~ |
|
||||||
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
|
| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ |
|
||||||
|
|
||||||
### spacy.StaticVectors.v1 {#StaticVectors}
|
### spacy.StaticVectors.v1 {#StaticVectors}
|
||||||
|
|
||||||
|
@ -426,6 +427,71 @@ one component.
|
||||||
| `grad_factor` | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. ~~float~~ |
|
| `grad_factor` | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. ~~float~~ |
|
||||||
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
|
## Pretraining architectures {#pretrain source="spacy/ml/models/multi_task.py"}
|
||||||
|
|
||||||
|
The spacy `pretrain` command lets you initialize a `Tok2Vec` layer in your
|
||||||
|
pipeline with information from raw text. To this end, additional layers are
|
||||||
|
added to build a network for a temporary task that forces the `Tok2Vec` layer to
|
||||||
|
learn something about sentence structure and word cooccurrence statistics. Two
|
||||||
|
pretraining objectives are available, both of which are variants of the cloze
|
||||||
|
task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced for
|
||||||
|
BERT.
|
||||||
|
|
||||||
|
For more information, see the section on
|
||||||
|
[pretraining](/usage/embeddings-transformers#pretraining).
|
||||||
|
|
||||||
|
### spacy.PretrainVectors.v1 {#pretrain_vectors}
|
||||||
|
|
||||||
|
> #### Example config
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [pretraining]
|
||||||
|
> component = "tok2vec"
|
||||||
|
> ...
|
||||||
|
>
|
||||||
|
> [pretraining.objective]
|
||||||
|
> @architectures = "spacy.PretrainVectors.v1"
|
||||||
|
> maxout_pieces = 3
|
||||||
|
> hidden_size = 300
|
||||||
|
> loss = "cosine"
|
||||||
|
> ```
|
||||||
|
|
||||||
|
Predict the word's vector from a static embeddings table as pretraining
|
||||||
|
objective for a Tok2Vec layer.
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~ |
|
||||||
|
| `hidden_size` | Size of the hidden layer of the model. ~~int~~ |
|
||||||
|
| `loss` | The loss function can be either "cosine" or "L2". We typically recommend to use "cosine". ~~~str~~ |
|
||||||
|
| **CREATES** | A callable function that can create the Model, given the `vocab` of the pipeline and the `tok2vec` layer to pretrain. ~~Callable[[Vocab, Model], Model]~~ |
|
||||||
|
|
||||||
|
### spacy.PretrainCharacters.v1 {#pretrain_chars}
|
||||||
|
|
||||||
|
> #### Example config
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [pretraining]
|
||||||
|
> component = "tok2vec"
|
||||||
|
> ...
|
||||||
|
>
|
||||||
|
> [pretraining.objective]
|
||||||
|
> @architectures = "spacy.PretrainCharacters.v1"
|
||||||
|
> maxout_pieces = 3
|
||||||
|
> hidden_size = 300
|
||||||
|
> n_characters = 4
|
||||||
|
> ```
|
||||||
|
|
||||||
|
Predict some number of leading and trailing UTF-8 bytes as pretraining objective
|
||||||
|
for a Tok2Vec layer.
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~ |
|
||||||
|
| `hidden_size` | Size of the hidden layer of the model. ~~int~~ |
|
||||||
|
| `n_characters` | The window of characters - e.g. if `n_characters = 2`, the model will try to predict the first two and last two characters of the word. ~~int~~ |
|
||||||
|
| **CREATES** | A callable function that can create the Model, given the `vocab` of the pipeline and the `tok2vec` layer to pretrain. ~~Callable[[Vocab, Model], Model]~~ |
|
||||||
|
|
||||||
## Parser & NER architectures {#parser}
|
## Parser & NER architectures {#parser}
|
||||||
|
|
||||||
### spacy.TransitionBasedParser.v2 {#TransitionBasedParser source="spacy/ml/models/parser.py"}
|
### spacy.TransitionBasedParser.v2 {#TransitionBasedParser source="spacy/ml/models/parser.py"}
|
||||||
|
@ -534,7 +600,7 @@ specific data and challenge.
|
||||||
> no_output_layer = false
|
> no_output_layer = false
|
||||||
>
|
>
|
||||||
> [model.tok2vec]
|
> [model.tok2vec]
|
||||||
> @architectures = "spacy.Tok2Vec.v1"
|
> @architectures = "spacy.Tok2Vec.v2"
|
||||||
>
|
>
|
||||||
> [model.tok2vec.embed]
|
> [model.tok2vec.embed]
|
||||||
> @architectures = "spacy.MultiHashEmbed.v1"
|
> @architectures = "spacy.MultiHashEmbed.v1"
|
||||||
|
@ -544,7 +610,7 @@ specific data and challenge.
|
||||||
> include_static_vectors = false
|
> include_static_vectors = false
|
||||||
>
|
>
|
||||||
> [model.tok2vec.encode]
|
> [model.tok2vec.encode]
|
||||||
> @architectures = "spacy.MaxoutWindowEncoder.v1"
|
> @architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||||
> width = ${model.tok2vec.embed.width}
|
> width = ${model.tok2vec.embed.width}
|
||||||
> window_size = 1
|
> window_size = 1
|
||||||
> maxout_pieces = 3
|
> maxout_pieces = 3
|
||||||
|
|
|
@ -61,20 +61,27 @@ markup to copy-paste into
|
||||||
[GitHub issues](https://github.com/explosion/spaCy/issues).
|
[GitHub issues](https://github.com/explosion/spaCy/issues).
|
||||||
|
|
||||||
```cli
|
```cli
|
||||||
$ python -m spacy info [--markdown] [--silent]
|
$ python -m spacy info [--markdown] [--silent] [--exclude]
|
||||||
```
|
```
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```cli
|
||||||
|
> $ python -m spacy info en_core_web_lg --markdown
|
||||||
|
> ```
|
||||||
|
|
||||||
```cli
|
```cli
|
||||||
$ python -m spacy info [model] [--markdown] [--silent]
|
$ python -m spacy info [model] [--markdown] [--silent] [--exclude]
|
||||||
```
|
```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ------------------------------------------------ | ----------------------------------------------------------------------------------------- |
|
| ------------------------------------------------ | --------------------------------------------------------------------------------------------- |
|
||||||
| `model` | A trained pipeline, i.e. package name or path (optional). ~~Optional[str] \(positional)~~ |
|
| `model` | A trained pipeline, i.e. package name or path (optional). ~~Optional[str] \(positional)~~ |
|
||||||
| `--markdown`, `-md` | Print information as Markdown. ~~bool (flag)~~ |
|
| `--markdown`, `-md` | Print information as Markdown. ~~bool (flag)~~ |
|
||||||
| `--silent`, `-s` <Tag variant="new">2.0.12</Tag> | Don't print anything, just return the values. ~~bool (flag)~~ |
|
| `--silent`, `-s` <Tag variant="new">2.0.12</Tag> | Don't print anything, just return the values. ~~bool (flag)~~ |
|
||||||
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
| `--exclude`, `-e` | Comma-separated keys to exclude from the print-out. Defaults to `"labels"`. ~~Optional[str]~~ |
|
||||||
| **PRINTS** | Information about your spaCy installation. |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
|
| **PRINTS** | Information about your spaCy installation. |
|
||||||
|
|
||||||
## validate {#validate new="2" tag="command"}
|
## validate {#validate new="2" tag="command"}
|
||||||
|
|
||||||
|
@ -121,7 +128,7 @@ customize those settings in your config file later.
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
```cli
|
```cli
|
||||||
$ python -m spacy init config [output_file] [--lang] [--pipeline] [--optimize] [--gpu] [--pretraining]
|
$ python -m spacy init config [output_file] [--lang] [--pipeline] [--optimize] [--gpu] [--pretraining] [--force]
|
||||||
```
|
```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
|
@ -132,6 +139,7 @@ $ python -m spacy init config [output_file] [--lang] [--pipeline] [--optimize] [
|
||||||
| `--optimize`, `-o` | `"efficiency"` or `"accuracy"`. Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters. Defaults to `"efficiency"`. ~~str (option)~~ |
|
| `--optimize`, `-o` | `"efficiency"` or `"accuracy"`. Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters. Defaults to `"efficiency"`. ~~str (option)~~ |
|
||||||
| `--gpu`, `-G` | Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters. ~~bool (flag)~~ |
|
| `--gpu`, `-G` | Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters. ~~bool (flag)~~ |
|
||||||
| `--pretraining`, `-pt` | Include config for pretraining (with [`spacy pretrain`](/api/cli#pretrain)). Defaults to `False`. ~~bool (flag)~~ |
|
| `--pretraining`, `-pt` | Include config for pretraining (with [`spacy pretrain`](/api/cli#pretrain)). Defaults to `False`. ~~bool (flag)~~ |
|
||||||
|
| `--force`, `-f` | Force overwriting the output file if it already exists. ~~bool (flag)~~ |
|
||||||
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||||
| **CREATES** | The config file for training. |
|
| **CREATES** | The config file for training. |
|
||||||
|
|
||||||
|
@ -783,6 +791,12 @@ in the section `[paths]`.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```cli
|
||||||
|
> $ python -m spacy train config.cfg --output ./output --paths.train ./train --paths.dev ./dev
|
||||||
|
> ```
|
||||||
|
|
||||||
```cli
|
```cli
|
||||||
$ python -m spacy train [config_path] [--output] [--code] [--verbose] [--gpu-id] [overrides]
|
$ python -m spacy train [config_path] [--output] [--code] [--verbose] [--gpu-id] [overrides]
|
||||||
```
|
```
|
||||||
|
@ -801,15 +815,16 @@ $ python -m spacy train [config_path] [--output] [--code] [--verbose] [--gpu-id]
|
||||||
## pretrain {#pretrain new="2.1" tag="command,experimental"}
|
## pretrain {#pretrain new="2.1" tag="command,experimental"}
|
||||||
|
|
||||||
Pretrain the "token to vector" ([`Tok2vec`](/api/tok2vec)) layer of pipeline
|
Pretrain the "token to vector" ([`Tok2vec`](/api/tok2vec)) layer of pipeline
|
||||||
components on [raw text](/api/data-formats#pretrain), using an approximate
|
components on raw text, using an approximate language-modeling objective.
|
||||||
language-modeling objective. Specifically, we load pretrained vectors, and train
|
Specifically, we load pretrained vectors, and train a component like a CNN,
|
||||||
a component like a CNN, BiLSTM, etc to predict vectors which match the
|
BiLSTM, etc to predict vectors which match the pretrained ones. The weights are
|
||||||
pretrained ones. The weights are saved to a directory after each epoch. You can
|
saved to a directory after each epoch. You can then include a **path to one of
|
||||||
then include a **path to one of these pretrained weights files** in your
|
these pretrained weights files** in your
|
||||||
[training config](/usage/training#config) as the `init_tok2vec` setting when you
|
[training config](/usage/training#config) as the `init_tok2vec` setting when you
|
||||||
train your pipeline. This technique may be especially helpful if you have little
|
train your pipeline. This technique may be especially helpful if you have little
|
||||||
labelled data. See the usage docs on
|
labelled data. See the usage docs on
|
||||||
[pretraining](/usage/embeddings-transformers#pretraining) for more info.
|
[pretraining](/usage/embeddings-transformers#pretraining) for more info. To read
|
||||||
|
the raw text, a [`JsonlCorpus`](/api/top-level#jsonlcorpus) is typically used.
|
||||||
|
|
||||||
<Infobox title="Changed in v3.0" variant="warning">
|
<Infobox title="Changed in v3.0" variant="warning">
|
||||||
|
|
||||||
|
@ -823,6 +838,12 @@ auto-generated by setting `--pretraining` on
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```cli
|
||||||
|
> $ python -m spacy pretrain config.cfg ./output_pretrain --paths.raw_text ./data.jsonl
|
||||||
|
> ```
|
||||||
|
|
||||||
```cli
|
```cli
|
||||||
$ python -m spacy pretrain [config_path] [output_dir] [--code] [--resume-path] [--epoch-resume] [--gpu-id] [overrides]
|
$ python -m spacy pretrain [config_path] [output_dir] [--code] [--resume-path] [--epoch-resume] [--gpu-id] [overrides]
|
||||||
```
|
```
|
||||||
|
|
|
@ -94,7 +94,7 @@ Defines the `nlp` object, its tokenizer and
|
||||||
>
|
>
|
||||||
> [components.textcat.model]
|
> [components.textcat.model]
|
||||||
> @architectures = "spacy.TextCatBOW.v1"
|
> @architectures = "spacy.TextCatBOW.v1"
|
||||||
> exclusive_classes = false
|
> exclusive_classes = true
|
||||||
> ngram_size = 1
|
> ngram_size = 1
|
||||||
> no_output_layer = false
|
> no_output_layer = false
|
||||||
> ```
|
> ```
|
||||||
|
@ -148,7 +148,7 @@ This section defines a **dictionary** mapping of string keys to functions. Each
|
||||||
function takes an `nlp` object and yields [`Example`](/api/example) objects. By
|
function takes an `nlp` object and yields [`Example`](/api/example) objects. By
|
||||||
default, the two keys `train` and `dev` are specified and each refer to a
|
default, the two keys `train` and `dev` are specified and each refer to a
|
||||||
[`Corpus`](/api/top-level#Corpus). When pretraining, an additional `pretrain`
|
[`Corpus`](/api/top-level#Corpus). When pretraining, an additional `pretrain`
|
||||||
section is added that defaults to a [`JsonlCorpus`](/api/top-level#JsonlCorpus).
|
section is added that defaults to a [`JsonlCorpus`](/api/top-level#jsonlcorpus).
|
||||||
You can also register custom functions that return a callable.
|
You can also register custom functions that return a callable.
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
|
|
454
website/docs/api/multilabel_textcategorizer.md
Normal file
454
website/docs/api/multilabel_textcategorizer.md
Normal file
|
@ -0,0 +1,454 @@
|
||||||
|
---
|
||||||
|
title: Multi-label TextCategorizer
|
||||||
|
tag: class
|
||||||
|
source: spacy/pipeline/textcat_multilabel.py
|
||||||
|
new: 3
|
||||||
|
teaser: 'Pipeline component for multi-label text classification'
|
||||||
|
api_base_class: /api/pipe
|
||||||
|
api_string_name: textcat_multilabel
|
||||||
|
api_trainable: true
|
||||||
|
---
|
||||||
|
|
||||||
|
The text categorizer predicts **categories over a whole document**. It
|
||||||
|
learns non-mutually exclusive labels, which means that zero or more labels
|
||||||
|
may be true per document.
|
||||||
|
|
||||||
|
## Config and implementation {#config}
|
||||||
|
|
||||||
|
The default config is defined by the pipeline component factory and describes
|
||||||
|
how the component should be configured. You can override its settings via the
|
||||||
|
`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
|
||||||
|
[`config.cfg` for training](/usage/training#config). See the
|
||||||
|
[model architectures](/api/architectures) documentation for details on the
|
||||||
|
architectures and their arguments and hyperparameters.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> from spacy.pipeline.textcat_multilabel import DEFAULT_MULTI_TEXTCAT_MODEL
|
||||||
|
> config = {
|
||||||
|
> "threshold": 0.5,
|
||||||
|
> "model": DEFAULT_MULTI_TEXTCAT_MODEL,
|
||||||
|
> }
|
||||||
|
> nlp.add_pipe("textcat_multilabel", config=config)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Setting | Description |
|
||||||
|
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
||||||
|
| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
|
||||||
|
```python
|
||||||
|
%%GITHUB_SPACY/spacy/pipeline/textcat_multilabel.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.\_\_init\_\_ {#init tag="method"}
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> # Construction via add_pipe with default model
|
||||||
|
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||||
|
>
|
||||||
|
> # Construction via add_pipe with custom model
|
||||||
|
> config = {"model": {"@architectures": "my_textcat"}}
|
||||||
|
> parser = nlp.add_pipe("textcat_multilabel", config=config)
|
||||||
|
>
|
||||||
|
> # Construction from class
|
||||||
|
> from spacy.pipeline import MultiLabel_TextCategorizer
|
||||||
|
> textcat = MultiLabel_TextCategorizer(nlp.vocab, model, threshold=0.5)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
Create a new pipeline instance. In your application, you would normally use a
|
||||||
|
shortcut for this and instantiate the component using its string name and
|
||||||
|
[`nlp.add_pipe`](/api/language#create_pipe).
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| -------------- | -------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||||
|
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||||
|
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||||
|
| _keyword-only_ | |
|
||||||
|
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.\_\_call\_\_ {#call tag="method"}
|
||||||
|
|
||||||
|
Apply the pipe to one document. The document is modified in place, and returned.
|
||||||
|
This usually happens under the hood when the `nlp` object is called on a text
|
||||||
|
and all pipeline components are applied to the `Doc` in order. Both
|
||||||
|
[`__call__`](/api/multilabel_textcategorizer#call) and [`pipe`](/api/multilabel_textcategorizer#pipe)
|
||||||
|
delegate to the [`predict`](/api/multilabel_textcategorizer#predict) and
|
||||||
|
[`set_annotations`](/api/multilabel_textcategorizer#set_annotations) methods.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> doc = nlp("This is a sentence.")
|
||||||
|
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||||
|
> # This usually happens under the hood
|
||||||
|
> processed = textcat(doc)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | -------------------------------- |
|
||||||
|
| `doc` | The document to process. ~~Doc~~ |
|
||||||
|
| **RETURNS** | The processed document. ~~Doc~~ |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.pipe {#pipe tag="method"}
|
||||||
|
|
||||||
|
Apply the pipe to a stream of documents. This usually happens under the hood
|
||||||
|
when the `nlp` object is called on a text and all pipeline components are
|
||||||
|
applied to the `Doc` in order. Both [`__call__`](/api/multilabel_textcategorizer#call) and
|
||||||
|
[`pipe`](/api/multilabel_textcategorizer#pipe) delegate to the
|
||||||
|
[`predict`](/api/multilabel_textcategorizer#predict) and
|
||||||
|
[`set_annotations`](/api/multilabel_textcategorizer#set_annotations) methods.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||||
|
> for doc in textcat.pipe(docs, batch_size=50):
|
||||||
|
> pass
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| -------------- | ------------------------------------------------------------- |
|
||||||
|
| `stream` | A stream of documents. ~~Iterable[Doc]~~ |
|
||||||
|
| _keyword-only_ | |
|
||||||
|
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
|
||||||
|
| **YIELDS** | The processed documents in order. ~~Doc~~ |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.initialize {#initialize tag="method" new="3"}
|
||||||
|
|
||||||
|
Initialize the component for training. `get_examples` should be a function that
|
||||||
|
returns an iterable of [`Example`](/api/example) objects. The data examples are
|
||||||
|
used to **initialize the model** of the component and can either be the full
|
||||||
|
training data or a representative sample. Initialization includes validating the
|
||||||
|
network,
|
||||||
|
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
|
||||||
|
setting up the label scheme based on the data. This method is typically called
|
||||||
|
by [`Language.initialize`](/api/language#initialize) and lets you customize
|
||||||
|
arguments it receives via the
|
||||||
|
[`[initialize.components]`](/api/data-formats#config-initialize) block in the
|
||||||
|
config.
|
||||||
|
|
||||||
|
<Infobox variant="warning" title="Changed in v3.0" id="begin_training">
|
||||||
|
|
||||||
|
This method was previously called `begin_training`.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||||
|
> textcat.initialize(lambda: [], nlp=nlp)
|
||||||
|
> ```
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> ### config.cfg
|
||||||
|
> [initialize.components.textcat_multilabel]
|
||||||
|
>
|
||||||
|
> [initialize.components.textcat_multilabel.labels]
|
||||||
|
> @readers = "spacy.read_labels.v1"
|
||||||
|
> path = "corpus/labels/textcat.json
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||||
|
| _keyword-only_ | |
|
||||||
|
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
|
||||||
|
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.predict {#predict tag="method"}
|
||||||
|
|
||||||
|
Apply the component's model to a batch of [`Doc`](/api/doc) objects without
|
||||||
|
modifying them.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||||
|
> scores = textcat.predict([doc1, doc2])
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ------------------------------------------- |
|
||||||
|
| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
|
||||||
|
| **RETURNS** | The model's prediction for each document. |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.set_annotations {#set_annotations tag="method"}
|
||||||
|
|
||||||
|
Modify a batch of [`Doc`](/api/doc) objects using pre-computed scores.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||||
|
> scores = textcat.predict(docs)
|
||||||
|
> textcat.set_annotations(docs, scores)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| -------- | --------------------------------------------------------- |
|
||||||
|
| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
|
||||||
|
| `scores` | The scores to set, produced by `MultiLabel_TextCategorizer.predict`. |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.update {#update tag="method"}
|
||||||
|
|
||||||
|
Learn from a batch of [`Example`](/api/example) objects containing the
|
||||||
|
predictions and gold-standard annotations, and update the component's model.
|
||||||
|
Delegates to [`predict`](/api/multilabel_textcategorizer#predict) and
|
||||||
|
[`get_loss`](/api/multilabel_textcategorizer#get_loss).
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||||
|
> optimizer = nlp.initialize()
|
||||||
|
> losses = textcat.update(examples, sgd=optimizer)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
|
| _keyword-only_ | |
|
||||||
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
|
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||||
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
|
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||||
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.rehearse {#rehearse tag="method,experimental" new="3"}
|
||||||
|
|
||||||
|
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
|
||||||
|
current model to make predictions similar to an initial model to try to address
|
||||||
|
the "catastrophic forgetting" problem. This feature is experimental.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||||
|
> optimizer = nlp.resume_training()
|
||||||
|
> losses = textcat.rehearse(examples, sgd=optimizer)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||||
|
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||||
|
| _keyword-only_ | |
|
||||||
|
| `drop` | The dropout rate. ~~float~~ |
|
||||||
|
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||||
|
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||||
|
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.get_loss {#get_loss tag="method"}
|
||||||
|
|
||||||
|
Find the loss and gradient of loss for the batch of documents and their
|
||||||
|
predicted scores.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> textcat = nlp.add_pipe("textcat_multilabel")
|
||||||
|
> scores = textcat.predict([eg.predicted for eg in examples])
|
||||||
|
> loss, d_loss = textcat.get_loss(examples, scores)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | --------------------------------------------------------------------------- |
|
||||||
|
| `examples` | The batch of examples. ~~Iterable[Example]~~ |
|
||||||
|
| `scores` | Scores representing the model's predictions. |
|
||||||
|
| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.score {#score tag="method" new="3"}
|
||||||
|
|
||||||
|
Score a batch of examples.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> scores = textcat.score(examples)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ---------------- | -------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `examples` | The examples to score. ~~Iterable[Example]~~ |
|
||||||
|
| _keyword-only_ | |
|
||||||
|
| **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.create_optimizer {#create_optimizer tag="method"}
|
||||||
|
|
||||||
|
Create an optimizer for the pipeline component.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> textcat = nlp.add_pipe("textcat")
|
||||||
|
> optimizer = textcat.create_optimizer()
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ---------------------------- |
|
||||||
|
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.use_params {#use_params tag="method, contextmanager"}
|
||||||
|
|
||||||
|
Modify the pipe's model to use the given parameter values.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> textcat = nlp.add_pipe("textcat")
|
||||||
|
> with textcat.use_params(optimizer.averages):
|
||||||
|
> textcat.to_disk("/best_model")
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| -------- | -------------------------------------------------- |
|
||||||
|
| `params` | The parameter values to use in the model. ~~dict~~ |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.add_label {#add_label tag="method"}
|
||||||
|
|
||||||
|
Add a new label to the pipe. Raises an error if the output dimension is already
|
||||||
|
set, or if the model has already been fully [initialized](#initialize). Note
|
||||||
|
that you don't have to call this method if you provide a **representative data
|
||||||
|
sample** to the [`initialize`](#initialize) method. In this case, all labels
|
||||||
|
found in the sample will be automatically added to the model, and the output
|
||||||
|
dimension will be [inferred](/usage/layers-architectures#thinc-shape-inference)
|
||||||
|
automatically.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> textcat = nlp.add_pipe("textcat")
|
||||||
|
> textcat.add_label("MY_LABEL")
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ----------------------------------------------------------- |
|
||||||
|
| `label` | The label to add. ~~str~~ |
|
||||||
|
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
|
Serialize the pipe to disk.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> textcat = nlp.add_pipe("textcat")
|
||||||
|
> textcat.to_disk("/path/to/textcat")
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
|
| _keyword-only_ | |
|
||||||
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> textcat = nlp.add_pipe("textcat")
|
||||||
|
> textcat.from_disk("/path/to/textcat")
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| -------------- | ----------------------------------------------------------------------------------------------- |
|
||||||
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
||||||
|
| _keyword-only_ | |
|
||||||
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
| **RETURNS** | The modified `MultiLabel_TextCategorizer` object. ~~MultiLabel_TextCategorizer~~ |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.to_bytes {#to_bytes tag="method"}
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> textcat = nlp.add_pipe("textcat")
|
||||||
|
> textcat_bytes = textcat.to_bytes()
|
||||||
|
> ```
|
||||||
|
|
||||||
|
Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
|
| _keyword-only_ | |
|
||||||
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
| **RETURNS** | The serialized form of the `MultiLabel_TextCategorizer` object. ~~bytes~~ |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> textcat_bytes = textcat.to_bytes()
|
||||||
|
> textcat = nlp.add_pipe("textcat")
|
||||||
|
> textcat.from_bytes(textcat_bytes)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
||||||
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
||||||
|
| _keyword-only_ | |
|
||||||
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||||
|
| **RETURNS** | The `MultiLabel_TextCategorizer` object. ~~MultiLabel_TextCategorizer~~ |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.labels {#labels tag="property"}
|
||||||
|
|
||||||
|
The labels currently added to the component.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> textcat.add_label("MY_LABEL")
|
||||||
|
> assert "MY_LABEL" in textcat.labels
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ------------------------------------------------------ |
|
||||||
|
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
|
## MultiLabel_TextCategorizer.label_data {#label_data tag="property" new="3"}
|
||||||
|
|
||||||
|
The labels currently added to the component and their internal meta information.
|
||||||
|
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
|
||||||
|
[`MultiLabel_TextCategorizer.initialize`](/api/multilabel_textcategorizer#initialize) to initialize
|
||||||
|
the model with a pre-defined label set.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> labels = textcat.label_data
|
||||||
|
> textcat.initialize(lambda: [], nlp=nlp, labels=labels)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | ---------------------------------------------------------- |
|
||||||
|
| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ |
|
||||||
|
|
||||||
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
different aspects of the object. If needed, you can exclude them from
|
||||||
|
serialization by passing in the string names via the `exclude` argument.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> data = textcat.to_disk("/path", exclude=["vocab"])
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ------- | -------------------------------------------------------------- |
|
||||||
|
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||||
|
| `cfg` | The config file. You usually don't want to exclude this. |
|
||||||
|
| `model` | The binary model data. You usually don't want to exclude this. |
|
|
@ -3,17 +3,15 @@ title: TextCategorizer
|
||||||
tag: class
|
tag: class
|
||||||
source: spacy/pipeline/textcat.py
|
source: spacy/pipeline/textcat.py
|
||||||
new: 2
|
new: 2
|
||||||
teaser: 'Pipeline component for text classification'
|
teaser: 'Pipeline component for single-label text classification'
|
||||||
api_base_class: /api/pipe
|
api_base_class: /api/pipe
|
||||||
api_string_name: textcat
|
api_string_name: textcat
|
||||||
api_trainable: true
|
api_trainable: true
|
||||||
---
|
---
|
||||||
|
|
||||||
The text categorizer predicts **categories over a whole document**. It can learn
|
The text categorizer predicts **categories over a whole document**. It can learn
|
||||||
one or more labels, and the labels can be mutually exclusive (i.e. one true
|
one or more labels, and the labels are mutually exclusive - there is exactly one
|
||||||
label per document) or non-mutually exclusive (i.e. zero or more labels may be
|
true label per document.
|
||||||
true per document). The multi-label setting is controlled by the model instance
|
|
||||||
that's provided.
|
|
||||||
|
|
||||||
## Config and implementation {#config}
|
## Config and implementation {#config}
|
||||||
|
|
||||||
|
@ -27,10 +25,10 @@ architectures and their arguments and hyperparameters.
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL
|
> from spacy.pipeline.textcat import DEFAULT_SINGLE_TEXTCAT_MODEL
|
||||||
> config = {
|
> config = {
|
||||||
> "threshold": 0.5,
|
> "threshold": 0.5,
|
||||||
> "model": DEFAULT_TEXTCAT_MODEL,
|
> "model": DEFAULT_SINGLE_TEXTCAT_MODEL,
|
||||||
> }
|
> }
|
||||||
> nlp.add_pipe("textcat", config=config)
|
> nlp.add_pipe("textcat", config=config)
|
||||||
> ```
|
> ```
|
||||||
|
@ -280,7 +278,6 @@ Score a batch of examples.
|
||||||
| ---------------- | -------------------------------------------------------------------------------------------------------------------- |
|
| ---------------- | -------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `examples` | The examples to score. ~~Iterable[Example]~~ |
|
| `examples` | The examples to score. ~~Iterable[Example]~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `positive_label` | Optional positive label. ~~Optional[str]~~ |
|
|
||||||
| **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
| **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||||
|
|
||||||
## TextCategorizer.create_optimizer {#create_optimizer tag="method"}
|
## TextCategorizer.create_optimizer {#create_optimizer tag="method"}
|
||||||
|
|
|
@ -129,13 +129,13 @@ the entity recognizer, use a
|
||||||
factory = "tok2vec"
|
factory = "tok2vec"
|
||||||
|
|
||||||
[components.tok2vec.model]
|
[components.tok2vec.model]
|
||||||
@architectures = "spacy.Tok2Vec.v1"
|
@architectures = "spacy.Tok2Vec.v2"
|
||||||
|
|
||||||
[components.tok2vec.model.embed]
|
[components.tok2vec.model.embed]
|
||||||
@architectures = "spacy.MultiHashEmbed.v1"
|
@architectures = "spacy.MultiHashEmbed.v1"
|
||||||
|
|
||||||
[components.tok2vec.model.encode]
|
[components.tok2vec.model.encode]
|
||||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||||
|
|
||||||
[components.ner]
|
[components.ner]
|
||||||
factory = "ner"
|
factory = "ner"
|
||||||
|
@ -161,13 +161,13 @@ factory = "ner"
|
||||||
@architectures = "spacy.TransitionBasedParser.v1"
|
@architectures = "spacy.TransitionBasedParser.v1"
|
||||||
|
|
||||||
[components.ner.model.tok2vec]
|
[components.ner.model.tok2vec]
|
||||||
@architectures = "spacy.Tok2Vec.v1"
|
@architectures = "spacy.Tok2Vec.v2"
|
||||||
|
|
||||||
[components.ner.model.tok2vec.embed]
|
[components.ner.model.tok2vec.embed]
|
||||||
@architectures = "spacy.MultiHashEmbed.v1"
|
@architectures = "spacy.MultiHashEmbed.v1"
|
||||||
|
|
||||||
[components.ner.model.tok2vec.encode]
|
[components.ner.model.tok2vec.encode]
|
||||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||||
```
|
```
|
||||||
|
|
||||||
<!-- TODO: Once rehearsal is tested, mention it here. -->
|
<!-- TODO: Once rehearsal is tested, mention it here. -->
|
||||||
|
@ -713,34 +713,39 @@ layer = "tok2vec"
|
||||||
|
|
||||||
#### Pretraining objectives {#pretraining-details}
|
#### Pretraining objectives {#pretraining-details}
|
||||||
|
|
||||||
Two pretraining objectives are available, both of which are variants of the
|
|
||||||
cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
|
|
||||||
for BERT. The objective can be defined and configured via the
|
|
||||||
`[pretraining.objective]` config block.
|
|
||||||
|
|
||||||
> ```ini
|
> ```ini
|
||||||
> ### Characters objective
|
> ### Characters objective
|
||||||
> [pretraining.objective]
|
> [pretraining.objective]
|
||||||
> type = "characters"
|
> @architectures = "spacy.PretrainCharacters.v1"
|
||||||
|
> maxout_pieces = 3
|
||||||
|
> hidden_size = 300
|
||||||
> n_characters = 4
|
> n_characters = 4
|
||||||
> ```
|
> ```
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> ### Vectors objective
|
> ### Vectors objective
|
||||||
> [pretraining.objective]
|
> [pretraining.objective]
|
||||||
> type = "vectors"
|
> @architectures = "spacy.PretrainVectors.v1"
|
||||||
|
> maxout_pieces = 3
|
||||||
|
> hidden_size = 300
|
||||||
> loss = "cosine"
|
> loss = "cosine"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
- **Characters:** The `"characters"` objective asks the model to predict some
|
Two pretraining objectives are available, both of which are variants of the
|
||||||
number of leading and trailing UTF-8 bytes for the words. For instance,
|
cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
|
||||||
setting `n_characters = 2`, the model will try to predict the first two and
|
for BERT. The objective can be defined and configured via the
|
||||||
last two characters of the word.
|
`[pretraining.objective]` config block.
|
||||||
|
|
||||||
- **Vectors:** The `"vectors"` objective asks the model to predict the word's
|
- [`PretrainCharacters`](/api/architectures#pretrain_chars): The `"characters"`
|
||||||
vector, from a static embeddings table. This requires a word vectors model to
|
objective asks the model to predict some number of leading and trailing UTF-8
|
||||||
be trained and loaded. The vectors objective can optimize either a cosine or
|
bytes for the words. For instance, setting `n_characters = 2`, the model will
|
||||||
an L2 loss. We've generally found cosine loss to perform better.
|
try to predict the first two and last two characters of the word.
|
||||||
|
|
||||||
|
- [`PretrainVectors`](/api/architectures#pretrain_vectors): The `"vectors"`
|
||||||
|
objective asks the model to predict the word's vector, from a static
|
||||||
|
embeddings table. This requires a word vectors model to be trained and loaded.
|
||||||
|
The vectors objective can optimize either a cosine or an L2 loss. We've
|
||||||
|
generally found cosine loss to perform better.
|
||||||
|
|
||||||
These pretraining objectives use a trick that we term **language modelling with
|
These pretraining objectives use a trick that we term **language modelling with
|
||||||
approximate outputs (LMAO)**. The motivation for the trick is that predicting an
|
approximate outputs (LMAO)**. The motivation for the trick is that predicting an
|
||||||
|
|
|
@ -134,7 +134,7 @@ labels = []
|
||||||
nO = null
|
nO = null
|
||||||
|
|
||||||
[components.textcat.model.tok2vec]
|
[components.textcat.model.tok2vec]
|
||||||
@architectures = "spacy.Tok2Vec.v1"
|
@architectures = "spacy.Tok2Vec.v2"
|
||||||
|
|
||||||
[components.textcat.model.tok2vec.embed]
|
[components.textcat.model.tok2vec.embed]
|
||||||
@architectures = "spacy.MultiHashEmbed.v1"
|
@architectures = "spacy.MultiHashEmbed.v1"
|
||||||
|
@ -144,7 +144,7 @@ attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
|
||||||
include_static_vectors = false
|
include_static_vectors = false
|
||||||
|
|
||||||
[components.textcat.model.tok2vec.encode]
|
[components.textcat.model.tok2vec.encode]
|
||||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||||
width = ${components.textcat.model.tok2vec.embed.width}
|
width = ${components.textcat.model.tok2vec.embed.width}
|
||||||
window_size = 1
|
window_size = 1
|
||||||
maxout_pieces = 3
|
maxout_pieces = 3
|
||||||
|
@ -152,7 +152,7 @@ depth = 2
|
||||||
|
|
||||||
[components.textcat.model.linear_model]
|
[components.textcat.model.linear_model]
|
||||||
@architectures = "spacy.TextCatBOW.v1"
|
@architectures = "spacy.TextCatBOW.v1"
|
||||||
exclusive_classes = false
|
exclusive_classes = true
|
||||||
ngram_size = 1
|
ngram_size = 1
|
||||||
no_output_layer = false
|
no_output_layer = false
|
||||||
```
|
```
|
||||||
|
@ -170,7 +170,7 @@ labels = []
|
||||||
|
|
||||||
[components.textcat.model]
|
[components.textcat.model]
|
||||||
@architectures = "spacy.TextCatBOW.v1"
|
@architectures = "spacy.TextCatBOW.v1"
|
||||||
exclusive_classes = false
|
exclusive_classes = true
|
||||||
ngram_size = 1
|
ngram_size = 1
|
||||||
no_output_layer = false
|
no_output_layer = false
|
||||||
nO = null
|
nO = null
|
||||||
|
@ -201,14 +201,14 @@ tokens, and their combination forms a typical
|
||||||
factory = "tok2vec"
|
factory = "tok2vec"
|
||||||
|
|
||||||
[components.tok2vec.model]
|
[components.tok2vec.model]
|
||||||
@architectures = "spacy.Tok2Vec.v1"
|
@architectures = "spacy.Tok2Vec.v2"
|
||||||
|
|
||||||
[components.tok2vec.model.embed]
|
[components.tok2vec.model.embed]
|
||||||
@architectures = "spacy.MultiHashEmbed.v1"
|
@architectures = "spacy.MultiHashEmbed.v1"
|
||||||
# ...
|
# ...
|
||||||
|
|
||||||
[components.tok2vec.model.encode]
|
[components.tok2vec.model.encode]
|
||||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||||
# ...
|
# ...
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -224,7 +224,7 @@ architecture:
|
||||||
# ...
|
# ...
|
||||||
|
|
||||||
[components.tok2vec.model.encode]
|
[components.tok2vec.model.encode]
|
||||||
@architectures = "spacy.MaxoutWindowEncoder.v1"
|
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
||||||
# ...
|
# ...
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -716,7 +716,7 @@ that we want to classify as being related or not. As these candidate pairs are
|
||||||
typically formed within one document, this function takes a [`Doc`](/api/doc) as
|
typically formed within one document, this function takes a [`Doc`](/api/doc) as
|
||||||
input and outputs a `List` of `Span` tuples. For instance, the following
|
input and outputs a `List` of `Span` tuples. For instance, the following
|
||||||
implementation takes any two entities from the same document, as long as they
|
implementation takes any two entities from the same document, as long as they
|
||||||
are within a **maximum distance** (in number of tokens) of eachother:
|
are within a **maximum distance** (in number of tokens) of each other:
|
||||||
|
|
||||||
> #### config.cfg (excerpt)
|
> #### config.cfg (excerpt)
|
||||||
>
|
>
|
||||||
|
@ -742,7 +742,7 @@ def create_instances(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]
|
||||||
return get_candidates
|
return get_candidates
|
||||||
```
|
```
|
||||||
|
|
||||||
This function in added to the [`@misc` registry](/api/top-level#registry) so we
|
This function is added to the [`@misc` registry](/api/top-level#registry) so we
|
||||||
can refer to it from the config, and easily swap it out for any other candidate
|
can refer to it from the config, and easily swap it out for any other candidate
|
||||||
generation function.
|
generation function.
|
||||||
|
|
||||||
|
|
|
@ -1060,7 +1060,7 @@ In this example we assume a custom function `read_custom_data` which loads or
|
||||||
generates texts with relevant text classification annotations. Then, small
|
generates texts with relevant text classification annotations. Then, small
|
||||||
lexical variations of the input text are created before generating the final
|
lexical variations of the input text are created before generating the final
|
||||||
[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets
|
[`Example`](/api/example) objects. The `@spacy.registry.readers` decorator lets
|
||||||
you register the function creating the custom reader in the `readers`
|
you register the function creating the custom reader in the `readers`
|
||||||
[registry](/api/top-level#registry) and assign it a string name, so it can be
|
[registry](/api/top-level#registry) and assign it a string name, so it can be
|
||||||
used in your config. All arguments on the registered function become available
|
used in your config. All arguments on the registered function become available
|
||||||
as **config settings** – in this case, `source`.
|
as **config settings** – in this case, `source`.
|
||||||
|
|
Loading…
Reference in New Issue
Block a user