Merge branch 'master' into spacy.io

This commit is contained in:
Ines Montani 2021-03-10 12:22:21 +11:00
commit c32cbac14f
29 changed files with 898 additions and 687 deletions

View File

@ -1,21 +0,0 @@
---
name: "\U000023F3 Installation Problem"
about: Do you have problems installing spaCy, and none of the suggestions in the docs
and other issues helped?
---
<!-- Before submitting an issue, make sure to check the docs and closed issues to see if any of the solutions work for you. Installation problems can often be related to Python environment issues and problems with compilation. -->
## How to reproduce the problem
<!-- Include the details of how the problem occurred. Which command did you run to install spaCy? Did you come across an error? What else did you try? -->
```bash
# copy-paste the error message here
```
## Your Environment
<!-- Include details of your environment. If you're using spaCy 1.7+, you can also type `python -m spacy info --markdown` and copy-paste the result here.-->
* Operating System:
* Python Version Used:
* spaCy Version Used:
* Environment Information:

106
.github/contributors/jankrepl.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jan Krepl |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2021-03-09 |
| GitHub username | jankrepl |
| Website (optional) | |

View File

@ -1,6 +1,6 @@
# fmt: off # fmt: off
__title__ = "spacy" __title__ = "spacy"
__version__ = "3.0.3" __version__ = "3.0.4"
__download_url__ = "https://github.com/explosion/spacy-models/releases/download" __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
__projects__ = "https://github.com/explosion/projects" __projects__ = "https://github.com/explosion/projects"

View File

@ -147,6 +147,11 @@ class Warnings:
"will be included in the results. For better results, token " "will be included in the results. For better results, token "
"patterns should return matches that are each exactly one token " "patterns should return matches that are each exactly one token "
"long.") "long.")
W111 = ("Jupyter notebook detected: if using `prefer_gpu()` or "
"`require_gpu()`, include it in the same cell right before "
"`spacy.load()` to ensure that the model is loaded on the correct "
"device. More information: "
"http://spacy.io/usage/v3#jupyter-notebook-gpu")
@add_codes @add_codes
@ -487,13 +492,20 @@ class Errors:
E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.") E202 = ("Unsupported alignment mode '{mode}'. Supported modes: {modes}.")
# New errors added in v3.x # New errors added in v3.x
E874 = ("Could not initialize the tok2vec model from component "
"'{component}' and layer '{layer}'.")
E875 = ("To use the PretrainVectors objective, make sure that static vectors are loaded. "
"In the config, these are defined by the initialize.vectors setting.")
E879 = ("Unexpected type for 'spans' data. Provide a dictionary mapping keys to " E879 = ("Unexpected type for 'spans' data. Provide a dictionary mapping keys to "
"a list of spans, with each span represented by a tuple (start_char, end_char). " "a list of spans, with each span represented by a tuple (start_char, end_char). "
"The tuple can be optionally extended with a label and a KB ID.") "The tuple can be optionally extended with a label and a KB ID.")
E880 = ("The 'wandb' library could not be found - did you install it? " E880 = ("The 'wandb' library could not be found - did you install it? "
"Alternatively, specify the 'ConsoleLogger' in the 'training.logger' " "Alternatively, specify the 'ConsoleLogger' in the 'training.logger' "
"config section, instead of the 'WandbLogger'.") "config section, instead of the 'WandbLogger'.")
E884 = ("The pipeline could not be initialized because the vectors "
"could not be found at '{vectors}'. If your pipeline was already "
"initialized/trained before, call 'resume_training' instead of 'initialize', "
"or initialize only the components that are new.")
E885 = ("entity_linker.set_kb received an invalid 'kb_loader' argument: expected " E885 = ("entity_linker.set_kb received an invalid 'kb_loader' argument: expected "
"a callable function, but got: {arg_type}") "a callable function, but got: {arg_type}")
E886 = ("Can't replace {name} -> {tok2vec} listeners: path '{path}' not " E886 = ("Can't replace {name} -> {tok2vec} listeners: path '{path}' not "

View File

@ -22,6 +22,7 @@ from .training.initialize import init_vocab, init_tok2vec
from .scorer import Scorer from .scorer import Scorer
from .util import registry, SimpleFrozenList, _pipe, raise_error from .util import registry, SimpleFrozenList, _pipe, raise_error
from .util import SimpleFrozenDict, combine_score_weights, CONFIG_SECTION_ORDER from .util import SimpleFrozenDict, combine_score_weights, CONFIG_SECTION_ORDER
from .util import warn_if_jupyter_cupy
from .lang.tokenizer_exceptions import URL_MATCH, BASE_EXCEPTIONS from .lang.tokenizer_exceptions import URL_MATCH, BASE_EXCEPTIONS
from .lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES from .lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
from .lang.punctuation import TOKENIZER_INFIXES from .lang.punctuation import TOKENIZER_INFIXES
@ -1219,13 +1220,12 @@ class Language:
before_init = I["before_init"] before_init = I["before_init"]
if before_init is not None: if before_init is not None:
before_init(self) before_init(self)
try:
init_vocab( init_vocab(
self, data=I["vocab_data"], lookups=I["lookups"], vectors=I["vectors"] self, data=I["vocab_data"], lookups=I["lookups"], vectors=I["vectors"]
) )
pretrain_cfg = config.get("pretraining") except IOError:
if pretrain_cfg: raise IOError(Errors.E884.format(vectors=I["vectors"]))
P = registry.resolve(pretrain_cfg, schema=ConfigSchemaPretrain)
init_tok2vec(self, P, I)
if self.vocab.vectors.data.shape[1] >= 1: if self.vocab.vectors.data.shape[1] >= 1:
ops = get_current_ops() ops = get_current_ops()
self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data) self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data)
@ -1244,6 +1244,10 @@ class Language:
proc.initialize, p_settings, section="components", name=name proc.initialize, p_settings, section="components", name=name
) )
proc.initialize(get_examples, nlp=self, **p_settings) proc.initialize(get_examples, nlp=self, **p_settings)
pretrain_cfg = config.get("pretraining")
if pretrain_cfg:
P = registry.resolve(pretrain_cfg, schema=ConfigSchemaPretrain)
init_tok2vec(self, P, I)
self._link_components() self._link_components()
self._optimizer = sgd self._optimizer = sgd
if sgd is not None: if sgd is not None:
@ -1592,6 +1596,7 @@ class Language:
# using the nlp.config with all defaults. # using the nlp.config with all defaults.
config = util.copy_config(config) config = util.copy_config(config)
orig_pipeline = config.pop("components", {}) orig_pipeline = config.pop("components", {})
orig_pretraining = config.pop("pretraining", None)
config["components"] = {} config["components"] = {}
if auto_fill: if auto_fill:
filled = registry.fill(config, validate=validate, schema=ConfigSchema) filled = registry.fill(config, validate=validate, schema=ConfigSchema)
@ -1599,6 +1604,9 @@ class Language:
filled = config filled = config
filled["components"] = orig_pipeline filled["components"] = orig_pipeline
config["components"] = orig_pipeline config["components"] = orig_pipeline
if orig_pretraining is not None:
filled["pretraining"] = orig_pretraining
config["pretraining"] = orig_pretraining
resolved_nlp = registry.resolve( resolved_nlp = registry.resolve(
filled["nlp"], validate=validate, schema=ConfigSchemaNlp filled["nlp"], validate=validate, schema=ConfigSchemaNlp
) )
@ -1615,6 +1623,10 @@ class Language:
or lang_cls is not cls or lang_cls is not cls
): ):
raise ValueError(Errors.E943.format(value=type(lang_cls))) raise ValueError(Errors.E943.format(value=type(lang_cls)))
# Warn about require_gpu usage in jupyter notebook
warn_if_jupyter_cupy()
# Note that we don't load vectors here, instead they get loaded explicitly # Note that we don't load vectors here, instead they get loaded explicitly
# inside stuff like the spacy train function. If we loaded them here, # inside stuff like the spacy train function. If we loaded them here,
# then we would load them twice at runtime: once when we make from config, # then we would load them twice at runtime: once when we make from config,

View File

@ -21,6 +21,8 @@ def create_pretrain_vectors(
maxout_pieces: int, hidden_size: int, loss: str maxout_pieces: int, hidden_size: int, loss: str
) -> Callable[["Vocab", Model], Model]: ) -> Callable[["Vocab", Model], Model]:
def create_vectors_objective(vocab: "Vocab", tok2vec: Model) -> Model: def create_vectors_objective(vocab: "Vocab", tok2vec: Model) -> Model:
if vocab.vectors.data.shape[1] == 0:
raise ValueError(Errors.E875)
model = build_cloze_multi_task_model( model = build_cloze_multi_task_model(
vocab, tok2vec, hidden_size=hidden_size, maxout_pieces=maxout_pieces vocab, tok2vec, hidden_size=hidden_size, maxout_pieces=maxout_pieces
) )
@ -134,7 +136,7 @@ def build_cloze_characters_multi_task_model(
) -> Model: ) -> Model:
output_layer = chain( output_layer = chain(
list2array(), list2array(),
Maxout(hidden_size, nP=maxout_pieces), Maxout(nO=hidden_size, nP=maxout_pieces),
LayerNorm(nI=hidden_size), LayerNorm(nI=hidden_size),
MultiSoftmax([256] * nr_char, nI=hidden_size), MultiSoftmax([256] * nr_char, nI=hidden_size),
) )

View File

@ -195,7 +195,7 @@ class EntityRuler(Pipe):
all_labels.add(label) all_labels.add(label)
else: else:
all_labels.add(l) all_labels.add(l)
return tuple(all_labels) return tuple(sorted(all_labels))
def initialize( def initialize(
self, self,

View File

@ -88,11 +88,9 @@ subword_features = true
def make_textcat( def make_textcat(
nlp: Language, name: str, model: Model[List[Doc], List[Floats2d]], threshold: float nlp: Language, name: str, model: Model[List[Doc], List[Floats2d]], threshold: float
) -> "TextCategorizer": ) -> "TextCategorizer":
"""Create a TextCategorizer compoment. The text categorizer predicts categories """Create a TextCategorizer component. The text categorizer predicts categories
over a whole document. It can learn one or more labels, and the labels can over a whole document. It can learn one or more labels, and the labels are considered
be mutually exclusive (i.e. one true label per doc) or non-mutually exclusive to be mutually exclusive (i.e. one true label per doc).
(i.e. zero or more labels may be true per doc). The multi-label setting is
controlled by the model instance that's provided.
model (Model[List[Doc], List[Floats2d]]): A model instance that predicts model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
scores for each category. scores for each category.
@ -317,9 +315,11 @@ class TextCategorizer(TrainablePipe):
get_examples (Callable[[], Iterable[Example]]): Function that get_examples (Callable[[], Iterable[Example]]): Function that
returns a representative sample of gold-standard Example objects. returns a representative sample of gold-standard Example objects.
nlp (Language): The current nlp object the component is part of. nlp (Language): The current nlp object the component is part of.
labels: The labels to add to the component, typically generated by the labels (Optional[Iterable[str]]): The labels to add to the component, typically generated by the
`init labels` command. If no labels are provided, the get_examples `init labels` command. If no labels are provided, the get_examples
callback is used to extract the labels from the data. callback is used to extract the labels from the data.
positive_label (Optional[str]): The positive label for a binary task with exclusive classes,
`None` otherwise and by default.
DOCS: https://spacy.io/api/textcategorizer#initialize DOCS: https://spacy.io/api/textcategorizer#initialize
""" """
@ -358,13 +358,13 @@ class TextCategorizer(TrainablePipe):
""" """
validate_examples(examples, "TextCategorizer.score") validate_examples(examples, "TextCategorizer.score")
self._validate_categories(examples) self._validate_categories(examples)
kwargs.setdefault("threshold", self.cfg["threshold"])
kwargs.setdefault("positive_label", self.cfg["positive_label"])
return Scorer.score_cats( return Scorer.score_cats(
examples, examples,
"cats", "cats",
labels=self.labels, labels=self.labels,
multi_label=False, multi_label=False,
positive_label=self.cfg["positive_label"],
threshold=self.cfg["threshold"],
**kwargs, **kwargs,
) )

View File

@ -88,11 +88,10 @@ subword_features = true
def make_multilabel_textcat( def make_multilabel_textcat(
nlp: Language, name: str, model: Model[List[Doc], List[Floats2d]], threshold: float nlp: Language, name: str, model: Model[List[Doc], List[Floats2d]], threshold: float
) -> "TextCategorizer": ) -> "TextCategorizer":
"""Create a TextCategorizer compoment. The text categorizer predicts categories """Create a TextCategorizer component. The text categorizer predicts categories
over a whole document. It can learn one or more labels, and the labels can over a whole document. It can learn one or more labels, and the labels are considered
be mutually exclusive (i.e. one true label per doc) or non-mutually exclusive to be non-mutually exclusive, which means that there can be zero or more labels
(i.e. zero or more labels may be true per doc). The multi-label setting is per doc).
controlled by the model instance that's provided.
model (Model[List[Doc], List[Floats2d]]): A model instance that predicts model (Model[List[Doc], List[Floats2d]]): A model instance that predicts
scores for each category. scores for each category.
@ -104,7 +103,7 @@ def make_multilabel_textcat(
class MultiLabel_TextCategorizer(TextCategorizer): class MultiLabel_TextCategorizer(TextCategorizer):
"""Pipeline component for multi-label text classification. """Pipeline component for multi-label text classification.
DOCS: https://spacy.io/api/multilabel_textcategorizer DOCS: https://spacy.io/api/textcategorizer
""" """
def __init__( def __init__(
@ -123,7 +122,7 @@ class MultiLabel_TextCategorizer(TextCategorizer):
losses during training. losses during training.
threshold (float): Cutoff to consider a prediction "positive". threshold (float): Cutoff to consider a prediction "positive".
DOCS: https://spacy.io/api/multilabel_textcategorizer#init DOCS: https://spacy.io/api/textcategorizer#init
""" """
self.vocab = vocab self.vocab = vocab
self.model = model self.model = model
@ -149,7 +148,7 @@ class MultiLabel_TextCategorizer(TextCategorizer):
`init labels` command. If no labels are provided, the get_examples `init labels` command. If no labels are provided, the get_examples
callback is used to extract the labels from the data. callback is used to extract the labels from the data.
DOCS: https://spacy.io/api/multilabel_textcategorizer#initialize DOCS: https://spacy.io/api/textcategorizer#initialize
""" """
validate_get_examples(get_examples, "MultiLabel_TextCategorizer.initialize") validate_get_examples(get_examples, "MultiLabel_TextCategorizer.initialize")
if labels is None: if labels is None:
@ -173,15 +172,15 @@ class MultiLabel_TextCategorizer(TextCategorizer):
examples (Iterable[Example]): The examples to score. examples (Iterable[Example]): The examples to score.
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_cats. RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_cats.
DOCS: https://spacy.io/api/multilabel_textcategorizer#score DOCS: https://spacy.io/api/textcategorizer#score
""" """
validate_examples(examples, "MultiLabel_TextCategorizer.score") validate_examples(examples, "MultiLabel_TextCategorizer.score")
kwargs.setdefault("threshold", self.cfg["threshold"])
return Scorer.score_cats( return Scorer.score_cats(
examples, examples,
"cats", "cats",
labels=self.labels, labels=self.labels,
multi_label=True, multi_label=True,
threshold=self.cfg["threshold"],
**kwargs, **kwargs,
) )

View File

@ -370,3 +370,51 @@ def test_textcat_evaluation():
assert scores["cats_micro_p"] == 4 / 5 assert scores["cats_micro_p"] == 4 / 5
assert scores["cats_micro_r"] == 4 / 6 assert scores["cats_micro_r"] == 4 / 6
def test_textcat_threshold():
# Ensure the scorer can be called with a different threshold
nlp = English()
nlp.add_pipe("textcat")
train_examples = []
for text, annotations in TRAIN_DATA_SINGLE_LABEL:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
nlp.initialize(get_examples=lambda: train_examples)
# score the model (it's not actually trained but that doesn't matter)
scores = nlp.evaluate(train_examples)
assert 0 <= scores["cats_score"] <= 1
scores = nlp.evaluate(train_examples, scorer_cfg={"threshold": 1.0})
assert scores["cats_f_per_type"]["POSITIVE"]["r"] == 0
scores = nlp.evaluate(train_examples, scorer_cfg={"threshold": 0})
macro_f = scores["cats_score"]
assert scores["cats_f_per_type"]["POSITIVE"]["r"] == 1.0
scores = nlp.evaluate(train_examples, scorer_cfg={"threshold": 0, "positive_label": "POSITIVE"})
pos_f = scores["cats_score"]
assert scores["cats_f_per_type"]["POSITIVE"]["r"] == 1.0
assert pos_f > macro_f
def test_textcat_multi_threshold():
# Ensure the scorer can be called with a different threshold
nlp = English()
nlp.add_pipe("textcat_multilabel")
train_examples = []
for text, annotations in TRAIN_DATA_SINGLE_LABEL:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
nlp.initialize(get_examples=lambda: train_examples)
# score the model (it's not actually trained but that doesn't matter)
scores = nlp.evaluate(train_examples)
assert 0 <= scores["cats_score"] <= 1
scores = nlp.evaluate(train_examples, scorer_cfg={"threshold": 1.0})
assert scores["cats_f_per_type"]["POSITIVE"]["r"] == 0
scores = nlp.evaluate(train_examples, scorer_cfg={"threshold": 0})
assert scores["cats_f_per_type"]["POSITIVE"]["r"] == 1.0

View File

@ -293,7 +293,7 @@ def test_serialize_parser(parser_config_string):
def test_config_nlp_roundtrip(): def test_config_nlp_roundtrip():
"""Test that a config prduced by the nlp object passes training config """Test that a config produced by the nlp object passes training config
validation.""" validation."""
nlp = English() nlp = English()
nlp.add_pipe("entity_ruler") nlp.add_pipe("entity_ruler")

View File

@ -4,7 +4,7 @@ from spacy.training import docs_to_json, offsets_to_biluo_tags
from spacy.training.converters import iob_to_docs, conll_ner_to_docs, conllu_to_docs from spacy.training.converters import iob_to_docs, conll_ner_to_docs, conllu_to_docs
from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate
from spacy.lang.nl import Dutch from spacy.lang.nl import Dutch
from spacy.util import ENV_VARS from spacy.util import ENV_VARS, load_model_from_config
from spacy.cli import info from spacy.cli import info
from spacy.cli.init_config import init_config, RECOMMENDATIONS from spacy.cli.init_config import init_config, RECOMMENDATIONS
from spacy.cli._util import validate_project_commands, parse_config_overrides from spacy.cli._util import validate_project_commands, parse_config_overrides
@ -397,10 +397,14 @@ def test_parse_cli_overrides():
"pipeline", [["tagger", "parser", "ner"], [], ["ner", "textcat", "sentencizer"]] "pipeline", [["tagger", "parser", "ner"], [], ["ner", "textcat", "sentencizer"]]
) )
@pytest.mark.parametrize("optimize", ["efficiency", "accuracy"]) @pytest.mark.parametrize("optimize", ["efficiency", "accuracy"])
def test_init_config(lang, pipeline, optimize): @pytest.mark.parametrize("pretraining", [True, False])
def test_init_config(lang, pipeline, optimize, pretraining):
# TODO: add more tests and also check for GPU with transformers # TODO: add more tests and also check for GPU with transformers
config = init_config(lang=lang, pipeline=pipeline, optimize=optimize, gpu=False) config = init_config(lang=lang, pipeline=pipeline, optimize=optimize, pretraining=pretraining, gpu=False)
assert isinstance(config, Config) assert isinstance(config, Config)
if pretraining:
config["paths"]["raw_text"] = "my_data.jsonl"
nlp = load_model_from_config(config, auto_fill=True)
def test_model_recommendations(): def test_model_recommendations():

View File

@ -38,19 +38,59 @@ def doc(nlp):
@pytest.mark.filterwarnings("ignore::UserWarning") @pytest.mark.filterwarnings("ignore::UserWarning")
def test_make_orth_variants(nlp, doc): def test_make_orth_variants(nlp):
single = [ single = [
{"tags": ["NFP"], "variants": ["", "..."]}, {"tags": ["NFP"], "variants": ["", "..."]},
{"tags": [":"], "variants": ["-", "", "", "--", "---", "——"]}, {"tags": [":"], "variants": ["-", "", "", "--", "---", "——"]},
] ]
# fmt: off
words = ["\n\n", "A", "\t", "B", "a", "b", "", "...", "-", "", "", "--", "---", "——"]
tags = ["_SP", "NN", "\t", "NN", "NN", "NN", "NFP", "NFP", ":", ":", ":", ":", ":", ":"]
# fmt: on
spaces = [True] * len(words)
spaces[0] = False
spaces[2] = False
doc = Doc(nlp.vocab, words=words, spaces=spaces, tags=tags)
augmenter = create_orth_variants_augmenter( augmenter = create_orth_variants_augmenter(
level=0.2, lower=0.5, orth_variants={"single": single} level=0.2, lower=0.5, orth_variants={"single": single}
) )
with make_docbin([doc]) as output_file: with make_docbin([doc] * 10) as output_file:
reader = Corpus(output_file, augmenter=augmenter) reader = Corpus(output_file, augmenter=augmenter)
# Due to randomness, only test that it works without errors for now # Due to randomness, only test that it works without errors
list(reader(nlp)) list(reader(nlp))
# check that the following settings lowercase everything
augmenter = create_orth_variants_augmenter(
level=1.0, lower=1.0, orth_variants={"single": single}
)
with make_docbin([doc] * 10) as output_file:
reader = Corpus(output_file, augmenter=augmenter)
for example in reader(nlp):
for token in example.reference:
assert token.text == token.text.lower()
# check that lowercasing is applied without tags
doc = Doc(nlp.vocab, words=words, spaces=[True] * len(words))
augmenter = create_orth_variants_augmenter(
level=1.0, lower=1.0, orth_variants={"single": single}
)
with make_docbin([doc] * 10) as output_file:
reader = Corpus(output_file, augmenter=augmenter)
for example in reader(nlp):
for ex_token, doc_token in zip(example.reference, doc):
assert ex_token.text == doc_token.text.lower()
# check that no lowercasing is applied with lower=0.0
doc = Doc(nlp.vocab, words=words, spaces=[True] * len(words))
augmenter = create_orth_variants_augmenter(
level=1.0, lower=0.0, orth_variants={"single": single}
)
with make_docbin([doc] * 10) as output_file:
reader = Corpus(output_file, augmenter=augmenter)
for example in reader(nlp):
for ex_token, doc_token in zip(example.reference, doc):
assert ex_token.text == doc_token.text
def test_lowercase_augmenter(nlp, doc): def test_lowercase_augmenter(nlp, doc):
augmenter = create_lower_casing_augmenter(level=1.0) augmenter = create_lower_casing_augmenter(level=1.0)
@ -66,6 +106,21 @@ def test_lowercase_augmenter(nlp, doc):
assert ref_ent.text == orig_ent.text.lower() assert ref_ent.text == orig_ent.text.lower()
assert [t.pos_ for t in eg.reference] == [t.pos_ for t in doc] assert [t.pos_ for t in eg.reference] == [t.pos_ for t in doc]
# check that augmentation works when lowercasing leads to different
# predicted tokenization
words = ["A", "B", "CCC."]
doc = Doc(nlp.vocab, words=words)
with make_docbin([doc]) as output_file:
reader = Corpus(output_file, augmenter=augmenter)
corpus = list(reader(nlp))
eg = corpus[0]
assert eg.reference.text == doc.text.lower()
assert eg.predicted.text == doc.text.lower()
assert [t.text for t in eg.reference] == [t.lower() for t in words]
assert [t.text for t in eg.predicted] == [
t.text for t in nlp.make_doc(doc.text.lower())
]
@pytest.mark.filterwarnings("ignore::UserWarning") @pytest.mark.filterwarnings("ignore::UserWarning")
def test_custom_data_augmentation(nlp, doc): def test_custom_data_augmentation(nlp, doc):

View File

@ -0,0 +1,345 @@
from pathlib import Path
import numpy as np
import pytest
import srsly
from spacy.vocab import Vocab
from thinc.api import Config
from ..util import make_tempdir
from ... import util
from ...lang.en import English
from ...training.initialize import init_nlp
from ...training.loop import train
from ...training.pretrain import pretrain
from ...tokens import Doc, DocBin
from ...language import DEFAULT_CONFIG_PRETRAIN_PATH, DEFAULT_CONFIG_PATH
pretrain_string_listener = """
[nlp]
lang = "en"
pipeline = ["tok2vec", "tagger"]
[components]
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = null
width = 342
depth = 4
window_size = 1
embed_size = 2000
maxout_pieces = 3
subword_features = true
[components.tagger]
factory = "tagger"
[components.tagger.model]
@architectures = "spacy.Tagger.v1"
[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.width}
[pretraining]
max_epochs = 5
[training]
max_epochs = 5
"""
pretrain_string_internal = """
[nlp]
lang = "en"
pipeline = ["tagger"]
[components]
[components.tagger]
factory = "tagger"
[components.tagger.model]
@architectures = "spacy.Tagger.v1"
[components.tagger.model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = null
width = 342
depth = 4
window_size = 1
embed_size = 2000
maxout_pieces = 3
subword_features = true
[pretraining]
max_epochs = 5
[training]
max_epochs = 5
"""
pretrain_string_vectors = """
[nlp]
lang = "en"
pipeline = ["tok2vec", "tagger"]
[components]
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = null
width = 342
depth = 4
window_size = 1
embed_size = 2000
maxout_pieces = 3
subword_features = true
[components.tagger]
factory = "tagger"
[components.tagger.model]
@architectures = "spacy.Tagger.v1"
[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.width}
[pretraining]
max_epochs = 5
[pretraining.objective]
@architectures = spacy.PretrainVectors.v1
maxout_pieces = 3
hidden_size = 300
loss = cosine
[training]
max_epochs = 5
"""
CHAR_OBJECTIVES = [
{},
{"@architectures": "spacy.PretrainCharacters.v1"},
{
"@architectures": "spacy.PretrainCharacters.v1",
"maxout_pieces": 5,
"hidden_size": 42,
"n_characters": 2,
},
]
VECTOR_OBJECTIVES = [
{
"@architectures": "spacy.PretrainVectors.v1",
"maxout_pieces": 3,
"hidden_size": 300,
"loss": "cosine",
},
{
"@architectures": "spacy.PretrainVectors.v1",
"maxout_pieces": 2,
"hidden_size": 200,
"loss": "L2",
},
]
def test_pretraining_default():
"""Test that pretraining defaults to a character objective"""
config = Config().from_str(pretrain_string_internal)
nlp = util.load_model_from_config(config, auto_fill=True, validate=False)
filled = nlp.config
pretrain_config = util.load_config(DEFAULT_CONFIG_PRETRAIN_PATH)
filled = pretrain_config.merge(filled)
assert "PretrainCharacters" in filled["pretraining"]["objective"]["@architectures"]
@pytest.mark.parametrize("objective", CHAR_OBJECTIVES)
def test_pretraining_tok2vec_characters(objective):
"""Test that pretraining works with the character objective"""
config = Config().from_str(pretrain_string_listener)
config["pretraining"]["objective"] = objective
nlp = util.load_model_from_config(config, auto_fill=True, validate=False)
filled = nlp.config
pretrain_config = util.load_config(DEFAULT_CONFIG_PRETRAIN_PATH)
filled = pretrain_config.merge(filled)
with make_tempdir() as tmp_dir:
file_path = write_sample_jsonl(tmp_dir)
filled["paths"]["raw_text"] = file_path
filled = filled.interpolate()
assert filled["pretraining"]["component"] == "tok2vec"
pretrain(filled, tmp_dir)
assert Path(tmp_dir / "model0.bin").exists()
assert Path(tmp_dir / "model4.bin").exists()
assert not Path(tmp_dir / "model5.bin").exists()
@pytest.mark.parametrize("objective", VECTOR_OBJECTIVES)
def test_pretraining_tok2vec_vectors_fail(objective):
"""Test that pretraining doesn't works with the vectors objective if there are no static vectors"""
config = Config().from_str(pretrain_string_listener)
config["pretraining"]["objective"] = objective
nlp = util.load_model_from_config(config, auto_fill=True, validate=False)
filled = nlp.config
pretrain_config = util.load_config(DEFAULT_CONFIG_PRETRAIN_PATH)
filled = pretrain_config.merge(filled)
with make_tempdir() as tmp_dir:
file_path = write_sample_jsonl(tmp_dir)
filled["paths"]["raw_text"] = file_path
filled = filled.interpolate()
assert filled["initialize"]["vectors"] is None
with pytest.raises(ValueError):
pretrain(filled, tmp_dir)
@pytest.mark.parametrize("objective", VECTOR_OBJECTIVES)
def test_pretraining_tok2vec_vectors(objective):
"""Test that pretraining works with the vectors objective and static vectors defined"""
config = Config().from_str(pretrain_string_listener)
config["pretraining"]["objective"] = objective
nlp = util.load_model_from_config(config, auto_fill=True, validate=False)
filled = nlp.config
pretrain_config = util.load_config(DEFAULT_CONFIG_PRETRAIN_PATH)
filled = pretrain_config.merge(filled)
with make_tempdir() as tmp_dir:
file_path = write_sample_jsonl(tmp_dir)
filled["paths"]["raw_text"] = file_path
nlp_path = write_vectors_model(tmp_dir)
filled["initialize"]["vectors"] = nlp_path
filled = filled.interpolate()
pretrain(filled, tmp_dir)
@pytest.mark.parametrize("config", [pretrain_string_internal, pretrain_string_listener])
def test_pretraining_tagger_tok2vec(config):
"""Test pretraining of the tagger's tok2vec layer (via a listener)"""
config = Config().from_str(pretrain_string_listener)
nlp = util.load_model_from_config(config, auto_fill=True, validate=False)
filled = nlp.config
pretrain_config = util.load_config(DEFAULT_CONFIG_PRETRAIN_PATH)
filled = pretrain_config.merge(filled)
with make_tempdir() as tmp_dir:
file_path = write_sample_jsonl(tmp_dir)
filled["paths"]["raw_text"] = file_path
filled["pretraining"]["component"] = "tagger"
filled["pretraining"]["layer"] = "tok2vec"
filled = filled.interpolate()
pretrain(filled, tmp_dir)
assert Path(tmp_dir / "model0.bin").exists()
assert Path(tmp_dir / "model4.bin").exists()
assert not Path(tmp_dir / "model5.bin").exists()
def test_pretraining_tagger():
"""Test pretraining of the tagger itself will throw an error (not an appropriate tok2vec layer)"""
config = Config().from_str(pretrain_string_internal)
nlp = util.load_model_from_config(config, auto_fill=True, validate=False)
filled = nlp.config
pretrain_config = util.load_config(DEFAULT_CONFIG_PRETRAIN_PATH)
filled = pretrain_config.merge(filled)
with make_tempdir() as tmp_dir:
file_path = write_sample_jsonl(tmp_dir)
filled["paths"]["raw_text"] = file_path
filled["pretraining"]["component"] = "tagger"
filled = filled.interpolate()
with pytest.raises(ValueError):
pretrain(filled, tmp_dir)
def test_pretraining_training():
"""Test that training can use a pretrained Tok2Vec model"""
config = Config().from_str(pretrain_string_internal)
nlp = util.load_model_from_config(config, auto_fill=True, validate=False)
filled = nlp.config
pretrain_config = util.load_config(DEFAULT_CONFIG_PRETRAIN_PATH)
filled = pretrain_config.merge(filled)
train_config = util.load_config(DEFAULT_CONFIG_PATH)
filled = train_config.merge(filled)
with make_tempdir() as tmp_dir:
pretrain_dir = tmp_dir / "pretrain"
pretrain_dir.mkdir()
file_path = write_sample_jsonl(pretrain_dir)
filled["paths"]["raw_text"] = file_path
filled["pretraining"]["component"] = "tagger"
filled["pretraining"]["layer"] = "tok2vec"
train_dir = tmp_dir / "train"
train_dir.mkdir()
train_path, dev_path = write_sample_training(train_dir)
filled["paths"]["train"] = train_path
filled["paths"]["dev"] = dev_path
filled = filled.interpolate()
P = filled["pretraining"]
nlp_base = init_nlp(filled)
model_base = nlp_base.get_pipe(P["component"]).model.get_ref(P["layer"]).get_ref("embed")
embed_base = None
for node in model_base.walk():
if node.name == "hashembed":
embed_base = node
pretrain(filled, pretrain_dir)
pretrained_model = Path(pretrain_dir / "model3.bin")
assert pretrained_model.exists()
filled["initialize"]["init_tok2vec"] = str(pretrained_model)
nlp = init_nlp(filled)
model = nlp.get_pipe(P["component"]).model.get_ref(P["layer"]).get_ref("embed")
embed = None
for node in model.walk():
if node.name == "hashembed":
embed = node
# ensure that the tok2vec weights are actually changed by the pretraining
assert np.any(np.not_equal(embed.get_param("E"), embed_base.get_param("E")))
train(nlp, train_dir)
def write_sample_jsonl(tmp_dir):
data = [
{
"meta": {"id": "1"},
"text": "This is the best TV you'll ever buy!",
"cats": {"pos": 1, "neg": 0},
},
{
"meta": {"id": "2"},
"text": "I wouldn't buy this again.",
"cats": {"pos": 0, "neg": 1},
},
]
file_path = f"{tmp_dir}/text.jsonl"
srsly.write_jsonl(file_path, data)
return file_path
def write_sample_training(tmp_dir):
words = ["The", "players", "start", "."]
tags = ["DT", "NN", "VBZ", "."]
doc = Doc(English().vocab, words=words, tags=tags)
doc_bin = DocBin()
doc_bin.add(doc)
train_path = f"{tmp_dir}/train.spacy"
dev_path = f"{tmp_dir}/dev.spacy"
doc_bin.to_disk(train_path)
doc_bin.to_disk(dev_path)
return train_path, dev_path
def write_vectors_model(tmp_dir):
import numpy
vocab = Vocab()
vector_data = {
"dog": numpy.random.uniform(-1, 1, (300,)),
"cat": numpy.random.uniform(-1, 1, (300,)),
"orange": numpy.random.uniform(-1, 1, (300,))
}
for word, vector in vector_data.items():
vocab.set_vector(word, vector)
nlp_path = tmp_dir / "vectors_model"
nlp = English(vocab)
nlp.to_disk(nlp_path)
return str(nlp_path)

View File

@ -1,12 +1,10 @@
from typing import Callable, Iterator, Dict, List, Tuple, TYPE_CHECKING from typing import Callable, Iterator, Dict, List, Tuple, TYPE_CHECKING
import random import random
import itertools import itertools
import copy
from functools import partial from functools import partial
from pydantic import BaseModel, StrictStr from pydantic import BaseModel, StrictStr
from ..util import registry from ..util import registry
from ..tokens import Doc
from .example import Example from .example import Example
if TYPE_CHECKING: if TYPE_CHECKING:
@ -71,7 +69,7 @@ def lower_casing_augmenter(
else: else:
example_dict = example.to_dict() example_dict = example.to_dict()
doc = nlp.make_doc(example.text.lower()) doc = nlp.make_doc(example.text.lower())
example_dict["token_annotation"]["ORTH"] = [t.lower_ for t in doc] example_dict["token_annotation"]["ORTH"] = [t.lower_ for t in example.reference]
yield example.from_dict(doc, example_dict) yield example.from_dict(doc, example_dict)
@ -88,9 +86,6 @@ def orth_variants_augmenter(
else: else:
raw_text = example.text raw_text = example.text
orig_dict = example.to_dict() orig_dict = example.to_dict()
if not orig_dict["token_annotation"]:
yield example
else:
variant_text, variant_token_annot = make_orth_variants( variant_text, variant_token_annot = make_orth_variants(
nlp, nlp,
raw_text, raw_text,
@ -98,14 +93,8 @@ def orth_variants_augmenter(
orth_variants, orth_variants,
lower=raw_text is not None and random.random() < lower, lower=raw_text is not None and random.random() < lower,
) )
if variant_text:
doc = nlp.make_doc(variant_text)
else:
doc = Doc(nlp.vocab, words=variant_token_annot["ORTH"])
variant_token_annot["ORTH"] = [w.text for w in doc]
variant_token_annot["SPACY"] = [w.whitespace_ for w in doc]
orig_dict["token_annotation"] = variant_token_annot orig_dict["token_annotation"] = variant_token_annot
yield example.from_dict(doc, orig_dict) yield example.from_dict(nlp.make_doc(variant_text), orig_dict)
def make_orth_variants( def make_orth_variants(
@ -116,16 +105,20 @@ def make_orth_variants(
*, *,
lower: bool = False, lower: bool = False,
) -> Tuple[str, Dict[str, List[str]]]: ) -> Tuple[str, Dict[str, List[str]]]:
orig_token_dict = copy.deepcopy(token_dict)
ndsv = orth_variants.get("single", [])
ndpv = orth_variants.get("paired", [])
words = token_dict.get("ORTH", []) words = token_dict.get("ORTH", [])
tags = token_dict.get("TAG", []) tags = token_dict.get("TAG", [])
# keep unmodified if words or tags are not defined # keep unmodified if words are not defined
if words and tags: if not words:
return raw, token_dict
if lower: if lower:
words = [w.lower() for w in words] words = [w.lower() for w in words]
raw = raw.lower()
# if no tags, only lowercase
if not tags:
token_dict["ORTH"] = words
return raw, token_dict
# single variants # single variants
ndsv = orth_variants.get("single", [])
punct_choices = [random.choice(x["variants"]) for x in ndsv] punct_choices = [random.choice(x["variants"]) for x in ndsv]
for word_idx in range(len(words)): for word_idx in range(len(words)):
for punct_idx in range(len(ndsv)): for punct_idx in range(len(ndsv)):
@ -135,6 +128,7 @@ def make_orth_variants(
): ):
words[word_idx] = punct_choices[punct_idx] words[word_idx] = punct_choices[punct_idx]
# paired variants # paired variants
ndpv = orth_variants.get("paired", [])
punct_choices = [random.choice(x["variants"]) for x in ndpv] punct_choices = [random.choice(x["variants"]) for x in ndpv]
for word_idx in range(len(words)): for word_idx in range(len(words)):
for punct_idx in range(len(ndpv)): for punct_idx in range(len(ndpv)):
@ -154,50 +148,10 @@ def make_orth_variants(
pair_idx = pair.index(words[word_idx]) pair_idx = pair.index(words[word_idx])
words[word_idx] = punct_choices[punct_idx][pair_idx] words[word_idx] = punct_choices[punct_idx][pair_idx]
token_dict["ORTH"] = words token_dict["ORTH"] = words
token_dict["TAG"] = tags # construct modified raw text from words and spaces
# modify raw raw = ""
if raw is not None: for orth, spacy in zip(token_dict["ORTH"], token_dict["SPACY"]):
variants = [] raw += orth
for single_variants in ndsv: if spacy:
variants.extend(single_variants["variants"]) raw += " "
for paired_variants in ndpv:
variants.extend(
list(itertools.chain.from_iterable(paired_variants["variants"]))
)
# store variants in reverse length order to be able to prioritize
# longer matches (e.g., "---" before "--")
variants = sorted(variants, key=lambda x: len(x))
variants.reverse()
variant_raw = ""
raw_idx = 0
# add initial whitespace
while raw_idx < len(raw) and raw[raw_idx].isspace():
variant_raw += raw[raw_idx]
raw_idx += 1
for word in words:
match_found = False
# skip whitespace words
if word.isspace():
match_found = True
# add identical word
elif word not in variants and raw[raw_idx:].startswith(word):
variant_raw += word
raw_idx += len(word)
match_found = True
# add variant word
else:
for variant in variants:
if not match_found and raw[raw_idx:].startswith(variant):
raw_idx += len(variant)
variant_raw += word
match_found = True
# something went wrong, abort
# (add a warning message?)
if not match_found:
return raw, orig_token_dict
# add following whitespace
while raw_idx < len(raw) and raw[raw_idx].isspace():
variant_raw += raw[raw_idx]
raw_idx += 1
raw = variant_raw
return raw, token_dict return raw, token_dict

View File

@ -9,6 +9,7 @@ import gzip
import zipfile import zipfile
import tqdm import tqdm
from .pretrain import get_tok2vec_ref
from ..lookups import Lookups from ..lookups import Lookups
from ..vectors import Vectors from ..vectors import Vectors
from ..errors import Errors, Warnings from ..errors import Errors, Warnings
@ -147,10 +148,6 @@ def init_tok2vec(
weights_data = None weights_data = None
init_tok2vec = ensure_path(I["init_tok2vec"]) init_tok2vec = ensure_path(I["init_tok2vec"])
if init_tok2vec is not None: if init_tok2vec is not None:
if P["objective"].get("type") == "vectors" and not I["vectors"]:
err = 'need initialize.vectors if pretraining.objective.type is "vectors"'
errors = [{"loc": ["initialize"], "msg": err}]
raise ConfigValidationError(config=nlp.config, errors=errors)
if not init_tok2vec.exists(): if not init_tok2vec.exists():
err = f"can't find pretrained tok2vec: {init_tok2vec}" err = f"can't find pretrained tok2vec: {init_tok2vec}"
errors = [{"loc": ["initialize", "init_tok2vec"], "msg": err}] errors = [{"loc": ["initialize", "init_tok2vec"], "msg": err}]
@ -158,21 +155,9 @@ def init_tok2vec(
with init_tok2vec.open("rb") as file_: with init_tok2vec.open("rb") as file_:
weights_data = file_.read() weights_data = file_.read()
if weights_data is not None: if weights_data is not None:
tok2vec_component = P["component"] layer = get_tok2vec_ref(nlp, P)
if tok2vec_component is None:
desc = (
f"To use pretrained tok2vec weights, [pretraining.component] "
f"needs to specify the component that should load them."
)
err = "component can't be null"
errors = [{"loc": ["pretraining", "component"], "msg": err}]
raise ConfigValidationError(
config=nlp.config["pretraining"], errors=errors, desc=desc
)
layer = nlp.get_pipe(tok2vec_component).model
if P["layer"]:
layer = layer.get_ref(P["layer"])
layer.from_bytes(weights_data) layer.from_bytes(weights_data)
logger.info(f"Loaded pretrained weights from {init_tok2vec}")
return True return True
return False return False

View File

@ -230,7 +230,10 @@ def train_while_improving(
if is_best_checkpoint is not None: if is_best_checkpoint is not None:
losses = {} losses = {}
# Stop if no improvement in `patience` updates (if specified) # Stop if no improvement in `patience` updates (if specified)
best_score, best_step = max(results) # Negate step value so that the earliest best step is chosen for the
# same score, i.e. (1.0, 100) is chosen over (1.0, 200)
best_result = max((r_score, -r_step) for r_score, r_step in results)
best_step = -best_result[1]
if patience and (step - best_step) >= patience: if patience and (step - best_step) >= patience:
break break
# Stop if we've exhausted our max steps (if specified) # Stop if we've exhausted our max steps (if specified)

View File

@ -6,9 +6,12 @@ from collections import Counter
import srsly import srsly
import time import time
import re import re
from thinc.config import ConfigValidationError
from wasabi import Printer from wasabi import Printer
from .example import Example from .example import Example
from ..errors import Errors
from ..tokens import Doc from ..tokens import Doc
from ..schemas import ConfigSchemaPretrain from ..schemas import ConfigSchemaPretrain
from ..util import registry, load_model_from_config, dot_to_object from ..util import registry, load_model_from_config, dot_to_object
@ -133,12 +136,21 @@ def create_pretraining_model(nlp, pretrain_config):
The actual tok2vec layer is stored as a reference, and only this bit will be The actual tok2vec layer is stored as a reference, and only this bit will be
serialized to file and read back in when calling the 'train' command. serialized to file and read back in when calling the 'train' command.
""" """
with nlp.select_pipes(enable=[]):
nlp.initialize() nlp.initialize()
component = nlp.get_pipe(pretrain_config["component"]) tok2vec = get_tok2vec_ref(nlp, pretrain_config)
if pretrain_config.get("layer"): # If the config referred to a Tok2VecListener, grab the original model instead
tok2vec = component.model.get_ref(pretrain_config["layer"]) if type(tok2vec).__name__ == "Tok2VecListener":
else: original_tok2vec = (
tok2vec = component.model tok2vec.upstream_name if tok2vec.upstream_name is not "*" else "tok2vec"
)
tok2vec = nlp.get_pipe(original_tok2vec).model
try:
tok2vec.initialize(X=[nlp.make_doc("Give it a doc to infer shapes")])
except ValueError:
component = pretrain_config["component"]
layer = pretrain_config["layer"]
raise ValueError(Errors.E874.format(component=component, layer=layer))
create_function = pretrain_config["objective"] create_function = pretrain_config["objective"]
model = create_function(nlp.vocab, tok2vec) model = create_function(nlp.vocab, tok2vec)
@ -147,6 +159,24 @@ def create_pretraining_model(nlp, pretrain_config):
return model return model
def get_tok2vec_ref(nlp, pretrain_config):
tok2vec_component = pretrain_config["component"]
if tok2vec_component is None:
desc = (
f"To use pretrained tok2vec weights, [pretraining.component] "
f"needs to specify the component that should load them."
)
err = "component can't be null"
errors = [{"loc": ["pretraining", "component"], "msg": err}]
raise ConfigValidationError(
config=nlp.config["pretraining"], errors=errors, desc=desc
)
layer = nlp.get_pipe(tok2vec_component).model
if pretrain_config["layer"]:
layer = layer.get_ref(pretrain_config["layer"])
return layer
class ProgressTracker: class ProgressTracker:
def __init__(self, frequency=1000000): def __init__(self, frequency=1000000):
self.loss = 0.0 self.loss = 0.0

View File

@ -1500,3 +1500,15 @@ def raise_error(proc_name, proc, docs, e):
def ignore_error(proc_name, proc, docs, e): def ignore_error(proc_name, proc, docs, e):
pass pass
def warn_if_jupyter_cupy():
"""Warn about require_gpu if a jupyter notebook + cupy + mismatched
contextvars vs. thread ops are detected
"""
if is_in_jupyter():
from thinc.backends.cupy_ops import CupyOps
if CupyOps.xp is not None:
from thinc.backends import contextvars_eq_thread_ops
if not contextvars_eq_thread_ops():
warnings.warn(Warnings.W111)

View File

@ -447,6 +447,9 @@ For more information, see the section on
> ```ini > ```ini
> [pretraining] > [pretraining]
> component = "tok2vec" > component = "tok2vec"
>
> [initialize]
> vectors = "en_core_web_lg"
> ... > ...
> >
> [pretraining.objective] > [pretraining.objective]
@ -457,7 +460,9 @@ For more information, see the section on
> ``` > ```
Predict the word's vector from a static embeddings table as pretraining Predict the word's vector from a static embeddings table as pretraining
objective for a Tok2Vec layer. objective for a Tok2Vec layer. To use this objective, make sure that the
`initialize.vectors` section in the config refers to a model with static
vectors.
| Name | Description | | Name | Description |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | | --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -584,6 +589,17 @@ several different built-in architectures. It is recommended to experiment with
different architectures and settings to determine what works best on your different architectures and settings to determine what works best on your
specific data and challenge. specific data and challenge.
<Infobox title="Single-label vs. multi-label classification" variant="warning">
When the architecture for a text classification challenge contains a setting for
`exclusive_classes`, it is important to use the correct value for the correct
pipeline component. The `textcat` component should always be used for
single-label use-cases where `exclusive_classes = true`, while the
`textcat_multilabel` should be used for multi-label settings with
`exclusive_classes = false`.
</Infobox>
### spacy.TextCatEnsemble.v2 {#TextCatEnsemble} ### spacy.TextCatEnsemble.v2 {#TextCatEnsemble}
> #### Example Config > #### Example Config

View File

@ -1,453 +0,0 @@
---
title: Multi-label TextCategorizer
tag: class
source: spacy/pipeline/textcat_multilabel.py
new: 3
teaser: 'Pipeline component for multi-label text classification'
api_base_class: /api/pipe
api_string_name: textcat_multilabel
api_trainable: true
---
The text categorizer predicts **categories over a whole document**. It
learns non-mutually exclusive labels, which means that zero or more labels
may be true per document.
## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes
how the component should be configured. You can override its settings via the
`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
[`config.cfg` for training](/usage/training#config). See the
[model architectures](/api/architectures) documentation for details on the
architectures and their arguments and hyperparameters.
> #### Example
>
> ```python
> from spacy.pipeline.textcat_multilabel import DEFAULT_MULTI_TEXTCAT_MODEL
> config = {
> "threshold": 0.5,
> "model": DEFAULT_MULTI_TEXTCAT_MODEL,
> }
> nlp.add_pipe("textcat_multilabel", config=config)
> ```
| Setting | Description |
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ |
```python
%%GITHUB_SPACY/spacy/pipeline/textcat_multilabel.py
```
## MultiLabel_TextCategorizer.\_\_init\_\_ {#init tag="method"}
> #### Example
>
> ```python
> # Construction via add_pipe with default model
> textcat = nlp.add_pipe("textcat_multilabel")
>
> # Construction via add_pipe with custom model
> config = {"model": {"@architectures": "my_textcat"}}
> parser = nlp.add_pipe("textcat_multilabel", config=config)
>
> # Construction from class
> from spacy.pipeline import MultiLabel_TextCategorizer
> textcat = MultiLabel_TextCategorizer(nlp.vocab, model, threshold=0.5)
> ```
Create a new pipeline instance. In your application, you would normally use a
shortcut for this and instantiate the component using its string name and
[`nlp.add_pipe`](/api/language#create_pipe).
| Name | Description |
| -------------- | -------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | The shared vocabulary. ~~Vocab~~ |
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
| _keyword-only_ | |
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
## MultiLabel_TextCategorizer.\_\_call\_\_ {#call tag="method"}
Apply the pipe to one document. The document is modified in place, and returned.
This usually happens under the hood when the `nlp` object is called on a text
and all pipeline components are applied to the `Doc` in order. Both
[`__call__`](/api/multilabel_textcategorizer#call) and [`pipe`](/api/multilabel_textcategorizer#pipe)
delegate to the [`predict`](/api/multilabel_textcategorizer#predict) and
[`set_annotations`](/api/multilabel_textcategorizer#set_annotations) methods.
> #### Example
>
> ```python
> doc = nlp("This is a sentence.")
> textcat = nlp.add_pipe("textcat_multilabel")
> # This usually happens under the hood
> processed = textcat(doc)
> ```
| Name | Description |
| ----------- | -------------------------------- |
| `doc` | The document to process. ~~Doc~~ |
| **RETURNS** | The processed document. ~~Doc~~ |
## MultiLabel_TextCategorizer.pipe {#pipe tag="method"}
Apply the pipe to a stream of documents. This usually happens under the hood
when the `nlp` object is called on a text and all pipeline components are
applied to the `Doc` in order. Both [`__call__`](/api/multilabel_textcategorizer#call) and
[`pipe`](/api/multilabel_textcategorizer#pipe) delegate to the
[`predict`](/api/multilabel_textcategorizer#predict) and
[`set_annotations`](/api/multilabel_textcategorizer#set_annotations) methods.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> for doc in textcat.pipe(docs, batch_size=50):
> pass
> ```
| Name | Description |
| -------------- | ------------------------------------------------------------- |
| `stream` | A stream of documents. ~~Iterable[Doc]~~ |
| _keyword-only_ | |
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
| **YIELDS** | The processed documents in order. ~~Doc~~ |
## MultiLabel_TextCategorizer.initialize {#initialize tag="method" new="3"}
Initialize the component for training. `get_examples` should be a function that
returns an iterable of [`Example`](/api/example) objects. The data examples are
used to **initialize the model** of the component and can either be the full
training data or a representative sample. Initialization includes validating the
network,
[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
setting up the label scheme based on the data. This method is typically called
by [`Language.initialize`](/api/language#initialize) and lets you customize
arguments it receives via the
[`[initialize.components]`](/api/data-formats#config-initialize) block in the
config.
<Infobox variant="warning" title="Changed in v3.0" id="begin_training">
This method was previously called `begin_training`.
</Infobox>
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> textcat.initialize(lambda: [], nlp=nlp)
> ```
>
> ```ini
> ### config.cfg
> [initialize.components.textcat_multilabel]
>
> [initialize.components.textcat_multilabel.labels]
> @readers = "spacy.read_labels.v1"
> path = "corpus/labels/textcat.json
> ```
| Name | Description |
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
| _keyword-only_ | |
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ |
## MultiLabel_TextCategorizer.predict {#predict tag="method"}
Apply the component's model to a batch of [`Doc`](/api/doc) objects without
modifying them.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> scores = textcat.predict([doc1, doc2])
> ```
| Name | Description |
| ----------- | ------------------------------------------- |
| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
| **RETURNS** | The model's prediction for each document. |
## MultiLabel_TextCategorizer.set_annotations {#set_annotations tag="method"}
Modify a batch of [`Doc`](/api/doc) objects using pre-computed scores.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> scores = textcat.predict(docs)
> textcat.set_annotations(docs, scores)
> ```
| Name | Description |
| -------- | --------------------------------------------------------- |
| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
| `scores` | The scores to set, produced by `MultiLabel_TextCategorizer.predict`. |
## MultiLabel_TextCategorizer.update {#update tag="method"}
Learn from a batch of [`Example`](/api/example) objects containing the
predictions and gold-standard annotations, and update the component's model.
Delegates to [`predict`](/api/multilabel_textcategorizer#predict) and
[`get_loss`](/api/multilabel_textcategorizer#get_loss).
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> optimizer = nlp.initialize()
> losses = textcat.update(examples, sgd=optimizer)
> ```
| Name | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
## MultiLabel_TextCategorizer.rehearse {#rehearse tag="method,experimental" new="3"}
Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the
current model to make predictions similar to an initial model to try to address
the "catastrophic forgetting" problem. This feature is experimental.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> optimizer = nlp.resume_training()
> losses = textcat.rehearse(examples, sgd=optimizer)
> ```
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
## MultiLabel_TextCategorizer.get_loss {#get_loss tag="method"}
Find the loss and gradient of loss for the batch of documents and their
predicted scores.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat_multilabel")
> scores = textcat.predict([eg.predicted for eg in examples])
> loss, d_loss = textcat.get_loss(examples, scores)
> ```
| Name | Description |
| ----------- | --------------------------------------------------------------------------- |
| `examples` | The batch of examples. ~~Iterable[Example]~~ |
| `scores` | Scores representing the model's predictions. |
| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ |
## MultiLabel_TextCategorizer.score {#score tag="method" new="3"}
Score a batch of examples.
> #### Example
>
> ```python
> scores = textcat.score(examples)
> ```
| Name | Description |
| ---------------- | -------------------------------------------------------------------------------------------------------------------- |
| `examples` | The examples to score. ~~Iterable[Example]~~ |
| _keyword-only_ | |
| **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ |
## MultiLabel_TextCategorizer.create_optimizer {#create_optimizer tag="method"}
Create an optimizer for the pipeline component.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> optimizer = textcat.create_optimizer()
> ```
| Name | Description |
| ----------- | ---------------------------- |
| **RETURNS** | The optimizer. ~~Optimizer~~ |
## MultiLabel_TextCategorizer.use_params {#use_params tag="method, contextmanager"}
Modify the pipe's model to use the given parameter values.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> with textcat.use_params(optimizer.averages):
> textcat.to_disk("/best_model")
> ```
| Name | Description |
| -------- | -------------------------------------------------- |
| `params` | The parameter values to use in the model. ~~dict~~ |
## MultiLabel_TextCategorizer.add_label {#add_label tag="method"}
Add a new label to the pipe. Raises an error if the output dimension is already
set, or if the model has already been fully [initialized](#initialize). Note
that you don't have to call this method if you provide a **representative data
sample** to the [`initialize`](#initialize) method. In this case, all labels
found in the sample will be automatically added to the model, and the output
dimension will be [inferred](/usage/layers-architectures#thinc-shape-inference)
automatically.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> textcat.add_label("MY_LABEL")
> ```
| Name | Description |
| ----------- | ----------------------------------------------------------- |
| `label` | The label to add. ~~str~~ |
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
## MultiLabel_TextCategorizer.to_disk {#to_disk tag="method"}
Serialize the pipe to disk.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> textcat.to_disk("/path/to/textcat")
> ```
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
## MultiLabel_TextCategorizer.from_disk {#from_disk tag="method"}
Load the pipe from disk. Modifies the object in place and returns it.
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> textcat.from_disk("/path/to/textcat")
> ```
| Name | Description |
| -------------- | ----------------------------------------------------------------------------------------------- |
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
| **RETURNS** | The modified `MultiLabel_TextCategorizer` object. ~~MultiLabel_TextCategorizer~~ |
## MultiLabel_TextCategorizer.to_bytes {#to_bytes tag="method"}
> #### Example
>
> ```python
> textcat = nlp.add_pipe("textcat")
> textcat_bytes = textcat.to_bytes()
> ```
Serialize the pipe to a bytestring.
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------- |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
| **RETURNS** | The serialized form of the `MultiLabel_TextCategorizer` object. ~~bytes~~ |
## MultiLabel_TextCategorizer.from_bytes {#from_bytes tag="method"}
Load the pipe from a bytestring. Modifies the object in place and returns it.
> #### Example
>
> ```python
> textcat_bytes = textcat.to_bytes()
> textcat = nlp.add_pipe("textcat")
> textcat.from_bytes(textcat_bytes)
> ```
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------- |
| `bytes_data` | The data to load from. ~~bytes~~ |
| _keyword-only_ | |
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
| **RETURNS** | The `MultiLabel_TextCategorizer` object. ~~MultiLabel_TextCategorizer~~ |
## MultiLabel_TextCategorizer.labels {#labels tag="property"}
The labels currently added to the component.
> #### Example
>
> ```python
> textcat.add_label("MY_LABEL")
> assert "MY_LABEL" in textcat.labels
> ```
| Name | Description |
| ----------- | ------------------------------------------------------ |
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
## MultiLabel_TextCategorizer.label_data {#label_data tag="property" new="3"}
The labels currently added to the component and their internal meta information.
This is the data generated by [`init labels`](/api/cli#init-labels) and used by
[`MultiLabel_TextCategorizer.initialize`](/api/multilabel_textcategorizer#initialize) to initialize
the model with a pre-defined label set.
> #### Example
>
> ```python
> labels = textcat.label_data
> textcat.initialize(lambda: [], nlp=nlp, labels=labels)
> ```
| Name | Description |
| ----------- | ---------------------------------------------------------- |
| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = textcat.to_disk("/path", exclude=["vocab"])
> ```
| Name | Description |
| ------- | -------------------------------------------------------------- |
| `vocab` | The shared [`Vocab`](/api/vocab). |
| `cfg` | The config file. You usually don't want to exclude this. |
| `model` | The binary model data. You usually don't want to exclude this. |

View File

@ -3,15 +3,30 @@ title: TextCategorizer
tag: class tag: class
source: spacy/pipeline/textcat.py source: spacy/pipeline/textcat.py
new: 2 new: 2
teaser: 'Pipeline component for single-label text classification' teaser: 'Pipeline component for text classification'
api_base_class: /api/pipe api_base_class: /api/pipe
api_string_name: textcat api_string_name: textcat
api_trainable: true api_trainable: true
--- ---
The text categorizer predicts **categories over a whole document**. It can learn The text categorizer predicts **categories over a whole document**. and comes in
one or more labels, and the labels are mutually exclusive - there is exactly one two flavours: `textcat` and `textcat_multilabel`. When you need to predict
true label per document. exactly one true label per document, use the `textcat` which has mutually
exclusive labels. If you want to perform multi-label classification and predict
zero, one or more labels per document, use the `textcat_multilabel` component
instead.
Both components are documented on this page.
<Infobox title="Migration from v2" variant="warning">
In spaCy v2, the `textcat` component could also perform **multi-label
classification**, and even used this setting by default. Since v3.0, the
component `textcat_multilabel` should be used for multi-label classification
instead. The `textcat` component is now used for mutually exclusive classes
only.
</Infobox>
## Config and implementation {#config} ## Config and implementation {#config}
@ -22,7 +37,7 @@ how the component should be configured. You can override its settings via the
[model architectures](/api/architectures) documentation for details on the [model architectures](/api/architectures) documentation for details on the
architectures and their arguments and hyperparameters. architectures and their arguments and hyperparameters.
> #### Example > #### Example (textcat)
> >
> ```python > ```python
> from spacy.pipeline.textcat import DEFAULT_SINGLE_TEXTCAT_MODEL > from spacy.pipeline.textcat import DEFAULT_SINGLE_TEXTCAT_MODEL
@ -33,6 +48,17 @@ architectures and their arguments and hyperparameters.
> nlp.add_pipe("textcat", config=config) > nlp.add_pipe("textcat", config=config)
> ``` > ```
> #### Example (textcat_multilabel)
>
> ```python
> from spacy.pipeline.textcat_multilabel import DEFAULT_MULTI_TEXTCAT_MODEL
> config = {
> "threshold": 0.5,
> "model": DEFAULT_MULTI_TEXTCAT_MODEL,
> }
> nlp.add_pipe("textcat_multilabel", config=config)
> ```
| Setting | Description | | Setting | Description |
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ | | `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
@ -48,6 +74,7 @@ architectures and their arguments and hyperparameters.
> >
> ```python > ```python
> # Construction via add_pipe with default model > # Construction via add_pipe with default model
> # Use 'textcat_multilabel' for multi-label classification
> textcat = nlp.add_pipe("textcat") > textcat = nlp.add_pipe("textcat")
> >
> # Construction via add_pipe with custom model > # Construction via add_pipe with custom model
@ -55,6 +82,7 @@ architectures and their arguments and hyperparameters.
> parser = nlp.add_pipe("textcat", config=config) > parser = nlp.add_pipe("textcat", config=config)
> >
> # Construction from class > # Construction from class
> # Use 'MultiLabel_TextCategorizer' for multi-label classification
> from spacy.pipeline import TextCategorizer > from spacy.pipeline import TextCategorizer
> textcat = TextCategorizer(nlp.vocab, model, threshold=0.5) > textcat = TextCategorizer(nlp.vocab, model, threshold=0.5)
> ``` > ```
@ -161,7 +189,7 @@ This method was previously called `begin_training`.
| _keyword-only_ | | | _keyword-only_ | |
| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | | `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ | | `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ |
| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise and by default. ~~Optional[str]~~ | | `positive_label` | The positive label for a binary task with exclusive classes, `None` otherwise and by default. This parameter is not available when using the `textcat_multilabel` component. ~~Optional[str]~~ |
## TextCategorizer.predict {#predict tag="method"} ## TextCategorizer.predict {#predict tag="method"}
@ -213,7 +241,7 @@ Delegates to [`predict`](/api/textcategorizer#predict) and
> ``` > ```
| Name | Description | | Name | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | | -------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | | `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | | | _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ | | `drop` | The dropout rate. ~~float~~ |
@ -274,7 +302,7 @@ Score a batch of examples.
> ``` > ```
| Name | Description | | Name | Description |
| ---------------- | -------------------------------------------------------------------------------------------------------------------- | | -------------- | -------------------------------------------------------------------------------------------------------------------- |
| `examples` | The examples to score. ~~Iterable[Example]~~ | | `examples` | The examples to score. ~~Iterable[Example]~~ |
| _keyword-only_ | | | _keyword-only_ | |
| **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ | | **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ |

View File

@ -138,6 +138,14 @@ data has already been allocated on CPU, it will not be moved. Ideally, this
function should be called right after importing spaCy and _before_ loading any function should be called right after importing spaCy and _before_ loading any
pipelines. pipelines.
<Infobox variant="warning" title="Jupyter notebook usage">
In a Jupyter notebook, run `prefer_gpu()` in the same cell as `spacy.load()`
to ensure that the model is loaded on the correct device. See [more
details](/usage/v3#jupyter-notebook-gpu).
</Infobox>
> #### Example > #### Example
> >
> ```python > ```python
@ -158,6 +166,14 @@ if no GPU is available. If data has already been allocated on CPU, it will not
be moved. Ideally, this function should be called right after importing spaCy be moved. Ideally, this function should be called right after importing spaCy
and _before_ loading any pipelines. and _before_ loading any pipelines.
<Infobox variant="warning" title="Jupyter notebook usage">
In a Jupyter notebook, run `require_gpu()` in the same cell as `spacy.load()`
to ensure that the model is loaded on the correct device. See [more
details](/usage/v3#jupyter-notebook-gpu).
</Infobox>
> #### Example > #### Example
> >
> ```python > ```python
@ -177,6 +193,14 @@ Allocate data and perform operations on CPU. If data has already been allocated
on GPU, it will not be moved. Ideally, this function should be called right on GPU, it will not be moved. Ideally, this function should be called right
after importing spaCy and _before_ loading any pipelines. after importing spaCy and _before_ loading any pipelines.
<Infobox variant="warning" title="Jupyter notebook usage">
In a Jupyter notebook, run `require_cpu()` in the same cell as `spacy.load()`
to ensure that the model is loaded on the correct device. See [more
details](/usage/v3#jupyter-notebook-gpu).
</Infobox>
> #### Example > #### Example
> >
> ```python > ```python

View File

@ -224,13 +224,14 @@ available pipeline components and component functions.
> ``` > ```
| String name | Component | Description | | String name | Component | Description |
| ----------------- | ----------------------------------------------- | ----------------------------------------------------------------------------------------- | | -------------------- | ---------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. | | `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. |
| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. | | `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. |
| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. | | `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. |
| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. | | `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. |
| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules and dictionaries. | | `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules and dictionaries. |
| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. | | `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories: exactly one category is predicted per document. |
| `textcat_multilabel` | [`MultiLabel_TextCategorizer`](/api/textcategorizer) | Assign text categories in a multi-label setting: zero, one or more labels per document. |
| `lemmatizer` | [`Lemmatizer`](/api/lemmatizer) | Assign base forms to words. | | `lemmatizer` | [`Lemmatizer`](/api/lemmatizer) | Assign base forms to words. |
| `morphologizer` | [`Morphologizer`](/api/morphologizer) | Assign morphological features and coarse-grained POS tags. | | `morphologizer` | [`Morphologizer`](/api/morphologizer) | Assign morphological features and coarse-grained POS tags. |
| `attribute_ruler` | [`AttributeRuler`](/api/attributeruler) | Assign token attribute mappings and rule-based exceptions. | | `attribute_ruler` | [`AttributeRuler`](/api/attributeruler) | Assign token attribute mappings and rule-based exceptions. |
@ -400,8 +401,8 @@ vectors available otherwise, it won't be able to make the same predictions.
> ``` > ```
> >
> By default, sourced components will be updated with your data during training. > By default, sourced components will be updated with your data during training.
> If you want to preserve the component as-is, you can "freeze" it if the pipeline > If you want to preserve the component as-is, you can "freeze" it if the
> is not using a shared `Tok2Vec` layer: > pipeline is not using a shared `Tok2Vec` layer:
> >
> ```ini > ```ini
> [training] > [training]
@ -1244,7 +1245,7 @@ labels = []
# the argument "model" # the argument "model"
[components.textcat.model] [components.textcat.model]
@architectures = "spacy.TextCatBOW.v1" @architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false exclusive_classes = true
ngram_size = 1 ngram_size = 1
no_output_layer = false no_output_layer = false

View File

@ -321,13 +321,14 @@ add to your pipeline and customize for your use case:
> ``` > ```
| Name | Description | | Name | Description |
| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`SentenceRecognizer`](/api/sentencerecognizer) | Trainable component for sentence segmentation. | | [`SentenceRecognizer`](/api/sentencerecognizer) | Trainable component for sentence segmentation. |
| [`Morphologizer`](/api/morphologizer) | Trainable component to predict morphological features. | | [`Morphologizer`](/api/morphologizer) | Trainable component to predict morphological features. |
| [`Lemmatizer`](/api/lemmatizer) | Standalone component for rule-based and lookup lemmatization. | | [`Lemmatizer`](/api/lemmatizer) | Standalone component for rule-based and lookup lemmatization. |
| [`AttributeRuler`](/api/attributeruler) | Component for setting token attributes using match patterns. | | [`AttributeRuler`](/api/attributeruler) | Component for setting token attributes using match patterns. |
| [`Transformer`](/api/transformer) | Component for using [transformer models](/usage/embeddings-transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). | | [`Transformer`](/api/transformer) | Component for using [transformer models](/usage/embeddings-transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). |
| [`TrainablePipe`](/api/pipe) | Base class for trainable pipeline components. | | [`TrainablePipe`](/api/pipe) | Base class for trainable pipeline components. |
| [`Multi-label TextCategorizer`](/api/textcategorizer) | Trainable component for multi-label text classification. |
<Infobox title="Details & Documentation" emoji="📖" list> <Infobox title="Details & Documentation" emoji="📖" list>
@ -592,6 +593,10 @@ Note that spaCy v3.0 now requires **Python 3.6+**.
- Various keyword arguments across functions and methods are now explicitly - Various keyword arguments across functions and methods are now explicitly
declared as **keyword-only** arguments. Those arguments are documented declared as **keyword-only** arguments. Those arguments are documented
accordingly across the API reference using the <Tag>keyword-only</Tag> tag. accordingly across the API reference using the <Tag>keyword-only</Tag> tag.
- The `textcat` pipeline component is now only applicable for classification of
mutually exclusives classes - i.e. one predicted class per input sentence or
document. To perform multi-label classification, use the new
`textcat_multilabel` component instead.
### Removed or renamed API {#incompat-removed} ### Removed or renamed API {#incompat-removed}
@ -1174,3 +1179,15 @@ This means that spaCy knows how to initialize `my_component`, even if your
package isn't imported. package isn't imported.
</Infobox> </Infobox>
#### Using GPUs in Jupyter notebooks {#jupyter-notebook-gpu}
In Jupyter notebooks, run [`prefer_gpu`](/api/top-level#spacy.prefer_gpu),
[`require_gpu`](/api/top-level#spacy.require_gpu) or
[`require_cpu`](/api/top-level#spacy.require_cpu) in the same cell as
[`spacy.load`](/api/top-level#spacy.load) to ensure that the model is loaded on the correct device.
Due to a bug related to `contextvars` (see the [bug
report](https://github.com/ipython/ipython/issues/11565)), the GPU settings may
not be preserved correctly across cells, resulting in models being loaded on
the wrong device or only partially on GPU.

View File

@ -9,6 +9,7 @@ import { htmlToReact } from '../components/util'
const DEFAULT_LANG = 'en' const DEFAULT_LANG = 'en'
const DEFAULT_HARDWARE = 'cpu' const DEFAULT_HARDWARE = 'cpu'
const DEFAULT_OPT = 'efficiency' const DEFAULT_OPT = 'efficiency'
const DEFAULT_TEXTCAT_EXCLUSIVE = true
const COMPONENTS = ['tagger', 'parser', 'ner', 'textcat'] const COMPONENTS = ['tagger', 'parser', 'ner', 'textcat']
const COMMENT = `# This is an auto-generated partial config. To use it with 'spacy train' const COMMENT = `# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings: # you can run spacy init fill-config to auto-fill all default settings:
@ -27,6 +28,19 @@ const DATA = [
options: COMPONENTS.map(id => ({ id, title: id })), options: COMPONENTS.map(id => ({ id, title: id })),
multiple: true, multiple: true,
}, },
{
id: 'textcat',
title: 'Text Classification',
multiple: true,
options: [
{
id: 'exclusive',
title: 'exclusive categories',
checked: DEFAULT_TEXTCAT_EXCLUSIVE,
help: 'only one label can apply',
},
],
},
{ {
id: 'hardware', id: 'hardware',
title: 'Hardware', title: 'Hardware',
@ -49,14 +63,28 @@ const DATA = [
export default function QuickstartTraining({ id, title, download = 'base_config.cfg' }) { export default function QuickstartTraining({ id, title, download = 'base_config.cfg' }) {
const [lang, setLang] = useState(DEFAULT_LANG) const [lang, setLang] = useState(DEFAULT_LANG)
const [_components, _setComponents] = useState([])
const [components, setComponents] = useState([]) const [components, setComponents] = useState([])
const [[hardware], setHardware] = useState([DEFAULT_HARDWARE]) const [[hardware], setHardware] = useState([DEFAULT_HARDWARE])
const [[optimize], setOptimize] = useState([DEFAULT_OPT]) const [[optimize], setOptimize] = useState([DEFAULT_OPT])
const [textcatExclusive, setTextcatExclusive] = useState(DEFAULT_TEXTCAT_EXCLUSIVE)
function updateComponents(value, isExclusive) {
_setComponents(value)
const updated = value.map(c => (c === 'textcat' && !isExclusive ? 'textcat_multilabel' : c))
setComponents(updated)
}
const setters = { const setters = {
lang: setLang, lang: setLang,
components: setComponents, components: v => updateComponents(v, textcatExclusive),
hardware: setHardware, hardware: setHardware,
optimize: setOptimize, optimize: setOptimize,
textcat: v => {
const isExclusive = v.includes('exclusive')
setTextcatExclusive(isExclusive)
updateComponents(_components, isExclusive)
},
} }
const reco = GENERATOR_DATA[lang] || GENERATOR_DATA.__default__ const reco = GENERATOR_DATA[lang] || GENERATOR_DATA.__default__
const content = generator({ const content = generator({
@ -78,20 +106,24 @@ export default function QuickstartTraining({ id, title, download = 'base_config.
<StaticQuery <StaticQuery
query={query} query={query}
render={({ site }) => { render={({ site }) => {
let data = DATA
const langs = site.siteMetadata.languages const langs = site.siteMetadata.languages
DATA[0].dropdown = langs data[0].dropdown = langs
.map(({ name, code }) => ({ .map(({ name, code }) => ({
id: code, id: code,
title: name, title: name,
})) }))
.sort((a, b) => a.title.localeCompare(b.title)) .sort((a, b) => a.title.localeCompare(b.title))
if (!_components.includes('textcat')) {
data = data.filter(({ id }) => id !== 'textcat')
}
return ( return (
<Quickstart <Quickstart
id="quickstart-widget" id="quickstart-widget"
Container="div" Container="div"
download={download} download={download}
rawContent={rawContent} rawContent={rawContent}
data={DATA} data={data}
title={title} title={title}
id={id} id={id}
setters={setters} setters={setters}