Merge branch 'develop' of https://github.com/explosion/spaCy into develop

This commit is contained in:
Matthew Honnibal 2020-09-14 21:04:49 +02:00
commit adf0bab23a
25 changed files with 206 additions and 116 deletions

View File

@ -4,17 +4,19 @@
spaCy is a library for advanced Natural Language Processing in Python and
Cython. It's built on the very latest research, and was designed from day one to
be used in real products. spaCy comes with
[pretrained statistical models](https://spacy.io/models) and word vectors, and
currently supports tokenization for **60+ languages**. It features
be used in real products.
spaCy comes with
[pretrained pipelines](https://spacy.io/models) and vectors, and
currently supports tokenization for **59+ languages**. It features
state-of-the-art speed, convolutional **neural network models** for tagging,
parsing and **named entity recognition** and easy **deep learning** integration.
It's commercial open-source software, released under the MIT license.
parsing, **named entity recognition**, **text classification** and more, multi-task learning with pretrained **transformers** like BERT, as well as a production-ready training system and easy model packaging, deployment and workflow management.
spaCy is commercial open-source software, released under the MIT license.
💫 **Version 2.3 out now!**
[Check out the release notes here.](https://github.com/explosion/spaCy/releases)
[![Azure Pipelines](<https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build+(3.x)>)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
[![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square&logo=github)](https://github.com/explosion/spaCy/releases)
[![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy/)
[![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square&logo=conda-forge&logoColor=white)](https://anaconda.org/conda-forge/spacy)
@ -31,7 +33,7 @@ It's commercial open-source software, released under the MIT license.
| --------------- | -------------------------------------------------------------- |
| [spaCy 101] | New to spaCy? Here's everything you need to know! |
| [Usage Guides] | How to use spaCy and its features. |
| [New in v2.3] | New features, backwards incompatibilities and migration guide. |
| [New in v3.0] | New features, backwards incompatibilities and migration guide. |
| [API Reference] | The detailed reference for spaCy's API. |
| [Models] | Download statistical language models for spaCy. |
| [Universe] | Libraries, extensions, demos, books and courses. |
@ -39,7 +41,7 @@ It's commercial open-source software, released under the MIT license.
| [Contribute] | How to contribute to the spaCy project and code base. |
[spacy 101]: https://spacy.io/usage/spacy-101
[new in v2.3]: https://spacy.io/usage/v2-3
[new in v3.0]: https://spacy.io/usage/v3
[usage guides]: https://spacy.io/usage/
[api reference]: https://spacy.io/api/
[models]: https://spacy.io/models
@ -56,34 +58,29 @@ be able to provide individual support via email. We also believe that help is
much more valuable if it's shared publicly, so that more people can benefit from
it.
| Type | Platforms |
| ------------------------ | ------------------------------------------------------ |
| 🚨 **Bug Reports** | [GitHub Issue Tracker] |
| 🎁 **Feature Requests** | [GitHub Issue Tracker] |
| 👩‍💻 **Usage Questions** | [Stack Overflow] · [Gitter Chat] · [Reddit User Group] |
| 🗯 **General Discussion** | [Gitter Chat] · [Reddit User Group] |
| Type | Platforms |
| ----------------------- | ---------------------- |
| 🚨 **Bug Reports** | [GitHub Issue Tracker] |
| 🎁 **Feature Requests** | [GitHub Issue Tracker] |
| 👩‍💻 **Usage Questions** | [Stack Overflow] |
[github issue tracker]: https://github.com/explosion/spaCy/issues
[stack overflow]: https://stackoverflow.com/questions/tagged/spacy
[gitter chat]: https://gitter.im/explosion/spaCy
[reddit user group]: https://www.reddit.com/r/spacynlp
## Features
- Non-destructive **tokenization**
- **Named entity** recognition
- Support for **50+ languages**
- pretrained [statistical models](https://spacy.io/models) and word vectors
- Support for **59+ languages**
- **Trained pipelines**
- Multi-task learning with pretrained **transformers** like BERT
- Pretrained **word vectors**
- State-of-the-art speed
- Easy **deep learning** integration
- Part-of-speech tagging
- Labelled dependency parsing
- Syntax-driven sentence segmentation
- Production-ready **training system**
- Linguistically-motivated **tokenization**
- Components for named **entity recognition**, part-of-speech-tagging, dependency parsing, sentence segmentation, **text classification**, lemmatization, morphological analysis, entity linking and more
- Easily extensible with **custom components** and attributes
- Support for custom models in **PyTorch**, **TensorFlow** and other frameworks
- Built in **visualizers** for syntax and NER
- Convenient string-to-hash mapping
- Export to numpy data arrays
- Efficient binary serialization
- Easy **model packaging** and deployment
- Easy **model packaging**, deployment and workflow management
- Robust, rigorously evaluated accuracy
📖 **For more details, see the
@ -102,13 +99,6 @@ For detailed installation instructions, see the
[pip]: https://pypi.org/project/spacy/
[conda]: https://anaconda.org/conda-forge/spacy
> ⚠️ **Important note for Python 3.8:** We can't yet ship pre-compiled binary
> wheels for spaCy that work on Python 3.8, as we're still waiting for our CI
> providers and other tooling to support it. This means that in order to run
> spaCy on Python 3.8, you'll need [a compiler installed](#source) and compile
> the library and its Cython dependencies locally. If this is causing problems
> for you, the easiest solution is to **use Python 3.7** in the meantime.
### pip
Using pip, spaCy releases are available as source packages and binary wheels (as
@ -164,26 +154,26 @@ If you've trained your own models, keep in mind that your training and runtime
inputs must match. After updating spaCy, we recommend **retraining your models**
with the new version.
📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the
[migration guide](https://spacy.io/usage/v2#migrating).**
📖 **For details on upgrading from spaCy 2.x to spaCy 3.x, see the
[migration guide](https://spacy.io/usage/v3#migrating).**
## Download models
As of v1.7.0, models for spaCy can be installed as **Python packages**. This
Trained pipelines for spaCy can be installed as **Python packages**. This
means that they're a component of your application, just like any other module.
Models can be installed using spaCy's `download` command, or manually by
pointing pip to a path or URL.
| Documentation | |
| ---------------------- | ------------------------------------------------------------- |
| [Available Models] | Detailed model descriptions, accuracy figures and benchmarks. |
| [Models Documentation] | Detailed usage instructions. |
| Documentation | |
| ---------------------- | ---------------------------------------------------------------- |
| [Available Pipelines] | Detailed pipeline descriptions, accuracy figures and benchmarks. |
| [Models Documentation] | Detailed usage instructions. |
[available models]: https://spacy.io/models
[available pipelines]: https://spacy.io/models
[models documentation]: https://spacy.io/docs/usage/models
```bash
# download best-matching version of specific model for your spaCy installation
# Download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_sm
# pip install .tar.gz archive from path or URL

View File

@ -89,7 +89,6 @@ def train(
nlp, config = util.load_model_from_config(config)
if config["training"]["vectors"] is not None:
util.load_vectors_into_model(nlp, config["training"]["vectors"])
verify_config(nlp)
raw_text, tag_map, morph_rules, weights_data = load_from_paths(config)
T_cfg = config["training"]
optimizer = T_cfg["optimizer"]
@ -108,6 +107,8 @@ def train(
nlp.resume_training(sgd=optimizer)
with nlp.select_pipes(disable=[*frozen_components, *resume_components]):
nlp.begin_training(lambda: train_corpus(nlp), sgd=optimizer)
# Verify the config after calling 'begin_training' to ensure labels are properly initialized
verify_config(nlp)
if tag_map:
# Replace tag map with provided mapping
@ -401,7 +402,7 @@ def verify_cli_args(config_path: Path, output_path: Optional[Path] = None) -> No
def verify_config(nlp: Language) -> None:
"""Perform additional checks based on the config and loaded nlp object."""
"""Perform additional checks based on the config, loaded nlp object and training data."""
# TODO: maybe we should validate based on the actual components, the list
# in config["nlp"]["pipeline"] instead?
for pipe_config in nlp.config["components"].values():
@ -415,18 +416,13 @@ def verify_textcat_config(nlp: Language, pipe_config: Dict[str, Any]) -> None:
# if 'positive_label' is provided: double check whether it's in the data and
# the task is binary
if pipe_config.get("positive_label"):
textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", [])
textcat_labels = nlp.get_pipe("textcat").labels
pos_label = pipe_config.get("positive_label")
if pos_label not in textcat_labels:
msg.fail(
f"The textcat's 'positive_label' config setting '{pos_label}' "
f"does not match any label in the training data.",
exits=1,
raise ValueError(
Errors.E920.format(pos_label=pos_label, labels=textcat_labels)
)
if len(textcat_labels) != 2:
msg.fail(
f"A textcat 'positive_label' '{pos_label}' was "
f"provided for training data that does not appear to be a "
f"binary classification problem with two labels.",
exits=1,
if len(list(textcat_labels)) != 2:
raise ValueError(
Errors.E919.format(pos_label=pos_label, labels=textcat_labels)
)

View File

@ -480,6 +480,11 @@ class Errors:
E201 = ("Span index out of range.")
# TODO: fix numbering after merging develop into master
E919 = ("A textcat 'positive_label' '{pos_label}' was provided for training "
"data that does not appear to be a binary classification problem "
"with two labels. Labels found: {labels}")
E920 = ("The textcat's 'positive_label' config setting '{pos_label}' "
"does not match any label in the training data. Labels found: {labels}")
E921 = ("The method 'set_output' can only be called on components that have "
"a Model with a 'resize_output' attribute. Otherwise, the output "
"layer can not be dynamically changed.")

View File

@ -56,7 +56,12 @@ subword_features = true
@Language.factory(
"textcat",
assigns=["doc.cats"],
default_config={"labels": [], "threshold": 0.5, "model": DEFAULT_TEXTCAT_MODEL},
default_config={
"labels": [],
"threshold": 0.5,
"positive_label": None,
"model": DEFAULT_TEXTCAT_MODEL,
},
scores=[
"cats_score",
"cats_score_desc",
@ -74,8 +79,9 @@ def make_textcat(
nlp: Language,
name: str,
model: Model[List[Doc], List[Floats2d]],
labels: Iterable[str],
labels: List[str],
threshold: float,
positive_label: Optional[str],
) -> "TextCategorizer":
"""Create a TextCategorizer compoment. The text categorizer predicts categories
over a whole document. It can learn one or more labels, and the labels can
@ -88,8 +94,16 @@ def make_textcat(
labels (list): A list of categories to learn. If empty, the model infers the
categories from the data.
threshold (float): Cutoff to consider a prediction "positive".
positive_label (Optional[str]): The positive label for a binary task with exclusive classes, None otherwise.
"""
return TextCategorizer(nlp.vocab, model, name, labels=labels, threshold=threshold)
return TextCategorizer(
nlp.vocab,
model,
name,
labels=labels,
threshold=threshold,
positive_label=positive_label,
)
class TextCategorizer(Pipe):
@ -104,8 +118,9 @@ class TextCategorizer(Pipe):
model: Model,
name: str = "textcat",
*,
labels: Iterable[str],
labels: List[str],
threshold: float,
positive_label: Optional[str],
) -> None:
"""Initialize a text categorizer.
@ -113,8 +128,9 @@ class TextCategorizer(Pipe):
model (thinc.api.Model): The Thinc Model powering the pipeline component.
name (str): The component instance name, used to add entries to the
losses during training.
labels (Iterable[str]): The labels to use.
labels (List[str]): The labels to use.
threshold (float): Cutoff to consider a prediction "positive".
positive_label (Optional[str]): The positive label for a binary task with exclusive classes, None otherwise.
DOCS: https://nightly.spacy.io/api/textcategorizer#init
"""
@ -122,7 +138,11 @@ class TextCategorizer(Pipe):
self.model = model
self.name = name
self._rehearsal_model = None
cfg = {"labels": labels, "threshold": threshold}
cfg = {
"labels": labels,
"threshold": threshold,
"positive_label": positive_label,
}
self.cfg = dict(cfg)
@property
@ -131,10 +151,10 @@ class TextCategorizer(Pipe):
DOCS: https://nightly.spacy.io/api/textcategorizer#labels
"""
return tuple(self.cfg.setdefault("labels", []))
return tuple(self.cfg["labels"])
@labels.setter
def labels(self, value: Iterable[str]) -> None:
def labels(self, value: List[str]) -> None:
self.cfg["labels"] = tuple(value)
def pipe(self, stream: Iterable[Doc], *, batch_size: int = 128) -> Iterator[Doc]:
@ -353,17 +373,10 @@ class TextCategorizer(Pipe):
sgd = self.create_optimizer()
return sgd
def score(
self,
examples: Iterable[Example],
*,
positive_label: Optional[str] = None,
**kwargs,
) -> Dict[str, Any]:
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
"""Score a batch of examples.
examples (Iterable[Example]): The examples to score.
positive_label (str): Optional positive label.
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_cats.
DOCS: https://nightly.spacy.io/api/textcategorizer#score
@ -374,7 +387,7 @@ class TextCategorizer(Pipe):
"cats",
labels=self.labels,
multi_label=self.model.attrs["multi_label"],
positive_label=positive_label,
positive_label=self.cfg["positive_label"],
threshold=self.cfg["threshold"],
**kwargs,
)

View File

@ -10,6 +10,7 @@ from spacy.tokens import Doc
from spacy.pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL
from ..util import make_tempdir
from ...cli.train import verify_textcat_config
from ...training import Example
@ -130,7 +131,10 @@ def test_overfitting_IO():
fix_random_seed(0)
nlp = English()
# Set exclusive labels
textcat = nlp.add_pipe("textcat", config={"model": {"exclusive_classes": True}})
textcat = nlp.add_pipe(
"textcat",
config={"model": {"exclusive_classes": True}, "positive_label": "POSITIVE"},
)
train_examples = []
for text, annotations in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
@ -159,7 +163,7 @@ def test_overfitting_IO():
assert cats2["POSITIVE"] + cats2["NEGATIVE"] == pytest.approx(1.0, 0.001)
# Test scoring
scores = nlp.evaluate(train_examples, scorer_cfg={"positive_label": "POSITIVE"})
scores = nlp.evaluate(train_examples)
assert scores["cats_micro_f"] == 1.0
assert scores["cats_score"] == 1.0
assert "cats_score_desc" in scores
@ -194,3 +198,29 @@ def test_textcat_configs(textcat_config):
for i in range(5):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
def test_positive_class():
nlp = English()
pipe_config = {"positive_label": "POS", "labels": ["POS", "NEG"]}
textcat = nlp.add_pipe("textcat", config=pipe_config)
assert textcat.labels == ("POS", "NEG")
verify_textcat_config(nlp, pipe_config)
def test_positive_class_not_present():
nlp = English()
pipe_config = {"positive_label": "POS", "labels": ["SOME", "THING"]}
textcat = nlp.add_pipe("textcat", config=pipe_config)
assert textcat.labels == ("SOME", "THING")
with pytest.raises(ValueError):
verify_textcat_config(nlp, pipe_config)
def test_positive_class_not_binary():
nlp = English()
pipe_config = {"positive_label": "POS", "labels": ["SOME", "THING", "POS"]}
textcat = nlp.add_pipe("textcat", config=pipe_config)
assert textcat.labels == ("SOME", "THING", "POS")
with pytest.raises(ValueError):
verify_textcat_config(nlp, pipe_config)

View File

@ -136,7 +136,7 @@ def test_serialize_textcat_empty(en_vocab):
# See issue #1105
cfg = {"model": DEFAULT_TEXTCAT_MODEL}
model = registry.make_from_config(cfg, validate=True)["model"]
textcat = TextCategorizer(en_vocab, model, labels=["ENTITY", "ACTION", "MODIFIER"], threshold=0.5)
textcat = TextCategorizer(en_vocab, model, labels=["ENTITY", "ACTION", "MODIFIER"], threshold=0.5, positive_label=None)
textcat.to_bytes(exclude=["vocab"])

View File

@ -630,3 +630,49 @@ In addition to the native markdown elements, you can use the components
├── gatsby-node.js # Node-specific hooks for Gatsby
└── package.json # package settings and dependencies
```
## Editorial {#editorial}
- "spaCy" should always be spelled with a lowercase "s" and a capital "C",
unless it specifically refers to the Python package or Python import `spacy`
(in which case it should be formatted as code).
- ✅ spaCy is a library for advanced NLP in Python.
- ❌ Spacy is a library for advanced NLP in Python.
- ✅ First, you need to install the `spacy` package from pip.
- Mentions of code, like function names, classes, variable names etc. in inline
text should be formatted as `code`.
- ✅ "Calling the `nlp` object on a text returns a `Doc`."
- Objects that have pages in the [API docs](/api) should be linked for
example, [`Doc`](/api/doc) or [`Language.to_disk`](/api/language#to_disk). The
mentions should still be formatted as code within the link. Links pointing to
the API docs will automatically receive a little icon. However, if a paragraph
includes many references to the API, the links can easily get messy. In that
case, we typically only link the first mention of an object and not any
subsequent ones.
- ✅ The [`Span`](/api/span) and [`Token`](/api/token) objects are views of a
[`Doc`](/api/doc). [`Span.as_doc`](/api/span#as_doc) creates a `Doc` object
from a `Span`.
- ❌ The [`Span`](/api/span) and [`Token`](/api/token) objects are views of a
[`Doc`](/api/doc). [`Span.as_doc`](/api/span#as_doc) creates a
[`Doc`](/api/doc) object from a [`Span`](/api/span).
* Other things we format as code are: references to trained pipeline packages
like `en_core_web_sm` or file names like `code.py` or `meta.json`.
- ✅ After training, the `config.cfg` is saved to disk.
* [Type annotations](#type-annotations) are a special type of code formatting,
expressed by wrapping the text in `~~` instead of backticks. The result looks
like this: ~~List[Doc]~~. All references to known types will be linked
automatically.
- ✅ The model has the input type ~~List[Doc]~~ and it outputs a
~~List[Array2d]~~.
* We try to keep links meaningful but short.
- ✅ For details, see the usage guide on
[training with custom code](/usage/training#custom-code).
- ❌ For details, see
[the usage guide on training with custom code](/usage/training#custom-code).
- ❌ For details, see the usage guide on training with custom code
[here](/usage/training#custom-code).

View File

@ -183,7 +183,7 @@ will be overwritten.
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `match_id` | An ID for the patterns. ~~str~~ |
| `patterns` | A list of match patterns. A pattern consists of a list of dicts, where each dict describes a token in the tree. ~~List[List[Dict[str, Union[str, Dict]]]]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[DependencyMatcher, Doc, int, List[Tuple], Any]]~~ |
## DependencyMatcher.get {#get tag="method"}

View File

@ -217,7 +217,7 @@ model. Delegates to [`predict`](/api/dependencyparser#predict) and
| Name | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |

View File

@ -85,7 +85,7 @@ providing custom registered functions.
| `vocab` | The shared vocabulary. ~~Vocab~~ |
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model~~ |
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `kb_loader` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. ~~Callable[[Vocab], KnowledgeBase]~~ |
| `get_candidates` | Function that generates plausible candidates for a given `Span` object. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ |
| `labels_discard` | NER labels that will automatically get a `"NIL"` prediction. ~~Iterable[str]~~ |
@ -218,7 +218,7 @@ pipe's entity linking model and context encoder. Delegates to
| Name | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |

View File

@ -206,7 +206,7 @@ model. Delegates to [`predict`](/api/entityrecognizer#predict) and
| Name | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |

View File

@ -255,7 +255,7 @@ Get all patterns that were added to the entity ruler.
| Name | Description |
| ----------------- | --------------------------------------------------------------------------------------------------------------------- |
| `matcher` | The underlying matcher used to process token patterns. ~~Matcher~~ | |
| `matcher` | The underlying matcher used to process token patterns. ~~Matcher~~ |
| `phrase_matcher` | The underlying phrase matcher, used to process phrase patterns. ~~PhraseMatcher~~ |
| `token_patterns` | The token patterns present in the entity ruler, keyed by label. ~~Dict[str, List[Dict[str, Union[str, List[dict]]]]~~ |
| `phrase_patterns` | The phrase patterns present in the entity ruler, keyed by label. ~~Dict[str, List[Doc]]~~ |

View File

@ -81,7 +81,7 @@ shortcut for this and instantiate the component using its string name and
| `vocab` | The shared vocabulary. ~~Vocab~~ |
| `model` | **Not yet implemented:** The model to use. ~~Model~~ |
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| mode | The lemmatizer mode, e.g. `"lookup"` or `"rule"`. Defaults to `"lookup"`. ~~str~~ |
| lookups | A lookups object containing the tables such as `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. Defaults to `None`. ~~Optional[Lookups]~~ |
| overwrite | Whether to overwrite existing lemmas. ~~bool~ |

View File

@ -139,7 +139,7 @@ setting up the label scheme based on the data.
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `pipeline` | Optional list of pipeline components that this component is part of. ~~Optional[List[Tuple[str, Callable[[Doc], Doc]]]]~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
| **RETURNS** | The optimizer. ~~Optimizer~~ |
@ -196,7 +196,7 @@ Delegates to [`predict`](/api/morphologizer#predict) and
| Name | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |

View File

@ -150,9 +150,9 @@ patterns = [nlp("health care reform"), nlp("healthcare reform")]
| Name | Description |
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `match_id` | str | An ID for the thing you're matching. ~~str~~ |
| `match_id` | An ID for the thing you're matching. ~~str~~ | |
| `docs` | `Doc` objects of the phrases to match. ~~List[Doc]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[Matcher, Doc, int, List[tuple], Any]]~~ |
## PhraseMatcher.remove {#remove tag="method" new="2.2"}

View File

@ -187,7 +187,7 @@ predictions and gold-standard annotations, and update the component's model.
| Name | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
@ -211,7 +211,7 @@ the "catastrophic forgetting" problem. This feature is experimental.
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |

View File

@ -192,7 +192,7 @@ Delegates to [`predict`](/api/sentencerecognizer#predict) and
| Name | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
@ -216,7 +216,7 @@ the "catastrophic forgetting" problem. This feature is experimental.
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |

View File

@ -53,7 +53,7 @@ Initialize the sentencizer.
| Name | Description |
| -------------- | ----------------------------------------------------------------------------------------------------------------------- |
| _keyword-only_ | | |
| _keyword-only_ | |
| `punct_chars` | Optional custom list of punctuation characters that mark sentence ends. See below for defaults. ~~Optional[List[str]]~~ |
```python

View File

@ -190,7 +190,7 @@ Delegates to [`predict`](/api/tagger#predict) and
| Name | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
@ -214,7 +214,7 @@ the "catastrophic forgetting" problem. This feature is experimental.
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |

View File

@ -36,11 +36,12 @@ architectures and their arguments and hyperparameters.
> nlp.add_pipe("textcat", config=config)
> ```
| Setting | Description |
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `labels` | A list of categories to learn. If empty, the model infers the categories from the data. Defaults to `[]`. ~~Iterable[str]~~ |
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ |
| Setting | Description |
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `labels` | A list of categories to learn. If empty, the model infers the categories from the data. Defaults to `[]`. ~~Iterable[str]~~ |
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise and by default. ~~Optional[str]~~ |
| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ |
```python
%%GITHUB_SPACY/spacy/pipeline/textcat.py
@ -60,21 +61,22 @@ architectures and their arguments and hyperparameters.
>
> # Construction from class
> from spacy.pipeline import TextCategorizer
> textcat = TextCategorizer(nlp.vocab, model, labels=[], threshold=0.5)
> textcat = TextCategorizer(nlp.vocab, model, labels=[], threshold=0.5, positive_label="POS")
> ```
Create a new pipeline instance. In your application, you would normally use a
shortcut for this and instantiate the component using its string name and
[`nlp.add_pipe`](/api/language#create_pipe).
| Name | Description |
| -------------- | -------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | The shared vocabulary. ~~Vocab~~ |
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
| _keyword-only_ | |
| `labels` | The labels to use. ~~Iterable[str]~~ |
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
| Name | Description |
| ---------------- | -------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | The shared vocabulary. ~~Vocab~~ |
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
| _keyword-only_ | |
| `labels` | The labels to use. ~~Iterable[str]~~ |
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise. ~~Optional[str]~~ |
## TextCategorizer.\_\_call\_\_ {#call tag="method"}
@ -201,7 +203,7 @@ Delegates to [`predict`](/api/textcategorizer#predict) and
| Name | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
@ -225,7 +227,7 @@ the "catastrophic forgetting" problem. This feature is experimental.
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
@ -263,7 +265,7 @@ Score a batch of examples.
| Name | Description |
| ---------------- | -------------------------------------------------------------------------------------------------------------------- |
| `examples` | The examples to score. ~~Iterable[Example]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `positive_label` | Optional positive label. ~~Optional[str]~~ |
| **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ |

View File

@ -144,7 +144,7 @@ setting up the label scheme based on the data.
| Name | Description |
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `pipeline` | Optional list of pipeline components that this component is part of. ~~Optional[List[Tuple[str, Callable[[Doc], Doc]]]]~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
| **RETURNS** | The optimizer. ~~Optimizer~~ |
@ -200,7 +200,7 @@ Delegates to [`predict`](/api/tok2vec#predict).
| Name | Description |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
| _keyword-only_ | | |
| _keyword-only_ | |
| `drop` | The dropout rate. ~~float~~ |
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |

View File

@ -11,6 +11,7 @@ menu:
- ['Setup & Installation', 'setup']
- ['Markdown Reference', 'markdown']
- ['Project Structure', 'structure']
- ['Editorial', 'editorial']
sidebar:
- label: Styleguide
items:

View File

@ -27,6 +27,7 @@ const Quickstart = ({
hidePrompts,
small,
codeLang,
Container = Section,
children,
}) => {
const contentRef = useRef()
@ -83,7 +84,7 @@ const Quickstart = ({
}, [data, initialized])
return !data.length ? null : (
<Section id={id}>
<Container id={id}>
<div className={classNames(classes.root, { [classes.hidePrompts]: !!hidePrompts })}>
{title && (
<H2 className={classes.title} name={id}>
@ -249,7 +250,7 @@ const Quickstart = ({
</pre>
{showCopy && <textarea ref={copyAreaRef} className={classes.copyArea} rows={1} />}
</div>
</Section>
</Container>
)
}

View File

@ -41,3 +41,7 @@
&:before
content: ""
.ul .ul &
text-indent: initial
margin-left: -20px

View File

@ -87,6 +87,8 @@ export default function QuickstartTraining({ id, title, download = 'base_config.
.sort((a, b) => a.title.localeCompare(b.title))
return (
<Quickstart
id="quickstart-widget"
Container="div"
download={download}
rawContent={content}
data={DATA}