mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 09:26:27 +03:00
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
This commit is contained in:
commit
adf0bab23a
78
README.md
78
README.md
|
@ -4,17 +4,19 @@
|
|||
|
||||
spaCy is a library for advanced Natural Language Processing in Python and
|
||||
Cython. It's built on the very latest research, and was designed from day one to
|
||||
be used in real products. spaCy comes with
|
||||
[pretrained statistical models](https://spacy.io/models) and word vectors, and
|
||||
currently supports tokenization for **60+ languages**. It features
|
||||
be used in real products.
|
||||
|
||||
spaCy comes with
|
||||
[pretrained pipelines](https://spacy.io/models) and vectors, and
|
||||
currently supports tokenization for **59+ languages**. It features
|
||||
state-of-the-art speed, convolutional **neural network models** for tagging,
|
||||
parsing and **named entity recognition** and easy **deep learning** integration.
|
||||
It's commercial open-source software, released under the MIT license.
|
||||
parsing, **named entity recognition**, **text classification** and more, multi-task learning with pretrained **transformers** like BERT, as well as a production-ready training system and easy model packaging, deployment and workflow management.
|
||||
spaCy is commercial open-source software, released under the MIT license.
|
||||
|
||||
💫 **Version 2.3 out now!**
|
||||
[Check out the release notes here.](https://github.com/explosion/spaCy/releases)
|
||||
|
||||
[![Azure Pipelines](<https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build+(3.x)>)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
||||
[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
||||
[![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square&logo=github)](https://github.com/explosion/spaCy/releases)
|
||||
[![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy/)
|
||||
[![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square&logo=conda-forge&logoColor=white)](https://anaconda.org/conda-forge/spacy)
|
||||
|
@ -31,7 +33,7 @@ It's commercial open-source software, released under the MIT license.
|
|||
| --------------- | -------------------------------------------------------------- |
|
||||
| [spaCy 101] | New to spaCy? Here's everything you need to know! |
|
||||
| [Usage Guides] | How to use spaCy and its features. |
|
||||
| [New in v2.3] | New features, backwards incompatibilities and migration guide. |
|
||||
| [New in v3.0] | New features, backwards incompatibilities and migration guide. |
|
||||
| [API Reference] | The detailed reference for spaCy's API. |
|
||||
| [Models] | Download statistical language models for spaCy. |
|
||||
| [Universe] | Libraries, extensions, demos, books and courses. |
|
||||
|
@ -39,7 +41,7 @@ It's commercial open-source software, released under the MIT license.
|
|||
| [Contribute] | How to contribute to the spaCy project and code base. |
|
||||
|
||||
[spacy 101]: https://spacy.io/usage/spacy-101
|
||||
[new in v2.3]: https://spacy.io/usage/v2-3
|
||||
[new in v3.0]: https://spacy.io/usage/v3
|
||||
[usage guides]: https://spacy.io/usage/
|
||||
[api reference]: https://spacy.io/api/
|
||||
[models]: https://spacy.io/models
|
||||
|
@ -56,34 +58,29 @@ be able to provide individual support via email. We also believe that help is
|
|||
much more valuable if it's shared publicly, so that more people can benefit from
|
||||
it.
|
||||
|
||||
| Type | Platforms |
|
||||
| ------------------------ | ------------------------------------------------------ |
|
||||
| 🚨 **Bug Reports** | [GitHub Issue Tracker] |
|
||||
| 🎁 **Feature Requests** | [GitHub Issue Tracker] |
|
||||
| 👩💻 **Usage Questions** | [Stack Overflow] · [Gitter Chat] · [Reddit User Group] |
|
||||
| 🗯 **General Discussion** | [Gitter Chat] · [Reddit User Group] |
|
||||
| Type | Platforms |
|
||||
| ----------------------- | ---------------------- |
|
||||
| 🚨 **Bug Reports** | [GitHub Issue Tracker] |
|
||||
| 🎁 **Feature Requests** | [GitHub Issue Tracker] |
|
||||
| 👩💻 **Usage Questions** | [Stack Overflow] |
|
||||
|
||||
[github issue tracker]: https://github.com/explosion/spaCy/issues
|
||||
[stack overflow]: https://stackoverflow.com/questions/tagged/spacy
|
||||
[gitter chat]: https://gitter.im/explosion/spaCy
|
||||
[reddit user group]: https://www.reddit.com/r/spacynlp
|
||||
|
||||
## Features
|
||||
|
||||
- Non-destructive **tokenization**
|
||||
- **Named entity** recognition
|
||||
- Support for **50+ languages**
|
||||
- pretrained [statistical models](https://spacy.io/models) and word vectors
|
||||
- Support for **59+ languages**
|
||||
- **Trained pipelines**
|
||||
- Multi-task learning with pretrained **transformers** like BERT
|
||||
- Pretrained **word vectors**
|
||||
- State-of-the-art speed
|
||||
- Easy **deep learning** integration
|
||||
- Part-of-speech tagging
|
||||
- Labelled dependency parsing
|
||||
- Syntax-driven sentence segmentation
|
||||
- Production-ready **training system**
|
||||
- Linguistically-motivated **tokenization**
|
||||
- Components for named **entity recognition**, part-of-speech-tagging, dependency parsing, sentence segmentation, **text classification**, lemmatization, morphological analysis, entity linking and more
|
||||
- Easily extensible with **custom components** and attributes
|
||||
- Support for custom models in **PyTorch**, **TensorFlow** and other frameworks
|
||||
- Built in **visualizers** for syntax and NER
|
||||
- Convenient string-to-hash mapping
|
||||
- Export to numpy data arrays
|
||||
- Efficient binary serialization
|
||||
- Easy **model packaging** and deployment
|
||||
- Easy **model packaging**, deployment and workflow management
|
||||
- Robust, rigorously evaluated accuracy
|
||||
|
||||
📖 **For more details, see the
|
||||
|
@ -102,13 +99,6 @@ For detailed installation instructions, see the
|
|||
[pip]: https://pypi.org/project/spacy/
|
||||
[conda]: https://anaconda.org/conda-forge/spacy
|
||||
|
||||
> ⚠️ **Important note for Python 3.8:** We can't yet ship pre-compiled binary
|
||||
> wheels for spaCy that work on Python 3.8, as we're still waiting for our CI
|
||||
> providers and other tooling to support it. This means that in order to run
|
||||
> spaCy on Python 3.8, you'll need [a compiler installed](#source) and compile
|
||||
> the library and its Cython dependencies locally. If this is causing problems
|
||||
> for you, the easiest solution is to **use Python 3.7** in the meantime.
|
||||
|
||||
### pip
|
||||
|
||||
Using pip, spaCy releases are available as source packages and binary wheels (as
|
||||
|
@ -164,26 +154,26 @@ If you've trained your own models, keep in mind that your training and runtime
|
|||
inputs must match. After updating spaCy, we recommend **retraining your models**
|
||||
with the new version.
|
||||
|
||||
📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the
|
||||
[migration guide](https://spacy.io/usage/v2#migrating).**
|
||||
📖 **For details on upgrading from spaCy 2.x to spaCy 3.x, see the
|
||||
[migration guide](https://spacy.io/usage/v3#migrating).**
|
||||
|
||||
## Download models
|
||||
|
||||
As of v1.7.0, models for spaCy can be installed as **Python packages**. This
|
||||
Trained pipelines for spaCy can be installed as **Python packages**. This
|
||||
means that they're a component of your application, just like any other module.
|
||||
Models can be installed using spaCy's `download` command, or manually by
|
||||
pointing pip to a path or URL.
|
||||
|
||||
| Documentation | |
|
||||
| ---------------------- | ------------------------------------------------------------- |
|
||||
| [Available Models] | Detailed model descriptions, accuracy figures and benchmarks. |
|
||||
| [Models Documentation] | Detailed usage instructions. |
|
||||
| Documentation | |
|
||||
| ---------------------- | ---------------------------------------------------------------- |
|
||||
| [Available Pipelines] | Detailed pipeline descriptions, accuracy figures and benchmarks. |
|
||||
| [Models Documentation] | Detailed usage instructions. |
|
||||
|
||||
[available models]: https://spacy.io/models
|
||||
[available pipelines]: https://spacy.io/models
|
||||
[models documentation]: https://spacy.io/docs/usage/models
|
||||
|
||||
```bash
|
||||
# download best-matching version of specific model for your spaCy installation
|
||||
# Download best-matching version of specific model for your spaCy installation
|
||||
python -m spacy download en_core_web_sm
|
||||
|
||||
# pip install .tar.gz archive from path or URL
|
||||
|
|
|
@ -89,7 +89,6 @@ def train(
|
|||
nlp, config = util.load_model_from_config(config)
|
||||
if config["training"]["vectors"] is not None:
|
||||
util.load_vectors_into_model(nlp, config["training"]["vectors"])
|
||||
verify_config(nlp)
|
||||
raw_text, tag_map, morph_rules, weights_data = load_from_paths(config)
|
||||
T_cfg = config["training"]
|
||||
optimizer = T_cfg["optimizer"]
|
||||
|
@ -108,6 +107,8 @@ def train(
|
|||
nlp.resume_training(sgd=optimizer)
|
||||
with nlp.select_pipes(disable=[*frozen_components, *resume_components]):
|
||||
nlp.begin_training(lambda: train_corpus(nlp), sgd=optimizer)
|
||||
# Verify the config after calling 'begin_training' to ensure labels are properly initialized
|
||||
verify_config(nlp)
|
||||
|
||||
if tag_map:
|
||||
# Replace tag map with provided mapping
|
||||
|
@ -401,7 +402,7 @@ def verify_cli_args(config_path: Path, output_path: Optional[Path] = None) -> No
|
|||
|
||||
|
||||
def verify_config(nlp: Language) -> None:
|
||||
"""Perform additional checks based on the config and loaded nlp object."""
|
||||
"""Perform additional checks based on the config, loaded nlp object and training data."""
|
||||
# TODO: maybe we should validate based on the actual components, the list
|
||||
# in config["nlp"]["pipeline"] instead?
|
||||
for pipe_config in nlp.config["components"].values():
|
||||
|
@ -415,18 +416,13 @@ def verify_textcat_config(nlp: Language, pipe_config: Dict[str, Any]) -> None:
|
|||
# if 'positive_label' is provided: double check whether it's in the data and
|
||||
# the task is binary
|
||||
if pipe_config.get("positive_label"):
|
||||
textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", [])
|
||||
textcat_labels = nlp.get_pipe("textcat").labels
|
||||
pos_label = pipe_config.get("positive_label")
|
||||
if pos_label not in textcat_labels:
|
||||
msg.fail(
|
||||
f"The textcat's 'positive_label' config setting '{pos_label}' "
|
||||
f"does not match any label in the training data.",
|
||||
exits=1,
|
||||
raise ValueError(
|
||||
Errors.E920.format(pos_label=pos_label, labels=textcat_labels)
|
||||
)
|
||||
if len(textcat_labels) != 2:
|
||||
msg.fail(
|
||||
f"A textcat 'positive_label' '{pos_label}' was "
|
||||
f"provided for training data that does not appear to be a "
|
||||
f"binary classification problem with two labels.",
|
||||
exits=1,
|
||||
if len(list(textcat_labels)) != 2:
|
||||
raise ValueError(
|
||||
Errors.E919.format(pos_label=pos_label, labels=textcat_labels)
|
||||
)
|
||||
|
|
|
@ -480,6 +480,11 @@ class Errors:
|
|||
E201 = ("Span index out of range.")
|
||||
|
||||
# TODO: fix numbering after merging develop into master
|
||||
E919 = ("A textcat 'positive_label' '{pos_label}' was provided for training "
|
||||
"data that does not appear to be a binary classification problem "
|
||||
"with two labels. Labels found: {labels}")
|
||||
E920 = ("The textcat's 'positive_label' config setting '{pos_label}' "
|
||||
"does not match any label in the training data. Labels found: {labels}")
|
||||
E921 = ("The method 'set_output' can only be called on components that have "
|
||||
"a Model with a 'resize_output' attribute. Otherwise, the output "
|
||||
"layer can not be dynamically changed.")
|
||||
|
|
|
@ -56,7 +56,12 @@ subword_features = true
|
|||
@Language.factory(
|
||||
"textcat",
|
||||
assigns=["doc.cats"],
|
||||
default_config={"labels": [], "threshold": 0.5, "model": DEFAULT_TEXTCAT_MODEL},
|
||||
default_config={
|
||||
"labels": [],
|
||||
"threshold": 0.5,
|
||||
"positive_label": None,
|
||||
"model": DEFAULT_TEXTCAT_MODEL,
|
||||
},
|
||||
scores=[
|
||||
"cats_score",
|
||||
"cats_score_desc",
|
||||
|
@ -74,8 +79,9 @@ def make_textcat(
|
|||
nlp: Language,
|
||||
name: str,
|
||||
model: Model[List[Doc], List[Floats2d]],
|
||||
labels: Iterable[str],
|
||||
labels: List[str],
|
||||
threshold: float,
|
||||
positive_label: Optional[str],
|
||||
) -> "TextCategorizer":
|
||||
"""Create a TextCategorizer compoment. The text categorizer predicts categories
|
||||
over a whole document. It can learn one or more labels, and the labels can
|
||||
|
@ -88,8 +94,16 @@ def make_textcat(
|
|||
labels (list): A list of categories to learn. If empty, the model infers the
|
||||
categories from the data.
|
||||
threshold (float): Cutoff to consider a prediction "positive".
|
||||
positive_label (Optional[str]): The positive label for a binary task with exclusive classes, None otherwise.
|
||||
"""
|
||||
return TextCategorizer(nlp.vocab, model, name, labels=labels, threshold=threshold)
|
||||
return TextCategorizer(
|
||||
nlp.vocab,
|
||||
model,
|
||||
name,
|
||||
labels=labels,
|
||||
threshold=threshold,
|
||||
positive_label=positive_label,
|
||||
)
|
||||
|
||||
|
||||
class TextCategorizer(Pipe):
|
||||
|
@ -104,8 +118,9 @@ class TextCategorizer(Pipe):
|
|||
model: Model,
|
||||
name: str = "textcat",
|
||||
*,
|
||||
labels: Iterable[str],
|
||||
labels: List[str],
|
||||
threshold: float,
|
||||
positive_label: Optional[str],
|
||||
) -> None:
|
||||
"""Initialize a text categorizer.
|
||||
|
||||
|
@ -113,8 +128,9 @@ class TextCategorizer(Pipe):
|
|||
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
||||
name (str): The component instance name, used to add entries to the
|
||||
losses during training.
|
||||
labels (Iterable[str]): The labels to use.
|
||||
labels (List[str]): The labels to use.
|
||||
threshold (float): Cutoff to consider a prediction "positive".
|
||||
positive_label (Optional[str]): The positive label for a binary task with exclusive classes, None otherwise.
|
||||
|
||||
DOCS: https://nightly.spacy.io/api/textcategorizer#init
|
||||
"""
|
||||
|
@ -122,7 +138,11 @@ class TextCategorizer(Pipe):
|
|||
self.model = model
|
||||
self.name = name
|
||||
self._rehearsal_model = None
|
||||
cfg = {"labels": labels, "threshold": threshold}
|
||||
cfg = {
|
||||
"labels": labels,
|
||||
"threshold": threshold,
|
||||
"positive_label": positive_label,
|
||||
}
|
||||
self.cfg = dict(cfg)
|
||||
|
||||
@property
|
||||
|
@ -131,10 +151,10 @@ class TextCategorizer(Pipe):
|
|||
|
||||
DOCS: https://nightly.spacy.io/api/textcategorizer#labels
|
||||
"""
|
||||
return tuple(self.cfg.setdefault("labels", []))
|
||||
return tuple(self.cfg["labels"])
|
||||
|
||||
@labels.setter
|
||||
def labels(self, value: Iterable[str]) -> None:
|
||||
def labels(self, value: List[str]) -> None:
|
||||
self.cfg["labels"] = tuple(value)
|
||||
|
||||
def pipe(self, stream: Iterable[Doc], *, batch_size: int = 128) -> Iterator[Doc]:
|
||||
|
@ -353,17 +373,10 @@ class TextCategorizer(Pipe):
|
|||
sgd = self.create_optimizer()
|
||||
return sgd
|
||||
|
||||
def score(
|
||||
self,
|
||||
examples: Iterable[Example],
|
||||
*,
|
||||
positive_label: Optional[str] = None,
|
||||
**kwargs,
|
||||
) -> Dict[str, Any]:
|
||||
def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
|
||||
"""Score a batch of examples.
|
||||
|
||||
examples (Iterable[Example]): The examples to score.
|
||||
positive_label (str): Optional positive label.
|
||||
RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_cats.
|
||||
|
||||
DOCS: https://nightly.spacy.io/api/textcategorizer#score
|
||||
|
@ -374,7 +387,7 @@ class TextCategorizer(Pipe):
|
|||
"cats",
|
||||
labels=self.labels,
|
||||
multi_label=self.model.attrs["multi_label"],
|
||||
positive_label=positive_label,
|
||||
positive_label=self.cfg["positive_label"],
|
||||
threshold=self.cfg["threshold"],
|
||||
**kwargs,
|
||||
)
|
||||
|
|
|
@ -10,6 +10,7 @@ from spacy.tokens import Doc
|
|||
from spacy.pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL
|
||||
|
||||
from ..util import make_tempdir
|
||||
from ...cli.train import verify_textcat_config
|
||||
from ...training import Example
|
||||
|
||||
|
||||
|
@ -130,7 +131,10 @@ def test_overfitting_IO():
|
|||
fix_random_seed(0)
|
||||
nlp = English()
|
||||
# Set exclusive labels
|
||||
textcat = nlp.add_pipe("textcat", config={"model": {"exclusive_classes": True}})
|
||||
textcat = nlp.add_pipe(
|
||||
"textcat",
|
||||
config={"model": {"exclusive_classes": True}, "positive_label": "POSITIVE"},
|
||||
)
|
||||
train_examples = []
|
||||
for text, annotations in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(text), annotations))
|
||||
|
@ -159,7 +163,7 @@ def test_overfitting_IO():
|
|||
assert cats2["POSITIVE"] + cats2["NEGATIVE"] == pytest.approx(1.0, 0.001)
|
||||
|
||||
# Test scoring
|
||||
scores = nlp.evaluate(train_examples, scorer_cfg={"positive_label": "POSITIVE"})
|
||||
scores = nlp.evaluate(train_examples)
|
||||
assert scores["cats_micro_f"] == 1.0
|
||||
assert scores["cats_score"] == 1.0
|
||||
assert "cats_score_desc" in scores
|
||||
|
@ -194,3 +198,29 @@ def test_textcat_configs(textcat_config):
|
|||
for i in range(5):
|
||||
losses = {}
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
|
||||
|
||||
def test_positive_class():
|
||||
nlp = English()
|
||||
pipe_config = {"positive_label": "POS", "labels": ["POS", "NEG"]}
|
||||
textcat = nlp.add_pipe("textcat", config=pipe_config)
|
||||
assert textcat.labels == ("POS", "NEG")
|
||||
verify_textcat_config(nlp, pipe_config)
|
||||
|
||||
|
||||
def test_positive_class_not_present():
|
||||
nlp = English()
|
||||
pipe_config = {"positive_label": "POS", "labels": ["SOME", "THING"]}
|
||||
textcat = nlp.add_pipe("textcat", config=pipe_config)
|
||||
assert textcat.labels == ("SOME", "THING")
|
||||
with pytest.raises(ValueError):
|
||||
verify_textcat_config(nlp, pipe_config)
|
||||
|
||||
|
||||
def test_positive_class_not_binary():
|
||||
nlp = English()
|
||||
pipe_config = {"positive_label": "POS", "labels": ["SOME", "THING", "POS"]}
|
||||
textcat = nlp.add_pipe("textcat", config=pipe_config)
|
||||
assert textcat.labels == ("SOME", "THING", "POS")
|
||||
with pytest.raises(ValueError):
|
||||
verify_textcat_config(nlp, pipe_config)
|
||||
|
|
|
@ -136,7 +136,7 @@ def test_serialize_textcat_empty(en_vocab):
|
|||
# See issue #1105
|
||||
cfg = {"model": DEFAULT_TEXTCAT_MODEL}
|
||||
model = registry.make_from_config(cfg, validate=True)["model"]
|
||||
textcat = TextCategorizer(en_vocab, model, labels=["ENTITY", "ACTION", "MODIFIER"], threshold=0.5)
|
||||
textcat = TextCategorizer(en_vocab, model, labels=["ENTITY", "ACTION", "MODIFIER"], threshold=0.5, positive_label=None)
|
||||
textcat.to_bytes(exclude=["vocab"])
|
||||
|
||||
|
||||
|
|
|
@ -630,3 +630,49 @@ In addition to the native markdown elements, you can use the components
|
|||
├── gatsby-node.js # Node-specific hooks for Gatsby
|
||||
└── package.json # package settings and dependencies
|
||||
```
|
||||
|
||||
## Editorial {#editorial}
|
||||
|
||||
- "spaCy" should always be spelled with a lowercase "s" and a capital "C",
|
||||
unless it specifically refers to the Python package or Python import `spacy`
|
||||
(in which case it should be formatted as code).
|
||||
- ✅ spaCy is a library for advanced NLP in Python.
|
||||
- ❌ Spacy is a library for advanced NLP in Python.
|
||||
- ✅ First, you need to install the `spacy` package from pip.
|
||||
- Mentions of code, like function names, classes, variable names etc. in inline
|
||||
text should be formatted as `code`.
|
||||
- ✅ "Calling the `nlp` object on a text returns a `Doc`."
|
||||
- Objects that have pages in the [API docs](/api) should be linked – for
|
||||
example, [`Doc`](/api/doc) or [`Language.to_disk`](/api/language#to_disk). The
|
||||
mentions should still be formatted as code within the link. Links pointing to
|
||||
the API docs will automatically receive a little icon. However, if a paragraph
|
||||
includes many references to the API, the links can easily get messy. In that
|
||||
case, we typically only link the first mention of an object and not any
|
||||
subsequent ones.
|
||||
- ✅ The [`Span`](/api/span) and [`Token`](/api/token) objects are views of a
|
||||
[`Doc`](/api/doc). [`Span.as_doc`](/api/span#as_doc) creates a `Doc` object
|
||||
from a `Span`.
|
||||
- ❌ The [`Span`](/api/span) and [`Token`](/api/token) objects are views of a
|
||||
[`Doc`](/api/doc). [`Span.as_doc`](/api/span#as_doc) creates a
|
||||
[`Doc`](/api/doc) object from a [`Span`](/api/span).
|
||||
|
||||
* Other things we format as code are: references to trained pipeline packages
|
||||
like `en_core_web_sm` or file names like `code.py` or `meta.json`.
|
||||
|
||||
- ✅ After training, the `config.cfg` is saved to disk.
|
||||
|
||||
* [Type annotations](#type-annotations) are a special type of code formatting,
|
||||
expressed by wrapping the text in `~~` instead of backticks. The result looks
|
||||
like this: ~~List[Doc]~~. All references to known types will be linked
|
||||
automatically.
|
||||
|
||||
- ✅ The model has the input type ~~List[Doc]~~ and it outputs a
|
||||
~~List[Array2d]~~.
|
||||
|
||||
* We try to keep links meaningful but short.
|
||||
- ✅ For details, see the usage guide on
|
||||
[training with custom code](/usage/training#custom-code).
|
||||
- ❌ For details, see
|
||||
[the usage guide on training with custom code](/usage/training#custom-code).
|
||||
- ❌ For details, see the usage guide on training with custom code
|
||||
[here](/usage/training#custom-code).
|
||||
|
|
|
@ -183,7 +183,7 @@ will be overwritten.
|
|||
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `match_id` | An ID for the patterns. ~~str~~ |
|
||||
| `patterns` | A list of match patterns. A pattern consists of a list of dicts, where each dict describes a token in the tree. ~~List[List[Dict[str, Union[str, Dict]]]]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[DependencyMatcher, Doc, int, List[Tuple], Any]]~~ |
|
||||
|
||||
## DependencyMatcher.get {#get tag="method"}
|
||||
|
|
|
@ -217,7 +217,7 @@ model. Delegates to [`predict`](/api/dependencyparser#predict) and
|
|||
| Name | Description |
|
||||
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `drop` | The dropout rate. ~~float~~ |
|
||||
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||
|
|
|
@ -85,7 +85,7 @@ providing custom registered functions.
|
|||
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model~~ |
|
||||
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `kb_loader` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. ~~Callable[[Vocab], KnowledgeBase]~~ |
|
||||
| `get_candidates` | Function that generates plausible candidates for a given `Span` object. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ |
|
||||
| `labels_discard` | NER labels that will automatically get a `"NIL"` prediction. ~~Iterable[str]~~ |
|
||||
|
@ -218,7 +218,7 @@ pipe's entity linking model and context encoder. Delegates to
|
|||
| Name | Description |
|
||||
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `drop` | The dropout rate. ~~float~~ |
|
||||
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||
|
|
|
@ -206,7 +206,7 @@ model. Delegates to [`predict`](/api/entityrecognizer#predict) and
|
|||
| Name | Description |
|
||||
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `drop` | The dropout rate. ~~float~~ |
|
||||
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||
|
|
|
@ -255,7 +255,7 @@ Get all patterns that were added to the entity ruler.
|
|||
|
||||
| Name | Description |
|
||||
| ----------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `matcher` | The underlying matcher used to process token patterns. ~~Matcher~~ | |
|
||||
| `matcher` | The underlying matcher used to process token patterns. ~~Matcher~~ |
|
||||
| `phrase_matcher` | The underlying phrase matcher, used to process phrase patterns. ~~PhraseMatcher~~ |
|
||||
| `token_patterns` | The token patterns present in the entity ruler, keyed by label. ~~Dict[str, List[Dict[str, Union[str, List[dict]]]]~~ |
|
||||
| `phrase_patterns` | The phrase patterns present in the entity ruler, keyed by label. ~~Dict[str, List[Doc]]~~ |
|
||||
|
|
|
@ -81,7 +81,7 @@ shortcut for this and instantiate the component using its string name and
|
|||
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||
| `model` | **Not yet implemented:** The model to use. ~~Model~~ |
|
||||
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| mode | The lemmatizer mode, e.g. `"lookup"` or `"rule"`. Defaults to `"lookup"`. ~~str~~ |
|
||||
| lookups | A lookups object containing the tables such as `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. Defaults to `None`. ~~Optional[Lookups]~~ |
|
||||
| overwrite | Whether to overwrite existing lemmas. ~~bool~ |
|
||||
|
|
|
@ -139,7 +139,7 @@ setting up the label scheme based on the data.
|
|||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `pipeline` | Optional list of pipeline components that this component is part of. ~~Optional[List[Tuple[str, Callable[[Doc], Doc]]]]~~ |
|
||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||
|
@ -196,7 +196,7 @@ Delegates to [`predict`](/api/morphologizer#predict) and
|
|||
| Name | Description |
|
||||
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `drop` | The dropout rate. ~~float~~ |
|
||||
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||
|
|
|
@ -150,9 +150,9 @@ patterns = [nlp("health care reform"), nlp("healthcare reform")]
|
|||
|
||||
| Name | Description |
|
||||
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `match_id` | str | An ID for the thing you're matching. ~~str~~ |
|
||||
| `match_id` | An ID for the thing you're matching. ~~str~~ | |
|
||||
| `docs` | `Doc` objects of the phrases to match. ~~List[Doc]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[Matcher, Doc, int, List[tuple], Any]]~~ |
|
||||
|
||||
## PhraseMatcher.remove {#remove tag="method" new="2.2"}
|
||||
|
|
|
@ -187,7 +187,7 @@ predictions and gold-standard annotations, and update the component's model.
|
|||
| Name | Description |
|
||||
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `drop` | The dropout rate. ~~float~~ |
|
||||
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||
|
@ -211,7 +211,7 @@ the "catastrophic forgetting" problem. This feature is experimental.
|
|||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `drop` | The dropout rate. ~~float~~ |
|
||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||
|
|
|
@ -192,7 +192,7 @@ Delegates to [`predict`](/api/sentencerecognizer#predict) and
|
|||
| Name | Description |
|
||||
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `drop` | The dropout rate. ~~float~~ |
|
||||
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||
|
@ -216,7 +216,7 @@ the "catastrophic forgetting" problem. This feature is experimental.
|
|||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `drop` | The dropout rate. ~~float~~ |
|
||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||
|
|
|
@ -53,7 +53,7 @@ Initialize the sentencizer.
|
|||
|
||||
| Name | Description |
|
||||
| -------------- | ----------------------------------------------------------------------------------------------------------------------- |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `punct_chars` | Optional custom list of punctuation characters that mark sentence ends. See below for defaults. ~~Optional[List[str]]~~ |
|
||||
|
||||
```python
|
||||
|
|
|
@ -190,7 +190,7 @@ Delegates to [`predict`](/api/tagger#predict) and
|
|||
| Name | Description |
|
||||
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `drop` | The dropout rate. ~~float~~ |
|
||||
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||
|
@ -214,7 +214,7 @@ the "catastrophic forgetting" problem. This feature is experimental.
|
|||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `drop` | The dropout rate. ~~float~~ |
|
||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||
|
|
|
@ -36,11 +36,12 @@ architectures and their arguments and hyperparameters.
|
|||
> nlp.add_pipe("textcat", config=config)
|
||||
> ```
|
||||
|
||||
| Setting | Description |
|
||||
| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `labels` | A list of categories to learn. If empty, the model infers the categories from the data. Defaults to `[]`. ~~Iterable[str]~~ |
|
||||
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
||||
| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
| Setting | Description |
|
||||
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `labels` | A list of categories to learn. If empty, the model infers the categories from the data. Defaults to `[]`. ~~Iterable[str]~~ |
|
||||
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
||||
| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise and by default. ~~Optional[str]~~ |
|
||||
| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
|
||||
```python
|
||||
%%GITHUB_SPACY/spacy/pipeline/textcat.py
|
||||
|
@ -60,21 +61,22 @@ architectures and their arguments and hyperparameters.
|
|||
>
|
||||
> # Construction from class
|
||||
> from spacy.pipeline import TextCategorizer
|
||||
> textcat = TextCategorizer(nlp.vocab, model, labels=[], threshold=0.5)
|
||||
> textcat = TextCategorizer(nlp.vocab, model, labels=[], threshold=0.5, positive_label="POS")
|
||||
> ```
|
||||
|
||||
Create a new pipeline instance. In your application, you would normally use a
|
||||
shortcut for this and instantiate the component using its string name and
|
||||
[`nlp.add_pipe`](/api/language#create_pipe).
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | -------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||
| _keyword-only_ | |
|
||||
| `labels` | The labels to use. ~~Iterable[str]~~ |
|
||||
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
||||
| Name | Description |
|
||||
| ---------------- | -------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
||||
| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
|
||||
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
|
||||
| _keyword-only_ | |
|
||||
| `labels` | The labels to use. ~~Iterable[str]~~ |
|
||||
| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ |
|
||||
| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise. ~~Optional[str]~~ |
|
||||
|
||||
## TextCategorizer.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
@ -201,7 +203,7 @@ Delegates to [`predict`](/api/textcategorizer#predict) and
|
|||
| Name | Description |
|
||||
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `drop` | The dropout rate. ~~float~~ |
|
||||
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||
|
@ -225,7 +227,7 @@ the "catastrophic forgetting" problem. This feature is experimental.
|
|||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `drop` | The dropout rate. ~~float~~ |
|
||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
|
||||
|
@ -263,7 +265,7 @@ Score a batch of examples.
|
|||
| Name | Description |
|
||||
| ---------------- | -------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | The examples to score. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `positive_label` | Optional positive label. ~~Optional[str]~~ |
|
||||
| **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ |
|
||||
|
||||
|
|
|
@ -144,7 +144,7 @@ setting up the label scheme based on the data.
|
|||
| Name | Description |
|
||||
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `pipeline` | Optional list of pipeline components that this component is part of. ~~Optional[List[Tuple[str, Callable[[Doc], Doc]]]]~~ |
|
||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||
| **RETURNS** | The optimizer. ~~Optimizer~~ |
|
||||
|
@ -200,7 +200,7 @@ Delegates to [`predict`](/api/tok2vec#predict).
|
|||
| Name | Description |
|
||||
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
|
||||
| _keyword-only_ | | |
|
||||
| _keyword-only_ | |
|
||||
| `drop` | The dropout rate. ~~float~~ |
|
||||
| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ |
|
||||
| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
|
||||
|
|
|
@ -11,6 +11,7 @@ menu:
|
|||
- ['Setup & Installation', 'setup']
|
||||
- ['Markdown Reference', 'markdown']
|
||||
- ['Project Structure', 'structure']
|
||||
- ['Editorial', 'editorial']
|
||||
sidebar:
|
||||
- label: Styleguide
|
||||
items:
|
||||
|
|
|
@ -27,6 +27,7 @@ const Quickstart = ({
|
|||
hidePrompts,
|
||||
small,
|
||||
codeLang,
|
||||
Container = Section,
|
||||
children,
|
||||
}) => {
|
||||
const contentRef = useRef()
|
||||
|
@ -83,7 +84,7 @@ const Quickstart = ({
|
|||
}, [data, initialized])
|
||||
|
||||
return !data.length ? null : (
|
||||
<Section id={id}>
|
||||
<Container id={id}>
|
||||
<div className={classNames(classes.root, { [classes.hidePrompts]: !!hidePrompts })}>
|
||||
{title && (
|
||||
<H2 className={classes.title} name={id}>
|
||||
|
@ -249,7 +250,7 @@ const Quickstart = ({
|
|||
</pre>
|
||||
{showCopy && <textarea ref={copyAreaRef} className={classes.copyArea} rows={1} />}
|
||||
</div>
|
||||
</Section>
|
||||
</Container>
|
||||
)
|
||||
}
|
||||
|
||||
|
|
|
@ -41,3 +41,7 @@
|
|||
|
||||
&:before
|
||||
content: ""
|
||||
|
||||
.ul .ul &
|
||||
text-indent: initial
|
||||
margin-left: -20px
|
||||
|
|
|
@ -87,6 +87,8 @@ export default function QuickstartTraining({ id, title, download = 'base_config.
|
|||
.sort((a, b) => a.title.localeCompare(b.title))
|
||||
return (
|
||||
<Quickstart
|
||||
id="quickstart-widget"
|
||||
Container="div"
|
||||
download={download}
|
||||
rawContent={content}
|
||||
data={DATA}
|
||||
|
|
Loading…
Reference in New Issue
Block a user