diff --git a/README.md b/README.md index cef2a1fdd..d23051af0 100644 --- a/README.md +++ b/README.md @@ -4,17 +4,19 @@ spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to -be used in real products. spaCy comes with -[pretrained statistical models](https://spacy.io/models) and word vectors, and -currently supports tokenization for **60+ languages**. It features +be used in real products. + +spaCy comes with +[pretrained pipelines](https://spacy.io/models) and vectors, and +currently supports tokenization for **59+ languages**. It features state-of-the-art speed, convolutional **neural network models** for tagging, -parsing and **named entity recognition** and easy **deep learning** integration. -It's commercial open-source software, released under the MIT license. +parsing, **named entity recognition**, **text classification** and more, multi-task learning with pretrained **transformers** like BERT, as well as a production-ready training system and easy model packaging, deployment and workflow management. +spaCy is commercial open-source software, released under the MIT license. ๐Ÿ’ซ **Version 2.3 out now!** [Check out the release notes here.](https://github.com/explosion/spaCy/releases) -[![Azure Pipelines]()](https://dev.azure.com/explosion-ai/public/_build?definitionId=8) +[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8) [![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square&logo=github)](https://github.com/explosion/spaCy/releases) [![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy/) [![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square&logo=conda-forge&logoColor=white)](https://anaconda.org/conda-forge/spacy) @@ -31,7 +33,7 @@ It's commercial open-source software, released under the MIT license. | --------------- | -------------------------------------------------------------- | | [spaCy 101] | New to spaCy? Here's everything you need to know! | | [Usage Guides] | How to use spaCy and its features. | -| [New in v2.3] | New features, backwards incompatibilities and migration guide. | +| [New in v3.0] | New features, backwards incompatibilities and migration guide. | | [API Reference] | The detailed reference for spaCy's API. | | [Models] | Download statistical language models for spaCy. | | [Universe] | Libraries, extensions, demos, books and courses. | @@ -39,7 +41,7 @@ It's commercial open-source software, released under the MIT license. | [Contribute] | How to contribute to the spaCy project and code base. | [spacy 101]: https://spacy.io/usage/spacy-101 -[new in v2.3]: https://spacy.io/usage/v2-3 +[new in v3.0]: https://spacy.io/usage/v3 [usage guides]: https://spacy.io/usage/ [api reference]: https://spacy.io/api/ [models]: https://spacy.io/models @@ -56,34 +58,29 @@ be able to provide individual support via email. We also believe that help is much more valuable if it's shared publicly, so that more people can benefit from it. -| Type | Platforms | -| ------------------------ | ------------------------------------------------------ | -| ๐Ÿšจ **Bug Reports** | [GitHub Issue Tracker] | -| ๐ŸŽ **Feature Requests** | [GitHub Issue Tracker] | -| ๐Ÿ‘ฉโ€๐Ÿ’ป **Usage Questions** | [Stack Overflow] ยท [Gitter Chat] ยท [Reddit User Group] | -| ๐Ÿ—ฏ **General Discussion** | [Gitter Chat] ยท [Reddit User Group] | +| Type | Platforms | +| ----------------------- | ---------------------- | +| ๐Ÿšจ **Bug Reports** | [GitHub Issue Tracker] | +| ๐ŸŽ **Feature Requests** | [GitHub Issue Tracker] | +| ๐Ÿ‘ฉโ€๐Ÿ’ป **Usage Questions** | [Stack Overflow] | [github issue tracker]: https://github.com/explosion/spaCy/issues [stack overflow]: https://stackoverflow.com/questions/tagged/spacy -[gitter chat]: https://gitter.im/explosion/spaCy -[reddit user group]: https://www.reddit.com/r/spacynlp ## Features -- Non-destructive **tokenization** -- **Named entity** recognition -- Support for **50+ languages** -- pretrained [statistical models](https://spacy.io/models) and word vectors +- Support for **59+ languages** +- **Trained pipelines** +- Multi-task learning with pretrained **transformers** like BERT +- Pretrained **word vectors** - State-of-the-art speed -- Easy **deep learning** integration -- Part-of-speech tagging -- Labelled dependency parsing -- Syntax-driven sentence segmentation +- Production-ready **training system** +- Linguistically-motivated **tokenization** +- Components for named **entity recognition**, part-of-speech-tagging, dependency parsing, sentence segmentation, **text classification**, lemmatization, morphological analysis, entity linking and more +- Easily extensible with **custom components** and attributes +- Support for custom models in **PyTorch**, **TensorFlow** and other frameworks - Built in **visualizers** for syntax and NER -- Convenient string-to-hash mapping -- Export to numpy data arrays -- Efficient binary serialization -- Easy **model packaging** and deployment +- Easy **model packaging**, deployment and workflow management - Robust, rigorously evaluated accuracy ๐Ÿ“– **For more details, see the @@ -102,13 +99,6 @@ For detailed installation instructions, see the [pip]: https://pypi.org/project/spacy/ [conda]: https://anaconda.org/conda-forge/spacy -> โš ๏ธ **Important note for Python 3.8:** We can't yet ship pre-compiled binary -> wheels for spaCy that work on Python 3.8, as we're still waiting for our CI -> providers and other tooling to support it. This means that in order to run -> spaCy on Python 3.8, you'll need [a compiler installed](#source) and compile -> the library and its Cython dependencies locally. If this is causing problems -> for you, the easiest solution is to **use Python 3.7** in the meantime. - ### pip Using pip, spaCy releases are available as source packages and binary wheels (as @@ -164,26 +154,26 @@ If you've trained your own models, keep in mind that your training and runtime inputs must match. After updating spaCy, we recommend **retraining your models** with the new version. -๐Ÿ“– **For details on upgrading from spaCy 1.x to spaCy 2.x, see the -[migration guide](https://spacy.io/usage/v2#migrating).** +๐Ÿ“– **For details on upgrading from spaCy 2.x to spaCy 3.x, see the +[migration guide](https://spacy.io/usage/v3#migrating).** ## Download models -As of v1.7.0, models for spaCy can be installed as **Python packages**. This +Trained pipelines for spaCy can be installed as **Python packages**. This means that they're a component of your application, just like any other module. Models can be installed using spaCy's `download` command, or manually by pointing pip to a path or URL. -| Documentation | | -| ---------------------- | ------------------------------------------------------------- | -| [Available Models] | Detailed model descriptions, accuracy figures and benchmarks. | -| [Models Documentation] | Detailed usage instructions. | +| Documentation | | +| ---------------------- | ---------------------------------------------------------------- | +| [Available Pipelines] | Detailed pipeline descriptions, accuracy figures and benchmarks. | +| [Models Documentation] | Detailed usage instructions. | -[available models]: https://spacy.io/models +[available pipelines]: https://spacy.io/models [models documentation]: https://spacy.io/docs/usage/models ```bash -# download best-matching version of specific model for your spaCy installation +# Download best-matching version of specific model for your spaCy installation python -m spacy download en_core_web_sm # pip install .tar.gz archive from path or URL diff --git a/spacy/cli/train.py b/spacy/cli/train.py index 0bc493e56..ae4a8455e 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -89,7 +89,6 @@ def train( nlp, config = util.load_model_from_config(config) if config["training"]["vectors"] is not None: util.load_vectors_into_model(nlp, config["training"]["vectors"]) - verify_config(nlp) raw_text, tag_map, morph_rules, weights_data = load_from_paths(config) T_cfg = config["training"] optimizer = T_cfg["optimizer"] @@ -108,6 +107,8 @@ def train( nlp.resume_training(sgd=optimizer) with nlp.select_pipes(disable=[*frozen_components, *resume_components]): nlp.begin_training(lambda: train_corpus(nlp), sgd=optimizer) + # Verify the config after calling 'begin_training' to ensure labels are properly initialized + verify_config(nlp) if tag_map: # Replace tag map with provided mapping @@ -401,7 +402,7 @@ def verify_cli_args(config_path: Path, output_path: Optional[Path] = None) -> No def verify_config(nlp: Language) -> None: - """Perform additional checks based on the config and loaded nlp object.""" + """Perform additional checks based on the config, loaded nlp object and training data.""" # TODO: maybe we should validate based on the actual components, the list # in config["nlp"]["pipeline"] instead? for pipe_config in nlp.config["components"].values(): @@ -415,18 +416,13 @@ def verify_textcat_config(nlp: Language, pipe_config: Dict[str, Any]) -> None: # if 'positive_label' is provided: double check whether it's in the data and # the task is binary if pipe_config.get("positive_label"): - textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", []) + textcat_labels = nlp.get_pipe("textcat").labels pos_label = pipe_config.get("positive_label") if pos_label not in textcat_labels: - msg.fail( - f"The textcat's 'positive_label' config setting '{pos_label}' " - f"does not match any label in the training data.", - exits=1, + raise ValueError( + Errors.E920.format(pos_label=pos_label, labels=textcat_labels) ) - if len(textcat_labels) != 2: - msg.fail( - f"A textcat 'positive_label' '{pos_label}' was " - f"provided for training data that does not appear to be a " - f"binary classification problem with two labels.", - exits=1, + if len(list(textcat_labels)) != 2: + raise ValueError( + Errors.E919.format(pos_label=pos_label, labels=textcat_labels) ) diff --git a/spacy/errors.py b/spacy/errors.py index 8f95609a6..f857bea52 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -480,6 +480,11 @@ class Errors: E201 = ("Span index out of range.") # TODO: fix numbering after merging develop into master + E919 = ("A textcat 'positive_label' '{pos_label}' was provided for training " + "data that does not appear to be a binary classification problem " + "with two labels. Labels found: {labels}") + E920 = ("The textcat's 'positive_label' config setting '{pos_label}' " + "does not match any label in the training data. Labels found: {labels}") E921 = ("The method 'set_output' can only be called on components that have " "a Model with a 'resize_output' attribute. Otherwise, the output " "layer can not be dynamically changed.") diff --git a/spacy/pipeline/textcat.py b/spacy/pipeline/textcat.py index 22d1de08f..3f6250680 100644 --- a/spacy/pipeline/textcat.py +++ b/spacy/pipeline/textcat.py @@ -56,7 +56,12 @@ subword_features = true @Language.factory( "textcat", assigns=["doc.cats"], - default_config={"labels": [], "threshold": 0.5, "model": DEFAULT_TEXTCAT_MODEL}, + default_config={ + "labels": [], + "threshold": 0.5, + "positive_label": None, + "model": DEFAULT_TEXTCAT_MODEL, + }, scores=[ "cats_score", "cats_score_desc", @@ -74,8 +79,9 @@ def make_textcat( nlp: Language, name: str, model: Model[List[Doc], List[Floats2d]], - labels: Iterable[str], + labels: List[str], threshold: float, + positive_label: Optional[str], ) -> "TextCategorizer": """Create a TextCategorizer compoment. The text categorizer predicts categories over a whole document. It can learn one or more labels, and the labels can @@ -88,8 +94,16 @@ def make_textcat( labels (list): A list of categories to learn. If empty, the model infers the categories from the data. threshold (float): Cutoff to consider a prediction "positive". + positive_label (Optional[str]): The positive label for a binary task with exclusive classes, None otherwise. """ - return TextCategorizer(nlp.vocab, model, name, labels=labels, threshold=threshold) + return TextCategorizer( + nlp.vocab, + model, + name, + labels=labels, + threshold=threshold, + positive_label=positive_label, + ) class TextCategorizer(Pipe): @@ -104,8 +118,9 @@ class TextCategorizer(Pipe): model: Model, name: str = "textcat", *, - labels: Iterable[str], + labels: List[str], threshold: float, + positive_label: Optional[str], ) -> None: """Initialize a text categorizer. @@ -113,8 +128,9 @@ class TextCategorizer(Pipe): model (thinc.api.Model): The Thinc Model powering the pipeline component. name (str): The component instance name, used to add entries to the losses during training. - labels (Iterable[str]): The labels to use. + labels (List[str]): The labels to use. threshold (float): Cutoff to consider a prediction "positive". + positive_label (Optional[str]): The positive label for a binary task with exclusive classes, None otherwise. DOCS: https://nightly.spacy.io/api/textcategorizer#init """ @@ -122,7 +138,11 @@ class TextCategorizer(Pipe): self.model = model self.name = name self._rehearsal_model = None - cfg = {"labels": labels, "threshold": threshold} + cfg = { + "labels": labels, + "threshold": threshold, + "positive_label": positive_label, + } self.cfg = dict(cfg) @property @@ -131,10 +151,10 @@ class TextCategorizer(Pipe): DOCS: https://nightly.spacy.io/api/textcategorizer#labels """ - return tuple(self.cfg.setdefault("labels", [])) + return tuple(self.cfg["labels"]) @labels.setter - def labels(self, value: Iterable[str]) -> None: + def labels(self, value: List[str]) -> None: self.cfg["labels"] = tuple(value) def pipe(self, stream: Iterable[Doc], *, batch_size: int = 128) -> Iterator[Doc]: @@ -353,17 +373,10 @@ class TextCategorizer(Pipe): sgd = self.create_optimizer() return sgd - def score( - self, - examples: Iterable[Example], - *, - positive_label: Optional[str] = None, - **kwargs, - ) -> Dict[str, Any]: + def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]: """Score a batch of examples. examples (Iterable[Example]): The examples to score. - positive_label (str): Optional positive label. RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_cats. DOCS: https://nightly.spacy.io/api/textcategorizer#score @@ -374,7 +387,7 @@ class TextCategorizer(Pipe): "cats", labels=self.labels, multi_label=self.model.attrs["multi_label"], - positive_label=positive_label, + positive_label=self.cfg["positive_label"], threshold=self.cfg["threshold"], **kwargs, ) diff --git a/spacy/tests/pipeline/test_textcat.py b/spacy/tests/pipeline/test_textcat.py index d12a7211a..99b5132ca 100644 --- a/spacy/tests/pipeline/test_textcat.py +++ b/spacy/tests/pipeline/test_textcat.py @@ -10,6 +10,7 @@ from spacy.tokens import Doc from spacy.pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL from ..util import make_tempdir +from ...cli.train import verify_textcat_config from ...training import Example @@ -130,7 +131,10 @@ def test_overfitting_IO(): fix_random_seed(0) nlp = English() # Set exclusive labels - textcat = nlp.add_pipe("textcat", config={"model": {"exclusive_classes": True}}) + textcat = nlp.add_pipe( + "textcat", + config={"model": {"exclusive_classes": True}, "positive_label": "POSITIVE"}, + ) train_examples = [] for text, annotations in TRAIN_DATA: train_examples.append(Example.from_dict(nlp.make_doc(text), annotations)) @@ -159,7 +163,7 @@ def test_overfitting_IO(): assert cats2["POSITIVE"] + cats2["NEGATIVE"] == pytest.approx(1.0, 0.001) # Test scoring - scores = nlp.evaluate(train_examples, scorer_cfg={"positive_label": "POSITIVE"}) + scores = nlp.evaluate(train_examples) assert scores["cats_micro_f"] == 1.0 assert scores["cats_score"] == 1.0 assert "cats_score_desc" in scores @@ -194,3 +198,29 @@ def test_textcat_configs(textcat_config): for i in range(5): losses = {} nlp.update(train_examples, sgd=optimizer, losses=losses) + + +def test_positive_class(): + nlp = English() + pipe_config = {"positive_label": "POS", "labels": ["POS", "NEG"]} + textcat = nlp.add_pipe("textcat", config=pipe_config) + assert textcat.labels == ("POS", "NEG") + verify_textcat_config(nlp, pipe_config) + + +def test_positive_class_not_present(): + nlp = English() + pipe_config = {"positive_label": "POS", "labels": ["SOME", "THING"]} + textcat = nlp.add_pipe("textcat", config=pipe_config) + assert textcat.labels == ("SOME", "THING") + with pytest.raises(ValueError): + verify_textcat_config(nlp, pipe_config) + + +def test_positive_class_not_binary(): + nlp = English() + pipe_config = {"positive_label": "POS", "labels": ["SOME", "THING", "POS"]} + textcat = nlp.add_pipe("textcat", config=pipe_config) + assert textcat.labels == ("SOME", "THING", "POS") + with pytest.raises(ValueError): + verify_textcat_config(nlp, pipe_config) diff --git a/spacy/tests/serialize/test_serialize_pipeline.py b/spacy/tests/serialize/test_serialize_pipeline.py index e621aebd8..eedad31e0 100644 --- a/spacy/tests/serialize/test_serialize_pipeline.py +++ b/spacy/tests/serialize/test_serialize_pipeline.py @@ -136,7 +136,7 @@ def test_serialize_textcat_empty(en_vocab): # See issue #1105 cfg = {"model": DEFAULT_TEXTCAT_MODEL} model = registry.make_from_config(cfg, validate=True)["model"] - textcat = TextCategorizer(en_vocab, model, labels=["ENTITY", "ACTION", "MODIFIER"], threshold=0.5) + textcat = TextCategorizer(en_vocab, model, labels=["ENTITY", "ACTION", "MODIFIER"], threshold=0.5, positive_label=None) textcat.to_bytes(exclude=["vocab"]) diff --git a/website/README.md b/website/README.md index 825d13c65..076032d92 100644 --- a/website/README.md +++ b/website/README.md @@ -630,3 +630,49 @@ In addition to the native markdown elements, you can use the components โ”œโ”€โ”€ gatsby-node.js # Node-specific hooks for Gatsby โ””โ”€โ”€ package.json # package settings and dependencies ``` + +## Editorial {#editorial} + +- "spaCy" should always be spelled with a lowercase "s" and a capital "C", + unless it specifically refers to the Python package or Python import `spacy` + (in which case it should be formatted as code). + - โœ… spaCy is a library for advanced NLP in Python. + - โŒ Spacy is a library for advanced NLP in Python. + - โœ… First, you need to install the `spacy` package from pip. +- Mentions of code, like function names, classes, variable names etc. in inline + text should be formatted as `code`. + - โœ… "Calling the `nlp` object on a text returns a `Doc`." +- Objects that have pages in the [API docs](/api) should be linked โ€“ for + example, [`Doc`](/api/doc) or [`Language.to_disk`](/api/language#to_disk). The + mentions should still be formatted as code within the link. Links pointing to + the API docs will automatically receive a little icon. However, if a paragraph + includes many references to the API, the links can easily get messy. In that + case, we typically only link the first mention of an object and not any + subsequent ones. + - โœ… The [`Span`](/api/span) and [`Token`](/api/token) objects are views of a + [`Doc`](/api/doc). [`Span.as_doc`](/api/span#as_doc) creates a `Doc` object + from a `Span`. + - โŒ The [`Span`](/api/span) and [`Token`](/api/token) objects are views of a + [`Doc`](/api/doc). [`Span.as_doc`](/api/span#as_doc) creates a + [`Doc`](/api/doc) object from a [`Span`](/api/span). + +* Other things we format as code are: references to trained pipeline packages + like `en_core_web_sm` or file names like `code.py` or `meta.json`. + + - โœ… After training, the `config.cfg` is saved to disk. + +* [Type annotations](#type-annotations) are a special type of code formatting, + expressed by wrapping the text in `~~` instead of backticks. The result looks + like this: ~~List[Doc]~~. All references to known types will be linked + automatically. + + - โœ… The model has the input type ~~List[Doc]~~ and it outputs a + ~~List[Array2d]~~. + +* We try to keep links meaningful but short. + - โœ… For details, see the usage guide on + [training with custom code](/usage/training#custom-code). + - โŒ For details, see + [the usage guide on training with custom code](/usage/training#custom-code). + - โŒ For details, see the usage guide on training with custom code + [here](/usage/training#custom-code). diff --git a/website/docs/api/dependencymatcher.md b/website/docs/api/dependencymatcher.md index c90a715d9..356adcda7 100644 --- a/website/docs/api/dependencymatcher.md +++ b/website/docs/api/dependencymatcher.md @@ -183,7 +183,7 @@ will be overwritten. | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `match_id` | An ID for the patterns. ~~str~~ | | `patterns` | A list of match patterns. A pattern consists of a list of dicts, where each dict describes a token in the tree. ~~List[List[Dict[str, Union[str, Dict]]]]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[DependencyMatcher, Doc, int, List[Tuple], Any]]~~ | ## DependencyMatcher.get {#get tag="method"} diff --git a/website/docs/api/dependencyparser.md b/website/docs/api/dependencyparser.md index 674812567..8af4455d3 100644 --- a/website/docs/api/dependencyparser.md +++ b/website/docs/api/dependencyparser.md @@ -217,7 +217,7 @@ model. Delegates to [`predict`](/api/dependencyparser#predict) and | Name | Description | | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | | `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `drop` | The dropout rate. ~~float~~ | | `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | | `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | diff --git a/website/docs/api/entitylinker.md b/website/docs/api/entitylinker.md index a9d45d68e..9cb35b487 100644 --- a/website/docs/api/entitylinker.md +++ b/website/docs/api/entitylinker.md @@ -85,7 +85,7 @@ providing custom registered functions. | `vocab` | The shared vocabulary. ~~Vocab~~ | | `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model~~ | | `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `kb_loader` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. ~~Callable[[Vocab], KnowledgeBase]~~ | | `get_candidates` | Function that generates plausible candidates for a given `Span` object. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ | | `labels_discard` | NER labels that will automatically get a `"NIL"` prediction. ~~Iterable[str]~~ | @@ -218,7 +218,7 @@ pipe's entity linking model and context encoder. Delegates to | Name | Description | | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | | `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `drop` | The dropout rate. ~~float~~ | | `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | | `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | diff --git a/website/docs/api/entityrecognizer.md b/website/docs/api/entityrecognizer.md index 1420aa1a7..8af73f44b 100644 --- a/website/docs/api/entityrecognizer.md +++ b/website/docs/api/entityrecognizer.md @@ -206,7 +206,7 @@ model. Delegates to [`predict`](/api/entityrecognizer#predict) and | Name | Description | | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | | `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `drop` | The dropout rate. ~~float~~ | | `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | | `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | diff --git a/website/docs/api/entityruler.md b/website/docs/api/entityruler.md index a6934eeef..7be44bc95 100644 --- a/website/docs/api/entityruler.md +++ b/website/docs/api/entityruler.md @@ -255,7 +255,7 @@ Get all patterns that were added to the entity ruler. | Name | Description | | ----------------- | --------------------------------------------------------------------------------------------------------------------- | -| `matcher` | The underlying matcher used to process token patterns. ~~Matcher~~ | | +| `matcher` | The underlying matcher used to process token patterns. ~~Matcher~~ | | `phrase_matcher` | The underlying phrase matcher, used to process phrase patterns. ~~PhraseMatcher~~ | | `token_patterns` | The token patterns present in the entity ruler, keyed by label. ~~Dict[str, List[Dict[str, Union[str, List[dict]]]]~~ | | `phrase_patterns` | The phrase patterns present in the entity ruler, keyed by label. ~~Dict[str, List[Doc]]~~ | diff --git a/website/docs/api/lemmatizer.md b/website/docs/api/lemmatizer.md index 486410907..f9978dcf9 100644 --- a/website/docs/api/lemmatizer.md +++ b/website/docs/api/lemmatizer.md @@ -81,7 +81,7 @@ shortcut for this and instantiate the component using its string name and | `vocab` | The shared vocabulary. ~~Vocab~~ | | `model` | **Not yet implemented:** The model to use. ~~Model~~ | | `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | mode | The lemmatizer mode, e.g. `"lookup"` or `"rule"`. Defaults to `"lookup"`. ~~str~~ | | lookups | A lookups object containing the tables such as `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. Defaults to `None`. ~~Optional[Lookups]~~ | | overwrite | Whether to overwrite existing lemmas. ~~bool~ | diff --git a/website/docs/api/morphologizer.md b/website/docs/api/morphologizer.md index f2b2f9cc0..e1a166474 100644 --- a/website/docs/api/morphologizer.md +++ b/website/docs/api/morphologizer.md @@ -139,7 +139,7 @@ setting up the label scheme based on the data. | Name | Description | | -------------- | ------------------------------------------------------------------------------------------------------------------------------------- | | `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `pipeline` | Optional list of pipeline components that this component is part of. ~~Optional[List[Tuple[str, Callable[[Doc], Doc]]]]~~ | | `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | | **RETURNS** | The optimizer. ~~Optimizer~~ | @@ -196,7 +196,7 @@ Delegates to [`predict`](/api/morphologizer#predict) and | Name | Description | | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | | `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `drop` | The dropout rate. ~~float~~ | | `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | | `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | diff --git a/website/docs/api/phrasematcher.md b/website/docs/api/phrasematcher.md index 39e3a298b..47bbdcf6a 100644 --- a/website/docs/api/phrasematcher.md +++ b/website/docs/api/phrasematcher.md @@ -150,9 +150,9 @@ patterns = [nlp("health care reform"), nlp("healthcare reform")] | Name | Description | | -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `match_id` | str | An ID for the thing you're matching. ~~str~~ | +| `match_id` | An ID for the thing you're matching. ~~str~~ | | | `docs` | `Doc` objects of the phrases to match. ~~List[Doc]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[Matcher, Doc, int, List[tuple], Any]]~~ | ## PhraseMatcher.remove {#remove tag="method" new="2.2"} diff --git a/website/docs/api/pipe.md b/website/docs/api/pipe.md index c8d61a5a9..e4e1e97f1 100644 --- a/website/docs/api/pipe.md +++ b/website/docs/api/pipe.md @@ -187,7 +187,7 @@ predictions and gold-standard annotations, and update the component's model. | Name | Description | | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | | `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `drop` | The dropout rate. ~~float~~ | | `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | | `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | @@ -211,7 +211,7 @@ the "catastrophic forgetting" problem. This feature is experimental. | Name | Description | | -------------- | ------------------------------------------------------------------------------------------------------------------------ | | `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `drop` | The dropout rate. ~~float~~ | | `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | | `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | diff --git a/website/docs/api/sentencerecognizer.md b/website/docs/api/sentencerecognizer.md index ca19327bb..acf94fb8e 100644 --- a/website/docs/api/sentencerecognizer.md +++ b/website/docs/api/sentencerecognizer.md @@ -192,7 +192,7 @@ Delegates to [`predict`](/api/sentencerecognizer#predict) and | Name | Description | | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | | `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `drop` | The dropout rate. ~~float~~ | | `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | | `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | @@ -216,7 +216,7 @@ the "catastrophic forgetting" problem. This feature is experimental. | Name | Description | | -------------- | ------------------------------------------------------------------------------------------------------------------------ | | `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `drop` | The dropout rate. ~~float~~ | | `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | | `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | diff --git a/website/docs/api/sentencizer.md b/website/docs/api/sentencizer.md index c435acdcb..ae31e4ddf 100644 --- a/website/docs/api/sentencizer.md +++ b/website/docs/api/sentencizer.md @@ -53,7 +53,7 @@ Initialize the sentencizer. | Name | Description | | -------------- | ----------------------------------------------------------------------------------------------------------------------- | -| _keyword-only_ | | | +| _keyword-only_ | | | `punct_chars` | Optional custom list of punctuation characters that mark sentence ends. See below for defaults. ~~Optional[List[str]]~~ | ```python diff --git a/website/docs/api/tagger.md b/website/docs/api/tagger.md index d83a77357..d428d376e 100644 --- a/website/docs/api/tagger.md +++ b/website/docs/api/tagger.md @@ -190,7 +190,7 @@ Delegates to [`predict`](/api/tagger#predict) and | Name | Description | | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | | `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `drop` | The dropout rate. ~~float~~ | | `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | | `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | @@ -214,7 +214,7 @@ the "catastrophic forgetting" problem. This feature is experimental. | Name | Description | | -------------- | ------------------------------------------------------------------------------------------------------------------------ | | `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `drop` | The dropout rate. ~~float~~ | | `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | | `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | diff --git a/website/docs/api/textcategorizer.md b/website/docs/api/textcategorizer.md index 9bdc6324f..b68039094 100644 --- a/website/docs/api/textcategorizer.md +++ b/website/docs/api/textcategorizer.md @@ -36,11 +36,12 @@ architectures and their arguments and hyperparameters. > nlp.add_pipe("textcat", config=config) > ``` -| Setting | Description | -| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `labels` | A list of categories to learn. If empty, the model infers the categories from the data. Defaults to `[]`. ~~Iterable[str]~~ | -| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ | -| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ | +| Setting | Description | +| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `labels` | A list of categories to learn. If empty, the model infers the categories from the data. Defaults to `[]`. ~~Iterable[str]~~ | +| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ | +| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise and by default. ~~Optional[str]~~ | +| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ | ```python %%GITHUB_SPACY/spacy/pipeline/textcat.py @@ -60,21 +61,22 @@ architectures and their arguments and hyperparameters. > > # Construction from class > from spacy.pipeline import TextCategorizer -> textcat = TextCategorizer(nlp.vocab, model, labels=[], threshold=0.5) +> textcat = TextCategorizer(nlp.vocab, model, labels=[], threshold=0.5, positive_label="POS") > ``` Create a new pipeline instance. In your application, you would normally use a shortcut for this and instantiate the component using its string name and [`nlp.add_pipe`](/api/language#create_pipe). -| Name | Description | -| -------------- | -------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | The shared vocabulary. ~~Vocab~~ | -| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ | -| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | -| _keyword-only_ | | -| `labels` | The labels to use. ~~Iterable[str]~~ | -| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ | +| Name | Description | +| ---------------- | -------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | +| _keyword-only_ | | +| `labels` | The labels to use. ~~Iterable[str]~~ | +| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ | +| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise. ~~Optional[str]~~ | ## TextCategorizer.\_\_call\_\_ {#call tag="method"} @@ -201,7 +203,7 @@ Delegates to [`predict`](/api/textcategorizer#predict) and | Name | Description | | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | | `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `drop` | The dropout rate. ~~float~~ | | `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | | `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | @@ -225,7 +227,7 @@ the "catastrophic forgetting" problem. This feature is experimental. | Name | Description | | -------------- | ------------------------------------------------------------------------------------------------------------------------ | | `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `drop` | The dropout rate. ~~float~~ | | `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | | `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | @@ -263,7 +265,7 @@ Score a batch of examples. | Name | Description | | ---------------- | -------------------------------------------------------------------------------------------------------------------- | | `examples` | The examples to score. ~~Iterable[Example]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `positive_label` | Optional positive label. ~~Optional[str]~~ | | **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ | diff --git a/website/docs/api/tok2vec.md b/website/docs/api/tok2vec.md index 6f13a17a5..5c7214edc 100644 --- a/website/docs/api/tok2vec.md +++ b/website/docs/api/tok2vec.md @@ -144,7 +144,7 @@ setting up the label scheme based on the data. | Name | Description | | -------------- | ------------------------------------------------------------------------------------------------------------------------------------- | | `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `pipeline` | Optional list of pipeline components that this component is part of. ~~Optional[List[Tuple[str, Callable[[Doc], Doc]]]]~~ | | `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | | **RETURNS** | The optimizer. ~~Optimizer~~ | @@ -200,7 +200,7 @@ Delegates to [`predict`](/api/tok2vec#predict). | Name | Description | | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | | `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | -| _keyword-only_ | | | +| _keyword-only_ | | | `drop` | The dropout rate. ~~float~~ | | `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | | `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | diff --git a/website/docs/styleguide.md b/website/docs/styleguide.md index 4d8aa8748..ed6f9d99b 100644 --- a/website/docs/styleguide.md +++ b/website/docs/styleguide.md @@ -11,6 +11,7 @@ menu: - ['Setup & Installation', 'setup'] - ['Markdown Reference', 'markdown'] - ['Project Structure', 'structure'] + - ['Editorial', 'editorial'] sidebar: - label: Styleguide items: diff --git a/website/src/components/quickstart.js b/website/src/components/quickstart.js index 6a335d4a0..64f828c2f 100644 --- a/website/src/components/quickstart.js +++ b/website/src/components/quickstart.js @@ -27,6 +27,7 @@ const Quickstart = ({ hidePrompts, small, codeLang, + Container = Section, children, }) => { const contentRef = useRef() @@ -83,7 +84,7 @@ const Quickstart = ({ }, [data, initialized]) return !data.length ? null : ( -
+
{title && (

@@ -249,7 +250,7 @@ const Quickstart = ({ {showCopy &&