Merge branch 'develop' into spacy.io

2025-08-24 05:54:55 +03:00 · 2019-02-24 13:12:26 +01:00 · 2019-02-24 13:12:26 +01:00 · 8458379cf5
commit 8458379cf5
parent f34d6281d6 3ef4da3503
5 changed files with 131 additions and 100 deletions
--- a/README.md
+++ b/README.md
@ -6,7 +6,7 @@ spaCy is a library for advanced Natural Language Processing in Python and
 Cython. It's built on the very latest research, and was designed from day one
 to be used in real products. spaCy comes with
 [pre-trained statistical models](https://spacy.io/models) and word vectors, and
-currently supports tokenization for **30+ languages**. It features the
+currently supports tokenization for **45+ languages**. It features the
 **fastest syntactic parser** in the world, convolutional
 **neural network models** for tagging, parsing and **named entity recognition**
 and easy **deep learning** integration. It's commercial open-source software,
@ -20,29 +20,30 @@ released under the MIT license.
 [![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square)](https://pypi.python.org/pypi/spacy)
 [![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square)](https://anaconda.org/conda-forge/spacy)
 [![Python wheels](https://img.shields.io/badge/wheels-%E2%9C%93-4c1.svg?longCache=true&style=flat-square&logo=python&logoColor=white)](https://github.com/explosion/wheelwright/releases)
+[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)
 [![spaCy on Twitter](https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow)](https://twitter.com/spacy_io)

 ## 📖 Documentation

-| Documentation |  |
-| --- | --- |
-| [spaCy 101] | New to spaCy? Here's everything you need to know!
-| [Usage Guides] | How to use spaCy and its features. |
-| [New in v2.0] | New features, backwards incompatibilities and migration guide. |
-| [API Reference] | The detailed reference for spaCy's API. |
-| [Models] | Download statistical language models for spaCy. |
-| [Universe] | Libraries, extensions, demos, books and courses. |
-| [Changelog] | Changes and version history. |
-| [Contribute] | How to contribute to the spaCy project and code base. |
+| Documentation   |                                                                |
+| --------------- | -------------------------------------------------------------- |
+| [spaCy 101]     | New to spaCy? Here's everything you need to know!              |
+| [Usage Guides]  | How to use spaCy and its features.                             |
+| [New in v2.1]   | New features, backwards incompatibilities and migration guide. |
+| [API Reference] | The detailed reference for spaCy's API.                        |
+| [Models]        | Download statistical language models for spaCy.                |
+| [Universe]      | Libraries, extensions, demos, books and courses.               |
+| [Changelog]     | Changes and version history.                                   |
+| [Contribute]    | How to contribute to the spaCy project and code base.          |

-[spaCy 101]: https://spacy.io/usage/spacy-101
-[New in v2.0]: https://spacy.io/usage/v2#migrating
-[Usage Guides]: https://spacy.io/usage/
-[API Reference]: https://spacy.io/api/
-[Models]: https://spacy.io/models
-[Universe]: https://spacy.io/universe
-[Changelog]: https://spacy.io/usage/#changelog
-[Contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
+[spacy 101]: https://spacy.io/usage/spacy-101
+[new in v2.1]: https://spacy.io/usage/v2-1
+[usage guides]: https://spacy.io/usage/
+[api reference]: https://spacy.io/api/
+[models]: https://spacy.io/models
+[universe]: https://spacy.io/universe
+[changelog]: https://spacy.io/usage/#changelog
+[contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md

 ## 💬 Where to ask questions

@ -51,35 +52,38 @@ and [@ines](https://github.com/ines). Please understand that we won't be able
 to provide individual support via email. We also believe that help is much more
 valuable if it's shared publicly, so that more people can benefit from it.

-* **Bug Reports**: [GitHub Issue Tracker]
-* **Usage Questions**: [Stack Overflow] · [Gitter Chat] · [Reddit User Group]
-* **General Discussion**: [Gitter Chat] · [Reddit User Group]
+| Type                     | Platforms                                              |
+| ------------------------ | ------------------------------------------------------ |
+| 🚨**Bug Reports**        | [GitHub Issue Tracker]                                 |
+| 🎁 **Feature Requests**  | [GitHub Issue Tracker]                                 |
+| 👩‍💻**Usage Questions**    | [Stack Overflow] · [Gitter Chat] · [Reddit User Group] |
+| 🗯 **General Discussion** | [Gitter Chat] · [Reddit User Group]                    |

-[GitHub Issue Tracker]: https://github.com/explosion/spaCy/issues
-[Stack Overflow]: http://stackoverflow.com/questions/tagged/spacy
-[Gitter Chat]: https://gitter.im/explosion/spaCy
-[Reddit User Group]: https://www.reddit.com/r/spacynlp
+[github issue tracker]: https://github.com/explosion/spaCy/issues
+[stack overflow]: http://stackoverflow.com/questions/tagged/spacy
+[gitter chat]: https://gitter.im/explosion/spaCy
+[reddit user group]: https://www.reddit.com/r/spacynlp

 ## Features

-* **Fastest syntactic parser** in the world
-* **Named entity** recognition
-* Non-destructive **tokenization**
-* Support for **30+ languages**
-* Pre-trained [statistical models](https://spacy.io/models) and word vectors
-* Easy **deep learning** integration
-* Part-of-speech tagging
-* Labelled dependency parsing
-* Syntax-driven sentence segmentation
-* Built in **visualizers** for syntax and NER
-* Convenient string-to-hash mapping
-* Export to numpy data arrays
-* Efficient binary serialization
-* Easy **model packaging** and deployment
-* State-of-the-art speed
-* Robust, rigorously evaluated accuracy
+-   **Fastest syntactic parser** in the world
+-   **Named entity** recognition
+-   Non-destructive **tokenization**
+-   Support for **45+ languages**
+-   Pre-trained [statistical models](https://spacy.io/models) and word vectors
+-   Easy **deep learning** integration
+-   Part-of-speech tagging
+-   Labelled dependency parsing
+-   Syntax-driven sentence segmentation
+-   Built in **visualizers** for syntax and NER
+-   Convenient string-to-hash mapping
+-   Export to numpy data arrays
+-   Efficient binary serialization
+-   Easy **model packaging** and deployment
+-   State-of-the-art speed
+-   Robust, rigorously evaluated accuracy

-📖  **For more details, see the
+📖 **For more details, see the
 [facts, figures and benchmarks](https://spacy.io/usage/facts-figures).**

 ## Install spaCy
@ -87,9 +91,9 @@ valuable if it's shared publicly, so that more people can benefit from it.
 For detailed installation instructions, see the
 [documentation](https://spacy.io/usage).

-* **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio)
-* **Python version**: Python 2.7, 3.4+ (only 64 bit)
-* **Package managers**: [pip] · [conda] (via `conda-forge`)
+-   **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio)
+-   **Python version**: Python 2.7, 3.4+ (only 64 bit)
+-   **Package managers**: [pip] · [conda] (via `conda-forge`)

 [pip]: https://pypi.python.org/pypi/spacy
 [conda]: https://anaconda.org/conda-forge/spacy
@ -142,7 +146,7 @@ If you've trained your own models, keep in mind that your training and runtime
 inputs must match. After updating spaCy, we recommend **retraining your models**
 with the new version.

-📖  **For details on upgrading from spaCy 1.x to spaCy 2.x, see the
+📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the
 [migration guide](https://spacy.io/usage/v2#migrating).**

 ## Download models
@ -152,13 +156,13 @@ This means that they're a component of your application, just like any
 other module. Models can be installed using spaCy's `download` command,
 or manually by pointing pip to a path or URL.

-| Documentation |  |
-| --- | --- |
-| [Available Models] | Detailed model descriptions, accuracy figures and benchmarks. |
-| [Models Documentation] | Detailed usage instructions. |
+| Documentation          |                                                               |
+| ---------------------- | ------------------------------------------------------------- |
+| [Available Models]     | Detailed model descriptions, accuracy figures and benchmarks. |
+| [Models Documentation] | Detailed usage instructions.                                  |

-[Available Models]: https://spacy.io/models
-[Models Documentation]: https://spacy.io/docs/usage/models
+[available models]: https://spacy.io/models
+[models documentation]: https://spacy.io/docs/usage/models

 ```bash
 # out-of-the-box: download best-matching default model
@ -261,7 +265,7 @@ VS 2010 (Python 3.4) and VS 2015 (Python 3.5).

 ## Run tests

-spaCy comes with an [extensive test suite](spacy/tests).  In order to run the
+spaCy comes with an [extensive test suite](spacy/tests). In order to run the
 tests, you'll usually want to clone the repository and build spaCy from source.
 This will also install the required development dependencies and test utilities
 defined in the `requirements.txt`.
--- a/website/docs/api/dependencyparser.md
+++ b/website/docs/api/dependencyparser.md
@ -37,17 +37,19 @@ shortcut for this and instantiate the component using its string name and
 > parser.from_disk("/path/to/model")
 > ```

-| Name        | Type                           | Description                                                                                                                                           |
-| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `vocab`     | `Vocab`                        | The shared vocabulary.                                                                                                                                |
-| `model`     | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
-| `**cfg`     | -                              | Configuration parameters.                                                                                                                             |
-| **RETURNS** | `DependencyParser`             | The newly constructed object.                                                                                                                         |
+| Name        | Type                          | Description                                                                                                                                           |
+| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `vocab`     | `Vocab`                       | The shared vocabulary.                                                                                                                                |
+| `model`     | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
+| `**cfg`     | -                             | Configuration parameters.                                                                                                                             |
+| **RETURNS** | `DependencyParser`            | The newly constructed object.                                                                                                                         |

 ## DependencyParser.\_\_call\_\_ {#call tag="method"}

 Apply the pipe to one document. The document is modified in place, and returned.
-Both [`__call__`](/api/dependencyparser#call) and
+This usually happens under the hood when you call the `nlp` object on a text and
+all pipeline components are applied to the `Doc` in order. Both
+[`__call__`](/api/dependencyparser#call) and
 [`pipe`](/api/dependencyparser#pipe) delegate to the
 [`predict`](/api/dependencyparser#predict) and
 [`set_annotations`](/api/dependencyparser#set_annotations) methods.
@ -57,6 +59,7 @@ Both [`__call__`](/api/dependencyparser#call) and
 > ```python
 > parser = DependencyParser(nlp.vocab)
 > doc = nlp(u"This is a sentence.")
+> # This usually happens under the hood
 > processed = parser(doc)
 > ```

@ -82,11 +85,11 @@ Apply the pipe to a stream of documents. Both
 >     pass
 > ```

-| Name         | Type     | Description                                                                                                    |
-| ------------ | -------- | -------------------------------------------------------------------------------------------------------------- |
-| `stream`     | iterable | A stream of documents.                                                                                         |
-| `batch_size` | int      | The number of texts to buffer. Defaults to `128`.                                                              |
-| **YIELDS**   | `Doc`    | Processed documents in the order of the original text.                                                         |
+| Name         | Type     | Description                                            |
+| ------------ | -------- | ------------------------------------------------------ |
+| `stream`     | iterable | A stream of documents.                                 |
+| `batch_size` | int      | The number of texts to buffer. Defaults to `128`.      |
+| **YIELDS**   | `Doc`    | Processed documents in the order of the original text. |

 ## DependencyParser.predict {#predict tag="method"}

--- a/website/docs/api/entityrecognizer.md
+++ b/website/docs/api/entityrecognizer.md
@ -37,17 +37,19 @@ shortcut for this and instantiate the component using its string name and
 > ner.from_disk("/path/to/model")
 > ```

-| Name        | Type                           | Description                                                                                                                                           |
-| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `vocab`     | `Vocab`                        | The shared vocabulary.                                                                                                                                |
-| `model`     | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
-| `**cfg`     | -                              | Configuration parameters.                                                                                                                             |
-| **RETURNS** | `EntityRecognizer`             | The newly constructed object.                                                                                                                         |
+| Name        | Type                          | Description                                                                                                                                           |
+| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `vocab`     | `Vocab`                       | The shared vocabulary.                                                                                                                                |
+| `model`     | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
+| `**cfg`     | -                             | Configuration parameters.                                                                                                                             |
+| **RETURNS** | `EntityRecognizer`            | The newly constructed object.                                                                                                                         |

 ## EntityRecognizer.\_\_call\_\_ {#call tag="method"}

 Apply the pipe to one document. The document is modified in place, and returned.
-Both [`__call__`](/api/entityrecognizer#call) and
+This usually happens under the hood when you call the `nlp` object on a text and
+all pipeline components are applied to the `Doc` in order. Both
+[`__call__`](/api/entityrecognizer#call) and
 [`pipe`](/api/entityrecognizer#pipe) delegate to the
 [`predict`](/api/entityrecognizer#predict) and
 [`set_annotations`](/api/entityrecognizer#set_annotations) methods.
@ -57,6 +59,7 @@ Both [`__call__`](/api/entityrecognizer#call) and
 > ```python
 > ner = EntityRecognizer(nlp.vocab)
 > doc = nlp(u"This is a sentence.")
+> # This usually happens under the hood
 > processed = ner(doc)
 > ```

@ -82,11 +85,11 @@ Apply the pipe to a stream of documents. Both
 >     pass
 > ```

-| Name         | Type     | Description                                                                                                    |
-| ------------ | -------- | -------------------------------------------------------------------------------------------------------------- |
-| `stream`     | iterable | A stream of documents.                                                                                         |
-| `batch_size` | int      | The number of texts to buffer. Defaults to `128`.                                                              |
-| **YIELDS**   | `Doc`    | Processed documents in the order of the original text.                                                         |
+| Name         | Type     | Description                                            |
+| ------------ | -------- | ------------------------------------------------------ |
+| `stream`     | iterable | A stream of documents.                                 |
+| `batch_size` | int      | The number of texts to buffer. Defaults to `128`.      |
+| **YIELDS**   | `Doc`    | Processed documents in the order of the original text. |

 ## EntityRecognizer.predict {#predict tag="method"}

--- a/website/docs/api/tagger.md
+++ b/website/docs/api/tagger.md
@ -37,18 +37,20 @@ shortcut for this and instantiate the component using its string name and
 > tagger.from_disk("/path/to/model")
 > ```

-| Name        | Type                           | Description                                                                                                                                           |
-| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `vocab`     | `Vocab`                        | The shared vocabulary.                                                                                                                                |
-| `model`     | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
-| `**cfg`     | -                              | Configuration parameters.                                                                                                                             |
-| **RETURNS** | `Tagger`                       | The newly constructed object.                                                                                                                         |
+| Name        | Type                          | Description                                                                                                                                           |
+| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `vocab`     | `Vocab`                       | The shared vocabulary.                                                                                                                                |
+| `model`     | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
+| `**cfg`     | -                             | Configuration parameters.                                                                                                                             |
+| **RETURNS** | `Tagger`                      | The newly constructed object.                                                                                                                         |

 ## Tagger.\_\_call\_\_ {#call tag="method"}

 Apply the pipe to one document. The document is modified in place, and returned.
-Both [`__call__`](/api/tagger#call) and [`pipe`](/api/tagger#pipe) delegate to
-the [`predict`](/api/tagger#predict) and
+This usually happens under the hood when you call the `nlp` object on a text and
+all pipeline components are applied to the `Doc` in order. Both
+[`__call__`](/api/tagger#call) and [`pipe`](/api/tagger#pipe) delegate to the
+[`predict`](/api/tagger#predict) and
 [`set_annotations`](/api/tagger#set_annotations) methods.

 > #### Example
@ -56,6 +58,7 @@ the [`predict`](/api/tagger#predict) and
 > ```python
 > tagger = Tagger(nlp.vocab)
 > doc = nlp(u"This is a sentence.")
+> # This usually happens under the hood
 > processed = tagger(doc)
 > ```

@ -79,11 +82,11 @@ Apply the pipe to a stream of documents. Both [`__call__`](/api/tagger#call) and
 >     pass
 > ```

-| Name         | Type     | Description                                                                                                    |
-| ------------ | -------- | -------------------------------------------------------------------------------------------------------------- |
-| `stream`     | iterable | A stream of documents.                                                                                         |
-| `batch_size` | int      | The number of texts to buffer. Defaults to `128`.                                                              |
-| **YIELDS**   | `Doc`    | Processed documents in the order of the original text.                                                         |
+| Name         | Type     | Description                                            |
+| ------------ | -------- | ------------------------------------------------------ |
+| `stream`     | iterable | A stream of documents.                                 |
+| `batch_size` | int      | The number of texts to buffer. Defaults to `128`.      |
+| **YIELDS**   | `Doc`    | Processed documents in the order of the original text. |

 ## Tagger.predict {#predict tag="method"}

--- a/website/docs/api/textcategorizer.md
+++ b/website/docs/api/textcategorizer.md
@ -31,6 +31,7 @@ shortcut for this and instantiate the component using its string name and
 > ```python
 > # Construction via create_pipe
 > textcat = nlp.create_pipe("textcat")
+> textcat = nlp.create_pipe("textcat", config={"exclusive_classes": True})
 >
 > # Construction from class
 > from spacy.pipeline import TextCategorizer
@ -38,19 +39,35 @@ shortcut for this and instantiate the component using its string name and
 > textcat.from_disk("/path/to/model")
 > ```

-| Name        | Type                           | Description                                                                                                                                           |
-| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `vocab`     | `Vocab`                        | The shared vocabulary.                                                                                                                                |
-| `model`     | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
-| `**cfg`     | -                              | Configuration parameters.                                                                                                                             |
-| **RETURNS** | `TextCategorizer`              | The newly constructed object.                                                                                                                         |
+| Name                | Type                          | Description                                                                                                                                           |
+| ------------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `vocab`             | `Vocab`                       | The shared vocabulary.                                                                                                                                |
+| `model`             | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
+| `exclusive_classes` | bool                          | Make categories mutually exclusive. Defaults to `False`.                                                                                              |
+| `architecture`      | unicode                       | Model architecture to use, see [architectures](#architectures) for details. Defaults to `"ensemble"`.                                                 |
+| **RETURNS**         | `TextCategorizer`             | The newly constructed object.                                                                                                                         |
+
+### Architectures {#architectures new="2.1"}
+
+Text classification models can be used to solve a wide variety of problems.
+Differences in text length, number of labels, difficulty, and runtime
+performance constraints mean that no single algorithm performs well on all types
+of problems. To handle a wider variety of problems, the `TextCategorizer` object
+allows configuration of its model architecture, using the `architecture` keyword
+argument.
+
+| Name           | Description                                                                                                                                              |
+| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `"ensemble"`   | **Default:** Stacked ensemble of a unigram bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention. |
+| `"simple_cnn"` | A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network.       |

 ## TextCategorizer.\_\_call\_\_ {#call tag="method"}

 Apply the pipe to one document. The document is modified in place, and returned.
-Both [`__call__`](/api/textcategorizer#call) and
-[`pipe`](/api/textcategorizer#pipe) delegate to the
-[`predict`](/api/textcategorizer#predict) and
+This usually happens under the hood when you call the `nlp` object on a text and
+all pipeline components are applied to the `Doc` in order. Both
+[`__call__`](/api/textcategorizer#call) and [`pipe`](/api/textcategorizer#pipe)
+delegate to the [`predict`](/api/textcategorizer#predict) and
 [`set_annotations`](/api/textcategorizer#set_annotations) methods.

 > #### Example
@ -58,6 +75,7 @@ Both [`__call__`](/api/textcategorizer#call) and
 > ```python
 > textcat = TextCategorizer(nlp.vocab)
 > doc = nlp(u"This is a sentence.")
+> # This usually happens under the hood
 > processed = textcat(doc)
 > ```