Merge branch 'develop' into spacy.io

2025-09-23 04:26:46 +03:00 · 2019-02-24 13:12:26 +01:00 · 2019-02-24 13:12:26 +01:00 · 8458379cf5
commit 8458379cf5
parent f34d6281d6 3ef4da3503
5 changed files with 131 additions and 100 deletions
--- a/README.md
+++ b/README.md
@ -6,7 +6,7 @@ spaCy is a library for advanced Natural Language Processing in Python and
 Cython. It's built on the very latest research, and was designed from day one
 to be used in real products. spaCy comes with
 [pre-trained statistical models](https://spacy.io/models) and word vectors, and
-currently supports tokenization for **30+ languages**. It features the
+currently supports tokenization for **45+ languages**. It features the
 **fastest syntactic parser** in the world, convolutional
 **neural network models** for tagging, parsing and **named entity recognition**
 and easy **deep learning** integration. It's commercial open-source software,
@ -20,29 +20,30 @@ released under the MIT license.
 [![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square)](https://pypi.python.org/pypi/spacy)
 [![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square)](https://anaconda.org/conda-forge/spacy)
 [![Python wheels](https://img.shields.io/badge/wheels-%E2%9C%93-4c1.svg?longCache=true&style=flat-square&logo=python&logoColor=white)](https://github.com/explosion/wheelwright/releases)
 [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)
 [![spaCy on Twitter](https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow)](https://twitter.com/spacy_io)
 ## 📖 Documentation
-| Documentation |  |
+| Documentation   |                                                                |
-| --- | --- |
+| --------------- | -------------------------------------------------------------- |
-| [spaCy 101] | New to spaCy? Here's everything you need to know!
+| [spaCy 101]     | New to spaCy? Here's everything you need to know!              |
-| [Usage Guides] | How to use spaCy and its features. |
+| [Usage Guides]  | How to use spaCy and its features.                             |
-| [New in v2.0] | New features, backwards incompatibilities and migration guide. |
+| [New in v2.1]   | New features, backwards incompatibilities and migration guide. |
-| [API Reference] | The detailed reference for spaCy's API. |
+| [API Reference] | The detailed reference for spaCy's API.                        |
-| [Models] | Download statistical language models for spaCy. |
+| [Models]        | Download statistical language models for spaCy.                |
-| [Universe] | Libraries, extensions, demos, books and courses. |
+| [Universe]      | Libraries, extensions, demos, books and courses.               |
-| [Changelog] | Changes and version history. |
+| [Changelog]     | Changes and version history.                                   |
-| [Contribute] | How to contribute to the spaCy project and code base. |
+| [Contribute]    | How to contribute to the spaCy project and code base.          |
-[spaCy 101]: https://spacy.io/usage/spacy-101
+[spacy 101]: https://spacy.io/usage/spacy-101
-[New in v2.0]: https://spacy.io/usage/v2#migrating
+[new in v2.1]: https://spacy.io/usage/v2-1
-[Usage Guides]: https://spacy.io/usage/
+[usage guides]: https://spacy.io/usage/
-[API Reference]: https://spacy.io/api/
+[api reference]: https://spacy.io/api/
-[Models]: https://spacy.io/models
+[models]: https://spacy.io/models
-[Universe]: https://spacy.io/universe
+[universe]: https://spacy.io/universe
-[Changelog]: https://spacy.io/usage/#changelog
+[changelog]: https://spacy.io/usage/#changelog
-[Contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
+[contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
 ## 💬 Where to ask questions
@ -51,35 +52,38 @@ and [@ines](https://github.com/ines). Please understand that we won't be able
 to provide individual support via email. We also believe that help is much more
 valuable if it's shared publicly, so that more people can benefit from it.
-* **Bug Reports**: [GitHub Issue Tracker]
+| Type                     | Platforms                                              |
-* **Usage Questions**: [Stack Overflow] · [Gitter Chat] · [Reddit User Group]
+| ------------------------ | ------------------------------------------------------ |
-* **General Discussion**: [Gitter Chat] · [Reddit User Group]
+| 🚨**Bug Reports**        | [GitHub Issue Tracker]                                 |
 | 🎁 **Feature Requests**  | [GitHub Issue Tracker]                                 |
 | 👩‍💻**Usage Questions**    | [Stack Overflow] · [Gitter Chat] · [Reddit User Group] |
 | 🗯 **General Discussion** | [Gitter Chat] · [Reddit User Group]                    |
-[GitHub Issue Tracker]: https://github.com/explosion/spaCy/issues
+[github issue tracker]: https://github.com/explosion/spaCy/issues
-[Stack Overflow]: http://stackoverflow.com/questions/tagged/spacy
+[stack overflow]: http://stackoverflow.com/questions/tagged/spacy
-[Gitter Chat]: https://gitter.im/explosion/spaCy
+[gitter chat]: https://gitter.im/explosion/spaCy
-[Reddit User Group]: https://www.reddit.com/r/spacynlp
+[reddit user group]: https://www.reddit.com/r/spacynlp
 ## Features
-* **Fastest syntactic parser** in the world
+-   **Fastest syntactic parser** in the world
-* **Named entity** recognition
+-   **Named entity** recognition
-* Non-destructive **tokenization**
+-   Non-destructive **tokenization**
-* Support for **30+ languages**
+-   Support for **45+ languages**
-* Pre-trained [statistical models](https://spacy.io/models) and word vectors
+-   Pre-trained [statistical models](https://spacy.io/models) and word vectors
-* Easy **deep learning** integration
+-   Easy **deep learning** integration
-* Part-of-speech tagging
+-   Part-of-speech tagging
-* Labelled dependency parsing
+-   Labelled dependency parsing
-* Syntax-driven sentence segmentation
+-   Syntax-driven sentence segmentation
-* Built in **visualizers** for syntax and NER
+-   Built in **visualizers** for syntax and NER
-* Convenient string-to-hash mapping
+-   Convenient string-to-hash mapping
-* Export to numpy data arrays
+-   Export to numpy data arrays
-* Efficient binary serialization
+-   Efficient binary serialization
-* Easy **model packaging** and deployment
+-   Easy **model packaging** and deployment
-* State-of-the-art speed
+-   State-of-the-art speed
-* Robust, rigorously evaluated accuracy
+-   Robust, rigorously evaluated accuracy
-📖  **For more details, see the
+📖 **For more details, see the
 [facts, figures and benchmarks](https://spacy.io/usage/facts-figures).**
 ## Install spaCy
@ -87,9 +91,9 @@ valuable if it's shared publicly, so that more people can benefit from it.
 For detailed installation instructions, see the
 [documentation](https://spacy.io/usage).
-* **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio)
+-   **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio)
-* **Python version**: Python 2.7, 3.4+ (only 64 bit)
+-   **Python version**: Python 2.7, 3.4+ (only 64 bit)
-* **Package managers**: [pip] · [conda] (via `conda-forge`)
+-   **Package managers**: [pip] · [conda] (via `conda-forge`)
 [pip]: https://pypi.python.org/pypi/spacy
 [conda]: https://anaconda.org/conda-forge/spacy
@ -142,7 +146,7 @@ If you've trained your own models, keep in mind that your training and runtime
 inputs must match. After updating spaCy, we recommend **retraining your models**
 with the new version.
-📖  **For details on upgrading from spaCy 1.x to spaCy 2.x, see the
+📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the
 [migration guide](https://spacy.io/usage/v2#migrating).**
 ## Download models
@ -152,13 +156,13 @@ This means that they're a component of your application, just like any
 other module. Models can be installed using spaCy's `download` command,
 or manually by pointing pip to a path or URL.
-| Documentation |  |
+| Documentation          |                                                               |
-| --- | --- |
+| ---------------------- | ------------------------------------------------------------- |
-| [Available Models] | Detailed model descriptions, accuracy figures and benchmarks. |
+| [Available Models]     | Detailed model descriptions, accuracy figures and benchmarks. |
-| [Models Documentation] | Detailed usage instructions. |
+| [Models Documentation] | Detailed usage instructions.                                  |
-[Available Models]: https://spacy.io/models
+[available models]: https://spacy.io/models
-[Models Documentation]: https://spacy.io/docs/usage/models
+[models documentation]: https://spacy.io/docs/usage/models
 ```bash
 # out-of-the-box: download best-matching default model
@ -261,7 +265,7 @@ VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
 ## Run tests
-spaCy comes with an [extensive test suite](spacy/tests).  In order to run the
+spaCy comes with an [extensive test suite](spacy/tests). In order to run the
 tests, you'll usually want to clone the repository and build spaCy from source.
 This will also install the required development dependencies and test utilities
 defined in the `requirements.txt`.
--- a/website/docs/api/dependencyparser.md
+++ b/website/docs/api/dependencyparser.md
@ -37,17 +37,19 @@ shortcut for this and instantiate the component using its string name and
 > parser.from_disk("/path/to/model")
 > ```
-| Name        | Type                           | Description                                                                                                                                           |
+| Name        | Type                          | Description                                                                                                                                           |
-| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
+| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `vocab`     | `Vocab`                        | The shared vocabulary.                                                                                                                                |
+| `vocab`     | `Vocab`                       | The shared vocabulary.                                                                                                                                |
-| `model`     | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
+| `model`     | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
-| `**cfg`     | -                              | Configuration parameters.                                                                                                                             |
+| `**cfg`     | -                             | Configuration parameters.                                                                                                                             |
-| **RETURNS** | `DependencyParser`             | The newly constructed object.                                                                                                                         |
+| **RETURNS** | `DependencyParser`            | The newly constructed object.                                                                                                                         |
 ## DependencyParser.\_\_call\_\_ {#call tag="method"}
 Apply the pipe to one document. The document is modified in place, and returned.
-Both [`__call__`](/api/dependencyparser#call) and
+This usually happens under the hood when you call the `nlp` object on a text and
 all pipeline components are applied to the `Doc` in order. Both
 [`__call__`](/api/dependencyparser#call) and
 [`pipe`](/api/dependencyparser#pipe) delegate to the
 [`predict`](/api/dependencyparser#predict) and
 [`set_annotations`](/api/dependencyparser#set_annotations) methods.
@ -57,6 +59,7 @@ Both [`__call__`](/api/dependencyparser#call) and
 > ```python
 > parser = DependencyParser(nlp.vocab)
 > doc = nlp(u"This is a sentence.")
 > # This usually happens under the hood
 > processed = parser(doc)
 > ```
@ -82,11 +85,11 @@ Apply the pipe to a stream of documents. Both
 >     pass
 > ```
-| Name         | Type     | Description                                                                                                    |
+| Name         | Type     | Description                                            |
-| ------------ | -------- | -------------------------------------------------------------------------------------------------------------- |
+| ------------ | -------- | ------------------------------------------------------ |
-| `stream`     | iterable | A stream of documents.                                                                                         |
+| `stream`     | iterable | A stream of documents.                                 |
-| `batch_size` | int      | The number of texts to buffer. Defaults to `128`.                                                              |
+| `batch_size` | int      | The number of texts to buffer. Defaults to `128`.      |
-| **YIELDS**   | `Doc`    | Processed documents in the order of the original text.                                                         |
+| **YIELDS**   | `Doc`    | Processed documents in the order of the original text. |
 ## DependencyParser.predict {#predict tag="method"}
--- a/website/docs/api/entityrecognizer.md
+++ b/website/docs/api/entityrecognizer.md
@ -37,17 +37,19 @@ shortcut for this and instantiate the component using its string name and
 > ner.from_disk("/path/to/model")
 > ```
-| Name        | Type                           | Description                                                                                                                                           |
+| Name        | Type                          | Description                                                                                                                                           |
-| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
+| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `vocab`     | `Vocab`                        | The shared vocabulary.                                                                                                                                |
+| `vocab`     | `Vocab`                       | The shared vocabulary.                                                                                                                                |
-| `model`     | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
+| `model`     | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
-| `**cfg`     | -                              | Configuration parameters.                                                                                                                             |
+| `**cfg`     | -                             | Configuration parameters.                                                                                                                             |
-| **RETURNS** | `EntityRecognizer`             | The newly constructed object.                                                                                                                         |
+| **RETURNS** | `EntityRecognizer`            | The newly constructed object.                                                                                                                         |
 ## EntityRecognizer.\_\_call\_\_ {#call tag="method"}
 Apply the pipe to one document. The document is modified in place, and returned.
-Both [`__call__`](/api/entityrecognizer#call) and
+This usually happens under the hood when you call the `nlp` object on a text and
 all pipeline components are applied to the `Doc` in order. Both
 [`__call__`](/api/entityrecognizer#call) and
 [`pipe`](/api/entityrecognizer#pipe) delegate to the
 [`predict`](/api/entityrecognizer#predict) and
 [`set_annotations`](/api/entityrecognizer#set_annotations) methods.
@ -57,6 +59,7 @@ Both [`__call__`](/api/entityrecognizer#call) and
 > ```python
 > ner = EntityRecognizer(nlp.vocab)
 > doc = nlp(u"This is a sentence.")
 > # This usually happens under the hood
 > processed = ner(doc)
 > ```
@ -82,11 +85,11 @@ Apply the pipe to a stream of documents. Both
 >     pass
 > ```
-| Name         | Type     | Description                                                                                                    |
+| Name         | Type     | Description                                            |
-| ------------ | -------- | -------------------------------------------------------------------------------------------------------------- |
+| ------------ | -------- | ------------------------------------------------------ |
-| `stream`     | iterable | A stream of documents.                                                                                         |
+| `stream`     | iterable | A stream of documents.                                 |
-| `batch_size` | int      | The number of texts to buffer. Defaults to `128`.                                                              |
+| `batch_size` | int      | The number of texts to buffer. Defaults to `128`.      |
-| **YIELDS**   | `Doc`    | Processed documents in the order of the original text.                                                         |
+| **YIELDS**   | `Doc`    | Processed documents in the order of the original text. |
 ## EntityRecognizer.predict {#predict tag="method"}
--- a/website/docs/api/tagger.md
+++ b/website/docs/api/tagger.md
@ -37,18 +37,20 @@ shortcut for this and instantiate the component using its string name and
 > tagger.from_disk("/path/to/model")
 > ```
-| Name        | Type                           | Description                                                                                                                                           |
+| Name        | Type                          | Description                                                                                                                                           |
-| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
+| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `vocab`     | `Vocab`                        | The shared vocabulary.                                                                                                                                |
+| `vocab`     | `Vocab`                       | The shared vocabulary.                                                                                                                                |
-| `model`     | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
+| `model`     | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
-| `**cfg`     | -                              | Configuration parameters.                                                                                                                             |
+| `**cfg`     | -                             | Configuration parameters.                                                                                                                             |
-| **RETURNS** | `Tagger`                       | The newly constructed object.                                                                                                                         |
+| **RETURNS** | `Tagger`                      | The newly constructed object.                                                                                                                         |
 ## Tagger.\_\_call\_\_ {#call tag="method"}
 Apply the pipe to one document. The document is modified in place, and returned.
-Both [`__call__`](/api/tagger#call) and [`pipe`](/api/tagger#pipe) delegate to
+This usually happens under the hood when you call the `nlp` object on a text and
-the [`predict`](/api/tagger#predict) and
+all pipeline components are applied to the `Doc` in order. Both
 [`__call__`](/api/tagger#call) and [`pipe`](/api/tagger#pipe) delegate to the
 [`predict`](/api/tagger#predict) and
 [`set_annotations`](/api/tagger#set_annotations) methods.
 > #### Example
@ -56,6 +58,7 @@ the [`predict`](/api/tagger#predict) and
 > ```python
 > tagger = Tagger(nlp.vocab)
 > doc = nlp(u"This is a sentence.")
 > # This usually happens under the hood
 > processed = tagger(doc)
 > ```
@ -79,11 +82,11 @@ Apply the pipe to a stream of documents. Both [`__call__`](/api/tagger#call) and
 >     pass
 > ```
-| Name         | Type     | Description                                                                                                    |
+| Name         | Type     | Description                                            |
-| ------------ | -------- | -------------------------------------------------------------------------------------------------------------- |
+| ------------ | -------- | ------------------------------------------------------ |
-| `stream`     | iterable | A stream of documents.                                                                                         |
+| `stream`     | iterable | A stream of documents.                                 |
-| `batch_size` | int      | The number of texts to buffer. Defaults to `128`.                                                              |
+| `batch_size` | int      | The number of texts to buffer. Defaults to `128`.      |
-| **YIELDS**   | `Doc`    | Processed documents in the order of the original text.                                                         |
+| **YIELDS**   | `Doc`    | Processed documents in the order of the original text. |
 ## Tagger.predict {#predict tag="method"}
--- a/website/docs/api/textcategorizer.md
+++ b/website/docs/api/textcategorizer.md
@ -31,6 +31,7 @@ shortcut for this and instantiate the component using its string name and
 > ```python
 > # Construction via create_pipe
 > textcat = nlp.create_pipe("textcat")
 > textcat = nlp.create_pipe("textcat", config={"exclusive_classes": True})
 >
 > # Construction from class
 > from spacy.pipeline import TextCategorizer
@ -38,19 +39,35 @@ shortcut for this and instantiate the component using its string name and
 > textcat.from_disk("/path/to/model")
 > ```
-| Name        | Type                           | Description                                                                                                                                           |
+| Name                | Type                          | Description                                                                                                                                           |
-| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
+| ------------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `vocab`     | `Vocab`                        | The shared vocabulary.                                                                                                                                |
+| `vocab`             | `Vocab`                       | The shared vocabulary.                                                                                                                                |
-| `model`     | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
+| `model`             | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
-| `**cfg`     | -                              | Configuration parameters.                                                                                                                             |
+| `exclusive_classes` | bool                          | Make categories mutually exclusive. Defaults to `False`.                                                                                              |
-| **RETURNS** | `TextCategorizer`              | The newly constructed object.                                                                                                                         |
+| `architecture`      | unicode                       | Model architecture to use, see [architectures](#architectures) for details. Defaults to `"ensemble"`.                                                 |
 | **RETURNS**         | `TextCategorizer`             | The newly constructed object.                                                                                                                         |
 ### Architectures {#architectures new="2.1"}
 Text classification models can be used to solve a wide variety of problems.
 Differences in text length, number of labels, difficulty, and runtime
 performance constraints mean that no single algorithm performs well on all types
 of problems. To handle a wider variety of problems, the `TextCategorizer` object
 allows configuration of its model architecture, using the `architecture` keyword
 argument.
 | Name           | Description                                                                                                                                              |
 | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `"ensemble"`   | **Default:** Stacked ensemble of a unigram bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention. |
 | `"simple_cnn"` | A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network.       |
 ## TextCategorizer.\_\_call\_\_ {#call tag="method"}
 Apply the pipe to one document. The document is modified in place, and returned.
-Both [`__call__`](/api/textcategorizer#call) and
+This usually happens under the hood when you call the `nlp` object on a text and
-[`pipe`](/api/textcategorizer#pipe) delegate to the
+all pipeline components are applied to the `Doc` in order. Both
-[`predict`](/api/textcategorizer#predict) and
+[`__call__`](/api/textcategorizer#call) and [`pipe`](/api/textcategorizer#pipe)
 delegate to the [`predict`](/api/textcategorizer#predict) and
 [`set_annotations`](/api/textcategorizer#set_annotations) methods.
 > #### Example
@ -58,6 +75,7 @@ Both [`__call__`](/api/textcategorizer#call) and
 > ```python
 > textcat = TextCategorizer(nlp.vocab)
 > doc = nlp(u"This is a sentence.")
 > # This usually happens under the hood
 > processed = textcat(doc)
 > ```