Merge branch 'develop' into spacy.io

This commit is contained in:
Ines Montani 2019-02-24 13:12:26 +01:00
commit 8458379cf5
5 changed files with 131 additions and 100 deletions

112
README.md
View File

@ -6,7 +6,7 @@ spaCy is a library for advanced Natural Language Processing in Python and
Cython. It's built on the very latest research, and was designed from day one
to be used in real products. spaCy comes with
[pre-trained statistical models](https://spacy.io/models) and word vectors, and
currently supports tokenization for **30+ languages**. It features the
currently supports tokenization for **45+ languages**. It features the
**fastest syntactic parser** in the world, convolutional
**neural network models** for tagging, parsing and **named entity recognition**
and easy **deep learning** integration. It's commercial open-source software,
@ -20,29 +20,30 @@ released under the MIT license.
[![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square)](https://pypi.python.org/pypi/spacy)
[![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square)](https://anaconda.org/conda-forge/spacy)
[![Python wheels](https://img.shields.io/badge/wheels-%E2%9C%93-4c1.svg?longCache=true&style=flat-square&logo=python&logoColor=white)](https://github.com/explosion/wheelwright/releases)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)
[![spaCy on Twitter](https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow)](https://twitter.com/spacy_io)
## 📖 Documentation
| Documentation | |
| --- | --- |
| [spaCy 101] | New to spaCy? Here's everything you need to know!
| [Usage Guides] | How to use spaCy and its features. |
| [New in v2.0] | New features, backwards incompatibilities and migration guide. |
| [API Reference] | The detailed reference for spaCy's API. |
| [Models] | Download statistical language models for spaCy. |
| [Universe] | Libraries, extensions, demos, books and courses. |
| [Changelog] | Changes and version history. |
| [Contribute] | How to contribute to the spaCy project and code base. |
| Documentation | |
| --------------- | -------------------------------------------------------------- |
| [spaCy 101] | New to spaCy? Here's everything you need to know! |
| [Usage Guides] | How to use spaCy and its features. |
| [New in v2.1] | New features, backwards incompatibilities and migration guide. |
| [API Reference] | The detailed reference for spaCy's API. |
| [Models] | Download statistical language models for spaCy. |
| [Universe] | Libraries, extensions, demos, books and courses. |
| [Changelog] | Changes and version history. |
| [Contribute] | How to contribute to the spaCy project and code base. |
[spaCy 101]: https://spacy.io/usage/spacy-101
[New in v2.0]: https://spacy.io/usage/v2#migrating
[Usage Guides]: https://spacy.io/usage/
[API Reference]: https://spacy.io/api/
[Models]: https://spacy.io/models
[Universe]: https://spacy.io/universe
[Changelog]: https://spacy.io/usage/#changelog
[Contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
[spacy 101]: https://spacy.io/usage/spacy-101
[new in v2.1]: https://spacy.io/usage/v2-1
[usage guides]: https://spacy.io/usage/
[api reference]: https://spacy.io/api/
[models]: https://spacy.io/models
[universe]: https://spacy.io/universe
[changelog]: https://spacy.io/usage/#changelog
[contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
## 💬 Where to ask questions
@ -51,35 +52,38 @@ and [@ines](https://github.com/ines). Please understand that we won't be able
to provide individual support via email. We also believe that help is much more
valuable if it's shared publicly, so that more people can benefit from it.
* **Bug Reports**: [GitHub Issue Tracker]
* **Usage Questions**: [Stack Overflow] · [Gitter Chat] · [Reddit User Group]
* **General Discussion**: [Gitter Chat] · [Reddit User Group]
| Type | Platforms |
| ------------------------ | ------------------------------------------------------ |
| 🚨**Bug Reports** | [GitHub Issue Tracker] |
| 🎁 **Feature Requests** | [GitHub Issue Tracker] |
| 👩‍💻**Usage Questions** | [Stack Overflow] · [Gitter Chat] · [Reddit User Group] |
| 🗯 **General Discussion** | [Gitter Chat] · [Reddit User Group] |
[GitHub Issue Tracker]: https://github.com/explosion/spaCy/issues
[Stack Overflow]: http://stackoverflow.com/questions/tagged/spacy
[Gitter Chat]: https://gitter.im/explosion/spaCy
[Reddit User Group]: https://www.reddit.com/r/spacynlp
[github issue tracker]: https://github.com/explosion/spaCy/issues
[stack overflow]: http://stackoverflow.com/questions/tagged/spacy
[gitter chat]: https://gitter.im/explosion/spaCy
[reddit user group]: https://www.reddit.com/r/spacynlp
## Features
* **Fastest syntactic parser** in the world
* **Named entity** recognition
* Non-destructive **tokenization**
* Support for **30+ languages**
* Pre-trained [statistical models](https://spacy.io/models) and word vectors
* Easy **deep learning** integration
* Part-of-speech tagging
* Labelled dependency parsing
* Syntax-driven sentence segmentation
* Built in **visualizers** for syntax and NER
* Convenient string-to-hash mapping
* Export to numpy data arrays
* Efficient binary serialization
* Easy **model packaging** and deployment
* State-of-the-art speed
* Robust, rigorously evaluated accuracy
- **Fastest syntactic parser** in the world
- **Named entity** recognition
- Non-destructive **tokenization**
- Support for **45+ languages**
- Pre-trained [statistical models](https://spacy.io/models) and word vectors
- Easy **deep learning** integration
- Part-of-speech tagging
- Labelled dependency parsing
- Syntax-driven sentence segmentation
- Built in **visualizers** for syntax and NER
- Convenient string-to-hash mapping
- Export to numpy data arrays
- Efficient binary serialization
- Easy **model packaging** and deployment
- State-of-the-art speed
- Robust, rigorously evaluated accuracy
📖 **For more details, see the
📖 **For more details, see the
[facts, figures and benchmarks](https://spacy.io/usage/facts-figures).**
## Install spaCy
@ -87,9 +91,9 @@ valuable if it's shared publicly, so that more people can benefit from it.
For detailed installation instructions, see the
[documentation](https://spacy.io/usage).
* **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio)
* **Python version**: Python 2.7, 3.4+ (only 64 bit)
* **Package managers**: [pip] · [conda] (via `conda-forge`)
- **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio)
- **Python version**: Python 2.7, 3.4+ (only 64 bit)
- **Package managers**: [pip] · [conda] (via `conda-forge`)
[pip]: https://pypi.python.org/pypi/spacy
[conda]: https://anaconda.org/conda-forge/spacy
@ -142,7 +146,7 @@ If you've trained your own models, keep in mind that your training and runtime
inputs must match. After updating spaCy, we recommend **retraining your models**
with the new version.
📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the
📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the
[migration guide](https://spacy.io/usage/v2#migrating).**
## Download models
@ -152,13 +156,13 @@ This means that they're a component of your application, just like any
other module. Models can be installed using spaCy's `download` command,
or manually by pointing pip to a path or URL.
| Documentation | |
| --- | --- |
| [Available Models] | Detailed model descriptions, accuracy figures and benchmarks. |
| [Models Documentation] | Detailed usage instructions. |
| Documentation | |
| ---------------------- | ------------------------------------------------------------- |
| [Available Models] | Detailed model descriptions, accuracy figures and benchmarks. |
| [Models Documentation] | Detailed usage instructions. |
[Available Models]: https://spacy.io/models
[Models Documentation]: https://spacy.io/docs/usage/models
[available models]: https://spacy.io/models
[models documentation]: https://spacy.io/docs/usage/models
```bash
# out-of-the-box: download best-matching default model
@ -261,7 +265,7 @@ VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
## Run tests
spaCy comes with an [extensive test suite](spacy/tests). In order to run the
spaCy comes with an [extensive test suite](spacy/tests). In order to run the
tests, you'll usually want to clone the repository and build spaCy from source.
This will also install the required development dependencies and test utilities
defined in the `requirements.txt`.

View File

@ -37,17 +37,19 @@ shortcut for this and instantiate the component using its string name and
> parser.from_disk("/path/to/model")
> ```
| Name | Type | Description |
| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `**cfg` | - | Configuration parameters. |
| **RETURNS** | `DependencyParser` | The newly constructed object. |
| Name | Type | Description |
| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `**cfg` | - | Configuration parameters. |
| **RETURNS** | `DependencyParser` | The newly constructed object. |
## DependencyParser.\_\_call\_\_ {#call tag="method"}
Apply the pipe to one document. The document is modified in place, and returned.
Both [`__call__`](/api/dependencyparser#call) and
This usually happens under the hood when you call the `nlp` object on a text and
all pipeline components are applied to the `Doc` in order. Both
[`__call__`](/api/dependencyparser#call) and
[`pipe`](/api/dependencyparser#pipe) delegate to the
[`predict`](/api/dependencyparser#predict) and
[`set_annotations`](/api/dependencyparser#set_annotations) methods.
@ -57,6 +59,7 @@ Both [`__call__`](/api/dependencyparser#call) and
> ```python
> parser = DependencyParser(nlp.vocab)
> doc = nlp(u"This is a sentence.")
> # This usually happens under the hood
> processed = parser(doc)
> ```
@ -82,11 +85,11 @@ Apply the pipe to a stream of documents. Both
> pass
> ```
| Name | Type | Description |
| ------------ | -------- | -------------------------------------------------------------------------------------------------------------- |
| `stream` | iterable | A stream of documents. |
| `batch_size` | int | The number of texts to buffer. Defaults to `128`. |
| **YIELDS** | `Doc` | Processed documents in the order of the original text. |
| Name | Type | Description |
| ------------ | -------- | ------------------------------------------------------ |
| `stream` | iterable | A stream of documents. |
| `batch_size` | int | The number of texts to buffer. Defaults to `128`. |
| **YIELDS** | `Doc` | Processed documents in the order of the original text. |
## DependencyParser.predict {#predict tag="method"}

View File

@ -37,17 +37,19 @@ shortcut for this and instantiate the component using its string name and
> ner.from_disk("/path/to/model")
> ```
| Name | Type | Description |
| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `**cfg` | - | Configuration parameters. |
| **RETURNS** | `EntityRecognizer` | The newly constructed object. |
| Name | Type | Description |
| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `**cfg` | - | Configuration parameters. |
| **RETURNS** | `EntityRecognizer` | The newly constructed object. |
## EntityRecognizer.\_\_call\_\_ {#call tag="method"}
Apply the pipe to one document. The document is modified in place, and returned.
Both [`__call__`](/api/entityrecognizer#call) and
This usually happens under the hood when you call the `nlp` object on a text and
all pipeline components are applied to the `Doc` in order. Both
[`__call__`](/api/entityrecognizer#call) and
[`pipe`](/api/entityrecognizer#pipe) delegate to the
[`predict`](/api/entityrecognizer#predict) and
[`set_annotations`](/api/entityrecognizer#set_annotations) methods.
@ -57,6 +59,7 @@ Both [`__call__`](/api/entityrecognizer#call) and
> ```python
> ner = EntityRecognizer(nlp.vocab)
> doc = nlp(u"This is a sentence.")
> # This usually happens under the hood
> processed = ner(doc)
> ```
@ -82,11 +85,11 @@ Apply the pipe to a stream of documents. Both
> pass
> ```
| Name | Type | Description |
| ------------ | -------- | -------------------------------------------------------------------------------------------------------------- |
| `stream` | iterable | A stream of documents. |
| `batch_size` | int | The number of texts to buffer. Defaults to `128`. |
| **YIELDS** | `Doc` | Processed documents in the order of the original text. |
| Name | Type | Description |
| ------------ | -------- | ------------------------------------------------------ |
| `stream` | iterable | A stream of documents. |
| `batch_size` | int | The number of texts to buffer. Defaults to `128`. |
| **YIELDS** | `Doc` | Processed documents in the order of the original text. |
## EntityRecognizer.predict {#predict tag="method"}

View File

@ -37,18 +37,20 @@ shortcut for this and instantiate the component using its string name and
> tagger.from_disk("/path/to/model")
> ```
| Name | Type | Description |
| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `**cfg` | - | Configuration parameters. |
| **RETURNS** | `Tagger` | The newly constructed object. |
| Name | Type | Description |
| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `**cfg` | - | Configuration parameters. |
| **RETURNS** | `Tagger` | The newly constructed object. |
## Tagger.\_\_call\_\_ {#call tag="method"}
Apply the pipe to one document. The document is modified in place, and returned.
Both [`__call__`](/api/tagger#call) and [`pipe`](/api/tagger#pipe) delegate to
the [`predict`](/api/tagger#predict) and
This usually happens under the hood when you call the `nlp` object on a text and
all pipeline components are applied to the `Doc` in order. Both
[`__call__`](/api/tagger#call) and [`pipe`](/api/tagger#pipe) delegate to the
[`predict`](/api/tagger#predict) and
[`set_annotations`](/api/tagger#set_annotations) methods.
> #### Example
@ -56,6 +58,7 @@ the [`predict`](/api/tagger#predict) and
> ```python
> tagger = Tagger(nlp.vocab)
> doc = nlp(u"This is a sentence.")
> # This usually happens under the hood
> processed = tagger(doc)
> ```
@ -79,11 +82,11 @@ Apply the pipe to a stream of documents. Both [`__call__`](/api/tagger#call) and
> pass
> ```
| Name | Type | Description |
| ------------ | -------- | -------------------------------------------------------------------------------------------------------------- |
| `stream` | iterable | A stream of documents. |
| `batch_size` | int | The number of texts to buffer. Defaults to `128`. |
| **YIELDS** | `Doc` | Processed documents in the order of the original text. |
| Name | Type | Description |
| ------------ | -------- | ------------------------------------------------------ |
| `stream` | iterable | A stream of documents. |
| `batch_size` | int | The number of texts to buffer. Defaults to `128`. |
| **YIELDS** | `Doc` | Processed documents in the order of the original text. |
## Tagger.predict {#predict tag="method"}

View File

@ -31,6 +31,7 @@ shortcut for this and instantiate the component using its string name and
> ```python
> # Construction via create_pipe
> textcat = nlp.create_pipe("textcat")
> textcat = nlp.create_pipe("textcat", config={"exclusive_classes": True})
>
> # Construction from class
> from spacy.pipeline import TextCategorizer
@ -38,19 +39,35 @@ shortcut for this and instantiate the component using its string name and
> textcat.from_disk("/path/to/model")
> ```
| Name | Type | Description |
| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `**cfg` | - | Configuration parameters. |
| **RETURNS** | `TextCategorizer` | The newly constructed object. |
| Name | Type | Description |
| ------------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `exclusive_classes` | bool | Make categories mutually exclusive. Defaults to `False`. |
| `architecture` | unicode | Model architecture to use, see [architectures](#architectures) for details. Defaults to `"ensemble"`. |
| **RETURNS** | `TextCategorizer` | The newly constructed object. |
### Architectures {#architectures new="2.1"}
Text classification models can be used to solve a wide variety of problems.
Differences in text length, number of labels, difficulty, and runtime
performance constraints mean that no single algorithm performs well on all types
of problems. To handle a wider variety of problems, the `TextCategorizer` object
allows configuration of its model architecture, using the `architecture` keyword
argument.
| Name | Description |
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `"ensemble"` | **Default:** Stacked ensemble of a unigram bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention. |
| `"simple_cnn"` | A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. |
## TextCategorizer.\_\_call\_\_ {#call tag="method"}
Apply the pipe to one document. The document is modified in place, and returned.
Both [`__call__`](/api/textcategorizer#call) and
[`pipe`](/api/textcategorizer#pipe) delegate to the
[`predict`](/api/textcategorizer#predict) and
This usually happens under the hood when you call the `nlp` object on a text and
all pipeline components are applied to the `Doc` in order. Both
[`__call__`](/api/textcategorizer#call) and [`pipe`](/api/textcategorizer#pipe)
delegate to the [`predict`](/api/textcategorizer#predict) and
[`set_annotations`](/api/textcategorizer#set_annotations) methods.
> #### Example
@ -58,6 +75,7 @@ Both [`__call__`](/api/textcategorizer#call) and
> ```python
> textcat = TextCategorizer(nlp.vocab)
> doc = nlp(u"This is a sentence.")
> # This usually happens under the hood
> processed = textcat(doc)
> ```