Merge branch 'develop' into spacy.io

This commit is contained in:
Ines Montani 2019-02-24 13:12:26 +01:00
commit 8458379cf5
5 changed files with 131 additions and 100 deletions

112
README.md
View File

@ -6,7 +6,7 @@ spaCy is a library for advanced Natural Language Processing in Python and
Cython. It's built on the very latest research, and was designed from day one Cython. It's built on the very latest research, and was designed from day one
to be used in real products. spaCy comes with to be used in real products. spaCy comes with
[pre-trained statistical models](https://spacy.io/models) and word vectors, and [pre-trained statistical models](https://spacy.io/models) and word vectors, and
currently supports tokenization for **30+ languages**. It features the currently supports tokenization for **45+ languages**. It features the
**fastest syntactic parser** in the world, convolutional **fastest syntactic parser** in the world, convolutional
**neural network models** for tagging, parsing and **named entity recognition** **neural network models** for tagging, parsing and **named entity recognition**
and easy **deep learning** integration. It's commercial open-source software, and easy **deep learning** integration. It's commercial open-source software,
@ -20,29 +20,30 @@ released under the MIT license.
[![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square)](https://pypi.python.org/pypi/spacy) [![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square)](https://pypi.python.org/pypi/spacy)
[![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square)](https://anaconda.org/conda-forge/spacy) [![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square)](https://anaconda.org/conda-forge/spacy)
[![Python wheels](https://img.shields.io/badge/wheels-%E2%9C%93-4c1.svg?longCache=true&style=flat-square&logo=python&logoColor=white)](https://github.com/explosion/wheelwright/releases) [![Python wheels](https://img.shields.io/badge/wheels-%E2%9C%93-4c1.svg?longCache=true&style=flat-square&logo=python&logoColor=white)](https://github.com/explosion/wheelwright/releases)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)
[![spaCy on Twitter](https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow)](https://twitter.com/spacy_io) [![spaCy on Twitter](https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow)](https://twitter.com/spacy_io)
## 📖 Documentation ## 📖 Documentation
| Documentation | | | Documentation | |
| --- | --- | | --------------- | -------------------------------------------------------------- |
| [spaCy 101] | New to spaCy? Here's everything you need to know! | [spaCy 101] | New to spaCy? Here's everything you need to know! |
| [Usage Guides] | How to use spaCy and its features. | | [Usage Guides] | How to use spaCy and its features. |
| [New in v2.0] | New features, backwards incompatibilities and migration guide. | | [New in v2.1] | New features, backwards incompatibilities and migration guide. |
| [API Reference] | The detailed reference for spaCy's API. | | [API Reference] | The detailed reference for spaCy's API. |
| [Models] | Download statistical language models for spaCy. | | [Models] | Download statistical language models for spaCy. |
| [Universe] | Libraries, extensions, demos, books and courses. | | [Universe] | Libraries, extensions, demos, books and courses. |
| [Changelog] | Changes and version history. | | [Changelog] | Changes and version history. |
| [Contribute] | How to contribute to the spaCy project and code base. | | [Contribute] | How to contribute to the spaCy project and code base. |
[spaCy 101]: https://spacy.io/usage/spacy-101 [spacy 101]: https://spacy.io/usage/spacy-101
[New in v2.0]: https://spacy.io/usage/v2#migrating [new in v2.1]: https://spacy.io/usage/v2-1
[Usage Guides]: https://spacy.io/usage/ [usage guides]: https://spacy.io/usage/
[API Reference]: https://spacy.io/api/ [api reference]: https://spacy.io/api/
[Models]: https://spacy.io/models [models]: https://spacy.io/models
[Universe]: https://spacy.io/universe [universe]: https://spacy.io/universe
[Changelog]: https://spacy.io/usage/#changelog [changelog]: https://spacy.io/usage/#changelog
[Contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md [contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
## 💬 Where to ask questions ## 💬 Where to ask questions
@ -51,35 +52,38 @@ and [@ines](https://github.com/ines). Please understand that we won't be able
to provide individual support via email. We also believe that help is much more to provide individual support via email. We also believe that help is much more
valuable if it's shared publicly, so that more people can benefit from it. valuable if it's shared publicly, so that more people can benefit from it.
* **Bug Reports**: [GitHub Issue Tracker] | Type | Platforms |
* **Usage Questions**: [Stack Overflow] · [Gitter Chat] · [Reddit User Group] | ------------------------ | ------------------------------------------------------ |
* **General Discussion**: [Gitter Chat] · [Reddit User Group] | 🚨**Bug Reports** | [GitHub Issue Tracker] |
| 🎁 **Feature Requests** | [GitHub Issue Tracker] |
| 👩‍💻**Usage Questions** | [Stack Overflow] · [Gitter Chat] · [Reddit User Group] |
| 🗯 **General Discussion** | [Gitter Chat] · [Reddit User Group] |
[GitHub Issue Tracker]: https://github.com/explosion/spaCy/issues [github issue tracker]: https://github.com/explosion/spaCy/issues
[Stack Overflow]: http://stackoverflow.com/questions/tagged/spacy [stack overflow]: http://stackoverflow.com/questions/tagged/spacy
[Gitter Chat]: https://gitter.im/explosion/spaCy [gitter chat]: https://gitter.im/explosion/spaCy
[Reddit User Group]: https://www.reddit.com/r/spacynlp [reddit user group]: https://www.reddit.com/r/spacynlp
## Features ## Features
* **Fastest syntactic parser** in the world - **Fastest syntactic parser** in the world
* **Named entity** recognition - **Named entity** recognition
* Non-destructive **tokenization** - Non-destructive **tokenization**
* Support for **30+ languages** - Support for **45+ languages**
* Pre-trained [statistical models](https://spacy.io/models) and word vectors - Pre-trained [statistical models](https://spacy.io/models) and word vectors
* Easy **deep learning** integration - Easy **deep learning** integration
* Part-of-speech tagging - Part-of-speech tagging
* Labelled dependency parsing - Labelled dependency parsing
* Syntax-driven sentence segmentation - Syntax-driven sentence segmentation
* Built in **visualizers** for syntax and NER - Built in **visualizers** for syntax and NER
* Convenient string-to-hash mapping - Convenient string-to-hash mapping
* Export to numpy data arrays - Export to numpy data arrays
* Efficient binary serialization - Efficient binary serialization
* Easy **model packaging** and deployment - Easy **model packaging** and deployment
* State-of-the-art speed - State-of-the-art speed
* Robust, rigorously evaluated accuracy - Robust, rigorously evaluated accuracy
📖 **For more details, see the 📖 **For more details, see the
[facts, figures and benchmarks](https://spacy.io/usage/facts-figures).** [facts, figures and benchmarks](https://spacy.io/usage/facts-figures).**
## Install spaCy ## Install spaCy
@ -87,9 +91,9 @@ valuable if it's shared publicly, so that more people can benefit from it.
For detailed installation instructions, see the For detailed installation instructions, see the
[documentation](https://spacy.io/usage). [documentation](https://spacy.io/usage).
* **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio) - **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio)
* **Python version**: Python 2.7, 3.4+ (only 64 bit) - **Python version**: Python 2.7, 3.4+ (only 64 bit)
* **Package managers**: [pip] · [conda] (via `conda-forge`) - **Package managers**: [pip] · [conda] (via `conda-forge`)
[pip]: https://pypi.python.org/pypi/spacy [pip]: https://pypi.python.org/pypi/spacy
[conda]: https://anaconda.org/conda-forge/spacy [conda]: https://anaconda.org/conda-forge/spacy
@ -142,7 +146,7 @@ If you've trained your own models, keep in mind that your training and runtime
inputs must match. After updating spaCy, we recommend **retraining your models** inputs must match. After updating spaCy, we recommend **retraining your models**
with the new version. with the new version.
📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the 📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the
[migration guide](https://spacy.io/usage/v2#migrating).** [migration guide](https://spacy.io/usage/v2#migrating).**
## Download models ## Download models
@ -152,13 +156,13 @@ This means that they're a component of your application, just like any
other module. Models can be installed using spaCy's `download` command, other module. Models can be installed using spaCy's `download` command,
or manually by pointing pip to a path or URL. or manually by pointing pip to a path or URL.
| Documentation | | | Documentation | |
| --- | --- | | ---------------------- | ------------------------------------------------------------- |
| [Available Models] | Detailed model descriptions, accuracy figures and benchmarks. | | [Available Models] | Detailed model descriptions, accuracy figures and benchmarks. |
| [Models Documentation] | Detailed usage instructions. | | [Models Documentation] | Detailed usage instructions. |
[Available Models]: https://spacy.io/models [available models]: https://spacy.io/models
[Models Documentation]: https://spacy.io/docs/usage/models [models documentation]: https://spacy.io/docs/usage/models
```bash ```bash
# out-of-the-box: download best-matching default model # out-of-the-box: download best-matching default model
@ -261,7 +265,7 @@ VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
## Run tests ## Run tests
spaCy comes with an [extensive test suite](spacy/tests). In order to run the spaCy comes with an [extensive test suite](spacy/tests). In order to run the
tests, you'll usually want to clone the repository and build spaCy from source. tests, you'll usually want to clone the repository and build spaCy from source.
This will also install the required development dependencies and test utilities This will also install the required development dependencies and test utilities
defined in the `requirements.txt`. defined in the `requirements.txt`.

View File

@ -37,17 +37,19 @@ shortcut for this and instantiate the component using its string name and
> parser.from_disk("/path/to/model") > parser.from_disk("/path/to/model")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. | | `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | | `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `**cfg` | - | Configuration parameters. | | `**cfg` | - | Configuration parameters. |
| **RETURNS** | `DependencyParser` | The newly constructed object. | | **RETURNS** | `DependencyParser` | The newly constructed object. |
## DependencyParser.\_\_call\_\_ {#call tag="method"} ## DependencyParser.\_\_call\_\_ {#call tag="method"}
Apply the pipe to one document. The document is modified in place, and returned. Apply the pipe to one document. The document is modified in place, and returned.
Both [`__call__`](/api/dependencyparser#call) and This usually happens under the hood when you call the `nlp` object on a text and
all pipeline components are applied to the `Doc` in order. Both
[`__call__`](/api/dependencyparser#call) and
[`pipe`](/api/dependencyparser#pipe) delegate to the [`pipe`](/api/dependencyparser#pipe) delegate to the
[`predict`](/api/dependencyparser#predict) and [`predict`](/api/dependencyparser#predict) and
[`set_annotations`](/api/dependencyparser#set_annotations) methods. [`set_annotations`](/api/dependencyparser#set_annotations) methods.
@ -57,6 +59,7 @@ Both [`__call__`](/api/dependencyparser#call) and
> ```python > ```python
> parser = DependencyParser(nlp.vocab) > parser = DependencyParser(nlp.vocab)
> doc = nlp(u"This is a sentence.") > doc = nlp(u"This is a sentence.")
> # This usually happens under the hood
> processed = parser(doc) > processed = parser(doc)
> ``` > ```
@ -82,11 +85,11 @@ Apply the pipe to a stream of documents. Both
> pass > pass
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------ | -------- | -------------------------------------------------------------------------------------------------------------- | | ------------ | -------- | ------------------------------------------------------ |
| `stream` | iterable | A stream of documents. | | `stream` | iterable | A stream of documents. |
| `batch_size` | int | The number of texts to buffer. Defaults to `128`. | | `batch_size` | int | The number of texts to buffer. Defaults to `128`. |
| **YIELDS** | `Doc` | Processed documents in the order of the original text. | | **YIELDS** | `Doc` | Processed documents in the order of the original text. |
## DependencyParser.predict {#predict tag="method"} ## DependencyParser.predict {#predict tag="method"}

View File

@ -37,17 +37,19 @@ shortcut for this and instantiate the component using its string name and
> ner.from_disk("/path/to/model") > ner.from_disk("/path/to/model")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. | | `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | | `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `**cfg` | - | Configuration parameters. | | `**cfg` | - | Configuration parameters. |
| **RETURNS** | `EntityRecognizer` | The newly constructed object. | | **RETURNS** | `EntityRecognizer` | The newly constructed object. |
## EntityRecognizer.\_\_call\_\_ {#call tag="method"} ## EntityRecognizer.\_\_call\_\_ {#call tag="method"}
Apply the pipe to one document. The document is modified in place, and returned. Apply the pipe to one document. The document is modified in place, and returned.
Both [`__call__`](/api/entityrecognizer#call) and This usually happens under the hood when you call the `nlp` object on a text and
all pipeline components are applied to the `Doc` in order. Both
[`__call__`](/api/entityrecognizer#call) and
[`pipe`](/api/entityrecognizer#pipe) delegate to the [`pipe`](/api/entityrecognizer#pipe) delegate to the
[`predict`](/api/entityrecognizer#predict) and [`predict`](/api/entityrecognizer#predict) and
[`set_annotations`](/api/entityrecognizer#set_annotations) methods. [`set_annotations`](/api/entityrecognizer#set_annotations) methods.
@ -57,6 +59,7 @@ Both [`__call__`](/api/entityrecognizer#call) and
> ```python > ```python
> ner = EntityRecognizer(nlp.vocab) > ner = EntityRecognizer(nlp.vocab)
> doc = nlp(u"This is a sentence.") > doc = nlp(u"This is a sentence.")
> # This usually happens under the hood
> processed = ner(doc) > processed = ner(doc)
> ``` > ```
@ -82,11 +85,11 @@ Apply the pipe to a stream of documents. Both
> pass > pass
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------ | -------- | -------------------------------------------------------------------------------------------------------------- | | ------------ | -------- | ------------------------------------------------------ |
| `stream` | iterable | A stream of documents. | | `stream` | iterable | A stream of documents. |
| `batch_size` | int | The number of texts to buffer. Defaults to `128`. | | `batch_size` | int | The number of texts to buffer. Defaults to `128`. |
| **YIELDS** | `Doc` | Processed documents in the order of the original text. | | **YIELDS** | `Doc` | Processed documents in the order of the original text. |
## EntityRecognizer.predict {#predict tag="method"} ## EntityRecognizer.predict {#predict tag="method"}

View File

@ -37,18 +37,20 @@ shortcut for this and instantiate the component using its string name and
> tagger.from_disk("/path/to/model") > tagger.from_disk("/path/to/model")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. | | `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | | `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `**cfg` | - | Configuration parameters. | | `**cfg` | - | Configuration parameters. |
| **RETURNS** | `Tagger` | The newly constructed object. | | **RETURNS** | `Tagger` | The newly constructed object. |
## Tagger.\_\_call\_\_ {#call tag="method"} ## Tagger.\_\_call\_\_ {#call tag="method"}
Apply the pipe to one document. The document is modified in place, and returned. Apply the pipe to one document. The document is modified in place, and returned.
Both [`__call__`](/api/tagger#call) and [`pipe`](/api/tagger#pipe) delegate to This usually happens under the hood when you call the `nlp` object on a text and
the [`predict`](/api/tagger#predict) and all pipeline components are applied to the `Doc` in order. Both
[`__call__`](/api/tagger#call) and [`pipe`](/api/tagger#pipe) delegate to the
[`predict`](/api/tagger#predict) and
[`set_annotations`](/api/tagger#set_annotations) methods. [`set_annotations`](/api/tagger#set_annotations) methods.
> #### Example > #### Example
@ -56,6 +58,7 @@ the [`predict`](/api/tagger#predict) and
> ```python > ```python
> tagger = Tagger(nlp.vocab) > tagger = Tagger(nlp.vocab)
> doc = nlp(u"This is a sentence.") > doc = nlp(u"This is a sentence.")
> # This usually happens under the hood
> processed = tagger(doc) > processed = tagger(doc)
> ``` > ```
@ -79,11 +82,11 @@ Apply the pipe to a stream of documents. Both [`__call__`](/api/tagger#call) and
> pass > pass
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------ | -------- | -------------------------------------------------------------------------------------------------------------- | | ------------ | -------- | ------------------------------------------------------ |
| `stream` | iterable | A stream of documents. | | `stream` | iterable | A stream of documents. |
| `batch_size` | int | The number of texts to buffer. Defaults to `128`. | | `batch_size` | int | The number of texts to buffer. Defaults to `128`. |
| **YIELDS** | `Doc` | Processed documents in the order of the original text. | | **YIELDS** | `Doc` | Processed documents in the order of the original text. |
## Tagger.predict {#predict tag="method"} ## Tagger.predict {#predict tag="method"}

View File

@ -31,6 +31,7 @@ shortcut for this and instantiate the component using its string name and
> ```python > ```python
> # Construction via create_pipe > # Construction via create_pipe
> textcat = nlp.create_pipe("textcat") > textcat = nlp.create_pipe("textcat")
> textcat = nlp.create_pipe("textcat", config={"exclusive_classes": True})
> >
> # Construction from class > # Construction from class
> from spacy.pipeline import TextCategorizer > from spacy.pipeline import TextCategorizer
@ -38,19 +39,35 @@ shortcut for this and instantiate the component using its string name and
> textcat.from_disk("/path/to/model") > textcat.from_disk("/path/to/model")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. | | `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | | `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `**cfg` | - | Configuration parameters. | | `exclusive_classes` | bool | Make categories mutually exclusive. Defaults to `False`. |
| **RETURNS** | `TextCategorizer` | The newly constructed object. | | `architecture` | unicode | Model architecture to use, see [architectures](#architectures) for details. Defaults to `"ensemble"`. |
| **RETURNS** | `TextCategorizer` | The newly constructed object. |
### Architectures {#architectures new="2.1"}
Text classification models can be used to solve a wide variety of problems.
Differences in text length, number of labels, difficulty, and runtime
performance constraints mean that no single algorithm performs well on all types
of problems. To handle a wider variety of problems, the `TextCategorizer` object
allows configuration of its model architecture, using the `architecture` keyword
argument.
| Name | Description |
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `"ensemble"` | **Default:** Stacked ensemble of a unigram bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention. |
| `"simple_cnn"` | A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. |
## TextCategorizer.\_\_call\_\_ {#call tag="method"} ## TextCategorizer.\_\_call\_\_ {#call tag="method"}
Apply the pipe to one document. The document is modified in place, and returned. Apply the pipe to one document. The document is modified in place, and returned.
Both [`__call__`](/api/textcategorizer#call) and This usually happens under the hood when you call the `nlp` object on a text and
[`pipe`](/api/textcategorizer#pipe) delegate to the all pipeline components are applied to the `Doc` in order. Both
[`predict`](/api/textcategorizer#predict) and [`__call__`](/api/textcategorizer#call) and [`pipe`](/api/textcategorizer#pipe)
delegate to the [`predict`](/api/textcategorizer#predict) and
[`set_annotations`](/api/textcategorizer#set_annotations) methods. [`set_annotations`](/api/textcategorizer#set_annotations) methods.
> #### Example > #### Example
@ -58,6 +75,7 @@ Both [`__call__`](/api/textcategorizer#call) and
> ```python > ```python
> textcat = TextCategorizer(nlp.vocab) > textcat = TextCategorizer(nlp.vocab)
> doc = nlp(u"This is a sentence.") > doc = nlp(u"This is a sentence.")
> # This usually happens under the hood
> processed = textcat(doc) > processed = textcat(doc)
> ``` > ```