From c03cb1cc63ebc6427ad8bc82d308e466aa377563 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Sun, 24 Feb 2019 13:11:49 +0100 Subject: [PATCH 1/3] Improve built-in component API docs --- website/docs/api/dependencyparser.md | 27 ++++++++++++++------------ website/docs/api/entityrecognizer.md | 27 ++++++++++++++------------ website/docs/api/tagger.md | 29 +++++++++++++++------------- website/docs/api/textcategorizer.md | 8 +++++--- 4 files changed, 51 insertions(+), 40 deletions(-) diff --git a/website/docs/api/dependencyparser.md b/website/docs/api/dependencyparser.md index 84ea707f0..b08e6139a 100644 --- a/website/docs/api/dependencyparser.md +++ b/website/docs/api/dependencyparser.md @@ -37,17 +37,19 @@ shortcut for this and instantiate the component using its string name and > parser.from_disk("/path/to/model") > ``` -| Name | Type | Description | -| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The shared vocabulary. | -| `model` | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | -| `**cfg` | - | Configuration parameters. | -| **RETURNS** | `DependencyParser` | The newly constructed object. | +| Name | Type | Description | +| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | `Vocab` | The shared vocabulary. | +| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | +| `**cfg` | - | Configuration parameters. | +| **RETURNS** | `DependencyParser` | The newly constructed object. | ## DependencyParser.\_\_call\_\_ {#call tag="method"} Apply the pipe to one document. The document is modified in place, and returned. -Both [`__call__`](/api/dependencyparser#call) and +This usually happens under the hood when you call the `nlp` object on a text and +all pipeline components are applied to the `Doc` in order. Both +[`__call__`](/api/dependencyparser#call) and [`pipe`](/api/dependencyparser#pipe) delegate to the [`predict`](/api/dependencyparser#predict) and [`set_annotations`](/api/dependencyparser#set_annotations) methods. @@ -57,6 +59,7 @@ Both [`__call__`](/api/dependencyparser#call) and > ```python > parser = DependencyParser(nlp.vocab) > doc = nlp(u"This is a sentence.") +> # This usually happens under the hood > processed = parser(doc) > ``` @@ -82,11 +85,11 @@ Apply the pipe to a stream of documents. Both > pass > ``` -| Name | Type | Description | -| ------------ | -------- | -------------------------------------------------------------------------------------------------------------- | -| `stream` | iterable | A stream of documents. | -| `batch_size` | int | The number of texts to buffer. Defaults to `128`. | -| **YIELDS** | `Doc` | Processed documents in the order of the original text. | +| Name | Type | Description | +| ------------ | -------- | ------------------------------------------------------ | +| `stream` | iterable | A stream of documents. | +| `batch_size` | int | The number of texts to buffer. Defaults to `128`. | +| **YIELDS** | `Doc` | Processed documents in the order of the original text. | ## DependencyParser.predict {#predict tag="method"} diff --git a/website/docs/api/entityrecognizer.md b/website/docs/api/entityrecognizer.md index e24d2b408..43de2c15c 100644 --- a/website/docs/api/entityrecognizer.md +++ b/website/docs/api/entityrecognizer.md @@ -37,17 +37,19 @@ shortcut for this and instantiate the component using its string name and > ner.from_disk("/path/to/model") > ``` -| Name | Type | Description | -| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The shared vocabulary. | -| `model` | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | -| `**cfg` | - | Configuration parameters. | -| **RETURNS** | `EntityRecognizer` | The newly constructed object. | +| Name | Type | Description | +| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | `Vocab` | The shared vocabulary. | +| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | +| `**cfg` | - | Configuration parameters. | +| **RETURNS** | `EntityRecognizer` | The newly constructed object. | ## EntityRecognizer.\_\_call\_\_ {#call tag="method"} Apply the pipe to one document. The document is modified in place, and returned. -Both [`__call__`](/api/entityrecognizer#call) and +This usually happens under the hood when you call the `nlp` object on a text and +all pipeline components are applied to the `Doc` in order. Both +[`__call__`](/api/entityrecognizer#call) and [`pipe`](/api/entityrecognizer#pipe) delegate to the [`predict`](/api/entityrecognizer#predict) and [`set_annotations`](/api/entityrecognizer#set_annotations) methods. @@ -57,6 +59,7 @@ Both [`__call__`](/api/entityrecognizer#call) and > ```python > ner = EntityRecognizer(nlp.vocab) > doc = nlp(u"This is a sentence.") +> # This usually happens under the hood > processed = ner(doc) > ``` @@ -82,11 +85,11 @@ Apply the pipe to a stream of documents. Both > pass > ``` -| Name | Type | Description | -| ------------ | -------- | -------------------------------------------------------------------------------------------------------------- | -| `stream` | iterable | A stream of documents. | -| `batch_size` | int | The number of texts to buffer. Defaults to `128`. | -| **YIELDS** | `Doc` | Processed documents in the order of the original text. | +| Name | Type | Description | +| ------------ | -------- | ------------------------------------------------------ | +| `stream` | iterable | A stream of documents. | +| `batch_size` | int | The number of texts to buffer. Defaults to `128`. | +| **YIELDS** | `Doc` | Processed documents in the order of the original text. | ## EntityRecognizer.predict {#predict tag="method"} diff --git a/website/docs/api/tagger.md b/website/docs/api/tagger.md index e2d7c257f..fccb7cfd0 100644 --- a/website/docs/api/tagger.md +++ b/website/docs/api/tagger.md @@ -37,18 +37,20 @@ shortcut for this and instantiate the component using its string name and > tagger.from_disk("/path/to/model") > ``` -| Name | Type | Description | -| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The shared vocabulary. | -| `model` | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | -| `**cfg` | - | Configuration parameters. | -| **RETURNS** | `Tagger` | The newly constructed object. | +| Name | Type | Description | +| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | `Vocab` | The shared vocabulary. | +| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | +| `**cfg` | - | Configuration parameters. | +| **RETURNS** | `Tagger` | The newly constructed object. | ## Tagger.\_\_call\_\_ {#call tag="method"} Apply the pipe to one document. The document is modified in place, and returned. -Both [`__call__`](/api/tagger#call) and [`pipe`](/api/tagger#pipe) delegate to -the [`predict`](/api/tagger#predict) and +This usually happens under the hood when you call the `nlp` object on a text and +all pipeline components are applied to the `Doc` in order. Both +[`__call__`](/api/tagger#call) and [`pipe`](/api/tagger#pipe) delegate to the +[`predict`](/api/tagger#predict) and [`set_annotations`](/api/tagger#set_annotations) methods. > #### Example @@ -56,6 +58,7 @@ the [`predict`](/api/tagger#predict) and > ```python > tagger = Tagger(nlp.vocab) > doc = nlp(u"This is a sentence.") +> # This usually happens under the hood > processed = tagger(doc) > ``` @@ -79,11 +82,11 @@ Apply the pipe to a stream of documents. Both [`__call__`](/api/tagger#call) and > pass > ``` -| Name | Type | Description | -| ------------ | -------- | -------------------------------------------------------------------------------------------------------------- | -| `stream` | iterable | A stream of documents. | -| `batch_size` | int | The number of texts to buffer. Defaults to `128`. | -| **YIELDS** | `Doc` | Processed documents in the order of the original text. | +| Name | Type | Description | +| ------------ | -------- | ------------------------------------------------------ | +| `stream` | iterable | A stream of documents. | +| `batch_size` | int | The number of texts to buffer. Defaults to `128`. | +| **YIELDS** | `Doc` | Processed documents in the order of the original text. | ## Tagger.predict {#predict tag="method"} diff --git a/website/docs/api/textcategorizer.md b/website/docs/api/textcategorizer.md index a1fa4c763..cdb826c44 100644 --- a/website/docs/api/textcategorizer.md +++ b/website/docs/api/textcategorizer.md @@ -48,9 +48,10 @@ shortcut for this and instantiate the component using its string name and ## TextCategorizer.\_\_call\_\_ {#call tag="method"} Apply the pipe to one document. The document is modified in place, and returned. -Both [`__call__`](/api/textcategorizer#call) and -[`pipe`](/api/textcategorizer#pipe) delegate to the -[`predict`](/api/textcategorizer#predict) and +This usually happens under the hood when you call the `nlp` object on a text and +all pipeline components are applied to the `Doc` in order. Both +[`__call__`](/api/textcategorizer#call) and [`pipe`](/api/textcategorizer#pipe) +delegate to the [`predict`](/api/textcategorizer#predict) and [`set_annotations`](/api/textcategorizer#set_annotations) methods. > #### Example @@ -58,6 +59,7 @@ Both [`__call__`](/api/textcategorizer#call) and > ```python > textcat = TextCategorizer(nlp.vocab) > doc = nlp(u"This is a sentence.") +> # This usually happens under the hood > processed = textcat(doc) > ``` From 46ec5cdccc7fe61ad3c319ddf9ef5b587d97f44c Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Sun, 24 Feb 2019 13:11:57 +0100 Subject: [PATCH 2/3] Update TextCategorizer docs --- website/docs/api/textcategorizer.md | 28 ++++++++++++++++++++++------ 1 file changed, 22 insertions(+), 6 deletions(-) diff --git a/website/docs/api/textcategorizer.md b/website/docs/api/textcategorizer.md index cdb826c44..f26a89098 100644 --- a/website/docs/api/textcategorizer.md +++ b/website/docs/api/textcategorizer.md @@ -31,6 +31,7 @@ shortcut for this and instantiate the component using its string name and > ```python > # Construction via create_pipe > textcat = nlp.create_pipe("textcat") +> textcat = nlp.create_pipe("textcat", config={"exclusive_classes": True}) > > # Construction from class > from spacy.pipeline import TextCategorizer @@ -38,12 +39,27 @@ shortcut for this and instantiate the component using its string name and > textcat.from_disk("/path/to/model") > ``` -| Name | Type | Description | -| ----------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The shared vocabulary. | -| `model` | `thinc.neural.Model` or `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | -| `**cfg` | - | Configuration parameters. | -| **RETURNS** | `TextCategorizer` | The newly constructed object. | +| Name | Type | Description | +| ------------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | `Vocab` | The shared vocabulary. | +| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | +| `exclusive_classes` | bool | Make categories mutually exclusive. Defaults to `False`. | +| `architecture` | unicode | Model architecture to use, see [architectures](#architectures) for details. Defaults to `"ensemble"`. | +| **RETURNS** | `TextCategorizer` | The newly constructed object. | + +### Architectures {#architectures new="2.1"} + +Text classification models can be used to solve a wide variety of problems. +Differences in text length, number of labels, difficulty, and runtime +performance constraints mean that no single algorithm performs well on all types +of problems. To handle a wider variety of problems, the `TextCategorizer` object +allows configuration of its model architecture, using the `architecture` keyword +argument. + +| Name | Description | +| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `"ensemble"` | **Default:** Stacked ensemble of a unigram bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention. | +| `"simple_cnn"` | A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. | ## TextCategorizer.\_\_call\_\_ {#call tag="method"} From 3ef4da35039f9d458972e688681cc667eb39c6a1 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Sun, 24 Feb 2019 13:12:13 +0100 Subject: [PATCH 3/3] Update and auto-format README [ci skip] --- README.md | 112 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 58 insertions(+), 54 deletions(-) diff --git a/README.md b/README.md index 0c5d0ba59..c9e28ee94 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products. spaCy comes with [pre-trained statistical models](https://spacy.io/models) and word vectors, and -currently supports tokenization for **30+ languages**. It features the +currently supports tokenization for **45+ languages**. It features the **fastest syntactic parser** in the world, convolutional **neural network models** for tagging, parsing and **named entity recognition** and easy **deep learning** integration. It's commercial open-source software, @@ -20,29 +20,30 @@ released under the MIT license. [![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square)](https://pypi.python.org/pypi/spacy) [![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square)](https://anaconda.org/conda-forge/spacy) [![Python wheels](https://img.shields.io/badge/wheels-%E2%9C%93-4c1.svg?longCache=true&style=flat-square&logo=python&logoColor=white)](https://github.com/explosion/wheelwright/releases) +[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black) [![spaCy on Twitter](https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow)](https://twitter.com/spacy_io) ## 📖 Documentation -| Documentation | | -| --- | --- | -| [spaCy 101] | New to spaCy? Here's everything you need to know! -| [Usage Guides] | How to use spaCy and its features. | -| [New in v2.0] | New features, backwards incompatibilities and migration guide. | -| [API Reference] | The detailed reference for spaCy's API. | -| [Models] | Download statistical language models for spaCy. | -| [Universe] | Libraries, extensions, demos, books and courses. | -| [Changelog] | Changes and version history. | -| [Contribute] | How to contribute to the spaCy project and code base. | +| Documentation | | +| --------------- | -------------------------------------------------------------- | +| [spaCy 101] | New to spaCy? Here's everything you need to know! | +| [Usage Guides] | How to use spaCy and its features. | +| [New in v2.1] | New features, backwards incompatibilities and migration guide. | +| [API Reference] | The detailed reference for spaCy's API. | +| [Models] | Download statistical language models for spaCy. | +| [Universe] | Libraries, extensions, demos, books and courses. | +| [Changelog] | Changes and version history. | +| [Contribute] | How to contribute to the spaCy project and code base. | -[spaCy 101]: https://spacy.io/usage/spacy-101 -[New in v2.0]: https://spacy.io/usage/v2#migrating -[Usage Guides]: https://spacy.io/usage/ -[API Reference]: https://spacy.io/api/ -[Models]: https://spacy.io/models -[Universe]: https://spacy.io/universe -[Changelog]: https://spacy.io/usage/#changelog -[Contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md +[spacy 101]: https://spacy.io/usage/spacy-101 +[new in v2.1]: https://spacy.io/usage/v2-1 +[usage guides]: https://spacy.io/usage/ +[api reference]: https://spacy.io/api/ +[models]: https://spacy.io/models +[universe]: https://spacy.io/universe +[changelog]: https://spacy.io/usage/#changelog +[contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md ## 💬 Where to ask questions @@ -51,35 +52,38 @@ and [@ines](https://github.com/ines). Please understand that we won't be able to provide individual support via email. We also believe that help is much more valuable if it's shared publicly, so that more people can benefit from it. -* **Bug Reports**: [GitHub Issue Tracker] -* **Usage Questions**: [Stack Overflow] · [Gitter Chat] · [Reddit User Group] -* **General Discussion**: [Gitter Chat] · [Reddit User Group] +| Type | Platforms | +| ------------------------ | ------------------------------------------------------ | +| 🚨**Bug Reports** | [GitHub Issue Tracker] | +| 🎁 **Feature Requests** | [GitHub Issue Tracker] | +| 👩‍💻**Usage Questions** | [Stack Overflow] · [Gitter Chat] · [Reddit User Group] | +| 🗯 **General Discussion** | [Gitter Chat] · [Reddit User Group] | -[GitHub Issue Tracker]: https://github.com/explosion/spaCy/issues -[Stack Overflow]: http://stackoverflow.com/questions/tagged/spacy -[Gitter Chat]: https://gitter.im/explosion/spaCy -[Reddit User Group]: https://www.reddit.com/r/spacynlp +[github issue tracker]: https://github.com/explosion/spaCy/issues +[stack overflow]: http://stackoverflow.com/questions/tagged/spacy +[gitter chat]: https://gitter.im/explosion/spaCy +[reddit user group]: https://www.reddit.com/r/spacynlp ## Features -* **Fastest syntactic parser** in the world -* **Named entity** recognition -* Non-destructive **tokenization** -* Support for **30+ languages** -* Pre-trained [statistical models](https://spacy.io/models) and word vectors -* Easy **deep learning** integration -* Part-of-speech tagging -* Labelled dependency parsing -* Syntax-driven sentence segmentation -* Built in **visualizers** for syntax and NER -* Convenient string-to-hash mapping -* Export to numpy data arrays -* Efficient binary serialization -* Easy **model packaging** and deployment -* State-of-the-art speed -* Robust, rigorously evaluated accuracy +- **Fastest syntactic parser** in the world +- **Named entity** recognition +- Non-destructive **tokenization** +- Support for **45+ languages** +- Pre-trained [statistical models](https://spacy.io/models) and word vectors +- Easy **deep learning** integration +- Part-of-speech tagging +- Labelled dependency parsing +- Syntax-driven sentence segmentation +- Built in **visualizers** for syntax and NER +- Convenient string-to-hash mapping +- Export to numpy data arrays +- Efficient binary serialization +- Easy **model packaging** and deployment +- State-of-the-art speed +- Robust, rigorously evaluated accuracy -📖 **For more details, see the +📖 **For more details, see the [facts, figures and benchmarks](https://spacy.io/usage/facts-figures).** ## Install spaCy @@ -87,9 +91,9 @@ valuable if it's shared publicly, so that more people can benefit from it. For detailed installation instructions, see the [documentation](https://spacy.io/usage). -* **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio) -* **Python version**: Python 2.7, 3.4+ (only 64 bit) -* **Package managers**: [pip] · [conda] (via `conda-forge`) +- **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio) +- **Python version**: Python 2.7, 3.4+ (only 64 bit) +- **Package managers**: [pip] · [conda] (via `conda-forge`) [pip]: https://pypi.python.org/pypi/spacy [conda]: https://anaconda.org/conda-forge/spacy @@ -142,7 +146,7 @@ If you've trained your own models, keep in mind that your training and runtime inputs must match. After updating spaCy, we recommend **retraining your models** with the new version. -📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the +📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the [migration guide](https://spacy.io/usage/v2#migrating).** ## Download models @@ -152,13 +156,13 @@ This means that they're a component of your application, just like any other module. Models can be installed using spaCy's `download` command, or manually by pointing pip to a path or URL. -| Documentation | | -| --- | --- | -| [Available Models] | Detailed model descriptions, accuracy figures and benchmarks. | -| [Models Documentation] | Detailed usage instructions. | +| Documentation | | +| ---------------------- | ------------------------------------------------------------- | +| [Available Models] | Detailed model descriptions, accuracy figures and benchmarks. | +| [Models Documentation] | Detailed usage instructions. | -[Available Models]: https://spacy.io/models -[Models Documentation]: https://spacy.io/docs/usage/models +[available models]: https://spacy.io/models +[models documentation]: https://spacy.io/docs/usage/models ```bash # out-of-the-box: download best-matching default model @@ -261,7 +265,7 @@ VS 2010 (Python 3.4) and VS 2015 (Python 3.5). ## Run tests -spaCy comes with an [extensive test suite](spacy/tests). In order to run the +spaCy comes with an [extensive test suite](spacy/tests). In order to run the tests, you'll usually want to clone the repository and build spaCy from source. This will also install the required development dependencies and test utilities defined in the `requirements.txt`.