spaCy/spacy/pipeline
Matthew Honnibal 6c783f8045 Bug fixes and options for TextCategorizer (#3472)
* Fix code for bag-of-words feature extraction

The _ml.py module had a redundant copy of a function to extract unigram
bag-of-words features, except one had a bug that set values to 0.
Another function allowed extraction of bigram features. Replace all three
with a new function that supports arbitrary ngram sizes and also allows
control of which attribute is used (e.g. ORTH, LOWER, etc).

* Support 'bow' architecture for TextCategorizer

This allows efficient ngram bag-of-words models, which are better when
the classifier needs to run quickly, especially when the texts are long.
Pass architecture="bow" to use it. The extra arguments ngram_size and
attr are also available, e.g. ngram_size=2 means unigram and bigram
features will be extracted.

* Fix size limits in train_textcat example

* Explain architectures better in docs
2019-03-23 16:44:44 +01:00
..
__init__.py 💫 Add better and serializable sentencizer (#3471) 2019-03-23 15:45:02 +01:00
entityruler.py 💫 Add better and serializable sentencizer (#3471) 2019-03-23 15:45:02 +01:00
functions.py Tidy up and improve docs and docstrings (#3370) 2019-03-08 11:42:26 +01:00
hooks.py 💫 Add better and serializable sentencizer (#3471) 2019-03-23 15:45:02 +01:00
pipes.pyx Bug fixes and options for TextCategorizer (#3472) 2019-03-23 16:44:44 +01:00