Commit Graph

1580 Commits

Author SHA1 Message Date
Ines Montani
f31876154d Adjust formatting [ci skip] 2019-10-25 11:19:46 +02:00
Kabir Khan
93640373c7 Make entity_ruler ent_id resolution 2x faster and add docs for… (#4513)
* Update entityruler.py

* Making ent_id resolution 2x faster and adding docs

* Fixing newlines in docstrings

* Fixing newlines in docstrings
2019-10-25 11:16:42 +02:00
adrianeboyd
1b0bbe4b76 Update tag maps and docs for English and German (#4501)
* Update English tag_map

Update English tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/en-penn-uposf.html

* Update German tag_map

Update German tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/de-stts-uposf.html

* Add missing Tiger dependencies to glossary

* Add quotes to definition of TO

* Update POS/TAG tables in docs

Update POS/TAG tables for English and German docs using current
information generated from the tag_maps and GLOSSARY.

* Update warning that -PRON- is specific to English

* Revert docs to default JSON output with convert

* Revert "Revert docs to default JSON output with convert"

This reverts commit 6b78c048f1.
2019-10-24 12:56:05 +02:00
adrianeboyd
8516e9d53b Support train dict format as JSONL (#4471)
* Support train dict format as JSONL

* Add (overly simple) check for dict vs. tuple to read JSONL lines as
either train dicts or train tuples

* Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()`
and `GoldCorpus.train_tuples`

* Revert docs to default JSON output with convert
2019-10-23 16:01:44 +02:00
adrianeboyd
7fc39f124c Fix logic in rules+model entity example [ci skip] (#4510) 2019-10-23 14:41:21 +02:00
Ines Montani
388ea03065 Update universe.json [ci skip] 2019-10-22 14:54:47 +02:00
Kabir Khan
8a7a30ea1d Add cookiecutter-spacy-fastapi to spacy universe (#4498) 2019-10-22 14:50:40 +02:00
Ines Montani
4659435573 Fix argument type in PhraseMatcher.add docs (closes #4496) [ci skip] 2019-10-22 14:37:30 +02:00
Julin S
3ee15fce0d Update information about Rasa (#4492)
Rasa has been updated and rasa core and rasa nlu have been merged.
2019-10-22 14:32:31 +02:00
Ines Montani
b2f88e2060 Fix formatting [ci skip] 2019-10-21 12:26:07 +02:00
adrianeboyd
3195a8f170 Add Entity Linking to menu (#4489) 2019-10-21 12:17:30 +02:00
Pepe Berba
7772d5d3c5 Update vocab.get_vector docs to include features on Fasttext ngram (#4464)
* Update `vocab.get_vector`

* Added contrib agreement
2019-10-20 01:28:18 +02:00
Ghola
258eb9e064 Misspelling on Lemmatizer Example #4406 (#4449)
Removing extra o in the lookups = Loookups()
2019-10-16 23:23:15 +02:00
Anastassia
4a77d03ff7 Fix documentation for the docs_to_json function (#4456) 2019-10-16 23:17:58 +02:00
Ines Montani
5cbe21700b Only show label scheme if not empty [ci skip] 2019-10-08 15:52:59 +02:00
Ines Montani
8f76d6c9ef Update transformer model details [ci skip] 2019-10-08 15:39:38 +02:00
Ines Montani
573e543e4a Alphanumeric -> alphabetic [ci skip]
see ines/spacy-course#38
2019-10-06 13:30:01 +02:00
Ines Montani
e65dffd80b Clarify serialization of extension attributes (closes #4377) [ci skip] 2019-10-05 11:58:00 +02:00
Ines Montani
e7ddc6f662 Add conda install for lookups [ci skip] 2019-10-03 17:52:53 +02:00
Sofie Van Landeghem
4e7259c6cf Bugfix initializing DocBin with attributes (#4368)
* docbin init fix + documentation fix + unit tests

* newline

* try with zlib instead of gzip (python 2 incompatibilities)
2019-10-03 14:48:45 +02:00
Ines Montani
ce1d441de5 Add docs for Vectors.most_similar [ci skip] 2019-10-03 14:29:47 +02:00
Ines Montani
80cf385f65 Update v2-2.md [ci skip] 2019-10-02 16:58:21 +02:00
Ines Montani
12a941d841 Update binder version [ci skip] 2019-10-02 16:47:01 +02:00
Ines Montani
b6670bf0c2 Use consistent spelling 2019-10-02 10:37:39 +02:00
Ines Montani
475e3188ce Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip] 2019-10-01 21:59:50 +02:00
Ines Montani
0dd127bb00 Update v2-2.md [ci skip] 2019-10-01 21:37:06 +02:00
Ines Montani
cf65a80f36 Refactor lemmatizer and data table integration (#4353)
* Move test

* Allow default in Lookups.get_table

* Start with blank tables in Lookups.from_bytes

* Refactor lemmatizer to hold instance of Lookups

* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need

* Update tests and docs

* Fix more tests

* Fix lemmatizer

* Upgrade pytest to try and fix weird CI errors

* Try pytest 4.6.5
2019-10-01 21:36:03 +02:00
Ines Montani
bc7e7db208 Fix wording [ci skip] 2019-10-01 14:20:44 +02:00
Ines Montani
2a3a4565cd Update infobox [ci skip] 2019-10-01 14:19:34 +02:00
Ines Montani
66aa0d479f Update v2.2 page [ci skip] 2019-10-01 14:11:05 +02:00
Ines Montani
a8a1800f2a Update lemma data documentation [ci skip] 2019-10-01 13:22:13 +02:00
Ines Montani
932ad9cb91 Fix typos and formatting [ci skip] 2019-10-01 12:30:04 +02:00
Ines Montani
ca0b20ae8b Make prereleases less verbose [ci skip] 2019-10-01 12:29:14 +02:00
Ines Montani
61263e2fbc Update universe.json [ci skip] 2019-09-30 13:49:44 +02:00
Ines Montani
71bd040834 Update models.js [ci skip] 2019-09-30 12:01:09 +02:00
Ines Montani
3d8fd4b461 Revert #4334 2019-09-29 17:32:12 +02:00
Ines Montani
3bd4da068e Fix link [ci skip] 2019-09-29 17:30:38 +02:00
Ines Montani
089f44cc56 Update serialization docs [ci skip] 2019-09-29 17:11:13 +02:00
Ines Montani
c9cd516d96 Move tests out of package (#4334)
* Move tests out of package

* Fix typo
2019-09-28 18:05:00 +02:00
Ines Montani
10742d3219 Update v2 docs [ci skip] 2019-09-28 15:57:22 +02:00
Ines Montani
a2815f6643 Fix model table display [ci skip] 2019-09-28 14:23:03 +02:00
Ines Montani
129670283e Pass meta labels through correctly [ci skip] 2019-09-28 14:08:33 +02:00
Ines Montani
f8d1e2f214 Update CLI docs [ci skip] 2019-09-28 13:12:30 +02:00
Ines Montani
59beab8405 Update v2-2.md [ci skip] 2019-09-27 18:10:43 +02:00
Ines Montani
685e4b2554 Update v2-2.md [ci skip] 2019-09-27 16:35:01 +02:00
Ines Montani
aad66d9bb9 Document PhraseMatcher.remove [ci skip] 2019-09-27 16:34:53 +02:00
Ines Montani
3624153591 Update languages.json [ci skip] 2019-09-27 15:15:41 +02:00
Ajinkya Kale
975aebd7e4 typo fix for wordnet_annotator (#4326) 2019-09-27 11:52:53 +02:00
Ines Montani
eb0649e38e Fix tag [ci skip] 2019-09-26 16:22:33 +02:00
Ines Montani
da9a869d3f Update vectors name docs [ci skip] 2019-09-26 16:21:32 +02:00
Em Zhan
aafa091541 Fix typo in documentation (#4322)
* Fix typo 'probj' instead of 'pobj'

* Add spaCy contributor agreement for zqianem
2019-09-25 19:42:18 +02:00
Matthew Honnibal
92ed4dc5e0
Allow vectors name to be set in init-model (#4321)
* Allow vectors name to be specified in init-model

* Document --vectors-name argument to init-model

* Update website/docs/api/cli.md

Co-Authored-By: Ines Montani <ines@ines.io>
2019-09-25 13:11:00 +02:00
Eric Semeniuc
09816f8323 update sense2vec version (#4320) 2019-09-25 12:17:54 +02:00
Sofie Van Landeghem
42340740e3 update neuralcoref example (#4317) 2019-09-24 10:47:17 +02:00
Ines Montani
197406de1d Update v2-2.md [ci skip] 2019-09-19 14:33:58 +02:00
Ines Montani
ddc09b08ed Update v2-2.md [ci skip] 2019-09-19 00:58:30 +02:00
Matthew Honnibal
e2047576c4 Fix merge conflict 2019-09-18 21:42:11 +02:00
Matthew Honnibal
46c02d25b1 Merge changes to test_ner 2019-09-18 21:41:24 +02:00
Ines Montani
d84763727c Remove unused setting [ci skip] 2019-09-18 21:24:14 +02:00
Ines Montani
9c940eab94 Update version in examples [ci skip] 2019-09-18 21:23:26 +02:00
Ines Montani
f873548f6c Add backwards incompatibility [ci skip] 2019-09-18 21:21:48 +02:00
Ines Montani
6ebdc5f7d2 Update download docs [ci skip] 2019-09-18 21:21:39 +02:00
Ines Montani
dd1810f05a Update DocBin and add docs 2019-09-18 20:23:21 +02:00
Ines Montani
d62690b3ba Update examples 2019-09-18 19:57:36 +02:00
Ines Montani
bd435faddd Add note about usage docs [ci skip] 2019-09-18 19:56:43 +02:00
Matthew Honnibal
931e96b6c7 DocPallet->DocBin in docs 2019-09-18 15:17:26 +02:00
Matthew Honnibal
f537cbeacc Update v2-2 docs 2019-09-18 14:07:55 +02:00
Ines Montani
c922f8e8b0 Fix sources rendering [ci skip] 2019-09-18 12:09:21 +02:00
Ines Montani
ea2a686cf7 Support new model sources format [ci skip] 2019-09-18 11:42:45 +02:00
Ines Montani
ee15fdfe88 Fix wording [ci skip] 2019-09-17 14:59:42 +02:00
Ines Montani
f566e69f38 Fix --vectors-loc docs (closes #4270) 2019-09-17 14:59:12 +02:00
Ines Montani
25c2b4b9a5 Improve init-model docs (see #4137) 2019-09-17 14:51:44 +02:00
Ines Montani
198b7e9789 Auto-format [ci skip] 2019-09-17 14:48:35 +02:00
adrianeboyd
b5d999e510 Add textcat to train CLI (#4226)
* Add doc.cats to spacy.gold at the paragraph level

Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in
the spacy JSON training format at the paragraph level.

* `spacy.gold.docs_to_json()` writes `docs.cats`

* `GoldCorpus` reads in cats in each `GoldParse`

* Update instances of gold_tuples to handle cats

Update iteration over gold_tuples / gold_parses to handle addition of
cats at the paragraph level.

* Add textcat to train CLI

* Add textcat options to train CLI
* Add textcat labels in `TextCategorizer.begin_training()`
* Add textcat evaluation to `Scorer`:
  * For binary exclusive classes with provided label: F1 for label
  * For 2+ exclusive classes: F1 macro average
  * For multilabel (not exclusive): ROC AUC macro average (currently
relying on sklearn)
* Provide user info on textcat evaluation settings, potential
incompatibilities
* Provide pipeline to Scorer in `Language.evaluate` for textcat config
* Customize train CLI output to include only metrics relevant to current
pipeline
* Add textcat evaluation to evaluate CLI

* Fix handling of unset arguments and config params

Fix handling of unset arguments and model confiug parameters in Scorer
initialization.

* Temporarily add sklearn requirement

* Remove sklearn version number

* Improve Scorer handling of models without textcats

* Fixing Scorer handling of models without textcats

* Update Scorer output for python 2.7

* Modify inf in Scorer for python 2.7

* Auto-format

Also make small adjustments to make auto-formatting with black easier and produce nicer results

* Move error message to Errors

* Update documentation

* Add cats to annotation JSON format [ci skip]

* Fix tpl flag and docs [ci skip]

* Switch to internal roc_auc_score

Switch to internal `roc_auc_score()` adapted from scikit-learn.

* Add AUCROCScore tests and improve errors/warnings

* Add tests for AUCROCScore and roc_auc_score
* Add missing error for only positive/negative values
* Remove unnecessary warnings and errors

* Make reduced roc_auc_score functions private

Because most of the checks and warnings have been stripped for the
internal functions and access is only intended through `ROCAUCScore`,
make the functions for roc_auc_score adapted from scikit-learn private.

* Check that data corresponds with multilabel flag

Check that the training instances correspond with the multilabel flag,
adding the multilabel flag if required.

* Add textcat score to early stopping check

* Add more checks to debug-data for textcat

* Add example training data for textcat

* Add more checks to textcat train CLI

* Check configuration when extending base model
* Fix typos

* Update textcat example data

* Provide licensing details and licenses for data
* Remove two labels with no positive instances from jigsaw-toxic-comment
data.


Co-authored-by: Ines Montani <ines@ines.io>
2019-09-15 22:31:31 +02:00
Ines Montani
bab9976d9a
💫 Adjust Table API and add docs (#4289)
* Adjust Table API and add docs

* Add attributes and update description [ci skip]

* Use strings.get_string_id instead of hash_string

* Fix table method calls

* Make orth arg in Lemmatizer.lookup optional

Fall back to string, which is now handled by Table.__contains__ out-of-the-box

* Fix method name

* Auto-format
2019-09-15 22:08:13 +02:00
Ines Montani
23e28e2844 Merge branch 'master' into develop 2019-09-15 17:57:09 +02:00
Ines Montani
57f4c088be Use full model name in quickstart install [ci skip] 2019-09-15 17:56:54 +02:00
Ines Montani
c7e4ea7154 Update examples and languages.json [ci skip] 2019-09-15 17:56:40 +02:00
Ines Montani
16c2522791 Merge branch 'master' into develop 2019-09-14 16:42:01 +02:00
Ines Montani
86befc80bf WIP: Add v2.2 page [ci skip] 2019-09-14 16:41:48 +02:00
Ines Montani
04d36d2471 Remove unused link [ci skip] 2019-09-14 16:41:19 +02:00
Ines Montani
76d26a3d5e Update site.json [ci skip] 2019-09-14 16:32:24 +02:00
Ines Montani
fe87ccc8d1 Update languages.json [ci skip] 2019-09-14 16:23:50 +02:00
Ines Montani
5c8b5e68ec Fix docs consistency [ci skip] 2019-09-14 16:23:37 +02:00
Ines Montani
bbf7337eaf Update adding languages docs [ci skip] 2019-09-14 15:32:15 +02:00
Ines Montani
3126dd0904 Tidy up and auto-format [ci skip] 2019-09-14 12:58:06 +02:00
Ines Montani
3c3658ef9f Merge branch 'master' into develop 2019-09-12 18:03:01 +02:00
Ines Montani
03809b82b7 Support label schemes in model directory 2019-09-12 18:01:46 +02:00
Sofie Van Landeghem
9be4d1c105 Allow copying of user_data in as_doc (#4282)
* Allow copying the user_data with as_doc + unit test

* add option to docs

* add typing

* import fix

* workaround to avoid bool clashing ...

* bint instead of bool
2019-09-12 17:08:14 +02:00
Ines Montani
ff51fba96a Update lemmaitzer docs [ci skip] 2019-09-12 16:26:33 +02:00
Ines Montani
25b2b3ff45 Remove LEMMA from exception examples [ci skip] 2019-09-12 16:26:27 +02:00
Ines Montani
82c16b7943 Remove u-strings and fix formatting [ci skip] 2019-09-12 16:11:15 +02:00
Ines Montani
38037d6816 Update landing [ci skip] 2019-09-12 15:33:39 +02:00
Ines Montani
a31e9e1cd5 Update training docs [ci skip] 2019-09-12 15:32:39 +02:00
Ines Montani
b544dcb3c5 Document debug-data [ci skip] 2019-09-12 15:26:20 +02:00
Ines Montani
72274e83f2 Ensure accordion label is left-aligned [ci skip] 2019-09-12 15:24:17 +02:00
Ines Montani
c0a4cab178 Update "Adding languages" docs [ci skip] 2019-09-12 14:53:06 +02:00
Ines Montani
10257f3131 Document Lookups [ci skip] 2019-09-12 14:00:14 +02:00
Ines Montani
aa4ff0baa1 Auto-format [ci skip] 2019-09-12 13:05:53 +02:00
Ines Montani
625ce2db8e Update Language docs [ci skip] 2019-09-12 13:03:38 +02:00