Commit Graph

1722 Commits

Author SHA1 Message Date
Bram Vanroy
718704022a Changes to spacy_conll in universe (#4914)
* Update information on spacy_conll

* Typo fix
2020-01-16 01:56:39 +01:00
Preston Badeer
b216ff43c9 Update vectors-similarity.md (#4889)
These links are broken on the website, due to quotes around the URLs.
2020-01-08 16:49:40 +01:00
Geoffrey Gordon Ashbrook
53929138d7 remove extra word typo (#4875)
"let you find you"
2020-01-06 12:37:42 +01:00
Ines Montani
400257a802 Update index.md [ci skip] 2020-01-04 01:52:18 +01:00
Ivan Echevarria
ef13e0c038 Add n_process to Language.pipe documentation (#4842) [ci skip]
* Add n_process to documentation

* Auto-format and add default [ci skip]

Co-authored-by: Ines Montani <ines@ines.io>
2019-12-29 14:23:33 +01:00
Ines Montani
1b838d1313 Divide models into core and starters [ci skip] 2019-12-21 14:10:22 +01:00
Sofie Van Landeghem
8ebbb85117 Documentation for PhraseMatcher constructor (#4826)
* add max_length as argument for init PhraseMatcher

* improve error message too
2019-12-20 23:00:04 +01:00
Ines Montani
c466e02466 Update universe [ci skip] 2019-12-13 15:57:39 +01:00
Thiago Lages de Alencar
a067ded495 Update doc.md (#4796) 2019-12-11 18:21:40 +01:00
Tclack88
ab8dc2732c Update token.md (#4767)
* Update token.md

documentation is confusing: A '?' is a right punct, but '¿' is a left punct

* Update token.md

add quotations around parentheses in `is_left_punct` and `is_right_punct` for clarrification, ensuring the question mark that follows is not percieved as an example of left and right punctuation

* Move quotes into code block [ci skip]
2019-12-06 19:22:02 +01:00
Ines Montani
bf611ebca7 Document jsonl option on converter [ci skip] 2019-12-06 19:17:45 +01:00
Nicolai Bjerre Pedersen
de5453cdcb Fix link to user hooks in docs (#4778)
* Fix link to user hooks in docs

* Update mr_bjerre.md

Mistake in contributor agreement

* Apparently hard to get it right (wrong name of sca)
2019-12-06 19:17:12 +01:00
Ines Montani
cbacb0f1a4 Update shape docs and examples (resolves #4615) [ci skip] 2019-11-23 17:16:55 +01:00
Paul O'Leary McCann
f0e3e606a6 Replace python-mecab3 with fugashi for Japanese (#4621)
* Switch from mecab-python3 to fugashi

mecab-python3 has been the best MeCab binding for a long time but it's
not very actively maintained, and since it's based on old SWIG code
distributed with MeCab there's a limit to how effectively it can be
maintained.

Fugashi is a new Cython-based MeCab wrapper I wrote. Since it's not
based on the old SWIG code it's easier to keep it current and make small
deviations from the MeCab C/C++ API where that makes sense.

* Change mecab-python3 to fugashi in setup.cfg

* Change "mecab tags" to "unidic tags"

The tags come from MeCab, but the tag schema is specified by Unidic, so
it's more proper to refer to it that way.

* Update conftest

* Add fugashi link to external deps list for Japanese
2019-11-23 14:31:04 +01:00
Ines Montani
a6200bc424 Update scorer.md [ci skip] 2019-11-21 17:02:43 +01:00
richardpaulhudson
8d06386e1e Update to Holmes Universe entry (#4679)
* Updated Universe entry for Holmes

* Correction

* Updated model name

* Updated wording
2019-11-21 16:23:24 +01:00
Ines Montani
235fe6fe3b Auto-format [ci skip] 2019-11-20 13:14:58 +01:00
adrianeboyd
2c876eb672 Add tokenizer explain() debugging method (#4596)
* Expose tokenizer rules as a property

Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)

Add tests and update Tokenizer API docs.

* Update Hungarian punctuation to remove empty string

Update Hungarian punctuation definitions so that `_units` does not match
an empty string.

* Use _load_special_tokenization consistently

Use `_load_special_tokenization()` and have it to handle `None` checks.

* Fix precedence of `token_match` vs. special cases

Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.

* Add `make_debug_doc()` to the Tokenizer

Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.

Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.

* Update tokenization usage docs

Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.

* Revert "Update Hungarian punctuation to remove empty string"

This reverts commit f0a577f7a5.

* Rework `make_debug_doc()` as `explain()`

Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.

* Handle cases with bad tokenizer patterns

Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.

* Remove unused displacy image

* Add tokenizer.explain() to usage docs
2019-11-20 13:07:25 +01:00
Ines Montani
e8b9cee6fd Make example consistent with model (closes #4587) [ci skip] 2019-11-18 12:41:48 +01:00
Ines Montani
e01a1a237f Auto-format [ci skip] 2019-11-18 12:41:31 +01:00
adrianeboyd
62e00fd9da Update tokenization usage docs (#4666)
Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
2019-11-18 12:35:13 +01:00
Ines Montani
5adcb352e9 Adjust order of docs sections [ci skip] 2019-11-17 16:08:56 +01:00
Ines Montani
e30d08410a
Add CI for Python 3.8 (#4479)
* Add 3.8 classifier

* Update azure-pipelines.yml

* Remove 3.8 warning from docs [ci skip]
2019-11-15 01:13:48 +01:00
f11r
877971860e Fix assert in sentencizer documentation. (#4639) 2019-11-13 15:24:14 +01:00
Ines Montani
9d5ff177c4 Work around Markdown rendering issue surfaced in #4600 [ci skip] 2019-11-11 17:12:08 +01:00
adrianeboyd
0f8678c0b1 Fix DocBin.merge() example (#4599) 2019-11-07 11:26:48 +01:00
walterhenry
5563c42ef5 Fixed typo: Added space between "recognize" and "various" (#4600) 2019-11-06 23:06:36 +01:00
Ines Montani
828ef27a32 Add warnings about 3.8 (resolves #4593) [ci skip] 2019-11-05 18:30:11 +01:00
Ines Montani
4b95587ad4 Update universe.json [ci skip] 2019-11-04 13:55:55 +01:00
Yash Patadia
0c396aeed4 add dframcy to universe.json (#4580) 2019-11-04 13:53:23 +01:00
Ines Montani
59358d9b71
Remove box-decoration-break from entities in displacy (#4564) 2019-10-31 15:09:43 +01:00
Ines Montani
4e1de85e43 Update syntax iterators [ci skip] 2019-10-30 14:31:40 +01:00
Ines Montani
726c5dd306 Update universe.json [ci skip] 2019-10-30 13:29:00 +01:00
Neel Kamath
6c036ab57d Add "spaCy Server" to spaCy Universe (#4553)
* Add "spaCy Server" to spaCy Universe

* Accept the spaCy Contributor Agreement
2019-10-30 13:20:46 +01:00
Nipun Sadvilkar
2a5e71232b project: pySBD - Python Sentence Boundary Disambiguation (#4455)
*   project: pySBD - Python Sentence Boundary Disambiguation

* 📝  Update links and description

* 🐛  Fix missing comma

* Update universe.json

pysbd as a spacy component through entrypoints

* 🚨  Fix universe.json

* 📝  Update code_example
2019-10-30 12:13:29 +01:00
Matthew Honnibal
d5509e0989 Support Mish activation (requires Thinc 7.3) (#4536)
* Add arch for MishWindowEncoder

* Support mish in tok2vec and conv window >=2

* Pass new tok2vec settings from parser

* Syntax error

* Fix tok2vec setting

* Fix registration of MishWindowEncoder

* Fix receptive field setting

* Fix mish arch

* Pass more options from parser

* Support more tok2vec options in pretrain

* Require thinc 7.3

* Add docs [ci skip]

* Require thinc 7.3.0.dev0 to run CI

* Run black

* Fix typo

* Update Thinc version


Co-authored-by: Ines Montani <ines@ines.io>
2019-10-28 15:16:33 +01:00
Ines Montani
1180304449 Update languages.json [ci skip] 2019-10-26 13:51:42 +02:00
Ines Montani
cfffdba7b1 Implement new API for {Phrase}Matcher.add (backwards-compatible) (#4522)
* Implement new API for {Phrase}Matcher.add (backwards-compatible)

* Update docs

* Also update DependencyMatcher.add

* Update internals

* Rewrite tests to use new API

* Add basic check for common mistake

Raise error with suggestion if user likely passed in a pattern instead of a list of patterns

* Fix typo [ci skip]
2019-10-25 22:21:08 +02:00
Ines Montani
d2da117114 Also support passing list to Language.disable_pipes (#4521)
* Also support passing list to Language.disable_pipes

* Adjust internals
2019-10-25 16:19:08 +02:00
Ines Montani
493be8e9db Update new version identifier [ci skip] 2019-10-25 11:42:49 +02:00
Ines Montani
2abf1028cb Update docs [ci skip] 2019-10-25 11:27:00 +02:00
Ines Montani
f31876154d Adjust formatting [ci skip] 2019-10-25 11:19:46 +02:00
Kabir Khan
93640373c7 Make entity_ruler ent_id resolution 2x faster and add docs for… (#4513)
* Update entityruler.py

* Making ent_id resolution 2x faster and adding docs

* Fixing newlines in docstrings

* Fixing newlines in docstrings
2019-10-25 11:16:42 +02:00
adrianeboyd
1b0bbe4b76 Update tag maps and docs for English and German (#4501)
* Update English tag_map

Update English tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/en-penn-uposf.html

* Update German tag_map

Update German tag_map based on this conversion table:
https://universaldependencies.org/tagset-conversion/de-stts-uposf.html

* Add missing Tiger dependencies to glossary

* Add quotes to definition of TO

* Update POS/TAG tables in docs

Update POS/TAG tables for English and German docs using current
information generated from the tag_maps and GLOSSARY.

* Update warning that -PRON- is specific to English

* Revert docs to default JSON output with convert

* Revert "Revert docs to default JSON output with convert"

This reverts commit 6b78c048f1.
2019-10-24 12:56:05 +02:00
adrianeboyd
8516e9d53b Support train dict format as JSONL (#4471)
* Support train dict format as JSONL

* Add (overly simple) check for dict vs. tuple to read JSONL lines as
either train dicts or train tuples

* Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()`
and `GoldCorpus.train_tuples`

* Revert docs to default JSON output with convert
2019-10-23 16:01:44 +02:00
adrianeboyd
7fc39f124c Fix logic in rules+model entity example [ci skip] (#4510) 2019-10-23 14:41:21 +02:00
Ines Montani
388ea03065 Update universe.json [ci skip] 2019-10-22 14:54:47 +02:00
Kabir Khan
8a7a30ea1d Add cookiecutter-spacy-fastapi to spacy universe (#4498) 2019-10-22 14:50:40 +02:00
Ines Montani
4659435573 Fix argument type in PhraseMatcher.add docs (closes #4496) [ci skip] 2019-10-22 14:37:30 +02:00
Julin S
3ee15fce0d Update information about Rasa (#4492)
Rasa has been updated and rasa core and rasa nlu have been merged.
2019-10-22 14:32:31 +02:00
Ines Montani
b2f88e2060 Fix formatting [ci skip] 2019-10-21 12:26:07 +02:00
adrianeboyd
3195a8f170 Add Entity Linking to menu (#4489) 2019-10-21 12:17:30 +02:00
Pepe Berba
7772d5d3c5 Update vocab.get_vector docs to include features on Fasttext ngram (#4464)
* Update `vocab.get_vector`

* Added contrib agreement
2019-10-20 01:28:18 +02:00
Ghola
258eb9e064 Misspelling on Lemmatizer Example #4406 (#4449)
Removing extra o in the lookups = Loookups()
2019-10-16 23:23:15 +02:00
Anastassia
4a77d03ff7 Fix documentation for the docs_to_json function (#4456) 2019-10-16 23:17:58 +02:00
Ines Montani
5cbe21700b Only show label scheme if not empty [ci skip] 2019-10-08 15:52:59 +02:00
Ines Montani
8f76d6c9ef Update transformer model details [ci skip] 2019-10-08 15:39:38 +02:00
Ines Montani
573e543e4a Alphanumeric -> alphabetic [ci skip]
see ines/spacy-course#38
2019-10-06 13:30:01 +02:00
Ines Montani
e65dffd80b Clarify serialization of extension attributes (closes #4377) [ci skip] 2019-10-05 11:58:00 +02:00
Ines Montani
e7ddc6f662 Add conda install for lookups [ci skip] 2019-10-03 17:52:53 +02:00
Sofie Van Landeghem
4e7259c6cf Bugfix initializing DocBin with attributes (#4368)
* docbin init fix + documentation fix + unit tests

* newline

* try with zlib instead of gzip (python 2 incompatibilities)
2019-10-03 14:48:45 +02:00
Ines Montani
ce1d441de5 Add docs for Vectors.most_similar [ci skip] 2019-10-03 14:29:47 +02:00
Ines Montani
80cf385f65 Update v2-2.md [ci skip] 2019-10-02 16:58:21 +02:00
Ines Montani
12a941d841 Update binder version [ci skip] 2019-10-02 16:47:01 +02:00
Ines Montani
b6670bf0c2 Use consistent spelling 2019-10-02 10:37:39 +02:00
Ines Montani
475e3188ce Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip] 2019-10-01 21:59:50 +02:00
Ines Montani
0dd127bb00 Update v2-2.md [ci skip] 2019-10-01 21:37:06 +02:00
Ines Montani
cf65a80f36 Refactor lemmatizer and data table integration (#4353)
* Move test

* Allow default in Lookups.get_table

* Start with blank tables in Lookups.from_bytes

* Refactor lemmatizer to hold instance of Lookups

* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need

* Update tests and docs

* Fix more tests

* Fix lemmatizer

* Upgrade pytest to try and fix weird CI errors

* Try pytest 4.6.5
2019-10-01 21:36:03 +02:00
Ines Montani
bc7e7db208 Fix wording [ci skip] 2019-10-01 14:20:44 +02:00
Ines Montani
2a3a4565cd Update infobox [ci skip] 2019-10-01 14:19:34 +02:00
Ines Montani
66aa0d479f Update v2.2 page [ci skip] 2019-10-01 14:11:05 +02:00
Ines Montani
a8a1800f2a Update lemma data documentation [ci skip] 2019-10-01 13:22:13 +02:00
Ines Montani
932ad9cb91 Fix typos and formatting [ci skip] 2019-10-01 12:30:04 +02:00
Ines Montani
ca0b20ae8b Make prereleases less verbose [ci skip] 2019-10-01 12:29:14 +02:00
Ines Montani
61263e2fbc Update universe.json [ci skip] 2019-09-30 13:49:44 +02:00
Ines Montani
71bd040834 Update models.js [ci skip] 2019-09-30 12:01:09 +02:00
Ines Montani
3d8fd4b461 Revert #4334 2019-09-29 17:32:12 +02:00
Ines Montani
3bd4da068e Fix link [ci skip] 2019-09-29 17:30:38 +02:00
Ines Montani
089f44cc56 Update serialization docs [ci skip] 2019-09-29 17:11:13 +02:00
Ines Montani
c9cd516d96 Move tests out of package (#4334)
* Move tests out of package

* Fix typo
2019-09-28 18:05:00 +02:00
Ines Montani
10742d3219 Update v2 docs [ci skip] 2019-09-28 15:57:22 +02:00
Ines Montani
a2815f6643 Fix model table display [ci skip] 2019-09-28 14:23:03 +02:00
Ines Montani
129670283e Pass meta labels through correctly [ci skip] 2019-09-28 14:08:33 +02:00
Ines Montani
f8d1e2f214 Update CLI docs [ci skip] 2019-09-28 13:12:30 +02:00
Ines Montani
59beab8405 Update v2-2.md [ci skip] 2019-09-27 18:10:43 +02:00
Ines Montani
685e4b2554 Update v2-2.md [ci skip] 2019-09-27 16:35:01 +02:00
Ines Montani
aad66d9bb9 Document PhraseMatcher.remove [ci skip] 2019-09-27 16:34:53 +02:00
Ines Montani
3624153591 Update languages.json [ci skip] 2019-09-27 15:15:41 +02:00
Ajinkya Kale
975aebd7e4 typo fix for wordnet_annotator (#4326) 2019-09-27 11:52:53 +02:00
Ines Montani
eb0649e38e Fix tag [ci skip] 2019-09-26 16:22:33 +02:00
Ines Montani
da9a869d3f Update vectors name docs [ci skip] 2019-09-26 16:21:32 +02:00
Em Zhan
aafa091541 Fix typo in documentation (#4322)
* Fix typo 'probj' instead of 'pobj'

* Add spaCy contributor agreement for zqianem
2019-09-25 19:42:18 +02:00
Matthew Honnibal
92ed4dc5e0
Allow vectors name to be set in init-model (#4321)
* Allow vectors name to be specified in init-model

* Document --vectors-name argument to init-model

* Update website/docs/api/cli.md

Co-Authored-By: Ines Montani <ines@ines.io>
2019-09-25 13:11:00 +02:00
Eric Semeniuc
09816f8323 update sense2vec version (#4320) 2019-09-25 12:17:54 +02:00
Sofie Van Landeghem
42340740e3 update neuralcoref example (#4317) 2019-09-24 10:47:17 +02:00
Ines Montani
197406de1d Update v2-2.md [ci skip] 2019-09-19 14:33:58 +02:00
Ines Montani
ddc09b08ed Update v2-2.md [ci skip] 2019-09-19 00:58:30 +02:00
Matthew Honnibal
e2047576c4 Fix merge conflict 2019-09-18 21:42:11 +02:00
Matthew Honnibal
46c02d25b1 Merge changes to test_ner 2019-09-18 21:41:24 +02:00
Ines Montani
d84763727c Remove unused setting [ci skip] 2019-09-18 21:24:14 +02:00
Ines Montani
9c940eab94 Update version in examples [ci skip] 2019-09-18 21:23:26 +02:00
Ines Montani
f873548f6c Add backwards incompatibility [ci skip] 2019-09-18 21:21:48 +02:00
Ines Montani
6ebdc5f7d2 Update download docs [ci skip] 2019-09-18 21:21:39 +02:00
Ines Montani
dd1810f05a Update DocBin and add docs 2019-09-18 20:23:21 +02:00
Ines Montani
d62690b3ba Update examples 2019-09-18 19:57:36 +02:00
Ines Montani
bd435faddd Add note about usage docs [ci skip] 2019-09-18 19:56:43 +02:00
Matthew Honnibal
931e96b6c7 DocPallet->DocBin in docs 2019-09-18 15:17:26 +02:00
Matthew Honnibal
f537cbeacc Update v2-2 docs 2019-09-18 14:07:55 +02:00
Ines Montani
c922f8e8b0 Fix sources rendering [ci skip] 2019-09-18 12:09:21 +02:00
Ines Montani
ea2a686cf7 Support new model sources format [ci skip] 2019-09-18 11:42:45 +02:00
Ines Montani
ee15fdfe88 Fix wording [ci skip] 2019-09-17 14:59:42 +02:00
Ines Montani
f566e69f38 Fix --vectors-loc docs (closes #4270) 2019-09-17 14:59:12 +02:00
Ines Montani
25c2b4b9a5 Improve init-model docs (see #4137) 2019-09-17 14:51:44 +02:00
Ines Montani
198b7e9789 Auto-format [ci skip] 2019-09-17 14:48:35 +02:00
adrianeboyd
b5d999e510 Add textcat to train CLI (#4226)
* Add doc.cats to spacy.gold at the paragraph level

Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in
the spacy JSON training format at the paragraph level.

* `spacy.gold.docs_to_json()` writes `docs.cats`

* `GoldCorpus` reads in cats in each `GoldParse`

* Update instances of gold_tuples to handle cats

Update iteration over gold_tuples / gold_parses to handle addition of
cats at the paragraph level.

* Add textcat to train CLI

* Add textcat options to train CLI
* Add textcat labels in `TextCategorizer.begin_training()`
* Add textcat evaluation to `Scorer`:
  * For binary exclusive classes with provided label: F1 for label
  * For 2+ exclusive classes: F1 macro average
  * For multilabel (not exclusive): ROC AUC macro average (currently
relying on sklearn)
* Provide user info on textcat evaluation settings, potential
incompatibilities
* Provide pipeline to Scorer in `Language.evaluate` for textcat config
* Customize train CLI output to include only metrics relevant to current
pipeline
* Add textcat evaluation to evaluate CLI

* Fix handling of unset arguments and config params

Fix handling of unset arguments and model confiug parameters in Scorer
initialization.

* Temporarily add sklearn requirement

* Remove sklearn version number

* Improve Scorer handling of models without textcats

* Fixing Scorer handling of models without textcats

* Update Scorer output for python 2.7

* Modify inf in Scorer for python 2.7

* Auto-format

Also make small adjustments to make auto-formatting with black easier and produce nicer results

* Move error message to Errors

* Update documentation

* Add cats to annotation JSON format [ci skip]

* Fix tpl flag and docs [ci skip]

* Switch to internal roc_auc_score

Switch to internal `roc_auc_score()` adapted from scikit-learn.

* Add AUCROCScore tests and improve errors/warnings

* Add tests for AUCROCScore and roc_auc_score
* Add missing error for only positive/negative values
* Remove unnecessary warnings and errors

* Make reduced roc_auc_score functions private

Because most of the checks and warnings have been stripped for the
internal functions and access is only intended through `ROCAUCScore`,
make the functions for roc_auc_score adapted from scikit-learn private.

* Check that data corresponds with multilabel flag

Check that the training instances correspond with the multilabel flag,
adding the multilabel flag if required.

* Add textcat score to early stopping check

* Add more checks to debug-data for textcat

* Add example training data for textcat

* Add more checks to textcat train CLI

* Check configuration when extending base model
* Fix typos

* Update textcat example data

* Provide licensing details and licenses for data
* Remove two labels with no positive instances from jigsaw-toxic-comment
data.


Co-authored-by: Ines Montani <ines@ines.io>
2019-09-15 22:31:31 +02:00
Ines Montani
bab9976d9a
💫 Adjust Table API and add docs (#4289)
* Adjust Table API and add docs

* Add attributes and update description [ci skip]

* Use strings.get_string_id instead of hash_string

* Fix table method calls

* Make orth arg in Lemmatizer.lookup optional

Fall back to string, which is now handled by Table.__contains__ out-of-the-box

* Fix method name

* Auto-format
2019-09-15 22:08:13 +02:00
Ines Montani
23e28e2844 Merge branch 'master' into develop 2019-09-15 17:57:09 +02:00
Ines Montani
57f4c088be Use full model name in quickstart install [ci skip] 2019-09-15 17:56:54 +02:00
Ines Montani
c7e4ea7154 Update examples and languages.json [ci skip] 2019-09-15 17:56:40 +02:00
Ines Montani
16c2522791 Merge branch 'master' into develop 2019-09-14 16:42:01 +02:00
Ines Montani
86befc80bf WIP: Add v2.2 page [ci skip] 2019-09-14 16:41:48 +02:00
Ines Montani
04d36d2471 Remove unused link [ci skip] 2019-09-14 16:41:19 +02:00
Ines Montani
76d26a3d5e Update site.json [ci skip] 2019-09-14 16:32:24 +02:00
Ines Montani
fe87ccc8d1 Update languages.json [ci skip] 2019-09-14 16:23:50 +02:00
Ines Montani
5c8b5e68ec Fix docs consistency [ci skip] 2019-09-14 16:23:37 +02:00
Ines Montani
bbf7337eaf Update adding languages docs [ci skip] 2019-09-14 15:32:15 +02:00
Ines Montani
3126dd0904 Tidy up and auto-format [ci skip] 2019-09-14 12:58:06 +02:00
Ines Montani
3c3658ef9f Merge branch 'master' into develop 2019-09-12 18:03:01 +02:00
Ines Montani
03809b82b7 Support label schemes in model directory 2019-09-12 18:01:46 +02:00
Sofie Van Landeghem
9be4d1c105 Allow copying of user_data in as_doc (#4282)
* Allow copying the user_data with as_doc + unit test

* add option to docs

* add typing

* import fix

* workaround to avoid bool clashing ...

* bint instead of bool
2019-09-12 17:08:14 +02:00
Ines Montani
ff51fba96a Update lemmaitzer docs [ci skip] 2019-09-12 16:26:33 +02:00
Ines Montani
25b2b3ff45 Remove LEMMA from exception examples [ci skip] 2019-09-12 16:26:27 +02:00
Ines Montani
82c16b7943 Remove u-strings and fix formatting [ci skip] 2019-09-12 16:11:15 +02:00
Ines Montani
38037d6816 Update landing [ci skip] 2019-09-12 15:33:39 +02:00
Ines Montani
a31e9e1cd5 Update training docs [ci skip] 2019-09-12 15:32:39 +02:00
Ines Montani
b544dcb3c5 Document debug-data [ci skip] 2019-09-12 15:26:20 +02:00
Ines Montani
72274e83f2 Ensure accordion label is left-aligned [ci skip] 2019-09-12 15:24:17 +02:00
Ines Montani
c0a4cab178 Update "Adding languages" docs [ci skip] 2019-09-12 14:53:06 +02:00
Ines Montani
10257f3131 Document Lookups [ci skip] 2019-09-12 14:00:14 +02:00
Ines Montani
aa4ff0baa1 Auto-format [ci skip] 2019-09-12 13:05:53 +02:00
Ines Montani
625ce2db8e Update Language docs [ci skip] 2019-09-12 13:03:38 +02:00
Ines Montani
cb41a33d14 Update displaCy API docs [ci skip] 2019-09-12 12:59:20 +02:00
Ines Montani
e7c20ad1d2 Update colors entry points docs [ci skip] 2019-09-12 12:59:10 +02:00
Ines Montani
7b59a919e6 Update entry points docs [ci skip] 2019-09-12 12:52:06 +02:00
Sofie Van Landeghem
0b4b4f1819 Documentation for Entity Linking (#4065)
* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts
2019-09-12 11:38:34 +02:00
Sofie Van Landeghem
53a9ca45c9 Docs: bufsize instead of buffsize (#4247) 2019-09-06 11:11:54 +02:00
Sofie Van Landeghem
6b012cebff Make pos/tag distinction more clear in docs (#4246)
* make distinction between tag and pos more prominent in docs

* out of the 101
2019-09-06 10:31:21 +02:00
Ines Montani
232a029de6 Send referrer for internal links [ci skip] 2019-09-05 10:41:46 +02:00
Ines Montani
2f31f96fce Update languages.json [ci skip] 2019-09-04 18:15:42 +02:00
Ines Montani
2245e95e2d Update languages.json [ci skip] 2019-09-04 17:11:40 +02:00
adrianeboyd
82159b5c19 Updates/bugfixes for NER/IOB converters (#4186)
* Updates/bugfixes for NER/IOB converters

* Converter formats `ner` and `iob` use autodetect to choose a converter if
  possible

* `iob2json` is reverted to handle sentence-per-line data like
  `word1|pos1|ent1 word2|pos2|ent2`

  * Fix bug in `merge_sentences()` so the second sentence in each batch isn't
    skipped

* `conll_ner2json` is made more general so it can handle more formats with
  whitespace-separated columns

  * Supports all formats where the first column is the token and the final
    column is the IOB tag; if present, the second column is the POS tag

  * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O`
    separates documents

  * Add option for segmenting sentences (new flag `-s`)

  * Parser-based sentence segmentation with a provided model, otherwise with
    sentencizer (new option `-b` to specify model)

  * Can group sentences into documents with `n_sents` as long as sentence
    segmentation is available

  * Only applies automatic segmentation when there are no existing delimiters
    in the data

* Provide info about settings applied during conversion with warnings and
  suggestions if settings conflict or might not be not optimal.

* Add tests for common formats

* Add '(default)' back to docs for -c auto

* Add document count back to output

* Revert changes to converter output message

* Use explicit tabs in convert CLI test data

* Adjust/add messages for n_sents=1 default

* Add sample NER data to training examples

* Update README

* Add links in docs to example NER data

* Define msg within converters
2019-08-29 12:04:01 +02:00
Ines Montani
b91425f803 Update universe.json [ci skip] 2019-08-28 13:45:06 +02:00
Ines Montani
aedae8b4c5 Update universe.json [ci skip] 2019-08-28 11:59:06 +02:00
Björn Böing
bae0455f91 Fix visualizer options linking for displaCy. (#4202) 2019-08-27 14:04:28 +02:00
Ines Montani
8114933f01 Fix universe.json [ci skip] 2019-08-27 12:13:42 +02:00
Ines Montani
48385552c6 Update languages.json [ci skip] 2019-08-27 11:52:51 +02:00
yanaiela
5d7bc26735 new universe project - the numeric fused-head (#4192)
* new universe project

* Update website/meta/universe.json

Co-Authored-By: Ines Montani <ines@ines.io>

* Update website/meta/universe.json

Co-Authored-By: Ines Montani <ines@ines.io>
2019-08-25 17:25:28 +02:00
Christos Aridas
61f5c007a0 DOC Fix pipeline functions examples (#4189) 2019-08-23 19:15:32 +02:00
Ines Montani
b072c13017 Update universe with videos [ci skip] 2019-08-21 21:35:37 +02:00
Pavle Vidanović
4fe9329bfb Serbian language code update "rs" -> "sr" (#4159)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated

* Serbian language code update. --bugfix
2019-08-21 19:57:37 +02:00
adrianeboyd
8fe7bdd0fa Improve token pattern checking without validation (#4105)
* Fix typo in rule-based matching docs

* Improve token pattern checking without validation

Add more detailed token pattern checks without full JSON pattern validation and
provide more detailed error messages.

Addresses #4070 (also related: #4063, #4100).

* Check whether top-level attributes in patterns and attr for PhraseMatcher are
  in token pattern schema

* Check whether attribute value types are supported in general (as opposed to
  per attribute with full validation)

* Report various internal error types (OverflowError, AttributeError, KeyError)
  as ValueError with standard error messages

* Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS,
  LEMMA, and DEP

* Add error messages with relevant details on how to use validate=True or nlp()
  instead of nlp.make_doc()

* Support attr=TEXT for PhraseMatcher

* Add NORM to schema

* Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler

* Remove unnecessary .keys()

* Rephrase error messages

* Add another type check to Matcher

Add another type check to Matcher for more understandable error messages
in some rare cases.

* Support phrase_matcher_attr=TEXT for EntityRuler

* Don't use spacy.errors in examples and bin scripts

* Fix error code

* Auto-format

Also try get Azure pipelines to finally start a build :(

* Update errors.py


Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2019-08-21 14:00:37 +02:00
Ines Montani
3134a9b6e0 Add section on expanding regex match to token boundaries (see #4158) [ci skip] 2019-08-21 12:53:31 +02:00
Ines Montani
072860fcd0 Auto-format [ci skip] 2019-08-20 14:46:41 +02:00
Andrei-Marius Avram
199589228e Added RONEC to spaCy Universe (#4151)
* Added RONEC to spaCy Universe

* Added contributor file

* Corrected date from .github/contributors/avramandrei.md

* Convert tabs to spaces

* Remove duplicate keys

Can only have one GitHub link unfortunately

* Also add models category

* Adjust ID

This is used to generate the URL, so a simpler string is better
2019-08-20 14:46:07 +02:00
Ines Montani
fe230c8776 Fix typo [ci skip] 2019-08-20 13:02:05 +02:00
Daniel Bourke
b0a28fd0de fix PhraseMatcher link typo (#4150)
/api/phtasematcher -> /api/phrasematcher
2019-08-20 13:01:43 +02:00
Ines Montani
ce4c3e5204 Document force flag on set_extension (closes #4148) 2019-08-19 19:22:07 +02:00
Ines Montani
66aba2d676 Improve regex matching docs [ci skip] 2019-08-19 13:59:41 +02:00
Sofie Van Landeghem
cc66f47893 Make enabling/disabling jupyter mode more explicit (#4144)
* make enabling/disabling jupyter mode more explicit

* markup fix
2019-08-19 11:53:34 +02:00
Ines Montani
e520eb3f6c Make visualized NER examples more clear (closes #4104) [ci skip] 2019-08-18 16:29:29 +02:00
Jeno
91441f169c Update universe.json to include negspacy (#4132) 2019-08-16 17:48:17 +02:00
Jeno Pizarro
2e6e0321dd Update universe.json to include negspacy 2019-08-16 10:24:09 -04:00
Ines Montani
1362f793cf Improve docs on phrase pattern attributes (closes #4100) [ci skip] 2019-08-11 11:13:49 +02:00
Ines Montani
1f4d8bf77e Update universe.json [ci skip] 2019-08-09 17:42:37 +02:00
ICLR&D
87e40b17a0 Add entry for Blackstone in universe.json (#4101)
* Add entry for Blackstone in universe.json

Add an entry for the Blackstone project. Checked JSON is valid.

* Create ICLRandD.md

* Fix indentation (tabs to spaces)

It looks like during validation, the JSON file automatically changed spaces to tabs. This caused the diff to show *everything* as changed, which is obviously not true. This hopefully fixes that.

* Try to fix formatting for diff

* Fix diff


Co-authored-by: Ines Montani <ines@ines.io>
2019-08-09 17:16:51 +02:00
Ines Montani
a2ac2e873f Update Binder version [ci skip] 2019-08-08 13:03:45 +02:00
Ines Montani
3e60afacf9 Add Serbian to languages [ci skip] 2019-08-07 13:38:25 +02:00
Ines Montani
1dc28a9ecb Update Binder version [ci skip] 2019-08-07 13:38:12 +02:00
Ines Montani
8b4a0fabbb Adjust docs example [ci skip] 2019-08-07 00:46:47 +02:00
adrianeboyd
69aca7d839 Add validate option to EntityRuler (#4089)
* Add validate option to EntityRuler

* Add validate to EntityRuler, passed to Matcher and PhraseMatcher

* Add validate to usage and API docs

* Update website/docs/usage/rule-based-matching.md

Co-Authored-By: Ines Montani <ines@ines.io>

* Update website/docs/usage/rule-based-matching.md

Co-Authored-By: Ines Montani <ines@ines.io>
2019-08-07 00:40:53 +02:00
Ines Montani
4ae320e5c2 Use consistent casing for entity ruler patterns (see #4063) [ci skip] 2019-08-06 12:20:22 +02:00
Ines Montani
223bde5cf6 Improve docs on matcher attributes [ci skip] (closes #4063) 2019-08-06 12:13:42 +02:00
Ines Montani
2bfae0b167 Auto-format 2019-08-06 12:13:31 +02:00
Ines Montani
7f3212e2f5
💫 Sync branches (#4084) [ci skip]
* Update from master

* Re-added Universe readme (#3688) (closes #3680)

* Fix typo

* Add version tag to `--base-model` argument (closes #3720)

* fixing regex matcher examples (#3708) (#3719)

* Improve Token.prob and Lexeme.prob docs (resolves #3701)

* Fix DependencyParser.predict docs (resolves #3561)

* Update languages.json


Co-authored-by: Bram Vanroy <Bram.Vanroy@UGent.be>
Co-authored-by: Aaron Kub <aaronkub@gmail.com>
2019-08-05 14:32:54 +02:00
Ines Montani
0f740fad1a Update universe.json [ci skip] 2019-08-05 14:30:07 +02:00
Ines Montani
0f76e0022d Update .tensor docs [ci skip] 2019-08-01 18:37:09 +02:00
Ines Montani
3072eb28c2 Support and render Markdown in model meta [ci skip] 2019-08-01 18:33:10 +02:00
Björn Böing
a83c0add2e Add links to tokenizer API docs to refer relevant information. (#4064)
* Add links to tokenizer API docs to refer relevant information.

* Add suggested changes

Co-Authored-By: Ines Montani <ines@ines.io>
2019-08-01 14:28:38 +02:00
Ejar
2cdf7d39e7 Corrected imported fucntion (#4062)
The example showed an incorrected import
2019-08-01 12:43:36 +02:00
Mohammed Daudali
23ec07debd Correct typo for AllenAI url on homepage (#4050)
* Typo fix for AllenAI url

Changed incorrect home page url for AllenAI from appenai.org to allenai.org

* Sign contributor agreement

* Change date format
2019-07-31 00:16:33 +02:00
Ines Montani
fcd2f7f656 Fix version introducing Span.ents (closes #4045) [ci skip] 2019-07-30 10:32:33 +02:00
Ines Montani
fc69da0acb
💫 Support simple training format in nlp.evaluate and add tests (#4033)
* Support simple training format in nlp.evaluate and add tests

* Update docs [ci skip]
2019-07-27 17:30:18 +02:00
Ines Montani
bd39e5e630 Add "Processing text" section [ci skip] 2019-07-25 17:38:03 +02:00
Ines Montani
a5e3d2f318 Improve section on disabling pipes [ci skip] 2019-07-25 14:25:34 +02:00
Ines Montani
02e444ec7c Add section on special tokenizer component [ci skip] 2019-07-25 14:25:03 +02:00
Ines Montani
1fa6d6ba55 Improve consistency of docs examples [ci skip] 2019-07-25 14:24:56 +02:00
adrianeboyd
784a5f4284 Update GoldParse attributes in API docs (#4023)
* add `words`
* update name of entity list to `ner`

I think it might be a bit more consistent to have `ner` named `entities`
or `ents` (and `ents` is actually set somewhere to `None`, which is a
bit confusing), but it looks like renaming it would be a non-trivial
decision.
2019-07-25 12:14:02 +02:00
Adriane Boyd
6c5044ed2a Update annotation docs for German
- minor formatting fixes
- remove STTS tags not used in Tiger
- update list of dependency relations to match tiger2dep
2019-07-22 11:59:03 +02:00
adrianeboyd
d2c474cbb7 Fix initial example in EntityRuler API docs (#3999) 2019-07-22 11:18:55 +02:00
Ines Montani
1167c303a0 Fix typos [ci skip] 2019-07-19 13:08:18 +02:00