Commit Graph

430 Commits

Author SHA1 Message Date
Ines Montani
235fe6fe3b Auto-format [ci skip] 2019-11-20 13:14:58 +01:00
adrianeboyd
2c876eb672 Add tokenizer explain() debugging method (#4596)
* Expose tokenizer rules as a property

Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)

Add tests and update Tokenizer API docs.

* Update Hungarian punctuation to remove empty string

Update Hungarian punctuation definitions so that `_units` does not match
an empty string.

* Use _load_special_tokenization consistently

Use `_load_special_tokenization()` and have it to handle `None` checks.

* Fix precedence of `token_match` vs. special cases

Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.

* Add `make_debug_doc()` to the Tokenizer

Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.

Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.

* Update tokenization usage docs

Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.

* Revert "Update Hungarian punctuation to remove empty string"

This reverts commit f0a577f7a5.

* Rework `make_debug_doc()` as `explain()`

Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.

* Handle cases with bad tokenizer patterns

Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.

* Remove unused displacy image

* Add tokenizer.explain() to usage docs
2019-11-20 13:07:25 +01:00
Ines Montani
e8b9cee6fd Make example consistent with model (closes #4587) [ci skip] 2019-11-18 12:41:48 +01:00
Ines Montani
e01a1a237f Auto-format [ci skip] 2019-11-18 12:41:31 +01:00
adrianeboyd
62e00fd9da Update tokenization usage docs (#4666)
Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
2019-11-18 12:35:13 +01:00
Ines Montani
5adcb352e9 Adjust order of docs sections [ci skip] 2019-11-17 16:08:56 +01:00
Ines Montani
e30d08410a
Add CI for Python 3.8 (#4479)
* Add 3.8 classifier

* Update azure-pipelines.yml

* Remove 3.8 warning from docs [ci skip]
2019-11-15 01:13:48 +01:00
Ines Montani
9d5ff177c4 Work around Markdown rendering issue surfaced in #4600 [ci skip] 2019-11-11 17:12:08 +01:00
walterhenry
5563c42ef5 Fixed typo: Added space between "recognize" and "various" (#4600) 2019-11-06 23:06:36 +01:00
Ines Montani
828ef27a32 Add warnings about 3.8 (resolves #4593) [ci skip] 2019-11-05 18:30:11 +01:00
Ines Montani
4e1de85e43 Update syntax iterators [ci skip] 2019-10-30 14:31:40 +01:00
Ines Montani
493be8e9db Update new version identifier [ci skip] 2019-10-25 11:42:49 +02:00
Ines Montani
f31876154d Adjust formatting [ci skip] 2019-10-25 11:19:46 +02:00
Kabir Khan
93640373c7 Make entity_ruler ent_id resolution 2x faster and add docs for… (#4513)
* Update entityruler.py

* Making ent_id resolution 2x faster and adding docs

* Fixing newlines in docstrings

* Fixing newlines in docstrings
2019-10-25 11:16:42 +02:00
adrianeboyd
7fc39f124c Fix logic in rules+model entity example [ci skip] (#4510) 2019-10-23 14:41:21 +02:00
adrianeboyd
3195a8f170 Add Entity Linking to menu (#4489) 2019-10-21 12:17:30 +02:00
Ines Montani
573e543e4a Alphanumeric -> alphabetic [ci skip]
see ines/spacy-course#38
2019-10-06 13:30:01 +02:00
Ines Montani
e65dffd80b Clarify serialization of extension attributes (closes #4377) [ci skip] 2019-10-05 11:58:00 +02:00
Sofie Van Landeghem
4e7259c6cf Bugfix initializing DocBin with attributes (#4368)
* docbin init fix + documentation fix + unit tests

* newline

* try with zlib instead of gzip (python 2 incompatibilities)
2019-10-03 14:48:45 +02:00
Ines Montani
80cf385f65 Update v2-2.md [ci skip] 2019-10-02 16:58:21 +02:00
Ines Montani
b6670bf0c2 Use consistent spelling 2019-10-02 10:37:39 +02:00
Ines Montani
475e3188ce Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip] 2019-10-01 21:59:50 +02:00
Ines Montani
0dd127bb00 Update v2-2.md [ci skip] 2019-10-01 21:37:06 +02:00
Ines Montani
bc7e7db208 Fix wording [ci skip] 2019-10-01 14:20:44 +02:00
Ines Montani
2a3a4565cd Update infobox [ci skip] 2019-10-01 14:19:34 +02:00
Ines Montani
66aa0d479f Update v2.2 page [ci skip] 2019-10-01 14:11:05 +02:00
Ines Montani
a8a1800f2a Update lemma data documentation [ci skip] 2019-10-01 13:22:13 +02:00
Ines Montani
932ad9cb91 Fix typos and formatting [ci skip] 2019-10-01 12:30:04 +02:00
Ines Montani
3d8fd4b461 Revert #4334 2019-09-29 17:32:12 +02:00
Ines Montani
3bd4da068e Fix link [ci skip] 2019-09-29 17:30:38 +02:00
Ines Montani
089f44cc56 Update serialization docs [ci skip] 2019-09-29 17:11:13 +02:00
Ines Montani
c9cd516d96 Move tests out of package (#4334)
* Move tests out of package

* Fix typo
2019-09-28 18:05:00 +02:00
Ines Montani
10742d3219 Update v2 docs [ci skip] 2019-09-28 15:57:22 +02:00
Ines Montani
59beab8405 Update v2-2.md [ci skip] 2019-09-27 18:10:43 +02:00
Ines Montani
685e4b2554 Update v2-2.md [ci skip] 2019-09-27 16:35:01 +02:00
Em Zhan
aafa091541 Fix typo in documentation (#4322)
* Fix typo 'probj' instead of 'pobj'

* Add spaCy contributor agreement for zqianem
2019-09-25 19:42:18 +02:00
Ines Montani
197406de1d Update v2-2.md [ci skip] 2019-09-19 14:33:58 +02:00
Ines Montani
ddc09b08ed Update v2-2.md [ci skip] 2019-09-19 00:58:30 +02:00
Ines Montani
9c940eab94 Update version in examples [ci skip] 2019-09-18 21:23:26 +02:00
Ines Montani
f873548f6c Add backwards incompatibility [ci skip] 2019-09-18 21:21:48 +02:00
Ines Montani
dd1810f05a Update DocBin and add docs 2019-09-18 20:23:21 +02:00
Ines Montani
d62690b3ba Update examples 2019-09-18 19:57:36 +02:00
Matthew Honnibal
931e96b6c7 DocPallet->DocBin in docs 2019-09-18 15:17:26 +02:00
Matthew Honnibal
f537cbeacc Update v2-2 docs 2019-09-18 14:07:55 +02:00
Ines Montani
16c2522791 Merge branch 'master' into develop 2019-09-14 16:42:01 +02:00
Ines Montani
86befc80bf WIP: Add v2.2 page [ci skip] 2019-09-14 16:41:48 +02:00
Ines Montani
04d36d2471 Remove unused link [ci skip] 2019-09-14 16:41:19 +02:00
Ines Montani
5c8b5e68ec Fix docs consistency [ci skip] 2019-09-14 16:23:37 +02:00
Ines Montani
bbf7337eaf Update adding languages docs [ci skip] 2019-09-14 15:32:15 +02:00
Ines Montani
25b2b3ff45 Remove LEMMA from exception examples [ci skip] 2019-09-12 16:26:27 +02:00