spaCy/spacy/tests/lang
adrianeboyd 2c876eb672 Add tokenizer explain() debugging method (#4596)
* Expose tokenizer rules as a property

Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)

Add tests and update Tokenizer API docs.

* Update Hungarian punctuation to remove empty string

Update Hungarian punctuation definitions so that `_units` does not match
an empty string.

* Use _load_special_tokenization consistently

Use `_load_special_tokenization()` and have it to handle `None` checks.

* Fix precedence of `token_match` vs. special cases

Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.

* Add `make_debug_doc()` to the Tokenizer

Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.

Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.

* Update tokenization usage docs

Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.

* Revert "Update Hungarian punctuation to remove empty string"

This reverts commit f0a577f7a5.

* Rework `make_debug_doc()` as `explain()`

Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.

* Handle cases with bad tokenizer patterns

Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.

* Remove unused displacy image

* Add tokenizer.explain() to usage docs
2019-11-20 13:07:25 +01:00
..
ar Revert #4334 2019-09-29 17:32:12 +02:00
bn Revert #4334 2019-09-29 17:32:12 +02:00
ca Revert #4334 2019-09-29 17:32:12 +02:00
da Move lookup tables out of the core library (#4346) 2019-10-01 00:01:27 +02:00
de Move lookup tables out of the core library (#4346) 2019-10-01 00:01:27 +02:00
el Revert #4334 2019-09-29 17:32:12 +02:00
en Add tokenizer explain() debugging method (#4596) 2019-11-20 13:07:25 +01:00
es Revert #4334 2019-09-29 17:32:12 +02:00
fi Revert #4334 2019-09-29 17:32:12 +02:00
fr Move lookup tables out of the core library (#4346) 2019-10-01 00:01:27 +02:00
ga Revert #4334 2019-09-29 17:32:12 +02:00
he Revert #4334 2019-09-29 17:32:12 +02:00
hu Revert #4334 2019-09-29 17:32:12 +02:00
id Revert #4334 2019-09-29 17:32:12 +02:00
it Revert #4334 2019-09-29 17:32:12 +02:00
ja Revert #4334 2019-09-29 17:32:12 +02:00
ko Revert #4334 2019-09-29 17:32:12 +02:00
lb Fix basic language support for Luxembourgish (by adding punctuation.py) (#4648) 2019-11-15 16:16:47 +01:00
lt Move lookup tables out of the core library (#4346) 2019-10-01 00:01:27 +02:00
nb Revert #4334 2019-09-29 17:32:12 +02:00
nl Move lookup tables out of the core library (#4346) 2019-10-01 00:01:27 +02:00
pl Revert #4334 2019-09-29 17:32:12 +02:00
pt Revert #4334 2019-09-29 17:32:12 +02:00
ro Move lookup tables out of the core library (#4346) 2019-10-01 00:01:27 +02:00
ru Revert #4334 2019-09-29 17:32:12 +02:00
sr Move lookup tables out of the core library (#4346) 2019-10-01 00:01:27 +02:00
sv Tidy up and auto-format [ci skip] 2019-10-24 16:20:48 +02:00
th Revert #4334 2019-09-29 17:32:12 +02:00
tr Move lookup tables out of the core library (#4346) 2019-10-01 00:01:27 +02:00
tt Revert #4334 2019-09-29 17:32:12 +02:00
uk Revert #4334 2019-09-29 17:32:12 +02:00
ur Revert #4334 2019-09-29 17:32:12 +02:00
zh Rework Chinese language initialization and tokenization (#4619) 2019-11-11 14:23:21 +01:00
__init__.py Revert #4334 2019-09-29 17:32:12 +02:00
test_attrs.py Revert #4334 2019-09-29 17:32:12 +02:00
test_initialize.py Revert #4334 2019-09-29 17:32:12 +02:00