spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-18 08:01:58 +03:00

History

adrianeboyd 2c876eb672 Add tokenizer explain() debugging method (#4596 ) * Expose tokenizer rules as a property Expose the tokenizer rules property in the same way as the other core properties. (The cache resetting is overkill, but consistent with `from_bytes` for now.) Add tests and update Tokenizer API docs. * Update Hungarian punctuation to remove empty string Update Hungarian punctuation definitions so that `_units` does not match an empty string. * Use _load_special_tokenization consistently Use `_load_special_tokenization()` and have it to handle `None` checks. * Fix precedence of `token_match` vs. special cases Remove `token_match` check from `_split_affixes()` so that special cases have precedence over `token_match`. `token_match` is checked only before infixes are split. * Add `make_debug_doc()` to the Tokenizer Add `make_debug_doc()` to the Tokenizer as a working implementation of the pseudo-code in the docs. Add a test (marked as slow) that checks that `nlp.tokenizer()` and `nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens for all languages that have `examples.sentences` that can be imported. * Update tokenization usage docs Update pseudo-code and algorithm description to correspond to `nlp.tokenizer.make_debug_doc()` with example debugging usage. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications. * Revert "Update Hungarian punctuation to remove empty string" This reverts commit `f0a577f7a5`. * Rework `make_debug_doc()` as `explain()` Rework `make_debug_doc()` as `explain()`, which returns a list of `(pattern_string, token_string)` tuples rather than a non-standard `Doc`. Update docs and tests accordingly, leaving the visualization for future work. * Handle cases with bad tokenizer patterns Detect when tokenizer patterns match empty prefixes and suffixes so that `explain()` does not hang on bad patterns. * Remove unused displacy image * Add tokenizer.explain() to usage docs		2019-11-20 13:07:25 +01:00
..
ar	Revert #4334	2019-09-29 17:32:12 +02:00
bn	Revert #4334	2019-09-29 17:32:12 +02:00
ca	Revert #4334	2019-09-29 17:32:12 +02:00
da	Move lookup tables out of the core library (#4346 )	2019-10-01 00:01:27 +02:00
de	Move lookup tables out of the core library (#4346 )	2019-10-01 00:01:27 +02:00
el	Revert #4334	2019-09-29 17:32:12 +02:00
en	Add tokenizer explain() debugging method (#4596 )	2019-11-20 13:07:25 +01:00
es	Revert #4334	2019-09-29 17:32:12 +02:00
fi	Revert #4334	2019-09-29 17:32:12 +02:00
fr	Move lookup tables out of the core library (#4346 )	2019-10-01 00:01:27 +02:00
ga	Revert #4334	2019-09-29 17:32:12 +02:00
he	Revert #4334	2019-09-29 17:32:12 +02:00
hu	Revert #4334	2019-09-29 17:32:12 +02:00
id	Revert #4334	2019-09-29 17:32:12 +02:00
it	Revert #4334	2019-09-29 17:32:12 +02:00
ja	Revert #4334	2019-09-29 17:32:12 +02:00
ko	Revert #4334	2019-09-29 17:32:12 +02:00
lb	Fix basic language support for Luxembourgish (by adding punctuation.py) (#4648 )	2019-11-15 16:16:47 +01:00
lt	Move lookup tables out of the core library (#4346 )	2019-10-01 00:01:27 +02:00
nb	Revert #4334	2019-09-29 17:32:12 +02:00
nl	Move lookup tables out of the core library (#4346 )	2019-10-01 00:01:27 +02:00
pl	Revert #4334	2019-09-29 17:32:12 +02:00
pt	Revert #4334	2019-09-29 17:32:12 +02:00
ro	Move lookup tables out of the core library (#4346 )	2019-10-01 00:01:27 +02:00
ru	Revert #4334	2019-09-29 17:32:12 +02:00
sr	Move lookup tables out of the core library (#4346 )	2019-10-01 00:01:27 +02:00
sv	Tidy up and auto-format [ci skip]	2019-10-24 16:20:48 +02:00
th	Revert #4334	2019-09-29 17:32:12 +02:00
tr	Move lookup tables out of the core library (#4346 )	2019-10-01 00:01:27 +02:00
tt	Revert #4334	2019-09-29 17:32:12 +02:00
uk	Revert #4334	2019-09-29 17:32:12 +02:00
ur	Revert #4334	2019-09-29 17:32:12 +02:00
zh	Rework Chinese language initialization and tokenization (#4619 )	2019-11-11 14:23:21 +01:00
__init__.py	Revert #4334	2019-09-29 17:32:12 +02:00
test_attrs.py	Revert #4334	2019-09-29 17:32:12 +02:00
test_initialize.py	Revert #4334	2019-09-29 17:32:12 +02:00