mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-24 00:46:28 +03:00
2c876eb672
* Expose tokenizer rules as a property
Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)
Add tests and update Tokenizer API docs.
* Update Hungarian punctuation to remove empty string
Update Hungarian punctuation definitions so that `_units` does not match
an empty string.
* Use _load_special_tokenization consistently
Use `_load_special_tokenization()` and have it to handle `None` checks.
* Fix precedence of `token_match` vs. special cases
Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.
* Add `make_debug_doc()` to the Tokenizer
Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.
Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.
* Update tokenization usage docs
Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.
Add more examples for customizing tokenizers while preserving the
existing defaults.
Minor edits / clarifications.
* Revert "Update Hungarian punctuation to remove empty string"
This reverts commit
|
||
---|---|---|
.. | ||
ar | ||
bn | ||
ca | ||
da | ||
de | ||
el | ||
en | ||
es | ||
fi | ||
fr | ||
ga | ||
he | ||
hu | ||
id | ||
it | ||
ja | ||
ko | ||
lb | ||
lt | ||
nb | ||
nl | ||
pl | ||
pt | ||
ro | ||
ru | ||
sr | ||
sv | ||
th | ||
tr | ||
tt | ||
uk | ||
ur | ||
zh | ||
__init__.py | ||
test_attrs.py | ||
test_initialize.py |