mirror of
https://github.com/explosion/spaCy.git
synced 2025-11-09 04:17:53 +03:00
* Expose tokenizer rules as a property
Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)
Add tests and update Tokenizer API docs.
* Update Hungarian punctuation to remove empty string
Update Hungarian punctuation definitions so that `_units` does not match
an empty string.
* Use _load_special_tokenization consistently
Use `_load_special_tokenization()` and have it to handle `None` checks.
* Fix precedence of `token_match` vs. special cases
Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.
* Add `make_debug_doc()` to the Tokenizer
Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.
Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.
* Update tokenization usage docs
Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.
Add more examples for customizing tokenizers while preserving the
existing defaults.
Minor edits / clarifications.
* Revert "Update Hungarian punctuation to remove empty string"
This reverts commit
|
||
|---|---|---|
| .. | ||
| ar | ||
| bn | ||
| ca | ||
| da | ||
| de | ||
| el | ||
| en | ||
| es | ||
| fi | ||
| fr | ||
| ga | ||
| he | ||
| hu | ||
| id | ||
| it | ||
| ja | ||
| ko | ||
| lb | ||
| lt | ||
| nb | ||
| nl | ||
| pl | ||
| pt | ||
| ro | ||
| ru | ||
| sr | ||
| sv | ||
| th | ||
| tr | ||
| tt | ||
| uk | ||
| ur | ||
| zh | ||
| __init__.py | ||
| test_attrs.py | ||
| test_initialize.py | ||