mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-25 00:34:20 +03:00
2c876eb672
* Expose tokenizer rules as a property
Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)
Add tests and update Tokenizer API docs.
* Update Hungarian punctuation to remove empty string
Update Hungarian punctuation definitions so that `_units` does not match
an empty string.
* Use _load_special_tokenization consistently
Use `_load_special_tokenization()` and have it to handle `None` checks.
* Fix precedence of `token_match` vs. special cases
Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.
* Add `make_debug_doc()` to the Tokenizer
Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.
Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.
* Update tokenization usage docs
Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.
Add more examples for customizing tokenizers while preserving the
existing defaults.
Minor edits / clarifications.
* Revert "Update Hungarian punctuation to remove empty string"
This reverts commit
|
||
---|---|---|
.. | ||
101 | ||
_benchmarks-choi.md | ||
adding-languages.md | ||
examples.md | ||
facts-figures.md | ||
index.md | ||
linguistic-features.md | ||
models.md | ||
processing-pipelines.md | ||
rule-based-matching.md | ||
saving-loading.md | ||
spacy-101.md | ||
training.md | ||
v2-1.md | ||
v2-2.md | ||
v2.md | ||
vectors-similarity.md | ||
visualizers.md |