mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 01:46:28 +03:00
2c876eb672
* Expose tokenizer rules as a property
Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)
Add tests and update Tokenizer API docs.
* Update Hungarian punctuation to remove empty string
Update Hungarian punctuation definitions so that `_units` does not match
an empty string.
* Use _load_special_tokenization consistently
Use `_load_special_tokenization()` and have it to handle `None` checks.
* Fix precedence of `token_match` vs. special cases
Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.
* Add `make_debug_doc()` to the Tokenizer
Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.
Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.
* Update tokenization usage docs
Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.
Add more examples for customizing tokenizers while preserving the
existing defaults.
Minor edits / clarifications.
* Revert "Update Hungarian punctuation to remove empty string"
This reverts commit
|
||
---|---|---|
.. | ||
annotation.md | ||
cli.md | ||
cython-classes.md | ||
cython-structs.md | ||
cython.md | ||
dependencyparser.md | ||
doc.md | ||
docbin.md | ||
entitylinker.md | ||
entityrecognizer.md | ||
entityruler.md | ||
goldcorpus.md | ||
goldparse.md | ||
index.md | ||
kb.md | ||
language.md | ||
lemmatizer.md | ||
lexeme.md | ||
lookups.md | ||
matcher.md | ||
phrasematcher.md | ||
pipeline-functions.md | ||
scorer.md | ||
sentencizer.md | ||
span.md | ||
stringstore.md | ||
tagger.md | ||
textcategorizer.md | ||
token.md | ||
tokenizer.md | ||
top-level.md | ||
vectors.md | ||
vocab.md |