mirror of
https://github.com/explosion/spaCy.git
synced 2025-12-12 12:44:29 +03:00
* Expose tokenizer rules as a property
Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)
Add tests and update Tokenizer API docs.
* Update Hungarian punctuation to remove empty string
Update Hungarian punctuation definitions so that `_units` does not match
an empty string.
* Use _load_special_tokenization consistently
Use `_load_special_tokenization()` and have it to handle `None` checks.
* Fix precedence of `token_match` vs. special cases
Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.
* Add `make_debug_doc()` to the Tokenizer
Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.
Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.
* Update tokenization usage docs
Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.
Add more examples for customizing tokenizers while preserving the
existing defaults.
Minor edits / clarifications.
* Revert "Update Hungarian punctuation to remove empty string"
This reverts commit
|
||
|---|---|---|
| .. | ||
| cli | ||
| data | ||
| displacy | ||
| lang | ||
| matcher | ||
| ml | ||
| pipeline | ||
| syntax | ||
| tests | ||
| tokens | ||
| __init__.pxd | ||
| __init__.py | ||
| __main__.py | ||
| _align.pyx | ||
| _ml.py | ||
| about.py | ||
| analysis.py | ||
| attrs.pxd | ||
| attrs.pyx | ||
| compat.py | ||
| errors.py | ||
| glossary.py | ||
| gold.pxd | ||
| gold.pyx | ||
| kb.pxd | ||
| kb.pyx | ||
| language.py | ||
| lemmatizer.py | ||
| lexeme.pxd | ||
| lexeme.pyx | ||
| lookups.py | ||
| morphology.pxd | ||
| morphology.pyx | ||
| parts_of_speech.pxd | ||
| parts_of_speech.pyx | ||
| scorer.py | ||
| strings.pxd | ||
| strings.pyx | ||
| structs.pxd | ||
| symbols.pxd | ||
| symbols.pyx | ||
| tokenizer.pxd | ||
| tokenizer.pyx | ||
| typedefs.pxd | ||
| typedefs.pyx | ||
| util.py | ||
| vectors.pyx | ||
| vocab.pxd | ||
| vocab.pyx | ||