* Mark most Hungarian tokenizer test cases as slow
Mark most Hungarian tokenizer test cases as slow to reduce the runtime
of the test suite in ordinary usage:
* for normal tests: run default tests plus 10% of the detailed tests
* for slow tests: run all tests
* Rework to mark individual tests as slow
* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
* expand serialization test for custom token attribute
* add failing test for issue 4849
* define ENT_ID as attr and use in doc serialization
* fix few typos
* Adding Support for Yoruba
* test text
* Updated test string.
* Fixing encoding declaration.
* Adding encoding to stop_words.py
* Added contributor agreement and removed iranlowo.
* Added removed test files and removed iranlowo to keep project bare.
* Returned CONTRIBUTING.md to default state.
* Added delted conftest entries
* Tidy up and auto-format
* Revert CONTRIBUTING.md
Co-authored-by: Ines Montani <ines@ines.io>
* Include Doc.cats in to_bytes()
* Include Doc.cats in DocBin serialization
* Add tests for serialization of cats
Test serialization of cats for Doc and DocBin.
* Enable lex_attrs on Finnish
* Copy the Danish tokenizer rules to Finnish
Specifically, don't break hyphenated compound words
* Contributor agreement
* A new file for Finnish tokenizer rules instead of including the Danish ones
- added some tests for tokenization issues
- fixed some issues with tokenization of words with hyphen infix
- rewrote the "tokenizer_exceptions.py" file (stemming from the German version)
* Restructure Sentencizer to follow Pipe API
Restructure Sentencizer to follow Pipe API so that it can be scored with
`nlp.evaluate()`.
* Add Sentencizer pipe() test
Iterate over lr_edges until all heads are within the current sentence.
Instead of iterating over them for a fixed number of iterations, check
whether the sentence boundaries are correct for the heads and stop when
all are correct. Stop after a maximum of 10 iterations, providing a
warning in this case since the sentence boundaries may not be correct.
* Switch from mecab-python3 to fugashi
mecab-python3 has been the best MeCab binding for a long time but it's
not very actively maintained, and since it's based on old SWIG code
distributed with MeCab there's a limit to how effectively it can be
maintained.
Fugashi is a new Cython-based MeCab wrapper I wrote. Since it's not
based on the old SWIG code it's easier to keep it current and make small
deviations from the MeCab C/C++ API where that makes sense.
* Change mecab-python3 to fugashi in setup.cfg
* Change "mecab tags" to "unidic tags"
The tags come from MeCab, but the tag schema is specified by Unidic, so
it's more proper to refer to it that way.
* Update conftest
* Add fugashi link to external deps list for Japanese
* Detect more empty matches in tokenizer.explain()
* Include a few languages in explain non-slow tests
Mark a few languages in tokenizer.explain() tests as not slow so they're
run by default.
* Expose tokenizer rules as a property
Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)
Add tests and update Tokenizer API docs.
* Update Hungarian punctuation to remove empty string
Update Hungarian punctuation definitions so that `_units` does not match
an empty string.
* Use _load_special_tokenization consistently
Use `_load_special_tokenization()` and have it to handle `None` checks.
* Fix precedence of `token_match` vs. special cases
Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.
* Add `make_debug_doc()` to the Tokenizer
Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.
Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.
* Update tokenization usage docs
Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.
Add more examples for customizing tokenizers while preserving the
existing defaults.
Minor edits / clarifications.
* Revert "Update Hungarian punctuation to remove empty string"
This reverts commit f0a577f7a5.
* Rework `make_debug_doc()` as `explain()`
Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.
* Handle cases with bad tokenizer patterns
Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.
* Remove unused displacy image
* Add tokenizer.explain() to usage docs
* Rework Chinese language initialization
* Create a `ChineseTokenizer` class
* Modify jieba post-processing to handle whitespace correctly
* Modify non-jieba character tokenization to handle whitespace correctly
* Add a `create_tokenizer()` method to `ChineseDefaults`
* Load lexical attributes
* Update Chinese tag_map for UD v2
* Add very basic Chinese tests
* Test tokenization with and without jieba
* Test `like_num` attribute
* Fix try_jieba_import()
* Fix zh code formatting
* Xfail new tokenization test
* Put new alignment behind feature flag
* Move USE_ALIGN to top of the file [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
The `Matcher` in `merge_subtokens()` returns all possible subsequences
of `subtok`, so for sequences of two or more subtoks it's necessary to
filter the matches so that the retokenizer is only merging the longest
matches with no overlapping spans.
* trying to fix script - not succesful yet
* match pop() with extend() to avoid changing the data
* few more pop-extend fixes
* reinsert deleted print statement
* fix print statement
* add last tested version
* append instead of extend
* add in few comments
* quick fix for 4402 + unit test
* fixing number of docs (not counting cats)
* more fixes
* fix len
* print tmp file instead of using data from examples dir
* print tmp file instead of using data from examples dir (2)
* Add work in progress
* Update analysis helpers and component decorator
* Fix porting of docstrings for Python 2
* Fix docstring stuff on Python 2
* Support meta factories when loading model
* Put auto pipeline analysis behind flag for now
* Analyse pipes on remove_pipe and replace_pipe
* Move analysis to root for now
Try to find a better place for it, but it needs to go for now to avoid circular imports
* Simplify decorator
Don't return a wrapped class and instead just write to the object
* Update existing components and factories
* Add condition in factory for classes vs. functions
* Add missing from_nlp classmethods
* Add "retokenizes" to printed overview
* Update assigns/requires declarations of builtins
* Only return data if no_print is enabled
* Use multiline table for overview
* Don't support Span
* Rewrite errors/warnings and move them to spacy.errors