To fix the slow tokenizer URL (#4374) and allow `token_match` to take
priority over prefixes and suffixes by default, introduce a new
tokenizer option for a token match pattern that's applied after prefixes
and suffixes but before infixes.
* Initialize lower flag explicitly
* Handle whitespace words from GoldParse correctly when creating raw
text with orth variants
* Return the text with original casing if anything goes wrong
* `debug-data`: determine coverage of provided vectors
* `evaluate`: support `blank:lg` model to make it possible to just evaluate
tokenization
* `init-model`: add option to truncate vectors to N most frequent vectors
from word2vec file
* `train`:
* if training on GPU, only run evaluation/timing on CPU in the first
iteration
* if training is aborted, exit with a non-0 exit status
* simplify creation of KB by skipping dim reduction
* small fixes to train EL example script
* add KB creation and NEL training example scripts to example section
* update descriptions of example scripts in the documentation
* moving wiki_entity_linking folder from bin to projects
* remove test for wiki NEL functionality that is being moved
Reconstruction of the original PR #4697 by @MiniLau.
Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema
because the Matcher is only going to be able to support `IS_SENT_START`.
Improve GoldParse NER alignment by including all cases where the start
and end of the NER span can be aligned, regardless of internal
tokenization differences.
To do this, convert BILUO tags to character offsets, check start/end
alignment with `doc.char_span()`, and assign the BILUO tags for the
aligned spans. Alignment for `O/-` tags is handled through the
one-to-one and multi alignments.
Modify jieba install message to instruct the user to use
`ChineseDefaults.use_jieba = False` so that it's possible to load
pkuseg-only models without jieba installed.
* Add pkuseg and serialization support for Chinese
Add support for pkuseg alongside jieba
* Specify model through `Language` meta:
* split on characters (if no word segmentation packages are installed)
```
Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}})
```
* jieba (remains the default tokenizer if installed)
```
Chinese()
Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit
```
* pkuseg
```
Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}})
```
* The new tokenizer setting `require_pkuseg` is used to override
`use_jieba` default, which is intended for models that provide a pkuseg
model:
```
nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}})
nlp = Chinese() # has `use_jieba` as `True` by default
nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer
```
Add support for serialization of tokenizer settings and pkuseg model, if
loaded
* Add sorting for `Language.to_bytes()` serialization of `Language.meta`
so that the (emptied, but still present) tokenizer metadata is in a
consistent position in the serialized data
Extend tests to cover all three tokenizer configurations and
serialization
* Fix from_disk and tests without jieba or pkuseg
* Load cfg first and only show error if `use_pkuseg`
* Fix blank/default initialization in serialization tests
* Explicitly initialize jieba's cache on init
* Add serialization for pkuseg pre/postprocessors
* Reformat pkuseg install message
* Matcher support for Span, as well as Doc #5056
* Removes an import unused
* Signed contributors agreement
* Code optimization and better test
* Add error message for bad Matcher call argument
* Fix merging
* Use max(uint64) for OOV lexeme rank
* Add test for default OOV rank
* Revert back to thinc==7.4.0
Requiring the updated version of thinc was unnecessary.
* Define OOV_RANK in one place
Define OOV_RANK in one place in `util`.
* Fix formatting [ci skip]
* Switch to external definitions of max(uint64)
Switch to external defintions of max(uint64) and confirm that they are
equal.
* Add Doc init from list of words and text
Add an option to initialize a `Doc` from a text and list of words where
the words may or may not include all whitespace tokens. If the text and
words are mismatched, raise an error.
* Fix error code
* Remove all whitespace before aligning words/text
* Move words/text init to util function
* Update error message
* Rename to get_words_and_spaces
* Fix formatting
* Fixed typo in cli warning
Fixed a typo in the warning for the provision of exactly two labels, which have not been designated as binary, to textcat.
* Create and signed contributor form
* Use inline flags in token_match patterns
Use inline flags in `token_match` patterns so that serializing does not
lose the flag information.
* Modify inline flag
* Modify inline flag
* Modify Vector.resize to work with cupy
Modify `Vectors.resize` to work with cupy. Modify behavior when resizing
to a different vector dimension so that individual vectors are truncated
or extended with zeros instead of having the original values filled into
the new shape without regard for the original axes.
* Update spacy/tests/vocab_vectors/test_vectors.py
Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Omit per_type scores from model-best calculations
The addition of per_type scores to the included metrics (#4911) causes
errors when they're compared while determining the best model, so omit
them for this `max()` comparison.
* Add default speed data for interrupted train CLI
Add better speed meta defaults so that an interrupted iteration still
produces a best model.
Co-authored-by: Ines Montani <ines@ines.io>
UD_Danish-DDT has (as far as I can tell) hallucinated periods after
abbreviations, so the changes are an artifact of the corpus and not due
to anything meaningful about Danish tokenization.
* Revert changes to priority of `token_match` so that it has priority
over all other tokenizer patterns
* Add lookahead and potentially slow lookbehind back to the default URL
pattern
* Expand character classes in URL pattern to improve matching around
lookaheads and lookbehinds related to #4882
* Revert changes to Hungarian tokenizer
* Revert (xfail) several URL tests to their status before #4374
* Update `tokenizer.explain()` and docs accordingly
* merge_entities sets the vector in the vocab for the merged token
* add unit test
* import unicode_literals
* move code to _merge function
* only set vector if vocab has non-zero vectors
* Improve token head verification
Improve the verification for valid token heads when heads are set:
* in `Token.head`: heads come from the same document
* in `Doc.from_array()`: head indices are within the bounds of the
document
* Improve error message
* Fix model-final/model-best meta
* include speed and accuracy from final iteration
* combine with speeds from base model if necessary
* Include token_acc metric for all components
* add lemma option to displacy 'dep' visualiser
* more compact list comprehension
* add option to doc
* fix test and add lemmas to util.get_doc
* fix capital
* remove lemma from get_doc
* cleanup
* Fix german stop words
Two stop words ("einige" and "einigen") are sticking together.
Remove three nouns that may serve as stop words in a specific context (e.g. religious or news) but are not applicable for general use.
* Create Jan-711.md
* Fix ent_ids and labels properties when id attribute used in patterns
* use set for labels
* sort end_ids for comparison in entity_ruler tests
* fixing entity_ruler ent_ids test
* add to set
* Run make_doc optimistically if using phrase matcher patterns.
* remove unused coveragerc I was testing with
* format
* Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially.
* Removing old add_patterns function
* Fixing spacing
* Make sure token_patterns loaded as well, before generator was being emptied in from_disk
* Sync Span __eq__ and __hash__
Use the same tuple for `__eq__` and `__hash__`, including all attributes
except `vector` and `vector_norm`.
* Update entity comparison in tests
Update `assert_docs_equal()` test util to compare `Span` properties for
ents rather than `Span` objects.
Modify flag settings so that `DEP` is not sufficient to set `is_parsed`
and only run `set_children_from_heads()` if `HEAD` is provided.
Then the combination `[SENT_START, DEP]` will set deps and not clobber
sent starts with a lot of one-word sentences.
* Rename `tag_map.py` to `tag_map_fine.py` to indicate that it's not the
default tag map
* Remove duplicate generic UD tag map and load `../tag_map.py` instead
* don't split on a colon. Colon is used to attach suffixes for abbreviations
* tokenize on any of LIST_HYPHENS (except a single hyphen), not just on --
* simplify infix rules by merging similar rules
* Add correct stopwords for Slovak language
* Add SNK Tags
* Disable formatting lint for TAGS
* Add example sentences for Slovak language
* Add slovak numerals in base form
* Add lex_attrs to sk init
* Add contributor agreement
* Fix ent_ids and labels properties when id attribute used in patterns
* use set for labels
* sort end_ids for comparison in entity_ruler tests
* fixing entity_ruler ent_ids test
* add to set
Improve train CLI with a provided base model so that you can:
* add a new component
* extend an existing component
* replace an existing component
When the final model and best model are saved, reenable any disabled
components and merge the meta information to include the full pipeline
and accuracy information for all components in the base model plus the
newly added components if needed.
* Mark most Hungarian tokenizer test cases as slow
Mark most Hungarian tokenizer test cases as slow to reduce the runtime
of the test suite in ordinary usage:
* for normal tests: run default tests plus 10% of the detailed tests
* for slow tests: run all tests
* Rework to mark individual tests as slow
* move nlp processing for el pipe to batch training instead of preprocessing
* adding dev eval back in, and limit in articles instead of entities
* use pipe whenever possible
* few more small doc changes
* access dev data through generator
* tqdm description
* small fixes
* update documentation
* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
* expand serialization test for custom token attribute
* add failing test for issue 4849
* define ENT_ID as attr and use in doc serialization
* fix few typos
* Adding Support for Yoruba
* test text
* Updated test string.
* Fixing encoding declaration.
* Adding encoding to stop_words.py
* Added contributor agreement and removed iranlowo.
* Added removed test files and removed iranlowo to keep project bare.
* Returned CONTRIBUTING.md to default state.
* Added delted conftest entries
* Tidy up and auto-format
* Revert CONTRIBUTING.md
Co-authored-by: Ines Montani <ines@ines.io>
* Include Doc.cats in to_bytes()
* Include Doc.cats in DocBin serialization
* Add tests for serialization of cats
Test serialization of cats for Doc and DocBin.
* Enable lex_attrs on Finnish
* Copy the Danish tokenizer rules to Finnish
Specifically, don't break hyphenated compound words
* Contributor agreement
* A new file for Finnish tokenizer rules instead of including the Danish ones
- added some tests for tokenization issues
- fixed some issues with tokenization of words with hyphen infix
- rewrote the "tokenizer_exceptions.py" file (stemming from the German version)
* Restructure Sentencizer to follow Pipe API
Restructure Sentencizer to follow Pipe API so that it can be scored with
`nlp.evaluate()`.
* Add Sentencizer pipe() test
Iterate over lr_edges until all heads are within the current sentence.
Instead of iterating over them for a fixed number of iterations, check
whether the sentence boundaries are correct for the heads and stop when
all are correct. Stop after a maximum of 10 iterations, providing a
warning in this case since the sentence boundaries may not be correct.
* Switch from mecab-python3 to fugashi
mecab-python3 has been the best MeCab binding for a long time but it's
not very actively maintained, and since it's based on old SWIG code
distributed with MeCab there's a limit to how effectively it can be
maintained.
Fugashi is a new Cython-based MeCab wrapper I wrote. Since it's not
based on the old SWIG code it's easier to keep it current and make small
deviations from the MeCab C/C++ API where that makes sense.
* Change mecab-python3 to fugashi in setup.cfg
* Change "mecab tags" to "unidic tags"
The tags come from MeCab, but the tag schema is specified by Unidic, so
it's more proper to refer to it that way.
* Update conftest
* Add fugashi link to external deps list for Japanese
* Detect more empty matches in tokenizer.explain()
* Include a few languages in explain non-slow tests
Mark a few languages in tokenizer.explain() tests as not slow so they're
run by default.
* Expose tokenizer rules as a property
Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)
Add tests and update Tokenizer API docs.
* Update Hungarian punctuation to remove empty string
Update Hungarian punctuation definitions so that `_units` does not match
an empty string.
* Use _load_special_tokenization consistently
Use `_load_special_tokenization()` and have it to handle `None` checks.
* Fix precedence of `token_match` vs. special cases
Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.
* Add `make_debug_doc()` to the Tokenizer
Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.
Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.
* Update tokenization usage docs
Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.
Add more examples for customizing tokenizers while preserving the
existing defaults.
Minor edits / clarifications.
* Revert "Update Hungarian punctuation to remove empty string"
This reverts commit f0a577f7a5.
* Rework `make_debug_doc()` as `explain()`
Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.
* Handle cases with bad tokenizer patterns
Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.
* Remove unused displacy image
* Add tokenizer.explain() to usage docs
* Rework Chinese language initialization
* Create a `ChineseTokenizer` class
* Modify jieba post-processing to handle whitespace correctly
* Modify non-jieba character tokenization to handle whitespace correctly
* Add a `create_tokenizer()` method to `ChineseDefaults`
* Load lexical attributes
* Update Chinese tag_map for UD v2
* Add very basic Chinese tests
* Test tokenization with and without jieba
* Test `like_num` attribute
* Fix try_jieba_import()
* Fix zh code formatting