* Expose tokenizer rules as a property
Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)
Add tests and update Tokenizer API docs.
* Update Hungarian punctuation to remove empty string
Update Hungarian punctuation definitions so that `_units` does not match
an empty string.
* Use _load_special_tokenization consistently
Use `_load_special_tokenization()` and have it to handle `None` checks.
* Fix precedence of `token_match` vs. special cases
Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.
* Add `make_debug_doc()` to the Tokenizer
Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.
Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.
* Update tokenization usage docs
Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.
Add more examples for customizing tokenizers while preserving the
existing defaults.
Minor edits / clarifications.
* Revert "Update Hungarian punctuation to remove empty string"
This reverts commit f0a577f7a5.
* Rework `make_debug_doc()` as `explain()`
Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.
* Handle cases with bad tokenizer patterns
Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.
* Remove unused displacy image
* Add tokenizer.explain() to usage docs
Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.
Add more examples for customizing tokenizers while preserving the
existing defaults.
Minor edits / clarifications.
* document token ent_kb_id
* document span kb_id
* update pipeline documentation
* prior and context weights as bool's instead
* entitylinker api documentation
* drop for both models
* finish entitylinker documentation
* small fixes
* documentation for KB
* candidate documentation
* links to api pages in code
* small fix
* frequency examples as counts for consistency
* consistent documentation about tensors returned by predict
* add entity linking to usage 101
* add entity linking infobox and KB section to 101
* entity-linking in linguistic features
* small typo corrections
* training example and docs for entity_linker
* predefined nlp and kb
* revert back to similarity encodings for simplicity (for now)
* set prior probabilities to 0 when excluded
* code clean up
* bugfix: deleting kb ID from tokens when entities were removed
* refactor train el example to use either model or vocab
* pretrain_kb example for example kb generation
* add to training docs for KB + EL example scripts
* small fixes
* error numbering
* ensure the language of vocab and nlp stay consistent across serialization
* equality with =
* avoid conflict in errors file
* add error 151
* final adjustements to the train scripts - consistency
* update of goldparse documentation
* small corrections
* push commit
* typo fix
* add candidate API to kb documentation
* update API sidebar with EntityLinker and KnowledgeBase
* remove EL from 101 docs
* remove entity linker from 101 pipelines / rephrase
* custom el model instead of existing model
* set version to 2.2 for EL functionality
* update documentation for 2 CLI scripts
* Fix typo in rule-based matching docs
* Improve token pattern checking without validation
Add more detailed token pattern checks without full JSON pattern validation and
provide more detailed error messages.
Addresses #4070 (also related: #4063, #4100).
* Check whether top-level attributes in patterns and attr for PhraseMatcher are
in token pattern schema
* Check whether attribute value types are supported in general (as opposed to
per attribute with full validation)
* Report various internal error types (OverflowError, AttributeError, KeyError)
as ValueError with standard error messages
* Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS,
LEMMA, and DEP
* Add error messages with relevant details on how to use validate=True or nlp()
instead of nlp.make_doc()
* Support attr=TEXT for PhraseMatcher
* Add NORM to schema
* Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler
* Remove unnecessary .keys()
* Rephrase error messages
* Add another type check to Matcher
Add another type check to Matcher for more understandable error messages
in some rare cases.
* Support phrase_matcher_attr=TEXT for EntityRuler
* Don't use spacy.errors in examples and bin scripts
* Fix error code
* Auto-format
Also try get Azure pipelines to finally start a build :(
* Update errors.py
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Changed learning rate by its param name.
I've been searching for a while how the parameter learning rate was named, with `beta1` and `beta2` its easy as they are marked as code, but learning rate wasn't. I think writing the actual parameter name would be helpful.
* Signing SCA