* Refactor Chinese tokenizer configuration
Refactor `ChineseTokenizer` configuration so that it uses a single
`segmenter` setting to choose between character segmentation, jieba, and
pkuseg.
* replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting
`segmenter` with the supported values: `char`, `jieba`, `pkuseg`
* make the default segmenter plain character segmentation `char` (no
additional libraries required)
* Fix Chinese serialization test to use char default
* Warn if attempting to customize other segmenter
Add a warning if `Chinese.pkuseg_update_user_dict` is called when
another segmenter is selected.
* Update website models for v2.3.0
* Add docs for Chinese word segmentation
* Tighten up Chinese docs section
* Merge branch 'master' into docs/v2.3.0 [ci skip]
* Merge branch 'master' into docs/v2.3.0 [ci skip]
* Auto-format and update version
* Update matcher.md
* Update languages and sorting
* Typo in landing page
* Infobox about token_match behavior
* Add meta and basic docs for Japanese
* POS -> TAG in models table
* Add info about lookups for normalization
* Updates to API docs for v2.3
* Update adding norm exceptions for adding languages
* Add --omit-extra-lookups to CLI API docs
* Add initial draft of "What's New in v2.3"
* Add new in v2.3 tags to Chinese and Japanese sections
* Add tokenizer to migration section
* Add new in v2.3 flags to init-model
* Typo
* More what's new in v2.3
Co-authored-by: Ines Montani <ines@ines.io>
* make disable_pipes deprecated in favour of the new toggle_pipes
* rewrite disable_pipes statements
* update documentation
* remove bin/wiki_entity_linking folder
* one more fix
* remove deprecated link to documentation
* few more doc fixes
* add note about name change to the docs
* restore original disable_pipes
* small fixes
* fix typo
* fix error number to W096
* rename to select_pipes
* also make changes to the documentation
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* simplify creation of KB by skipping dim reduction
* small fixes to train EL example script
* add KB creation and NEL training example scripts to example section
* update descriptions of example scripts in the documentation
* moving wiki_entity_linking folder from bin to projects
* remove test for wiki NEL functionality that is being moved
* The embedding vis. link is broken
The first link seems to be reasonable for now unless someone has an updated embedding vis they want to share?
* contributor agreement
* Update Mlawrence95.md
* Update website/docs/usage/examples.md
Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Revert changes to priority of `token_match` so that it has priority
over all other tokenizer patterns
* Add lookahead and potentially slow lookbehind back to the default URL
pattern
* Expand character classes in URL pattern to improve matching around
lookaheads and lookbehinds related to #4882
* Revert changes to Hungarian tokenizer
* Revert (xfail) several URL tests to their status before #4374
* Update `tokenizer.explain()` and docs accordingly
* Fix ent_ids and labels properties when id attribute used in patterns
* use set for labels
* sort end_ids for comparison in entity_ruler tests
* fixing entity_ruler ent_ids test
* add to set
* Run make_doc optimistically if using phrase matcher patterns.
* remove unused coveragerc I was testing with
* format
* Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially.
* Removing old add_patterns function
* Fixing spacing
* Make sure token_patterns loaded as well, before generator was being emptied in from_disk