* Initialize lower flag explicitly
* Handle whitespace words from GoldParse correctly when creating raw
text with orth variants
* Return the text with original casing if anything goes wrong
* `debug-data`: determine coverage of provided vectors
* `evaluate`: support `blank:lg` model to make it possible to just evaluate
tokenization
* `init-model`: add option to truncate vectors to N most frequent vectors
from word2vec file
* `train`:
* if training on GPU, only run evaluation/timing on CPU in the first
iteration
* if training is aborted, exit with a non-0 exit status
* simplify creation of KB by skipping dim reduction
* small fixes to train EL example script
* add KB creation and NEL training example scripts to example section
* update descriptions of example scripts in the documentation
* moving wiki_entity_linking folder from bin to projects
* remove test for wiki NEL functionality that is being moved
Reconstruction of the original PR #4697 by @MiniLau.
Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema
because the Matcher is only going to be able to support `IS_SENT_START`.
Improve GoldParse NER alignment by including all cases where the start
and end of the NER span can be aligned, regardless of internal
tokenization differences.
To do this, convert BILUO tags to character offsets, check start/end
alignment with `doc.char_span()`, and assign the BILUO tags for the
aligned spans. Alignment for `O/-` tags is handled through the
one-to-one and multi alignments.
* The embedding vis. link is broken
The first link seems to be reasonable for now unless someone has an updated embedding vis they want to share?
* contributor agreement
* Update Mlawrence95.md
* Update website/docs/usage/examples.md
Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Previously, pipelines with shared tok2vec weights would call the
tok2vec backprop callback multiple times, once for each pipeline
component. This caused errors for PyTorch, and was inefficient.
Instead, accumulate the gradient for all but one component, and just
call the callback once.
Modify jieba install message to instruct the user to use
`ChineseDefaults.use_jieba = False` so that it's possible to load
pkuseg-only models without jieba installed.
* Add pkuseg and serialization support for Chinese
Add support for pkuseg alongside jieba
* Specify model through `Language` meta:
* split on characters (if no word segmentation packages are installed)
```
Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}})
```
* jieba (remains the default tokenizer if installed)
```
Chinese()
Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit
```
* pkuseg
```
Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}})
```
* The new tokenizer setting `require_pkuseg` is used to override
`use_jieba` default, which is intended for models that provide a pkuseg
model:
```
nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}})
nlp = Chinese() # has `use_jieba` as `True` by default
nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer
```
Add support for serialization of tokenizer settings and pkuseg model, if
loaded
* Add sorting for `Language.to_bytes()` serialization of `Language.meta`
so that the (emptied, but still present) tokenizer metadata is in a
consistent position in the serialized data
Extend tests to cover all three tokenizer configurations and
serialization
* Fix from_disk and tests without jieba or pkuseg
* Load cfg first and only show error if `use_pkuseg`
* Fix blank/default initialization in serialization tests
* Explicitly initialize jieba's cache on init
* Add serialization for pkuseg pre/postprocessors
* Reformat pkuseg install message