spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-08 09:41:11 +03:00

History

adrianeboyd f7471abd82 Add pkuseg and serialization support for Chinese (#5308 ) * Add pkuseg and serialization support for Chinese Add support for pkuseg alongside jieba * Specify model through `Language` meta: * split on characters (if no word segmentation packages are installed) ``` Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}}) ``` * jieba (remains the default tokenizer if installed) ``` Chinese() Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit ``` * pkuseg ``` Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}}) ``` * The new tokenizer setting `require_pkuseg` is used to override `use_jieba` default, which is intended for models that provide a pkuseg model: ``` nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}}) nlp = Chinese() # has `use_jieba` as `True` by default nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer ``` Add support for serialization of tokenizer settings and pkuseg model, if loaded * Add sorting for `Language.to_bytes()` serialization of `Language.meta` so that the (emptied, but still present) tokenizer metadata is in a consistent position in the serialized data Extend tests to cover all three tokenizer configurations and serialization * Fix from_disk and tests without jieba or pkuseg * Load cfg first and only show error if `use_pkuseg` * Fix blank/default initialization in serialization tests * Explicitly initialize jieba's cache on init * Add serialization for pkuseg pre/postprocessors * Reformat pkuseg install message		2020-04-18 17:01:53 +02:00
..
ar	Revert #4334	2019-09-29 17:32:12 +02:00
bn	Revert #4334	2019-09-29 17:32:12 +02:00
ca	Revert #4334	2019-09-29 17:32:12 +02:00
da	Tidy up and auto-format	2020-03-25 12:28:12 +01:00
de	Move lookup tables out of the core library (#4346 )	2019-10-01 00:01:27 +02:00
el	Revert #4334	2019-09-29 17:32:12 +02:00
en	Add tokenizer explain() debugging method (#4596 )	2019-11-20 13:07:25 +01:00
es	Revert #4334	2019-09-29 17:32:12 +02:00
eu	Add __init__.py to eu and hy tests (#5278 )	2020-04-08 20:03:06 +02:00
fi	add two abbreviations and some additional unit tests (#5040 )	2020-02-22 14:12:32 +01:00
fr	Move lookup tables out of the core library (#4346 )	2019-10-01 00:01:27 +02:00
ga	Revert #4334	2019-09-29 17:32:12 +02:00
he	Revert #4334	2019-09-29 17:32:12 +02:00
hu	Tidy up and auto-format	2020-03-25 12:28:12 +01:00
hy	Add __init__.py to eu and hy tests (#5278 )	2020-04-08 20:03:06 +02:00
id	Revert #4334	2019-09-29 17:32:12 +02:00
it	Revert #4334	2019-09-29 17:32:12 +02:00
ja	Revert #4334	2019-09-29 17:32:12 +02:00
ko	Revert #4334	2019-09-29 17:32:12 +02:00
lb	Tidy up and auto-format	2019-12-21 19:04:17 +01:00
lt	Improve Lithuanian tokenization (#5205 )	2020-03-25 11:28:12 +01:00
nb	Revert #4334	2019-09-29 17:32:12 +02:00
nl	Move lookup tables out of the core library (#4346 )	2019-10-01 00:01:27 +02:00
pl	Revert #4334	2019-09-29 17:32:12 +02:00
pt	Revert #4334	2019-09-29 17:32:12 +02:00
ro	Move lookup tables out of the core library (#4346 )	2019-10-01 00:01:27 +02:00
ru	Revert #4334	2019-09-29 17:32:12 +02:00
sr	Move lookup tables out of the core library (#4346 )	2019-10-01 00:01:27 +02:00
sv	Tidy up and auto-format [ci skip]	2019-10-24 16:20:48 +02:00
th	Revert #4334	2019-09-29 17:32:12 +02:00
tr	Move lookup tables out of the core library (#4346 )	2019-10-01 00:01:27 +02:00
tt	Add trailing whitespace to multiline test text (#4877 )	2020-01-06 14:58:59 +01:00
uk	Revert #4334	2019-09-29 17:32:12 +02:00
ur	Revert #4334	2019-09-29 17:32:12 +02:00
yo	Adding support for Yoruba Language (#4614 )	2019-12-21 14:11:50 +01:00
zh	Add pkuseg and serialization support for Chinese (#5308 )	2020-04-18 17:01:53 +02:00
__init__.py	Revert #4334	2019-09-29 17:32:12 +02:00
test_attrs.py	Tidy up and auto-format	2019-12-21 19:04:17 +01:00
test_initialize.py	Adding support for Yoruba Language (#4614 )	2019-12-21 14:11:50 +01:00