spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-02-17 12:40:46 +03:00

History

adrianeboyd f7471abd82 Add pkuseg and serialization support for Chinese (#5308 ) * Add pkuseg and serialization support for Chinese Add support for pkuseg alongside jieba * Specify model through `Language` meta: * split on characters (if no word segmentation packages are installed) ``` Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}}) ``` * jieba (remains the default tokenizer if installed) ``` Chinese() Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit ``` * pkuseg ``` Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}}) ``` * The new tokenizer setting `require_pkuseg` is used to override `use_jieba` default, which is intended for models that provide a pkuseg model: ``` nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}}) nlp = Chinese() # has `use_jieba` as `True` by default nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer ``` Add support for serialization of tokenizer settings and pkuseg model, if loaded * Add sorting for `Language.to_bytes()` serialization of `Language.meta` so that the (emptied, but still present) tokenizer metadata is in a consistent position in the serialized data Extend tests to cover all three tokenizer configurations and serialization * Fix from_disk and tests without jieba or pkuseg * Load cfg first and only show error if `use_pkuseg` * Fix blank/default initialization in serialization tests * Explicitly initialize jieba's cache on init * Add serialization for pkuseg pre/postprocessors * Reformat pkuseg install message		2020-04-18 17:01:53 +02:00
..
cli	Use max(uint64) for OOV lexeme rank (#5303 )	2020-04-15 13:49:47 +02:00
data	Make spacy/data a package	2017-03-18 20:04:22 +01:00
displacy	Tidy up and auto-format	2020-03-25 12:28:12 +01:00
lang	Add pkuseg and serialization support for Chinese (#5308 )	2020-04-18 17:01:53 +02:00
matcher	Matcher support for Span as well as Doc (#5113 )	2020-04-15 13:51:33 +02:00
ml	Replace function registries with catalogue (#4584 )	2019-11-07 11:45:22 +01:00
pipeline	Add ideographic stops to sentencizer (#5263 )	2020-04-08 12:58:39 +02:00
syntax	prevent updating cfg if the Model was already defined (#5078 )	2020-03-03 13:58:56 +01:00
tests	Add pkuseg and serialization support for Chinese (#5308 )	2020-04-18 17:01:53 +02:00
tokens	additional information if doc is empty	2020-03-09 18:08:18 +01:00
__init__.pxd	* Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags.	2014-10-24 02:23:42 +11:00
__init__.py	Replace function registries with catalogue (#4584 )	2019-11-07 11:45:22 +01:00
__main__.py	Use latest wasabi	2019-11-04 02:38:45 +01:00
_align.pyx	Fixes typos (#4843 )	2019-12-29 14:24:13 +01:00
_ml.py	Use max(uint64) for OOV lexeme rank (#5303 )	2020-04-15 13:49:47 +02:00
about.py	Set version to v2.2.4	2020-03-12 11:30:41 +01:00
analysis.py	Support span._. in component decorator attrs (#4555 )	2019-10-30 17:19:36 +01:00
attrs.pxd	make idx available via to_array (#5030 )	2020-02-22 14:13:06 +01:00
attrs.pyx	make idx available via to_array (#5030 )	2020-02-22 14:13:06 +01:00
compat.py	Replace function registries with catalogue (#4584 )	2019-11-07 11:45:22 +01:00
errors.py	Matcher support for Span as well as Doc (#5113 )	2020-04-15 13:51:33 +02:00
glossary.py	Update tag maps and docs for English and German (#4501 )	2019-10-24 12:56:05 +02:00
gold.pxd	Merge changes from master	2019-08-21 14:18:52 +02:00
gold.pyx	Initialize all values in a2b/b2a in new align (#5063 )	2020-02-27 18:43:00 +01:00
kb.pxd	rename entity frequency	2019-07-19 17:40:28 +02:00
kb.pyx	More robust set entities method in KB (#4794 )	2019-12-13 10:45:29 +01:00
language.py	Add pkuseg and serialization support for Chinese (#5308 )	2020-04-18 17:01:53 +02:00
lemmatizer.py	Remove duplicated branch in if/else-if statement (#5234 )	2020-04-02 14:47:42 +02:00
lexeme.pxd	Use max(uint64) for OOV lexeme rank (#5303 )	2020-04-15 13:49:47 +02:00
lexeme.pyx	Use max(uint64) for OOV lexeme rank (#5303 )	2020-04-15 13:49:47 +02:00
lookups.py	Refactor lemmatizer and data table integration (#4353 )	2019-10-01 21:36:03 +02:00
morphology.pxd	annotate kb_id through ents in doc	2019-03-22 11:36:44 +01:00
morphology.pyx	Improve Morphology errors (#4314 )	2019-09-21 14:37:06 +02:00
parts_of_speech.pxd	Add support for Universal Dependencies v2.0	2017-03-03 13:17:34 +01:00
parts_of_speech.pyx	Tidy up rest	2017-10-27 21:07:59 +02:00
scorer.py	Fix GoldParse init when token count differs (#5191 )	2020-03-26 10:46:23 +01:00
strings.pxd	Try to fix StringStore clean up (see #1506 )	2017-11-11 03:11:27 +03:00
strings.pyx	Merge branch 'master' into feature/lemmatizer	2019-03-16 13:44:22 +01:00
structs.pxd	Replace Entity/MatchStruct with SpanC (#4459 )	2019-10-18 11:01:47 +02:00
symbols.pxd	make idx available via to_array (#5030 )	2020-02-22 14:13:06 +01:00
symbols.pyx	make idx available via to_array (#5030 )	2020-02-22 14:13:06 +01:00
tokenizer.pxd	Flush tokenizer cache when necessary (#4258 )	2019-09-08 20:52:46 +02:00
tokenizer.pyx	Use inline flags in token_match patterns (#5257 )	2020-04-06 13:19:04 +02:00
typedefs.pxd	Work on changing StringStore to return hashes.	2017-05-28 12:36:27 +02:00
typedefs.pyx	Tidy up rest	2017-10-27 21:07:59 +02:00
util.py	Use max(uint64) for OOV lexeme rank (#5303 )	2020-04-15 13:49:47 +02:00
vectors.pyx	Raise error for inplace resize with new vector dim (#5228 )	2020-04-02 10:43:13 +02:00
vocab.pxd	💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167 )	2019-08-22 14:21:32 +02:00
vocab.pyx	Use max(uint64) for OOV lexeme rank (#5303 )	2020-04-15 13:49:47 +02:00