spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-02-16 20:20:41 +03:00

History

Paul O'Leary McCann 756b66b7c0 Reduce size of language data (#4141 ) * Move Turkish lemmas to a json file Rather than a large dict in Python source, the data is now a big json file. This includes a method for loading the json file, falling back to a compressed file, and an update to MANIFEST.in that excludes json in the spacy/lang directory. This focuses on Turkish specifically because it has the most language data in core. * Transition all lemmatizer.py files to json This covers all lemmatizer.py files of a significant size (>500k or so). Small files were left alone. None of the affected files have logic, so this was pretty straightforward. One unusual thing is that the lemma data for Urdu doesn't seem to be used anywhere. That may require further investigation. * Move large lang data to json for fr/nb/nl/sv These are the languages that use a lemmatizer directory (rather than a single file) and are larger than English. For most of these languages there were many language data files, in which case only the large ones (>500k or so) were converted to json. It may or may not be a good idea to migrate the remaining Python files to json in the future. * Fix id lemmas.json The contents of this file were originally just copied from the Python source, but that used single quotes, so it had to be properly converted to json first. * Add .json.gz to gitignore This covers the json.gz files built as part of distribution. * Add language data gzip to build process Currently this gzip data on every build; it works, but it should be changed to only gzip when the source file has been updated. * Remove Danish lemmatizer.py Missed this when I added the json. * Update to match latest explosion/srsly#9 The way gzipped json is loaded/saved in srsly changed a bit. * Only compress language data if necessary If a .json.gz file exists and is newer than the corresponding json file, it's not recompressed. * Move en/el language data to json This only affected files >500kb, which was nouns for both languages and the generic lookup table for English. * Remove empty files in Norwegian tokenizer It's unclear why, but the Norwegian (nb) tokenizer had empty files for adj/adv/noun/verb lemmas. This may have been a result of copying the structure of the English lemmatizer. This removed the files, but still creates the empty sets in the lemmatizer. That may not actually be necessary. * Remove dubious entries in English lookup.json " furthest" and " skilled" - both prefixed with a space - were in the English lookup table. That seems obviously wrong so I have removed them. * Fix small issues with en/fr lemmatizers The en tokenizer was including the removed _nouns.py file, so that's removed. The fr tokenizer is unusual in that it has a lemmatizer directory with both __init__.py and lemmatizer.py. lemmatizer.py had not been converted to load the json language data, so that was fixed. * Auto-format * Auto-format * Update srsly pin * Consistently use pathlib paths		2019-08-20 14:54:11 +02:00
..
cli	Tidy up and auto-format	2019-08-18 15:09:16 +02:00
data	Make spacy/data a package	2017-03-18 20:04:22 +01:00
displacy	Custom entity render (#4117 )	2019-08-16 18:39:25 +02:00
lang	Reduce size of language data (#4141 )	2019-08-20 14:54:11 +02:00
matcher	💫 Fix issue #3839 : Incorrect entity IDs from Matcher with operators (#3949 )	2019-07-11 12:55:11 +02:00
pipeline	CLI scripts for entity linking (wikipedia & generic) (#4091 )	2019-08-13 15:38:59 +02:00
syntax	💫 Improve error message when model.from_bytes() dies (#4014 )	2019-07-24 11:27:34 +02:00
tests	Tidy up and auto-format	2019-08-18 15:09:16 +02:00
tokens	CLI scripts for entity linking (wikipedia & generic) (#4091 )	2019-08-13 15:38:59 +02:00
__init__.pxd	* Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags.	2014-10-24 02:23:42 +11:00
__init__.py	Fix formatting (hopefully also restarts build properly)	2019-03-20 09:55:45 +01:00
__main__.py	Update __main__.py	2019-03-20 09:43:26 +01:00
_align.pyx	Improve alignment around quotes	2018-08-16 01:04:34 +02:00
_ml.py	Tidy up and auto-format	2019-08-18 15:09:16 +02:00
about.py	Set version to v2.1.8	2019-08-07 13:53:58 +02:00
attrs.pxd	Fix attrs alignment	2019-07-12 17:59:47 +02:00
attrs.pyx	ensure Span.as_doc keeps the entity links + unit test	2019-06-25 15:28:51 +02:00
compat.py	Fix symlink creation to show error message on failure (#3589 ) (resolves #3307 ))	2019-04-16 11:58:31 +02:00
errors.py	Tidy up and auto-format	2019-08-18 15:09:16 +02:00
glossary.py	Update glossary.py to match information found in documentation (#3704 ) (closes ##3679)	2019-05-10 14:23:20 +02:00
gold.pxd	fixes in kb and gold	2019-07-17 17:18:26 +02:00
gold.pyx	WIP: Extending debug-data (#4114 )	2019-08-16 10:52:46 +02:00
kb.pxd	rename entity frequency	2019-07-19 17:40:28 +02:00
kb.pyx	CLI scripts for entity linking (wikipedia & generic) (#4091 )	2019-08-13 15:38:59 +02:00
language.py	Raise error if annotation dict in simple training style has unexpected keys #4074 (#4079 )	2019-08-06 11:01:25 +02:00
lemmatizer.py	Fix inconsistant lemmatizer issue #3484 (#3646 )	2019-05-04 18:16:03 +02:00
lexeme.pxd	💫 Support lexical attributes in retokenizer attrs (closes #2390 ) (#3325 )	2019-02-24 21:13:51 +01:00
lexeme.pyx	Tidy up property code style (#3391 )	2019-03-11 15:59:09 +01:00
morphology.pxd	annotate kb_id through ents in doc	2019-03-22 11:36:44 +01:00
morphology.pyx	Fix issue #3551 : Upper case lemmas	2019-04-16 12:27:15 +02:00
parts_of_speech.pxd	Add support for Universal Dependencies v2.0	2017-03-03 13:17:34 +01:00
parts_of_speech.pyx	Tidy up rest	2017-10-27 21:07:59 +02:00
scorer.py	Tidy up and auto-format	2019-08-18 15:09:16 +02:00
strings.pxd	Try to fix StringStore clean up (see #1506 )	2017-11-11 03:11:27 +03:00
strings.pyx	💫 Make serialization methods consistent (#3385 )	2019-03-10 19:16:45 +01:00
structs.pxd	rename entity frequency	2019-07-19 17:40:28 +02:00
symbols.pxd	Fix symbol alignment	2019-07-12 17:48:38 +02:00
symbols.pyx	ensure Span.as_doc keeps the entity links + unit test	2019-06-25 15:28:51 +02:00
tokenizer.pxd	Disable tokenizer cache for special-cases. Fixes #1250	2017-10-24 16:08:05 +02:00
tokenizer.pyx	tokenizer doc fix	2019-07-15 11:19:34 +02:00
typedefs.pxd	Work on changing StringStore to return hashes.	2017-05-28 12:36:27 +02:00
typedefs.pyx	Tidy up rest	2017-10-27 21:07:59 +02:00
util.py	Reduce size of language data (#4141 )	2019-08-20 14:54:11 +02:00
vectors.pyx	Update Vectors.find docs [ci skip]	2019-03-16 17:10:57 +01:00
vocab.pxd	💫 Small efficiency fixes to tokenizer (#2587 )	2018-07-24 23:35:54 +02:00
vocab.pyx	Tidy up property code style (#3391 )	2019-03-11 15:59:09 +01:00