spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-09-21 03:19:13 +03:00

Author	SHA1	Message	Date
Ines Montani	2ed49404e3	Improve setup.py and call into Cython directly (#4952 ) * Improve setup.py and call into Cython directly * Add numpy to setup_requires * Improve clean helper * Update setup.cfg * Try if it builds without pyproject.toml * Update MANIFEST.in	2020-02-11 17:46:18 -05:00
Ines Montani	db55577c45	Drop Python 2.7 and 3.5 (#4828 ) * Remove unicode declarations * Remove Python 3.5 and 2.7 from CI * Don't require pathlib * Replace compat helpers * Remove OrderedDict * Use f-strings * Set Cython compiler language level * Fix typo * Re-add OrderedDict for Table * Update setup.cfg * Revert CONTRIBUTING.md * Revert lookups.md * Revert top-level.md * Small adjustments and docs [ci skip]	2019-12-22 01:53:56 +01:00
adrianeboyd	0c9640ced3	Replace old gold alignment with new gold alignment (#4710 ) Replace old gold alignment that allowed for some noise in the alignment between raw and orth with the new simpler alignment that requires that the raw and orth strings are identical except for whitespace and capitalization. * Replace old alignment with new alignment, removing `_align.pyx` and its tests * Remove all quote normalizations * Enable test for new align * Modify test case for quote normalization	2019-11-25 23:13:26 +01:00
Ines Montani	e0cf4796a5	Move lookup tables out of the core library (#4346 ) * Add default to util.get_entry_point * Tidy up entry points * Read lookups from entry points * Remove lookup tables and related tests * Add lookups install option * Remove lemmatizer tests * Remove logic to process language data files * Update setup.cfg	2019-10-01 00:01:27 +02:00
Ines Montani	ba186299e1	Tidy up and modernize setup and config (#4344 ) * Tidy up and modernize setup and config * Update setup.cfg * Re-add pyproject.toml * Delete .flake8 * Move static meta from about to setup.cfg * Update setup.cfg Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>	2019-09-30 20:10:55 +02:00
Matthew Honnibal	84837c1680	Use include_package_data in setup.py	2019-09-30 14:56:44 +02:00
Matthew Honnibal	b6ec291bde	Require preshed 3.0.2	2019-09-28 22:23:24 +02:00
Matthew Honnibal	4c383ab77e	Require newer preshed	2019-09-28 22:08:05 +02:00
Matthew Honnibal	96dd143a18	Install json.gz files	2019-09-28 16:35:39 +02:00
Ines Montani	80d554f2e2	Remove unsupported version [ci skip]	2019-09-19 01:14:42 +02:00
Ines Montani	7e3ac2cd41	Merge branch 'master' into develop	2019-09-12 15:35:25 +02:00
Ines Montani	0760c41393	Change st_ctime to st_mtime	2019-09-12 15:35:01 +02:00
Matthew Honnibal	c181a94e75	Require thinc 7.1.1	2019-09-10 20:12:24 +02:00
Matthew Honnibal	28741ff5db	Require preshed v3.0.0	2019-09-10 19:13:07 +02:00
Matthew Honnibal	4e2f07a655	Merge branch 'develop' into feature/lemmatizer	2019-08-25 21:03:25 +02:00
Matthew Honnibal	b8edc8dffb	Require thinc 7.1	2019-08-25 14:54:09 +02:00
Matthew Honnibal	c308cf3e3e	Merge branch 'master' into feature/lemmatizer	2019-08-25 13:52:27 +02:00
Matthew Honnibal	f9075a6fd1	Update to blis 0.4 and thinc 7.1	2019-08-25 13:50:47 +02:00
Wannaphong Phatthiyaphaibun	d53c3fcbc1	Add Thai Language tokenizers (#4191 ) Add th (pythainlp)	2019-08-25 11:35:21 +02:00
Matthew Honnibal	bcd08f20af	Merge changes from master	2019-08-21 14:18:52 +02:00
Paul O'Leary McCann	756b66b7c0	Reduce size of language data (#4141 ) * Move Turkish lemmas to a json file Rather than a large dict in Python source, the data is now a big json file. This includes a method for loading the json file, falling back to a compressed file, and an update to MANIFEST.in that excludes json in the spacy/lang directory. This focuses on Turkish specifically because it has the most language data in core. * Transition all lemmatizer.py files to json This covers all lemmatizer.py files of a significant size (>500k or so). Small files were left alone. None of the affected files have logic, so this was pretty straightforward. One unusual thing is that the lemma data for Urdu doesn't seem to be used anywhere. That may require further investigation. * Move large lang data to json for fr/nb/nl/sv These are the languages that use a lemmatizer directory (rather than a single file) and are larger than English. For most of these languages there were many language data files, in which case only the large ones (>500k or so) were converted to json. It may or may not be a good idea to migrate the remaining Python files to json in the future. * Fix id lemmas.json The contents of this file were originally just copied from the Python source, but that used single quotes, so it had to be properly converted to json first. * Add .json.gz to gitignore This covers the json.gz files built as part of distribution. * Add language data gzip to build process Currently this gzip data on every build; it works, but it should be changed to only gzip when the source file has been updated. * Remove Danish lemmatizer.py Missed this when I added the json. * Update to match latest explosion/srsly#9 The way gzipped json is loaded/saved in srsly changed a bit. * Only compress language data if necessary If a .json.gz file exists and is newer than the corresponding json file, it's not recompressed. * Move en/el language data to json This only affected files >500kb, which was nouns for both languages and the generic lookup table for English. * Remove empty files in Norwegian tokenizer It's unclear why, but the Norwegian (nb) tokenizer had empty files for adj/adv/noun/verb lemmas. This may have been a result of copying the structure of the English lemmatizer. This removed the files, but still creates the empty sets in the lemmatizer. That may not actually be necessary. * Remove dubious entries in English lookup.json " furthest" and " skilled" - both prefixed with a space - were in the English lookup table. That seems obviously wrong so I have removed them. * Fix small issues with en/fr lemmatizers The en tokenizer was including the removed _nouns.py file, so that's removed. The fr tokenizer is unusual in that it has a lemmatizer directory with both __init__.py and lemmatizer.py. lemmatizer.py had not been converted to load the json language data, so that was fixed. * Auto-format * Auto-format * Update srsly pin * Consistently use pathlib paths	2019-08-20 14:54:11 +02:00
Ines Montani	123929b58b	Update Thinc version pin	2019-07-12 00:15:35 +02:00
Ines Montani	cda9fc3dae	Update Thinc version pin	2019-07-11 15:53:13 +02:00
cedar101	58f06e6180	Korean support (#3901 ) * start lang/ko * add test codes * using natto-py * add test_ko_tokenizer_full_tags() * spaCy contributor agreement * external dependency for ko * collections.namedtuple for python version < 3.5 * case fix * tuple unpacking * add jongseong(final consonant) * apply mecab option * Remove Pipfile for now Co-authored-by: Ines Montani <ines@ines.io>	2019-07-09 22:23:16 +02:00
Ines Montani	5d6b4bb3bd	Update srsly pin	2019-06-07 11:14:32 +02:00
Ines Montani	a7fd42d937	Make jsonschema dependency optional (#3784 )	2019-05-30 14:34:58 +02:00
Ines Montani	a8416c46f7	Use string name in setup.py Hopefully this will trick GitHub's parser into recognising it as a Python package and show us the dependents / "used by" statistics 🤞	2019-05-28 17:11:39 +02:00
Ines Montani	04658ebbb2	Relax jsonschema pin (closes #3628 )	2019-05-03 11:58:58 +02:00
svlandeg	12d4caf341	Merge remote-tracking branch 'upstream/master' into feature/el-framework	2019-03-22 13:44:36 +01:00
Ines Montani	c81923ee30	Update wasabi pin	2019-03-22 13:31:58 +01:00
svlandeg	cf34113250	very minimal KB functionality working	2019-03-22 11:36:44 +01:00
Matthew Honnibal	02d7b41893	Fix GPU installation. Closes #3437	2019-03-20 00:59:27 +01:00
Matthew Honnibal	932d7dde1c	Fix compile error	2019-03-07 14:34:54 +01:00
Matthew Honnibal	ef3110a444	Fix compile error	2019-03-07 10:45:55 +01:00
Matthew Honnibal	fc1cc4c529	Move morphologizer under spacy/pipes	2019-03-07 01:36:26 +01:00
Matthew Honnibal	3993f41cc4	Update morphology branch from develop	2019-03-07 00:14:43 +01:00
Ines Montani	55bb570f51	Add [ja] to extras_require	2019-02-25 09:37:05 +01:00
Matthew Honnibal	55bb3cc482	Require thinc 7.0.2	2019-02-23 13:10:09 +01:00
Matthew Honnibal	808ae7521b	Require thinc 7.0.1	2019-02-16 17:29:57 +01:00
Matthew Honnibal	eea3001b98	Depend on thinc 7.0.1.dev2	2019-02-16 17:02:30 +01:00
Matthew Honnibal	f456b673d4	Require thinc 7.0.1.dev1	2019-02-16 16:22:26 +01:00
Matthew Honnibal	11e826ac3b	Require thinc v7.0.1.dev0	2019-02-16 15:47:02 +01:00
Matthew Honnibal	4c49f5f7b0	Update Thinc dependency	2019-02-15 12:39:08 +01:00
Matthew Honnibal	bed956c698	Drop regex dependency	2019-02-13 23:08:22 +11:00
Ines Montani	a9f8d17632	💫 Break up large pipeline.pyx (#3246 ) * Break up large pipeline.pyx * Merge some components back together * Fix typo	2019-02-10 12:14:51 +01:00
Ines Montani	5d0b60999d	Merge branch 'master' into develop	2019-02-07 20:54:07 +01:00
Ines Montani	1ea4df459d	💫 Break up large matcher.pyx (#3236 ) * Break up large matcher.pyx * Remove unused function	2019-02-07 19:42:25 +11:00
Paul Ganssle	021d04069a	Build metadata modernization - pyproject.toml and python_requires (#3167 ) * Added pyproject.toml This adds the build requirements metadata to the repo, which can be used with any build tools that implement PEP 517 and PEP 518 (e.g. pip, tox). It is no longer necessary to have the build dependencies installed when installing from source. * Add python_requires for 2.7, 3.4+ This directive specifies in the build metadata which version of CPython is supported by this version of spaCy, which pip will take into account when determining what version to download. This will allow you to safely drop old versions of Python without `pip install spaCy` breaking for those versions. * Add Python 3.7 to the trove classifiers	2019-01-16 17:42:09 +01:00
Mathieu Morey	f07b577fbd	Support CUDA 10 (#3126 ) * ENH support CUDA 10 * Update _instructions.jade	2019-01-09 03:10:45 +01:00
Matthew Honnibal	b7ce85a6f3	Fix packaging of json schemas	2018-12-19 13:54:02 +01:00

1 2 3 4 5 ...

481 Commits