spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-10-04 10:56:45 +03:00

Author	SHA1	Message	Date
Ines Montani	2f45bd94c0	Auto-formatting	2019-02-12 18:30:11 +01:00
Ines Montani	0184a95340	Merge branch 'master' into develop	2019-02-12 18:29:24 +01:00
Akhilesh	a78db10941	add kannada support (#3264 ) * add kannada support * add few more stop words * add support for Kannada Language	2019-02-12 18:28:39 +01:00
Ines Montani	b589b945db	Fix PhraseMatcher pickling and length (resolves #3248 ) (#3252 )	2019-02-12 18:27:54 +01:00
Ines Montani	483dddc9bc	💫 Add token match pattern validation via JSON schemas (#3244 ) * Add custom MatchPatternError * Improve validators and add validation option to Matcher * Adjust formatting * Never validate in Matcher within PhraseMatcher If we do decide to make validate default to True, the PhraseMatcher's Matcher shouldn't ever validate. Here, we create the patterns automatically anyways (and it's currently unclear whether the validation has performance impacts at a very large scale).	2019-02-13 01:47:26 +11:00
Ines Montani	ad2a514cdf	Show warning if phrase pattern Doc was overprocessed (#3255 ) In most cases, the PhraseMatcher will match on the verbatim token text or as of v2.1, sometimes the lowercase text. This means that we only need a tokenized Doc, without any other attributes. If phrase patterns are created by processing large terminology lists with the full `nlp` object, this easily can make things a lot slower, because all components will be applied, even if we don't actually need the attributes they set (like part-of-speech tags, dependency labels). The warning message also includes a suggestion to use nlp.make_doc or nlp.tokenizer.pipe for even faster processing. For now, the validation has to be enabled explicitly by setting validate=True.	2019-02-13 01:45:31 +11:00
Matthew Honnibal	6ec834dc72	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-02-13 01:14:44 +11:00
Matthew Honnibal	43fa039d96	xfail regression test for model labels	2019-02-13 01:14:26 +11:00
Matthew Honnibal	bc300d4e31	Add test for issue 3209	2019-02-13 01:13:01 +11:00
Ines Montani	34a3cc26a9	Add xfailing test for reverse pattern (see #1971 )	2019-02-12 14:49:59 +01:00
Ines Montani	fe39fd4d13	Make warning tests more explicit	2019-02-10 14:02:19 +01:00
Ines Montani	a9f8d17632	💫 Break up large pipeline.pyx (#3246 ) * Break up large pipeline.pyx * Merge some components back together * Fix typo	2019-02-10 12:14:51 +01:00
Ines Montani	e7593b791e	Fix import	2019-02-08 20:50:52 +01:00
Ines Montani	0754b848fe	Actually xfail test for #1971	2019-02-08 20:50:35 +01:00
Ines Montani	414a69b736	Add xfailing test (see #1971 , #2675 , #2671 )	2019-02-08 20:50:01 +01:00
Ines Montani	ea07f3022e	Only run noun chunks iterator in Span if available (closes #3199 )	2019-02-08 18:33:16 +01:00
Ines Montani	ff36b14cb2	Fix whitespace	2019-02-08 18:31:31 +01:00
Ines Montani	f4ce7bb7e9	Fix typo and deprecation message (resolves #3195 ) [ci skip]	2019-02-08 18:09:23 +01:00
Ines Montani	694139aad3	Fix formatting [ci skip]	2019-02-08 16:32:36 +01:00
Ines Montani	2898768757	Remove unused attribute [ci skip]	2019-02-08 16:31:30 +01:00
Ines Montani	586c56fc6c	Tidy up regression tests	2019-02-08 15:51:13 +01:00
Ines Montani	25602c794c	Tidy up and fix small bugs and typos	2019-02-08 14:14:49 +01:00
Ines Montani	9e652afa4b	Merge branch 'master' into develop	2019-02-08 13:28:09 +01:00
Björn Lennartsson	647f0140c7	Fixed tag map for Swedish Talbanken (#3186 )	2019-02-08 14:28:59 +11:00
Stanisław Giziński	1448ad100c	Improved polish tokenizer and stop words. (#2974 ) * Improved stop words list * Removed some wrong stop words form list * Improved stop words list * Removed some wrong stop words form list * Improved Polish Tokenizer (#38) * Add tests for polish tokenizer * Add polish tokenizer exceptions * Don't split any words containing hyphens * Fix test case with wrong model answer * Remove commented out line of code until better solution is found * Add source srx' license * Rename exception_list.py to match spaCy conventionality * Add a brief explanation of where the exception list comes from * Add newline after reach exception * Rename COPYING.txt to LICENSE * Delete old files * Add header to the license * Agreements signed * Stanisław Giziński agreement * Krzysztof Kowalczyk - signed agreement * Mateusz Olko agreement * Add DoomCoder's contributor agreement * Improve like number checking in polish lang * like num tests added * all from SI system added * Final licence and removed splitting exceptions * Added polish stop words to LEX_ATTRA * Add encoding info to pl tokenizer exceptions	2019-02-08 14:27:21 +11:00
Ines Montani	402d133c90	Add Ukrainian unicode	2019-02-07 21:11:58 +01:00
Ines Montani	e2d93e4852	Merge branch 'master' into develop	2019-02-07 21:10:08 +01:00
Ines Montani	2499da97e8	Format	2019-02-07 21:07:02 +01:00
Julia Makogon	b41d64825a	Ukrainian language added. Small fixes in Russian (#3241 ) * Classes for Ukrainian; small fix in Russian. * Contributor agreement	2019-02-07 21:05:11 +01:00
Ines Montani	77efee0295	Auto-format	2019-02-07 21:00:04 +01:00
Ines Montani	5d0b60999d	Merge branch 'master' into develop	2019-02-07 20:54:07 +01:00
Matthew Honnibal	dbeebfa3a2	Set version to v2.1.0a7.dev1	2019-02-08 01:54:01 +11:00
Ines Montani	338d659bd0	Store JSON schemas in Python and tidy up (#3235 )	2019-02-07 19:44:31 +11:00
Ines Montani	1ea4df459d	💫 Break up large matcher.pyx (#3236 ) * Break up large matcher.pyx * Remove unused function	2019-02-07 19:42:25 +11:00
Ines Montani	a9bf5d9fd8	Add xfailing test for set value with operator [ci skip]	2019-02-06 13:40:11 +01:00
Ines Montani	e51a238b3f	Auto-format	2019-02-06 13:32:18 +01:00
Ines Montani	f25bd9f5e4	Add gold.spans_from_biluo_tags helper (#3227 )	2019-02-06 21:50:26 +11:00
Ines Montani	5e16490d9d	Fix default argument in TextCategorizer.Model (resolves #3221 )	2019-02-05 12:33:47 +01:00
Ines Montani	89ad095900	Fix whitespace	2019-02-05 12:32:20 +01:00
Sofie	9745b0d523	Improve Italian & Urdu tokenization accuracy (#3228 ) ## Description 1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour. 2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour. ### Types of change Enhancement of Italian & Urdu tokenization ## Checklist - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-02-04 22:39:25 +01:00
Sofie	a3efa3e8d9	Improve Catalan tokenization accuracy (#3225 ) * small hyphen clean up for French * catalan infix similar to french	2019-02-04 20:37:19 +11:00
Ines Montani	e00680a33a	Remove unused outdated file	2019-02-01 11:39:48 +01:00
Matthew Honnibal	27e3f98cae	Set version to v2.1.0a7.dev0	2019-02-01 18:06:34 +11:00
Sofie	46dfe773e1	Replacing regex library with re to increase tokenization speed (#3218 ) * replace unicode categories with raw list of code points * simplifying ranges * fixing variable length quotes * removing redundant regular expression * small cleanup of regexp notations * quotes and alpha as ranges instead of alterations * removed most regexp dependencies and features * exponential backtracking - unit tests * rewrote expression with pathological backtracking * disabling double hyphen tests for now * test additional variants of repeating punctuation * remove regex and redundant backslashes from load_reddit script * small typo fixes * disable double punctuation test for russian * clean up old comments * format block code * final cleanup * naming consistency * french strings as unicode for python 2 support * french regular expression case insensitive	2019-02-01 18:05:22 +11:00
Amandine Périnet	d570e75dbb	Improving the French lookup dictionnary for ambiguous words (#3185 ) * modifying FR lookup to remove ambiguity and adding lookup vocab to FR files * modifying FR lookup to remove ambiguity and adding lookup vocab to FR files * updating the contributor agreement for amperinet	2019-01-31 23:53:45 +01:00
Ines Montani	e9a6dbe4f3	Don't check for Jupyter in global scope and fix check (#3213 ) Resolves #3208. Prevent interactions with other libraries (pandas) that also access `get_ipython().config` and its parameters. See #3208 for details. I don't fully understand why this happens, but in spaCy, we can at least make sure we avoid calling into this method. <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-01-31 23:49:13 +01:00
Amandine Périnet	b34bc9d2e9	add small fix for French lemmatizer (#3206 )	2019-01-31 23:44:10 +01:00
Loghi	5ca8e2b269	Tamil (#3194 ) * Tamil language support stop wors, examples and numerical attribite supports added Contributor agreement signed * Create Loghijiaha.md Added contributor agreement * Update CONTRIBUTOR_AGREEMENT.md Adjusted contributor_agreement.md * Norm exceptions added	2019-01-27 06:02:04 +01:00
foufaster	8bd85fd9d5	Fix french lemmatization (#3180 )	2019-01-27 06:01:30 +01:00
Sofie	66016ac289	Batch UD evaluation script (#3174 ) * running UD eval * printing timing of tokenizer: tokens per second * timing of default English model * structured output and parameterization to compare different runs * additional flag to allow evaluation without parsing info * printing verbose log of errors for manual inspection * printing over- and undersegmented cases (and combo's) * add under and oversegmented numbers to Score and structured output * print high-freq over/under segmented words and word shapes * printing examples as part of the structured output * print the results to file * batch run of different models and treebanks per language * cleaning up code * commandline script to process all languages in spaCy & UD * heuristic to remove blinded corpora and option to run one single best per language * pathlib instead of os for file paths	2019-01-27 06:01:02 +01:00

1 2 3 4 5 ...

5595 Commits