spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-11-17 08:16:04 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	80b94313b6	💫 Fix interaction of lemmatizer and tokenizer exceptions (#3388 ) Closes #2203. Closes #3268. Lemmas set from outside the `Morphology` class were being overwritten. The result was especially confusing when deserialising, as it meant some lemmas could change when storing and retrieving a `Doc` object. This PR applies two fixes: 1) When we go to set the lemma in the `Morphology` class, first check whether a lemma is already set. If so, don't overwrite. 2) When we load with `doc.from_array()`, take care to apply the `TAG` field first. This allows other fields to overwrite the `TAG` implied properties, if they're provided explicitly (e.g. the `LEMMA`). ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-11 01:31:21 +01:00
Ines Montani	8f45ff3dc2	Adjust formatting [ci skip]	2019-03-11 00:47:41 +01:00
Ines Montani	7ba3a5d95c	💫 Make serialization methods consistent (#3385 ) * Make serialization methods consistent exclude keyword argument instead of random named keyword arguments and deprecation handling * Update docs and add section on serialization fields	2019-03-10 19:16:45 +01:00
Ines Montani	67e38690d4	Un-xfail passing tests and tidy up	2019-03-10 18:42:16 +01:00
Matthew Honnibal	27dd820753	Fix vocab deserialization when loading already present lexemes (#3383 ) * Fix vocab deserialization bug. Closes #2153 * Un-xfail test for #2153	2019-03-10 17:21:19 +01:00
Matthew Honnibal	61e5ce02a4	Add xfailing test for #2153	2019-03-10 16:36:29 +01:00
Matthew Honnibal	8a6272f842	Un-xfail test	2019-03-10 15:51:15 +01:00
Ines Montani	0426689db8	💫 Improve Doc.to_json and add Doc.is_nered (#3381 ) * Use default return instead of else * Add Doc.is_nered to indicate if entities have been set * Add properties in Doc.to_json if they were set, not if they're available This way, if a processed Doc exports "pos": None, it means that the tag was explicitly unset. If it exports "ents": [], it means that entity annotations are available but that this document doesn't contain any entities. Before, this would have been unclear and problematic for training.	2019-03-10 15:24:34 +01:00
Ines Montani	7984543953	Add xfailing test for to_array/from_array string attrs	2019-03-10 15:08:15 +01:00
Ines Montani	6bbf4ea309	Simplify tests and avoid tokenizing	2019-03-10 15:05:56 +01:00
Matthew Honnibal	a5b1f6dcec	Fix NER when preset entities cross sentence boundaries (#3379 ) 💫 Fix NER when preset entities cross sentence boundaries	2019-03-10 14:53:03 +01:00
Matthew Honnibal	231bc7bb7b	Add xfailing test for #3345	2019-03-10 13:00:15 +01:00
Ines Montani	96b91a8898	Fix noqa [ci skip]	2019-03-07 12:25:00 +01:00
Ines Montani	533b580c19	Add test for stray print statements in languages (see #3342 )	2019-02-27 16:04:30 +01:00
Ines Montani	9b62639d19	Auto-format [ci skip]	2019-02-27 14:24:55 +01:00
Matthew Honnibal	f1d77eb140	💫 Improve handling of missing NER tags (closes #2603 ) (#3341 ) * Improve handling of missing NER tags GoldParse can accept missing NER tags, if entities is provided in BILUO format (rather than as spans). Missing tags can be provided as None values. Fix bug that occurred when first tag was a None value. Closes #2603. * Document specification of missing NER tags.	2019-02-27 12:06:32 +01:00
Ines Montani	e359bdd0e3	Auto-format	2019-02-27 11:56:45 +01:00
Matthew Honnibal	4a3371acd5	Make doc[0].is_sent_start == True (closes #2869 ) (#3340 ) * Make doc[0] have sent_start True. Closes #2869 * Document that doc[0].is_sent_start defaults True.	2019-02-27 11:17:17 +01:00
Matthew Honnibal	2d3ce89b78	Improve matcher tests re issue #3328	2019-02-27 10:25:56 +01:00
Matthew Honnibal	8d6954e0e7	Fix matcher bug #3328	2019-02-27 10:25:39 +01:00
Ines Montani	aadf586789	Add xfailing test for #3331	2019-02-25 22:33:30 +01:00
Ines Montani	f135d663f7	Update conftest.py	2019-02-25 15:55:29 +01:00
Ines Montani	76ce8b2662	Merge branch 'master' into develop	2019-02-25 15:54:55 +01:00
Julia Makogon	f1c3108d52	Fixing pymorphy2 dependency issue (#3329 ) (closes #3327 ) * Classes for Ukrainian; small fix in Russian. * Contributor agreement * pymorphy2 initialization split for ru and uk (#3327) * stop-words fixed * Unit-tests updated	2019-02-25 15:48:17 +01:00
Ines Montani	1a735e0f1f	Add regression test for #3328	2019-02-25 10:12:58 +01:00
Ines Montani	62b558ab72	💫 Support lexical attributes in retokenizer attrs (closes #2390 ) (#3325 ) * Fix formatting and whitespace * Add support for lexical attributes (closes #2390) * Document lexical attribute setting during retokenization * Assign variable oputside of nested loop	2019-02-24 21:13:51 +01:00
Ines Montani	a48deb4081	Merge regression tests	2019-02-24 21:03:39 +01:00
Ines Montani	8f6c193a4d	Delete _test_issue1622.py	2019-02-24 20:33:31 +01:00
Ines Montani	c8e967c78d	Try include previously segfaulting test	2019-02-24 20:32:46 +01:00
Ines Montani	328b589deb	Merge regression tests	2019-02-24 20:31:38 +01:00
Ines Montani	3bc53905cc	Remove print statements from test	2019-02-24 20:31:15 +01:00
Ines Montani	1ae0df3da9	Un-x-fail passing test	2019-02-24 20:24:15 +01:00
Ines Montani	399a5803d0	Tidy up tests [ci skip]	2019-02-24 19:02:16 +01:00
Ines Montani	df19e2bff6	💫 Allow setting of custom attributes during retokenization (closes #3314 ) (#3324 ) <!--- Provide a general summary of your changes in the title. --> ## Description This PR adds the abilility to override custom extension attributes during merging. This will only work for attributes that are writable, i.e. attributes registered with a default value like `default=False` or attribute that have both a getter and a setter implemented. ```python Token.set_extension('is_musician', default=False) doc = nlp("I like David Bowie.") with doc.retokenize() as retokenizer: attrs = {"LEMMA": "David Bowie", "_": {"is_musician": True}} retokenizer.merge(doc[2:4], attrs=attrs) assert doc[2].text == "David Bowie" assert doc[2].lemma_ == "David Bowie" assert doc[2]._.is_musician ``` ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-02-24 18:38:47 +01:00
Ines Montani	d8f69d592f	Tidy up retokenizer tests	2019-02-24 14:14:11 +01:00
Ines Montani	723e27cb8c	Tidy up tests	2019-02-24 14:11:23 +01:00
Ines Montani	80bdcb99c5	Fix escaping of HTML in displacy ENT (closes #2728 )	2019-02-21 14:30:39 +01:00
Matthew Honnibal	c5f947f194	Fix regex deprecation warnings	2019-02-21 11:56:47 +01:00
Matthew Honnibal	80195bc2d1	Fix issue #3288 (#3308 )	2019-02-21 09:48:53 +01:00
Matthew Honnibal	a137e8b418	Fix Pipe.to_bytes() when model uninitialized Closes #3289	2019-02-21 09:42:02 +01:00
Sofie	9a478b6db8	Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293 ) * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * remove duplicate * remove xfail for Issue #2179 fixed by Matt * adjust documentation and remove reference to regex lib	2019-02-20 22:10:13 +01:00
Matthew Honnibal	0d1ca15b13	💫 Fix bugs in matcher extensions. Closes #1971 (#3301 ) * Fix matching on extension attrs and predicates * Fix detection of match_id when using extension attributes. The match ID is stored as the last entry in the pattern. We were checking for this with nr_attr == 0, which didn't account for extension attributes. * Fix handling of predicates. The wrong count was being passed through, so even patterns that didn't have a predicate were being checked. * Fix regex pattern * Fix matcher set value test	2019-02-20 21:30:39 +01:00
Ines Montani	3b667787a9	Add xfailing test for #3289	2019-02-18 16:45:04 +01:00
Ines Montani	91f260f2c4	Add another test for #1971	2019-02-18 13:36:20 +01:00
Ines Montani	f30aac324c	Update test_issue1971.py	2019-02-18 13:36:15 +01:00
Ines Montani	8fa26ca97e	Fix tensor shape in test for #3288	2019-02-18 11:01:54 +01:00
Ines Montani	c32290557f	Add xfailing test for #3288	2019-02-18 10:59:31 +01:00
Ines Montani	3af0b2dd1c	Add xfailing test for #1971 [ci skip]	2019-02-17 13:04:47 +01:00
Ines Montani	1e252b129c	Auto-format	2019-02-17 12:22:07 +01:00
Matthew Honnibal	92b6bd2977	Refinements to retokenize.split() function (#3282 ) * Change retokenize.split() API for heads * Pass lists as values for attrs in split * Fix test_doc_split filename * Add error for mismatched tokens after split * Raise error if new tokens don't match text * Fix doc test * Fix error * Move deps under attrs * Fix split tests * Fix retokenize.split	2019-02-15 17:32:31 +01:00
Ines Montani	1aa57690dc	Add xfailing test for orth mismatch in retokenizer.split	2019-02-15 13:55:04 +01:00
Ines Montani	819768483f	Add xfailing test for out-of-bounds heads	2019-02-15 13:09:07 +01:00
Ines Montani	d8051e89ca	Tidy up tests	2019-02-15 12:56:51 +01:00
Ines Montani	c31a9dabd5	💫 Add en/em dash to prefixes and suffixes (#3281 ) * Auto-format * Add en/em dash to prefixes and suffixes	2019-02-15 10:29:59 +01:00
Ines Montani	5651a0d052	💫 Replace {Doc,Span}.merge with Doc.retokenize (#3280 ) * Add deprecation warning to Doc.merge and Span.merge * Replace {Doc,Span}.merge with Doc.retokenize	2019-02-15 10:29:44 +01:00
Ines Montani	f146121092	💫 Make handling of [Pipe].labels consistent (#3273 ) * Make handling of [Pipe].labels consistent * Un-xfail passing test * Update spacy/pipeline/pipes.pyx Co-Authored-By: ines <ines@ines.io> * Update spacy/pipeline/pipes.pyx Co-Authored-By: ines <ines@ines.io> * Update spacy/tests/pipeline/test_pipe_methods.py Co-Authored-By: ines <ines@ines.io> * Update spacy/pipeline/pipes.pyx Co-Authored-By: ines <ines@ines.io> * Move error message to spacy.errors * Fix textcat labels and test * Make EntityRuler.labels return tuple as well	2019-02-15 06:03:19 +11:00
Ines Montani	3d577b77c6	Auto-formatting	2019-02-14 19:56:38 +01:00
Ines Montani	e104e47c21	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-02-14 15:35:34 +01:00
Ines Montani	0cd01a8c5e	Merge branch 'master' into develop	2019-02-14 15:35:20 +01:00
Ines Montani	2e31921d0a	💫 Add base Language classes for more languages (#3276 ) * Add base classes for more languages * Add test for language class initialization Make sure language can be initialize – otherwise, it's difficult to catch serious errors in the test suite, because languages are lazy-loaded	2019-02-15 01:31:19 +11:00
Grivaz	39815513e2	Add split one token into several (resolves #2838 ) (#3253 ) * Add split one token into several (resolves #2838) * Improve error message for token splitting * Make retokenizer.split() tests use a Token object Change retokenizer.split() to use a Token object, instead of an index. * Pass Token into retokenize.split() Tweak retokenize.split() API so that we pass the `Token` object, not the index. * Fix token.idx in retokenize.split() * Test that token.idx is correct after split * Fix token.idx for split tokens * Fix retokenize.split() * Fix retokenize.split * Fix retokenize.split() test	2019-02-15 01:27:13 +11:00
Ines Montani	743ecf728c	Tidy up conftest	2019-02-14 13:27:13 +01:00
Ines Montani	4d2438f985	Tidy up and auto-format	2019-02-13 15:29:08 +01:00
Ines Montani	fbf9f1edf1	Also raise error in Span.__reduce__	2019-02-13 13:22:05 +01:00
Ines Montani	2d0c3c73f4	Raise better error if token is pickled (resolves #2833 ) (#3267 )	2019-02-13 11:27:04 +01:00
Ines Montani	b589b945db	Fix PhraseMatcher pickling and length (resolves #3248 ) (#3252 )	2019-02-12 18:27:54 +01:00
Ines Montani	483dddc9bc	💫 Add token match pattern validation via JSON schemas (#3244 ) * Add custom MatchPatternError * Improve validators and add validation option to Matcher * Adjust formatting * Never validate in Matcher within PhraseMatcher If we do decide to make validate default to True, the PhraseMatcher's Matcher shouldn't ever validate. Here, we create the patterns automatically anyways (and it's currently unclear whether the validation has performance impacts at a very large scale).	2019-02-13 01:47:26 +11:00
Ines Montani	ad2a514cdf	Show warning if phrase pattern Doc was overprocessed (#3255 ) In most cases, the PhraseMatcher will match on the verbatim token text or as of v2.1, sometimes the lowercase text. This means that we only need a tokenized Doc, without any other attributes. If phrase patterns are created by processing large terminology lists with the full `nlp` object, this easily can make things a lot slower, because all components will be applied, even if we don't actually need the attributes they set (like part-of-speech tags, dependency labels). The warning message also includes a suggestion to use nlp.make_doc or nlp.tokenizer.pipe for even faster processing. For now, the validation has to be enabled explicitly by setting validate=True.	2019-02-13 01:45:31 +11:00
Matthew Honnibal	6ec834dc72	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-02-13 01:14:44 +11:00
Matthew Honnibal	43fa039d96	xfail regression test for model labels	2019-02-13 01:14:26 +11:00
Matthew Honnibal	bc300d4e31	Add test for issue 3209	2019-02-13 01:13:01 +11:00
Ines Montani	34a3cc26a9	Add xfailing test for reverse pattern (see #1971 )	2019-02-12 14:49:59 +01:00
Ines Montani	fe39fd4d13	Make warning tests more explicit	2019-02-10 14:02:19 +01:00
Ines Montani	e7593b791e	Fix import	2019-02-08 20:50:52 +01:00
Ines Montani	0754b848fe	Actually xfail test for #1971	2019-02-08 20:50:35 +01:00
Ines Montani	414a69b736	Add xfailing test (see #1971 , #2675 , #2671 )	2019-02-08 20:50:01 +01:00
Ines Montani	ea07f3022e	Only run noun chunks iterator in Span if available (closes #3199 )	2019-02-08 18:33:16 +01:00
Ines Montani	586c56fc6c	Tidy up regression tests	2019-02-08 15:51:13 +01:00
Ines Montani	25602c794c	Tidy up and fix small bugs and typos	2019-02-08 14:14:49 +01:00
Ines Montani	9e652afa4b	Merge branch 'master' into develop	2019-02-08 13:28:09 +01:00
Stanisław Giziński	1448ad100c	Improved polish tokenizer and stop words. (#2974 ) * Improved stop words list * Removed some wrong stop words form list * Improved stop words list * Removed some wrong stop words form list * Improved Polish Tokenizer (#38) * Add tests for polish tokenizer * Add polish tokenizer exceptions * Don't split any words containing hyphens * Fix test case with wrong model answer * Remove commented out line of code until better solution is found * Add source srx' license * Rename exception_list.py to match spaCy conventionality * Add a brief explanation of where the exception list comes from * Add newline after reach exception * Rename COPYING.txt to LICENSE * Delete old files * Add header to the license * Agreements signed * Stanisław Giziński agreement * Krzysztof Kowalczyk - signed agreement * Mateusz Olko agreement * Add DoomCoder's contributor agreement * Improve like number checking in polish lang * like num tests added * all from SI system added * Final licence and removed splitting exceptions * Added polish stop words to LEX_ATTRA * Add encoding info to pl tokenizer exceptions	2019-02-08 14:27:21 +11:00
Ines Montani	e2d93e4852	Merge branch 'master' into develop	2019-02-07 21:10:08 +01:00
Julia Makogon	b41d64825a	Ukrainian language added. Small fixes in Russian (#3241 ) * Classes for Ukrainian; small fix in Russian. * Contributor agreement	2019-02-07 21:05:11 +01:00
Ines Montani	5d0b60999d	Merge branch 'master' into develop	2019-02-07 20:54:07 +01:00
Ines Montani	338d659bd0	Store JSON schemas in Python and tidy up (#3235 )	2019-02-07 19:44:31 +11:00
Ines Montani	a9bf5d9fd8	Add xfailing test for set value with operator [ci skip]	2019-02-06 13:40:11 +01:00
Ines Montani	e51a238b3f	Auto-format	2019-02-06 13:32:18 +01:00
Ines Montani	f25bd9f5e4	Add gold.spans_from_biluo_tags helper (#3227 )	2019-02-06 21:50:26 +11:00
Sofie	9745b0d523	Improve Italian & Urdu tokenization accuracy (#3228 ) ## Description 1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour. 2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour. ### Types of change Enhancement of Italian & Urdu tokenization ## Checklist - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-02-04 22:39:25 +01:00
Sofie	a3efa3e8d9	Improve Catalan tokenization accuracy (#3225 ) * small hyphen clean up for French * catalan infix similar to french	2019-02-04 20:37:19 +11:00
Sofie	46dfe773e1	Replacing regex library with re to increase tokenization speed (#3218 ) * replace unicode categories with raw list of code points * simplifying ranges * fixing variable length quotes * removing redundant regular expression * small cleanup of regexp notations * quotes and alpha as ranges instead of alterations * removed most regexp dependencies and features * exponential backtracking - unit tests * rewrote expression with pathological backtracking * disabling double hyphen tests for now * test additional variants of repeating punctuation * remove regex and redundant backslashes from load_reddit script * small typo fixes * disable double punctuation test for russian * clean up old comments * format block code * final cleanup * naming consistency * french strings as unicode for python 2 support * french regular expression case insensitive	2019-02-01 18:05:22 +11:00
foufaster	8bd85fd9d5	Fix french lemmatization (#3180 )	2019-01-27 06:01:30 +01:00
Matthew Honnibal	77ddcf7381	💫 Update matcher engine for regex and extensions (#3173 ) * Update matcher engine for regex and extensions Add support for matching over arbitrary Python predicate functions, and arbitrary Python attribute getters. This will allow matching over regex patterns, and allow supporting extension attributes. The results of the Python predicate functions are cached, so that we don't call the same predicate function twice for the same token. The extension attributes are fetched into an array for each token in the doc. This should minimise the performance impact of the new features. We still need to wire up these features to the patterns, and test it all. * Work on wiring up extra attributes in matcher * Work on tests for extra matcher attrs * Add support for extension attrs to matcher * Test extension attribute matching * Work on implementing predicate-based match patterns * Get predicates working for set membership * Add test for set membership * Make extensions+predicates work * Test matcher extensions * Cache predicate results better in Matcher * Remove print statement in matcher test * Use srsly to get key for predicates	2019-01-21 13:23:15 +01:00
Björn Lennartsson	b892b446cc	Updates to Swedish Language (#3164 ) * Added the same punctuation rules as danish language. * Added abbreviations and also the possibility to have capitalized abbreviations on some. Added a few specific cases too * Added test for long texts in swedish * Added morph rules, infixes and suffixes to __init__.py for swedish * Added some tests for prefixes, infixes and suffixes * Added tests for lemma * Renamed files to follow convention * [sv] Removed ambigious abbreviations * Added more tests for tokenizer exceptions * Added test for problem with punctuation in issue #2578 * Contributor agreement * Removed faulty lemmatization of 'jag' ('I') as it was lemmatized to 'jaga' ('hunt')	2019-01-16 13:45:50 +01:00
Álvaro Abella Bascarán	e03e1eee92	Bugfix/get lca matrix (#3110 ) This PR adds a test for an untested case of `Span.get_lca_matrix`, and fixes a bug for that scenario, which I introduced in [this PR](https://github.com/explosion/spaCy/pull/3089) (sorry!). ## Description The previous implementation of get_lca_matrix was failing for the case `doc[j:k].get_lca_matrix()` where `j > 0`. A test has been added for this case and the bug has been fixed. ### Types of change Bug fix ## Checklist - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-01-06 19:07:50 +01:00
Matthew Honnibal	3c09d3d986	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-30 15:49:57 +01:00
Matthew Honnibal	bf20252ae0	Update test for #3012	2018-12-30 15:46:46 +01:00
Matthew Honnibal	63b7accd74	💫 Make span.as_doc() return a copy, not a view. Closes #1537 (#3107 ) Initially span.as_doc() was designed to return a view of the span's contents, as a Doc object. This was a nice idea, but it fails due to the token.idx property, which refers to the character offset within the string. In a span, the idx of the first token might not be 0. Because this data is different, we can't have a view --- it'll be inconsistent. This patch changes span.as_doc() to instead return a copy. The docs are updated accordingly. Closes #1537 * Update test for span.as_doc() * Make span.as_doc() return a copy. Closes #1537 * Document change to Span.as_doc()	2018-12-30 15:17:46 +01:00
Matthew Honnibal	72e4d3782a	Resize doc.tensor when merging spans. Closes #1963 (#3106 ) The doc.retokenize() context manager wasn't resizing doc.tensor, leading to a mismatch between the number of tokens in the doc and the number of rows in the tensor. We fix this by deleting rows from the tensor. Merged spans are represented by the vector of their last token. * Add test for resizing doc.tensor when merging * Add test for resizing doc.tensor when merging. Closes #1963 * Update get_lca_matrix test for develop * Fix retokenize if tensor unset	2018-12-30 15:17:17 +01:00
Matthew Honnibal	3d64eb4a74	Update get_lca_matrix test for develop	2018-12-30 14:28:07 +01:00

1 2 3 4 5 ...

1230 Commits