spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-08 09:41:11 +03:00

Author	SHA1	Message	Date
Ines Montani	75f3234404	💫 Refactor test suite (#2568 ) ## Description Related issues: #2379 (should be fixed by separating model tests) * total execution time down from > 300 seconds to under 60 seconds 🎉 * removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure * changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version) * merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways) * tidied up and rewrote existing tests wherever possible ### Todo - [ ] move tests to `/tests` and adjust CI commands accordingly - [x] move model test suite from internal repo to `spacy-models` - [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~ - [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted - [ ] update documentation on how to run tests ### Types of change enhancement, tests ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-24 23:38:44 +02:00
Matthew Honnibal	82277f63a3	💫 Small efficiency fixes to tokenizer (#2587 ) This patch improves tokenizer speed by about 10%, and reduces memory usage in the `Vocab` by removing a redundant index. The `vocab._by_orth` and `vocab._by_hash` indexed on different data in v1, but in v2 the orth and the hash are identical. The patch also fixes an uninitialized variable in the tokenizer, the `has_special` flag. This checks whether a chunk we're tokenizing triggers a special-case rule. If it does, then we avoid caching within the chunk. This check led to incorrectly rejecting some chunks from the cache. With the `en_core_web_md` model, we now tokenize the IMDB train data at 503,104k words per second. Prior to this patch, we had 465,764k words per second. Before switching to the regex library and supporting more languages, we had 1.3m words per second for the tokenizer. In order to recover the missing speed, we need to: * Fix the variable-length lookarounds in the suffix, infix and `token_match` rules * Improve the performance of the `token_match` regex * Switch back from the `regex` library to the `re` library. ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-24 23:35:54 +02:00
ines	3c30d1763c	Merge branch 'master' into develop	2018-07-21 15:34:18 +02:00
Matthew Honnibal	1a1c7304cf	Set version to 2.0.12.dev1	2018-07-21 13:08:01 +02:00
ines	1ea881c80b	Allow ignoring warnings and only overwrite if set explicitly	2018-07-20 22:50:19 +02:00
Matthew Honnibal	e0caf3ae8c	Fix msgpack for new version	2018-07-20 17:32:00 +02:00
Matthew Honnibal	899f1cf442	Add regression test for issue 2179	2018-07-20 17:15:44 +02:00
Matthew Honnibal	9db77fd914	Fix deserialization for msgpack	2018-07-20 14:11:09 +02:00
katarkor	5ca853bee0	changed tag_map, morph_rules, lemmatizer for Norwegian (#2565 ) * changed tag_map, morph_rules, lemmatizer for Norwegian * Move unicode declaration up Hopefully fixes test failure on Python 2 * Update CONTRIBUTOR_AGREEMENT.md * Move unicode declarations Hopefully fixes test this time * Revert "Merge remote-tracking branch 'origin/patch-1'" This reverts commit `f5ccd5dd0d`, reversing changes made to `dd07e180ea`. * Update contributor agreement [ci skip]	2018-07-19 19:38:24 +02:00
Ines Montani	e7b075565d	💫 Rule-based NER component (#2513 ) * Add helper function for reading in JSONL * Add rule-based NER component * Fix whitespace * Add component to factories * Add tests * Add option to disable indent on json_dumps compat Otherwise, reading JSONL back in line by line won't work * Fix error code	2018-07-18 19:43:16 +02:00
ines	d84b13e02c	Merge branch 'master' into develop	2018-07-18 18:57:00 +02:00
Ole Henrik Skogstrøm	6e2930a4a2	Conll(u)-bio converter (#2525 ) * Started simple conllxbiluo converter * Fix missing BIO to BILUO conversion	2018-07-18 18:55:42 +02:00
ines	02aefe7cc0	Merge branch 'master' into develop	2018-07-18 18:52:59 +02:00
Ioannis Daras	6ed18412d0	Greek language optimizations (#2558 ) * Greek language optimizations * Add encoding on files containing greek words * Add encoding on files containing greek words	2018-07-18 18:51:38 +02:00
ines	80e7485630	Merge branch 'master' into develop	2018-07-18 17:28:47 +02:00
Paul O'Leary McCann	61ef0739b8	Add Japanese stop words. (#2549 ) List created by taking the 2000 top words from a Wikipedia dump and removing everything that wasn't hiragana. Tried going through kanji words and deciding what to keep but there were too many obvious non-stopwords (東京 was in the top 500) and many other words where it wasn't clear if they should be included or not.	2018-07-17 10:12:48 +02:00
Tero K	f35980f865	Enhancement/lang fi examples (#2547 ) * Added a file with examples in finnish * added contributor agreement	2018-07-15 09:50:27 +02:00
Paul O'Leary McCann	1987f3f784	Add Japanese lemmas (#2543 ) This info was already available from Mecab, forgot to add it before.	2018-07-13 10:55:14 +02:00
ines	3a321e79ac	Merge branch 'master' into develop	2018-07-10 13:49:08 +02:00
Eleni170	6042723535	Add support for Greek language (#2535 ) * Add contributor agreement * Support for Greek language * Fix missing el_tokenizer	2018-07-10 13:48:38 +02:00
Stefan Schweter	3dfc7f86be	lemmatizer: correct lemma for Rang (#2537 ) <!--- Provide a general summary of your changes in the title. --> ## Description This PR corrects the German lemma form for the word "Rang". Initially, the lemma form was "ringen", which is not correct, because it refers to the verb ("ringen") and not to the noun ("Rang"). ### Types of change The lemma form for "Rang" is corrected to "Rang", see also the [Duden](https://www.duden.de/rechtschreibung/Rang) entry. ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-10 13:11:19 +02:00
ines	fd6207426a	Merge branch 'master' into develop	2018-07-09 18:05:10 +02:00
Duygu Altinok	00b9a58558	German lemmatizer additions (#2529 ) * lemma of was-> was * added new pairs issue @2486 * added article tests	2018-07-09 11:10:15 +02:00
Ole Henrik Skogstrøm	c21efea9bb	Add sent property to token (#2521 ) * Add sent property to token * Refactored and cleaned up copy paste errors.	2018-07-06 15:54:15 +02:00
ines	38e07ade4c	Add test for custom tokenizer serialization (resolves #2494 )	2018-07-06 12:40:51 +02:00
ines	c2581f9172	Tidy up tokenizer test	2018-07-06 12:40:28 +02:00
Matthew Honnibal	43dcaa473e	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-07-06 12:36:42 +02:00
Matthew Honnibal	6c8d627733	Fix tokenizer deserialization	2018-07-06 12:36:33 +02:00
ines	c001d46153	Tidy up	2018-07-06 12:33:42 +02:00
Matthew Honnibal	63f5651f8d	Fix tokenizer serialization	2018-07-06 12:32:11 +02:00
Matthew Honnibal	e1569fda4e	Fix compile error in matcher	2018-07-06 12:29:23 +02:00
Matthew Honnibal	f5b2076700	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-07-06 12:23:14 +02:00
Matthew Honnibal	1a2f61725c	Fix tokenizer serialization	2018-07-06 12:23:04 +02:00
ines	9e09477b2f	Remove unused import	2018-07-06 12:18:17 +02:00
ines	26f04a6ac3	Fix Matcher tests and add test for any token with operator	2018-07-06 12:17:50 +02:00
Matthew Honnibal	f5703b7a91	Clean up unused stuff in matcher	2018-07-06 12:16:44 +02:00
Matthew Honnibal	08c362d541	Suppress compiler warning about unreachable code	2018-07-06 11:31:22 +02:00
Matthew Honnibal	8ae1bec8bf	Fix init_model	2018-07-05 14:02:06 +02:00
Matthew Honnibal	7b09a4ca49	Fix lemmatization	2018-07-05 13:56:02 +02:00
Matthew Honnibal	ec41ceb383	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-07-05 13:49:42 +02:00
Matthew Honnibal	4eb3405df7	Fix lemmatizer ordering, re Issue #1387	2018-07-05 13:49:29 +02:00
ines	63666af328	Merge branch 'master' into develop	2018-07-04 14:52:25 +02:00
ines	8feb7cfe2d	Remove model dependency from French lemmatizer tests	2018-07-04 14:46:45 +02:00
kleinay	a82c3153ad	fix issue #2452 - displacy arrow direction is always forward (#2506 ) (closes #2452 ) <!--- Provide a general summary of your changes in the title. --> Referring #2452, fixing displacy arrow directions to match the input. ## Description The fix is simply replacing `direction is 'left'` with `direction == 'left'` to include the case `direction` is a `str` and not a `unicode`. ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-04 14:12:08 +02:00
Bùi Trung Chí	9af46b4f1b	Fix loading tokenizer with custom prefix search (#2495 ) * Add contributor agreement * Fix loading tokenizer with cutom prefix search	2018-07-04 12:56:07 +02:00
Matthew Honnibal	dee8bdb900	Fix init-model for npz vectors	2018-07-04 02:29:48 +02:00
Matthew Honnibal	59d655e8d0	Fix model init from jsonl	2018-07-04 01:30:40 +02:00
Matthew Honnibal	1e38bea6e9	Save vectors init	2018-07-03 23:55:04 +02:00
Matthew Honnibal	6692833887	Fix init_model	2018-07-03 23:24:11 +02:00
Matthew Honnibal	4a38a26cb5	Fix init_model	2018-07-03 22:57:11 +02:00

1 2 3 4 5 ...

5216 Commits