spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-14 10:12:22 +03:00

Author	SHA1	Message	Date
Xiaoquan Kong	87fa847e6e	Fix Chinese language related bugs (#2634 )	2018-08-07 11:26:31 +02:00
Xiaoquan Kong	f0c9652ed1	New Feature: display more detail when Error E067 (#2639 ) * Fix off-by-one error * Add verbose option * Update verbose option * Update documents for verbose option	2018-08-07 10:45:29 +02:00
Emil Stenström	1914c488d3	Swedish: Exceptions for single letter words ending sentence (#2615 ) * Exceptions for single letter words ending sentence Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."), should be tokenized as two separate tokens. * Add test	2018-08-05 14:14:30 +02:00
Matthew Honnibal	860f5bd91f	Add test for issue 2626	2018-08-05 13:46:57 +02:00
Kaisa (Katarzyna) Korsak	e531a827db	Changed conllu2json to be able to extract NER tags (#2594 ) * extract ner tags from conllu file if available * fixed a bug in regex	2018-07-25 22:21:31 +02:00
Dmitry Bruhanov	07d0cc9de7	Update examples.py (#2597 )	2018-07-25 22:20:24 +02:00
Matthew Honnibal	66983d8412	Port BenDerPan's Chinese changes to v2 (finally) (#2591 ) * add template files for Chinese * add template files for Chinese, and test directory .	2018-07-25 02:47:23 +02:00
ines	f2e3e039b7	Update French stop words (resolves #2540 )	2018-07-24 23:41:51 +02:00
Ines Montani	75f3234404	💫 Refactor test suite (#2568 ) ## Description Related issues: #2379 (should be fixed by separating model tests) * total execution time down from > 300 seconds to under 60 seconds 🎉 * removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure * changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version) * merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways) * tidied up and rewrote existing tests wherever possible ### Todo - [ ] move tests to `/tests` and adjust CI commands accordingly - [x] move model test suite from internal repo to `spacy-models` - [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~ - [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted - [ ] update documentation on how to run tests ### Types of change enhancement, tests ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-24 23:38:44 +02:00
Matthew Honnibal	82277f63a3	💫 Small efficiency fixes to tokenizer (#2587 ) This patch improves tokenizer speed by about 10%, and reduces memory usage in the `Vocab` by removing a redundant index. The `vocab._by_orth` and `vocab._by_hash` indexed on different data in v1, but in v2 the orth and the hash are identical. The patch also fixes an uninitialized variable in the tokenizer, the `has_special` flag. This checks whether a chunk we're tokenizing triggers a special-case rule. If it does, then we avoid caching within the chunk. This check led to incorrectly rejecting some chunks from the cache. With the `en_core_web_md` model, we now tokenize the IMDB train data at 503,104k words per second. Prior to this patch, we had 465,764k words per second. Before switching to the regex library and supporting more languages, we had 1.3m words per second for the tokenizer. In order to recover the missing speed, we need to: * Fix the variable-length lookarounds in the suffix, infix and `token_match` rules * Improve the performance of the `token_match` regex * Switch back from the `regex` library to the `re` library. ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-24 23:35:54 +02:00
Matthew Honnibal	6303ce3d0e	Try to fix memory error by moving fr_tokenizer to module scope	2018-07-24 20:09:06 +02:00
Matthew Honnibal	afe3fa4449	Merge branch 'master' of https://github.com/explosion/spaCy	2018-07-24 19:44:31 +02:00
Matthew Honnibal	b2e9e958b9	Add session scoping to tokenizers to try to fix oom on Appveyor	2018-07-24 19:44:18 +02:00
Ines Montani	a43ad114c2	Fix typo [ci skip]	2018-07-24 18:45:40 +02:00
Dmitry Bruhanov	27160b1516	added some widespread written jargon & dialectizms (#2584 ) This jargon is not offencive but emotionally colored as funny due to its deviation from the norm for various reasons: immitating a dialect, deliberately wrong spelling emphasizing its low colloquial nature, obsolete form, foreign borrowing with native flections, etc. Dmitry Briukhanov, Linguist & Pythonist	2018-07-24 18:44:29 +02:00
ines	3c30d1763c	Merge branch 'master' into develop	2018-07-21 15:34:18 +02:00
Matthew Honnibal	90c269e1a9	Set about to v2.0.12 release	2018-07-21 15:09:42 +02:00
Matthew Honnibal	1a1c7304cf	Set version to 2.0.12.dev1	2018-07-21 13:08:01 +02:00
ines	1ea881c80b	Allow ignoring warnings and only overwrite if set explicitly	2018-07-20 22:50:19 +02:00
Matthew Honnibal	e0caf3ae8c	Fix msgpack for new version	2018-07-20 17:32:00 +02:00
Matthew Honnibal	899f1cf442	Add regression test for issue 2179	2018-07-20 17:15:44 +02:00
Matthew Honnibal	9db77fd914	Fix deserialization for msgpack	2018-07-20 14:11:09 +02:00
katarkor	5ca853bee0	changed tag_map, morph_rules, lemmatizer for Norwegian (#2565 ) * changed tag_map, morph_rules, lemmatizer for Norwegian * Move unicode declaration up Hopefully fixes test failure on Python 2 * Update CONTRIBUTOR_AGREEMENT.md * Move unicode declarations Hopefully fixes test this time * Revert "Merge remote-tracking branch 'origin/patch-1'" This reverts commit `f5ccd5dd0d`, reversing changes made to `dd07e180ea`. * Update contributor agreement [ci skip]	2018-07-19 19:38:24 +02:00
Ines Montani	e7b075565d	💫 Rule-based NER component (#2513 ) * Add helper function for reading in JSONL * Add rule-based NER component * Fix whitespace * Add component to factories * Add tests * Add option to disable indent on json_dumps compat Otherwise, reading JSONL back in line by line won't work * Fix error code	2018-07-18 19:43:16 +02:00
ines	d84b13e02c	Merge branch 'master' into develop	2018-07-18 18:57:00 +02:00
Ole Henrik Skogstrøm	6e2930a4a2	Conll(u)-bio converter (#2525 ) * Started simple conllxbiluo converter * Fix missing BIO to BILUO conversion	2018-07-18 18:55:42 +02:00
ines	02aefe7cc0	Merge branch 'master' into develop	2018-07-18 18:52:59 +02:00
Ioannis Daras	6ed18412d0	Greek language optimizations (#2558 ) * Greek language optimizations * Add encoding on files containing greek words * Add encoding on files containing greek words	2018-07-18 18:51:38 +02:00
ines	80e7485630	Merge branch 'master' into develop	2018-07-18 17:28:47 +02:00
Paul O'Leary McCann	61ef0739b8	Add Japanese stop words. (#2549 ) List created by taking the 2000 top words from a Wikipedia dump and removing everything that wasn't hiragana. Tried going through kanji words and deciding what to keep but there were too many obvious non-stopwords (東京 was in the top 500) and many other words where it wasn't clear if they should be included or not.	2018-07-17 10:12:48 +02:00
Tero K	f35980f865	Enhancement/lang fi examples (#2547 ) * Added a file with examples in finnish * added contributor agreement	2018-07-15 09:50:27 +02:00
Paul O'Leary McCann	1987f3f784	Add Japanese lemmas (#2543 ) This info was already available from Mecab, forgot to add it before.	2018-07-13 10:55:14 +02:00
ines	3a321e79ac	Merge branch 'master' into develop	2018-07-10 13:49:08 +02:00
Eleni170	6042723535	Add support for Greek language (#2535 ) * Add contributor agreement * Support for Greek language * Fix missing el_tokenizer	2018-07-10 13:48:38 +02:00
Stefan Schweter	3dfc7f86be	lemmatizer: correct lemma for Rang (#2537 ) <!--- Provide a general summary of your changes in the title. --> ## Description This PR corrects the German lemma form for the word "Rang". Initially, the lemma form was "ringen", which is not correct, because it refers to the verb ("ringen") and not to the noun ("Rang"). ### Types of change The lemma form for "Rang" is corrected to "Rang", see also the [Duden](https://www.duden.de/rechtschreibung/Rang) entry. ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-10 13:11:19 +02:00
ines	fd6207426a	Merge branch 'master' into develop	2018-07-09 18:05:10 +02:00
Duygu Altinok	00b9a58558	German lemmatizer additions (#2529 ) * lemma of was-> was * added new pairs issue @2486 * added article tests	2018-07-09 11:10:15 +02:00
Ole Henrik Skogstrøm	c21efea9bb	Add sent property to token (#2521 ) * Add sent property to token * Refactored and cleaned up copy paste errors.	2018-07-06 15:54:15 +02:00
ines	38e07ade4c	Add test for custom tokenizer serialization (resolves #2494 )	2018-07-06 12:40:51 +02:00
ines	c2581f9172	Tidy up tokenizer test	2018-07-06 12:40:28 +02:00
Matthew Honnibal	43dcaa473e	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-07-06 12:36:42 +02:00
Matthew Honnibal	6c8d627733	Fix tokenizer deserialization	2018-07-06 12:36:33 +02:00
ines	c001d46153	Tidy up	2018-07-06 12:33:42 +02:00
Matthew Honnibal	63f5651f8d	Fix tokenizer serialization	2018-07-06 12:32:11 +02:00
Matthew Honnibal	e1569fda4e	Fix compile error in matcher	2018-07-06 12:29:23 +02:00
Matthew Honnibal	f5b2076700	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-07-06 12:23:14 +02:00
Matthew Honnibal	1a2f61725c	Fix tokenizer serialization	2018-07-06 12:23:04 +02:00
ines	9e09477b2f	Remove unused import	2018-07-06 12:18:17 +02:00
ines	26f04a6ac3	Fix Matcher tests and add test for any token with operator	2018-07-06 12:17:50 +02:00
Matthew Honnibal	f5703b7a91	Clean up unused stuff in matcher	2018-07-06 12:16:44 +02:00
Matthew Honnibal	08c362d541	Suppress compiler warning about unreachable code	2018-07-06 11:31:22 +02:00
Matthew Honnibal	8ae1bec8bf	Fix init_model	2018-07-05 14:02:06 +02:00
Matthew Honnibal	7b09a4ca49	Fix lemmatization	2018-07-05 13:56:02 +02:00
Matthew Honnibal	ec41ceb383	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-07-05 13:49:42 +02:00
Matthew Honnibal	4eb3405df7	Fix lemmatizer ordering, re Issue #1387	2018-07-05 13:49:29 +02:00
ines	63666af328	Merge branch 'master' into develop	2018-07-04 14:52:25 +02:00
ines	8feb7cfe2d	Remove model dependency from French lemmatizer tests	2018-07-04 14:46:45 +02:00
kleinay	a82c3153ad	fix issue #2452 - displacy arrow direction is always forward (#2506 ) (closes #2452 ) <!--- Provide a general summary of your changes in the title. --> Referring #2452, fixing displacy arrow directions to match the input. ## Description The fix is simply replacing `direction is 'left'` with `direction == 'left'` to include the case `direction` is a `str` and not a `unicode`. ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-04 14:12:08 +02:00
Bùi Trung Chí	9af46b4f1b	Fix loading tokenizer with custom prefix search (#2495 ) * Add contributor agreement * Fix loading tokenizer with cutom prefix search	2018-07-04 12:56:07 +02:00
Matthew Honnibal	dee8bdb900	Fix init-model for npz vectors	2018-07-04 02:29:48 +02:00
Matthew Honnibal	59d655e8d0	Fix model init from jsonl	2018-07-04 01:30:40 +02:00
Matthew Honnibal	1e38bea6e9	Save vectors init	2018-07-03 23:55:04 +02:00
Matthew Honnibal	6692833887	Fix init_model	2018-07-03 23:24:11 +02:00
Matthew Honnibal	4a38a26cb5	Fix init_model	2018-07-03 22:57:11 +02:00
Matthew Honnibal	019d09e3c3	Fix init model	2018-07-03 22:16:44 +02:00
Matthew Honnibal	2543f8c93a	Support .npz vectors in init-model command	2018-07-03 21:42:16 +02:00
Matthew Honnibal	86aad11939	Fix init_model arg	2018-07-03 17:00:42 +02:00
Matthew Honnibal	eff42d36e3	Fix init model command	2018-07-03 16:32:23 +02:00
Matthew Honnibal	97487122ea	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-07-03 15:44:37 +02:00
Matthew Honnibal	6a89faf12e	Add support for jsonl-formatted lexical attributes to init-model command.	2018-07-03 12:22:56 +02:00
Matthew Honnibal	2ec2192000	Revert #1389 : Don't overrule rules when lemma exception is present	2018-06-29 19:43:02 +02:00
Matthew Honnibal	01ace9734d	Make pipeline work on empty docs	2018-06-29 19:21:38 +02:00
Matthew Honnibal	a1b05048d0	Fix tagger when doc is empty	2018-06-29 16:05:40 +02:00
Matthew Honnibal	3786942ff1	Fix tagger when docs are empty	2018-06-29 15:13:45 +02:00
ines	526be40823	Add test for `46d8a66`	2018-06-29 14:33:12 +02:00
ines	f08c871adf	Fix typo in Language.from_disk	2018-06-29 14:32:16 +02:00
Matthew Honnibal	46d8a66fef	Fix tokenizer serialization if token_match is None	2018-06-29 14:24:46 +02:00
Matthew Honnibal	e0860bcfb3	Fix bug when docs are empty	2018-06-29 13:56:29 +02:00
Matthew Honnibal	a4d2b0c293	Fix bug when docs are empty	2018-06-29 13:44:25 +02:00
Matthew Honnibal	c83fccfe2a	Fix output of best model	2018-06-25 23:05:56 +02:00
Matthew Honnibal	5a65418c40	Fix handling of unseen labels in tagger	2018-06-25 22:28:59 +02:00
Matthew Honnibal	5b56aad4c2	Fix handling of unseen labels in tagger	2018-06-25 22:24:54 +02:00
Matthew Honnibal	3aabf621a3	Fix handling of unknown tags in tagger update	2018-06-25 22:01:02 +02:00
Matthew Honnibal	69c900f003	Fix init-model if no vectors provided	2018-06-25 18:26:02 +02:00
Matthew Honnibal	664f89327a	Fix init-model if no vectors provided	2018-06-25 17:58:45 +02:00
Matthew Honnibal	c4698f5712	Don't collate model unless training succeeds	2018-06-25 16:36:42 +02:00
Ole Henrik Skogstrøm	d16cb6bee6	Accept Span to displacy render (#2478 ) (closes #2477 ) * Add Span to displacy render * Fix span support, errors and add tests	2018-06-25 14:55:16 +02:00
Matthew Honnibal	24dfbb8a28	Fix model collation	2018-06-25 14:35:24 +02:00
Matthew Honnibal	62237755a4	Import shutil	2018-06-25 13:40:17 +02:00
Matthew Honnibal	a040fca99e	Import json into cli.train	2018-06-25 11:50:37 +02:00
Matthew Honnibal	2c703d99c2	Fix collation of best models	2018-06-25 01:21:34 +02:00
Matthew Honnibal	9d6a1c57f2	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-06-24 23:40:06 +02:00
Matthew Honnibal	2c80b7c013	Collate best model after training	2018-06-24 23:39:52 +02:00
Muhammad Irfan	f33c703066	Add Urdu Language Support (#2430 ) * added Urdu language support. * added Urdu language tests. * modified conftest.py for Urdu language support. * added spacy contributor agreement.	2018-06-22 11:14:03 +02:00
himkt	14d9007efd	fix wrong indexing (#2416 ) * fix wrong indexing * add agreement	2018-06-19 10:20:57 +02:00
Aliia E	428bae66b5	Add Tatar Language Support (#2444 ) * add Tatar lang support * add Tatar letters * add Tatar tests * sign contributor agreement * sign contributor agreement [x] * remove comments from Language class * remove all template comments	2018-06-19 10:17:53 +02:00
Cory Hurst	446f5ec41b	Silent keyword in info function in init (#2459 ) * Pass through "silent" kwarg to the wrapper in the spacy module init. reference issue #2196 * Pass through "silent" kwarg to the wrapper in the spacy module init. reference issue #2196 * contributor agreement	2018-06-18 12:24:21 +02:00
ines	778e5f4da3	Merge branch 'master' into develop	2018-06-11 00:38:04 +02:00
himkt	57311d5d47	replace janome with mecab in the documentation and the test (#2415 ) * Add links to Reddit data (see #2401) * replace janome with mecab in the documentation and the test * add the assignment	2018-06-11 00:33:13 +02:00
Nour Shalabi	a169b79092	Additions to Arabic stop words. (#2422 ) * Additions to Arabic stop words. * Create nourshalabi.md	2018-06-08 02:33:23 +02:00

1 2 3 4 5 ...

5280 Commits