spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-16 23:21:58 +03:00

Author	SHA1	Message	Date
Paul O'Leary McCann	1987f3f784	Add Japanese lemmas (#2543 ) This info was already available from Mecab, forgot to add it before.	2018-07-13 10:55:14 +02:00
ines	3a321e79ac	Merge branch 'master' into develop	2018-07-10 13:49:08 +02:00
Eleni170	6042723535	Add support for Greek language (#2535 ) * Add contributor agreement * Support for Greek language * Fix missing el_tokenizer	2018-07-10 13:48:38 +02:00
Stefan Schweter	3dfc7f86be	lemmatizer: correct lemma for Rang (#2537 ) <!--- Provide a general summary of your changes in the title. --> ## Description This PR corrects the German lemma form for the word "Rang". Initially, the lemma form was "ringen", which is not correct, because it refers to the verb ("ringen") and not to the noun ("Rang"). ### Types of change The lemma form for "Rang" is corrected to "Rang", see also the [Duden](https://www.duden.de/rechtschreibung/Rang) entry. ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-10 13:11:19 +02:00
ines	fd6207426a	Merge branch 'master' into develop	2018-07-09 18:05:10 +02:00
Duygu Altinok	00b9a58558	German lemmatizer additions (#2529 ) * lemma of was-> was * added new pairs issue @2486 * added article tests	2018-07-09 11:10:15 +02:00
Ole Henrik Skogstrøm	c21efea9bb	Add sent property to token (#2521 ) * Add sent property to token * Refactored and cleaned up copy paste errors.	2018-07-06 15:54:15 +02:00
ines	38e07ade4c	Add test for custom tokenizer serialization (resolves #2494 )	2018-07-06 12:40:51 +02:00
ines	c2581f9172	Tidy up tokenizer test	2018-07-06 12:40:28 +02:00
Matthew Honnibal	43dcaa473e	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-07-06 12:36:42 +02:00
Matthew Honnibal	6c8d627733	Fix tokenizer deserialization	2018-07-06 12:36:33 +02:00
ines	c001d46153	Tidy up	2018-07-06 12:33:42 +02:00
Matthew Honnibal	63f5651f8d	Fix tokenizer serialization	2018-07-06 12:32:11 +02:00
Matthew Honnibal	e1569fda4e	Fix compile error in matcher	2018-07-06 12:29:23 +02:00
Matthew Honnibal	f5b2076700	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-07-06 12:23:14 +02:00
Matthew Honnibal	1a2f61725c	Fix tokenizer serialization	2018-07-06 12:23:04 +02:00
ines	9e09477b2f	Remove unused import	2018-07-06 12:18:17 +02:00
ines	26f04a6ac3	Fix Matcher tests and add test for any token with operator	2018-07-06 12:17:50 +02:00
Matthew Honnibal	f5703b7a91	Clean up unused stuff in matcher	2018-07-06 12:16:44 +02:00
Matthew Honnibal	08c362d541	Suppress compiler warning about unreachable code	2018-07-06 11:31:22 +02:00
Matthew Honnibal	8ae1bec8bf	Fix init_model	2018-07-05 14:02:06 +02:00
Matthew Honnibal	7b09a4ca49	Fix lemmatization	2018-07-05 13:56:02 +02:00
Matthew Honnibal	ec41ceb383	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-07-05 13:49:42 +02:00
Matthew Honnibal	4eb3405df7	Fix lemmatizer ordering, re Issue #1387	2018-07-05 13:49:29 +02:00
ines	63666af328	Merge branch 'master' into develop	2018-07-04 14:52:25 +02:00
ines	8feb7cfe2d	Remove model dependency from French lemmatizer tests	2018-07-04 14:46:45 +02:00
kleinay	a82c3153ad	fix issue #2452 - displacy arrow direction is always forward (#2506 ) (closes #2452 ) <!--- Provide a general summary of your changes in the title. --> Referring #2452, fixing displacy arrow directions to match the input. ## Description The fix is simply replacing `direction is 'left'` with `direction == 'left'` to include the case `direction` is a `str` and not a `unicode`. ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-04 14:12:08 +02:00
Bùi Trung Chí	9af46b4f1b	Fix loading tokenizer with custom prefix search (#2495 ) * Add contributor agreement * Fix loading tokenizer with cutom prefix search	2018-07-04 12:56:07 +02:00
Matthew Honnibal	dee8bdb900	Fix init-model for npz vectors	2018-07-04 02:29:48 +02:00
Matthew Honnibal	59d655e8d0	Fix model init from jsonl	2018-07-04 01:30:40 +02:00
Matthew Honnibal	1e38bea6e9	Save vectors init	2018-07-03 23:55:04 +02:00
Matthew Honnibal	6692833887	Fix init_model	2018-07-03 23:24:11 +02:00
Matthew Honnibal	4a38a26cb5	Fix init_model	2018-07-03 22:57:11 +02:00
Matthew Honnibal	019d09e3c3	Fix init model	2018-07-03 22:16:44 +02:00
Matthew Honnibal	2543f8c93a	Support .npz vectors in init-model command	2018-07-03 21:42:16 +02:00
Matthew Honnibal	86aad11939	Fix init_model arg	2018-07-03 17:00:42 +02:00
Matthew Honnibal	eff42d36e3	Fix init model command	2018-07-03 16:32:23 +02:00
Matthew Honnibal	97487122ea	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-07-03 15:44:37 +02:00
Matthew Honnibal	6a89faf12e	Add support for jsonl-formatted lexical attributes to init-model command.	2018-07-03 12:22:56 +02:00
Matthew Honnibal	2ec2192000	Revert #1389 : Don't overrule rules when lemma exception is present	2018-06-29 19:43:02 +02:00
Matthew Honnibal	01ace9734d	Make pipeline work on empty docs	2018-06-29 19:21:38 +02:00
Matthew Honnibal	a1b05048d0	Fix tagger when doc is empty	2018-06-29 16:05:40 +02:00
Matthew Honnibal	3786942ff1	Fix tagger when docs are empty	2018-06-29 15:13:45 +02:00
ines	526be40823	Add test for `46d8a66`	2018-06-29 14:33:12 +02:00
ines	f08c871adf	Fix typo in Language.from_disk	2018-06-29 14:32:16 +02:00
Matthew Honnibal	46d8a66fef	Fix tokenizer serialization if token_match is None	2018-06-29 14:24:46 +02:00
Matthew Honnibal	e0860bcfb3	Fix bug when docs are empty	2018-06-29 13:56:29 +02:00
Matthew Honnibal	a4d2b0c293	Fix bug when docs are empty	2018-06-29 13:44:25 +02:00
Matthew Honnibal	c83fccfe2a	Fix output of best model	2018-06-25 23:05:56 +02:00
Matthew Honnibal	5a65418c40	Fix handling of unseen labels in tagger	2018-06-25 22:28:59 +02:00
Matthew Honnibal	5b56aad4c2	Fix handling of unseen labels in tagger	2018-06-25 22:24:54 +02:00
Matthew Honnibal	3aabf621a3	Fix handling of unknown tags in tagger update	2018-06-25 22:01:02 +02:00
Matthew Honnibal	69c900f003	Fix init-model if no vectors provided	2018-06-25 18:26:02 +02:00
Matthew Honnibal	664f89327a	Fix init-model if no vectors provided	2018-06-25 17:58:45 +02:00
Matthew Honnibal	c4698f5712	Don't collate model unless training succeeds	2018-06-25 16:36:42 +02:00
Ole Henrik Skogstrøm	d16cb6bee6	Accept Span to displacy render (#2478 ) (closes #2477 ) * Add Span to displacy render * Fix span support, errors and add tests	2018-06-25 14:55:16 +02:00
Matthew Honnibal	24dfbb8a28	Fix model collation	2018-06-25 14:35:24 +02:00
Matthew Honnibal	62237755a4	Import shutil	2018-06-25 13:40:17 +02:00
Matthew Honnibal	a040fca99e	Import json into cli.train	2018-06-25 11:50:37 +02:00
Matthew Honnibal	2c703d99c2	Fix collation of best models	2018-06-25 01:21:34 +02:00
Matthew Honnibal	9d6a1c57f2	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-06-24 23:40:06 +02:00
Matthew Honnibal	2c80b7c013	Collate best model after training	2018-06-24 23:39:52 +02:00
Muhammad Irfan	f33c703066	Add Urdu Language Support (#2430 ) * added Urdu language support. * added Urdu language tests. * modified conftest.py for Urdu language support. * added spacy contributor agreement.	2018-06-22 11:14:03 +02:00
himkt	14d9007efd	fix wrong indexing (#2416 ) * fix wrong indexing * add agreement	2018-06-19 10:20:57 +02:00
Aliia E	428bae66b5	Add Tatar Language Support (#2444 ) * add Tatar lang support * add Tatar letters * add Tatar tests * sign contributor agreement * sign contributor agreement [x] * remove comments from Language class * remove all template comments	2018-06-19 10:17:53 +02:00
Cory Hurst	446f5ec41b	Silent keyword in info function in init (#2459 ) * Pass through "silent" kwarg to the wrapper in the spacy module init. reference issue #2196 * Pass through "silent" kwarg to the wrapper in the spacy module init. reference issue #2196 * contributor agreement	2018-06-18 12:24:21 +02:00
ines	778e5f4da3	Merge branch 'master' into develop	2018-06-11 00:38:04 +02:00
himkt	57311d5d47	replace janome with mecab in the documentation and the test (#2415 ) * Add links to Reddit data (see #2401) * replace janome with mecab in the documentation and the test * add the assignment	2018-06-11 00:33:13 +02:00
Nour Shalabi	a169b79092	Additions to Arabic stop words. (#2422 ) * Additions to Arabic stop words. * Create nourshalabi.md	2018-06-08 02:33:23 +02:00
ines	a0017e4909	Merge branch 'master' into develop	2018-05-30 14:10:47 +02:00
ines	b8ef9c1000	Fix model names in conftest (see #2379 )	2018-05-30 14:10:20 +02:00
ines	4a62486340	Merge branch 'master' into develop	2018-05-30 13:01:01 +02:00
Maciej	c7d53348d7	Fix bug in CLI iob and ner converter (#2392 ) (fixes #2385 ) * issue_2385 add tests for iob_to_biluo converter function * issue_2385 fix and modify iob_to_biluo function to accept either iob or biluo tags in cli.converter * issue_2385 add test to fix b char bug * add contributor agreement * fill contributor agreement	2018-05-30 12:28:44 +02:00
ines	3c3a175018	Merge branch 'master' into develop	2018-05-28 18:37:09 +02:00
ansgar-t	9732988951	escape html in displacy.render (#2378 ) (closes #2361 ) ## Description Fix for issue #2361 : replace &, <, >, " with &amp; , &lt; , &gt; , &quot; in before rendering svg ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. (As discussed in the comments to #2361) - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-05-28 18:36:41 +02:00
ines	f7103babd9	Only overwrite warnings filter if set explicitly (resolves #2369 ) This way, pre-defined warning filters are respected and users are still able to use the fine-grained warning settings if they like.	2018-05-26 18:44:15 +02:00
ines	330c039106	Merge branch 'master' into develop	2018-05-26 18:30:52 +02:00
James Messinger	4515e96e90	Better formatting for `spacy train` CLI (#2357 ) * Better formatting for `spacy train` CLI Changed to use fixed-spaces rather than tabs to align table headers and data. ### Before: ``` Itn. P.Loss N.Loss UAS NER P. NER R. NER F. Tag % Token % 0 4618.857 2910.004 76.172 79.645 67.987 88.732 88.261 100.000 4436.9 6376.4 1 4671.972 3764.812 74.481 78.046 62.374 82.680 88.377 100.000 4672.2 6227.1 2 4742.756 3673.473 71.994 77.380 63.966 84.494 90.620 100.000 4298.0 5983.9 ``` ### After: ``` Itn. Dep Loss NER Loss UAS NER P. NER R. NER F. Tag % Token % CPU WPS GPU WPS 0 4618.857 2910.004 76.172 79.645 67.987 88.732 88.261 100.000 4436.9 6376.4 1 4671.972 3764.812 74.481 78.046 62.374 82.680 88.377 100.000 4672.2 6227.1 2 4742.756 3673.473 71.994 77.380 63.966 84.494 90.620 100.000 4298.0 5983.9 ``` * Added contributor file	2018-05-25 13:08:45 +02:00
Aristo Rinjuang	432ede04af	adding more words and rephrasing (#2351 ) * adding more words and rephrasing * adding a contributor * tokenizer bugs solved	2018-05-24 11:40:57 +02:00
Jani Monoses	ec62cadf4c	Updates to Romanian support (#2354 ) * Add back Romanian in conftest * Romanian lex_attr * More tokenizer exceptions for Romanian * Add tests for some Romanian tokenizer exceptions	2018-05-24 11:40:00 +02:00
Matthew Honnibal	5d281cf302	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-05-22 20:50:59 +02:00
Matthew Honnibal	ce458c2428	Fix spacy requirement constraint in package template	2018-05-22 20:50:46 +02:00
Ines Montani	862da5e793	Support pipeline factories via entry points (#2348 )	2018-05-22 18:29:45 +02:00
Matthew Honnibal	d5af38f80c	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-05-21 17:42:55 +02:00
Matthew Honnibal	ee33de8652	Fix unpickling of NER parser	2018-05-21 17:42:40 +02:00
ines	f9dbcac8e4	Merge branch 'master' into develop	2018-05-21 02:29:29 +02:00
cclauss	f7dcaa1f6b	Simplify is_config() and normalize_string_keys() (#2305 ) * Simplify is_config() and normalize_string_keys() * Use __in__ to avoid the nested _ands_ and _ors_. * Dict comprehension directly tracks with the doc string * Keep more basic loop in normalize_string_keys * Whitespace	2018-05-21 01:54:35 +02:00
Ines Montani	cae4457c38	💫 Add .similarity warnings for no vectors and option to exclude warnings (#2197 ) * Add logic to filter out warning IDs via environment variable Usage: SPACY_WARNING_EXCLUDE=W001,W007 * Add warnings for empty vectors * Add warning if no word vectors are used in .similarity methods For example, if only tensors are available in small models – should hopefully clear up some confusion around this * Capture warnings in tests * Rename SPACY_WARNING_EXCLUDE to SPACY_WARNING_IGNORE	2018-05-21 01:22:38 +02:00
Matthew Honnibal	b096b22c20	Merge pull request #2247 from skrcode/1480 1480 - Implement Fast-Text vectors with subword features	2018-05-21 01:16:21 +02:00
Matthew Honnibal	f3b4f6a4ec	Merge setup.py	2018-05-20 23:21:00 +02:00
Ines Montani	d4cc736b7c	💫 Improve model downloads: check for existing install, customise pip and use requests library again (#2346 ) * Go back to using requests instead of urllib (closes #2320) Fewer dependencies are good, but this one was simply causing too many other problems around SSL verification and Python 2/3 compatibility. requests is a popular enough package that it's okay for spaCy to depend on it – and this will hopefully make model downloads less flakey. * Only download model if not installed (see #1456) Use #egg=model==version to allow pip to check for existing installations. The download is only started if no installation matching the package/version is found. Fixes a long-standing inconvenience. * Pass additional options to pip when installing model (resolves #1456) Treat all additional arguments passed to the download command as pip options to allow user to customise the command. For example: python -m spacy download en --user * Add CLI option to enable installing model package dependencies * Revert "Add CLI option to enable installing model package dependencies" This reverts commit `9336ffe695`. * Update documentation	2018-05-20 20:26:56 +02:00
Matthew Honnibal	3eb446e0a5	Require thinc 6.11.1 and prepare for release to spacy-nightly	2018-05-20 19:00:34 +02:00
Matthew Honnibal	bdc23dd8c1	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-05-20 18:59:24 +02:00
ines	5401c55c75	Merge branch 'master' into develop	2018-05-20 16:49:40 +02:00
ines	b59e3b157f	Don't require attrs argument in Doc.retokenize and allow both ints and unicode (resolves #2304 )	2018-05-20 15:15:37 +02:00
ines	5768df4f09	Add SimpleFrozenDict util to use as default function argument	2018-05-20 15:13:37 +02:00
Matthew Honnibal	7431e9c87f	Fix parser for GPU	2018-05-19 17:24:34 +00:00
Matthew Honnibal	401213fb1f	Only warn about unnamed vectors if non-zero sized.	2018-05-19 18:51:55 +02:00
Matthew Honnibal	74d5c625b3	Use rising beam update prob	2018-05-16 20:11:59 +02:00
Matthew Honnibal	544ae7f1db	Merge branch 'develop' into feature/refactor-parser	2018-05-16 02:06:49 +02:00
Matthew Honnibal	d1b27fe5aa	Revert "Improve dynamic oracle when values are missing in parse" This reverts commit `f56bd4736b`.	2018-05-16 00:31:52 +02:00
Matthew Honnibal	83acaa0358	Add missing name attribute for parser	2018-05-15 19:01:53 +02:00
Matthew Honnibal	f328c195ca	Fix size limits in training data	2018-05-15 19:01:41 +02:00
Matthew Honnibal	8446b35ce0	Fix parser model loading	2018-05-15 18:43:46 +02:00
Matthew Honnibal	dc1a479fbd	Merge branch 'develop' into feature/refactor-parser	2018-05-15 18:39:21 +02:00
Matthew Honnibal	546dd99cdf	Merge master into develop -- mostly Arabic and website	2018-05-15 18:14:28 +02:00
Matthew Honnibal	5664ab7e6c	Revert hacks to tests	2018-05-15 18:00:09 +02:00
Matthew Honnibal	7b9195657b	Restore beam_density argument for parser beam	2018-05-15 17:55:11 +02:00
Matthew Honnibal	581d318971	Fix conftest	2018-05-15 00:54:45 +02:00
Tahar Zanouda	00417794d3	Add Arabic language (#2314 ) * added support for Arabic lang * added Arabic language support * updated conftest	2018-05-15 00:27:19 +02:00
Jani Monoses	0e08e49e87	Lemmatizer ro (#2319 ) * Add Romanian lemmatizer lookup table. Adapted from http://www.lexiconista.com/datasets/lemmatization/ by replacing cedillas with commas (ș and ț). The original dataset is licensed under the Open Database License. * Fix one blatant issue in the Romanian lemmatizer * Romanian examples file * Add ro_tokenizer in conftest * Add Romanian lemmatizer test	2018-05-12 15:20:04 +02:00
Matthew Honnibal	887631ca25	Disable some tests to figure out why CI fails	2018-05-10 16:42:01 +02:00
Matthew Honnibal	902a172cb7	Disable some tests to figure out why CI fails	2018-05-10 16:30:07 +02:00
Matthew Honnibal	614d45ea58	Set a more aggressive threshold on the max violn update	2018-05-10 15:38:24 +02:00
Matthew Honnibal	8e8724b55b	Default to beam_update_prob 1	2018-05-10 15:38:02 +02:00
Jani Monoses	42b34832e4	Update Romanian stopword list (#2316 ) * Contributor agreement for janimo * Update Romanian stopword list Include the correct spellings of all the words already in the repo that are using cedillas (ş and ţ) instead of commas (ș and ț). Add another unrelated spelling fix. See https://github.com/stopwords-iso/stopwords-ro/pull/1 and https://github.com/stopwords-iso/stopwords-ro/pull/2	2018-05-10 12:16:56 +02:00
Lucas Abbade	be7fdc59d1	Update lex_attrs.py (#2307 ) * Update lex_attrs.py Fixed spelling mistakes of some numbers (according to Brazilian Portuguese). * Update lex_attrs.py As requested, I've included the correct spelling for both Brazilian Portuguese and Portuguese Portuguese. I will advise however, that the two are separated in the future. Brazilian Portuguese is a very different language from the original one, although most of the writing is unified, the way people talk in both countries is radically different. Keeping both languages as one may lead to bigger issues in the future, especially when it comes to spell checking.	2018-05-09 20:49:31 +02:00
mauryaland	5368ba028a	Update stop_words.py for French language (#2310 ) * Add contraction forms of some common stopwords All the stopwords added contain the apostrophe" ' "or " ’ ". * Adds contributor agreement mauryaland * Update mauryaland.md	2018-05-09 12:04:38 +02:00
Matthew Honnibal	a61fd60681	Fix error in beam gradient calculation	2018-05-09 02:44:09 +02:00
Matthew Honnibal	a6ae1ee6f7	Don't modify Token in global scope	2018-05-09 00:43:00 +02:00
Matthew Honnibal	f94f721f40	Avoid importing fused token symbol in ud-run-test, untl that's added	2018-05-09 00:28:03 +02:00
Matthew Honnibal	659ec5b975	Avoid importing fused token symbol in ud-run-test, untl that's added	2018-05-08 19:40:33 +02:00
Matthew Honnibal	4cb0494bef	Bug fixes to beam search after refactor	2018-05-08 13:48:50 +02:00
Matthew Honnibal	5ed71973b3	Add a keyword argument sink to GoldParse	2018-05-08 13:48:32 +02:00
Matthew Honnibal	8cfe326f87	Avoid relying on final gold check in beam search	2018-05-08 13:48:19 +02:00
Matthew Honnibal	fc4dd49b77	Support oracle segmentation in ud-train CLI command	2018-05-08 13:47:45 +02:00
Matthew Honnibal	c49e44349a	Fix beam parsing	2018-05-08 02:53:24 +02:00
Matthew Honnibal	99649d114d	Fix parser	2018-05-08 00:27:26 +02:00
Matthew Honnibal	8a82367a9d	Fix beam search after refactor	2018-05-08 00:20:33 +02:00
Matthew Honnibal	5a0f26be0c	Readd beam search after refactor	2018-05-08 00:19:52 +02:00
ines	7a3599c21a	Fix formatting and consistency	2018-05-07 23:02:11 +02:00
Matthew Honnibal	36b2c9bdd5	Fix refactored parser	2018-05-07 18:58:09 +02:00
Matthew Honnibal	bde3be1ad1	Fix refactored parser	2018-05-07 18:31:04 +02:00
Matthew Honnibal	01c4e13b02	Update test	2018-05-07 16:59:52 +02:00
Matthew Honnibal	f6cdafc00e	Fix refactored parser	2018-05-07 16:59:38 +02:00
Matthew Honnibal	f56bd4736b	Improve dynamic oracle when values are missing in parse	2018-05-07 15:53:18 +02:00
Matthew Honnibal	eddc0e0c74	Set gold.sent_starts in ud_train	2018-05-07 15:52:47 +02:00
Matthew Honnibal	bf19f22340	Allow gold.sent_starts to be set from Python	2018-05-07 15:51:34 +02:00
Matthew Honnibal	7f163442e6	Work on refactoring greedy parser	2018-05-07 15:45:52 +02:00
Douglas Knox	9b49a40f4e	Test and fix for Issue #2219 (#2272 ) Test and fix for Issue #2219: Token.similarity() failed if single letter	2018-05-03 18:40:46 +02:00
Paul O'Leary McCann	bd72fbf09c	Port Japanese mecab tokenizer from v1 (#2036 ) * Port Japanese mecab tokenizer from v1 This brings the Mecab-based Japanese tokenization introduced in #1246 to spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag information from Mecab is stored in a token extension. A tag map is also included. As a reminder, Mecab is required because Universal Dependencies are based on Unidic tags, and Janome doesn't support Unidic. Things to check: 1. Is this the right way to use a token extension? 2. What's the right way to implement a JapaneseTagger? The approach in #1246 relied on `tag_from_strings` which is just gone now. I guess the best thing is to just try training spaCy's default Tagger? -POLM * Add tagging/make_doc and tests	2018-05-03 18:38:26 +02:00
G.Pruvost	cc8e804648	#2211 - Support for ssl certs config on download command (#2212 ) * Add support for SSL/Certs customization on download CLI * Add a note on SSL options for the 'download' CLI in the README * Add contributor agreement	2018-05-03 18:37:02 +02:00
Jens Dahl Møllerhøj	b9290397fb	rename SP to _SP (#2289 )	2018-05-03 18:33:49 +02:00
Matthew Honnibal	a8e70a4187	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-05-03 14:02:10 +02:00
Matthew Honnibal	c0e596283b	Set version to 2.1.0a0	2018-05-03 14:00:11 +02:00
Matthew Honnibal	8cd06cc763	Try to fix root-outside-sentence bug	2018-05-02 14:39:48 +00:00
Matthew Honnibal	acebd01033	Set cildren from heads in finalize doc	2018-05-02 14:19:22 +00:00
Matthew Honnibal	569440a6db	Dont normalize gradient by batch size	2018-05-02 08:42:10 +02:00
Matthew Honnibal	281e29cbcd	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-05-02 01:36:23 +00:00
Matthew Honnibal	2338e8c7fc	Update develop from master	2018-05-02 01:36:12 +00:00
Matthew Honnibal	9d147e12c4	Merge remote-tracking branch 'origin/master' into develop	2018-05-01 18:18:51 +02:00
Matthew Honnibal	6d0fe67b72	Constrain subtok label to adjacent tokens	2018-05-01 17:34:27 +02:00
Matthew Honnibal	8f21953fc5	Constrain subtok to adjacent words	2018-05-01 17:29:00 +02:00
Matthew Honnibal	b43bfd3524	Fix arc-eager oracle tests	2018-05-01 16:16:14 +02:00
Matthew Honnibal	31ed64e9b0	Fix textcat test	2018-05-01 15:18:39 +02:00
Matthew Honnibal	548bdff943	Update default Adam settings	2018-05-01 15:18:20 +02:00
Matthew Honnibal	adbb1f7533	Add better arc-eager oracle tests	2018-05-01 15:14:55 +02:00
Matthew Honnibal	697bcaa34f	Add some methods to ArcEager that make testing easier	2018-05-01 15:13:14 +02:00
Mr Roboto	6f5ccda19c	Addresses Issue #2228 - Deserialization fails when using tensor=False or sentiment=False (#2230 ) * Fixes issue #2228 * Adds a new contributor	2018-05-01 13:40:22 +02:00
Matthew Honnibal	d44bb45c72	Fix scoring if tokenization changes	2018-05-01 01:33:20 +02:00
Matthew Honnibal	2b26c007cd	Revert "Disable batch size compounding in ud-train" This reverts commit `8a120fb455`.	2018-04-29 14:09:02 +00:00
Matthew Honnibal	723b328062	Add script to run UD test	2018-04-29 15:50:25 +02:00
Matthew Honnibal	17af6aa3a4	Update ud_train script	2018-04-29 15:49:32 +02:00
Matthew Honnibal	5de8a36537	Fix arc_eager is_nonproj_tree	2018-04-29 15:49:11 +02:00
Matthew Honnibal	5260268f70	Fix textcat after merge	2018-04-29 15:48:53 +02:00
Matthew Honnibal	ad3d56c3ba	Fix compile error in matcher	2018-04-29 15:48:34 +02:00
Matthew Honnibal	a8bc947fd4	Fix Token.set_extension	2018-04-29 15:48:19 +02:00
Matthew Honnibal	2c4a6d66fa	Merge master into develop. Big merge, many conflicts -- need to review	2018-04-29 14:49:26 +02:00
ines	3c80f69ff5	Return data in cli.info and add silent option (resolves #2196 )	2018-04-29 01:59:44 +02:00
ines	1c6d77610c	Add remove_extension method on Doc, Token and Span (closes #2242 )	2018-04-28 23:33:09 +02:00
ines	abdb853ebf	Simplify underscore tests	2018-04-28 23:30:33 +02:00
ines	6fb6371670	Add collapse_phrases option to displacy (closes #2266 )	2018-04-28 23:06:50 +02:00
Robin Linderborg	1f9904ef12	fixes #2238 (#2241 ) * Remove erroneous lemma lookup år > åra in Swedish * Add contributors agreement * Add contrib agreement to correct directory * Revert change to CONTRIBUTOR_AGREEMENT	2018-04-28 14:55:22 +02:00
Robin Linderborg	d01f503b54	Remove incorrect lemma lookup gäng->gänga (#2252 ) * Remove incorrect lemma lookup gäng->gänga In modern Swedish, "gäng" is mostly associated with "gang" or "group of people". The removed lemma lookup lemmatized it to the verb "thread". * Add contrib agreement to correct directory * Revert change to CONTRIBUTOR_AGREEMENT	2018-04-28 14:54:41 +02:00
Suraj Krishnan Rajan	69d041148f	Implement Fast-Text vectors with subword features	2018-04-21 01:34:14 +05:30
ines	686225eadd	Fix Spanish noun_chunks (resolves #2210 ) Make sure 'NP' label is added to StringStore and move noun_bounds helper into a closure to allow reusing label sets	2018-04-18 18:44:01 -04:00
ines	9632595fb4	Use correct, non-deprecated merge syntax (resolves #2226 )	2018-04-18 18:28:28 -04:00
Suraj Rajan	5957f15227	Fixed typos for #2222,#2223 (#2233 ) (closes #2222 , closes #2223 )	2018-04-18 14:55:26 -07:00
Matthew Honnibal	97851d2c4e	Increment version to v2.0.12.dev0	2018-04-10 22:20:16 +02:00
Matthew Honnibal	ed39c75a92	Merge branch 'master' of https://github.com/explosion/spaCy	2018-04-10 22:19:40 +02:00
Matthew Honnibal	3836199a83	Fix loading of models when custom vectors are added	2018-04-10 22:19:20 +02:00
ines	0299d5fac8	Update argument annotations and formatting	2018-04-10 21:45:11 +02:00
ines	49b1e48bf5	Fix syntax error	2018-04-10 21:44:59 +02:00
ines	70052e46e9	Fix formatting [ci skip]	2018-04-10 21:42:46 +02:00
Matthew Honnibal	0ddb152be0	Improve error message when reading vectors	2018-04-10 21:26:50 +02:00
Matthew Honnibal	db50ac524e	Support zipped vector files in init-model	2018-04-10 21:21:00 +02:00
ines	270fcfd925	Fix typo in package command message (closes #2200 )	2018-04-10 19:14:31 +02:00
ines	24d8bf348d	Revert "Add support for .zip to init_model" This reverts commit `7ee880a0ad`.	2018-04-10 19:08:06 +02:00
Matthew Honnibal	7ee880a0ad	Add support for .zip to init_model	2018-04-10 14:30:04 +00:00
ines	5ecb274764	Fix indentation error and set Doc.is_tagged correctly	2018-04-10 16:14:52 +02:00
ines	987ee27af7	Return Doc if noun chunks merger component if Doc is not parsed	2018-04-09 14:51:02 +02:00
Xiaoquan Kong	e2f13ec722	bugfix: `Doc.noun_chunks` call `Doc.noun_chunks_iterator` without checking (closes #2194 )	2018-04-08 23:44:05 +02:00
Jens Dahl Møllerhøj	e5055e3cf6	Add Danish lemmatizer (#2184 ) * add danish lemmatizer * fill contributor agreement	2018-04-07 19:07:28 +02:00
ines	bccbf538ef	Revert "Check if spaCy has compiled correctly and show error message" This reverts commit `3463ded7cf`.	2018-04-06 15:49:44 +02:00
ines	fb4eda6616	Merge branch 'master' of https://github.com/explosion/spaCy	2018-04-06 00:38:48 +02:00
Matthew Honnibal	0c7fab4443	Set version to 2.0.11	2018-04-04 11:19:11 +02:00
Matthew Honnibal	a350be0601	Fix vector-name loading fix	2018-04-04 01:31:25 +02:00
Matthew Honnibal	21047bde52	Fix syntax error in italian lemmatizer	2018-04-03 23:13:22 +02:00
Matthew Honnibal	81f4005f3d	Fix loading models with pretrained vectors	2018-04-03 23:11:48 +02:00
ines	3463ded7cf	Check if spaCy has compiled correctly and show error message	2018-04-03 22:18:47 +02:00
Matthew Honnibal	96b612873b	Add hyper-parameter to control whether parser makes a beam update	2018-04-03 22:02:56 +02:00
ines	e5f47cd82d	Update errors	2018-04-03 21:40:29 +02:00
Matthew Honnibal	f7e6313b43	Increment version to v2.0.11.dev0	2018-04-03 20:58:47 +02:00
ines	10462816bc	Fix tests for Python 2	2018-04-03 18:51:31 +02:00
ines	62b4b527d7	Don't raise error if set_extension has getter and setter (closes #2177 ) Improve error messages, raise error if setter is specified without a getter and compare against _unset to allow default=None. Also add more tests.	2018-04-03 18:30:17 +02:00
ines	ee3082ad29	Fix whitespace	2018-04-03 18:29:53 +02:00
Ines Montani	3141e04822	💫 New system for error messages and warnings (#2163 ) * Add spacy.errors module * Update deprecation and user warnings * Replace errors and asserts with new error message system * Remove redundant asserts * Fix whitespace * Add messages for print/util.prints statements * Fix typo * Fix typos * Move CLI messages to spacy.cli._messages * Add decorator to display error code with message An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc. * Remove unused link in spacy.about * Update errors for invalid pipeline components * Improve error for unknown factories * Add displaCy warnings * Update formatting consistency * Move error message to spacy.errors * Update errors and check if doc returned by component is None	2018-04-03 15:50:31 +02:00
Matthew Honnibal	abf8b16d71	Add doc.retokenize() context manager (#2172 ) This patch takes a step towards #1487 by introducing the doc.retokenize() context manager, to handle merging spans, and soon splitting tokens. The idea is to do merging and splitting like this: with doc.retokenize() as retokenizer: for start, end, label in matches: retokenizer.merge(doc[start : end], attrs={'ent_type': label}) The retokenizer accumulates the merge requests, and applies them together at the end of the block. This will allow retokenization to be more efficient, and much less error prone. A retokenizer.split() function will then be added, to handle splitting a single token into multiple tokens. These methods take `Span` and `Token` objects; if the user wants to go directly from offsets, they can append to the .merges and .splits lists on the retokenizer. The doc.merge() method's behaviour remains unchanged, so this patch should be 100% backwards incompatible (modulo bugs). Internally, doc.merge() fixes up the arguments (to handle the various deprecated styles), opens the retokenizer, and makes the single merge. We can later start making deprecation warnings on direct calls to doc.merge(), to migrate people to use of the retokenize context manager.	2018-04-03 14:10:35 +02:00
Matthew Honnibal	8a120fb455	Disable batch size compounding in ud-train	2018-04-01 08:45:00 +00:00
Matthew Honnibal	98165e43a7	Sometimes update beam with greedy oracle	2018-04-01 08:44:35 +00:00
Suraj Rajan	1cdbb7c97c	[2032] - Changed python set to cpp stl set (#2170 ) Changed python set to cpp stl set #2032 ## Description Changed python set to cpp stl set. CPP stl set works better due to the logarithmic run time of its methods. Finding minimum in the cpp set is done in constant time as opposed to the worst case linear runtime of python set. Operations such as find,count,insert,delete are also done in either constant and logarithmic time thus making cpp set a better option to manage vectors. Reference : http://www.cplusplus.com/reference/set/set/ ### Types of change Enhancement for `Vectors` for faster initialising of word vectors(fasttext)	2018-03-31 13:28:25 +02:00
Matthew Honnibal	f3b7c5e537	Fix syntax error	2018-03-29 21:50:32 +02:00
Matthew Honnibal	23afa6429f	Add input length error, to address #1826	2018-03-29 21:45:26 +02:00
Ines Montani	a609a1ca29	Merge pull request #2152 from explosion/feature/tidy-up-dependencies 💫 Tidy up dependencies	2018-03-29 14:35:09 +02:00
Viet Trung Tran	ea2af94cd9	Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer (#2155 ) * support for Vietnamese * Contributor Agreement for adding Vietnamese support on spaCy	2018-03-29 12:19:51 +02:00
ines	e6979bdbbd	Merge branch 'feature/tidy-up-dependencies' of https://github.com/explosion/spaCy into feature/tidy-up-dependencies	2018-03-29 00:19:37 +02:00
ines	83146458a2	Fix urllib for Python 3	2018-03-29 00:19:33 +02:00
Matthew Honnibal	8308bbc617	Get msgpack and msgpack_numpy via Thinc, to avoid potential version conflicts	2018-03-29 00:14:55 +02:00
Matthew Honnibal	b5098079d8	Fix error on urllib	2018-03-29 00:08:16 +02:00
Ines Montani	0de599b16b	Merge pull request #2159 from explosion/feature/fix-merged-entity-iob (resolves #1554 , resolves #1752 ) 💫 Fix token.ent_iob after doc.merge(), and ensure consistency in doc.ents	2018-03-28 23:10:00 +02:00
Ines Montani	98e9cda677	Merge pull request #2158 from explosion/feature/fix-multiple-vectors (resolves #1660 ) 💫 Fix loading of multiple vector models	2018-03-28 23:08:24 +02:00
Matthew Honnibal	a7c5ae2beb	Avoid forcing a name on empty vectors, and remove print statement	2018-03-28 21:08:58 +02:00
ines	3eb67bbe4b	Allow entity types with dashes (resolves #1967 )	2018-03-28 20:51:26 +02:00
Matthew Honnibal	cf5fcf0546	Update serialization test	2018-03-28 20:12:53 +02:00
Matthew Honnibal	4555e3e251	Dont assume pretrained_vectors cfg set in build_tagger	2018-03-28 20:12:45 +02:00
Matthew Honnibal	0b375d50c8	Fix ent_iob tags in doc.merge to avoid inconsistent sequences	2018-03-28 18:39:03 +02:00
Matthew Honnibal	95fa89c4b8	Update doc.ents test	2018-03-28 18:39:03 +02:00
Matthew Honnibal	e807f88410	Resolve merge when cherry-picking ent iob patches from develop	2018-03-28 18:38:13 +02:00
Matthew Honnibal	99fbc7db33	Improve error message when entity sequence is inconsistent	2018-03-28 18:36:53 +02:00
Matthew Honnibal	cbd2794be0	Add test for ent_iob during span merge	2018-03-28 18:36:53 +02:00
Matthew Honnibal	f8dd905a24	Warn and fallback if vectors have no name	2018-03-28 18:24:53 +02:00
Matthew Honnibal	fd9e259414	Add test for #1660	2018-03-28 18:22:51 +02:00
Matthew Honnibal	bc4afa9881	Remove print statement	2018-03-28 17:48:37 +02:00
Matthew Honnibal	79dc241caa	Set pretrained_vectors in parser cfg	2018-03-28 17:35:07 +02:00
Matthew Honnibal	17c3e7efa2	Add message noting vectors	2018-03-28 16:33:43 +02:00
Matthew Honnibal	9bf6e93b3e	Set pretrained_vectors in begin_training	2018-03-28 16:32:41 +02:00
Matthew Honnibal	95a9615221	Fix loading of multiple pre-trained vectors This patch addresses #1660, which was caused by keying all pre-trained vectors with the same ID when telling Thinc how to refer to them. This meant that if multiple models were loaded that had pre-trained vectors, errors or incorrect behaviour resulted. The vectors class now includes a .name attribute, which defaults to: {nlp.meta['lang']_nlp.meta['name']}.vectors The vectors name is set in the cfg of the pipeline components under the key pretrained_vectors. This replaces the previous cfg key pretrained_dims. In order to make existing models compatible with this change, we check for the pretrained_dims key when loading models in from_disk and from_bytes, and add the cfg key pretrained_vectors if we find it.	2018-03-28 16:02:59 +02:00
ines	7fbc9e5874	Replace requests with urllib	2018-03-28 12:46:07 +02:00
ines	da1f200362	Add compat helpers for urllib	2018-03-28 12:45:53 +02:00
ines	ac88c72c9a	Fix ftfy workaround and remove old import	2018-03-28 12:14:28 +02:00
ines	ce6071ca89	Remove ftfy dependency and update docs	2018-03-28 12:09:42 +02:00
Matthew Honnibal	070b6c6495	Remove dependency on ftfy	2018-03-28 12:07:02 +02:00
ines	6d2c85f428	Drop six and related hacks as a dependency	2018-03-28 10:45:25 +02:00
ines	9e83513004	Add position of invalid token to error message	2018-03-27 23:56:59 +02:00
ines	11c4735ccf	Fix issue in Italian lemmatizer data (resolves #2050 )	2018-03-27 23:55:22 +02:00
Matthew Honnibal	6a961928b2	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-03-27 21:01:48 +00:00
Matthew Honnibal	b7136cb094	Support zipped vector files in init-model	2018-03-27 21:01:18 +00:00
ines	693971dd8f	Improve error message if token text is empty string (see #2101 )	2018-03-27 22:25:40 +02:00
ines	0c829e6605	Fix whitespace	2018-03-27 22:20:59 +02:00
Matthew Honnibal	de9fd091ac	Fix #2014 : token.pos_ not writeable	2018-03-27 21:21:11 +02:00

... 3 4 5 6 7 ...

5399 Commits