spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-11 04:08:09 +03:00

Author	SHA1	Message	Date
himkt	14d9007efd	fix wrong indexing (#2416 ) * fix wrong indexing * add agreement	2018-06-19 10:20:57 +02:00
Aliia E	428bae66b5	Add Tatar Language Support (#2444 ) * add Tatar lang support * add Tatar letters * add Tatar tests * sign contributor agreement * sign contributor agreement [x] * remove comments from Language class * remove all template comments	2018-06-19 10:17:53 +02:00
Cory Hurst	446f5ec41b	Silent keyword in info function in init (#2459 ) * Pass through "silent" kwarg to the wrapper in the spacy module init. reference issue #2196 * Pass through "silent" kwarg to the wrapper in the spacy module init. reference issue #2196 * contributor agreement	2018-06-18 12:24:21 +02:00
Nour Shalabi	a169b79092	Additions to Arabic stop words. (#2422 ) * Additions to Arabic stop words. * Create nourshalabi.md	2018-06-08 02:33:23 +02:00
ines	b8ef9c1000	Fix model names in conftest (see #2379 )	2018-05-30 14:10:20 +02:00
Maciej	c7d53348d7	Fix bug in CLI iob and ner converter (#2392 ) (fixes #2385 ) * issue_2385 add tests for iob_to_biluo converter function * issue_2385 fix and modify iob_to_biluo function to accept either iob or biluo tags in cli.converter * issue_2385 add test to fix b char bug * add contributor agreement * fill contributor agreement	2018-05-30 12:28:44 +02:00
ansgar-t	9732988951	escape html in displacy.render (#2378 ) (closes #2361 ) ## Description Fix for issue #2361 : replace &, <, >, " with &amp; , &lt; , &gt; , &quot; in before rendering svg ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. (As discussed in the comments to #2361) - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-05-28 18:36:41 +02:00
James Messinger	4515e96e90	Better formatting for `spacy train` CLI (#2357 ) * Better formatting for `spacy train` CLI Changed to use fixed-spaces rather than tabs to align table headers and data. ### Before: ``` Itn. P.Loss N.Loss UAS NER P. NER R. NER F. Tag % Token % 0 4618.857 2910.004 76.172 79.645 67.987 88.732 88.261 100.000 4436.9 6376.4 1 4671.972 3764.812 74.481 78.046 62.374 82.680 88.377 100.000 4672.2 6227.1 2 4742.756 3673.473 71.994 77.380 63.966 84.494 90.620 100.000 4298.0 5983.9 ``` ### After: ``` Itn. Dep Loss NER Loss UAS NER P. NER R. NER F. Tag % Token % CPU WPS GPU WPS 0 4618.857 2910.004 76.172 79.645 67.987 88.732 88.261 100.000 4436.9 6376.4 1 4671.972 3764.812 74.481 78.046 62.374 82.680 88.377 100.000 4672.2 6227.1 2 4742.756 3673.473 71.994 77.380 63.966 84.494 90.620 100.000 4298.0 5983.9 ``` * Added contributor file	2018-05-25 13:08:45 +02:00
Aristo Rinjuang	432ede04af	adding more words and rephrasing (#2351 ) * adding more words and rephrasing * adding a contributor * tokenizer bugs solved	2018-05-24 11:40:57 +02:00
Jani Monoses	ec62cadf4c	Updates to Romanian support (#2354 ) * Add back Romanian in conftest * Romanian lex_attr * More tokenizer exceptions for Romanian * Add tests for some Romanian tokenizer exceptions	2018-05-24 11:40:00 +02:00
cclauss	f7dcaa1f6b	Simplify is_config() and normalize_string_keys() (#2305 ) * Simplify is_config() and normalize_string_keys() * Use __in__ to avoid the nested _ands_ and _ors_. * Dict comprehension directly tracks with the doc string * Keep more basic loop in normalize_string_keys * Whitespace	2018-05-21 01:54:35 +02:00
Ines Montani	d4cc736b7c	💫 Improve model downloads: check for existing install, customise pip and use requests library again (#2346 ) * Go back to using requests instead of urllib (closes #2320) Fewer dependencies are good, but this one was simply causing too many other problems around SSL verification and Python 2/3 compatibility. requests is a popular enough package that it's okay for spaCy to depend on it – and this will hopefully make model downloads less flakey. * Only download model if not installed (see #1456) Use #egg=model==version to allow pip to check for existing installations. The download is only started if no installation matching the package/version is found. Fixes a long-standing inconvenience. * Pass additional options to pip when installing model (resolves #1456) Treat all additional arguments passed to the download command as pip options to allow user to customise the command. For example: python -m spacy download en --user * Add CLI option to enable installing model package dependencies * Revert "Add CLI option to enable installing model package dependencies" This reverts commit `9336ffe695`. * Update documentation	2018-05-20 20:26:56 +02:00
ines	b59e3b157f	Don't require attrs argument in Doc.retokenize and allow both ints and unicode (resolves #2304 )	2018-05-20 15:15:37 +02:00
ines	5768df4f09	Add SimpleFrozenDict util to use as default function argument	2018-05-20 15:13:37 +02:00
Matthew Honnibal	581d318971	Fix conftest	2018-05-15 00:54:45 +02:00
Tahar Zanouda	00417794d3	Add Arabic language (#2314 ) * added support for Arabic lang * added Arabic language support * updated conftest	2018-05-15 00:27:19 +02:00
Jani Monoses	0e08e49e87	Lemmatizer ro (#2319 ) * Add Romanian lemmatizer lookup table. Adapted from http://www.lexiconista.com/datasets/lemmatization/ by replacing cedillas with commas (ș and ț). The original dataset is licensed under the Open Database License. * Fix one blatant issue in the Romanian lemmatizer * Romanian examples file * Add ro_tokenizer in conftest * Add Romanian lemmatizer test	2018-05-12 15:20:04 +02:00
Jani Monoses	42b34832e4	Update Romanian stopword list (#2316 ) * Contributor agreement for janimo * Update Romanian stopword list Include the correct spellings of all the words already in the repo that are using cedillas (ş and ţ) instead of commas (ș and ț). Add another unrelated spelling fix. See https://github.com/stopwords-iso/stopwords-ro/pull/1 and https://github.com/stopwords-iso/stopwords-ro/pull/2	2018-05-10 12:16:56 +02:00
Lucas Abbade	be7fdc59d1	Update lex_attrs.py (#2307 ) * Update lex_attrs.py Fixed spelling mistakes of some numbers (according to Brazilian Portuguese). * Update lex_attrs.py As requested, I've included the correct spelling for both Brazilian Portuguese and Portuguese Portuguese. I will advise however, that the two are separated in the future. Brazilian Portuguese is a very different language from the original one, although most of the writing is unified, the way people talk in both countries is radically different. Keeping both languages as one may lead to bigger issues in the future, especially when it comes to spell checking.	2018-05-09 20:49:31 +02:00
mauryaland	5368ba028a	Update stop_words.py for French language (#2310 ) * Add contraction forms of some common stopwords All the stopwords added contain the apostrophe" ' "or " ’ ". * Adds contributor agreement mauryaland * Update mauryaland.md	2018-05-09 12:04:38 +02:00
ines	7a3599c21a	Fix formatting and consistency	2018-05-07 23:02:11 +02:00
Douglas Knox	9b49a40f4e	Test and fix for Issue #2219 (#2272 ) Test and fix for Issue #2219: Token.similarity() failed if single letter	2018-05-03 18:40:46 +02:00
Paul O'Leary McCann	bd72fbf09c	Port Japanese mecab tokenizer from v1 (#2036 ) * Port Japanese mecab tokenizer from v1 This brings the Mecab-based Japanese tokenization introduced in #1246 to spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag information from Mecab is stored in a token extension. A tag map is also included. As a reminder, Mecab is required because Universal Dependencies are based on Unidic tags, and Janome doesn't support Unidic. Things to check: 1. Is this the right way to use a token extension? 2. What's the right way to implement a JapaneseTagger? The approach in #1246 relied on `tag_from_strings` which is just gone now. I guess the best thing is to just try training spaCy's default Tagger? -POLM * Add tagging/make_doc and tests	2018-05-03 18:38:26 +02:00
G.Pruvost	cc8e804648	#2211 - Support for ssl certs config on download command (#2212 ) * Add support for SSL/Certs customization on download CLI * Add a note on SSL options for the 'download' CLI in the README * Add contributor agreement	2018-05-03 18:37:02 +02:00
Jens Dahl Møllerhøj	b9290397fb	rename SP to _SP (#2289 )	2018-05-03 18:33:49 +02:00
Mr Roboto	6f5ccda19c	Addresses Issue #2228 - Deserialization fails when using tensor=False or sentiment=False (#2230 ) * Fixes issue #2228 * Adds a new contributor	2018-05-01 13:40:22 +02:00
ines	3c80f69ff5	Return data in cli.info and add silent option (resolves #2196 )	2018-04-29 01:59:44 +02:00
ines	1c6d77610c	Add remove_extension method on Doc, Token and Span (closes #2242 )	2018-04-28 23:33:09 +02:00
ines	abdb853ebf	Simplify underscore tests	2018-04-28 23:30:33 +02:00
ines	6fb6371670	Add collapse_phrases option to displacy (closes #2266 )	2018-04-28 23:06:50 +02:00
Robin Linderborg	1f9904ef12	fixes #2238 (#2241 ) * Remove erroneous lemma lookup år > åra in Swedish * Add contributors agreement * Add contrib agreement to correct directory * Revert change to CONTRIBUTOR_AGREEMENT	2018-04-28 14:55:22 +02:00
Robin Linderborg	d01f503b54	Remove incorrect lemma lookup gäng->gänga (#2252 ) * Remove incorrect lemma lookup gäng->gänga In modern Swedish, "gäng" is mostly associated with "gang" or "group of people". The removed lemma lookup lemmatized it to the verb "thread". * Add contrib agreement to correct directory * Revert change to CONTRIBUTOR_AGREEMENT	2018-04-28 14:54:41 +02:00
ines	686225eadd	Fix Spanish noun_chunks (resolves #2210 ) Make sure 'NP' label is added to StringStore and move noun_bounds helper into a closure to allow reusing label sets	2018-04-18 18:44:01 -04:00
ines	9632595fb4	Use correct, non-deprecated merge syntax (resolves #2226 )	2018-04-18 18:28:28 -04:00
Suraj Rajan	5957f15227	Fixed typos for #2222,#2223 (#2233 ) (closes #2222 , closes #2223 )	2018-04-18 14:55:26 -07:00
Matthew Honnibal	97851d2c4e	Increment version to v2.0.12.dev0	2018-04-10 22:20:16 +02:00
Matthew Honnibal	ed39c75a92	Merge branch 'master' of https://github.com/explosion/spaCy	2018-04-10 22:19:40 +02:00
Matthew Honnibal	3836199a83	Fix loading of models when custom vectors are added	2018-04-10 22:19:20 +02:00
ines	0299d5fac8	Update argument annotations and formatting	2018-04-10 21:45:11 +02:00
ines	49b1e48bf5	Fix syntax error	2018-04-10 21:44:59 +02:00
ines	70052e46e9	Fix formatting [ci skip]	2018-04-10 21:42:46 +02:00
Matthew Honnibal	0ddb152be0	Improve error message when reading vectors	2018-04-10 21:26:50 +02:00
Matthew Honnibal	db50ac524e	Support zipped vector files in init-model	2018-04-10 21:21:00 +02:00
ines	270fcfd925	Fix typo in package command message (closes #2200 )	2018-04-10 19:14:31 +02:00
ines	24d8bf348d	Revert "Add support for .zip to init_model" This reverts commit `7ee880a0ad`.	2018-04-10 19:08:06 +02:00
Matthew Honnibal	7ee880a0ad	Add support for .zip to init_model	2018-04-10 14:30:04 +00:00
ines	5ecb274764	Fix indentation error and set Doc.is_tagged correctly	2018-04-10 16:14:52 +02:00
ines	987ee27af7	Return Doc if noun chunks merger component if Doc is not parsed	2018-04-09 14:51:02 +02:00
Xiaoquan Kong	e2f13ec722	bugfix: `Doc.noun_chunks` call `Doc.noun_chunks_iterator` without checking (closes #2194 )	2018-04-08 23:44:05 +02:00
Jens Dahl Møllerhøj	e5055e3cf6	Add Danish lemmatizer (#2184 ) * add danish lemmatizer * fill contributor agreement	2018-04-07 19:07:28 +02:00
ines	bccbf538ef	Revert "Check if spaCy has compiled correctly and show error message" This reverts commit `3463ded7cf`.	2018-04-06 15:49:44 +02:00
ines	fb4eda6616	Merge branch 'master' of https://github.com/explosion/spaCy	2018-04-06 00:38:48 +02:00
Matthew Honnibal	0c7fab4443	Set version to 2.0.11	2018-04-04 11:19:11 +02:00
Matthew Honnibal	a350be0601	Fix vector-name loading fix	2018-04-04 01:31:25 +02:00
Matthew Honnibal	21047bde52	Fix syntax error in italian lemmatizer	2018-04-03 23:13:22 +02:00
Matthew Honnibal	81f4005f3d	Fix loading models with pretrained vectors	2018-04-03 23:11:48 +02:00
ines	3463ded7cf	Check if spaCy has compiled correctly and show error message	2018-04-03 22:18:47 +02:00
Matthew Honnibal	96b612873b	Add hyper-parameter to control whether parser makes a beam update	2018-04-03 22:02:56 +02:00
ines	e5f47cd82d	Update errors	2018-04-03 21:40:29 +02:00
Matthew Honnibal	f7e6313b43	Increment version to v2.0.11.dev0	2018-04-03 20:58:47 +02:00
ines	10462816bc	Fix tests for Python 2	2018-04-03 18:51:31 +02:00
ines	62b4b527d7	Don't raise error if set_extension has getter and setter (closes #2177 ) Improve error messages, raise error if setter is specified without a getter and compare against _unset to allow default=None. Also add more tests.	2018-04-03 18:30:17 +02:00
ines	ee3082ad29	Fix whitespace	2018-04-03 18:29:53 +02:00
Ines Montani	3141e04822	💫 New system for error messages and warnings (#2163 ) * Add spacy.errors module * Update deprecation and user warnings * Replace errors and asserts with new error message system * Remove redundant asserts * Fix whitespace * Add messages for print/util.prints statements * Fix typo * Fix typos * Move CLI messages to spacy.cli._messages * Add decorator to display error code with message An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc. * Remove unused link in spacy.about * Update errors for invalid pipeline components * Improve error for unknown factories * Add displaCy warnings * Update formatting consistency * Move error message to spacy.errors * Update errors and check if doc returned by component is None	2018-04-03 15:50:31 +02:00
Matthew Honnibal	abf8b16d71	Add doc.retokenize() context manager (#2172 ) This patch takes a step towards #1487 by introducing the doc.retokenize() context manager, to handle merging spans, and soon splitting tokens. The idea is to do merging and splitting like this: with doc.retokenize() as retokenizer: for start, end, label in matches: retokenizer.merge(doc[start : end], attrs={'ent_type': label}) The retokenizer accumulates the merge requests, and applies them together at the end of the block. This will allow retokenization to be more efficient, and much less error prone. A retokenizer.split() function will then be added, to handle splitting a single token into multiple tokens. These methods take `Span` and `Token` objects; if the user wants to go directly from offsets, they can append to the .merges and .splits lists on the retokenizer. The doc.merge() method's behaviour remains unchanged, so this patch should be 100% backwards incompatible (modulo bugs). Internally, doc.merge() fixes up the arguments (to handle the various deprecated styles), opens the retokenizer, and makes the single merge. We can later start making deprecation warnings on direct calls to doc.merge(), to migrate people to use of the retokenize context manager.	2018-04-03 14:10:35 +02:00
Suraj Rajan	1cdbb7c97c	[2032] - Changed python set to cpp stl set (#2170 ) Changed python set to cpp stl set #2032 ## Description Changed python set to cpp stl set. CPP stl set works better due to the logarithmic run time of its methods. Finding minimum in the cpp set is done in constant time as opposed to the worst case linear runtime of python set. Operations such as find,count,insert,delete are also done in either constant and logarithmic time thus making cpp set a better option to manage vectors. Reference : http://www.cplusplus.com/reference/set/set/ ### Types of change Enhancement for `Vectors` for faster initialising of word vectors(fasttext)	2018-03-31 13:28:25 +02:00
Matthew Honnibal	f3b7c5e537	Fix syntax error	2018-03-29 21:50:32 +02:00
Matthew Honnibal	23afa6429f	Add input length error, to address #1826	2018-03-29 21:45:26 +02:00
Ines Montani	a609a1ca29	Merge pull request #2152 from explosion/feature/tidy-up-dependencies 💫 Tidy up dependencies	2018-03-29 14:35:09 +02:00
Viet Trung Tran	ea2af94cd9	Add support for Vietnamese in spaCy by leveraging Pyvi, an external Vietnamese tokenizer (#2155 ) * support for Vietnamese * Contributor Agreement for adding Vietnamese support on spaCy	2018-03-29 12:19:51 +02:00
ines	e6979bdbbd	Merge branch 'feature/tidy-up-dependencies' of https://github.com/explosion/spaCy into feature/tidy-up-dependencies	2018-03-29 00:19:37 +02:00
ines	83146458a2	Fix urllib for Python 3	2018-03-29 00:19:33 +02:00
Matthew Honnibal	8308bbc617	Get msgpack and msgpack_numpy via Thinc, to avoid potential version conflicts	2018-03-29 00:14:55 +02:00
Matthew Honnibal	b5098079d8	Fix error on urllib	2018-03-29 00:08:16 +02:00
Ines Montani	0de599b16b	Merge pull request #2159 from explosion/feature/fix-merged-entity-iob (resolves #1554 , resolves #1752 ) 💫 Fix token.ent_iob after doc.merge(), and ensure consistency in doc.ents	2018-03-28 23:10:00 +02:00
Ines Montani	98e9cda677	Merge pull request #2158 from explosion/feature/fix-multiple-vectors (resolves #1660 ) 💫 Fix loading of multiple vector models	2018-03-28 23:08:24 +02:00
Matthew Honnibal	a7c5ae2beb	Avoid forcing a name on empty vectors, and remove print statement	2018-03-28 21:08:58 +02:00
ines	3eb67bbe4b	Allow entity types with dashes (resolves #1967 )	2018-03-28 20:51:26 +02:00
Matthew Honnibal	cf5fcf0546	Update serialization test	2018-03-28 20:12:53 +02:00
Matthew Honnibal	4555e3e251	Dont assume pretrained_vectors cfg set in build_tagger	2018-03-28 20:12:45 +02:00
Matthew Honnibal	0b375d50c8	Fix ent_iob tags in doc.merge to avoid inconsistent sequences	2018-03-28 18:39:03 +02:00
Matthew Honnibal	95fa89c4b8	Update doc.ents test	2018-03-28 18:39:03 +02:00
Matthew Honnibal	e807f88410	Resolve merge when cherry-picking ent iob patches from develop	2018-03-28 18:38:13 +02:00
Matthew Honnibal	99fbc7db33	Improve error message when entity sequence is inconsistent	2018-03-28 18:36:53 +02:00
Matthew Honnibal	cbd2794be0	Add test for ent_iob during span merge	2018-03-28 18:36:53 +02:00
Matthew Honnibal	f8dd905a24	Warn and fallback if vectors have no name	2018-03-28 18:24:53 +02:00
Matthew Honnibal	fd9e259414	Add test for #1660	2018-03-28 18:22:51 +02:00
Matthew Honnibal	bc4afa9881	Remove print statement	2018-03-28 17:48:37 +02:00
Matthew Honnibal	79dc241caa	Set pretrained_vectors in parser cfg	2018-03-28 17:35:07 +02:00
Matthew Honnibal	17c3e7efa2	Add message noting vectors	2018-03-28 16:33:43 +02:00
Matthew Honnibal	9bf6e93b3e	Set pretrained_vectors in begin_training	2018-03-28 16:32:41 +02:00
Matthew Honnibal	95a9615221	Fix loading of multiple pre-trained vectors This patch addresses #1660, which was caused by keying all pre-trained vectors with the same ID when telling Thinc how to refer to them. This meant that if multiple models were loaded that had pre-trained vectors, errors or incorrect behaviour resulted. The vectors class now includes a .name attribute, which defaults to: {nlp.meta['lang']_nlp.meta['name']}.vectors The vectors name is set in the cfg of the pipeline components under the key pretrained_vectors. This replaces the previous cfg key pretrained_dims. In order to make existing models compatible with this change, we check for the pretrained_dims key when loading models in from_disk and from_bytes, and add the cfg key pretrained_vectors if we find it.	2018-03-28 16:02:59 +02:00
ines	7fbc9e5874	Replace requests with urllib	2018-03-28 12:46:07 +02:00
ines	da1f200362	Add compat helpers for urllib	2018-03-28 12:45:53 +02:00
ines	ac88c72c9a	Fix ftfy workaround and remove old import	2018-03-28 12:14:28 +02:00
ines	ce6071ca89	Remove ftfy dependency and update docs	2018-03-28 12:09:42 +02:00
Matthew Honnibal	070b6c6495	Remove dependency on ftfy	2018-03-28 12:07:02 +02:00
ines	6d2c85f428	Drop six and related hacks as a dependency	2018-03-28 10:45:25 +02:00
ines	9e83513004	Add position of invalid token to error message	2018-03-27 23:56:59 +02:00
ines	11c4735ccf	Fix issue in Italian lemmatizer data (resolves #2050 )	2018-03-27 23:55:22 +02:00

1 2 3 4 5 ...

4947 Commits