spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-27 10:26:35 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	8e8724b55b	Default to beam_update_prob 1	2018-05-10 15:38:02 +02:00
Jani Monoses	42b34832e4	Update Romanian stopword list (#2316 ) * Contributor agreement for janimo * Update Romanian stopword list Include the correct spellings of all the words already in the repo that are using cedillas (ş and ţ) instead of commas (ș and ț). Add another unrelated spelling fix. See https://github.com/stopwords-iso/stopwords-ro/pull/1 and https://github.com/stopwords-iso/stopwords-ro/pull/2	2018-05-10 12:16:56 +02:00
Lucas Abbade	be7fdc59d1	Update lex_attrs.py (#2307 ) * Update lex_attrs.py Fixed spelling mistakes of some numbers (according to Brazilian Portuguese). * Update lex_attrs.py As requested, I've included the correct spelling for both Brazilian Portuguese and Portuguese Portuguese. I will advise however, that the two are separated in the future. Brazilian Portuguese is a very different language from the original one, although most of the writing is unified, the way people talk in both countries is radically different. Keeping both languages as one may lead to bigger issues in the future, especially when it comes to spell checking.	2018-05-09 20:49:31 +02:00
mauryaland	5368ba028a	Update stop_words.py for French language (#2310 ) * Add contraction forms of some common stopwords All the stopwords added contain the apostrophe" ' "or " ’ ". * Adds contributor agreement mauryaland * Update mauryaland.md	2018-05-09 12:04:38 +02:00
Matthew Honnibal	a61fd60681	Fix error in beam gradient calculation	2018-05-09 02:44:09 +02:00
Matthew Honnibal	a6ae1ee6f7	Don't modify Token in global scope	2018-05-09 00:43:00 +02:00
Matthew Honnibal	f94f721f40	Avoid importing fused token symbol in ud-run-test, untl that's added	2018-05-09 00:28:03 +02:00
Matthew Honnibal	659ec5b975	Avoid importing fused token symbol in ud-run-test, untl that's added	2018-05-08 19:40:33 +02:00
Matthew Honnibal	4cb0494bef	Bug fixes to beam search after refactor	2018-05-08 13:48:50 +02:00
Matthew Honnibal	5ed71973b3	Add a keyword argument sink to GoldParse	2018-05-08 13:48:32 +02:00
Matthew Honnibal	8cfe326f87	Avoid relying on final gold check in beam search	2018-05-08 13:48:19 +02:00
Matthew Honnibal	fc4dd49b77	Support oracle segmentation in ud-train CLI command	2018-05-08 13:47:45 +02:00
Matthew Honnibal	c49e44349a	Fix beam parsing	2018-05-08 02:53:24 +02:00
Matthew Honnibal	99649d114d	Fix parser	2018-05-08 00:27:26 +02:00
Matthew Honnibal	8a82367a9d	Fix beam search after refactor	2018-05-08 00:20:33 +02:00
Matthew Honnibal	5a0f26be0c	Readd beam search after refactor	2018-05-08 00:19:52 +02:00
ines	7a3599c21a	Fix formatting and consistency	2018-05-07 23:02:11 +02:00
Matthew Honnibal	36b2c9bdd5	Fix refactored parser	2018-05-07 18:58:09 +02:00
Matthew Honnibal	bde3be1ad1	Fix refactored parser	2018-05-07 18:31:04 +02:00
Matthew Honnibal	01c4e13b02	Update test	2018-05-07 16:59:52 +02:00
Matthew Honnibal	f6cdafc00e	Fix refactored parser	2018-05-07 16:59:38 +02:00
Matthew Honnibal	f56bd4736b	Improve dynamic oracle when values are missing in parse	2018-05-07 15:53:18 +02:00
Matthew Honnibal	eddc0e0c74	Set gold.sent_starts in ud_train	2018-05-07 15:52:47 +02:00
Matthew Honnibal	bf19f22340	Allow gold.sent_starts to be set from Python	2018-05-07 15:51:34 +02:00
Matthew Honnibal	7f163442e6	Work on refactoring greedy parser	2018-05-07 15:45:52 +02:00
Douglas Knox	9b49a40f4e	Test and fix for Issue #2219 (#2272 ) Test and fix for Issue #2219: Token.similarity() failed if single letter	2018-05-03 18:40:46 +02:00
Paul O'Leary McCann	bd72fbf09c	Port Japanese mecab tokenizer from v1 (#2036 ) * Port Japanese mecab tokenizer from v1 This brings the Mecab-based Japanese tokenization introduced in #1246 to spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag information from Mecab is stored in a token extension. A tag map is also included. As a reminder, Mecab is required because Universal Dependencies are based on Unidic tags, and Janome doesn't support Unidic. Things to check: 1. Is this the right way to use a token extension? 2. What's the right way to implement a JapaneseTagger? The approach in #1246 relied on `tag_from_strings` which is just gone now. I guess the best thing is to just try training spaCy's default Tagger? -POLM * Add tagging/make_doc and tests	2018-05-03 18:38:26 +02:00
G.Pruvost	cc8e804648	#2211 - Support for ssl certs config on download command (#2212 ) * Add support for SSL/Certs customization on download CLI * Add a note on SSL options for the 'download' CLI in the README * Add contributor agreement	2018-05-03 18:37:02 +02:00
Jens Dahl Møllerhøj	b9290397fb	rename SP to _SP (#2289 )	2018-05-03 18:33:49 +02:00
Matthew Honnibal	a8e70a4187	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-05-03 14:02:10 +02:00
Matthew Honnibal	c0e596283b	Set version to 2.1.0a0	2018-05-03 14:00:11 +02:00
Matthew Honnibal	8cd06cc763	Try to fix root-outside-sentence bug	2018-05-02 14:39:48 +00:00
Matthew Honnibal	acebd01033	Set cildren from heads in finalize doc	2018-05-02 14:19:22 +00:00
Matthew Honnibal	569440a6db	Dont normalize gradient by batch size	2018-05-02 08:42:10 +02:00
Matthew Honnibal	281e29cbcd	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-05-02 01:36:23 +00:00
Matthew Honnibal	2338e8c7fc	Update develop from master	2018-05-02 01:36:12 +00:00
Matthew Honnibal	9d147e12c4	Merge remote-tracking branch 'origin/master' into develop	2018-05-01 18:18:51 +02:00
Matthew Honnibal	6d0fe67b72	Constrain subtok label to adjacent tokens	2018-05-01 17:34:27 +02:00
Matthew Honnibal	8f21953fc5	Constrain subtok to adjacent words	2018-05-01 17:29:00 +02:00
Matthew Honnibal	b43bfd3524	Fix arc-eager oracle tests	2018-05-01 16:16:14 +02:00
Matthew Honnibal	31ed64e9b0	Fix textcat test	2018-05-01 15:18:39 +02:00
Matthew Honnibal	548bdff943	Update default Adam settings	2018-05-01 15:18:20 +02:00
Matthew Honnibal	adbb1f7533	Add better arc-eager oracle tests	2018-05-01 15:14:55 +02:00
Matthew Honnibal	697bcaa34f	Add some methods to ArcEager that make testing easier	2018-05-01 15:13:14 +02:00
Mr Roboto	6f5ccda19c	Addresses Issue #2228 - Deserialization fails when using tensor=False or sentiment=False (#2230 ) * Fixes issue #2228 * Adds a new contributor	2018-05-01 13:40:22 +02:00
Matthew Honnibal	d44bb45c72	Fix scoring if tokenization changes	2018-05-01 01:33:20 +02:00
Matthew Honnibal	2b26c007cd	Revert "Disable batch size compounding in ud-train" This reverts commit `8a120fb455`.	2018-04-29 14:09:02 +00:00
Matthew Honnibal	723b328062	Add script to run UD test	2018-04-29 15:50:25 +02:00
Matthew Honnibal	17af6aa3a4	Update ud_train script	2018-04-29 15:49:32 +02:00
Matthew Honnibal	5de8a36537	Fix arc_eager is_nonproj_tree	2018-04-29 15:49:11 +02:00
Matthew Honnibal	5260268f70	Fix textcat after merge	2018-04-29 15:48:53 +02:00
Matthew Honnibal	ad3d56c3ba	Fix compile error in matcher	2018-04-29 15:48:34 +02:00
Matthew Honnibal	a8bc947fd4	Fix Token.set_extension	2018-04-29 15:48:19 +02:00
Matthew Honnibal	2c4a6d66fa	Merge master into develop. Big merge, many conflicts -- need to review	2018-04-29 14:49:26 +02:00
ines	3c80f69ff5	Return data in cli.info and add silent option (resolves #2196 )	2018-04-29 01:59:44 +02:00
ines	1c6d77610c	Add remove_extension method on Doc, Token and Span (closes #2242 )	2018-04-28 23:33:09 +02:00
ines	abdb853ebf	Simplify underscore tests	2018-04-28 23:30:33 +02:00
ines	6fb6371670	Add collapse_phrases option to displacy (closes #2266 )	2018-04-28 23:06:50 +02:00
Robin Linderborg	1f9904ef12	fixes #2238 (#2241 ) * Remove erroneous lemma lookup år > åra in Swedish * Add contributors agreement * Add contrib agreement to correct directory * Revert change to CONTRIBUTOR_AGREEMENT	2018-04-28 14:55:22 +02:00
Robin Linderborg	d01f503b54	Remove incorrect lemma lookup gäng->gänga (#2252 ) * Remove incorrect lemma lookup gäng->gänga In modern Swedish, "gäng" is mostly associated with "gang" or "group of people". The removed lemma lookup lemmatized it to the verb "thread". * Add contrib agreement to correct directory * Revert change to CONTRIBUTOR_AGREEMENT	2018-04-28 14:54:41 +02:00
Suraj Krishnan Rajan	69d041148f	Implement Fast-Text vectors with subword features	2018-04-21 01:34:14 +05:30
ines	686225eadd	Fix Spanish noun_chunks (resolves #2210 ) Make sure 'NP' label is added to StringStore and move noun_bounds helper into a closure to allow reusing label sets	2018-04-18 18:44:01 -04:00
ines	9632595fb4	Use correct, non-deprecated merge syntax (resolves #2226 )	2018-04-18 18:28:28 -04:00
Suraj Rajan	5957f15227	Fixed typos for #2222,#2223 (#2233 ) (closes #2222 , closes #2223 )	2018-04-18 14:55:26 -07:00
Matthew Honnibal	97851d2c4e	Increment version to v2.0.12.dev0	2018-04-10 22:20:16 +02:00
Matthew Honnibal	ed39c75a92	Merge branch 'master' of https://github.com/explosion/spaCy	2018-04-10 22:19:40 +02:00
Matthew Honnibal	3836199a83	Fix loading of models when custom vectors are added	2018-04-10 22:19:20 +02:00
ines	0299d5fac8	Update argument annotations and formatting	2018-04-10 21:45:11 +02:00
ines	49b1e48bf5	Fix syntax error	2018-04-10 21:44:59 +02:00
ines	70052e46e9	Fix formatting [ci skip]	2018-04-10 21:42:46 +02:00
Matthew Honnibal	0ddb152be0	Improve error message when reading vectors	2018-04-10 21:26:50 +02:00
Matthew Honnibal	db50ac524e	Support zipped vector files in init-model	2018-04-10 21:21:00 +02:00
ines	270fcfd925	Fix typo in package command message (closes #2200 )	2018-04-10 19:14:31 +02:00
ines	24d8bf348d	Revert "Add support for .zip to init_model" This reverts commit `7ee880a0ad`.	2018-04-10 19:08:06 +02:00
Matthew Honnibal	7ee880a0ad	Add support for .zip to init_model	2018-04-10 14:30:04 +00:00
ines	5ecb274764	Fix indentation error and set Doc.is_tagged correctly	2018-04-10 16:14:52 +02:00
ines	987ee27af7	Return Doc if noun chunks merger component if Doc is not parsed	2018-04-09 14:51:02 +02:00
Xiaoquan Kong	e2f13ec722	bugfix: `Doc.noun_chunks` call `Doc.noun_chunks_iterator` without checking (closes #2194 )	2018-04-08 23:44:05 +02:00
Jens Dahl Møllerhøj	e5055e3cf6	Add Danish lemmatizer (#2184 ) * add danish lemmatizer * fill contributor agreement	2018-04-07 19:07:28 +02:00
ines	bccbf538ef	Revert "Check if spaCy has compiled correctly and show error message" This reverts commit `3463ded7cf`.	2018-04-06 15:49:44 +02:00
ines	fb4eda6616	Merge branch 'master' of https://github.com/explosion/spaCy	2018-04-06 00:38:48 +02:00
Matthew Honnibal	0c7fab4443	Set version to 2.0.11	2018-04-04 11:19:11 +02:00
Matthew Honnibal	a350be0601	Fix vector-name loading fix	2018-04-04 01:31:25 +02:00
Matthew Honnibal	21047bde52	Fix syntax error in italian lemmatizer	2018-04-03 23:13:22 +02:00
Matthew Honnibal	81f4005f3d	Fix loading models with pretrained vectors	2018-04-03 23:11:48 +02:00
ines	3463ded7cf	Check if spaCy has compiled correctly and show error message	2018-04-03 22:18:47 +02:00
Matthew Honnibal	96b612873b	Add hyper-parameter to control whether parser makes a beam update	2018-04-03 22:02:56 +02:00
ines	e5f47cd82d	Update errors	2018-04-03 21:40:29 +02:00
Matthew Honnibal	f7e6313b43	Increment version to v2.0.11.dev0	2018-04-03 20:58:47 +02:00
ines	10462816bc	Fix tests for Python 2	2018-04-03 18:51:31 +02:00
ines	62b4b527d7	Don't raise error if set_extension has getter and setter (closes #2177 ) Improve error messages, raise error if setter is specified without a getter and compare against _unset to allow default=None. Also add more tests.	2018-04-03 18:30:17 +02:00
ines	ee3082ad29	Fix whitespace	2018-04-03 18:29:53 +02:00
Ines Montani	3141e04822	💫 New system for error messages and warnings (#2163 ) * Add spacy.errors module * Update deprecation and user warnings * Replace errors and asserts with new error message system * Remove redundant asserts * Fix whitespace * Add messages for print/util.prints statements * Fix typo * Fix typos * Move CLI messages to spacy.cli._messages * Add decorator to display error code with message An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc. * Remove unused link in spacy.about * Update errors for invalid pipeline components * Improve error for unknown factories * Add displaCy warnings * Update formatting consistency * Move error message to spacy.errors * Update errors and check if doc returned by component is None	2018-04-03 15:50:31 +02:00
Matthew Honnibal	abf8b16d71	Add doc.retokenize() context manager (#2172 ) This patch takes a step towards #1487 by introducing the doc.retokenize() context manager, to handle merging spans, and soon splitting tokens. The idea is to do merging and splitting like this: with doc.retokenize() as retokenizer: for start, end, label in matches: retokenizer.merge(doc[start : end], attrs={'ent_type': label}) The retokenizer accumulates the merge requests, and applies them together at the end of the block. This will allow retokenization to be more efficient, and much less error prone. A retokenizer.split() function will then be added, to handle splitting a single token into multiple tokens. These methods take `Span` and `Token` objects; if the user wants to go directly from offsets, they can append to the .merges and .splits lists on the retokenizer. The doc.merge() method's behaviour remains unchanged, so this patch should be 100% backwards incompatible (modulo bugs). Internally, doc.merge() fixes up the arguments (to handle the various deprecated styles), opens the retokenizer, and makes the single merge. We can later start making deprecation warnings on direct calls to doc.merge(), to migrate people to use of the retokenize context manager.	2018-04-03 14:10:35 +02:00
Matthew Honnibal	8a120fb455	Disable batch size compounding in ud-train	2018-04-01 08:45:00 +00:00
Matthew Honnibal	98165e43a7	Sometimes update beam with greedy oracle	2018-04-01 08:44:35 +00:00
Suraj Rajan	1cdbb7c97c	[2032] - Changed python set to cpp stl set (#2170 ) Changed python set to cpp stl set #2032 ## Description Changed python set to cpp stl set. CPP stl set works better due to the logarithmic run time of its methods. Finding minimum in the cpp set is done in constant time as opposed to the worst case linear runtime of python set. Operations such as find,count,insert,delete are also done in either constant and logarithmic time thus making cpp set a better option to manage vectors. Reference : http://www.cplusplus.com/reference/set/set/ ### Types of change Enhancement for `Vectors` for faster initialising of word vectors(fasttext)	2018-03-31 13:28:25 +02:00
Matthew Honnibal	f3b7c5e537	Fix syntax error	2018-03-29 21:50:32 +02:00
Matthew Honnibal	23afa6429f	Add input length error, to address #1826	2018-03-29 21:45:26 +02:00
Ines Montani	a609a1ca29	Merge pull request #2152 from explosion/feature/tidy-up-dependencies 💫 Tidy up dependencies	2018-03-29 14:35:09 +02:00

1 2 3 4 5 ...

5135 Commits