spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-27 10:26:35 +03:00

Author	SHA1	Message	Date
ines	5768df4f09	Add SimpleFrozenDict util to use as default function argument	2018-05-20 15:13:37 +02:00
Matthew Honnibal	7431e9c87f	Fix parser for GPU	2018-05-19 17:24:34 +00:00
Matthew Honnibal	401213fb1f	Only warn about unnamed vectors if non-zero sized.	2018-05-19 18:51:55 +02:00
Matthew Honnibal	74d5c625b3	Use rising beam update prob	2018-05-16 20:11:59 +02:00
Matthew Honnibal	544ae7f1db	Merge branch 'develop' into feature/refactor-parser	2018-05-16 02:06:49 +02:00
Matthew Honnibal	d1b27fe5aa	Revert "Improve dynamic oracle when values are missing in parse" This reverts commit `f56bd4736b`.	2018-05-16 00:31:52 +02:00
Matthew Honnibal	83acaa0358	Add missing name attribute for parser	2018-05-15 19:01:53 +02:00
Matthew Honnibal	f328c195ca	Fix size limits in training data	2018-05-15 19:01:41 +02:00
Matthew Honnibal	8446b35ce0	Fix parser model loading	2018-05-15 18:43:46 +02:00
Matthew Honnibal	dc1a479fbd	Merge branch 'develop' into feature/refactor-parser	2018-05-15 18:39:21 +02:00
Matthew Honnibal	546dd99cdf	Merge master into develop -- mostly Arabic and website	2018-05-15 18:14:28 +02:00
Matthew Honnibal	5664ab7e6c	Revert hacks to tests	2018-05-15 18:00:09 +02:00
Matthew Honnibal	7b9195657b	Restore beam_density argument for parser beam	2018-05-15 17:55:11 +02:00
Matthew Honnibal	581d318971	Fix conftest	2018-05-15 00:54:45 +02:00
Tahar Zanouda	00417794d3	Add Arabic language (#2314 ) * added support for Arabic lang * added Arabic language support * updated conftest	2018-05-15 00:27:19 +02:00
Jani Monoses	0e08e49e87	Lemmatizer ro (#2319 ) * Add Romanian lemmatizer lookup table. Adapted from http://www.lexiconista.com/datasets/lemmatization/ by replacing cedillas with commas (ș and ț). The original dataset is licensed under the Open Database License. * Fix one blatant issue in the Romanian lemmatizer * Romanian examples file * Add ro_tokenizer in conftest * Add Romanian lemmatizer test	2018-05-12 15:20:04 +02:00
Matthew Honnibal	887631ca25	Disable some tests to figure out why CI fails	2018-05-10 16:42:01 +02:00
Matthew Honnibal	902a172cb7	Disable some tests to figure out why CI fails	2018-05-10 16:30:07 +02:00
Matthew Honnibal	614d45ea58	Set a more aggressive threshold on the max violn update	2018-05-10 15:38:24 +02:00
Matthew Honnibal	8e8724b55b	Default to beam_update_prob 1	2018-05-10 15:38:02 +02:00
Jani Monoses	42b34832e4	Update Romanian stopword list (#2316 ) * Contributor agreement for janimo * Update Romanian stopword list Include the correct spellings of all the words already in the repo that are using cedillas (ş and ţ) instead of commas (ș and ț). Add another unrelated spelling fix. See https://github.com/stopwords-iso/stopwords-ro/pull/1 and https://github.com/stopwords-iso/stopwords-ro/pull/2	2018-05-10 12:16:56 +02:00
Lucas Abbade	be7fdc59d1	Update lex_attrs.py (#2307 ) * Update lex_attrs.py Fixed spelling mistakes of some numbers (according to Brazilian Portuguese). * Update lex_attrs.py As requested, I've included the correct spelling for both Brazilian Portuguese and Portuguese Portuguese. I will advise however, that the two are separated in the future. Brazilian Portuguese is a very different language from the original one, although most of the writing is unified, the way people talk in both countries is radically different. Keeping both languages as one may lead to bigger issues in the future, especially when it comes to spell checking.	2018-05-09 20:49:31 +02:00
mauryaland	5368ba028a	Update stop_words.py for French language (#2310 ) * Add contraction forms of some common stopwords All the stopwords added contain the apostrophe" ' "or " ’ ". * Adds contributor agreement mauryaland * Update mauryaland.md	2018-05-09 12:04:38 +02:00
Matthew Honnibal	a61fd60681	Fix error in beam gradient calculation	2018-05-09 02:44:09 +02:00
Matthew Honnibal	a6ae1ee6f7	Don't modify Token in global scope	2018-05-09 00:43:00 +02:00
Matthew Honnibal	f94f721f40	Avoid importing fused token symbol in ud-run-test, untl that's added	2018-05-09 00:28:03 +02:00
Matthew Honnibal	659ec5b975	Avoid importing fused token symbol in ud-run-test, untl that's added	2018-05-08 19:40:33 +02:00
Matthew Honnibal	4cb0494bef	Bug fixes to beam search after refactor	2018-05-08 13:48:50 +02:00
Matthew Honnibal	5ed71973b3	Add a keyword argument sink to GoldParse	2018-05-08 13:48:32 +02:00
Matthew Honnibal	8cfe326f87	Avoid relying on final gold check in beam search	2018-05-08 13:48:19 +02:00
Matthew Honnibal	fc4dd49b77	Support oracle segmentation in ud-train CLI command	2018-05-08 13:47:45 +02:00
Matthew Honnibal	c49e44349a	Fix beam parsing	2018-05-08 02:53:24 +02:00
Matthew Honnibal	99649d114d	Fix parser	2018-05-08 00:27:26 +02:00
Matthew Honnibal	8a82367a9d	Fix beam search after refactor	2018-05-08 00:20:33 +02:00
Matthew Honnibal	5a0f26be0c	Readd beam search after refactor	2018-05-08 00:19:52 +02:00
ines	7a3599c21a	Fix formatting and consistency	2018-05-07 23:02:11 +02:00
Matthew Honnibal	36b2c9bdd5	Fix refactored parser	2018-05-07 18:58:09 +02:00
Matthew Honnibal	bde3be1ad1	Fix refactored parser	2018-05-07 18:31:04 +02:00
Matthew Honnibal	01c4e13b02	Update test	2018-05-07 16:59:52 +02:00
Matthew Honnibal	f6cdafc00e	Fix refactored parser	2018-05-07 16:59:38 +02:00
Matthew Honnibal	f56bd4736b	Improve dynamic oracle when values are missing in parse	2018-05-07 15:53:18 +02:00
Matthew Honnibal	eddc0e0c74	Set gold.sent_starts in ud_train	2018-05-07 15:52:47 +02:00
Matthew Honnibal	bf19f22340	Allow gold.sent_starts to be set from Python	2018-05-07 15:51:34 +02:00
Matthew Honnibal	7f163442e6	Work on refactoring greedy parser	2018-05-07 15:45:52 +02:00
Douglas Knox	9b49a40f4e	Test and fix for Issue #2219 (#2272 ) Test and fix for Issue #2219: Token.similarity() failed if single letter	2018-05-03 18:40:46 +02:00
Paul O'Leary McCann	bd72fbf09c	Port Japanese mecab tokenizer from v1 (#2036 ) * Port Japanese mecab tokenizer from v1 This brings the Mecab-based Japanese tokenization introduced in #1246 to spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag information from Mecab is stored in a token extension. A tag map is also included. As a reminder, Mecab is required because Universal Dependencies are based on Unidic tags, and Janome doesn't support Unidic. Things to check: 1. Is this the right way to use a token extension? 2. What's the right way to implement a JapaneseTagger? The approach in #1246 relied on `tag_from_strings` which is just gone now. I guess the best thing is to just try training spaCy's default Tagger? -POLM * Add tagging/make_doc and tests	2018-05-03 18:38:26 +02:00
G.Pruvost	cc8e804648	#2211 - Support for ssl certs config on download command (#2212 ) * Add support for SSL/Certs customization on download CLI * Add a note on SSL options for the 'download' CLI in the README * Add contributor agreement	2018-05-03 18:37:02 +02:00
Jens Dahl Møllerhøj	b9290397fb	rename SP to _SP (#2289 )	2018-05-03 18:33:49 +02:00
Matthew Honnibal	a8e70a4187	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-05-03 14:02:10 +02:00
Matthew Honnibal	c0e596283b	Set version to 2.1.0a0	2018-05-03 14:00:11 +02:00
Matthew Honnibal	8cd06cc763	Try to fix root-outside-sentence bug	2018-05-02 14:39:48 +00:00
Matthew Honnibal	acebd01033	Set cildren from heads in finalize doc	2018-05-02 14:19:22 +00:00
Matthew Honnibal	569440a6db	Dont normalize gradient by batch size	2018-05-02 08:42:10 +02:00
Matthew Honnibal	281e29cbcd	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-05-02 01:36:23 +00:00
Matthew Honnibal	2338e8c7fc	Update develop from master	2018-05-02 01:36:12 +00:00
Matthew Honnibal	9d147e12c4	Merge remote-tracking branch 'origin/master' into develop	2018-05-01 18:18:51 +02:00
Matthew Honnibal	6d0fe67b72	Constrain subtok label to adjacent tokens	2018-05-01 17:34:27 +02:00
Matthew Honnibal	8f21953fc5	Constrain subtok to adjacent words	2018-05-01 17:29:00 +02:00
Matthew Honnibal	b43bfd3524	Fix arc-eager oracle tests	2018-05-01 16:16:14 +02:00
Matthew Honnibal	31ed64e9b0	Fix textcat test	2018-05-01 15:18:39 +02:00
Matthew Honnibal	548bdff943	Update default Adam settings	2018-05-01 15:18:20 +02:00
Matthew Honnibal	adbb1f7533	Add better arc-eager oracle tests	2018-05-01 15:14:55 +02:00
Matthew Honnibal	697bcaa34f	Add some methods to ArcEager that make testing easier	2018-05-01 15:13:14 +02:00
Mr Roboto	6f5ccda19c	Addresses Issue #2228 - Deserialization fails when using tensor=False or sentiment=False (#2230 ) * Fixes issue #2228 * Adds a new contributor	2018-05-01 13:40:22 +02:00
Matthew Honnibal	d44bb45c72	Fix scoring if tokenization changes	2018-05-01 01:33:20 +02:00
Matthew Honnibal	2b26c007cd	Revert "Disable batch size compounding in ud-train" This reverts commit `8a120fb455`.	2018-04-29 14:09:02 +00:00
Matthew Honnibal	723b328062	Add script to run UD test	2018-04-29 15:50:25 +02:00
Matthew Honnibal	17af6aa3a4	Update ud_train script	2018-04-29 15:49:32 +02:00
Matthew Honnibal	5de8a36537	Fix arc_eager is_nonproj_tree	2018-04-29 15:49:11 +02:00
Matthew Honnibal	5260268f70	Fix textcat after merge	2018-04-29 15:48:53 +02:00
Matthew Honnibal	ad3d56c3ba	Fix compile error in matcher	2018-04-29 15:48:34 +02:00
Matthew Honnibal	a8bc947fd4	Fix Token.set_extension	2018-04-29 15:48:19 +02:00
Matthew Honnibal	2c4a6d66fa	Merge master into develop. Big merge, many conflicts -- need to review	2018-04-29 14:49:26 +02:00
ines	3c80f69ff5	Return data in cli.info and add silent option (resolves #2196 )	2018-04-29 01:59:44 +02:00
ines	1c6d77610c	Add remove_extension method on Doc, Token and Span (closes #2242 )	2018-04-28 23:33:09 +02:00
ines	abdb853ebf	Simplify underscore tests	2018-04-28 23:30:33 +02:00
ines	6fb6371670	Add collapse_phrases option to displacy (closes #2266 )	2018-04-28 23:06:50 +02:00
Robin Linderborg	1f9904ef12	fixes #2238 (#2241 ) * Remove erroneous lemma lookup år > åra in Swedish * Add contributors agreement * Add contrib agreement to correct directory * Revert change to CONTRIBUTOR_AGREEMENT	2018-04-28 14:55:22 +02:00
Robin Linderborg	d01f503b54	Remove incorrect lemma lookup gäng->gänga (#2252 ) * Remove incorrect lemma lookup gäng->gänga In modern Swedish, "gäng" is mostly associated with "gang" or "group of people". The removed lemma lookup lemmatized it to the verb "thread". * Add contrib agreement to correct directory * Revert change to CONTRIBUTOR_AGREEMENT	2018-04-28 14:54:41 +02:00
Suraj Krishnan Rajan	69d041148f	Implement Fast-Text vectors with subword features	2018-04-21 01:34:14 +05:30
ines	686225eadd	Fix Spanish noun_chunks (resolves #2210 ) Make sure 'NP' label is added to StringStore and move noun_bounds helper into a closure to allow reusing label sets	2018-04-18 18:44:01 -04:00
ines	9632595fb4	Use correct, non-deprecated merge syntax (resolves #2226 )	2018-04-18 18:28:28 -04:00
Suraj Rajan	5957f15227	Fixed typos for #2222,#2223 (#2233 ) (closes #2222 , closes #2223 )	2018-04-18 14:55:26 -07:00
Matthew Honnibal	97851d2c4e	Increment version to v2.0.12.dev0	2018-04-10 22:20:16 +02:00
Matthew Honnibal	ed39c75a92	Merge branch 'master' of https://github.com/explosion/spaCy	2018-04-10 22:19:40 +02:00
Matthew Honnibal	3836199a83	Fix loading of models when custom vectors are added	2018-04-10 22:19:20 +02:00
ines	0299d5fac8	Update argument annotations and formatting	2018-04-10 21:45:11 +02:00
ines	49b1e48bf5	Fix syntax error	2018-04-10 21:44:59 +02:00
ines	70052e46e9	Fix formatting [ci skip]	2018-04-10 21:42:46 +02:00
Matthew Honnibal	0ddb152be0	Improve error message when reading vectors	2018-04-10 21:26:50 +02:00
Matthew Honnibal	db50ac524e	Support zipped vector files in init-model	2018-04-10 21:21:00 +02:00
ines	270fcfd925	Fix typo in package command message (closes #2200 )	2018-04-10 19:14:31 +02:00
ines	24d8bf348d	Revert "Add support for .zip to init_model" This reverts commit `7ee880a0ad`.	2018-04-10 19:08:06 +02:00
Matthew Honnibal	7ee880a0ad	Add support for .zip to init_model	2018-04-10 14:30:04 +00:00
ines	5ecb274764	Fix indentation error and set Doc.is_tagged correctly	2018-04-10 16:14:52 +02:00
ines	987ee27af7	Return Doc if noun chunks merger component if Doc is not parsed	2018-04-09 14:51:02 +02:00
Xiaoquan Kong	e2f13ec722	bugfix: `Doc.noun_chunks` call `Doc.noun_chunks_iterator` without checking (closes #2194 )	2018-04-08 23:44:05 +02:00
Jens Dahl Møllerhøj	e5055e3cf6	Add Danish lemmatizer (#2184 ) * add danish lemmatizer * fill contributor agreement	2018-04-07 19:07:28 +02:00
ines	bccbf538ef	Revert "Check if spaCy has compiled correctly and show error message" This reverts commit `3463ded7cf`.	2018-04-06 15:49:44 +02:00
ines	fb4eda6616	Merge branch 'master' of https://github.com/explosion/spaCy	2018-04-06 00:38:48 +02:00

1 2 3 4 5 ...

5154 Commits