spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-03-29 06:14:13 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	dbc276e3b2	Fix 'toupper()' -> 'upper()'	2017-10-20 13:02:13 +02:00
Matthew Honnibal	7a46792376	Fix compile error Closures not allowed in cpdef	2017-10-20 11:53:47 +02:00
Matthew Honnibal	658536b5ce	Fix to_array compile error	2017-10-20 11:35:10 +02:00
Matthew Honnibal	c0799430a7	Make small changes to Doc.to_array * Change type-check logic to 'hasattr' (Python type-checking is brittle) * Small 'house style' edits, mostly making code more terse.	2017-10-20 11:17:00 +02:00
Ramanan Balakrishnan	5941aa96a1	Support strings for attribute list in doc.to_array	2017-10-20 11:59:34 +05:30
Ramanan Balakrishnan	b47b4e2654	Support single value for attribute list in doc.to_scalar conversion	2017-10-18 14:43:47 +05:30
Matthew Honnibal	cd9378c8f1	Merge pull request #1423 from yuukos/master Fixed Russian tokenizer	2017-10-16 11:45:53 +02:00
yuukos	92931a2efd	Merge branch 'russian_language'	2017-10-16 13:46:28 +07:00
yuukos	241d19a3e6	fixed Russian Tokenizer - added trailing space flags for tokens	2017-10-16 13:37:05 +07:00
Paul O'Leary McCann	71ae8013ec	[ja] Use user_details instead of a wrapper class Instead of using a JapaneseDoc wrapper class to store Mecab output, stash it in `user_data`. -POLM	2017-10-16 00:24:34 +09:00
Paul O'Leary McCann	43eedf73f2	[ja] Stash tokenizer output for speed Before this commit, the Mecab tokenizer had to be called twice when creating a Doc- once during tokenization and once during tagging. This creates a JapaneseDoc wrapper class for Doc that stashes the parsed tokenizer output to remove redundant processing. -POLM	2017-10-15 23:33:25 +09:00
yuukos	6fb9d75bd2	fixed test with creating tokenizer	2017-10-13 15:51:03 +07:00
yuukos	a229b6e0de	added tests for Russian language added tests of creating Russian Language instance and Russian tokenizer	2017-10-13 14:04:37 +07:00
yuukos	622b6d6270	updated Russian tokenizer moved the trying to import pymorph into __init__	2017-10-13 13:57:29 +07:00
yuukos	f81dd284eb	updated spacy/__init__.py registered russian language via set_lang_class	2017-10-12 22:28:34 +07:00
yuukos	7b9491679f	added russian language support	2017-10-12 22:24:20 +07:00
Raphaël Bournhonesque	3452d6ce52	Resolve issue #1078 by simplifying URL pattern - avoid catastrophic backtracking - reduce character range of host name, domain name and TLD identifier	2017-10-11 11:24:00 +02:00
Matthew Honnibal	331d338b8b	Merge pull request #1246 from polm/ja-pos-tagger [wip] Sample implementation of Japanese Tagger (ref #1214)	2017-10-09 04:00:53 +02:00
Orion Montoya	b0d271809d	Unit test for lemmatizer exceptions -- copied from regression test for #1387	2017-10-05 10:49:28 -04:00
Orion Montoya	ffb50d21a0	Lemmatizer honors exceptions: Fix #1387	2017-10-05 10:49:02 -04:00
Orion Montoya	e81a608173	Regression test for lemmatizer exceptions -- demonstrate issue #1387	2017-10-05 10:47:48 -04:00
Matthew Honnibal	eb72eae258	Merge pull request #1364 from Destygo/master Fixed NER model loading bug	2017-09-29 12:29:43 +02:00
Vincent Genty	259ed027af	Fixed NER model loading bug	2017-09-26 15:46:04 +02:00
Ines Montani	361211fe26	Merge pull request #1342 from wannaphongcom/master Add Thai language	2017-09-26 15:40:55 +02:00
Yam	923c4c2fb2	Update punctuation.py add `……`	2017-09-22 09:50:46 +08:00
Wannaphong Phatthiyaphaibun	1abf472068	add th test	2017-09-21 12:56:58 +07:00
Wannaphong Phatthiyaphaibun	39bb5690f0	update th	2017-09-21 00:36:02 +07:00
Wannaphong Phatthiyaphaibun	44291f6697	add thai	2017-09-20 23:26:34 +07:00
Yam	978b24ccd4	Update punctuation.py In Chinese, `~` and `——` is hyphens, `·` is intermittent symbol	2017-09-20 23:02:22 +08:00
Yu-chun Huang	188b439b25	Add Chinese punctuation Add Chinese punctuation.	2017-09-19 16:58:42 +08:00
Yu-chun Huang	1f1f35dcd0	Add Chinese punctuation Add Chinese punctuation.	2017-09-19 16:57:24 +08:00
Yu-chun Huang	7692b8c071	Update __init__.py Set the "cut_all" parameter to False, or jieba will return ALL POSSIBLE word segmentations.	2017-09-12 16:23:47 +08:00
Matthew Honnibal	ddaff6ca56	Merge pull request #1287 from IamJeffG/feature/1226-more-complete-noun-chunks Capture more noun chunks	2017-09-08 07:59:10 +02:00
Matthew Honnibal	45029a550e	Fix customized-tokenizer tests	2017-09-04 20:13:13 +02:00
Matthew Honnibal	34c585396a	Merge pull request #1294 from Vimos/master Fix issue #1292 and add test case for the Assertion Error	2017-09-04 19:20:40 +02:00
Matthew Honnibal	c68f188eb0	Fix error on test	2017-09-04 18:59:36 +02:00
Matthew Honnibal	e8a26ebfab	Add efficiency note to new get_lca_matrix() method	2017-09-04 15:43:52 +02:00
Eric Zhao	d61c117081	Lowest common ancestor matrix for spans and docs Added functionality for spans and docs to get lowest common ancestor matrix by simply calling: doc.get_lca_matrix() or doc[:3].get_lca_matrix(). Corresponding unit tests were also added under spacy/tests/doc and spacy/tests/spans. Designed to address: https://github.com/explosion/spaCy/issues/969.	2017-09-03 12:22:19 -07:00
Matthew Honnibal	9bffcaa73d	Update test to make it slightly more direct The `nlp` container should be unnecessary here. If so, we can test the tokenizer class just a little more directly.	2017-09-01 21:16:56 +02:00
Vimos Tan	a6d9fb5bb6	fix issue #1292	2017-08-30 14:49:14 +08:00
Paul O'Leary McCann	8b3e1f7b5b	Handle out-of-vocab words Wasn't handling words out of the tokenizer dictionary vocabulary properly. This adds a fix and test for that. -POLM	2017-08-29 23:58:42 +09:00
Jeffrey Gerard	884ba168a8	Capture more noun chunks	2017-08-23 21:18:53 -07:00
Paul O'Leary McCann	95050201ce	Add importorskip for Japanese fixture	2017-08-22 21:30:59 +09:00
Paul O'Leary McCann	bcf2b9b4f5	Update tagger & tokenizer tests Tagger is now parametrized and has two sentences with more tag coverage. The tokenizer tests are updated to reflect differences in tokenization between IPAdic and Unidic. -POLM	2017-08-22 00:03:11 +09:00
Paul O'Leary McCann	adfd987316	Update the TAG_MAP	2017-08-22 00:02:55 +09:00
Paul O'Leary McCann	53e17296e9	Fix pronoun handling Missed this case earlier. 連体詞 have three classes for UD purposes: - その -> DET - それ -> PRON - 同じ -> ADJ -POLM	2017-08-22 00:01:49 +09:00
Paul O'Leary McCann	c435f748d7	Put Mecab import in utility function	2017-08-22 00:01:28 +09:00
ines	dcff10abe9	Add regression test for #1281	2017-08-21 16:11:47 +02:00
ines	edc596d9a7	Add missing tokenizer exceptions (resolves #1281 )	2017-08-21 16:11:36 +02:00
Paul O'Leary McCann	234a8a7591	Change default tag for 動詞,非自立可能 Example of this is いる in these sentences: 彼はそこにいる。# should be VERB 彼は底に立っている。# should be AUX Unclear which case is more numerous - need to check a large corpus - but in keeping with the other ambiguous tags, this is mapped to the "dominant" or first part of the tag. -POLM	2017-08-21 00:21:45 +09:00

1 2 3 4 5 ...

2907 Commits