spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-19 16:41:59 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	cd9378c8f1	Merge pull request #1423 from yuukos/master Fixed Russian tokenizer	2017-10-16 11:45:53 +02:00
yuukos	92931a2efd	Merge branch 'russian_language'	2017-10-16 13:46:28 +07:00
yuukos	241d19a3e6	fixed Russian Tokenizer - added trailing space flags for tokens	2017-10-16 13:37:05 +07:00
Paul O'Leary McCann	71ae8013ec	[ja] Use user_details instead of a wrapper class Instead of using a JapaneseDoc wrapper class to store Mecab output, stash it in `user_data`. -POLM	2017-10-16 00:24:34 +09:00
Paul O'Leary McCann	43eedf73f2	[ja] Stash tokenizer output for speed Before this commit, the Mecab tokenizer had to be called twice when creating a Doc- once during tokenization and once during tagging. This creates a JapaneseDoc wrapper class for Doc that stashes the parsed tokenizer output to remove redundant processing. -POLM	2017-10-15 23:33:25 +09:00
yuukos	6fb9d75bd2	fixed test with creating tokenizer	2017-10-13 15:51:03 +07:00
yuukos	a229b6e0de	added tests for Russian language added tests of creating Russian Language instance and Russian tokenizer	2017-10-13 14:04:37 +07:00
yuukos	622b6d6270	updated Russian tokenizer moved the trying to import pymorph into __init__	2017-10-13 13:57:29 +07:00
yuukos	f81dd284eb	updated spacy/__init__.py registered russian language via set_lang_class	2017-10-12 22:28:34 +07:00
yuukos	7b9491679f	added russian language support	2017-10-12 22:24:20 +07:00
Raphaël Bournhonesque	3452d6ce52	Resolve issue #1078 by simplifying URL pattern - avoid catastrophic backtracking - reduce character range of host name, domain name and TLD identifier	2017-10-11 11:24:00 +02:00
Matthew Honnibal	331d338b8b	Merge pull request #1246 from polm/ja-pos-tagger [wip] Sample implementation of Japanese Tagger (ref #1214)	2017-10-09 04:00:53 +02:00
Orion Montoya	b0d271809d	Unit test for lemmatizer exceptions -- copied from regression test for #1387	2017-10-05 10:49:28 -04:00
Orion Montoya	ffb50d21a0	Lemmatizer honors exceptions: Fix #1387	2017-10-05 10:49:02 -04:00
Orion Montoya	e81a608173	Regression test for lemmatizer exceptions -- demonstrate issue #1387	2017-10-05 10:47:48 -04:00
Matthew Honnibal	eb72eae258	Merge pull request #1364 from Destygo/master Fixed NER model loading bug	2017-09-29 12:29:43 +02:00
Vincent Genty	259ed027af	Fixed NER model loading bug	2017-09-26 15:46:04 +02:00
Ines Montani	361211fe26	Merge pull request #1342 from wannaphongcom/master Add Thai language	2017-09-26 15:40:55 +02:00
Yam	923c4c2fb2	Update punctuation.py add `……`	2017-09-22 09:50:46 +08:00
Wannaphong Phatthiyaphaibun	1abf472068	add th test	2017-09-21 12:56:58 +07:00
Wannaphong Phatthiyaphaibun	39bb5690f0	update th	2017-09-21 00:36:02 +07:00
Wannaphong Phatthiyaphaibun	44291f6697	add thai	2017-09-20 23:26:34 +07:00
Yam	978b24ccd4	Update punctuation.py In Chinese, `~` and `——` is hyphens, `·` is intermittent symbol	2017-09-20 23:02:22 +08:00
Yu-chun Huang	188b439b25	Add Chinese punctuation Add Chinese punctuation.	2017-09-19 16:58:42 +08:00
Yu-chun Huang	1f1f35dcd0	Add Chinese punctuation Add Chinese punctuation.	2017-09-19 16:57:24 +08:00
Yu-chun Huang	7692b8c071	Update __init__.py Set the "cut_all" parameter to False, or jieba will return ALL POSSIBLE word segmentations.	2017-09-12 16:23:47 +08:00
Matthew Honnibal	ddaff6ca56	Merge pull request #1287 from IamJeffG/feature/1226-more-complete-noun-chunks Capture more noun chunks	2017-09-08 07:59:10 +02:00
Matthew Honnibal	45029a550e	Fix customized-tokenizer tests	2017-09-04 20:13:13 +02:00
Matthew Honnibal	34c585396a	Merge pull request #1294 from Vimos/master Fix issue #1292 and add test case for the Assertion Error	2017-09-04 19:20:40 +02:00
Matthew Honnibal	c68f188eb0	Fix error on test	2017-09-04 18:59:36 +02:00
Matthew Honnibal	e8a26ebfab	Add efficiency note to new get_lca_matrix() method	2017-09-04 15:43:52 +02:00
Eric Zhao	d61c117081	Lowest common ancestor matrix for spans and docs Added functionality for spans and docs to get lowest common ancestor matrix by simply calling: doc.get_lca_matrix() or doc[:3].get_lca_matrix(). Corresponding unit tests were also added under spacy/tests/doc and spacy/tests/spans. Designed to address: https://github.com/explosion/spaCy/issues/969.	2017-09-03 12:22:19 -07:00
Matthew Honnibal	9bffcaa73d	Update test to make it slightly more direct The `nlp` container should be unnecessary here. If so, we can test the tokenizer class just a little more directly.	2017-09-01 21:16:56 +02:00
Vimos Tan	a6d9fb5bb6	fix issue #1292	2017-08-30 14:49:14 +08:00
Paul O'Leary McCann	8b3e1f7b5b	Handle out-of-vocab words Wasn't handling words out of the tokenizer dictionary vocabulary properly. This adds a fix and test for that. -POLM	2017-08-29 23:58:42 +09:00
Jeffrey Gerard	884ba168a8	Capture more noun chunks	2017-08-23 21:18:53 -07:00
Paul O'Leary McCann	95050201ce	Add importorskip for Japanese fixture	2017-08-22 21:30:59 +09:00
Paul O'Leary McCann	bcf2b9b4f5	Update tagger & tokenizer tests Tagger is now parametrized and has two sentences with more tag coverage. The tokenizer tests are updated to reflect differences in tokenization between IPAdic and Unidic. -POLM	2017-08-22 00:03:11 +09:00
Paul O'Leary McCann	adfd987316	Update the TAG_MAP	2017-08-22 00:02:55 +09:00
Paul O'Leary McCann	53e17296e9	Fix pronoun handling Missed this case earlier. 連体詞 have three classes for UD purposes: - その -> DET - それ -> PRON - 同じ -> ADJ -POLM	2017-08-22 00:01:49 +09:00
Paul O'Leary McCann	c435f748d7	Put Mecab import in utility function	2017-08-22 00:01:28 +09:00
ines	dcff10abe9	Add regression test for #1281	2017-08-21 16:11:47 +02:00
ines	edc596d9a7	Add missing tokenizer exceptions (resolves #1281 )	2017-08-21 16:11:36 +02:00
Paul O'Leary McCann	234a8a7591	Change default tag for 動詞,非自立可能 Example of this is いる in these sentences: 彼はそこにいる。# should be VERB 彼は底に立っている。# should be AUX Unclear which case is more numerous - need to check a large corpus - but in keeping with the other ambiguous tags, this is mapped to the "dominant" or first part of the tag. -POLM	2017-08-21 00:21:45 +09:00
Paul O'Leary McCann	6e9e686568	Sample implementation of Japanese Tagger (ref #1214 ) This is far from complete but it should be enough to check some things. 1. Mecab transition. Janome doesn't support Unidic, only IPAdic, but UD tag mappings are based on Unidic. This switches out Mecab for Janome to get around that. 2. Raw tag extension. A simple tag map can't meet the specifications for UD tag mappings, so this adds an extra field to ambiguous cases. For this demo it just deals with the simplest case, which only needs to look at the literal token. (In reality it may be necessary to look at the whole sentence, but that's another issue.) 3. General code structure. Seems nobody else has implemented a custom Tagger yet, so still not sure this is the correct way to pass the vocabulary around, for example. Any feedback would be greatly appreciated. -POLM	2017-08-08 01:27:15 +09:00
Delirious Lettuce	d3b03f0544	Fix typos: * `auxillary` -> `auxiliary` * `consistute` -> `constitute` * `earlist` -> `earliest` * `prefered` -> `preferred` * `direcory` -> `directory` * `reuseable` -> `reusable` * `idiosyncracies` -> `idiosyncrasies` * `enviroment` -> `environment` * `unecessary` -> `unnecessary` * `yesteday` -> `yesterday` * `resouces` -> `resources`	2017-08-06 21:31:39 -06:00
Matthew Honnibal	d51d55bba6	Increment version	2017-07-22 15:43:16 +02:00
Matthew Honnibal	796b2f4c1b	Remove print statements in tests	2017-07-22 15:42:38 +02:00
Matthew Honnibal	4b2e5e59ed	Add flush_cache method to tokenizer, to fix #1061 The tokenizer caches output for common chunks, for efficiency. This cache is be invalidated when the tokenizer rules change, e.g. when a new special-case rule is introduced. That's what was causing #1061. When the cache is flushed, we free the intermediate token chunks. I think this is safe --- but if we start getting segfaults, this patch is to blame. The resolution would be to simply not free those bits of memory. They'll be freed when the tokenizer exits anyway.	2017-07-22 15:06:50 +02:00
Matthew Honnibal	23a55b40ca	Default to English noun chunks iterator if no lang set	2017-07-22 14:15:25 +02:00

1 2 3 4 5 ...

2901 Commits