spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-01-30 11:14:08 +03:00

Author	SHA1	Message	Date
Ines Montani	361211fe26	Merge pull request #1342 from wannaphongcom/master Add Thai language	2017-09-26 15:40:55 +02:00
Yam	923c4c2fb2	Update punctuation.py add `……`	2017-09-22 09:50:46 +08:00
Wannaphong Phatthiyaphaibun	1abf472068	add th test	2017-09-21 12:56:58 +07:00
Wannaphong Phatthiyaphaibun	39bb5690f0	update th	2017-09-21 00:36:02 +07:00
Wannaphong Phatthiyaphaibun	44291f6697	add thai	2017-09-20 23:26:34 +07:00
Yam	978b24ccd4	Update punctuation.py In Chinese, `~` and `——` is hyphens, `·` is intermittent symbol	2017-09-20 23:02:22 +08:00
Yu-chun Huang	188b439b25	Add Chinese punctuation Add Chinese punctuation.	2017-09-19 16:58:42 +08:00
Yu-chun Huang	1f1f35dcd0	Add Chinese punctuation Add Chinese punctuation.	2017-09-19 16:57:24 +08:00
Yu-chun Huang	7692b8c071	Update __init__.py Set the "cut_all" parameter to False, or jieba will return ALL POSSIBLE word segmentations.	2017-09-12 16:23:47 +08:00
Matthew Honnibal	ddaff6ca56	Merge pull request #1287 from IamJeffG/feature/1226-more-complete-noun-chunks Capture more noun chunks	2017-09-08 07:59:10 +02:00
Matthew Honnibal	45029a550e	Fix customized-tokenizer tests	2017-09-04 20:13:13 +02:00
Matthew Honnibal	34c585396a	Merge pull request #1294 from Vimos/master Fix issue #1292 and add test case for the Assertion Error	2017-09-04 19:20:40 +02:00
Matthew Honnibal	c68f188eb0	Fix error on test	2017-09-04 18:59:36 +02:00
Matthew Honnibal	e8a26ebfab	Add efficiency note to new get_lca_matrix() method	2017-09-04 15:43:52 +02:00
Eric Zhao	d61c117081	Lowest common ancestor matrix for spans and docs Added functionality for spans and docs to get lowest common ancestor matrix by simply calling: doc.get_lca_matrix() or doc[:3].get_lca_matrix(). Corresponding unit tests were also added under spacy/tests/doc and spacy/tests/spans. Designed to address: https://github.com/explosion/spaCy/issues/969.	2017-09-03 12:22:19 -07:00
Matthew Honnibal	9bffcaa73d	Update test to make it slightly more direct The `nlp` container should be unnecessary here. If so, we can test the tokenizer class just a little more directly.	2017-09-01 21:16:56 +02:00
Vimos Tan	a6d9fb5bb6	fix issue #1292	2017-08-30 14:49:14 +08:00
Jeffrey Gerard	884ba168a8	Capture more noun chunks	2017-08-23 21:18:53 -07:00
ines	dcff10abe9	Add regression test for #1281	2017-08-21 16:11:47 +02:00
ines	edc596d9a7	Add missing tokenizer exceptions (resolves #1281 )	2017-08-21 16:11:36 +02:00
Delirious Lettuce	d3b03f0544	Fix typos: * `auxillary` -> `auxiliary` * `consistute` -> `constitute` * `earlist` -> `earliest` * `prefered` -> `preferred` * `direcory` -> `directory` * `reuseable` -> `reusable` * `idiosyncracies` -> `idiosyncrasies` * `enviroment` -> `environment` * `unecessary` -> `unnecessary` * `yesteday` -> `yesterday` * `resouces` -> `resources`	2017-08-06 21:31:39 -06:00
Matthew Honnibal	d51d55bba6	Increment version	2017-07-22 15:43:16 +02:00
Matthew Honnibal	796b2f4c1b	Remove print statements in tests	2017-07-22 15:42:38 +02:00
Matthew Honnibal	4b2e5e59ed	Add flush_cache method to tokenizer, to fix #1061 The tokenizer caches output for common chunks, for efficiency. This cache is be invalidated when the tokenizer rules change, e.g. when a new special-case rule is introduced. That's what was causing #1061. When the cache is flushed, we free the intermediate token chunks. I think this is safe --- but if we start getting segfaults, this patch is to blame. The resolution would be to simply not free those bits of memory. They'll be freed when the tokenizer exits anyway.	2017-07-22 15:06:50 +02:00
Matthew Honnibal	23a55b40ca	Default to English noun chunks iterator if no lang set	2017-07-22 14:15:25 +02:00
Matthew Honnibal	9750a0128c	Fix Span.noun_chunks. Closes #1207	2017-07-22 14:14:57 +02:00
Matthew Honnibal	d9b85675d7	Rename regression test	2017-07-22 14:14:35 +02:00
Matthew Honnibal	dfbc7e49de	Add test for Issue #1207	2017-07-22 14:14:01 +02:00
Matthew Honnibal	0ae3807d7d	Fix gaps in Lexeme API. Closes #1031	2017-07-22 13:53:48 +02:00
Matthew Honnibal	83e1b5f1e3	Merge branch 'master' of https://github.com/explosion/spaCy	2017-07-22 13:45:35 +02:00
Matthew Honnibal	45f6961ae0	Add __version__ symbol in __init__.py	2017-07-22 13:45:21 +02:00
Matthew Honnibal	8b9c4c5e1c	Add missing SP symbol to tag map, re #1052	2017-07-22 13:44:17 +02:00
Ines Montani	9af04ea11f	Merge pull request #1161 from AlexisEidelman/patch-1 French NUM_WORDS and ORDINAL_WORDS	2017-07-22 13:40:46 +02:00
Matthew Honnibal	44dd247e73	Merge branch 'master' of https://github.com/explosion/spaCy	2017-07-22 13:35:30 +02:00
Matthew Honnibal	94267ec50f	Fix merge conflit in printer	2017-07-22 13:35:15 +02:00
Ines Montani	c7708dc736	Merge pull request #1177 from swierh/master Dutch NUM_WORDS and ORDINAL_WORDS	2017-07-22 13:35:08 +02:00
Matthew Honnibal	5916d46ba8	Avoid use of deepcopy in printer	2017-07-22 13:34:01 +02:00
Ines Montani	9eca6503c1	Merge pull request #1157 from polm/master Add basic Japanese Tokenizer Test	2017-07-10 13:07:11 +02:00
Paul O'Leary McCann	bc87b815cc	Add comment clarifying what LANGUAGES does	2017-07-09 16:28:55 +09:00
Paul O'Leary McCann	04e6a65188	Remove Japanese from LANGUAGES LANGUAGES is a list of languages whose tokenizers get run through a variety of generic tests. Since the generic tests don't check the JA fixture, it blows up when it can't find janome. -POLM	2017-07-09 16:23:26 +09:00
Swier	29720150f9	fix import of stop words in language data	2017-07-05 14:08:04 +02:00
Swier	f377c9c952	Rename stop_words.py to word_sets.py	2017-07-05 14:06:28 +02:00
Swier	5357874bf7	add Dutch numbers and ordinals	2017-07-05 14:03:30 +02:00
gispk47	669bd14213	Update __init__.py remove the empty string return from jieba.cut,this will cause the list of tokens cant be pushed assert error	2017-07-01 13:12:00 +08:00
Paul O'Leary McCann	c336193392	Parametrize and extend Japanese tokenizer tests	2017-06-29 00:09:40 +09:00
Paul O'Leary McCann	30a34ebb6e	Add importorskip for janome	2017-06-29 00:09:20 +09:00
Alexis	1b3a5d87ba	French NUM_WORDS and ORDINAL_WORDS	2017-06-28 14:11:20 +02:00
Paul O'Leary McCann	e56fea14eb	Add basic Japanese tokenizer test	2017-06-28 01:24:25 +09:00
Paul O'Leary McCann	84041a2bb5	Make create_tokenizer work with Japanese	2017-06-28 01:18:05 +09:00
György Orosz	fa26041da6	Fixed typo in cli/package.py	2017-06-07 16:19:08 +02:00

1 2 3 4 5 ...

2876 Commits