spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-02-08 08:19:45 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	331d338b8b	Merge pull request #1246 from polm/ja-pos-tagger [wip] Sample implementation of Japanese Tagger (ref #1214)	2017-10-09 04:00:53 +02:00
Orion Montoya	b0d271809d	Unit test for lemmatizer exceptions -- copied from regression test for #1387	2017-10-05 10:49:28 -04:00
Orion Montoya	e81a608173	Regression test for lemmatizer exceptions -- demonstrate issue #1387	2017-10-05 10:47:48 -04:00
Wannaphong Phatthiyaphaibun	1abf472068	add th test	2017-09-21 12:56:58 +07:00
Matthew Honnibal	ddaff6ca56	Merge pull request #1287 from IamJeffG/feature/1226-more-complete-noun-chunks Capture more noun chunks	2017-09-08 07:59:10 +02:00
Matthew Honnibal	45029a550e	Fix customized-tokenizer tests	2017-09-04 20:13:13 +02:00
Matthew Honnibal	34c585396a	Merge pull request #1294 from Vimos/master Fix issue #1292 and add test case for the Assertion Error	2017-09-04 19:20:40 +02:00
Matthew Honnibal	c68f188eb0	Fix error on test	2017-09-04 18:59:36 +02:00
Eric Zhao	d61c117081	Lowest common ancestor matrix for spans and docs Added functionality for spans and docs to get lowest common ancestor matrix by simply calling: doc.get_lca_matrix() or doc[:3].get_lca_matrix(). Corresponding unit tests were also added under spacy/tests/doc and spacy/tests/spans. Designed to address: https://github.com/explosion/spaCy/issues/969.	2017-09-03 12:22:19 -07:00
Matthew Honnibal	9bffcaa73d	Update test to make it slightly more direct The `nlp` container should be unnecessary here. If so, we can test the tokenizer class just a little more directly.	2017-09-01 21:16:56 +02:00
Vimos Tan	a6d9fb5bb6	fix issue #1292	2017-08-30 14:49:14 +08:00
Paul O'Leary McCann	8b3e1f7b5b	Handle out-of-vocab words Wasn't handling words out of the tokenizer dictionary vocabulary properly. This adds a fix and test for that. -POLM	2017-08-29 23:58:42 +09:00
Jeffrey Gerard	884ba168a8	Capture more noun chunks	2017-08-23 21:18:53 -07:00
Paul O'Leary McCann	95050201ce	Add importorskip for Japanese fixture	2017-08-22 21:30:59 +09:00
Paul O'Leary McCann	bcf2b9b4f5	Update tagger & tokenizer tests Tagger is now parametrized and has two sentences with more tag coverage. The tokenizer tests are updated to reflect differences in tokenization between IPAdic and Unidic. -POLM	2017-08-22 00:03:11 +09:00
ines	dcff10abe9	Add regression test for #1281	2017-08-21 16:11:47 +02:00
Paul O'Leary McCann	6e9e686568	Sample implementation of Japanese Tagger (ref #1214 ) This is far from complete but it should be enough to check some things. 1. Mecab transition. Janome doesn't support Unidic, only IPAdic, but UD tag mappings are based on Unidic. This switches out Mecab for Janome to get around that. 2. Raw tag extension. A simple tag map can't meet the specifications for UD tag mappings, so this adds an extra field to ambiguous cases. For this demo it just deals with the simplest case, which only needs to look at the literal token. (In reality it may be necessary to look at the whole sentence, but that's another issue.) 3. General code structure. Seems nobody else has implemented a custom Tagger yet, so still not sure this is the correct way to pass the vocabulary around, for example. Any feedback would be greatly appreciated. -POLM	2017-08-08 01:27:15 +09:00
Matthew Honnibal	796b2f4c1b	Remove print statements in tests	2017-07-22 15:42:38 +02:00
Matthew Honnibal	4b2e5e59ed	Add flush_cache method to tokenizer, to fix #1061 The tokenizer caches output for common chunks, for efficiency. This cache is be invalidated when the tokenizer rules change, e.g. when a new special-case rule is introduced. That's what was causing #1061. When the cache is flushed, we free the intermediate token chunks. I think this is safe --- but if we start getting segfaults, this patch is to blame. The resolution would be to simply not free those bits of memory. They'll be freed when the tokenizer exits anyway.	2017-07-22 15:06:50 +02:00
Matthew Honnibal	d9b85675d7	Rename regression test	2017-07-22 14:14:35 +02:00
Matthew Honnibal	dfbc7e49de	Add test for Issue #1207	2017-07-22 14:14:01 +02:00
Matthew Honnibal	0ae3807d7d	Fix gaps in Lexeme API. Closes #1031	2017-07-22 13:53:48 +02:00
Paul O'Leary McCann	bc87b815cc	Add comment clarifying what LANGUAGES does	2017-07-09 16:28:55 +09:00
Paul O'Leary McCann	04e6a65188	Remove Japanese from LANGUAGES LANGUAGES is a list of languages whose tokenizers get run through a variety of generic tests. Since the generic tests don't check the JA fixture, it blows up when it can't find janome. -POLM	2017-07-09 16:23:26 +09:00
Paul O'Leary McCann	c336193392	Parametrize and extend Japanese tokenizer tests	2017-06-29 00:09:40 +09:00
Paul O'Leary McCann	30a34ebb6e	Add importorskip for janome	2017-06-29 00:09:20 +09:00
Paul O'Leary McCann	e56fea14eb	Add basic Japanese tokenizer test	2017-06-28 01:24:25 +09:00
ines	6e1dbc608e	Fix parse_tree test	2017-05-13 12:34:20 +02:00
Matthew Honnibal	ad590feaa8	Fix test, which imported English incorrectly	2017-05-13 11:36:19 +02:00
Matthew Honnibal	b2540d2379	Merge Kengz's tree_print patch	2017-05-13 03:18:49 +02:00
Ines Montani	7da9cefd25	Merge pull request #1022 from luvogels/master Initial support for Norwegian Bokmål	2017-04-27 11:16:06 +02:00
luvogels	d12a0b6431	Hooked up tokenizer tests	2017-04-26 23:21:41 +02:00
luvogels	8de59ce3b9	Added tokenizer tests	2017-04-26 19:10:18 +02:00
Matthew Honnibal	4d98511db7	Make Span hashable. Closes #1019	2017-04-26 19:01:05 +02:00
Matthew Honnibal	24c4c51f13	Try to make test999 less flakey	2017-04-26 18:42:06 +02:00
Matthew Honnibal	c4be9c36fe	Fix unicode header in tests	2017-04-24 10:09:01 +02:00
Matthew Honnibal	65f10b53e5	Fix test	2017-04-24 00:25:55 +02:00
Matthew Honnibal	70a43858e1	Fix flakey test	2017-04-24 00:06:30 +02:00
Matthew Honnibal	3973af2d15	Make training test less flakey	2017-04-23 22:59:34 +02:00
ines	42305bc519	Remove unnecessary test	2017-04-23 21:21:41 +02:00
ines	012ea594d1	Add file for misc tests	2017-04-23 21:06:51 +02:00
ines	83f66947dc	Rename test_download to test_cli	2017-04-23 21:06:50 +02:00
Matthew Honnibal	874a3cbb07	Add test for Issue #955	2017-04-23 17:57:01 +02:00
Matthew Honnibal	5d8af40445	Add test for Issue #999	2017-04-23 17:06:30 +02:00
Matthew Honnibal	040751ad17	Remove xfail on Test #910	2017-04-23 16:28:55 +02:00
Ben Eyal	e90e8a3f10	Enable test	2017-04-20 02:25:24 +03:00
ines	2bd89e7ade	Tidy up Hebrew tests and test for punctuation (see #995 )	2017-04-19 19:28:03 +02:00
ines	13d30b6c01	xfail lemmatizer test that's causing problems (see #546 )	2017-04-16 21:18:39 +02:00
ines	0084466a66	Remove unused utf8open util and replace os.path with ensure_path	2017-04-16 20:37:45 +02:00
Matthew Honnibal	1dca7eeb03	Add unicode declaration on new regression test	2017-04-07 18:09:23 +02:00

1 2 3 4 5 ...

572 Commits