Commit Graph

572 Commits

Author SHA1 Message Date
Matthew Honnibal
331d338b8b Merge pull request #1246 from polm/ja-pos-tagger
[wip] Sample implementation of Japanese Tagger (ref #1214)
2017-10-09 04:00:53 +02:00
Orion Montoya
b0d271809d Unit test for lemmatizer exceptions -- copied from regression test for #1387 2017-10-05 10:49:28 -04:00
Orion Montoya
e81a608173 Regression test for lemmatizer exceptions -- demonstrate issue #1387 2017-10-05 10:47:48 -04:00
Wannaphong Phatthiyaphaibun
1abf472068 add th test 2017-09-21 12:56:58 +07:00
Matthew Honnibal
ddaff6ca56 Merge pull request #1287 from IamJeffG/feature/1226-more-complete-noun-chunks
Capture more noun chunks
2017-09-08 07:59:10 +02:00
Matthew Honnibal
45029a550e Fix customized-tokenizer tests 2017-09-04 20:13:13 +02:00
Matthew Honnibal
34c585396a Merge pull request #1294 from Vimos/master
Fix issue #1292 and add test case for the Assertion Error
2017-09-04 19:20:40 +02:00
Matthew Honnibal
c68f188eb0 Fix error on test 2017-09-04 18:59:36 +02:00
Eric Zhao
d61c117081 Lowest common ancestor matrix for spans and docs
Added functionality for spans and docs to get lowest common ancestor
matrix by simply calling: doc.get_lca_matrix() or
doc[:3].get_lca_matrix().
Corresponding unit tests were also added under spacy/tests/doc and
spacy/tests/spans.
Designed to address: https://github.com/explosion/spaCy/issues/969.
2017-09-03 12:22:19 -07:00
Matthew Honnibal
9bffcaa73d Update test to make it slightly more direct
The `nlp` container should be unnecessary here. If so, we can test the tokenizer class just a little more directly.
2017-09-01 21:16:56 +02:00
Vimos Tan
a6d9fb5bb6 fix issue #1292 2017-08-30 14:49:14 +08:00
Paul O'Leary McCann
8b3e1f7b5b Handle out-of-vocab words
Wasn't handling words out of the tokenizer dictionary vocabulary
properly. This adds a fix and test for that. -POLM
2017-08-29 23:58:42 +09:00
Jeffrey Gerard
884ba168a8 Capture more noun chunks 2017-08-23 21:18:53 -07:00
Paul O'Leary McCann
95050201ce Add importorskip for Japanese fixture 2017-08-22 21:30:59 +09:00
Paul O'Leary McCann
bcf2b9b4f5 Update tagger & tokenizer tests
Tagger is now parametrized and has two sentences with more tag coverage.

The tokenizer tests are updated to reflect differences in tokenization
between IPAdic and Unidic. -POLM
2017-08-22 00:03:11 +09:00
ines
dcff10abe9 Add regression test for #1281 2017-08-21 16:11:47 +02:00
Paul O'Leary McCann
6e9e686568 Sample implementation of Japanese Tagger (ref #1214)
This is far from complete but it should be enough to check some things.

1. Mecab transition. Janome doesn't support Unidic, only IPAdic, but UD
tag mappings are based on Unidic. This switches out Mecab for Janome to
get around that.

2. Raw tag extension. A simple tag map can't meet the specifications for
UD tag mappings, so this adds an extra field to ambiguous cases. For
this demo it just deals with the simplest case, which only needs to look
at the literal token. (In reality it may be necessary to look at the
whole sentence, but that's another issue.)

3. General code structure. Seems nobody else has implemented a custom
Tagger yet, so still not sure this is the correct way to pass the
vocabulary around, for example.

Any feedback would be greatly appreciated. -POLM
2017-08-08 01:27:15 +09:00
Matthew Honnibal
796b2f4c1b Remove print statements in tests 2017-07-22 15:42:38 +02:00
Matthew Honnibal
4b2e5e59ed Add flush_cache method to tokenizer, to fix #1061
The tokenizer caches output for common chunks, for efficiency. This
cache is be invalidated when the tokenizer rules change, e.g. when a new
special-case rule is introduced. That's what was causing #1061.

When the cache is flushed, we free the intermediate token chunks.
I *think* this is safe --- but if we start getting segfaults, this patch
is to blame. The resolution would be to simply not free those bits of
memory. They'll be freed when the tokenizer exits anyway.
2017-07-22 15:06:50 +02:00
Matthew Honnibal
d9b85675d7 Rename regression test 2017-07-22 14:14:35 +02:00
Matthew Honnibal
dfbc7e49de Add test for Issue #1207 2017-07-22 14:14:01 +02:00
Matthew Honnibal
0ae3807d7d Fix gaps in Lexeme API. Closes #1031 2017-07-22 13:53:48 +02:00
Paul O'Leary McCann
bc87b815cc Add comment clarifying what LANGUAGES does 2017-07-09 16:28:55 +09:00
Paul O'Leary McCann
04e6a65188 Remove Japanese from LANGUAGES
LANGUAGES is a list of languages whose tokenizers get run through a
variety of generic tests. Since the generic tests don't check the JA
fixture, it blows up when it can't find janome. -POLM
2017-07-09 16:23:26 +09:00
Paul O'Leary McCann
c336193392 Parametrize and extend Japanese tokenizer tests 2017-06-29 00:09:40 +09:00
Paul O'Leary McCann
30a34ebb6e Add importorskip for janome 2017-06-29 00:09:20 +09:00
Paul O'Leary McCann
e56fea14eb Add basic Japanese tokenizer test 2017-06-28 01:24:25 +09:00
ines
6e1dbc608e Fix parse_tree test 2017-05-13 12:34:20 +02:00
Matthew Honnibal
ad590feaa8 Fix test, which imported English incorrectly 2017-05-13 11:36:19 +02:00
Matthew Honnibal
b2540d2379 Merge Kengz's tree_print patch 2017-05-13 03:18:49 +02:00
Ines Montani
7da9cefd25 Merge pull request #1022 from luvogels/master
Initial support for Norwegian Bokmål
2017-04-27 11:16:06 +02:00
luvogels
d12a0b6431 Hooked up tokenizer tests 2017-04-26 23:21:41 +02:00
luvogels
8de59ce3b9 Added tokenizer tests 2017-04-26 19:10:18 +02:00
Matthew Honnibal
4d98511db7 Make Span hashable. Closes #1019 2017-04-26 19:01:05 +02:00
Matthew Honnibal
24c4c51f13 Try to make test999 less flakey 2017-04-26 18:42:06 +02:00
Matthew Honnibal
c4be9c36fe Fix unicode header in tests 2017-04-24 10:09:01 +02:00
Matthew Honnibal
65f10b53e5 Fix test 2017-04-24 00:25:55 +02:00
Matthew Honnibal
70a43858e1 Fix flakey test 2017-04-24 00:06:30 +02:00
Matthew Honnibal
3973af2d15 Make training test less flakey 2017-04-23 22:59:34 +02:00
ines
42305bc519 Remove unnecessary test 2017-04-23 21:21:41 +02:00
ines
012ea594d1 Add file for misc tests 2017-04-23 21:06:51 +02:00
ines
83f66947dc Rename test_download to test_cli 2017-04-23 21:06:50 +02:00
Matthew Honnibal
874a3cbb07 Add test for Issue #955 2017-04-23 17:57:01 +02:00
Matthew Honnibal
5d8af40445 Add test for Issue #999 2017-04-23 17:06:30 +02:00
Matthew Honnibal
040751ad17 Remove xfail on Test #910 2017-04-23 16:28:55 +02:00
Ben Eyal
e90e8a3f10 Enable test 2017-04-20 02:25:24 +03:00
ines
2bd89e7ade Tidy up Hebrew tests and test for punctuation (see #995) 2017-04-19 19:28:03 +02:00
ines
13d30b6c01 xfail lemmatizer test that's causing problems (see #546) 2017-04-16 21:18:39 +02:00
ines
0084466a66 Remove unused utf8open util and replace os.path with ensure_path 2017-04-16 20:37:45 +02:00
Matthew Honnibal
1dca7eeb03 Add unicode declaration on new regression test 2017-04-07 18:09:23 +02:00