Commit Graph

260 Commits

Author SHA1 Message Date
Jani Monoses
ec62cadf4c Updates to Romanian support (#2354)
* Add back Romanian in conftest

* Romanian lex_attr

* More tokenizer exceptions for Romanian

* Add tests for some Romanian tokenizer exceptions
2018-05-24 11:40:00 +02:00
Tahar Zanouda
00417794d3 Add Arabic language (#2314)
* added support for Arabic lang

* added Arabic language support

* updated conftest
2018-05-15 00:27:19 +02:00
Jani Monoses
0e08e49e87 Lemmatizer ro (#2319)
* Add Romanian lemmatizer lookup table.

Adapted from http://www.lexiconista.com/datasets/lemmatization/
by replacing cedillas with commas (ș and ț).

The original dataset is licensed under the Open Database License.

* Fix one blatant issue in the Romanian lemmatizer

* Romanian examples file

* Add ro_tokenizer in conftest

* Add Romanian lemmatizer test
2018-05-12 15:20:04 +02:00
Paul O'Leary McCann
bd72fbf09c Port Japanese mecab tokenizer from v1 (#2036)
* Port Japanese mecab tokenizer from v1

This brings the Mecab-based Japanese tokenization introduced in #1246 to
spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag
information from Mecab is stored in a token extension. A tag map is also
included.

As a reminder, Mecab is required because Universal Dependencies are
based on Unidic tags, and Janome doesn't support Unidic.

Things to check:

1. Is this the right way to use a token extension?

2. What's the right way to implement a JapaneseTagger? The approach in
 #1246 relied on `tag_from_strings` which is just gone now. I guess the
best thing is to just try training spaCy's default Tagger?

-POLM

* Add tagging/make_doc and tests
2018-05-03 18:38:26 +02:00
Jens Dahl Møllerhøj
e5055e3cf6 Add Danish lemmatizer (#2184)
* add danish lemmatizer

* fill contributor agreement
2018-04-07 19:07:28 +02:00
ines
6d2c85f428 Drop six and related hacks as a dependency 2018-03-28 10:45:25 +02:00
4altinok
471d3c9e23 added lex test for is_currency 2018-02-11 18:50:50 +01:00
Ines Montani
a3dd167d7f
Merge branch 'master' into da_ud_tokenization 2017-12-20 21:05:34 +00:00
Søren Lind Kristiansen
15d13efafd Tune Danish tokenizer to more closely match tokenization in Universal Dependencies. 2017-12-20 17:36:52 +01:00
Canbey Bilgili
abe098b255 Adds Turkish Lemmatization 2017-12-01 17:04:32 +03:00
Matthew Honnibal
f9ed9ea529
Merge pull request #1624 from GreenRiverRUS/russian
Add support for Russian
2017-11-29 23:10:01 +01:00
Søren Lind Kristiansen
0ffd27b0f6 Add several Danish alternative spellings 2017-11-27 13:35:41 +01:00
Vadim Mazaev
cacd859dcd Added tag map, fixed tests fails, added more exceptions 2017-11-26 20:54:48 +03:00
Søren Lind Kristiansen
6aa241bcec Add day of month tokenizer exceptions for Danish. 2017-11-24 15:03:24 +01:00
Søren Lind Kristiansen
0c276ed020 Add weekday abbreviations and remove abiguous month abbreviations for Danish. 2017-11-24 14:43:29 +01:00
Søren Lind Kristiansen
056547e989 Add multiple tokenizer exceptions for Danish. 2017-11-24 11:51:26 +01:00
Søren Lind Kristiansen
8dc265ac0c Add test for tokenization of 'i.' for Danish. 2017-11-24 11:29:37 +01:00
Vadim Mazaev
81314f8659 Fixed tokenizer: added char classes; added first lemmatizer and
tokenizer tests
2017-11-21 22:23:59 +03:00
ines
17849dee4b Fix French test (see #1617) 2017-11-20 13:59:59 +01:00
Matthew Honnibal
63c6ae4191 Fix lemmatizer test 2017-11-06 11:57:06 +01:00
Matthew Honnibal
144a93c2a5 Back-off to tensor for similarity if no vectors 2017-11-03 20:56:33 +01:00
Matthew Honnibal
d6e831bf89 Fix lemmatizer tests 2017-11-03 19:46:34 +01:00
Jim O'Regan
08b0bfd153 merge 2017-10-31 22:55:59 +00:00
Jim O'Regan
00ecfa5417 Ó, not O 2017-10-31 22:54:42 +00:00
Ines Montani
25b1d6cd91
Fix syntax error 2017-10-31 22:36:03 +01:00
Jim O'Regan
fe4b10346a replace example sentence until I get around to adding a punctuation.py 2017-10-31 20:24:53 +00:00
Jim O'Regan
d4a8160c36 change quotes 2017-10-31 15:15:44 +00:00
Jim O'Regan
41dd29e48e merge 2017-10-31 14:07:45 +00:00
Ines Montani
facf77e541 Merge branch 'develop' into support-danish 2017-10-24 11:53:19 +02:00
ines
cd6a29dce7 Port over changes from #1294 2017-10-14 13:28:46 +02:00
ines
38c756fd85 Port over changes from #1287 2017-10-14 13:16:21 +02:00
ines
612224c10d Port over changes from #1157 2017-10-14 13:11:39 +02:00
Matthew Honnibal
cf6da9301a Update lemmatizer test 2017-10-12 22:50:52 +02:00
ines
453c47ca24 Add German lemmatizer tests 2017-10-11 13:27:26 +02:00
Matthew Honnibal
c6cd81f192 Wrap try/except around model saving 2017-10-05 08:14:24 -05:00
Matthew Honnibal
fd4baff475 Update tests 2017-10-05 08:12:27 -05:00
Wannaphong Phatthiyaphaibun
5cba67146c add thai in spacy2 2017-09-26 21:36:27 +07:00
ines
ece30c28a8 Don't split hyphenated words in German
This way, the tokenizer matches the tokenization in German treebanks
2017-09-16 20:40:15 +02:00
Jim O'Regan
187be6d372 copy/paste error 2017-09-11 09:33:17 +01:00
Jim O'Regan
c283e9edfe first stab at test 2017-09-11 08:57:48 +01:00
Matthew Honnibal
d5fbf27335 Fix test 2017-09-04 16:45:11 +02:00
Matthew Honnibal
644d6c9e1a Improve lemmatization tests, re #1296 2017-09-04 15:17:44 +02:00
Jim Geovedi
fbc62a09c7 added {pre,suf,in}fix tests 2017-08-20 13:43:00 +07:00
Jim Geovedi
cc4772cac2 reworks 2017-08-03 13:08:38 +07:00
Jim Geovedi
783f7d8b86 added test set for Indonesian language 2017-07-29 18:21:07 +07:00
mollerhoj
e840077601 Add some basic tests for Danish 2017-07-03 15:49:51 +02:00
ines
cc9c5dc7a3 Fix noun chunks test 2017-06-05 16:39:04 +02:00
ines
a0f4592f0a Update tests 2017-06-05 02:26:13 +02:00
ines
3e105bcd36 Update tests 2017-06-05 02:09:27 +02:00
Matthew Honnibal
58be0e1f6f Update tests 2017-06-04 16:35:06 -05:00
Ines Montani
112c5787eb Merge pull request #1101 from oroszgy/hu_tokenizer_fix
More robust Hungarian tokenizer.
2017-06-04 22:37:51 +02:00
ines
e47eef5e03 Update German tokenizer exceptions and tests 2017-06-03 21:07:44 +02:00
ines
d77c2cc8bb Add tests for English norm exceptions 2017-06-03 20:59:50 +02:00
Gyorgy Orosz
f0c3b09242 More robust Hungarian tokenizer. 2017-05-31 22:28:40 +02:00
ines
20a7003c0d Update model fixtures and reorganise tests 2017-05-29 22:14:31 +02:00
ines
d0c6d4f76d Fix formatting 2017-05-23 11:32:00 +02:00
ines
2c3bdd09b1 Add English test for like_num 2017-05-09 11:06:34 +02:00
ines
22375eafb0 Fix and merge attrs and lex_attrs tests 2017-05-09 11:06:25 +02:00
ines
c714841cc8 Move language-specific tests to tests/lang 2017-05-09 00:02:37 +02:00
ines
3c0f85de8e Remove imports in /lang/__init__.py 2017-05-08 23:58:07 +02:00