Commit Graph

86 Commits

Author SHA1 Message Date
Julia Makogon
f1c3108d52 Fixing pymorphy2 dependency issue (#3329) (closes #3327)
* Classes for Ukrainian; small fix in Russian.

* Contributor agreement

* pymorphy2 initialization split for ru and uk (#3327)

* stop-words fixed

* Unit-tests updated
2019-02-25 15:48:17 +01:00
Stanisław Giziński
1448ad100c Improved polish tokenizer and stop words. (#2974)
* Improved stop words list

* Removed some wrong stop words form list

* Improved stop words list

* Removed some wrong stop words form list

* Improved Polish Tokenizer (#38)

* Add tests for polish tokenizer

* Add polish tokenizer exceptions

* Don't split any words containing hyphens

* Fix test case with wrong model answer

* Remove commented out line of code until better solution is found

* Add source srx' license

* Rename exception_list.py to match spaCy conventionality

* Add a brief explanation of where the exception list comes from

* Add newline after reach exception

* Rename COPYING.txt to LICENSE

* Delete old files

* Add header to the license

* Agreements signed

* Stanisław Giziński agreement

* Krzysztof Kowalczyk - signed agreement

* Mateusz Olko agreement

* Add DoomCoder's contributor agreement

* Improve like number checking in polish lang


* like num tests added

* all from SI system added

* Final licence and removed splitting exceptions

* Added polish stop words to LEX_ATTRA

* Add encoding info to pl tokenizer exceptions
2019-02-08 14:27:21 +11:00
Julia Makogon
b41d64825a Ukrainian language added. Small fixes in Russian (#3241)
* Classes for Ukrainian; small fix in Russian.

* Contributor agreement
2019-02-07 21:05:11 +01:00
Muhammad Irfan
2e84ec1513 Fixed ISO code for Urdu. (#3073) 2018-12-20 12:28:53 +01:00
Marc Puig
98fe1ab259 Catalan Language Support (#2940)
* Catalan language Support

* Ddding Catalan to documentation
2018-11-26 15:25:47 +01:00
Nathaniel J. Smith
26849874ad When calling getoption() in conftest.py, pass a default option (#2709)
* When calling getoption() in conftest.py, pass a default option

This is necessary to allow testing an installed spacy by running:

  pytest --pyargs spacy

* Add contributor agreement
2018-09-03 09:57:52 +02:00
Matthew Honnibal
6303ce3d0e Try to fix memory error by moving fr_tokenizer to module scope 2018-07-24 20:09:06 +02:00
Matthew Honnibal
b2e9e958b9 Add session scoping to tokenizers to try to fix oom on Appveyor 2018-07-24 19:44:18 +02:00
Paul O'Leary McCann
1987f3f784 Add Japanese lemmas (#2543)
This info was already available from Mecab, forgot to add it before.
2018-07-13 10:55:14 +02:00
Eleni170
6042723535 Add support for Greek language (#2535)
* Add contributor agreement

* Support for Greek language

* Fix missing el_tokenizer
2018-07-10 13:48:38 +02:00
Muhammad Irfan
f33c703066 Add Urdu Language Support (#2430)
* added Urdu language support.

* added Urdu language tests.

* modified conftest.py for Urdu language support.

* added spacy contributor agreement.
2018-06-22 11:14:03 +02:00
Aliia E
428bae66b5 Add Tatar Language Support (#2444)
* add Tatar lang support

* add Tatar letters

* add Tatar tests

* sign contributor agreement

* sign contributor agreement [x]

* remove comments from Language class

* remove all template comments
2018-06-19 10:17:53 +02:00
ines
b8ef9c1000 Fix model names in conftest (see #2379) 2018-05-30 14:10:20 +02:00
Jani Monoses
ec62cadf4c Updates to Romanian support (#2354)
* Add back Romanian in conftest

* Romanian lex_attr

* More tokenizer exceptions for Romanian

* Add tests for some Romanian tokenizer exceptions
2018-05-24 11:40:00 +02:00
Matthew Honnibal
581d318971 Fix conftest 2018-05-15 00:54:45 +02:00
Tahar Zanouda
00417794d3 Add Arabic language (#2314)
* added support for Arabic lang

* added Arabic language support

* updated conftest
2018-05-15 00:27:19 +02:00
Jani Monoses
0e08e49e87 Lemmatizer ro (#2319)
* Add Romanian lemmatizer lookup table.

Adapted from http://www.lexiconista.com/datasets/lemmatization/
by replacing cedillas with commas (ș and ț).

The original dataset is licensed under the Open Database License.

* Fix one blatant issue in the Romanian lemmatizer

* Romanian examples file

* Add ro_tokenizer in conftest

* Add Romanian lemmatizer test
2018-05-12 15:20:04 +02:00
Paul O'Leary McCann
bd72fbf09c Port Japanese mecab tokenizer from v1 (#2036)
* Port Japanese mecab tokenizer from v1

This brings the Mecab-based Japanese tokenization introduced in #1246 to
spaCy v2. There isn't a JapaneseTagger implementation yet, but POS tag
information from Mecab is stored in a token extension. A tag map is also
included.

As a reminder, Mecab is required because Universal Dependencies are
based on Unidic tags, and Janome doesn't support Unidic.

Things to check:

1. Is this the right way to use a token extension?

2. What's the right way to implement a JapaneseTagger? The approach in
 #1246 relied on `tag_from_strings` which is just gone now. I guess the
best thing is to just try training spaCy's default Tagger?

-POLM

* Add tagging/make_doc and tests
2018-05-03 18:38:26 +02:00
Matthew Honnibal
95a9615221 Fix loading of multiple pre-trained vectors
This patch addresses #1660, which was caused by keying all pre-trained
vectors with the same ID when telling Thinc how to refer to them. This
meant that if multiple models were loaded that had pre-trained vectors,
errors or incorrect behaviour resulted.

The vectors class now includes a .name attribute, which defaults to:
{nlp.meta['lang']_nlp.meta['name']}.vectors
The vectors name is set in the cfg of the pipeline components under the
key pretrained_vectors. This replaces the previous cfg key
pretrained_dims.

In order to make existing models compatible with this change, we check
for the pretrained_dims key when loading models in from_disk and
from_bytes, and add the cfg key pretrained_vectors if we find it.
2018-03-28 16:02:59 +02:00
Canbey Bilgili
abe098b255 Adds Turkish Lemmatization 2017-12-01 17:04:32 +03:00
Vadim Mazaev
4ba7ddf651 Bugfixies 2017-11-30 12:29:38 +03:00
Vadim Mazaev
53e7c38637 Fixed tests depends on pymorphy2 2017-11-26 21:04:44 +03:00
Vadim Mazaev
cacd859dcd Added tag map, fixed tests fails, added more exceptions 2017-11-26 20:54:48 +03:00
Vadim Mazaev
81314f8659 Fixed tokenizer: added char classes; added first lemmatizer and
tokenizer tests
2017-11-21 22:23:59 +03:00
ines
3af281a334 Update test model name 2017-11-01 23:02:00 +01:00
Jim O'Regan
34ca59691b no idea what is wrong here 2017-10-31 14:50:13 +00:00
Jim O'Regan
41dd29e48e merge 2017-10-31 14:07:45 +00:00
Ines Montani
facf77e541 Merge branch 'develop' into support-danish 2017-10-24 11:53:19 +02:00
ines
612224c10d Port over changes from #1157 2017-10-14 13:11:39 +02:00
ines
9b3f8f9ec3 Fix formatting and add comment on languages 2017-10-14 13:11:18 +02:00
ines
61a503a611 Fix parser test 2017-10-07 00:38:51 +02:00
Wannaphong Phatthiyaphaibun
7b5263ffa4 fix thai test 2017-09-26 23:54:15 +07:00
Wannaphong Phatthiyaphaibun
5cba67146c add thai in spacy2 2017-09-26 21:36:27 +07:00
Jim O'Regan
7de709483b missed adding here 2017-09-11 10:51:21 +01:00
Jim O'Regan
b1b6123867 add ga_tokenizer 2017-09-11 10:31:41 +01:00
Matthew Honnibal
cb4839033c Fix loader for EN tests 2017-09-04 15:19:18 +02:00
Jim Geovedi
713d7c0aa0 added indonesian lang test 2017-08-20 12:17:14 +07:00
mollerhoj
e840077601 Add some basic tests for Danish 2017-07-03 15:49:51 +02:00
ines
a0f4592f0a Update tests 2017-06-05 02:26:13 +02:00
ines
3e105bcd36 Update tests 2017-06-05 02:09:27 +02:00
ines
078232932c Fix tokenizer fixture scope 2017-06-05 01:06:34 +02:00
Matthew Honnibal
55d0621532 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-04 15:53:25 -05:00
Matthew Honnibal
5b9f116aca Update tests 2017-06-04 15:53:17 -05:00
ines
f432bb4b48 Fix fixture scopes 2017-06-04 22:34:31 +02:00
ines
20a7003c0d Update model fixtures and reorganise tests 2017-05-29 22:14:31 +02:00
ines
6e3937efc5 Check for arguments of model markers to specify models to test
Lets user set --models --en for only English models
2017-05-29 22:10:16 +02:00
ines
b462076d80 Merge load_lang_class and get_lang_class 2017-05-14 01:31:10 +02:00
ines
5858857a78 Update languages list in conftest 2017-05-13 15:37:54 +02:00
ines
bd57b611cc Update conftest to lazy load languages 2017-05-09 00:02:21 +02:00
Gregory Howard
c0afcd22bb Merge remote-tracking branch 'remotes/upstream/master' 2017-04-27 14:42:54 +02:00