Commit Graph

640 Commits

Author SHA1 Message Date
Ines Montani
23e28e2844 Merge branch 'master' into develop 2019-09-15 17:57:09 +02:00
Ines Montani
c7e4ea7154 Update examples and languages.json [ci skip] 2019-09-15 17:56:40 +02:00
Ines Montani
16c2522791 Merge branch 'master' into develop 2019-09-14 16:42:01 +02:00
adrianeboyd
bee7961927 Add Kannada, Tamil, and Telugu unicode blocks (#4288)
Add Kannada, Tamil, and Telugu unicode blocks to uncased character
classes so that period is recognized as a suffix during tokenization.

(I'm sure a few symbols in the code blocks should not be ALPHA, but this
is mainly relevant for suffix detection and seems to be an improvement
in practice.)
2019-09-14 14:23:06 +02:00
Ines Montani
3126dd0904 Tidy up and auto-format [ci skip] 2019-09-14 12:58:06 +02:00
Paul O'Leary McCann
29a9e636eb Fix half-width space handling in JA (#4284) (closes #4262)
Before this patch, half-width spaces between words were simply lost in
Japanese text. This wasn't immediately noticeable because much Japanese
text never uses spaces at all.
2019-09-13 16:28:12 +02:00
Paul O'Leary McCann
7d8df69158 Bloom-filter backed Lookup Tables (#4268)
* Improve load_language_data helper

* WIP: Add Lookups implementation

* Start moving lemma data over to JSON

* WIP: move data over for more languages

* Convert more languages

* Fix lemmatizer fixtures in tests

* Finish conversion

* Auto-format JSON files

* Fix test for now

* Make sure tables are stored on instance

* Update docstrings

* Update docstrings and errors

* Update test

* Add Lookups.__len__

* Add serialization methods

* Add Lookups.remove_table

* Use msgpack for serialization to disk

* Fix file exists check

* Try using OrderedDict for everything

* Update .flake8 [ci skip]

* Try fixing serialization

* Update test_lookups.py

* Update test_serialize_vocab_strings.py

* Lookups / Tables now work

This implements the stubs in the Lookups/Table classes. Currently this
is in Cython but with no type declarations, so that could be improved.

* Add lookups to setup.py

* Actually add lookups pyx

The previous commit added the old py file...

* Lookups work-in-progress

* Move from pyx back to py

* Add string based lookups, fix serialization

* Update tests, language/lemmatizer to work with string lookups

There are some outstanding issues here:

- a pickling-related test fails due to the bloom filter
- some custom lemmatizers (fr/nl at least) have issues

More generally, there's a question of how to deal with the case where
you have a string but want to use the lookup table. Currently the table
allows access by string or id, but that's getting pretty awkward.

* Change lemmatizer lookup method to pass (orth, string)

* Fix token lookup

* Fix French lookup

* Fix lt lemmatizer test

* Fix Dutch lemmatizer

* Fix lemmatizer lookup test

This was using a normal dict instead of a Table, so checks for the
string instead of an integer key failed.

* Make uk/nl/ru lemmatizer lookup methods consistent

The mentioned tokenizers all have their own implementation of the
`lookup` method, which accesses a `Lookups` table. The way that was
called in `token.pyx` was changed so this should be updated to have the
same arguments as `lookup` in `lemmatizer.py` (specificially (orth/id,
string)).

Prior to this change tests weren't failing, but there would probably be
issues with normal use of a model. More tests should proably be added.

Additionally, the language-specific `lookup` implementations seem like
they might not be needed, since they handle things like lower-casing
that aren't actually language specific.

* Make recently added Greek method compatible

* Remove redundant class/method

Leftovers from a merge not cleaned up adequately.
2019-09-12 17:26:11 +02:00
Ines Montani
af25323653 Tidy up and auto-format 2019-09-11 14:00:36 +02:00
Ines Montani
e82a8d0d7a Merge branch 'master' into develop 2019-09-11 11:52:38 +02:00
Ines Montani
8f9f48b04c Add GreekLemmatizer.lookup (resolves #4272) 2019-09-11 11:44:40 +02:00
Ines Montani
6279d74c65 Tidy up and auto-format 2019-09-11 11:38:22 +02:00
Matthew Honnibal
7b858ba606 Update from master 2019-09-10 20:14:08 +02:00
adrianeboyd
e367864e59 Update Ukrainian create_lemmatizer kwargs (#4266)
Allow Ukrainian create_lemmatizer to accept lookups kwarg.
2019-09-10 11:14:46 +02:00
adrianeboyd
c32126359a Allow period as suffix following punctuation (#4248)
Addresses rare cases (such as `_MATH_.`, see #1061) where the final
period was not recognized as a suffix following punctuation.
2019-09-09 19:19:22 +02:00
Mihai Gliga
25aecd504f adding Romanian tag_map (#4257)
* adding Romanian tag_map

* added SCA file

* forgotten import
2019-09-09 11:53:09 +02:00
Matthew Honnibal
1653b818c5 Update Lithuanian tag map 2019-09-08 20:57:58 +02:00
Matthew Honnibal
1a65c5b7af Update develop from master 2019-09-08 18:21:41 +02:00
Pavle Vidanović
d03401f532 Lemmatizer lookup dictionary for Serbian and basic tag set adde… (#4251)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated

* Serbian language code update. --bugfix

* Tokenizer exceptions added. Init file updated.

* Norm exceptions and lexical attributes added.

* Examples added.

* Tests added.

* sr_lang examples update.

* Tokenizer exceptions updated. (Serbian)

* Lemmatizer created. Licence included.

* Test updated.

* Tag map basic added.

* tag_map.py file removed since it uses default spacy tags.
2019-09-08 14:19:15 +02:00
Ivan Šarić
b01025dd06 adds Croatian lemma_lookup.json, license file and corresponding tests (#4252) 2019-09-08 13:40:45 +02:00
Bae Yong-Ju
a55f5a744f Fix ValueError exception on empty Korean text. (#4245) 2019-09-06 10:29:40 +02:00
Adriane Boyd
c39c13f26b Add guillemets/chevrons to German orth variants
Add guillemets/chevrons to German orth variants for both German/Austrian
and Swiss conventions.
2019-09-04 20:05:08 +02:00
Adriane Boyd
02babf9317 English tag map without unsupported features/values 2019-08-30 11:29:19 +02:00
Matthew Honnibal
782056d117 Fix morph rules 2019-08-28 16:59:45 +02:00
Matthew Honnibal
6b2ea883ed
Merge pull request #4205 from adrianeboyd/feature/gold-train-orth-variants
Add train_docs() option to add orth variants
2019-08-28 16:54:06 +02:00
Adriane Boyd
47af3f676e Single and paired orth variants for German 2019-08-28 09:19:18 +02:00
Adriane Boyd
56c38484a1 Single and paired orth variants for English 2019-08-28 09:19:18 +02:00
Matthew Honnibal
095c63c6b8 Avoid making prepositions get the tag SCONJ 2019-08-25 21:56:47 +02:00
Matthew Honnibal
c308cf3e3e
Merge branch 'master' into feature/lemmatizer 2019-08-25 13:52:27 +02:00
Ines Montani
5ca7dd0f94
💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167)
* Improve load_language_data helper

* WIP: Add Lookups implementation

* Start moving lemma data over to JSON

* WIP: move data over for more languages

* Convert more languages

* Fix lemmatizer fixtures in tests

* Finish conversion

* Auto-format JSON files

* Fix test for now

* Make sure tables are stored on instance
2019-08-22 14:21:32 +02:00
Ines Montani
a8752a569d Auto-format [ci skip] 2019-08-22 11:44:39 +02:00
Pavle Vidanović
60e10a9f93 Serbian language improvement (#4169)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated

* Serbian language code update. --bugfix

* Tokenizer exceptions added. Init file updated.

* Norm exceptions and lexical attributes added.

* Examples added.

* Tests added.

* sr_lang examples update.

* Tokenizer exceptions updated. (Serbian)
2019-08-22 11:43:07 +02:00
Pavle Vidanović
4fe9329bfb Serbian language code update "rs" -> "sr" (#4159)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated

* Serbian language code update. --bugfix
2019-08-21 19:57:37 +02:00
Matthew Honnibal
bcd08f20af Merge changes from master 2019-08-21 14:18:52 +02:00
Ines Montani
f580302673 Tidy up and auto-format 2019-08-20 17:36:34 +02:00
Paul O'Leary McCann
756b66b7c0 Reduce size of language data (#4141)
* Move Turkish lemmas to a json file

Rather than a large dict in Python source, the data is now a big json
file. This includes a method for loading the json file, falling back to
a compressed file, and an update to MANIFEST.in that excludes json in
the spacy/lang directory.

This focuses on Turkish specifically because it has the most language
data in core.

* Transition all lemmatizer.py files to json

This covers all lemmatizer.py files of a significant size (>500k or so).
Small files were left alone.

None of the affected files have logic, so this was pretty
straightforward.

One unusual thing is that the lemma data for Urdu doesn't seem to be
used anywhere. That may require further investigation.

* Move large lang data to json for fr/nb/nl/sv

These are the languages that use a lemmatizer directory (rather than a
single file) and are larger than English.

For most of these languages there were many language data files, in
which case only the large ones (>500k or so) were converted to json. It
may or may not be a good idea to migrate the remaining Python files to
json in the future.

* Fix id lemmas.json

The contents of this file were originally just copied from the Python
source, but that used single quotes, so it had to be properly converted
to json first.

* Add .json.gz to gitignore

This covers the json.gz files built as part of distribution.

* Add language data gzip to build process

Currently this gzip data on every build; it works, but it should be
changed to only gzip when the source file has been updated.

* Remove Danish lemmatizer.py

Missed this when I added the json.

* Update to match latest explosion/srsly#9

The way gzipped json is loaded/saved in srsly changed a bit.

* Only compress language data if necessary

If a .json.gz file exists and is newer than the corresponding json file,
it's not recompressed.

* Move en/el language data to json

This only affected files >500kb, which was nouns for both languages and
the generic lookup table for English.

* Remove empty files in Norwegian tokenizer

It's unclear why, but the Norwegian (nb) tokenizer had empty files for
adj/adv/noun/verb lemmas. This may have been a result of copying the
structure of the English lemmatizer.

This removed the files, but still creates the empty sets in the
lemmatizer. That may not actually be necessary.

* Remove dubious entries in English lookup.json

" furthest" and " skilled" - both prefixed with a space - were in the
English lookup table. That seems obviously wrong so I have removed them.

* Fix small issues with en/fr lemmatizers

The en tokenizer was including the removed _nouns.py file, so that's
removed.

The fr tokenizer is unusual in that it has a lemmatizer directory with
both __init__.py and lemmatizer.py. lemmatizer.py had not been converted
to load the json language data, so that was fixed.

* Auto-format

* Auto-format

* Update srsly pin

* Consistently use pathlib paths
2019-08-20 14:54:11 +02:00
Ivan Šarić
434f6fa6c1 Issue #1107 - adds examples.py for Croatian language (#4143)
* adds contributor agreement for isaric

* adds examples.py for croatian language
2019-08-18 23:04:41 +02:00
Paul O'Leary McCann
7f82a1fe1b Make the emoticon list a raw string (#4139)
While working on an unrelated task I got warnings about an unsupported
escape sequence (`"\("`) in the tokenizer exceptions. Making the
tokenizer exceptions a raw string makes this warning go away.

The specific string that triggered this is `¯\(ツ)/¯`.
2019-08-18 15:17:13 +02:00
Ines Montani
009280fbc5 Tidy up and auto-format 2019-08-18 15:09:16 +02:00
AJ Rader
2f3648700c Correction of default lemmatizer lookup in English (Issue # 4104) (#4110)
* pytest file for issue4104 established

* edited default lookup english lemmatizer for spun; fixes issue 4102

* eliminated parameterization and sorted dictionary dependnency in issue 4104 test

* added contributor agreement
2019-08-15 11:39:10 +02:00
黎谢鹏
250a54414b update lang/zh (#4103)
* update lang/zh

* update lang/zh
2019-08-12 10:37:48 +02:00
Pavle Vidanović
e1a935d71c Stopwords for Serbian language. (#4078)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated
2019-08-05 10:22:27 +02:00
veer-bains
874bd8c8dd Fixed syntax error in lang/ko when using python 2 (#4082) (closes #4068)
* fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py

* fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py

* Update __init__.py

* Create veer-bains.md

* Update __init__.py

fixed syntax errors in variable datatype assignment when calling spacy.blank("ko") with python 2.7
2019-08-05 10:19:32 +02:00
Muhammad Irfan
d1d30b0442 added missing punctuation following conventions. (#4066) 2019-08-04 13:41:18 +02:00
Bae Yong-Ju
05fbf5d976 Fix error when Korean text contains regexp special characters. (#4022) 2019-07-25 17:53:33 +02:00
Paul O'Leary McCann
c8949ce88a Remove old comment (#4012)
Norwegian used to borrow from French but that doesn't appear to have
been true for a while now, so the comment that was here is no longer
relevant.
2019-07-23 23:10:06 +02:00
BreakBB
3e370cf2ba Add 'Prof.' to Englisch tokenizer_exceptions 2019-07-19 10:00:45 +02:00
Søren Lind Kristiansen
26aee70d95 Make Danish tokenizer split on forward slash 2019-07-12 15:20:42 +02:00
Ines Montani
197cfd7ebc Merge branch 'master' into pr/3948 2019-07-11 12:18:31 +02:00
Ines Montani
0b8406a05c Tidy up and auto-format 2019-07-11 12:02:25 +02:00
yash
815f8d13dd Fix default punctuation rules for hindi text (#3625 explosion) 2019-07-11 15:00:51 +05:30
cedar101
58f06e6180 Korean support (#3901)
* start lang/ko

* add test codes

* using natto-py

* add test_ko_tokenizer_full_tags()

* spaCy contributor agreement

* external dependency for ko

* collections.namedtuple for python version < 3.5

* case fix

* tuple unpacking

* add jongseong(final consonant)

* apply mecab option

* Remove Pipfile for now


Co-authored-by: Ines Montani <ines@ines.io>
2019-07-09 22:23:16 +02:00
Knut O. Hellan
a54f0cfc2b Norwegian tweaks (#3894)
* Norwegian fix

Add support for alternative past tense verb form (vaska).

* Norwegian months

Add all Norwegian months to tokenizer excpetions.

* More Norwegian abbreviations

Add more Norwegian abbreviations to tokenizer_exceptions.

* Contributor agreement khellan

Add signed contributor agreement for khellan (Knut O. Hellan).
2019-07-08 10:28:47 +02:00
Rokas Ramanauskas
61ce126d4c Lithuanian language support (#3895)
* initial LT lang support

* Added more stopwords. Started setting up some basic test environment (not complete)

* Initial morph rules for LT lang

* Closes #1 Adds tokenizer exceptions for Lithuanian

* Closes #5 Punctuation rules. Closes #6 Lexical Attributes

* test: add native examples to basic tests

* feat: add tag map for lt lang

* fix: remove undefined tag attribute 'Definite'

* feat: add lemmatizer for lt lang

* refactor: add new instances to lt lang morph rules; use tags from tag map

* refactor: add morph rules to lt lang defaults

* refactor: only keep nouns, verbs, adverbs and adjectives in lt lang lemmatizer lookup

* refactor: add capitalized words to lt lang lemmatizer

* refactor: add more num words to lt lang lex attrs

* refactor: update lt lang stop word set

* refactor: add new instances to lt lang tokenizer exceptions

* refactor: remove comments form lt lang init file

* refactor: use function instead of lambda in lt lex lang getter

* refactor: remove conversion to dict in lt init when dict is already provided

* chore: rename lt 'test_basic' to 'test_text'

* feat: add more lt text tests

* feat: add lemmatizer tests

* refactor: remove unused imports, add newline to end of file

* chore: add contributor agreement

* chore: change 'en' to 'lt' in lt example description

* fix: add missing encoding info

* style: add newline to end of file

* refactor: use python2 compatible syntax

* style: reformat code using black
2019-07-08 10:25:22 +02:00
Ines Montani
4f1dae1c6b Update languages and examples (see #1107) 2019-06-26 16:19:17 +02:00
Ines Montani
c833d9b314 Add "v.s." to English tokenizer exceptions (see #3868) 2019-06-20 17:48:45 +02:00
Azagh3l
5accfbb938 Update exemples.py (#3838)
Added missing hyphen and accent.
2019-06-14 09:31:05 +02:00
Ines Montani
aae9034492 Tidy up [ci skip] 2019-06-12 13:38:23 +02:00
Azagh3l
eb3e4263ee Update lex_attrs.py (#3835)
Corrected typos, added french (from France) versions of some numbers.
2019-06-11 10:59:16 +02:00
Germán
86eb817b74 Overwrites default getter for like_num in Spanish by adding _num_words and like_num to lex_attrs.py (#3810) (closes #3803))
* (#3803) Spanish like_num returning false for number-like token

* (#3803) Spanish like_num now returning True for number-like token
2019-06-02 12:22:57 +02:00
Ujwal Narayan
ed7be3f64c Update norm_exceptions.py (#3778)
* Update norm_exceptions.py

Extended the Currency set to include Franc, Indian Rupee, Bangladeshi Taka, Korean Won, Mexican Dollar, and Egyptian Pound

* Fix formatting [ci skip]
2019-05-27 11:52:52 +02:00
estr4ng7d
604acb6ace Marathi Language Support (#3767)
* Adding Marathi language details and folder to it

* Adding few changes and running tests

* Adding few changes and running tests

* Update __init__.py

mh -> mr

* Rename spacy/lang/mh/__init__.py to spacy/lang/mr/__init__.py

* mh -> mr
2019-05-24 14:29:42 +02:00
Ujwal Narayan
4d550a3055 Enhancing Kannada language Resources (#3755)
* Updated stop_words.py

Added more stopwords

* Create ujwal-narayan.md

Enhancing Kannada language resources
2019-05-20 12:56:10 +02:00
Wannaphong Phatthiyaphaibun
5a14a13f64 fix thai bug (#3693)
fix tokenize for pythainlp
2019-05-10 14:21:34 +02:00
Ines Montani
78cb807a9a Auto-format [ci skip] 2019-05-06 16:58:29 +02:00
Dobita21
f95ecedd83 Add Thai lex_attrs (#3655)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability

* add Thai norm_exception

* Add Dobita21 SCA

* editรึ : หรือ,

* Update Dobita21.md

* Auto-format

* Integrate norms into language defaults

* add acronym and some norm exception words

* add lex_attrs

* Add lexical attribute getters into the language defaults

* fix LEX_ATTRS


Co-authored-by: Donut <dobita21@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2019-05-01 12:03:14 +02:00
BreakBB
8952004dfc Update French example sents and add two German stop words (#3662)
* Update french example sentences

* Add 'anderem' and 'ihren' to German stop words
2019-05-01 12:01:35 +02:00
Dobita21
721e1fc86c update norm_exceptions (#3627)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability

* add Thai norm_exception

* Add Dobita21 SCA

* editรึ : หรือ,

* Update Dobita21.md

* Auto-format

* Integrate norms into language defaults

* add acronym and some norm exception words
2019-04-23 12:48:03 +02:00
Dobita21
189c90743c Add Thai norm_exceptions (#3612)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability

* add Thai norm_exception

* Add Dobita21 SCA

* editรึ : หรือ,

* Update Dobita21.md

* Auto-format

* Integrate norms into language defaults
2019-04-20 12:16:03 +02:00
Omer Celik
531c0869b2 Added Turkish Lira symbol(₺) (#3576)
Added Turkish Lira symbol(₺) 
https://en.wikipedia.org/wiki/Turkish_lira
2019-04-11 11:32:28 +02:00
Ines Montani
145c0b7e88 Tidy up and auto-format 2019-04-09 11:40:19 +02:00
Dobita21
8bf6967eb7 Update Thai stop words (#3545)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability
2019-04-05 12:06:38 +02:00
jeannefukumaru
f67d881b30 fix typos in tag_map flagged by python -m debug-data (#3542)
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.


Co-authored-by: Ines Montani <ines@ines.io>
2019-04-05 12:06:09 +02:00
Jeanne Choo
b6c9807431 Merge remote-tracking branch 'upstream/master' 2019-04-04 14:21:50 +08:00
Jeanne Choo
80e15af76c fixed tag_map.py merge conflict 2019-04-04 14:18:27 +08:00
jeannefukumaru
876ce01567 updated tag map with missing tags 2019-04-03 23:09:11 +08:00
Ines Montani
4faf62d515
Merge pull request #3530 from svlandeg/fix/issue_3521
Allow English stopwords with any type of apostrophe
2019-04-03 14:14:03 +02:00
Yves Peirsman
951825532c Improved Dutch language resources and Dutch lemmatization (#3409)
* Improved Dutch language resources and Dutch lemmatization

* Fix conftest

* Update punctuation.py

* Auto-format

* Format and fix tests

* Remove unused test file

* Re-add deleted test

* removed redundant infix regex pattern for ','; note: brackets + simple hyphen remains

* Cleaner lemmatization files
2019-04-03 14:13:26 +02:00
svlandeg
4ff786e113 addressed all comments by Ines 2019-04-03 13:50:33 +02:00
Kamolsit Mongkolsrisawat
dcc67f3f51 Update Thai tokenizer_exception list (#3529)
* add tokenizer_exceptions word (ก-น) from https://goo.gl/JpJ2qq

* update tokenizer_exceptions word list

* add contributor file
2019-04-03 09:13:36 +02:00
svlandeg
673c81bbb4 unicode string for python 2.7 2019-04-02 13:52:07 +02:00
svlandeg
eca9cc5417 fixing Issue #3521 by adding all hyphen variants for each stopword 2019-04-02 13:24:59 +02:00
jeannefukumaru
6cdb7b2e04 added tag_map for indonesian (#3515)
* added tag_map for indonesian

* changed tag map from .py to .txt to see if tests pass

* added symbols import

* added utf8 encoding flag

* added missing SCONJ symbol

* Auto-format

* Remove unused imports

* Make tag map available in Indonesian defaults
2019-04-01 12:27:48 +02:00
Ines Montani
c23e234d65 Auto-format 2019-04-01 12:11:27 +02:00
Ines Montani
0a0b1087b0 Make tag map available in Indonesian defaults 2019-04-01 11:46:51 +02:00
Ines Montani
5d9212c44c Remove unused imports 2019-04-01 11:46:25 +02:00
Ines Montani
8d6b544632 Auto-format 2019-04-01 11:45:43 +02:00
jeannefukumaru
6567f27849
added missing SCONJ symbol 2019-04-01 17:02:53 +08:00
jeannefukumaru
082a0a2232
added utf8 encoding flag 2019-04-01 16:37:11 +08:00
jeannefukumaru
a741bed7a7
added symbols import 2019-04-01 16:21:06 +08:00
jeannefukumaru
745cf0c914 changed tag map from .py to .txt to see if tests pass 2019-04-01 07:04:50 +08:00
jeannefukumaru
3cc897102f added tag_map for indonesian 2019-04-01 00:00:08 +08:00
Duygu Altinok
5a7bc6b39d Fix/irreg adverbs extension (#3499)
* extended list of irreg adverbs

* added test to exceptions

* fixed typo
2019-03-28 13:23:33 +01:00
Wannaphong Phatthiyaphaibun
297a051992 Update Thai tag map (#3480)
* Update Thai tag map

Update Thai tag map

* Create wannaphongcom.md
2019-03-25 16:53:26 +01:00
Matthew Honnibal
c66bd61e88 Fix lemmas 2019-03-21 14:22:12 +01:00
Matthew Honnibal
04395ffa49 Bring English tag_map in line with UD Treebank
I wrote a small script to read the UD English training data and check
that our tag map and morph rules were resulting in the best POS map.
This hadn't been done for some time, and there have been various changes
to the UD schema since it has been done. After these changes we should
see much better agreement between our POS assignments and the UD POS
tags.
2019-03-21 13:53:44 +01:00
Mehdi Hamoumi
9211f30ee3 Tiny correction in french lookup dictionary (#3427) 2019-03-19 13:00:19 +01:00
Ines Montani
278e9d2eb0 Merge branch 'master' into feature/lemmatizer 2019-03-16 13:44:22 +01:00
Ines Montani
2912ddc9a6 Don't set extension attribute in Japanese (closes #3398) 2019-03-12 13:30:33 +01:00
Ines Montani
cdd418b93e Auto-format [ci skip] 2019-03-11 17:10:50 +01:00
Matthew Honnibal
39a4741e26 Add support for vocab.writing_system property (#3390)
* Add xfail test for vocab.writing_system

* Add vocab.writing_system property

* Set Language.Defaults.writing_system

* Set default writing system

* Remove xfail on test_vocab_writing_system
2019-03-11 15:23:20 +01:00
Ines Montani
ee4f312e89 Add writing_system to ArabicDefaults (experimental) 2019-03-11 14:22:23 +01:00
Ines Montani
ef80cfde6f Fix pickling of Japanese (closes #3191) 2019-03-11 13:34:23 +01:00
Matthew Honnibal
5d25ee52fb Fix English tag map 2019-03-11 01:06:02 +01:00
Matthew Honnibal
7503e1e505 Improve English tag map. Re #593, #3311 2019-03-10 23:50:00 +01:00
Matthew Honnibal
78aba46530 Update feature/lemmatizer from develop 2019-03-10 02:45:33 +01:00
Ines Montani
610fb306bd Revert hyphens 2019-03-09 12:51:53 +01:00
Ines Montani
bbabb6aaae Escape more hyphens 2019-03-09 12:41:05 +01:00
Ines Montani
b8db219850 Auto-format 2019-03-09 12:40:58 +01:00
Ines Montani
a145bfe627 Try escaping hyphens again 2019-03-09 03:06:50 +01:00
Ines Montani
b9c71fc0f0 Fix flags 2019-03-09 02:46:04 +01:00
Ines Montani
ae09b6a6cf Try fixing unicode inconsistencies on Python 2 2019-03-09 02:37:50 +01:00
Ines Montani
d957d7a697 Auto-format 2019-03-09 02:37:41 +01:00
Ines Montani
65402c3d02 Revert "Experiment with escaping hyphens"
This reverts commit 9b42e2d5dd.
2019-03-09 02:13:00 +01:00
Ines Montani
9b42e2d5dd Experiment with escaping hyphens 2019-03-09 02:05:26 +01:00
Matthew Honnibal
00cfadbf63 Fix obsolete data in English tokenizer exceptions 2019-03-07 21:58:16 +01:00
Matthew Honnibal
7afe56a360 Fix morphological features in en tag_map 2019-03-07 21:57:56 +01:00
Matthew Honnibal
3a667833d1 Fix morphological features in de tag_map 2019-03-07 21:57:43 +01:00
Matthew Honnibal
e585b50458 Fix features in English tag map 2019-03-07 18:32:09 +01:00
Matthew Honnibal
3993f41cc4 Update morphology branch from develop 2019-03-07 00:14:43 +01:00
Ines Montani
6bd34e9d54 Expose Japanese stop words (closes #3346) 2019-03-06 14:21:15 +01:00
Ines Montani
85deb96278 Fix whitespace 2019-03-06 14:20:34 +01:00
Ines Montani
23f6ebf0f3 Add missing " (closes #3343) 2019-02-27 16:37:03 +01:00
Ines Montani
48a2046d1c Remove stray print statement (closes #3342) 2019-02-27 15:35:04 +01:00
Ines Montani
07d7c0a1af Fix whitespace 2019-02-27 15:34:21 +01:00
Ines Montani
76ce8b2662 Merge branch 'master' into develop 2019-02-25 15:54:55 +01:00
Julia Makogon
f1c3108d52 Fixing pymorphy2 dependency issue (#3329) (closes #3327)
* Classes for Ukrainian; small fix in Russian.

* Contributor agreement

* pymorphy2 initialization split for ru and uk (#3327)

* stop-words fixed

* Unit-tests updated
2019-02-25 15:48:17 +01:00
Ines Montani
2982f82934 Auto-format 2019-02-24 14:09:15 +01:00
Matthew Honnibal
c5f947f194 Fix regex deprecation warnings 2019-02-21 11:56:47 +01:00
Sofie
9a478b6db8 Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293)
* splitting up latin unicode interval

* removing hyphen as infix for French

* adding failing test for issue 1235

* test for issue #3002 which now works

* partial fix for issue #2070

* keep the hyphen as infix for French (as it was)

* restore french expressions with hyphen as infix (as it was)

* added succeeding unit test for Issue #2656

* Fix issue #2822 with custom Italian exception

* Fix issue #2926 by allowing numbers right before infix /

* splitting up latin unicode interval

* removing hyphen as infix for French

* adding failing test for issue 1235

* test for issue #3002 which now works

* partial fix for issue #2070

* keep the hyphen as infix for French (as it was)

* restore french expressions with hyphen as infix (as it was)

* added succeeding unit test for Issue #2656

* Fix issue #2822 with custom Italian exception

* Fix issue #2926 by allowing numbers right before infix /

* remove duplicate

* remove xfail for Issue #2179 fixed by Matt

* adjust documentation and remove reference to regex lib
2019-02-20 22:10:13 +01:00
Ines Montani
3fdcdec6a0 Merge branch 'master' into develop 2019-02-18 10:03:32 +01:00
Roshni Biswas
e09f1347fa updates for Bengali language (#3286)
* Update morph_rules.py

* contributor agreement for roshni-b

* created example sentences
2019-02-18 10:02:28 +01:00
Ines Montani
043e8186f3 Merge branch 'master' into develop 2019-02-17 17:51:17 +01:00
Marc Puig
51268e9f21 Typo error fixed (#3284) 2019-02-17 17:51:02 +01:00
Ines Montani
19a002bfd3 Merge branch 'master' into develop 2019-02-17 12:22:54 +01:00
Roshni Biswas
e26d923726 Update morph_rules.py (#3283) 2019-02-17 12:21:47 +01:00
Ines Montani
c31a9dabd5 💫 Add en/em dash to prefixes and suffixes (#3281)
* Auto-format

* Add en/em dash to prefixes and suffixes
2019-02-15 10:29:59 +01:00
Ines Montani
2e31921d0a 💫 Add base Language classes for more languages (#3276)
* Add base classes for more languages

* Add test for language class initialization

Make sure language can be initialize – otherwise, it's difficult to catch serious errors in the test suite, because languages are lazy-loaded
2019-02-15 01:31:19 +11:00
Ines Montani
106d95b01a Fix typo 2019-02-14 12:26:56 +01:00
Ines Montani
11d6b874db
Update stop_words.py 2019-02-14 12:25:19 +01:00
Ines Montani
4d2438f985 Tidy up and auto-format 2019-02-13 15:29:08 +01:00
Ines Montani
2f45bd94c0 Auto-formatting 2019-02-12 18:30:11 +01:00
Ines Montani
0184a95340 Merge branch 'master' into develop 2019-02-12 18:29:24 +01:00
Akhilesh
a78db10941 add kannada support (#3264)
* add kannada support

* add few more stop words

* add support for Kannada Language
2019-02-12 18:28:39 +01:00
Ines Montani
25602c794c Tidy up and fix small bugs and typos 2019-02-08 14:14:49 +01:00
Ines Montani
9e652afa4b Merge branch 'master' into develop 2019-02-08 13:28:09 +01:00
Björn Lennartsson
647f0140c7 Fixed tag map for Swedish Talbanken (#3186) 2019-02-08 14:28:59 +11:00
Stanisław Giziński
1448ad100c Improved polish tokenizer and stop words. (#2974)
* Improved stop words list

* Removed some wrong stop words form list

* Improved stop words list

* Removed some wrong stop words form list

* Improved Polish Tokenizer (#38)

* Add tests for polish tokenizer

* Add polish tokenizer exceptions

* Don't split any words containing hyphens

* Fix test case with wrong model answer

* Remove commented out line of code until better solution is found

* Add source srx' license

* Rename exception_list.py to match spaCy conventionality

* Add a brief explanation of where the exception list comes from

* Add newline after reach exception

* Rename COPYING.txt to LICENSE

* Delete old files

* Add header to the license

* Agreements signed

* Stanisław Giziński agreement

* Krzysztof Kowalczyk - signed agreement

* Mateusz Olko agreement

* Add DoomCoder's contributor agreement

* Improve like number checking in polish lang


* like num tests added

* all from SI system added

* Final licence and removed splitting exceptions

* Added polish stop words to LEX_ATTRA

* Add encoding info to pl tokenizer exceptions
2019-02-08 14:27:21 +11:00
Ines Montani
402d133c90 Add Ukrainian unicode 2019-02-07 21:11:58 +01:00
Ines Montani
e2d93e4852 Merge branch 'master' into develop 2019-02-07 21:10:08 +01:00
Ines Montani
2499da97e8 Format 2019-02-07 21:07:02 +01:00