Commit Graph

486 Commits

Author SHA1 Message Date
Søren Lind Kristiansen
26aee70d95 Make Danish tokenizer split on forward slash 2019-07-12 15:20:42 +02:00
Ines Montani
197cfd7ebc Merge branch 'master' into pr/3948 2019-07-11 12:18:31 +02:00
Ines Montani
0b8406a05c Tidy up and auto-format 2019-07-11 12:02:25 +02:00
yash
815f8d13dd Fix default punctuation rules for hindi text (#3625 explosion) 2019-07-11 15:00:51 +05:30
cedar101
58f06e6180 Korean support (#3901)
* start lang/ko

* add test codes

* using natto-py

* add test_ko_tokenizer_full_tags()

* spaCy contributor agreement

* external dependency for ko

* collections.namedtuple for python version < 3.5

* case fix

* tuple unpacking

* add jongseong(final consonant)

* apply mecab option

* Remove Pipfile for now


Co-authored-by: Ines Montani <ines@ines.io>
2019-07-09 22:23:16 +02:00
Knut O. Hellan
a54f0cfc2b Norwegian tweaks (#3894)
* Norwegian fix

Add support for alternative past tense verb form (vaska).

* Norwegian months

Add all Norwegian months to tokenizer excpetions.

* More Norwegian abbreviations

Add more Norwegian abbreviations to tokenizer_exceptions.

* Contributor agreement khellan

Add signed contributor agreement for khellan (Knut O. Hellan).
2019-07-08 10:28:47 +02:00
Rokas Ramanauskas
61ce126d4c Lithuanian language support (#3895)
* initial LT lang support

* Added more stopwords. Started setting up some basic test environment (not complete)

* Initial morph rules for LT lang

* Closes #1 Adds tokenizer exceptions for Lithuanian

* Closes #5 Punctuation rules. Closes #6 Lexical Attributes

* test: add native examples to basic tests

* feat: add tag map for lt lang

* fix: remove undefined tag attribute 'Definite'

* feat: add lemmatizer for lt lang

* refactor: add new instances to lt lang morph rules; use tags from tag map

* refactor: add morph rules to lt lang defaults

* refactor: only keep nouns, verbs, adverbs and adjectives in lt lang lemmatizer lookup

* refactor: add capitalized words to lt lang lemmatizer

* refactor: add more num words to lt lang lex attrs

* refactor: update lt lang stop word set

* refactor: add new instances to lt lang tokenizer exceptions

* refactor: remove comments form lt lang init file

* refactor: use function instead of lambda in lt lex lang getter

* refactor: remove conversion to dict in lt init when dict is already provided

* chore: rename lt 'test_basic' to 'test_text'

* feat: add more lt text tests

* feat: add lemmatizer tests

* refactor: remove unused imports, add newline to end of file

* chore: add contributor agreement

* chore: change 'en' to 'lt' in lt example description

* fix: add missing encoding info

* style: add newline to end of file

* refactor: use python2 compatible syntax

* style: reformat code using black
2019-07-08 10:25:22 +02:00
Ines Montani
4f1dae1c6b Update languages and examples (see #1107) 2019-06-26 16:19:17 +02:00
Ines Montani
c833d9b314 Add "v.s." to English tokenizer exceptions (see #3868) 2019-06-20 17:48:45 +02:00
Azagh3l
5accfbb938 Update exemples.py (#3838)
Added missing hyphen and accent.
2019-06-14 09:31:05 +02:00
Ines Montani
aae9034492 Tidy up [ci skip] 2019-06-12 13:38:23 +02:00
Azagh3l
eb3e4263ee Update lex_attrs.py (#3835)
Corrected typos, added french (from France) versions of some numbers.
2019-06-11 10:59:16 +02:00
Germán
86eb817b74 Overwrites default getter for like_num in Spanish by adding _num_words and like_num to lex_attrs.py (#3810) (closes #3803))
* (#3803) Spanish like_num returning false for number-like token

* (#3803) Spanish like_num now returning True for number-like token
2019-06-02 12:22:57 +02:00
Ujwal Narayan
ed7be3f64c Update norm_exceptions.py (#3778)
* Update norm_exceptions.py

Extended the Currency set to include Franc, Indian Rupee, Bangladeshi Taka, Korean Won, Mexican Dollar, and Egyptian Pound

* Fix formatting [ci skip]
2019-05-27 11:52:52 +02:00
estr4ng7d
604acb6ace Marathi Language Support (#3767)
* Adding Marathi language details and folder to it

* Adding few changes and running tests

* Adding few changes and running tests

* Update __init__.py

mh -> mr

* Rename spacy/lang/mh/__init__.py to spacy/lang/mr/__init__.py

* mh -> mr
2019-05-24 14:29:42 +02:00
Ujwal Narayan
4d550a3055 Enhancing Kannada language Resources (#3755)
* Updated stop_words.py

Added more stopwords

* Create ujwal-narayan.md

Enhancing Kannada language resources
2019-05-20 12:56:10 +02:00
Wannaphong Phatthiyaphaibun
5a14a13f64 fix thai bug (#3693)
fix tokenize for pythainlp
2019-05-10 14:21:34 +02:00
Ines Montani
78cb807a9a Auto-format [ci skip] 2019-05-06 16:58:29 +02:00
Dobita21
f95ecedd83 Add Thai lex_attrs (#3655)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability

* add Thai norm_exception

* Add Dobita21 SCA

* editรึ : หรือ,

* Update Dobita21.md

* Auto-format

* Integrate norms into language defaults

* add acronym and some norm exception words

* add lex_attrs

* Add lexical attribute getters into the language defaults

* fix LEX_ATTRS


Co-authored-by: Donut <dobita21@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2019-05-01 12:03:14 +02:00
BreakBB
8952004dfc Update French example sents and add two German stop words (#3662)
* Update french example sentences

* Add 'anderem' and 'ihren' to German stop words
2019-05-01 12:01:35 +02:00
Dobita21
721e1fc86c update norm_exceptions (#3627)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability

* add Thai norm_exception

* Add Dobita21 SCA

* editรึ : หรือ,

* Update Dobita21.md

* Auto-format

* Integrate norms into language defaults

* add acronym and some norm exception words
2019-04-23 12:48:03 +02:00
Dobita21
189c90743c Add Thai norm_exceptions (#3612)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability

* add Thai norm_exception

* Add Dobita21 SCA

* editรึ : หรือ,

* Update Dobita21.md

* Auto-format

* Integrate norms into language defaults
2019-04-20 12:16:03 +02:00
Omer Celik
531c0869b2 Added Turkish Lira symbol(₺) (#3576)
Added Turkish Lira symbol(₺) 
https://en.wikipedia.org/wiki/Turkish_lira
2019-04-11 11:32:28 +02:00
Ines Montani
145c0b7e88 Tidy up and auto-format 2019-04-09 11:40:19 +02:00
Dobita21
8bf6967eb7 Update Thai stop words (#3545)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability
2019-04-05 12:06:38 +02:00
jeannefukumaru
f67d881b30 fix typos in tag_map flagged by python -m debug-data (#3542)
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.


Co-authored-by: Ines Montani <ines@ines.io>
2019-04-05 12:06:09 +02:00
Jeanne Choo
b6c9807431 Merge remote-tracking branch 'upstream/master' 2019-04-04 14:21:50 +08:00
Jeanne Choo
80e15af76c fixed tag_map.py merge conflict 2019-04-04 14:18:27 +08:00
jeannefukumaru
876ce01567 updated tag map with missing tags 2019-04-03 23:09:11 +08:00
Ines Montani
4faf62d515
Merge pull request #3530 from svlandeg/fix/issue_3521
Allow English stopwords with any type of apostrophe
2019-04-03 14:14:03 +02:00
Yves Peirsman
951825532c Improved Dutch language resources and Dutch lemmatization (#3409)
* Improved Dutch language resources and Dutch lemmatization

* Fix conftest

* Update punctuation.py

* Auto-format

* Format and fix tests

* Remove unused test file

* Re-add deleted test

* removed redundant infix regex pattern for ','; note: brackets + simple hyphen remains

* Cleaner lemmatization files
2019-04-03 14:13:26 +02:00
svlandeg
4ff786e113 addressed all comments by Ines 2019-04-03 13:50:33 +02:00
Kamolsit Mongkolsrisawat
dcc67f3f51 Update Thai tokenizer_exception list (#3529)
* add tokenizer_exceptions word (ก-น) from https://goo.gl/JpJ2qq

* update tokenizer_exceptions word list

* add contributor file
2019-04-03 09:13:36 +02:00
svlandeg
673c81bbb4 unicode string for python 2.7 2019-04-02 13:52:07 +02:00
svlandeg
eca9cc5417 fixing Issue #3521 by adding all hyphen variants for each stopword 2019-04-02 13:24:59 +02:00
jeannefukumaru
6cdb7b2e04 added tag_map for indonesian (#3515)
* added tag_map for indonesian

* changed tag map from .py to .txt to see if tests pass

* added symbols import

* added utf8 encoding flag

* added missing SCONJ symbol

* Auto-format

* Remove unused imports

* Make tag map available in Indonesian defaults
2019-04-01 12:27:48 +02:00
Ines Montani
c23e234d65 Auto-format 2019-04-01 12:11:27 +02:00
Ines Montani
0a0b1087b0 Make tag map available in Indonesian defaults 2019-04-01 11:46:51 +02:00
Ines Montani
5d9212c44c Remove unused imports 2019-04-01 11:46:25 +02:00
Ines Montani
8d6b544632 Auto-format 2019-04-01 11:45:43 +02:00
jeannefukumaru
6567f27849
added missing SCONJ symbol 2019-04-01 17:02:53 +08:00
jeannefukumaru
082a0a2232
added utf8 encoding flag 2019-04-01 16:37:11 +08:00
jeannefukumaru
a741bed7a7
added symbols import 2019-04-01 16:21:06 +08:00
jeannefukumaru
745cf0c914 changed tag map from .py to .txt to see if tests pass 2019-04-01 07:04:50 +08:00
jeannefukumaru
3cc897102f added tag_map for indonesian 2019-04-01 00:00:08 +08:00
Duygu Altinok
5a7bc6b39d Fix/irreg adverbs extension (#3499)
* extended list of irreg adverbs

* added test to exceptions

* fixed typo
2019-03-28 13:23:33 +01:00
Wannaphong Phatthiyaphaibun
297a051992 Update Thai tag map (#3480)
* Update Thai tag map

Update Thai tag map

* Create wannaphongcom.md
2019-03-25 16:53:26 +01:00
Matthew Honnibal
c66bd61e88 Fix lemmas 2019-03-21 14:22:12 +01:00
Matthew Honnibal
04395ffa49 Bring English tag_map in line with UD Treebank
I wrote a small script to read the UD English training data and check
that our tag map and morph rules were resulting in the best POS map.
This hadn't been done for some time, and there have been various changes
to the UD schema since it has been done. After these changes we should
see much better agreement between our POS assignments and the UD POS
tags.
2019-03-21 13:53:44 +01:00
Mehdi Hamoumi
9211f30ee3 Tiny correction in french lookup dictionary (#3427) 2019-03-19 13:00:19 +01:00