Adriane Boyd
f4008bdb13
Restrict pymorphy2 requirement to pymorphy2 mode ( #8299 )
...
For the Russian and Ukrainian lemmatizers, restrict the `pymorphy2`
requirement to the mode `pymorphy2` so that lookup or other lemmatizer
modes can be loaded without installing `pymorphy2`.
2021-06-11 10:19:22 +02:00
Jean-Hugues Roy
ff5cf3606c
Improvements to French stopwords list ( #7941 )
...
* "y" etc.
Many changes described in pull request
* Update spacy/lang/fr/stop_words.py
* Update spacy/lang/fr/stop_words.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-06-02 11:50:49 +02:00
Paul O'Leary McCann
d1a221a374
Add all symbols in Unicode Currency Symbols block ( #8212 )
...
* Add all symbols in Unicode Currency Symbols block
In #8102 it came up that the rupee symbol was treated different from
dollar / euro / yen symbols. This adds many symbols not already
included.
* Fix test
* Fix training test
2021-05-31 18:03:40 +10:00
Paul O'Leary McCann
bdeaf3a18b
Fix/fix en ordinals ( #8028 )
...
* Fix #8019
"th" is not the only ordinal ending.
* Add some more ordinal tests
2021-05-07 10:26:42 +02:00
Adriane Boyd
31528f62ed
Add / to nb infixes ( #7991 )
2021-05-04 11:00:10 +02:00
Sevdimali
49aed683cc
Azerbaijani language added ( #7911 )
2021-04-28 14:42:02 +02:00
m0canu1
921feee092
Added more exception to the italian language from https://forum.wordr … ( #7246 )
...
* Added more exception to the italian language from https://forum.wordreference.com/threads/le-abbreviazioni-nella-lingua-italiana-abbreviations-in-italian.2464189/
* Remove unnecessary exception
Co-authored-by: Alexandru Mocanu <alexandru.mocanu@augeos.it>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-03-30 10:23:32 +02:00
Adriane Boyd
3bcf74aca7
Rename and update ru pymorphy2 lookup lemmatize
...
* To allow default lookup lemmatization with a blank Russian model,
rename pymorphy2 lookup mode to `pymorphy2_lookup`
* Bug fix: update pymorphy2 lookup lemmatize to return list rather than
string
2021-03-15 11:11:06 +01:00
Adriane Boyd
264862c67a
Fix Ukrainian lemmatizer init ( #7127 )
...
Fix class variable and init for `UkrainianLemmatizer` so that it loads
the `uk` dictionaries rather than having the parent `RussianLemmatizer`
override with the `ru` settings.
2021-02-22 11:05:08 +11:00
Boian Tzonev
cca8651fc8
Bulgarian tokenizer exceptions ( #7114 )
...
* [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian
* [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian
2021-02-19 19:19:19 +01:00
Ines Montani
9ba715ed16
Tidy up and auto-format
2021-02-13 12:55:56 +11:00
Ines Montani
6c450decfc
Fix punctuation settings and add to initialize tests
2021-02-13 11:51:21 +11:00
Shumi
4e514f1ea8
Update stop_words.py
...
I have deleted line 1 to 5 and the statement print(STOP_WORDS)
2021-02-11 21:30:34 +02:00
Shumi
0d57e84b7b
Update lex_attrs.py
...
I have removed line 1 to 4
2021-02-11 21:28:23 +02:00
Shumi
37ec67f868
Update examples.py
...
I have removed two lines:
# coding: utf8
from __future__ import unicode_literals
And updated: >>> from spacy.lang.tn.examples import sentences
2021-02-11 21:25:58 +02:00
Shumi
39eeba6760
Update __init__.py
...
Added infixes = TOKENIZER_INFIXES
2021-02-11 21:20:46 +02:00
Shumi
ed3397727e
Delete tag_map.py
...
Tag map file is deleted. I will add it later because it was failing validations
2021-02-10 20:41:18 +02:00
Shumi
7c8721b1bd
Update tag_map.py
...
Updated tag_map
2021-02-10 20:21:22 +02:00
Shumi
f6be28cfb2
Added files to Setswana Language
...
Add South African Setswana Language
2021-02-10 20:15:13 +02:00
Shumi
24046fef17
South African Setswana language
...
Please accept the additional of Setswana language
2021-02-10 20:12:33 +02:00
svlandeg
91e72c031e
reformatting
2021-01-30 17:29:33 +01:00
svlandeg
a8d84188f0
add stop words
...
Co-authored-by: tewodrosm <tedmaam2006@gmail.com>
2021-01-30 17:26:49 +01:00
Ines Montani
e6accb3a9e
Tidy up and auto-format
2021-01-30 12:52:33 +11:00
Ines Montani
817b0db521
Fix escape sequence
2021-01-30 12:39:58 +11:00
Ines Montani
bbf080dfe5
Merge pull request #6645 from bittlingmayer/patch-3
2021-01-30 01:26:28 +11:00
Adriane Boyd
bced6309e5
Add full exceptions with spaces
2021-01-29 14:27:22 +01:00
Ines Montani
5ed51c9dd2
Merge pull request #6828 from explosion/master-tmp
2021-01-27 23:05:46 +11:00
Adriane Boyd
d17afb4826
Add Spanish rule-based lemmatizer ( #6833 )
...
* Initial Spanish lemmatizer
* Handle merged verb+pron(s) multi-word tokens
* Use VERB for AUX rule lookup
* Add morph to lemma cache key
* Fix aux lookups, minor refactoring
* Improve verb+pron handling
* Move verb+pron handling into its own method
* Check for exceptions (primarily for se)
* Collect pronouns in the same (not reversed) order
* Only add modified possible lemmas
2021-01-27 19:21:35 +08:00
Ines Montani
615dba9d99
Fix tokenizer exceptions
2021-01-27 22:11:42 +11:00
Ines Montani
e3f8be9a94
Update language data
2021-01-27 13:29:22 +11:00
Ines Montani
230e651ad6
Merge branch 'develop' into master-tmp
2021-01-27 13:26:29 +11:00
Adriane Boyd
71a6350744
Implement overwrite param for all custom lemmatizers ( #6794 )
2021-01-26 14:53:43 +11:00
muratjumashev
2b19ebad59
Remove Kyrgyz chars fr. char_classes since Tatar ones already cover
2021-01-25 00:46:45 +06:00
muratjumashev
53abf759ad
Fix punctuation
2021-01-24 20:54:22 +06:00
muratjumashev
2a2646362b
Fix language subclass
2021-01-23 22:00:50 +06:00
muratjumashev
fe3b5b8ff5
Add kyrgyz to char_classes
2021-01-23 21:53:41 +06:00
muratjumashev
e30bbf5432
Add examples
2021-01-23 21:49:08 +06:00
muratjumashev
2f385385a9
Remove comment
2021-01-23 21:36:28 +06:00
muratjumashev
d53724ba1d
Add lex_attrs
2021-01-23 21:35:25 +06:00
muratjumashev
4418ec2eee
Add punctuation
2021-01-23 21:31:31 +06:00
muratjumashev
101d265778
Add stopwords
2021-01-23 21:25:28 +06:00
muratjumashev
28d06ab860
Add tokenizer_exceptions
2021-01-22 23:08:41 +06:00
Sofie Van Landeghem
fed8f48965
raise NotImplementedError when noun_chunks iterator is not implemented ( #6711 )
...
* raise NotImplementedError when noun_chunks iterator is not implemented
* bring back, fix and document span.noun_chunks
* formatting
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2021-01-17 19:56:05 +08:00
Adriane Boyd
185fc62f4d
Remove unused is_base_form for mk lemmatizer ( #6743 )
...
Remove unimplemented/incorrect is_base_form for Macedonian lemmatizer.
2021-01-17 09:41:35 +01:00
Ines Montani
b0b743597c
Tidy up and auto-format
2021-01-15 11:57:36 +11:00
Adriane Boyd
0c936004d1
Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3
2021-01-14 11:49:58 +01:00
Adriane Boyd
e649242927
Prevent overlapping noun chunks for Spanish ( #6712 )
...
* Prevent overlapping noun chunks in Spanish noun chunk iterator
* Clean up similar code in Danish noun chunk iterator
2021-01-14 17:33:31 +11:00
Adriane Boyd
54e8e3c208
Update model-related dependencies ( #6725 )
...
* Update pymorphy2 error messages for Russian and Ukrainian
* Add pymorphy2 to pex
* Update spacy-pkuseg version for pex
2021-01-14 17:29:44 +11:00
Alex Combessie
9cc880014c
Remove questionable French stopwords ( #6310 )
...
* Remove questionable French stopwords
* Create alexcombessie.md
2021-01-08 11:36:22 +11:00
Cristiana S Parada
7a0222f260
Update stop_words.py in Portuguese (a,o,e) ( #6345 )
...
* Update stop_words.py
Added three aditional stopwords: "a" and "o" that means "the", and "e" that means "and"
* Create cristianasp.md
* zero edit to push CI
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-01-08 11:35:38 +11:00