Commit Graph

892 Commits

Author SHA1 Message Date
Adriane Boyd
c5de9b463a
Update custom tokenizer APIs and pickling (#8972)
* Fix incorrect pickling of Japanese and Korean pipelines, which led to
the entire pipeline being reset if pickled

* Enable pickling of Vietnamese tokenizer

* Update tokenizer APIs for Chinese, Japanese, Korean, Thai, and
Vietnamese so that only the `Vocab` is required for initialization
2021-08-19 14:37:47 +02:00
Adriane Boyd
f99d6d5e39
Refactor scoring methods to use registered functions (#8766)
* Add scorer option to components

Add an optional `scorer` parameter to all pipeline components. If a
scoring function is provided, it overrides the default scoring method
for that component.

* Add registered scorers for all components

* Add `scorers` registry
* Move all scoring methods outside of components as independent
  functions and register
* Use the registered scoring methods as defaults in configs and inits

Additional:

* The scoring methods no longer have access to the full component, so
  use settings from `cfg` as default scorer options to handle settings
  such as `labels`, `threshold`, and `positive_label`
* The `attribute_ruler` scoring method no longer has access to the
  patterns, so all scoring methods are called
* Bug fix: `spancat` scoring method is updated to set `allow_overlap` to
  score overlapping spans correctly

* Update Russian lemmatizer to use direct score method

* Check type of cfg in Pipe.score

* Fix check

* Update spacy/pipeline/sentencizer.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remove validate_examples from scoring functions

* Use Pipe.labels instead of Pipe.cfg["labels"]

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-08-10 15:13:39 +02:00
fgaim
ee011ca963
Update Tigrinya ትግርኛ language support (#8900)
* Add missing punctuation for Tigrinya and Amharic

* Fix numeral and ordinal numbers for Tigrinya

 - Amharic was used in many cases
 - Also fixed some typos

* Update Tigrinya stop-words

* Contributor agreement for fgaim

* Fix typo in "ti" lang test

* Remove multi-word entries from numbers and ordinals
2021-08-10 13:55:08 +02:00
Dimitar Ganev
733ffe439d
Improve the stop words and the tokenizer exceptions in Bulgarian language. (#8862)
* Add more stop words and Improve the readability

* Add and categorize the tokenizer exceptions for `bg` lang

* Create syrull.md

* Add references for the additional stop words and tokenizer exc abbrs
2021-08-10 13:44:23 +02:00
Adriane Boyd
81d3a1edb1
Use tokenizer URL_MATCH pattern in LIKE_URL (#8765) 2021-07-27 12:07:01 +02:00
Adriane Boyd
d48c01a6f7
Remove extraneous grc test file (#8768) 2021-07-20 15:51:15 +02:00
explosion-bot
eff3d1088b Auto-format code with black 2021-07-16 08:03:36 +00:00
jmyerston
993b0fab0e
Added ancient Greek language support (#8606)
* Add ancient Greek language support

Initial commit

* Contributor Agreement

* grc tokenizer test added  and files formatted with black, unnecessary import removed

Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Commas in lists fixed. __init__py added to test

* Update lex_attrs.py

* Update stop_words.py

* Update stop_words.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-07-15 10:27:17 +02:00
Julien Rossi
e117573822
Adding noun_chunks to the DUTCH language model (nl) (#8529)
*  implement noun_chunks for dutch language

* copy/paste FR and SV syntax iterators to accomodate UD tags
* added tests with dutch text
* signed contributor agreement

* 🐛 fix noun chunks generator

* built from scratch
* define noun chunk as a single Noun-Phrase
* includes some corner cases debugging (incorrect POS tagging)
* test with provided annotated sample (POS, DEP)

*  fix failing test

* CI pipeline did not like the added sample file
* add the sample as a pytest fixture

* Update spacy/lang/nl/syntax_iterators.py

* Update spacy/lang/nl/syntax_iterators.py

Code readability

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/tests/lang/nl/test_noun_chunks.py

correct comment

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* finalize code

* change "if next_word" into "if next_word is not None"

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-07-14 14:01:02 +02:00
Adriane Boyd
d8805a1073
Fix ru/uk lemmatizer mp with spawn (#8657)
Use an instance variable instead a class variable for the morphological
analzyer so that multiprocessing with spawn is possible.
2021-07-09 15:36:56 +02:00
Adriane Boyd
b8e720fdb9
Fix Azerbaijani init, extend lang init tests (#8656)
* Extend langs in initialize tests

* Fix az init
2021-07-09 15:36:35 +02:00
Adriane Boyd
86d01e9229 Tidy up with flake8: imports, comparisons, etc. 2021-06-28 12:08:15 +02:00
Adriane Boyd
5eeb25f043 Tidy up code 2021-06-28 12:08:15 +02:00
Adriane Boyd
02bac8f269
Fix non-deterministic deduplication in Greek lemmatizer (#8421) 2021-06-17 09:11:01 +02:00
Giovanni Toffoli
19521d525b
Added Italian POS-aware lemmatizer. (#8079)
* Added Italian POS-aware lemmatizer.

Also added the code used to build the lookup tables by POS.

* Create gtoffoli.md

* Add imports and format

* Remove helper script

* Use lemma_lookup instead of lemma_lookup_legacy

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-06-16 11:14:45 +02:00
Antti Ajanki
5a6125c227
[Finnish tokenizer] Handle conjunction contractions (#8105) 2021-06-16 10:56:47 +02:00
Adriane Boyd
5646fcbe46 Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1 2021-06-15 15:05:17 +02:00
Adriane Boyd
b98d216205
Update Catalan language data (#8308)
* Update Catalan language data

Update Catalan language data based on contributions from the Text Mining
Unit at the Barcelona Supercomputing Center:

https://github.com/TeMU-BSC/spacy4release/tree/main/lang_data

* Update tokenizer settings for UD Catalan AnCora

Update for UD Catalan AnCora v2.7 with merged multi-word tokens.

* Update test

* Move prefix patternt to more generic infix pattern

* Clean up
2021-06-11 10:21:22 +02:00
Adriane Boyd
f4008bdb13
Restrict pymorphy2 requirement to pymorphy2 mode (#8299)
For the Russian and Ukrainian lemmatizers, restrict the `pymorphy2`
requirement to the mode `pymorphy2` so that lookup or other lemmatizer
modes can be loaded without installing `pymorphy2`.
2021-06-11 10:19:22 +02:00
Jean-Hugues Roy
ff5cf3606c
Improvements to French stopwords list (#7941)
* "y" etc.

Many changes described in pull request

* Update spacy/lang/fr/stop_words.py

* Update spacy/lang/fr/stop_words.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-06-02 11:50:49 +02:00
Paul O'Leary McCann
d1a221a374
Add all symbols in Unicode Currency Symbols block (#8212)
* Add all symbols in Unicode Currency Symbols block

In #8102 it came up that the rupee symbol was treated different from
dollar / euro / yen symbols. This adds many symbols not already
included.

* Fix test

* Fix training test
2021-05-31 18:03:40 +10:00
Adriane Boyd
1d59fdbd39
Update Vietnamese tokenizer (#8099)
* Adapt tokenization methods from `pyvi` to preserve text encoding and
whitespace
* Add serialization support similar to Chinese and Japanese

Note: as for Chinese and Japanese, some settings are duplicated in
`config.cfg` and `tokenizer/cfg`.
2021-05-17 18:16:20 +10:00
Paul O'Leary McCann
bdeaf3a18b
Fix/fix en ordinals (#8028)
* Fix #8019

"th" is not the only ordinal ending.

* Add some more ordinal tests
2021-05-07 10:26:42 +02:00
Adriane Boyd
31528f62ed
Add / to nb infixes (#7991) 2021-05-04 11:00:10 +02:00
Sevdimali
49aed683cc
Azerbaijani language added (#7911) 2021-04-28 14:42:02 +02:00
Jacopo Farina
c105ed10fd
Remove torino from stop words (#7634)
Torino is the proper name of a city and the token has no other meaning
2021-04-26 16:53:43 +02:00
m0canu1
921feee092
Added more exception to the italian language from https://forum.wordr… (#7246)
* Added more exception to the italian language from https://forum.wordreference.com/threads/le-abbreviazioni-nella-lingua-italiana-abbreviations-in-italian.2464189/

* Remove unnecessary exception

Co-authored-by: Alexandru Mocanu <alexandru.mocanu@augeos.it>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-03-30 10:23:32 +02:00
Adriane Boyd
3bcf74aca7 Rename and update ru pymorphy2 lookup lemmatize
* To allow default lookup lemmatization with a blank Russian model,
rename pymorphy2 lookup mode to `pymorphy2_lookup`

* Bug fix: update pymorphy2 lookup lemmatize to return list rather than
string
2021-03-15 11:11:06 +01:00
Adriane Boyd
264862c67a
Fix Ukrainian lemmatizer init (#7127)
Fix class variable and init for `UkrainianLemmatizer` so that it loads
the `uk` dictionaries rather than having the parent `RussianLemmatizer`
override with the `ru` settings.
2021-02-22 11:05:08 +11:00
Boian Tzonev
cca8651fc8
Bulgarian tokenizer exceptions (#7114)
* [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian

* [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian
2021-02-19 19:19:19 +01:00
Ines Montani
9ba715ed16 Tidy up and auto-format 2021-02-13 12:55:56 +11:00
Ines Montani
6c450decfc Fix punctuation settings and add to initialize tests 2021-02-13 11:51:21 +11:00
Shumi
4e514f1ea8
Update stop_words.py
I have deleted line 1 to 5 and the statement print(STOP_WORDS)
2021-02-11 21:30:34 +02:00
Shumi
0d57e84b7b
Update lex_attrs.py
I have removed line 1 to 4
2021-02-11 21:28:23 +02:00
Shumi
37ec67f868
Update examples.py
I have removed two lines:
# coding: utf8
from __future__ import unicode_literals

And updated: >>> from spacy.lang.tn.examples import sentences
2021-02-11 21:25:58 +02:00
Shumi
39eeba6760
Update __init__.py
Added infixes = TOKENIZER_INFIXES
2021-02-11 21:20:46 +02:00
Shumi
ed3397727e
Delete tag_map.py
Tag map file is deleted. I will add it later because it was failing validations
2021-02-10 20:41:18 +02:00
Shumi
7c8721b1bd
Update tag_map.py
Updated tag_map
2021-02-10 20:21:22 +02:00
Shumi
f6be28cfb2
Added files to Setswana Language
Add South African Setswana Language
2021-02-10 20:15:13 +02:00
Shumi
24046fef17
South African Setswana language
Please accept the additional of Setswana language
2021-02-10 20:12:33 +02:00
svlandeg
91e72c031e reformatting 2021-01-30 17:29:33 +01:00
svlandeg
a8d84188f0 add stop words
Co-authored-by: tewodrosm <tedmaam2006@gmail.com>
2021-01-30 17:26:49 +01:00
Ines Montani
e6accb3a9e Tidy up and auto-format 2021-01-30 12:52:33 +11:00
Ines Montani
817b0db521 Fix escape sequence 2021-01-30 12:39:58 +11:00
Ines Montani
bbf080dfe5
Merge pull request #6645 from bittlingmayer/patch-3 2021-01-30 01:26:28 +11:00
Adriane Boyd
bced6309e5
Add full exceptions with spaces 2021-01-29 14:27:22 +01:00
Ines Montani
5ed51c9dd2
Merge pull request #6828 from explosion/master-tmp 2021-01-27 23:05:46 +11:00
Adriane Boyd
d17afb4826
Add Spanish rule-based lemmatizer (#6833)
* Initial Spanish lemmatizer

* Handle merged verb+pron(s) multi-word tokens

* Use VERB for AUX rule lookup

* Add morph to lemma cache key

* Fix aux lookups, minor refactoring

* Improve verb+pron handling

* Move verb+pron handling into its own method
* Check for exceptions (primarily for se)
* Collect pronouns in the same (not reversed) order

* Only add modified possible lemmas
2021-01-27 19:21:35 +08:00
Ines Montani
615dba9d99 Fix tokenizer exceptions 2021-01-27 22:11:42 +11:00
Ines Montani
e3f8be9a94 Update language data 2021-01-27 13:29:22 +11:00
Ines Montani
230e651ad6 Merge branch 'develop' into master-tmp 2021-01-27 13:26:29 +11:00
Adriane Boyd
71a6350744
Implement overwrite param for all custom lemmatizers (#6794) 2021-01-26 14:53:43 +11:00
muratjumashev
2b19ebad59 Remove Kyrgyz chars fr. char_classes since Tatar ones already cover 2021-01-25 00:46:45 +06:00
muratjumashev
53abf759ad Fix punctuation 2021-01-24 20:54:22 +06:00
muratjumashev
2a2646362b Fix language subclass 2021-01-23 22:00:50 +06:00
muratjumashev
fe3b5b8ff5 Add kyrgyz to char_classes 2021-01-23 21:53:41 +06:00
muratjumashev
e30bbf5432 Add examples 2021-01-23 21:49:08 +06:00
muratjumashev
2f385385a9 Remove comment 2021-01-23 21:36:28 +06:00
muratjumashev
d53724ba1d Add lex_attrs 2021-01-23 21:35:25 +06:00
muratjumashev
4418ec2eee Add punctuation 2021-01-23 21:31:31 +06:00
muratjumashev
101d265778 Add stopwords 2021-01-23 21:25:28 +06:00
muratjumashev
28d06ab860 Add tokenizer_exceptions 2021-01-22 23:08:41 +06:00
Sofie Van Landeghem
fed8f48965
raise NotImplementedError when noun_chunks iterator is not implemented (#6711)
* raise NotImplementedError when noun_chunks iterator is not implemented

* bring back, fix and document span.noun_chunks

* formatting

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2021-01-17 19:56:05 +08:00
Adriane Boyd
185fc62f4d
Remove unused is_base_form for mk lemmatizer (#6743)
Remove unimplemented/incorrect is_base_form for Macedonian lemmatizer.
2021-01-17 09:41:35 +01:00
Ines Montani
b0b743597c Tidy up and auto-format 2021-01-15 11:57:36 +11:00
Adriane Boyd
0c936004d1 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3 2021-01-14 11:49:58 +01:00
Adriane Boyd
e649242927
Prevent overlapping noun chunks for Spanish (#6712)
* Prevent overlapping noun chunks in Spanish noun chunk iterator
* Clean up similar code in Danish noun chunk iterator
2021-01-14 17:33:31 +11:00
Adriane Boyd
54e8e3c208
Update model-related dependencies (#6725)
* Update pymorphy2 error messages for Russian and Ukrainian
* Add pymorphy2 to pex
* Update spacy-pkuseg version for pex
2021-01-14 17:29:44 +11:00
Alex Combessie
9cc880014c
Remove questionable French stopwords (#6310)
* Remove questionable French stopwords

* Create alexcombessie.md
2021-01-08 11:36:22 +11:00
Cristiana S Parada
7a0222f260
Update stop_words.py in Portuguese (a,o,e) (#6345)
* Update stop_words.py

Added three aditional stopwords: "a" and "o" that means "the", and "e" that means "and"

* Create cristianasp.md

* zero edit to push CI

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-01-08 11:35:38 +11:00
Lorena Ciutacu
f11002f1f1
add new Romanian stopwords (#6621)
* add contributor agreement

* update ro stopwords list

* add new stopwords
2021-01-08 11:34:47 +11:00
ophelielacroix
e3222fdec9
Add (noun chunks) syntax iterators for Danish (#6246)
* add syntax iterators for danish

* add test noun chunks for danish syntax iterators

* add contributor agreement

* update da syntax iterators to remove nested chunks

* add tests for da noun chunks

* Fix test

* add missing import
* fix example

* Prevent overlapping noun chunks

Prevent overlapping noun chunks by tracking the end index of the
previous noun chunk span.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-01-07 16:33:00 +11:00
Sofie Van Landeghem
6f7e7d88b9
remove cause without apostrophe from norm exceptions (#6636) 2021-01-06 12:30:30 +08:00
Ines Montani
991669c934 Tidy up and auto-format 2021-01-05 13:41:53 +11:00
Adam Bittlingmayer
f2fe60bacf
Update tokenizer_exceptions.py
See https://github.com/explosion/spaCy/pull/6643
2020-12-29 16:05:11 +04:00
Yosi
cf52510631
Add Amharic አማርኛ Language support (#6583)
* Add Amharic to space

* clean up

* Add some PRON_LEMMA

* add Tigrinya support

* remove text_noun_chunks

* Tigrinya Support

* added some more details for ti

* fix unit test

* add amharic char range

* changes from review

* amharic and tigrinya share same unicode block

* get rid of _amharic/_tigrinya in char_classes

Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>
2020-12-22 16:50:34 +01:00
Ines Montani
1da1568110 Remove tag map 2020-12-09 11:13:49 +11:00
Adriane Boyd
724831b066 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master
* Update Macedonian for v3
* Update Turkish for v3
2020-11-25 11:49:34 +01:00
Daniel Vasic
20d72de986
Added Multext-East V5 tagset for Croatian language (#6248)
* Added Multext-East V5 tagset for Croatian language

* Create danielvasic.md

* Update danielvasic.md

* Update danielvasic.md

* Add tag map to CroatianDefaults

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-11-05 12:19:22 +01:00
Robert Šípek
6069efe57d
Add tag map to cs language (#6284) 2020-11-05 10:13:11 +01:00
Vu Ha
6d465ec52c
add oprd to the list of accepted deps for noun chunking (#6302)
* add oprd to the list of accepted deps for noun chunking

* add SCA
2020-11-05 09:17:35 +01:00
Duygu Altinok
0e55f806dd
Turkish tokenization improvements (#6268)
* added single and paired orth variants

* added token match

* added long text tokenization test

* inverted init

* normalized lemmas to lowercase

* more abbrevs

* tests for ordinals and abbrevs

* separated period abbvrevs to another list

* fiex typo

* added ordinal and abbrev tests

* added number tests for dates

* minor refinement

* added inflected abbrevs regex

* added percentage and inflection

* cosmetics

* added token match

* added url inflection tests

* excluded url tokens from custom pattern

* removed url match import
2020-10-29 09:43:17 +01:00
Adriane Boyd
4299a7f654 Setup / install / quickstart updates
* Add `cuda110` to setup.cfg and quickstart dropdown
* Switch to `pip` for pip-only packages in conda quickstart instructions
* Update zh pkuseg install message with version range and conda
* Remove `zh` from `extras_require` because the default doesn't require
additional packages
2020-10-23 11:27:54 +02:00
Borijan Georgievski
2311192ba1
Include Macedonian language (#6230)
* Include Macedonian language

* Fix indentation at char_classes.py

* Fix indentation at char_classes.py

* Add Macedonian tests, update lex_attrs and char_classes

* Import unicode literals for python 2
2020-10-15 15:55:01 +02:00
Ines Montani
d165af26be Auto-format [ci skip] 2020-10-15 10:08:53 +02:00
Ines Montani
5665a21517 Tidy up 2020-10-15 09:30:32 +02:00
Ines Montani
178760855f Merge branch 'develop' into master-tmp 2020-10-15 09:06:03 +02:00
Ines Montani
7f92a5ee6a
Update spacy/lang/ta/examples.py 2020-10-13 11:03:35 +02:00
Ines Montani
539b0c10da Tidy up and auto-format 2020-10-10 19:14:48 +02:00
Duygu Altinok
80fb1bffc9 Ordinal numbers for Turkish (#6142)
* minor ordinal number addition

* fixed typo

* added corresponding lexical test
2020-10-09 10:13:15 +02:00
Duygu Altinok
2fad279a44 Turkish language syntax iterators (#6191)
* added tr_vocab to config

* basic test

* added syntax iterator to Turkish lang class

* first version for Turkish syntax iter, without flat

* added simple tests with nmod, amod, det

* more tests to amod and nmod

* separated noun chunks and parser test

* rearrangement after nchunk parser separation

* added recursive NPs

* tests with complicated recursive NPs

* tests with conjed NPs

* additional tests for conj NP

* small modification for shaving off conj from NP

* added tests with flat

* more tests with flat

* added examples with flats conjed

* added inner func for flat trick

* corrected parse

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-09 10:10:22 +02:00
Baranitharan
d6037c1860
added sentence 2020-10-08 08:22:58 +05:30
Baranitharan
81afe9b19d
Update examples.py 2020-10-08 08:17:25 +05:30
Wannaphong Phatthiyaphaibun
9fc8392b38
Add Thai tag map (LST20 Corpus) (#6163)
* Add Thai tag map (LST20 Corpus)

By @korakot

* Update tag_map.py

* Update tag_map.py

* Update tag_map.py
2020-10-07 11:12:01 +02:00
Duygu Altinok
7e821c2776
Turkish language syntax iterators (#6191)
* added tr_vocab to config

* basic test

* added syntax iterator to Turkish lang class

* first version for Turkish syntax iter, without flat

* added simple tests with nmod, amod, det

* more tests to amod and nmod

* separated noun chunks and parser test

* rearrangement after nchunk parser separation

* added recursive NPs

* tests with complicated recursive NPs

* tests with conjed NPs

* additional tests for conj NP

* small modification for shaving off conj from NP

* added tests with flat

* more tests with flat

* added examples with flats conjed

* added inner func for flat trick

* corrected parse

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-07 11:07:52 +02:00
Duygu Altinok
2ce6fc2611
Turkish tag map and morph rules addition (#6141)
* feat: added turkish tag map

* feat: morph rules cconj and sconj

* feat: more conjuncts

* feat: added popular postpositions

* feat: added adverbs

* feat: added personal pronouns

* feat: added reflexive pronouns

* minor: corrected case capital

* minor: fixed comma typo

* feat: added indef pronouns

* feat: added dict iter

* fixed comma typo

* updated language class with tag map and morph

* use default tag map instead

* removed tag map
2020-10-07 10:27:36 +02:00
Duygu Altinok
b95a11dd95
Ordinal numbers for Turkish (#6142)
* minor ordinal number addition

* fixed typo

* added corresponding lexical test
2020-10-07 10:25:37 +02:00
Rahul Gupta
1a00bff06d
Hindi: Adds tests for lexical attributes (norm and like_num) (#5829)
* Hindi: Adds tests for lexical attributes (norm and like_num)

* Signs and sdds the contributor agreement

* Add ordinal numbers to be tagged as like_num

* Adds alternate pronunciation for 31 and 39
2020-10-07 10:23:32 +02:00
Nuccy90
c809b2c8e7
Update morph_rules.py (#6102)
* Update morph_rules.py

Added "dig" and "dej" ("you" in accusative form)

* Create Nuccy90.md

* Update Nuccy90.md
2020-10-06 15:14:47 +02:00
Ines Montani
126268ce50 Auto-format [ci skip] 2020-10-05 21:58:18 +02:00