Commit Graph

916 Commits

Author SHA1 Message Date
Denis Bezykornov
7e684ad691
Update russian tokenizer exceptions (#11753)
* Fix typos, add couple of new abbreviations, remove nonbreaking spaces

* Remove space from abbreviation

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-11-15 11:37:25 +01:00
Adriane Boyd
7e56701057 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5 2022-10-20 13:38:49 +02:00
Adriane Boyd
fe06e037bc
Fix init for pymorphy2_lookup lemmatizer mode (#11631) 2022-10-12 12:18:39 +02:00
Jacobo Myerston
3e8bc1272f
add punctuation to grc (#11426)
* add punctuation to grc

Add support for special editorial punctuation that is common in ancient Greek texts.  Ancient Greek texts, as found in digital and print form, have been largely edited by scholars. Restorations and improvements are normally marked with special characters that need to be handled properly by the tokenizer.

* add unit tests

* simplify regex

* move generic quotes to char classes

* rename unit test

* fix regex

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-09-27 11:38:56 +02:00
Sofie Van Landeghem
d801cccd38
Merge pull request #11430 from rmitsch/chore/synch-develop
Synch develop with master
2022-09-05 15:07:18 +02:00
github-actions[bot]
71884d0942
Auto-format code with black (#11427)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-09-02 11:43:20 +02:00
Patrick J. Burns
5ae63b1fbd
Add Latin language support (#11349)
* Add lang folder for la (Latin)

* Add Latin lang classes

* Add minimal tokenizer exceptions

* Add minimal stopwords

* Add minimal lex_attrs

* Update stopwords, tokenizer exceptions

* Add la tests; register la_tokenizer in conftest.py

* Update spacy/lang/la/lex_attrs.py

Remove duplicate form in Latin lex_attrs

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update natto-py version spec (#11222)

* Update natto-py version spec

* Update setup.cfg

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add scorer to textcat API docs config settings (#11263)

* Update docs for pipeline initialize() methods (#11221)

* Update documentation for dependency parser

* Update documentation for trainable_lemmatizer

* Update documentation for entity_linker

* Update documentation for ner

* Update documentation for morphologizer

* Update documentation for senter

* Update documentation for spancat

* Update documentation for tagger

* Update documentation for textcat

* Update documentation for tok2vec

* Run prettier on edited files

* Apply similar changes in transformer docs

* Remove need to say annotated example explicitly

I removed the need to say "Must contain at least one annotated Example"
because it's often a given that Examples will contain some gold-standard
annotation.

* Run prettier on transformer docs

* chore: add 'concepCy' to spacy universe (#11255)

* chore: add 'concepCy' to spacy universe

* docs: add 'slogan' to concepCy

* Support full prerelease versions in the compat table (#11228)

* Support full prerelease versions in the compat table

* Fix types

* adding spans to doc_annotation in Example.to_dict (#11261)

* adding spans to doc_annotation in Example.to_dict

* to_dict compatible with from_dict: tuples instead of spans

* use strings for label and kb_id

* Simplify test

* Update data formats docs

Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix regex invalid escape sequences (#11276)

* Add W605 to the errors raised by flake8 in the CI (#11283)

* Clean up automated label-based issue handling (#11284)

* Clean up automated label-based issue handline

1. upgrade tiangolo/issue-manager to latest
2. move needs-more-info to tiangolo
3. change needs-more-info close time to 7 days
4. delete old needs-more-info config

* Use old, longer message

* Fix label name

* Fix Dutch noun chunks to skip overlapping spans (#11275)

* Add test for overlapping noun chunks

* Skip overlapping noun chunks

* Update spacy/tests/lang/nl/test_noun_chunks.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Docs: displaCy documentation - data types, `parse_{deps,ents,spans}`, spans example (#10950)

* add in spans example and parse references

* rm autoformatter

* rm extra ents copy

* TypedDict draft

* type fixes

* restore non-documentation files

* docs update

* fix spans example

* fix hyperlinks

* add parse example

* example fix + argument fix

* fix api arg in docs

* fix bad variable replacement

* fix spacing in style

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* fix spacing on table

* fix spacing on table

* rm temp files

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* include span_ruler for default warning filter (#11333)

* Add uk pipelines to website (#11332)

* Check for . in factory names (#11336)

* Make fixes for PR #11349

* Fix roman numeral coverage in #11349

Co-authored-by: Patrick J. Burns <patricks@diyclassics.org>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Lj Miranda <12949683+ljvmiranda921@users.noreply.github.com>
Co-authored-by: Jules Belveze <32683010+JulesBelveze@users.noreply.github.com>
Co-authored-by: stefawolf <wlf.ste@gmail.com>
Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
2022-08-30 14:04:54 +02:00
Paul O'Leary McCann
aafee5e1b7
Fix lookup usage in French/Catalan (fix #11347) (#11382)
* Fix lookup usage (fix #11347)

Before using the lookups table in the French (and Catalan) lemmatizers,
there's a check to see if the current term is in the table. But it's
checking a string against hashes, so it's always false. Also the table
lookup function is designed so you don't have to do that anyway.

* Use the lookup table directly

* Use string, not token
2022-08-29 10:32:38 +02:00
Adriane Boyd
81874265e9 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5-1 2022-08-24 12:47:42 +02:00
Tobius Saul
c09d2fa25b
luganda language extension (#10847)
* luganda language extension

* __init__.py changes

* New enhancements

* Lexical attribute changed

* punctuaction and sentence additions

* Remove comment header

* Fix typos, reformat

* reformated version

* Add tokenizer test

* Remove contractions from stop words

* Format

* Add Luganda to website

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-23 13:09:36 +02:00
Adriane Boyd
5fa8f4faca
Switch ru and uk lemmatizers to pymorphy3 (#11345)
* Switch ru and uk lemmatizers to pymorphy3

* Switch to pymorphy3 in tests
2022-08-22 11:27:14 +02:00
Adriane Boyd
ed4ad309e6
Fix Dutch noun chunks to skip overlapping spans (#11275)
* Add test for overlapping noun chunks

* Skip overlapping noun chunks

* Update spacy/tests/lang/nl/test_noun_chunks.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-08-10 09:49:08 +02:00
Adriane Boyd
fc4246558b
Fix regex invalid escape sequences (#11276) 2022-08-09 10:59:36 +02:00
Luka Dragar
b64243ed55
Updates to Slovenian language (#11162)
* Added examples for Slovene

* Update spacy/lang/sl/examples.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Corrected a typo in one of the sentences

* Updated support for Slovenian

* Some minor changes to corrections

* Added forint currency

* Corrected HYPHENS_PERMITTED regex and some formatting

* Minor changes

* Un-xfail tokenizer test

* Format

Co-authored-by: Luka Dragar <D20124481@mytudublin.ie>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-05 10:10:18 +02:00
Richard Hudson
a9559e7435
Handle Cyrillic combining diacritics (#10837)
* Handle Russian, Ukrainian and Bulgarian

* Corrections

* Correction

* Correction to comment

* Changes based on review

* Correction

* Reverted irrelevant change in punctuation.py

* Remove unnecessary group

* Reverted accidental change
2022-06-28 15:35:32 +02:00
Adriane Boyd
727ce6d1f5
Remove English exceptions with mismatched features (#10873)
Remove English contraction exceptions with mismatched features that lead
to exceptions like "theses" and "thisre".
2022-06-03 09:44:04 +02:00
github-actions[bot]
e07500369c
Auto-format code with black (#10687)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-04-22 11:24:53 +02:00
Richard Hudson
4b227f4861
Merge pull request #10669 from mgrojo/develop
Fix some issues in Spanish stop-word list and examples
2022-04-19 09:37:34 +02:00
mgr
3d50b1a989 Fix some issues in Spanish examples
- Spelling: nationalities in lowercase, accent.
- Incorrect verb composition
- Untranslated word
2022-04-18 22:12:57 +02:00
mgr
2a2654c756 Remove significant or not very frequent words from stop word list [es]
The list of stop words for Spanish contained many inadequate words, see:

https://github.com/explosion/spaCy/issues/3052#issuecomment-1100760100

Removed words:
- verb forms of 'trabajar' (work) and intentar (try)
- words related to 'empleo' (employment)
- incorrect words: ampleamos, arribaabajo, soyos, paìs
- miscellaneous words due to being too significant of too infrequent:
  actualmente, aproximadamente, antaño, cosas, ejemplo, horas, general,
  pais, principalmente, raras

Added other stop words for completion:
- Spanish one-letter words
- numbers up to twelve

Some reformatting to 79 columns.

When in doubt, the English and German lists have been consulted as good
examples.
2022-04-18 22:04:02 +02:00
Duy Ngo
229ecaf0ea
Add numbers and definitions (#10665) 2022-04-18 12:58:32 +02:00
fonfonx
028cbad05e
Add feminine form of word "one" in French (#10653)
* Add French number

* Add fonfonx.md

* Add feminine ordinal words for French
2022-04-14 10:21:27 +02:00
Yunus Atahan
36d3af3013
Fixed typo in Turkish lang. (#10582)
* added failing test case for the issue.

* Fixed typo.

* fixed typo in test.

* added corrected typo word into test_tr_lex_attrs_capitals as param. Test passes. Also tried and confirmed that test is failing after fixing the typo in the test case I wrote. Deleted the test case for typo.

Co-authored-by: Yunus Atahan <yunus.atahan@trmotor.local>
2022-03-30 13:16:08 +02:00
Luka Dragar
53674bb745
Examples for Slovene (#10539)
* Added examples for Slovene

* Update spacy/lang/sl/examples.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Corrected a typo in one of the sentences

Co-authored-by: Luka Dragar <D20124481@mytudublin.ie>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-03-28 10:44:10 +02:00
Adriane Boyd
e908a67829
Handle unknown tags in KoreanTokenizer tag map (#10536) 2022-03-24 11:25:36 +01:00
Grey Murav
3ff5a6a5c0
Extend list of _num_words (#10468) 2022-03-16 18:25:42 +01:00
github-actions[bot]
1bbf232074
Auto-format code with black (#10479)
* Auto-format code with black

* Update spacy/lang/hsb/lex_attrs.py

Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-03-11 12:20:23 +01:00
Adriane Boyd
191e8b31fa
Remove English tokenizer exception May. (#10463) 2022-03-08 14:28:46 +01:00
jnphilipp
5ca0dbae76
Add Lower Sorbian support. (#10431)
* Add support basic support for lower sorbian.

* Add some test for dsb.

* Update spacy/lang/dsb/examples.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-03-07 16:57:14 +01:00
jnphilipp
7ed7908716
Add Upper Sorbian support. (#10432)
* Add support basic support for upper sorbian.

* Add tokenizer exceptions and tests.

* Update spacy/lang/hsb/examples.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-03-07 16:20:39 +01:00
Adriane Boyd
30030176ee
Update Korean defaults for Tokenizer (#10322)
Update Korean defaults for `Tokenizer` for tokenization following UD
Korean Kaist.
2022-02-21 10:26:19 +01:00
github-actions[bot]
6de84c8757
Auto-format code with black (#10333)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-02-21 09:15:42 +01:00
Grey Murav
aa93b471a1
Extend list of stopwords for ru language (#10313) 2022-02-17 15:51:15 +01:00
Grey Murav
23f06dc37f
Extend list of numbers for ru language (#10280)
* Extended list of numbers for ru language

Extended list of numbers with all forms and cases including short forms, slang variants and roman numerals.

* Update lex_attrs.py

* Update 'like_num' function with percentages

Added support for numbers with percentages like 12%, 1.2% and etc. to the  'like_num' function.

* black formatting

Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
2022-02-17 15:50:08 +01:00
Grey Murav
a9756963e6
Extend list of abbreviations for ru language (#10282)
* Extend list of abbreviations for ru language

Extended list of abbreviations for ru language those may have influence on tokenization.

* black formatting

Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
2022-02-17 15:48:50 +01:00
Adriane Boyd
da7520a83c
Delay loading of mecab in Korean tokenizer (#10295)
* Delay loading of mecab in Korean tokenizer

Delay loading of mecab until the tokenizer is called the first time so
that it's possible to initialize a blank `ko` pipeline without having
mecab installed, e.g. for use with `spacy init vectors`.

* Move mecab import back to __init__

Move mecab import back to __init__ to warn users at the same point as
before for missing python dependencies.
2022-02-17 11:35:34 +01:00
Antti Ajanki
e9c26f2ee9
Add a noun chunker for Finnish (#10214)
with test cases
2022-02-08 08:44:11 +01:00
Sofie Van Landeghem
bc12ecb870
Merge pull request #10185 from martinjack/master
Update Ukrainian tokenizer_exceptions
2022-02-06 16:30:03 +01:00
Sofie Van Landeghem
14513f82da
Merge pull request #10215 from explosion/master
update develop
2022-02-06 13:45:41 +01:00
Evgen Kytonin
fc3d446c71 Update Ukrainian tokenizer_exceptions 2022-02-01 13:24:00 +02:00
Marek Šuppa
f09c799a96
fix: Add missing comma to _eleven_to_beyond (#10166)
* This comma has been most probably been left out unintentionally, leading to string concatenation between the two consecutive lines. This issue has been found automatically using a regular expression.
2022-01-30 16:45:06 +09:00
Marek Šuppa
67ecac633f
fix: Add missing comma to examples.py (#10167)
* This comma has been most probably been left out unintentionally, leading to string concatenation between the two consecutive lines. This issue has been found automatically using a regular expression.
2022-01-30 16:43:29 +09:00
pepemedigu
2abd380f2d
Update lex_attrs.py for Spanish with ordinals (#10038)
* Update lex_attrs.py

Add ordinal words

* black formatting

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-20 15:44:13 +01:00
Adriane Boyd
add52935ff
Revert "Bump sudachipy version (#9917)" (#10071)
This reverts commit 58bdd8607b.
2022-01-17 10:38:37 +01:00
Paul O'Leary McCann
58bdd8607b
Bump sudachipy version (#9917)
* Edited Slovenian stop words list (#9707)

* Noun chunks for Italian (#9662)

* added it vocab

* copied portuguese

* added possessive determiner

* added conjed Nps

* added nmoded Nps

* test misc

* more examples

* fixed typo

* fixed parenth

* fixed comma

* comma fix

* added syntax iters

* fix some index problems

* fixed index

* corrected heads for test case

* fixed tets case

* fixed determiner gender

* cleaned left over

* added example with apostophe

* French NP review (#9667)

* adapted from pt

* added basic tests

* added fr vocab

* fixed noun chunks

* more examples

* typo fix

* changed naming

* changed the naming

* typo fix

* Add Japanese kana characters to default exceptions (fix #9693) (#9742)

This includes the main kana, or phonetic characters, used in Japanese.

There are some supplemental kana blocks in Unicode outside the BMP that
could also be included, but because their actual use is rare I omitted
them for now, but maybe they should be added. The omitted blocks are:

- Kana Supplement
- Kana Extended (A and B)
- Small Kana Extension

* Remove NER words from stop words in Norwegian (#9820)

Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.

Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.

See explanation in https://github.com/explosion/spaCy/issues/3052#issuecomment-986756711 and comment https://github.com/explosion/spaCy/issues/3052#issuecomment-986951831

* Bump sudachipy version

* Update sudachipy versions

* Bump versions

Bumping to the most recent dictionary just to keep thing current.
Bumping sudachipy to 5.2 because older versions don't support recent
dictionaries.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Richard Hudson <richard@explosion.ai>
Co-authored-by: Duygu Altinok <duygu@explosion.ai>
Co-authored-by: Haakon Meland Eriksen <haakon.eriksen@far.no>
2022-01-17 08:16:22 +01:00
Duygu Altinok
7ec1452f5f
added ellided forms (#9878)
* added ellided forms

* rearranged a bit

* rearranged a bit

* added stopword tests

* blacked tests file
2021-12-23 13:41:01 +01:00
Haakon Meland Eriksen
251119455d
Remove NER words from stop words in Norwegian (#9820)
Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.

Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.

See explanation in https://github.com/explosion/spaCy/issues/3052#issuecomment-986756711 and comment https://github.com/explosion/spaCy/issues/3052#issuecomment-986951831
2021-12-07 09:45:10 +01:00
Paul O'Leary McCann
b4d526c357
Add Japanese kana characters to default exceptions (fix #9693) (#9742)
This includes the main kana, or phonetic characters, used in Japanese.

There are some supplemental kana blocks in Unicode outside the BMP that
could also be included, but because their actual use is rare I omitted
them for now, but maybe they should be added. The omitted blocks are:

- Kana Supplement
- Kana Extended (A and B)
- Small Kana Extension
2021-11-30 23:36:39 +01:00
Duygu Altinok
29f28d1f3e
French NP review (#9667)
* adapted from pt

* added basic tests

* added fr vocab

* fixed noun chunks

* more examples

* typo fix

* changed naming

* changed the naming

* typo fix
2021-11-30 12:19:07 +01:00
Duygu Altinok
25bd9f9d48
Noun chunks for Italian (#9662)
* added it vocab

* copied portuguese

* added possessive determiner

* added conjed Nps

* added nmoded Nps

* test misc

* more examples

* fixed typo

* fixed parenth

* fixed comma

* comma fix

* added syntax iters

* fix some index problems

* fixed index

* corrected heads for test case

* fixed tets case

* fixed determiner gender

* cleaned left over

* added example with apostophe
2021-11-23 16:29:25 +01:00