Commit Graph

196 Commits

Author SHA1 Message Date
Lise
b6e022381d
Feature/nn and fo language extensions (#13116)
* add language extensions for norwegian nynorsk and faroese

* update docstring for nn/examples.py

* use relative imports

* add fo and nn tokenizers to pytest fixtures

* add unittests for fo and nn and fix bug in nn

* remove module docstring from fo/__init__.py

* add comments about example sentences' origin

* add license information to faroese data credit

* format unittests using black

* add __init__ files to test/lang/nn and tests/lang/fo

* fix import order and use relative imports in fo/__nit__.py and nn/__init__.py

* Make the tests a bit more compact

* Add fo and nn to website languages

* Add note about jul.

* Add "jul." as exception

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-11-20 07:49:59 +01:00
Daniël de Kok
e2b70df012
Configure isort to use the Black profile, recursively isort the spacy module (#12721)
* Use isort with Black profile

* isort all the things

* Fix import cycles as a result of import sorting

* Add DOCBIN_ALL_ATTRS type definition

* Add isort to requirements

* Remove isort from build dependencies check

* Typo
2023-06-14 17:48:41 +02:00
Sani
873c16a4df
Malay language support (#12602)
* add malay lang

* fix token len

* black format

* reformat conftest malay

* remove exceptions not exist in dbp

* format code
2023-05-17 12:45:21 +02:00
Adriane Boyd
30d31fd335
Update Russian and Ukrainian lemmatizers (#11811)
* pymorph2 issues #11620, #11626, #11625:
- #11620: pymorphy2_lookup
- #11626: handle multiple forms pointing to the same normal form + handling empty POS tag
- #11625: matching DET that are labelled as PRON by pymorhp2

* Move lemmatizer algorithm changes back into RussianLemmatizer

* Fix uk pymorphy3_lookup mode init

* Move and update tests for ru/uk lookup lemmatizer modes

* Fix typo

* Remove traces of previous behavior for uninflected POS

* Refactor to private generic-looking pymorphy methods

* Remove xfailed uk lemmatizer cases

* Update spacy/lang/ru/lemmatizer.py

Co-authored-by: Richard Hudson <richard@explosion.ai>

Co-authored-by: Dmytro S Lituiev <d.lituiev@gmail.com>
Co-authored-by: Richard Hudson <richard@explosion.ai>
2022-11-25 11:12:46 +01:00
Adriane Boyd
7e56701057 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5 2022-10-20 13:38:49 +02:00
Adriane Boyd
fe06e037bc
Fix init for pymorphy2_lookup lemmatizer mode (#11631) 2022-10-12 12:18:39 +02:00
Sofie Van Landeghem
d801cccd38
Merge pull request #11430 from rmitsch/chore/synch-develop
Synch develop with master
2022-09-05 15:07:18 +02:00
github-actions[bot]
71884d0942
Auto-format code with black (#11427)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-09-02 11:43:20 +02:00
Patrick J. Burns
5ae63b1fbd
Add Latin language support (#11349)
* Add lang folder for la (Latin)

* Add Latin lang classes

* Add minimal tokenizer exceptions

* Add minimal stopwords

* Add minimal lex_attrs

* Update stopwords, tokenizer exceptions

* Add la tests; register la_tokenizer in conftest.py

* Update spacy/lang/la/lex_attrs.py

Remove duplicate form in Latin lex_attrs

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update natto-py version spec (#11222)

* Update natto-py version spec

* Update setup.cfg

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add scorer to textcat API docs config settings (#11263)

* Update docs for pipeline initialize() methods (#11221)

* Update documentation for dependency parser

* Update documentation for trainable_lemmatizer

* Update documentation for entity_linker

* Update documentation for ner

* Update documentation for morphologizer

* Update documentation for senter

* Update documentation for spancat

* Update documentation for tagger

* Update documentation for textcat

* Update documentation for tok2vec

* Run prettier on edited files

* Apply similar changes in transformer docs

* Remove need to say annotated example explicitly

I removed the need to say "Must contain at least one annotated Example"
because it's often a given that Examples will contain some gold-standard
annotation.

* Run prettier on transformer docs

* chore: add 'concepCy' to spacy universe (#11255)

* chore: add 'concepCy' to spacy universe

* docs: add 'slogan' to concepCy

* Support full prerelease versions in the compat table (#11228)

* Support full prerelease versions in the compat table

* Fix types

* adding spans to doc_annotation in Example.to_dict (#11261)

* adding spans to doc_annotation in Example.to_dict

* to_dict compatible with from_dict: tuples instead of spans

* use strings for label and kb_id

* Simplify test

* Update data formats docs

Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix regex invalid escape sequences (#11276)

* Add W605 to the errors raised by flake8 in the CI (#11283)

* Clean up automated label-based issue handling (#11284)

* Clean up automated label-based issue handline

1. upgrade tiangolo/issue-manager to latest
2. move needs-more-info to tiangolo
3. change needs-more-info close time to 7 days
4. delete old needs-more-info config

* Use old, longer message

* Fix label name

* Fix Dutch noun chunks to skip overlapping spans (#11275)

* Add test for overlapping noun chunks

* Skip overlapping noun chunks

* Update spacy/tests/lang/nl/test_noun_chunks.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Docs: displaCy documentation - data types, `parse_{deps,ents,spans}`, spans example (#10950)

* add in spans example and parse references

* rm autoformatter

* rm extra ents copy

* TypedDict draft

* type fixes

* restore non-documentation files

* docs update

* fix spans example

* fix hyperlinks

* add parse example

* example fix + argument fix

* fix api arg in docs

* fix bad variable replacement

* fix spacing in style

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* fix spacing on table

* fix spacing on table

* rm temp files

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* include span_ruler for default warning filter (#11333)

* Add uk pipelines to website (#11332)

* Check for . in factory names (#11336)

* Make fixes for PR #11349

* Fix roman numeral coverage in #11349

Co-authored-by: Patrick J. Burns <patricks@diyclassics.org>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Lj Miranda <12949683+ljvmiranda921@users.noreply.github.com>
Co-authored-by: Jules Belveze <32683010+JulesBelveze@users.noreply.github.com>
Co-authored-by: stefawolf <wlf.ste@gmail.com>
Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
2022-08-30 14:04:54 +02:00
Adriane Boyd
81874265e9 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5-1 2022-08-24 12:47:42 +02:00
Tobius Saul
c09d2fa25b
luganda language extension (#10847)
* luganda language extension

* __init__.py changes

* New enhancements

* Lexical attribute changed

* punctuaction and sentence additions

* Remove comment header

* Fix typos, reformat

* reformated version

* Add tokenizer test

* Remove contractions from stop words

* Format

* Add Luganda to website

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-23 13:09:36 +02:00
Adriane Boyd
5fa8f4faca
Switch ru and uk lemmatizers to pymorphy3 (#11345)
* Switch ru and uk lemmatizers to pymorphy3

* Switch to pymorphy3 in tests
2022-08-22 11:27:14 +02:00
github-actions[bot]
e7fd06bdbe
Auto-format code with black (#11099)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-07-08 18:43:25 +09:00
Madeesh Kannan
d36d66b7ca
Increase test deadline to 30 minutes to prevent spurious test failures (#11070)
* Increase test deadline to 30 minutes to prevent spurious test failures

* Reduce deadline to 2 minutes
2022-07-04 18:37:09 +02:00
Madeesh Kannan
9ba3e1cb2f
Basic tests for the Tamil language (#10629)
* Add basic tests for Tamil (ta)

* Add comment
Remove superfluous condition

* Remove superfluous call to `pipe`
Instantiate new tokenizer for special case
2022-04-07 14:47:37 +02:00
jnphilipp
5ca0dbae76
Add Lower Sorbian support. (#10431)
* Add support basic support for lower sorbian.

* Add some test for dsb.

* Update spacy/lang/dsb/examples.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-03-07 16:57:14 +01:00
jnphilipp
7ed7908716
Add Upper Sorbian support. (#10432)
* Add support basic support for upper sorbian.

* Add tokenizer exceptions and tests.

* Update spacy/lang/hsb/examples.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-03-07 16:20:39 +01:00
Adriane Boyd
30030176ee
Update Korean defaults for Tokenizer (#10322)
Update Korean defaults for `Tokenizer` for tokenization following UD
Korean Kaist.
2022-02-21 10:26:19 +01:00
Sofie Van Landeghem
58e29776bd
Merge pull request #9777 from explosion/master
Update develop with master
2021-11-30 14:01:23 +01:00
Duygu Altinok
29f28d1f3e
French NP review (#9667)
* adapted from pt

* added basic tests

* added fr vocab

* fixed noun chunks

* more examples

* typo fix

* changed naming

* changed the naming

* typo fix
2021-11-30 12:19:07 +01:00
Richard Hudson
7b134b8fbd
New tests for a number of alpha languages (#9703)
* Added Slovak

* Added Slovenian tests

* Added Estonian tests

* Added Croatian tests

* Added Latvian tests

* Added Icelandic tests

* Added Afrikaans tests

* Added language-independent tests

* Added Kannada tests

* Tidied up

* Added Albanian tests

* Formatted with black

* Added failing tests for anomalies

* Update spacy/tests/lang/af/test_text.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Added context to failing Estonian tokenizer test

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Added context to failing Croatian tokenizer test

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Added context to failing Icelandic tokenizer test

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Added context to failing Latvian tokenizer test

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Added context to failing Slovak tokenizer test

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Added context to failing Slovenian tokenizer test

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-11-28 21:59:23 +01:00
Duygu Altinok
25bd9f9d48
Noun chunks for Italian (#9662)
* added it vocab

* copied portuguese

* added possessive determiner

* added conjed Nps

* added nmoded Nps

* test misc

* more examples

* fixed typo

* fixed parenth

* fixed comma

* comma fix

* added syntax iters

* fix some index problems

* fixed index

* corrected heads for test case

* fixed tets case

* fixed determiner gender

* cleaned left over

* added example with apostophe
2021-11-23 16:29:25 +01:00
Duygu Altinok
f0e8c9fe58
Spanish noun chunks review (#9537)
* updated syntax iters

* formatted the code

* added prepositional objects

* code clean up

* eliminated left attached adp

* added es vocab

* added basic tests

* fixed typo

* fixed typo

* list to set

* fixed doc name

* added code for conj

* more tests

* differentiated adjectives and flat

* fixed typo

* added compounds

* more compounds

* tests for compounds

* tests for nominal modifiers

* fixed typo

* fixed typo

* formatted file

* reformatted tests

* fixed typo

* fixed punct typo

* formatted after changes

* added indirect object

* added full sentence examples

* added longer full sentence examples

* fixed sentence length of test

* added passive subj

* added test case by Damian
2021-11-05 00:46:36 +01:00
Duygu Altinok
6e6650307d
Portuguese noun chunks review (#9559)
* added tests

* added pt vocab

* transferred spanish

* added syntax iters

* fixed parenthesis

* added nmod example

* added relative pron

* fixed rel pron

* added rel subclause

* corrected typo

* added more NP chains

* long sentence

* fixed typo

* fixed typo

* fixed typo

* corrected heads

* added passive subj

* added pass subj

* added passive obj

* refinement to rights

* went back to odl

* fixed test

* fixed typo

* fixed typo

* formatted

* Format

* Format test cases

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-11-04 23:55:49 +01:00
Lj Miranda
f1bc655a38
Add initial Tagalog (tl) tests (#9582)
* Add tl_tokenizer to test fixtures

* Add tagalog tests
2021-11-02 08:35:49 +01:00
Ines Montani
ad9f57cbbf Allow conftest.py to run twice for build envs 2021-10-19 15:13:25 +02:00
Ines Montani
1fa7c4e73b Support issue marker via pytest 2021-10-11 13:56:24 +02:00
explosion-bot
eff3d1088b Auto-format code with black 2021-07-16 08:03:36 +00:00
jmyerston
993b0fab0e
Added ancient Greek language support (#8606)
* Add ancient Greek language support

Initial commit

* Contributor Agreement

* grc tokenizer test added  and files formatted with black, unnecessary import removed

Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Commas in lists fixed. __init__py added to test

* Update lex_attrs.py

* Update stop_words.py

* Update stop_words.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-07-15 10:27:17 +02:00
Julien Rossi
e117573822
Adding noun_chunks to the DUTCH language model (nl) (#8529)
*  implement noun_chunks for dutch language

* copy/paste FR and SV syntax iterators to accomodate UD tags
* added tests with dutch text
* signed contributor agreement

* 🐛 fix noun chunks generator

* built from scratch
* define noun chunk as a single Noun-Phrase
* includes some corner cases debugging (incorrect POS tagging)
* test with provided annotated sample (POS, DEP)

*  fix failing test

* CI pipeline did not like the added sample file
* add the sample as a pytest fixture

* Update spacy/lang/nl/syntax_iterators.py

* Update spacy/lang/nl/syntax_iterators.py

Code readability

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/tests/lang/nl/test_noun_chunks.py

correct comment

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* finalize code

* change "if next_word" into "if next_word is not None"

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-07-14 14:01:02 +02:00
Adriane Boyd
5646fcbe46 Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1 2021-06-15 15:05:17 +02:00
Adriane Boyd
f4008bdb13
Restrict pymorphy2 requirement to pymorphy2 mode (#8299)
For the Russian and Ukrainian lemmatizers, restrict the `pymorphy2`
requirement to the mode `pymorphy2` so that lookup or other lemmatizer
modes can be loaded without installing `pymorphy2`.
2021-06-11 10:19:22 +02:00
Adriane Boyd
1d59fdbd39
Update Vietnamese tokenizer (#8099)
* Adapt tokenization methods from `pyvi` to preserve text encoding and
whitespace
* Add serialization support similar to Chinese and Japanese

Note: as for Chinese and Japanese, some settings are duplicated in
`config.cfg` and `tokenizer/cfg`.
2021-05-17 18:16:20 +10:00
Boian Tzonev
cca8651fc8
Bulgarian tokenizer exceptions (#7114)
* [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian

* [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian
2021-02-19 19:19:19 +01:00
Ines Montani
80ba9eaf7d Fix test 2021-01-27 21:29:02 +11:00
Ines Montani
230e651ad6 Merge branch 'develop' into master-tmp 2021-01-27 13:26:29 +11:00
muratjumashev
87168eb81f Add tests 2021-01-24 20:56:16 +06:00
Ines Montani
b0b743597c Tidy up and auto-format 2021-01-15 11:57:36 +11:00
Adriane Boyd
0c936004d1 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3 2021-01-14 11:49:58 +01:00
Yosi
cf52510631
Add Amharic አማርኛ Language support (#6583)
* Add Amharic to space

* clean up

* Add some PRON_LEMMA

* add Tigrinya support

* remove text_noun_chunks

* Tigrinya Support

* added some more details for ti

* fix unit test

* add amharic char range

* changes from review

* amharic and tigrinya share same unicode block

* get rid of _amharic/_tigrinya in char_classes

Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>
2020-12-22 16:50:34 +01:00
Adriane Boyd
724831b066 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master
* Update Macedonian for v3
* Update Turkish for v3
2020-11-25 11:49:34 +01:00
Borijan Georgievski
2311192ba1
Include Macedonian language (#6230)
* Include Macedonian language

* Fix indentation at char_classes.py

* Fix indentation at char_classes.py

* Add Macedonian tests, update lex_attrs and char_classes

* Import unicode literals for python 2
2020-10-15 15:55:01 +02:00
Ines Montani
d165af26be Auto-format [ci skip] 2020-10-15 10:08:53 +02:00
Ines Montani
5d62499266 Fix tests 2020-10-15 09:29:15 +02:00
Ines Montani
178760855f Merge branch 'develop' into master-tmp 2020-10-15 09:06:03 +02:00
Ines Montani
539b0c10da Tidy up and auto-format 2020-10-10 19:14:48 +02:00
Duygu Altinok
2fad279a44 Turkish language syntax iterators (#6191)
* added tr_vocab to config

* basic test

* added syntax iterator to Turkish lang class

* first version for Turkish syntax iter, without flat

* added simple tests with nmod, amod, det

* more tests to amod and nmod

* separated noun chunks and parser test

* rearrangement after nchunk parser separation

* added recursive NPs

* tests with complicated recursive NPs

* tests with conjed NPs

* additional tests for conj NP

* small modification for shaving off conj from NP

* added tests with flat

* more tests with flat

* added examples with flats conjed

* added inner func for flat trick

* corrected parse

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-09 10:10:22 +02:00
Duygu Altinok
7e821c2776
Turkish language syntax iterators (#6191)
* added tr_vocab to config

* basic test

* added syntax iterator to Turkish lang class

* first version for Turkish syntax iter, without flat

* added simple tests with nmod, amod, det

* more tests to amod and nmod

* separated noun chunks and parser test

* rearrangement after nchunk parser separation

* added recursive NPs

* tests with complicated recursive NPs

* tests with conjed NPs

* additional tests for conj NP

* small modification for shaving off conj from NP

* added tests with flat

* more tests with flat

* added examples with flats conjed

* added inner func for flat trick

* corrected parse

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-07 11:07:52 +02:00
Rahul Gupta
1a00bff06d
Hindi: Adds tests for lexical attributes (norm and like_num) (#5829)
* Hindi: Adds tests for lexical attributes (norm and like_num)

* Signs and sdds the contributor agreement

* Add ordinal numbers to be tagged as like_num

* Adds alternate pronunciation for 31 and 39
2020-10-07 10:23:32 +02:00
Ines Montani
6abfc2911d
Merge pull request #6203 from adrianeboyd/feature/zh-spacy-pkuseg 2020-10-05 21:35:57 +02:00