This reverts commit 6f314f99c4.
We are reverting this until we can support this normalization more
consistently across vectors, training corpora, and lemmatizer data.
* Use Latin normalization for Serbian attrs
Use Latin normalization for Serbian `NORM`, `PREFIX`, and `SUFFIX`.
* Update NORMs in tokenizer exceptions and related tests
* Add tests for all custom lex attrs
* Remove unused imports
* Add noun chunking to la syntax iterators
* Expand list of numeral, ordinal words
* Expand abbreviations in la tokenizer_exceptions
* Add example sents
* Update spacy/lang/la/syntax_iterators.py
Reorganize la syntax iterators
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Minor updates based on review
* fix call
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* add unittest for explosion#12311
* create punctuation.py for swedish
* removed : from infixes in swedish punctuation.py
* allow : as infix if succeeding char is uppercase
* Init
* fix tests
* Update spacy/errors.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix test_blank_languages
* Rename xx to mul in docs
* Format _util with black
* prettier formatting
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* pymorph2 issues #11620, #11626, #11625:
- #11620: pymorphy2_lookup
- #11626: handle multiple forms pointing to the same normal form + handling empty POS tag
- #11625: matching DET that are labelled as PRON by pymorhp2
* Move lemmatizer algorithm changes back into RussianLemmatizer
* Fix uk pymorphy3_lookup mode init
* Move and update tests for ru/uk lookup lemmatizer modes
* Fix typo
* Remove traces of previous behavior for uninflected POS
* Refactor to private generic-looking pymorphy methods
* Remove xfailed uk lemmatizer cases
* Update spacy/lang/ru/lemmatizer.py
Co-authored-by: Richard Hudson <richard@explosion.ai>
Co-authored-by: Dmytro S Lituiev <d.lituiev@gmail.com>
Co-authored-by: Richard Hudson <richard@explosion.ai>
* add punctuation to grc
Add support for special editorial punctuation that is common in ancient Greek texts. Ancient Greek texts, as found in digital and print form, have been largely edited by scholars. Restorations and improvements are normally marked with special characters that need to be handled properly by the tokenizer.
* add unit tests
* simplify regex
* move generic quotes to char classes
* rename unit test
* fix regex
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add lang folder for la (Latin)
* Add Latin lang classes
* Add minimal tokenizer exceptions
* Add minimal stopwords
* Add minimal lex_attrs
* Update stopwords, tokenizer exceptions
* Add la tests; register la_tokenizer in conftest.py
* Update spacy/lang/la/lex_attrs.py
Remove duplicate form in Latin lex_attrs
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update natto-py version spec (#11222)
* Update natto-py version spec
* Update setup.cfg
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add scorer to textcat API docs config settings (#11263)
* Update docs for pipeline initialize() methods (#11221)
* Update documentation for dependency parser
* Update documentation for trainable_lemmatizer
* Update documentation for entity_linker
* Update documentation for ner
* Update documentation for morphologizer
* Update documentation for senter
* Update documentation for spancat
* Update documentation for tagger
* Update documentation for textcat
* Update documentation for tok2vec
* Run prettier on edited files
* Apply similar changes in transformer docs
* Remove need to say annotated example explicitly
I removed the need to say "Must contain at least one annotated Example"
because it's often a given that Examples will contain some gold-standard
annotation.
* Run prettier on transformer docs
* chore: add 'concepCy' to spacy universe (#11255)
* chore: add 'concepCy' to spacy universe
* docs: add 'slogan' to concepCy
* Support full prerelease versions in the compat table (#11228)
* Support full prerelease versions in the compat table
* Fix types
* adding spans to doc_annotation in Example.to_dict (#11261)
* adding spans to doc_annotation in Example.to_dict
* to_dict compatible with from_dict: tuples instead of spans
* use strings for label and kb_id
* Simplify test
* Update data formats docs
Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix regex invalid escape sequences (#11276)
* Add W605 to the errors raised by flake8 in the CI (#11283)
* Clean up automated label-based issue handling (#11284)
* Clean up automated label-based issue handline
1. upgrade tiangolo/issue-manager to latest
2. move needs-more-info to tiangolo
3. change needs-more-info close time to 7 days
4. delete old needs-more-info config
* Use old, longer message
* Fix label name
* Fix Dutch noun chunks to skip overlapping spans (#11275)
* Add test for overlapping noun chunks
* Skip overlapping noun chunks
* Update spacy/tests/lang/nl/test_noun_chunks.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Docs: displaCy documentation - data types, `parse_{deps,ents,spans}`, spans example (#10950)
* add in spans example and parse references
* rm autoformatter
* rm extra ents copy
* TypedDict draft
* type fixes
* restore non-documentation files
* docs update
* fix spans example
* fix hyperlinks
* add parse example
* example fix + argument fix
* fix api arg in docs
* fix bad variable replacement
* fix spacing in style
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* fix spacing on table
* fix spacing on table
* rm temp files
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* include span_ruler for default warning filter (#11333)
* Add uk pipelines to website (#11332)
* Check for . in factory names (#11336)
* Make fixes for PR #11349
* Fix roman numeral coverage in #11349
Co-authored-by: Patrick J. Burns <patricks@diyclassics.org>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Lj Miranda <12949683+ljvmiranda921@users.noreply.github.com>
Co-authored-by: Jules Belveze <32683010+JulesBelveze@users.noreply.github.com>
Co-authored-by: stefawolf <wlf.ste@gmail.com>
Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
* Switch to mecab-ko as default Korean tokenizer
Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.
Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.
* Temporarily run tests with mecab-ko tokenizer
* Fix types
* Fix duplicate test names
* Update requirements test
* Revert "Temporarily run tests with mecab-ko tokenizer"
This reverts commit d2083e7044.
* Add mecab_args setting, fix pickle for KoreanNattoTokenizer
* Fix length check
* Update docs
* Formatting
* Update natto-py error message
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Added examples for Slovene
* Update spacy/lang/sl/examples.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Corrected a typo in one of the sentences
* Updated support for Slovenian
* Some minor changes to corrections
* Added forint currency
* Corrected HYPHENS_PERMITTED regex and some formatting
* Minor changes
* Un-xfail tokenizer test
* Format
Co-authored-by: Luka Dragar <D20124481@mytudublin.ie>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Handle Russian, Ukrainian and Bulgarian
* Corrections
* Correction
* Correction to comment
* Changes based on review
* Correction
* Reverted irrelevant change in punctuation.py
* Remove unnecessary group
* Reverted accidental change
* Add basic tests for Tamil (ta)
* Add comment
Remove superfluous condition
* Remove superfluous call to `pipe`
Instantiate new tokenizer for special case
* added failing test case for the issue.
* Fixed typo.
* fixed typo in test.
* added corrected typo word into test_tr_lex_attrs_capitals as param. Test passes. Also tried and confirmed that test is failing after fixing the typo in the test case I wrote. Deleted the test case for typo.
Co-authored-by: Yunus Atahan <yunus.atahan@trmotor.local>
* Add support basic support for lower sorbian.
* Add some test for dsb.
* Update spacy/lang/dsb/examples.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add support basic support for upper sorbian.
* Add tokenizer exceptions and tests.
* Update spacy/lang/hsb/examples.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* added iob to int
* added tests
* added iob strings
* added error
* blacked attrs
* Update spacy/tests/lang/test_attrs.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/attrs.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* added iob strings as global
* minor refinement with iob
* removed iob strings from token
* changed to uppercase
* cleaned and went back to master version
* imported iob from attrs
* Update and format errors
* Support and test both str and int ENT_IOB key
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Edited Slovenian stop words list (#9707)
* Noun chunks for Italian (#9662)
* added it vocab
* copied portuguese
* added possessive determiner
* added conjed Nps
* added nmoded Nps
* test misc
* more examples
* fixed typo
* fixed parenth
* fixed comma
* comma fix
* added syntax iters
* fix some index problems
* fixed index
* corrected heads for test case
* fixed tets case
* fixed determiner gender
* cleaned left over
* added example with apostophe
* French NP review (#9667)
* adapted from pt
* added basic tests
* added fr vocab
* fixed noun chunks
* more examples
* typo fix
* changed naming
* changed the naming
* typo fix
* Add Japanese kana characters to default exceptions (fix#9693) (#9742)
This includes the main kana, or phonetic characters, used in Japanese.
There are some supplemental kana blocks in Unicode outside the BMP that
could also be included, but because their actual use is rare I omitted
them for now, but maybe they should be added. The omitted blocks are:
- Kana Supplement
- Kana Extended (A and B)
- Small Kana Extension
* Remove NER words from stop words in Norwegian (#9820)
Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.
Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.
See explanation in https://github.com/explosion/spaCy/issues/3052#issuecomment-986756711 and comment https://github.com/explosion/spaCy/issues/3052#issuecomment-986951831
* Bump sudachipy version
* Update sudachipy versions
* Bump versions
Bumping to the most recent dictionary just to keep thing current.
Bumping sudachipy to 5.2 because older versions don't support recent
dictionaries.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Richard Hudson <richard@explosion.ai>
Co-authored-by: Duygu Altinok <duygu@explosion.ai>
Co-authored-by: Haakon Meland Eriksen <haakon.eriksen@far.no>
* Migrate regressions 1-1000
* Move serialize test to correct file
* Remove tests that won't work in v3
* Migrate regressions 1000-1500
Removed regression test 1250 because v3 doesn't support the old LEX
scheme anymore.
* Add missing imports in serializer tests
* Migrate tests 1500-2000
* Migrate regressions from 2000-2500
* Migrate regressions from 2501-3000
* Migrate regressions from 3000-3501
* Migrate regressions from 3501-4000
* Migrate regressions from 4001-4500
* Migrate regressions from 4501-5000
* Migrate regressions from 5001-5501
* Migrate regressions from 5501 to 7000
* Migrate regressions from 7001 to 8000
* Migrate remaining regression tests
* Fixing missing imports
* Update docs with new system [ci skip]
* Update CONTRIBUTING.md
- Fix formatting
- Update wording
* Remove lemmatizer tests in el lang
* Move a few tests into the general tokenizer
* Separate Doc and DocBin tests
* Added Slovak
* Added Slovenian tests
* Added Estonian tests
* Added Croatian tests
* Added Latvian tests
* Added Icelandic tests
* Added Afrikaans tests
* Added language-independent tests
* Added Kannada tests
* Tidied up
* Added Albanian tests
* Formatted with black
* Added failing tests for anomalies
* Update spacy/tests/lang/af/test_text.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Estonian tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Croatian tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Icelandic tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Latvian tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Slovak tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Slovenian tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>