* Edited Slovenian stop words list (#9707)
* Noun chunks for Italian (#9662)
* added it vocab
* copied portuguese
* added possessive determiner
* added conjed Nps
* added nmoded Nps
* test misc
* more examples
* fixed typo
* fixed parenth
* fixed comma
* comma fix
* added syntax iters
* fix some index problems
* fixed index
* corrected heads for test case
* fixed tets case
* fixed determiner gender
* cleaned left over
* added example with apostophe
* French NP review (#9667)
* adapted from pt
* added basic tests
* added fr vocab
* fixed noun chunks
* more examples
* typo fix
* changed naming
* changed the naming
* typo fix
* Add Japanese kana characters to default exceptions (fix#9693) (#9742)
This includes the main kana, or phonetic characters, used in Japanese.
There are some supplemental kana blocks in Unicode outside the BMP that
could also be included, but because their actual use is rare I omitted
them for now, but maybe they should be added. The omitted blocks are:
- Kana Supplement
- Kana Extended (A and B)
- Small Kana Extension
* Remove NER words from stop words in Norwegian (#9820)
Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.
Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.
See explanation in https://github.com/explosion/spaCy/issues/3052#issuecomment-986756711 and comment https://github.com/explosion/spaCy/issues/3052#issuecomment-986951831
* Bump sudachipy version
* Update sudachipy versions
* Bump versions
Bumping to the most recent dictionary just to keep thing current.
Bumping sudachipy to 5.2 because older versions don't support recent
dictionaries.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Richard Hudson <richard@explosion.ai>
Co-authored-by: Duygu Altinok <duygu@explosion.ai>
Co-authored-by: Haakon Meland Eriksen <haakon.eriksen@far.no>
* Added Slovak
* Added Slovenian tests
* Added Estonian tests
* Added Croatian tests
* Added Latvian tests
* Added Icelandic tests
* Added Afrikaans tests
* Added language-independent tests
* Added Kannada tests
* Tidied up
* Added Albanian tests
* Formatted with black
* Added failing tests for anomalies
* Update spacy/tests/lang/af/test_text.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Estonian tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Croatian tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Icelandic tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Latvian tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Slovak tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Slovenian tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add ancient Greek language support
Initial commit
* Contributor Agreement
* grc tokenizer test added and files formatted with black, unnecessary import removed
Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Commas in lists fixed. __init__py added to test
* Update lex_attrs.py
* Update stop_words.py
* Update stop_words.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* ✨ implement noun_chunks for dutch language
* copy/paste FR and SV syntax iterators to accomodate UD tags
* added tests with dutch text
* signed contributor agreement
* 🐛 fix noun chunks generator
* built from scratch
* define noun chunk as a single Noun-Phrase
* includes some corner cases debugging (incorrect POS tagging)
* test with provided annotated sample (POS, DEP)
* ✅ fix failing test
* CI pipeline did not like the added sample file
* add the sample as a pytest fixture
* Update spacy/lang/nl/syntax_iterators.py
* Update spacy/lang/nl/syntax_iterators.py
Code readability
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/tests/lang/nl/test_noun_chunks.py
correct comment
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* finalize code
* change "if next_word" into "if next_word is not None"
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
For the Russian and Ukrainian lemmatizers, restrict the `pymorphy2`
requirement to the mode `pymorphy2` so that lookup or other lemmatizer
modes can be loaded without installing `pymorphy2`.
* Adapt tokenization methods from `pyvi` to preserve text encoding and
whitespace
* Add serialization support similar to Chinese and Japanese
Note: as for Chinese and Japanese, some settings are duplicated in
`config.cfg` and `tokenizer/cfg`.
* Add Amharic to space
* clean up
* Add some PRON_LEMMA
* add Tigrinya support
* remove text_noun_chunks
* Tigrinya Support
* added some more details for ti
* fix unit test
* add amharic char range
* changes from review
* amharic and tigrinya share same unicode block
* get rid of _amharic/_tigrinya in char_classes
Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>
* Include Macedonian language
* Fix indentation at char_classes.py
* Fix indentation at char_classes.py
* Add Macedonian tests, update lex_attrs and char_classes
* Import unicode literals for python 2
* added tr_vocab to config
* basic test
* added syntax iterator to Turkish lang class
* first version for Turkish syntax iter, without flat
* added simple tests with nmod, amod, det
* more tests to amod and nmod
* separated noun chunks and parser test
* rearrangement after nchunk parser separation
* added recursive NPs
* tests with complicated recursive NPs
* tests with conjed NPs
* additional tests for conj NP
* small modification for shaving off conj from NP
* added tests with flat
* more tests with flat
* added examples with flats conjed
* added inner func for flat trick
* corrected parse
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* added tr_vocab to config
* basic test
* added syntax iterator to Turkish lang class
* first version for Turkish syntax iter, without flat
* added simple tests with nmod, amod, det
* more tests to amod and nmod
* separated noun chunks and parser test
* rearrangement after nchunk parser separation
* added recursive NPs
* tests with complicated recursive NPs
* tests with conjed NPs
* additional tests for conj NP
* small modification for shaving off conj from NP
* added tests with flat
* more tests with flat
* added examples with flats conjed
* added inner func for flat trick
* corrected parse
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Hindi: Adds tests for lexical attributes (norm and like_num)
* Signs and sdds the contributor agreement
* Add ordinal numbers to be tagged as like_num
* Adds alternate pronunciation for 31 and 39
* Create lex_attrs.py
Hello,
I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech.
* Update __init__.py
Updated for use with new Czech Lex_attrs file
* Update stop_words.py
* Create test_text.py
* add like_num testing for czech
Co-authored-by: holubvl3 <47881982+holubvl3@users.noreply.github.com>
Co-authored-by: holubvl3 <vilemrousi@gmail.com>
Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>
* Add Lemmatizer and simplify related components
* Add `Lemmatizer` pipe with `lookup` and `rule` modes using the
`Lookups` tables.
* Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma)
* Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer,
or morph rules)
* Remove lemmatizer from `Vocab`
* Adjust many many tests
Differences:
* No default lookup lemmas
* No special treatment of TAG in `from_array` and similar required
* Easier to modify labels in a `Tagger`
* No extra strings added from morphology / tag map
* Fix test
* Initial fix for Lemmatizer config/serialization
* Adjust init test to be more generic
* Adjust init test to force empty Lookups
* Add simple cache to rule-based lemmatizer
* Convert language-specific lemmatizers
Convert language-specific lemmatizers to component lemmatizers. Remove
previous lemmatizer class.
* Fix French and Polish lemmatizers
* Remove outdated UPOS conversions
* Update Russian lemmatizer init in tests
* Add minimal init/run tests for custom lemmatizers
* Add option to overwrite existing lemmas
* Update mode setting, lookup loading, and caching
* Make `mode` an immutable property
* Only enforce strict `load_lookups` for known supported modes
* Move caching into individual `_lemmatize` methods
* Implement strict when lang is not found in lookups
* Fix tables/lookups in make_lemmatizer
* Reallow provided lookups and allow for stricter checks
* Add lookups asset to all Lemmatizer pipe tests
* Rename lookups in lemmatizer init test
* Clean up merge
* Refactor lookup table loading
* Add helper from `load_lemmatizer_lookups` that loads required and
optional lookups tables based on settings provided by a config.
Additional slight refactor of lookups:
* Add `Lookups.set_table` to set a table from a provided `Table`
* Reorder class definitions to be able to specify type as `Table`
* Move registry assets into test methods
* Refactor lookups tables config
Use class methods within `Lemmatizer` to provide the config for
particular modes and to load the lookups from a config.
* Add pipe and score to lemmatizer
* Simplify Tagger.score
* Add missing import
* Clean up imports and auto-format
* Remove unused kwarg
* Tidy up and auto-format
* Update docstrings for Lemmatizer
Update docstrings for Lemmatizer.
Additionally modify `is_base_form` API to take `Token` instead of
individual features.
* Update docstrings
* Remove tag map values from Tagger.add_label
* Update API docs
* Fix relative link in Lemmatizer API docs
* Update with WIP
* Update with WIP
* Update with pipeline serialization
* Update types and pipe factories
* Add deep merge, tidy up and add tests
* Fix pipe creation from config
* Don't validate default configs on load
* Update spacy/language.py
Co-authored-by: Ines Montani <ines@ines.io>
* Adjust factory/component meta error
* Clean up factory args and remove defaults
* Add test for failing empty dict defaults
* Update pipeline handling and methods
* provide KB as registry function instead of as object
* small change in test to make functionality more clear
* update example script for EL configuration
* Fix typo
* Simplify test
* Simplify test
* splitting pipes.pyx into separate files
* moving default configs to each component file
* fix batch_size type
* removing default values from component constructors where possible (TODO: test 4725)
* skip instead of xfail
* Add test for config -> nlp with multiple instances
* pipeline.pipes -> pipeline.pipe
* Tidy up, document, remove kwargs
* small cleanup/generalization for Tok2VecListener
* use DEFAULT_UPSTREAM field
* revert to avoid circular imports
* Fix tests
* Replace deprecated arg
* Make model dirs require config
* fix pickling of keyword-only arguments in constructor
* WIP: clean up and integrate full config
* Add helper to handle function args more reliably
Now also includes keyword-only args
* Fix config composition and serialization
* Improve config debugging and add visual diff
* Remove unused defaults and fix type
* Remove pipeline and factories from meta
* Update spacy/default_config.cfg
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/default_config.cfg
* small UX edits
* avoid printing stack trace for debug CLI commands
* Add support for language-specific factories
* specify the section of the config which holds the model to debug
* WIP: add Language.from_config
* Update with language data refactor WIP
* Auto-format
* Add backwards-compat handling for Language.factories
* Update morphologizer.pyx
* Fix morphologizer
* Update and simplify lemmatizers
* Fix Japanese tests
* Port over tagger changes
* Fix Chinese and tests
* Update to latest Thinc
* WIP: xfail first Russian lemmatizer test
* Fix component-specific overrides
* fix nO for output layers in debug_model
* Fix default value
* Fix tests and don't pass objects in config
* Fix deep merging
* Fix lemma lookup data registry
Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed)
* Add types
* Add Vocab.from_config
* Fix typo
* Fix tests
* Make config copying more elegant
* Fix pipe analysis
* Fix lemmatizers and is_base_form
* WIP: move language defaults to config
* Fix morphology type
* Fix vocab
* Remove comment
* Update to latest Thinc
* Add morph rules to config
* Tidy up
* Remove set_morphology option from tagger factory
* Hack use_gpu
* Move [pipeline] to top-level block and make [nlp.pipeline] list
Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them
* Fix use_gpu and resume in CLI
* Auto-format
* Remove resume from config
* Fix formatting and error
* [pipeline] -> [components]
* Fix types
* Fix tagger test: requires set_morphology?
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Refactor Chinese tokenizer configuration
Refactor `ChineseTokenizer` configuration so that it uses a single
`segmenter` setting to choose between character segmentation, jieba, and
pkuseg.
* replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting
`segmenter` with the supported values: `char`, `jieba`, `pkuseg`
* make the default segmenter plain character segmentation `char` (no
additional libraries required)
* Fix Chinese serialization test to use char default
* Warn if attempting to customize other segmenter
Add a warning if `Chinese.pkuseg_update_user_dict` is called when
another segmenter is selected.