* Edited Slovenian stop words list (#9707)
* Noun chunks for Italian (#9662)
* added it vocab
* copied portuguese
* added possessive determiner
* added conjed Nps
* added nmoded Nps
* test misc
* more examples
* fixed typo
* fixed parenth
* fixed comma
* comma fix
* added syntax iters
* fix some index problems
* fixed index
* corrected heads for test case
* fixed tets case
* fixed determiner gender
* cleaned left over
* added example with apostophe
* French NP review (#9667)
* adapted from pt
* added basic tests
* added fr vocab
* fixed noun chunks
* more examples
* typo fix
* changed naming
* changed the naming
* typo fix
* Add Japanese kana characters to default exceptions (fix#9693) (#9742)
This includes the main kana, or phonetic characters, used in Japanese.
There are some supplemental kana blocks in Unicode outside the BMP that
could also be included, but because their actual use is rare I omitted
them for now, but maybe they should be added. The omitted blocks are:
- Kana Supplement
- Kana Extended (A and B)
- Small Kana Extension
* Remove NER words from stop words in Norwegian (#9820)
Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.
Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.
See explanation in https://github.com/explosion/spaCy/issues/3052#issuecomment-986756711 and comment https://github.com/explosion/spaCy/issues/3052#issuecomment-986951831
* Bump sudachipy version
* Update sudachipy versions
* Bump versions
Bumping to the most recent dictionary just to keep thing current.
Bumping sudachipy to 5.2 because older versions don't support recent
dictionaries.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Richard Hudson <richard@explosion.ai>
Co-authored-by: Duygu Altinok <duygu@explosion.ai>
Co-authored-by: Haakon Meland Eriksen <haakon.eriksen@far.no>
* Refactor Docs.is_ flags
* Add derived `Doc.has_annotation` method
* `Doc.has_annotation(attr)` returns `True` for partial annotation
* `Doc.has_annotation(attr, require_complete=True)` returns `True` for
complete annotation
* Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced`
and `is_nered`
* Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs
for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The
list is the `DocBin` attributes list plus `SPACY` and `LENGTH`.
Notes on `Doc.has_annotation`:
* `HEAD` is converted to `DEP` because heads don't have an unset state
* Accept `IS_SENT_START` as a synonym of `SENT_START`
Additional changes:
* Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for
`DocBin`
* In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override
`SENT_START`
* In `Doc.from_array()` using `attrs` other than
`Doc._get_array_attrs()` (i.e., a user's custom list rather than our
default internal list) with both `HEAD` and `SENT_START` shows a warning
that `HEAD` will override `SENT_START`
* `set_children_from_heads` does not require dependency labels to set
sentence boundaries and sets `sent_start` for all non-sentence starts to
`-1`
* Fix call to set_children_form_heads
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Limiting noun_chunks for specific langauges
* Limiting noun_chunks for specific languages
Contributor Agreement
* Addressing review comments
* Removed unused fixtures and imports
* Add fa_tokenizer in test suite
* Use fa_tokenizer in test
* Undo extraneous reformatting
Co-authored-by: adrianeboyd <adrianeboyd@gmail.com>