* Switch to mecab-ko as default Korean tokenizer
Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.
Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.
* Temporarily run tests with mecab-ko tokenizer
* Fix types
* Fix duplicate test names
* Update requirements test
* Revert "Temporarily run tests with mecab-ko tokenizer"
This reverts commit d2083e7044.
* Add mecab_args setting, fix pickle for KoreanNattoTokenizer
* Fix length check
* Update docs
* Formatting
* Update natto-py error message
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Add token and span custom attributes to to_json()
* Change logic for to_json
* Add functionality to from_json
* Small adjustments
* Move token/span attributes to new dict key
* Fix test
* Fix the same test but much better
* Add backwards compatibility tests and adjust logic
* Add test to check if attributes not set in underscore are not saved in the json
* Add tests for json compatibility
* Adjust test names
* Fix tests and clean up code
* Fix assert json tests
* small adjustment
* adjust naming and code readability
* Adjust naming, added more tests and changed logic
* Fix typo
* Adjust errors, naming, and small test optimization
* Fix byte tests
* Fix bytes tests
* Change naming and json structure
* update schema
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/tokens/doc.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/tokens/doc.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update schema for underscore attributes
* Adjust underscore schema
* adjust schema tests
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Map `Span.id` to `Token.ent_id` in all cases when setting `Doc.ents`
* Reset `Token.ent_id` and `Token.ent_kb_id` when setting `Doc.ents`
* Make `Span.ent_id` an alias of `Span.id` rather than a read-only view
of the root token's `ent_id` annotation
* adding spans to doc_annotation in Example.to_dict
* to_dict compatible with from_dict: tuples instead of spans
* use strings for label and kb_id
* Simplify test
* Update data formats docs
Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Added examples for Slovene
* Update spacy/lang/sl/examples.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Corrected a typo in one of the sentences
* Updated support for Slovenian
* Some minor changes to corrections
* Added forint currency
* Corrected HYPHENS_PERMITTED regex and some formatting
* Minor changes
* Un-xfail tokenizer test
* Format
Co-authored-by: Luka Dragar <D20124481@mytudublin.ie>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Setters that take a different type than what the getter returns are still
problematic for MyPy. Replace the setter by a method, so that type inference
works everywhere.
* add additional REL_OP
* change to condition and new rel_op symbols
* add operators to docs
* add the anchor while we're in here
* add tests
Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
* Min_max_operators
1. Modified API and Usage for spaCy website to include min_max operator
2. Modified matcher.pyx to include min_max function {n,m} and its variants
3. Modified schemas.py to include min_max validation error
4. Added test cases to test_matcher_api.py, test_matcher_logic.py and test_pattern_validation.py
* attempt to fix mypy/pydantic compat issue
* formatting
* Update spacy/tests/matcher/test_pattern_validation.py
Co-authored-by: Source-Shen <82353723+Source-Shen@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Distinguish between vectors that are 0 vs. missing vectors when warning
about missing vectors.
Update `Doc.has_vector` to match `Span.has_vector` and
`Token.has_vector` for cases where the vocab has vectors but none of the
tokens in the container have vectors.
* Handle Russian, Ukrainian and Bulgarian
* Corrections
* Correction
* Correction to comment
* Changes based on review
* Correction
* Reverted irrelevant change in punctuation.py
* Remove unnecessary group
* Reverted accidental change
This change adds the new `activations` attribute to `Doc`. This
attribute can be used by trainable pipes to store their activations,
probabilities, and guesses for downstream users.
As an example, this change modifies the `tagger` and `senter` pipes to
add an `store_activations` option. When this option is enabled, the
probabilities and guesses are stored in `set_annotations`.
* Enable flag on spacy.load: foundation for include, enable arguments.
* Enable flag on spacy.load: fixed tests.
* Enable flag on spacy.load: switched from pretrained model to empty model with added pipes for tests.
* Enable flag on spacy.load: switched to more consistent error on misspecification of component activity. Test refactoring. Added to default config.
* Enable flag on spacy.load: added support for fields not in pipeline.
* Enable flag on spacy.load: removed serialization fields from supported fields.
* Enable flag on spacy.load: removed 'enable' from config again.
* Enable flag on spacy.load: relaxed checks in _resolve_component_activation_status() to allow non-standard pipes.
* Enable flag on spacy.load: fixed relaxed checks for _resolve_component_activation_status() to allow non-standard pipes. Extended tests.
* Enable flag on spacy.load: comments w.r.t. resolution workarounds.
* Enable flag on spacy.load: remove include fields. Update website docs.
* Enable flag on spacy.load: updates w.r.t. changes in master.
* Implement Doc.from_json(): update docstrings.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Implement Doc.from_json(): remove newline.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Implement Doc.from_json(): change error message for E1038.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Enable flag on spacy.load: wrapped docstring for _resolve_component_status() at 80 chars.
* Enable flag on spacy.load: changed exmples for enable flag.
* Remove newline.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix docstring for Language._resolve_component_status().
* Rename E1038 to E1042.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* account for NER labels with a hyphen in the name
* cleanup
* fix docstring
* add return type to helper method
* shorter method and few more occurrences
* user helper method across repo
* fix circular import
* partial revert to avoid circular import