* Add default to MorphAnalysis.get
Similar to `dict`, allow a `default` option for `MorphAnalysis.get` for
the user to provide a default return value if the field is not found.
The default return value remains `[]`, which is not the same as
`dict.get`, but is already established as this method's default return
value with the return type `List[str]`. However the new `default` option
does not enforce that the user-provided default is actually `List[str]`.
* Restore test case
* Add span_id to Span.char_span, update Doc/Span.char_span docs
`Span.char_span(id=)` should be removed in the future.
* Also use Union[int, str] in Doc docstring
* Convert all individual values explicitly to uint64 for array-based doc representations
* Temporarily test with latest numpy v1.24.0rc
* Remove unnecessary conversion from attr_t
* Reduce number of individual casts
* Convert specifically from int32 to uint64
* Revert "Temporarily test with latest numpy v1.24.0rc"
This reverts commit eb0e3c5006.
* Also use int32 in tests
* Fix multiple extensions and character offset
* Rename token_start/end to start/end
* Refactor Doc.from_json based on review
* Iterate over user_data items
* Only add non-empty underscore entries
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Remove side effects from Doc.__init__()
* Changes based on review comment
* Readd test
* Change interface of Doc.__init__()
* Simplify test
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update doc.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add token and span custom attributes to to_json()
* Change logic for to_json
* Add functionality to from_json
* Small adjustments
* Move token/span attributes to new dict key
* Fix test
* Fix the same test but much better
* Add backwards compatibility tests and adjust logic
* Add test to check if attributes not set in underscore are not saved in the json
* Add tests for json compatibility
* Adjust test names
* Fix tests and clean up code
* Fix assert json tests
* small adjustment
* adjust naming and code readability
* Adjust naming, added more tests and changed logic
* Fix typo
* Adjust errors, naming, and small test optimization
* Fix byte tests
* Fix bytes tests
* Change naming and json structure
* update schema
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/tokens/doc.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/tokens/doc.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update schema for underscore attributes
* Adjust underscore schema
* adjust schema tests
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add SpanRuler component
Add a `SpanRuler` component similar to `EntityRuler` that saves a list
of matched spans to `Doc.spans[spans_key]`. The matches from the token
and phrase matchers are deduplicated and sorted before assignment but
are not otherwise filtered.
* Update spacy/pipeline/span_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix cast
* Add self.key property
* Use number of patterns as length
* Remove patterns kwarg from init
* Update spacy/tests/pipeline/test_span_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add options for spans filter and setting to ents
* Add `spans_filter` option as a registered function'
* Make `spans_key` optional and if `None`, set to `doc.ents` instead of
`doc.spans[spans_key]`.
* Update and generalize tests
* Add test for setting doc.ents, fix key property type
* Fix typing
* Allow independent doc.spans and doc.ents
* If `spans_key` is set, set `doc.spans` with `spans_filter`.
* If `annotate_ents` is set, set `doc.ents` with `ents_fitler`.
* Use `util.filter_spans` by default as `ents_filter`.
* Use a custom warning if the filter does not work for `doc.ents`.
* Enable use of SpanC.id in Span
* Support id in SpanRuler as Span.id
* Update types
* `id` can only be provided as string (already by `PatternType`
definition)
* Update all uses of Span.id/ent_id in Doc
* Rename Span id kwarg to span_id
* Update types and docs
* Add ents filter to mimic EntityRuler overwrite_ents
* Refactor `ents_filter` to take `entities, spans` args for more
filtering options
* Give registered filters more descriptive names
* Allow registered `filter_spans` filter
(`spacy.first_longest_spans_filter.v1`) to take any number of
`Iterable[Span]` objects as args so it can be used for spans filter
or ents filter
* Implement future entity ruler as span ruler
Implement a compatible `entity_ruler` as `future_entity_ruler` using
`SpanRuler` as the underlying component:
* Add `sort_key` and `sort_reverse` to allow the sorting behavior to be
customized. (Necessary for the same sorting/filtering as in
`EntityRuler`.)
* Implement `overwrite_overlapping_ents_filter` and
`preserve_existing_ents_filter` to support
`EntityRuler.overwrite_ents` settings.
* Add `remove_by_id` to support `EntityRuler.remove` functionality.
* Refactor `entity_ruler` tests to parametrize all tests to test both
`entity_ruler` and `future_entity_ruler`
* Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns`
properties.
Additional changes:
* Move all config settings to top-level attributes to avoid duplicating
settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of
casting.)
* Format
* Fix filter make method name
* Refactor to use same error for removing by label or ID
* Also provide existing spans to spans filter
* Support ids property
* Remove token_patterns and phrase_patterns
* Update docstrings
* Add span ruler docs
* Fix types
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Move sorting into filters
* Check for all tokens in seen tokens in entity ruler filters
* Remove registered sort key
* Set Token.ent_id in a backwards-compatible way in Doc.set_ents
* Remove sort options from API docs
* Update docstrings
* Rename entity ruler filters
* Fix and parameterize scoring
* Add id to Span API docs
* Fix typo in API docs
* Include explicit labeled=True for scorer
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added new convenience cython functions to SpanGroup to avoid unnecessary allocation/deallocation of objects
* Replaced sorting in has_overlap with C++ for efficiency. Also, added a test for has_overlap
* Added a method to efficiently merge SpanGroups
* Added __delitem__, __add__ and __iadd__. Also, allowed to pass span lists to merge function. Replaced extend() body with call to merge
* Renamed merge to concat and added missing things to documentation
* Added operator+ and operator += in the documentation
* Added a test for Doc deallocation
* Update spacy/tokens/span_group.pyx
* Updated SpanGroup tests to use new span list comparison function rather than assert_span_list_equal, eliminating the need to have a separate assert_not_equal fnction
* Fixed typos in SpanGroup documentation
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Minor changes requested by Sofie: rearranged import statements. Added new=3.2.1 tag to SpanGroup.__setitem__ documentation
* SpanGroup: moved repetitive list index check/adjustment in a separate function
* Turn off formatting that hurts readability spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove formatting that hurts readability spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Turn off formatting that hurts readability in spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Support more internal methods for SpanGroup
Add support for:
* `__setitem__`
* `__delitem__`
* `__iadd__`: for `SpanGroup` or `Iterable[Span]`
* `__add__`: for `SpanGroup` only
Adapted from #9698 with the scope limited to the magic methods.
* Use v3.3 as new version in docs
* Add new tag to SpanGroup.copy in API docs
* Remove duplicate import
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remaining suggestions and formatting
Co-authored-by: nrodnova <nrodnova@hotmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Natalia Rodnova <4512370+nrodnova@users.noreply.github.com>
* remove duplicate line
* add sent start/end token attributes to the docs
* let has_annotation work with IS_SENT_END
* elif instead of if
* add has_annotation test for sent attributes
* fix typo
* remove duplicate is_sent_start entry in docs
* Corrected Span's __richcmp__ implementation to take end, label and kb_id in consideration
* Updated test
* Updated test
* Removed formatting from a test for readability sake
* Use same tuples for all comparisons
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Added sents property to Span class that returns a generator of sentences the Span belongs to
* Added description to Span.sents property
* Update test_span to clarify the difference between span.sent and span.sents
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/tests/doc/test_span.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix documentation typos in spacy/tokens/span.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update Span.sents doc string in spacy/tokens/span.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Parametrized test_span_spans
* Corrected Span.sents to check for span-level hook first. Also, made Span.sent respect doc-level sents hook if no span-level hook is provided
* Corrected Span ocumentation copy/paste issue
* Put back accidentally deleted lines
* Fixed formatting in span.pyx
* Moved check for SENT_START annotation after user hooks in Span.sents
* add version where the property was introduced
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Migrate regressions 1-1000
* Move serialize test to correct file
* Remove tests that won't work in v3
* Migrate regressions 1000-1500
Removed regression test 1250 because v3 doesn't support the old LEX
scheme anymore.
* Add missing imports in serializer tests
* Migrate tests 1500-2000
* Migrate regressions from 2000-2500
* Migrate regressions from 2501-3000
* Migrate regressions from 3000-3501
* Migrate regressions from 3501-4000
* Migrate regressions from 4001-4500
* Migrate regressions from 4501-5000
* Migrate regressions from 5001-5501
* Migrate regressions from 5501 to 7000
* Migrate regressions from 7001 to 8000
* Migrate remaining regression tests
* Fixing missing imports
* Update docs with new system [ci skip]
* Update CONTRIBUTING.md
- Fix formatting
- Update wording
* Remove lemmatizer tests in el lang
* Move a few tests into the general tokenizer
* Separate Doc and DocBin tests
* Validate pos values when creating Doc
* Add clear error when setting invalid pos
This also changes the error language slightly.
* Fix variable name
* Update spacy/tokens/doc.pyx
* Test that setting invalid pos raises an error
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Support a cfg field in transition system
* Make NER 'has gold' check use right alignment for span
* Pass 'negative_samples_key' property into NER transition system
* Add field for negative samples to NER transition system
* Check neg_key in NER has_gold
* Support negative examples in NER oracle
* Test for negative examples in NER
* Fix name of config variable in NER
* Remove vestiges of old-style partial annotation
* Remove obsolete tests
* Add comment noting lack of support for negative samples in parser
* Additions to "neg examples" PR (#8201)
* add custom error and test for deprecated format
* add test for unlearning an entity
* add break also for Begin's cost
* add negative_samples_key property on Parser
* rename
* extend docs & fix some older docs issues
* add subclass constructors, clean up tests, fix docs
* add flaky test with ValueError if gold parse was not found
* remove ValueError if n_gold == 0
* fix docstring
* Hack in environment variables to try out training
* Remove hack
* Remove NER hack, and support 'negative O' samples
* Fix O oracle
* Fix transition parser
* Remove 'not O' from oracle
* Fix NER oracle
* check for spans in both gold.ents and gold.spans and raise if so, to prevent memory access violation
* use set instead of list in consistency check
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Change span lemmas to use original whitespace (fix#8368)
This is a redo of #8371 based off master.
The test for this required some changes to existing tests. I don't think
the changes were significant but I'd like someone to check them.
* Remove mystery docstring
This sentence was uncompleted for years, and now we will never know how
it ends.
* Fill in deps if not provided with heads
Before this change, if heads were passed without deps they would be
silently ignored, which could be confusing. See #8334.
* Use "dep" instead of a blank string
This is the customary placeholder dep. It might be better to show an
error here instead though.
* Throw error on heads without deps
* Add a test
* Fix tests
* Formatting
* Fix all tests
* Fix a test I missed
* Revise error message
* Clean up whitespace
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix range in Span.get_lca_matrix
Fix the adjusted token index / lca matrix index ranges for
`_get_lca_matrix` for spans.
* The range for `k` should correspond to the adjusted indices in
`lca_matrix` with the `start` indexed at `0`
* Update test for v3.x
* Handle partial entities in Span.as_doc
In `Span.as_doc` replace partial entities at the beginning or end of the
span with missing entity annotation.
Fixes a bug where invalid entity annotation (no initial `B`) was
returned for an initial partial entity.
* Check for empty span in ents conversion
Note: `Span.as_doc()` will still fail on an empty span due to failures
in `Span.vector`.
* Adjust custom extension data when copying user data in `Span.as_doc()`
* Restrict `Doc.from_docs()` to adjusting offsets for custom extension
data
* Update test to use extension
* (Duplicate bug fix for character offset from #7497)
Merge data from `doc.spans` in `Doc.from_docs()`.
* Fix internal character offset set when merging empty docs (only
affects tokens and spans in `user_data` if an empty doc is in the list
of docs)
In the retokenizer, only reset sent starts (with
`set_children_from_head`) if the doc is parsed. If there is no parse,
merged tokens have the unset `token.is_sent_start == None` by default after
retokenization.