* Require that all SpanGroup spans are from the current doc
The restriction on only adding spans from the current doc were already
implemented for all operations except for `SpanGroup.__init__`.
Initialize copied spans for `SpanGroup.copy` with `Doc.char_span` in
order to validate the character offsets and to make it possible to copy
spans between documents with differing tokenization. Currently there is
no validation that the document texts are identical, but the span char
offsets must be valid spans in the target doc, which prevents you from
ending up with completely invalid spans.
* Undo change in test_beam_overfitting_IO
* Add default to MorphAnalysis.get
Similar to `dict`, allow a `default` option for `MorphAnalysis.get` for
the user to provide a default return value if the field is not found.
The default return value remains `[]`, which is not the same as
`dict.get`, but is already established as this method's default return
value with the return type `List[str]`. However the new `default` option
does not enforce that the user-provided default is actually `List[str]`.
* Restore test case
* Enforce that Span.start/end(_char) remain valid and in sync
Allowing span attributes to be writable starting in v3 has made it
possible for the internal `Span.start/end/start_char/end_char` to get
out-of-sync or have invalid values.
This checks that the values are valid and syncs the token and char
offsets if any attributes are modified directly. It does not yet handle
the case where the underlying doc is modified.
* Format
* Init
* fix tests
* Update spacy/errors.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix test_blank_languages
* Rename xx to mul in docs
* Format _util with black
* prettier formatting
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add span_id to Span.char_span, update Doc/Span.char_span docs
`Span.char_span(id=)` should be removed in the future.
* Also use Union[int, str] in Doc docstring
* Init
* Fix return type for mypy
* adjust types and improve setting new attributes
* Add underscore changes to json conversion
* Add test and underscore changes to from_docs
* add underscore changes and test to span.to_doc
* update return values
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add types to function
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* adjust formatting
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* shorten return type
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* add helper function to improve readability
* Improve code and add comments
* rerun azure tests
* Fix tests for json conversion
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Convert all individual values explicitly to uint64 for array-based doc representations
* Temporarily test with latest numpy v1.24.0rc
* Remove unnecessary conversion from attr_t
* Reduce number of individual casts
* Convert specifically from int32 to uint64
* Revert "Temporarily test with latest numpy v1.24.0rc"
This reverts commit eb0e3c5006.
* Also use int32 in tests
* remove sentiment attribute
* remove sentiment from docs
* add test for backwards compatibility
* replace from_disk with from_bytes
* Fix docs and format file
* Fix formatting
* Fix multiple extensions and character offset
* Rename token_start/end to start/end
* Refactor Doc.from_json based on review
* Iterate over user_data items
* Only add non-empty underscore entries
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Remove side effects from Doc.__init__()
* Changes based on review comment
* Readd test
* Change interface of Doc.__init__()
* Simplify test
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update doc.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add token and span custom attributes to to_json()
* Change logic for to_json
* Add functionality to from_json
* Small adjustments
* Move token/span attributes to new dict key
* Fix test
* Fix the same test but much better
* Add backwards compatibility tests and adjust logic
* Add test to check if attributes not set in underscore are not saved in the json
* Add tests for json compatibility
* Adjust test names
* Fix tests and clean up code
* Fix assert json tests
* small adjustment
* adjust naming and code readability
* Adjust naming, added more tests and changed logic
* Fix typo
* Adjust errors, naming, and small test optimization
* Fix byte tests
* Fix bytes tests
* Change naming and json structure
* update schema
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/tokens/doc.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/tokens/doc.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update schema for underscore attributes
* Adjust underscore schema
* adjust schema tests
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Map `Span.id` to `Token.ent_id` in all cases when setting `Doc.ents`
* Reset `Token.ent_id` and `Token.ent_kb_id` when setting `Doc.ents`
* Make `Span.ent_id` an alias of `Span.id` rather than a read-only view
of the root token's `ent_id` annotation
* Add SpanRuler component
Add a `SpanRuler` component similar to `EntityRuler` that saves a list
of matched spans to `Doc.spans[spans_key]`. The matches from the token
and phrase matchers are deduplicated and sorted before assignment but
are not otherwise filtered.
* Update spacy/pipeline/span_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix cast
* Add self.key property
* Use number of patterns as length
* Remove patterns kwarg from init
* Update spacy/tests/pipeline/test_span_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add options for spans filter and setting to ents
* Add `spans_filter` option as a registered function'
* Make `spans_key` optional and if `None`, set to `doc.ents` instead of
`doc.spans[spans_key]`.
* Update and generalize tests
* Add test for setting doc.ents, fix key property type
* Fix typing
* Allow independent doc.spans and doc.ents
* If `spans_key` is set, set `doc.spans` with `spans_filter`.
* If `annotate_ents` is set, set `doc.ents` with `ents_fitler`.
* Use `util.filter_spans` by default as `ents_filter`.
* Use a custom warning if the filter does not work for `doc.ents`.
* Enable use of SpanC.id in Span
* Support id in SpanRuler as Span.id
* Update types
* `id` can only be provided as string (already by `PatternType`
definition)
* Update all uses of Span.id/ent_id in Doc
* Rename Span id kwarg to span_id
* Update types and docs
* Add ents filter to mimic EntityRuler overwrite_ents
* Refactor `ents_filter` to take `entities, spans` args for more
filtering options
* Give registered filters more descriptive names
* Allow registered `filter_spans` filter
(`spacy.first_longest_spans_filter.v1`) to take any number of
`Iterable[Span]` objects as args so it can be used for spans filter
or ents filter
* Implement future entity ruler as span ruler
Implement a compatible `entity_ruler` as `future_entity_ruler` using
`SpanRuler` as the underlying component:
* Add `sort_key` and `sort_reverse` to allow the sorting behavior to be
customized. (Necessary for the same sorting/filtering as in
`EntityRuler`.)
* Implement `overwrite_overlapping_ents_filter` and
`preserve_existing_ents_filter` to support
`EntityRuler.overwrite_ents` settings.
* Add `remove_by_id` to support `EntityRuler.remove` functionality.
* Refactor `entity_ruler` tests to parametrize all tests to test both
`entity_ruler` and `future_entity_ruler`
* Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns`
properties.
Additional changes:
* Move all config settings to top-level attributes to avoid duplicating
settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of
casting.)
* Format
* Fix filter make method name
* Refactor to use same error for removing by label or ID
* Also provide existing spans to spans filter
* Support ids property
* Remove token_patterns and phrase_patterns
* Update docstrings
* Add span ruler docs
* Fix types
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Move sorting into filters
* Check for all tokens in seen tokens in entity ruler filters
* Remove registered sort key
* Set Token.ent_id in a backwards-compatible way in Doc.set_ents
* Remove sort options from API docs
* Update docstrings
* Rename entity ruler filters
* Fix and parameterize scoring
* Add id to Span API docs
* Fix typo in API docs
* Include explicit labeled=True for scorer
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added new convenience cython functions to SpanGroup to avoid unnecessary allocation/deallocation of objects
* Replaced sorting in has_overlap with C++ for efficiency. Also, added a test for has_overlap
* Added a method to efficiently merge SpanGroups
* Added __delitem__, __add__ and __iadd__. Also, allowed to pass span lists to merge function. Replaced extend() body with call to merge
* Renamed merge to concat and added missing things to documentation
* Added operator+ and operator += in the documentation
* Added a test for Doc deallocation
* Update spacy/tokens/span_group.pyx
* Updated SpanGroup tests to use new span list comparison function rather than assert_span_list_equal, eliminating the need to have a separate assert_not_equal fnction
* Fixed typos in SpanGroup documentation
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Minor changes requested by Sofie: rearranged import statements. Added new=3.2.1 tag to SpanGroup.__setitem__ documentation
* SpanGroup: moved repetitive list index check/adjustment in a separate function
* Turn off formatting that hurts readability spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove formatting that hurts readability spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Turn off formatting that hurts readability in spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Support more internal methods for SpanGroup
Add support for:
* `__setitem__`
* `__delitem__`
* `__iadd__`: for `SpanGroup` or `Iterable[Span]`
* `__add__`: for `SpanGroup` only
Adapted from #9698 with the scope limited to the magic methods.
* Use v3.3 as new version in docs
* Add new tag to SpanGroup.copy in API docs
* Remove duplicate import
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remaining suggestions and formatting
Co-authored-by: nrodnova <nrodnova@hotmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Natalia Rodnova <4512370+nrodnova@users.noreply.github.com>
* remove duplicate line
* add sent start/end token attributes to the docs
* let has_annotation work with IS_SENT_END
* elif instead of if
* add has_annotation test for sent attributes
* fix typo
* remove duplicate is_sent_start entry in docs
* Corrected Span's __richcmp__ implementation to take end, label and kb_id in consideration
* Updated test
* Updated test
* Removed formatting from a test for readability sake
* Use same tuples for all comparisons
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Span/SpanGroup: wrap SpanC in shared_ptr
When a Span that was retrieved from a SpanGroup was modified, these
changes were not reflected in the SpanGroup because the underlying
SpanC struct was copied.
This change applies the solution proposed by @nrodnova, to wrap SpanC in
a shared_ptr. This makes a SpanGroup and Spans derived from it share the
same SpanC. So, changes made through a Span are visible in the SpanGroup
as well.
Fixes#9556
* Test that a SpanGroup is modified through its Spans
* SpanGroup.push_back: remove nogil
Modifying std::vector is not thread-safe.
* C++ >= 11 does not allow const T in vector<T>
* Add Span.span_c as a shorthand for Span.c.get
Since this method is cdef'ed, it is only visible from Cython, so we
avoid using raw pointers in Python
Replace existing uses of span.c.get() to use this new method.
* Fix formatting
* Style fix: pointer types
* SpanGroup.to_bytes: reduce number of shared_ptr::get calls
* Mark SpanGroup modification test with issue
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added sents property to Span class that returns a generator of sentences the Span belongs to
* Added description to Span.sents property
* Update test_span to clarify the difference between span.sent and span.sents
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/tests/doc/test_span.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix documentation typos in spacy/tokens/span.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update Span.sents doc string in spacy/tokens/span.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Parametrized test_span_spans
* Corrected Span.sents to check for span-level hook first. Also, made Span.sent respect doc-level sents hook if no span-level hook is provided
* Corrected Span ocumentation copy/paste issue
* Put back accidentally deleted lines
* Fixed formatting in span.pyx
* Moved check for SENT_START annotation after user hooks in Span.sents
* add version where the property was introduced
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Migrate regressions 1-1000
* Move serialize test to correct file
* Remove tests that won't work in v3
* Migrate regressions 1000-1500
Removed regression test 1250 because v3 doesn't support the old LEX
scheme anymore.
* Add missing imports in serializer tests
* Migrate tests 1500-2000
* Migrate regressions from 2000-2500
* Migrate regressions from 2501-3000
* Migrate regressions from 3000-3501
* Migrate regressions from 3501-4000
* Migrate regressions from 4001-4500
* Migrate regressions from 4501-5000
* Migrate regressions from 5001-5501
* Migrate regressions from 5501 to 7000
* Migrate regressions from 7001 to 8000
* Migrate remaining regression tests
* Fixing missing imports
* Update docs with new system [ci skip]
* Update CONTRIBUTING.md
- Fix formatting
- Update wording
* Remove lemmatizer tests in el lang
* Move a few tests into the general tokenizer
* Separate Doc and DocBin tests
* Validate pos values when creating Doc
* Add clear error when setting invalid pos
This also changes the error language slightly.
* Fix variable name
* Update spacy/tokens/doc.pyx
* Test that setting invalid pos raises an error
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>