Commit Graph

761 Commits

Author SHA1 Message Date
Daniël de Kok
096794dd74 Account for differences between Span.sents in spaCy 3/4 2023-06-22 15:38:22 +02:00
Daniël de Kok
50c5e9a2dd Merge remote-tracking branch 'upstream/master' into sync-v4-master-20230612 2023-06-12 15:57:10 +02:00
Adriane Boyd
c4112a1da3
Require that all SpanGroup spans are from the current doc (#12569)
* Require that all SpanGroup spans are from the current doc

The restriction on only adding spans from the current doc were already
implemented for all operations except for `SpanGroup.__init__`.

Initialize copied spans for `SpanGroup.copy` with `Doc.char_span` in
order to validate the character offsets and to make it possible to copy
spans between documents with differing tokenization. Currently there is
no validation that the document texts are identical, but the span char
offsets must be valid spans in the target doc, which prevents you from
ending up with completely invalid spans.

* Undo change in test_beam_overfitting_IO
2023-06-01 19:19:17 +02:00
Basile Dura
6ea4155487
feat: add comparison operators in span.pyi (#12652)
* feat: add comparison operators in span.pyi

remove Cython-specific `__richcmp__`

* fix: comparison operators should be defined for any other object
2023-05-23 08:50:37 +02:00
Basile Dura
95fd46b1dd
feat: add type hinting on SpanGroup.__iter__ (#12642) 2023-05-17 14:20:00 +02:00
Adriane Boyd
b60b027927
Add default option to MorphAnalysis.get (#12545)
* Add default to MorphAnalysis.get

Similar to `dict`, allow a `default` option for `MorphAnalysis.get` for
the user to provide a default return value if the field is not found.
The default return value remains `[]`, which is not the same as
`dict.get`, but is already established as this method's default return
value with the return type `List[str]`. However the new `default` option
does not enforce that the user-provided default is actually `List[str]`.

* Restore test case
2023-04-20 14:06:32 +02:00
Adriane Boyd
5d0f48fe69
Enforce that Span.start/end(_char) remain valid and in sync (#12268)
* Enforce that Span.start/end(_char) remain valid and in sync

Allowing span attributes to be writable starting in v3 has made it
possible for the internal `Span.start/end/start_char/end_char` to get
out-of-sync or have invalid values.

This checks that the values are valid and syncs the token and char
offsets if any attributes are modified directly. It does not yet handle
the case where the underlying doc is modified.

* Format
2023-04-06 16:01:59 +02:00
Adriane Boyd
4a1ec332de
Add Span.kb_id/Span.id strings to Doc/DocBin serialization if set (#12493)
* Add Span.kb_id/Span.id strings to Doc/DocBin serialization if set

* Format
2023-04-03 15:11:12 +02:00
Adriane Boyd
4538ceb507
Remove redundant strings.add for Doc.char_span (#12429) 2023-04-03 11:38:56 +02:00
Raphael Mitsch
d85df9d577
Fix Span.sents for edge case of Span being the only Span in the last sentence of a Doc. (#12484) 2023-03-29 18:54:47 +02:00
Raphael Mitsch
e8cab4625c
Fix sentence indexing bug in Span.sents (#12405)
* Add test for partial sentences in ent.sents.

* Removed unneeded import.

* Format. Simplify code.
2023-03-14 10:21:53 +01:00
Adriane Boyd
da75896ef5
Return Tuple[Span] for all Doc/Span attrs that provide spans (#12288)
* Return Tuple[Span] for all Doc/Span attrs that provide spans

* Update Span types
2023-03-01 16:00:02 +01:00
Adriane Boyd
df4c069a13
Remove backoff from .vector to .tensor (#12292) 2023-02-23 11:36:50 +01:00
Adriane Boyd
b95123060a
Make Span.char_span optional args keyword-only (#12257)
* Make Span.char_span optional args keyword-only

* Make kb_id and following kw-only

* Format
2023-02-15 12:34:33 +01:00
Adriane Boyd
cbc2ae933e
Remove unused Span.char_span(id=) (#12250) 2023-02-08 14:46:07 +01:00
Adriane Boyd
5089efa2d0
Use the same tuple in Span cmp and hash (#12251) 2023-02-08 14:28:34 +01:00
Adriane Boyd
cd95b29053 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master-7 2023-02-02 13:06:15 +01:00
Adriane Boyd
5f8a398bb9
Add span_id to Span.char_span, update Doc/Span.char_span docs (#12196)
* Add span_id to Span.char_span, update Doc/Span.char_span docs

`Span.char_span(id=)` should be removed in the future.

* Also use Union[int, str] in Doc docstring
2023-01-27 15:09:17 +01:00
Simon Gurcke
774c10fa39
Add alignment_mode argument to Span.char_span() (#12145)
* Add alignment_mode argument to Span.char_span()

* Update website

* Update spacy/tokens/span.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-27 11:43:40 +01:00
Paul O'Leary McCann
de360bc981
Refactor lexeme mem passing (#12125)
* Don't pass mem pool to new lexeme function

* Remove unused mem from function args

Two methods calling _new_lexeme, get and get_by_orth, took mem arguments
just to call the internal method. That's no longer necessary, so this
cleans it up.

* prettier formatting

* Remove more unused mem args
2023-01-25 12:50:21 +09:00
Daniël de Kok
207565a788 Merge remote-tracking branch 'upstream/master' into chore/v4-merge-master-20221222 2022-12-22 10:08:54 +01:00
Raphael Mitsch
eef3d950b4
Fix SpanGroup and Span typing (#12009)
* Correct Span.label, Span.kb_id types. Fix SpanGroup.__iter__().

* Extend test.

* Rename test. Fix typo.

* Add comment.

* Fix types for Span.label, Span.kb_id, Span.char_span().

* Update spacy/tests/doc/test_span_group.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update docs.

* Fix typo.

* Update spacy/tokens/span_group.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-12-21 18:54:27 +01:00
Edward
ca75190a3d
Custom extensions for spans with equal boundaries (#11429)
* Init

* Fix return type for mypy

* adjust types and improve setting new attributes

* Add underscore changes to json conversion

* Add test and underscore changes to from_docs

* add underscore changes and test to span.to_doc

* update return values

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add types to function

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* adjust formatting

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* shorten return type

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* add helper function to improve readability

* Improve code and add comments

* rerun azure tests

* Fix tests for json conversion

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-12-12 08:55:53 +01:00
Adriane Boyd
0591e67265
Cast to uint64 for all array-based doc representations (#11933)
* Convert all individual values explicitly to uint64 for array-based doc representations

* Temporarily test with latest numpy v1.24.0rc

* Remove unnecessary conversion from attr_t

* Reduce number of individual casts

* Convert specifically from int32 to uint64

* Revert "Temporarily test with latest numpy v1.24.0rc"

This reverts commit eb0e3c5006.

* Also use int32 in tests
2022-12-12 08:45:35 +01:00
svlandeg
04fea09ffd Merge branch 'copy_master' into copy_v4 2022-12-05 08:56:15 +01:00
Edward
e79910d57e
Remove sentiment extension (#11722)
* remove sentiment attribute

* remove sentiment from docs

* add test for backwards compatibility

* replace from_disk with from_bytes

* Fix docs and format file

* Fix formatting
2022-11-23 13:09:32 +01:00
Adriane Boyd
ea326cf47d
Fix types for Span.id and Span.id_ (#11744) 2022-11-07 08:11:13 +01:00
Adriane Boyd
40e1000db0
Restore Doc attr getter values in Doc.to_json (#11700) 2022-11-03 11:49:08 +01:00
Adriane Boyd
103b24fb25 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master 2022-10-21 09:13:32 +02:00
Adriane Boyd
7e56701057 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5 2022-10-20 13:38:49 +02:00
Edward
d66ccb8eb0
Fix multiple entries per custom extension in doc json (#11551)
* Fix multiple extensions and character offset

* Rename token_start/end to start/end

* Refactor Doc.from_json based on review

* Iterate over user_data items

* Only add non-empty underscore entries

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-10-19 15:52:47 +02:00
Madeesh Kannan
446a3ecf34
StringStore refactoring (#11344)
* `strings`: Remove unused `hash32_utf8` function

* `strings`: Make `hash_utf8` and `decode_Utf8Str` private

* `strings`: Reorganize private functions

* 'strings': Raise error when non-string/-int types are passed to functions that don't accept them

* `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods

* `Morphology`: Use `StringStore.items()` to enumerate features when pickling

* `test_stringstore`: Update pre-Python 3 tests

* Update `StringStore` docs

* Fix `get_string_id` imports

* Replace redundant test with tests for type checking

* Rename `_retrieve_interned_str`, remove `.get` default arg

* Add `get_string_id` to `strings.pyi`
Remove `mypy` ignore directives from imports of the above

* `strings.pyi`: Replace functions that consume `Union`-typed params with overloads

* `strings.pyi`: Revert some function signatures

* Update `SYMBOLS_BY_INT` lookups and error codes post-merge

* Revert clobbered change introduced in a previous merge

* Remove unnecessary type hint

* Invert tuple order in `StringStore.items()`

* Add test for `StringStore.items()`

* Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling"

This reverts commit 1af9510ceb.

* Rename `keys` and `key_map`

* Add `keys()` and `values()`

* Add comment about the inverted key-value semantics in the API

* Fix type hints

* Implement `keys()`, `values()`, `items()` without generators

* Fix type hints, remove unnecessary boxing

* Update docs

* Simplify `keys/values/items()` impl

* `mypy` fix

* Fix error message, doc fixes
2022-10-06 10:51:06 +02:00
svlandeg
e3027c65b8 Merge branch 'copy_develop' into copy_v4 2022-10-03 14:12:16 +02:00
svlandeg
9c8cdb403e Merge branch 'master_copy' into develop_copy 2022-09-30 15:40:26 +02:00
Richard Hudson
6f692a06d5
Remove side effects from Doc.__init__() (#11506)
* Remove side effects from Doc.__init__()

* Changes based on review comment

* Readd test

* Change interface of Doc.__init__()

* Simplify test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update doc.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-09-26 15:58:21 +02:00
Daniël de Kok
efdbb722c5
Store activations in Docs when save_activations is enabled (#11002)
* Store activations in Doc when `store_activations` is enabled

This change adds the new `activations` attribute to `Doc`. This
attribute can be used by trainable pipes to store their activations,
probabilities, and guesses for downstream users.

As an example, this change modifies the `tagger` and `senter` pipes to
add an `store_activations` option. When this option is enabled, the
probabilities and guesses are stored in `set_annotations`.

* Change type of `store_activations` to `Union[bool, List[str]]`

When the value is:

- A bool: all activations are stored when set to `True`.
- A List[str]: the activations named in the list are stored

* Formatting fixes in Tagger

* Support store_activations in spancat and morphologizer

* Make Doc.activations type visible to MyPy

* textcat/textcat_multilabel: add store_activations option

* trainable_lemmatizer/entity_linker: add store_activations option

* parser/ner: do not currently support returning activations

* Extend tagger and senter tests

So that they, like the other tests, also check that we get no
activations if no activations were requested.

* Document `Doc.activations` and `store_activations` in the relevant pipes

* Start errors/warnings at higher numbers to avoid merge conflicts

Between the master and v4 branches.

* Add `store_activations` to docstrings.

* Replace store_activations setter by set_store_activations method

Setters that take a different type than what the getter returns are still
problematic for MyPy. Replace the setter by a method, so that type inference
works everywhere.

* Use dict comprehension suggested by @svlandeg

* Revert "Use dict comprehension suggested by @svlandeg"

This reverts commit 6e7b958f70.

* EntityLinker: add type annotations to _add_activations

* _store_activations: make kwarg-only, remove doc_scores_lens arg

* set_annotations: add type annotations

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* TextCat.predict: return dict

* Make the `TrainablePipe.store_activations` property a bool

This means that we can also bring back `store_activations` setter.

* Remove `TrainablePipe.activations`

We do not need to enumerate the activations anymore since `store_activations` is
`bool`.

* Add type annotations for activations in predict/set_annotations

* Rename `TrainablePipe.store_activations` to `save_activations`

* Error E1400 is not used anymore

This error was used when activations were still `Union[bool, List[str]]`.

* Change wording in API docs after store -> save change

* docs: tag (save_)activations as new in spaCy 4.0

* Fix copied line in morphologizer activations test

* Don't train in any test_save_activations test

* Rename activations

- "probs" -> "probabilities"
- "guesses" -> "label_ids", except in the edit tree lemmatizer, where
  "guesses" -> "tree_ids".

* Remove unused W400 warning.

This warning was used when we still allowed the user to specify
which activations to save.

* Formatting fixes

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Replace "kb_ids" by a constant

* spancat: replace a cast by an assertion

* Fix EOF spacing

* Fix comments in test_save_activations tests

* Do not set RNG seed in activation saving tests

* Revert "spancat: replace a cast by an assertion"

This reverts commit 0bd5730d16.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-13 09:51:12 +02:00
shademe
977b847cce
Merge branch 'develop' into merge-develop-into-v4 2022-09-07 11:35:47 +02:00
Adriane Boyd
4a615cacd2
Consolidate and freeze symbols (#11352)
* Consolidate and freeze symbols

Instead of having symbol values defined in three potentially conflicting
places (`spacy.attrs`, `spacy.parts_of_speech`, `spacy.symbols`), define
all symbols in `spacy.symbols` and reference those values in
`spacy.attrs` and `spacy.parts_of_speech`.

Remove deprecated and placeholder symbols from `spacy.attrs.IDS`.

Make `spacy.attrs.NAMES` and `spacy.symbols.NAMES` reverse dicts rather
than lists in order to support future use of hash values in `attr_id_t`.

Minor changes:

* Use `uint64_t` for attrs in `Doc.to_array` to support future use of
hash values
* Remove unneeded attrs filter for error message in `Doc.to_array`
* Remove unused attr `SENT_END`

* Handle dynamic size of attr_id_t in Doc.to_array

* Undo added warnings

* Refactor to make Doc.to_array more similar to Doc.from_array

* Improve refactoring
2022-09-02 09:08:40 +02:00
Madeesh Kannan
604a7c3c26
SpanGroup(s)-related optimizations (#11380)
* `SpanGroup`: Add support for binding copies to a new reference document

* `SpanGroups`: Replace superfluous serialize-deserialize roundtrip in `copy`

Instead, directly copy the in-memory representations of the constituent `SpanGroup`s.

* Update `SpanGroup.copy()` signature

* Rename `new_doc` param to `doc`

* Fix kwdarg

* Update `.pyi` file and docstrings

* `mypy` fix

* Update spacy/tokens/span_group.pyx

* Update docs

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-31 09:03:20 +02:00
Adriane Boyd
98a916e01a
Make stable private modules public and adjust names (#11353)
* Make stable private modules public and adjust names

* `spacy.ml._character_embed` -> `spacy.ml.character_embed`
* `spacy.ml._precomputable_affine` -> `spacy.ml.precomputable_affine`
* `spacy.tokens._serialize` -> `spacy.tokens.doc_bin`
* `spacy.tokens._retokenize` -> `spacy.tokens.retokenize`
* `spacy.tokens._dict_proxies` -> `spacy.tokens.span_groups`

* Skip _precomputable_affine

* retokenize -> retokenizer

* Fix imports
2022-08-30 13:56:35 +02:00
Adriane Boyd
c44d243f25 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master 2022-08-24 07:15:41 +02:00
Edward
5afa98aabf
Support custom attributes for tokens and spans in json conversion (#11125)
* Add token and span custom attributes to to_json()

* Change logic for to_json

* Add functionality to from_json

* Small adjustments

* Move token/span attributes to new dict key

* Fix test

* Fix the same test but much better

* Add backwards compatibility tests and adjust logic

* Add test to check if attributes not set in underscore are not saved in the json

* Add tests for json compatibility

* Adjust test names

* Fix tests and clean up code

* Fix assert json tests

* small adjustment

* adjust naming and code readability

* Adjust naming, added more tests and changed logic

* Fix typo

* Adjust errors, naming, and small test optimization

* Fix byte tests

* Fix bytes tests

* Change naming and json structure

* update schema

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update schema for underscore attributes

* Adjust underscore schema

* adjust schema tests

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-23 10:05:02 +02:00
Adriane Boyd
bb0e178878
Make Span/Doc.ents more consistent for ent_kb_id and ent_id (#11328)
* Map `Span.id` to `Token.ent_id` in all cases when setting `Doc.ents`
* Reset `Token.ent_id` and `Token.ent_kb_id` when setting `Doc.ents`
* Make `Span.ent_id` an alias of `Span.id` rather than a read-only view
of the root token's `ent_id` annotation
2022-08-22 20:28:57 +02:00
Daniël de Kok
1ff683a50b Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220728 2022-07-28 13:53:59 +02:00
Madeesh Kannan
ba18d2913d
Morphology/Morphologizer optimizations and refactoring (#11024)
* `Morphology`: Refactor to use C types, reduce allocations, remove unused code

* `Morphologzier`: Avoid unnecessary sorting of morpho features

* `Morphologizer`: Remove execessive reallocations of labels, improve hash lookups of labels, coerce `numpy` numeric types to native ints
Update docs

* Remove unused method

* Replace `unique_ptr` usage with `shared_ptr`

* Add type annotations to internal Python methods, rename `hash` variable, fix typos

* Add comment to clarify implementation detail

* Fix return type

* `Morphology`: Stop early when splitting fields and values
2022-07-15 11:14:08 +02:00
Nicolai Bjerre Pedersen
2fa983aa2e
Fix span typings (#11119)
Add id, id_ to span.pyi.
2022-07-12 13:47:35 +02:00
Adriane Boyd
24f4908fce
Update vector handling in similarity methods (#11013)
Distinguish between vectors that are 0 vs. missing vectors when warning
about missing vectors.

Update `Doc.has_vector` to match `Span.has_vector` and
`Token.has_vector` for cases where the vocab has vectors but none of the
tokens in the container have vectors.
2022-06-28 19:50:47 +02:00
Daniël de Kok
2f05c6824c Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220609 2022-06-09 10:18:25 +02:00
Madeesh Kannan
41389ffe1e
Avoid pickling Doc inputs passed to Language.pipe() (#10864)
* `Language.pipe()`: Serialize `Doc` objects to bytes when using multiprocessing to avoid pickling overhead

* `Doc.to_dict()`: Serialize `_context` attribute (keeping in line with `(un)pickle_doc()`

* Correct type annotations

* Fix typo

* `Doc`: Do not serialize `_context`

* `Language.pipe`: Send context objects to child processes, Simplify `as_tuples` handling

* Fix type annotation

* `Language.pipe`: Simplify `as_tuple` multiprocessor handling

* Cleanup code, fix typos

* MyPy fixes

* Move doc preparation function into `_multiprocessing_pipe`
Whitespace changes

* Remove superfluous comma

* Rename `prepare_doc` to `prepare_input`

* Update spacy/errors.py

* Undo renaming for error

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-06-02 20:06:49 +02:00
single-fingal
6c6b8da7cc
Fix: De/Serialize SpanGroups including the SpanGroup keys (#10707)
* fix: De/Serialize `SpanGroups` including the SpanGroup keys

This prevents the loss of `SpanGroup`s that have the same .name as other `SpanGroup`s within the same `SpanGroups` object (upon de/serialization of the `SpanGroups`).

Fixes #10685

* Maintain backwards compatibility for serialized `SpanGroups`

(serialized as: a list of `SpanGroup`s, or b'')

* Add tests for `SpanGroups` deserialization backwards-compatibility

* Move a `SpanGroups` de/serialization test (test_issue10685)
  to tests/serialize/test_serialize_spangroups.py

* Output a warning if deserializing a `SpanGroups` with duplicate .name-d `SpanGroup`s

* Minor refactor

* `SpanGroups.from_bytes` handles only `list` and `dict` types with
`dict` as the expected default
* For lists, keep first rather than last value encountered
* Update error message
* Rename and update tests

* Update to preserve list serialization of SpanGroups

To avoid breaking compatibility of serialized `Doc` and `DocBin` with
earlier versions of spacy v3, revert back to a list-only serialization,
but update the names just for serialization so that the SpanGroups keys
override the SpanGroup names.

* Preserve object identity and current key overwrite

* Preserve SpanGroup object identity
* Preserve last rather than first span group from SpanGroup list
  format without SpanGroups keys

* Update inline comments

* Fix types

* Add type info for SpanGroup.copy

* Deserialize `SpanGroup`s as copies

when a single SpanGroup is the value for more than 1 `SpanGroups` key.
This is because we serialize `SpanGroups` as dicts (to maintain backward-
and forward-compatibility) and we can't assume `SpanGroup`s with the same
bytes/serialization were the same (identical) object, pre-serialization.

* Update spacy/tokens/_dict_proxies.py

* Add more SpanGroups serialization tests

Test that serialized SpanGroups maintain their Span order

* small clarification on older spaCy version

* Update spacy/tests/serialize/test_serialize_span_groups.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-02 15:56:27 +02:00