* add span key option for CLI evaluation
* Rephrase CLI help to refer to Doc.spans instead of spancat
* Rephrase docs to refer to Doc.spans instead of spancat
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
SpaCy's HashEmbedCNN layer performs convolutions over tokens to produce
contextualized embeddings using a `MaxoutWindowEncoder` layer. These
convolutions are implemented using Thinc's `expand_window` layer, which
concatenates `window_size` neighboring sequence items on either side of
the sequence item being processed. This is repeated across `depth`
convolutional layers.
For example, consider the sequence "ABCDE" and a `MaxoutWindowEncoder`
layer with a context window of 1 and a depth of 2. We'll focus on the
token "C". We can visually represent the contextual embedding produced
for "C" as:
```mermaid
flowchart LR
A0(A<sub>0</sub>)
B0(B<sub>0</sub>)
C0(C<sub>0</sub>)
D0(D<sub>0</sub>)
E0(E<sub>0</sub>)
B1(B<sub>1</sub>)
C1(C<sub>1</sub>)
D1(D<sub>1</sub>)
C2(C<sub>2</sub>)
A0 --> B1
B0 --> B1
C0 --> B1
B0 --> C1
C0 --> C1
D0 --> C1
C0 --> D1
D0 --> D1
E0 --> D1
B1 --> C2
C1 --> C2
D1 --> C2
```
Described in words, this graph shows that before the first layer of the
convolution, the "receptive field" centered at each token consists only
of that same token. That is to say, that we have a receptive field of 1.
The first layer of the convolution adds one neighboring token on either
side to the receptive field. Since this is done on both sides, the
receptive field increases by 2, giving the first layer a receptive field
of 3. The second layer of the convolutions adds an _additional_
neighboring token on either side to the receptive field, giving a final
receptive field of 5.
However, this doesn't match the formula currently given in the docs,
which read:
> The receptive field of the CNN will be
> `depth * (window_size * 2 + 1)`, so a 4-layer network with a window
> size of `2` will be sensitive to 20 words at a time.
Substituting in our depth of 2 and window size of 1, this formula gives
us a receptive field of:
```
depth * (window_size * 2 + 1)
= 2 * (1 * 2 + 1)
= 2 * (2 + 1)
= 2 * 3
= 6
```
This not only doesn't match our computations from above, it's also an
even number! This is suspicious, since the receptive field is supposed
to be centered on a token, and not between tokens. Generally, this
formula results in an even number for any even value of `depth`.
The error in this formula is that the adjustment for the center token
is multiplied by the depth, when it should occur only once. The
corrected formula, `depth * window_size * 2 + 1`, gives the correct
value for our small example from above:
```
depth * window_size * 2 + 1
= 2 * 1 * 2 + 1
= 4 + 1
= 5
```
These changes update the docs to correct the receptive field formula and
the example receptive field size.
* remove migration support form
* initial test commit
* add fixture
* add combo test
* pull out parameter example data
* fix formatting on examples
* remove unused import
* remove unncessary fmt:off instructions
* only set logger level if verbose flag is explicitly set
---------
Co-authored-by: svlandeg <svlandeg@github.com>
* Add data structures to docs
* Adjusted descriptions for more consistency
* Add _optional_ flag to parameters
* Add tests and adjust optional title key in doc
* Add title to dep visualizations
* fix typo
---------
Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
* Add cli for finding locations of registered func
* fixes: naming and typing
* isort
* update naming
* remove to find-function
* remove file:// bit
* use registry name if given and exit gracefully if a registry was not found
* clean up failure msg
* specify registry_name options
* mypy fixes
* return location for internal usage
* add documentation
* more mypy fixes
* clean up example
* add section to menu
* add tests
---------
Co-authored-by: svlandeg <svlandeg@github.com>
* modified: spacy/language.py
- corrected typo in docstring for :method:`Language.replace_listeners`
- added noqa comment on unused local variable assignment in :method:`Language.from_config` as I wasn't sure if it should be unassigned
modified: website/docs/api/language.mdx
- corrected typo in `Language.replace_listeners` markdown
* modified: spacy/language.py
- removed noqa comment
---------
Co-authored-by: Ian Thompson <ian.thompson@hrblock.com>
These changes add a missing call to `escape_html` in the displaCy span
renderer. Previously span-annotated tokens would be inserted into the
page markup without being escaped, resulting in potentially incorrect
rendering. When I encountered this issue, it resulted in some docs and
span underlines being superimposed on top of properly rendered docs and
span underlines near the beginning of the visualization (due to an
unescaped `<span>` tag).
When the default `max_length` is not set and there are longer training
documents, it can be difficult to train and evaluate the span finder due
to memory limits and the time it takes to evaluate a huge number of
predicted spans.
* Support custom token/lexeme attribute for vectors
* Fix imports
* Back off to ORTH without Vectors.attr
* Fallback if vectors.attr doesn't exist
* Update docs
When sourcing a component, the object from the original pipeline is added to the new pipeline as the same object. This creates a situation where there are several attributes that cannot be in sync between the original pipeline and the new pipeline at the same time for this one object:
* component.name
* component.listener_map / component.listening_components for tok2vec and transformer
When running replace_listeners on a component, the config is not updated correctly if the state of the component is incorrect for the current pipeline (in particular changes that should be applied from model.attrs["replace_listener_cfg"] as used in spacy-transformers) due to the fact that:
* find_listeners relies on component.name to set the name in the listener_map
* replace_listeners relies on listener_map to determine how to modify the configs
In addition, there are several places where pipeline components are modified and the listener map and/or internal component names aren't currently updated.
In cases where there is a component shared by two pipelines that cannot be in sync, this PR chooses to prioritize the most recently modified or initialized pipeline. There is no actual solution with the current source behavior that will make both pipelines usable, so the current pipeline is updated whenever components are added/renamed/removed or the pipeline is initialized for training.
* Use isort with Black profile
* isort all the things
* Fix import cycles as a result of import sorting
* Add DOCBIN_ALL_ATTRS type definition
* Add isort to requirements
* Remove isort from build dependencies check
* Typo
* span finder integrated into spacy from experimental
* black
* isort
* black
* default spankey constant
* black
* Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* rename
* rename
* max_length and min_length as Optional[int] and strict checking
* black
* mypy fix for integer type infinity
* revert line order
* implement all comparison operators for inf int
* avoid two for loops over all docs by not precomputing
* interleave thresholding with span creation
* black
* revert to not interleaving (relized its faster)
* black
* Update spacy/errors.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* update dosctring
* enforce that the gold and predicted documents have the same text
* new error for ensuring reference and predicted texts are the same
* remove todo
* adjust test
* black
* handle misaligned tokenization
* return correct variable
* failing overfit test
* only use a single spans_key like in spancat
* black
* remove debug lines
* typo
* remove comment
* remove near duplicate reduntant method
* use the 'spans_key' variable name everywhere
* Update spacy/pipeline/span_finder.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* flaky test fix suggestion, hand set bias terms
* only test suggester and test result exhaustively
* make it clear that the span_finder_suggester is more general (not specific to span_finder)
* Update spacy/tests/pipeline/test_span_finder.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Apply suggestions from code review
* remove question comment
* move preset_spans_suggester test to spancat tests
* Add docs and unify default configs for spancat and span finder
* Add `allow_overlap=True` to span finder scorer
* Fix offset bug in set_annotations
* Ignore labels in span finder scorer
* Format
* Add span_finder to quickstart template
* Move settings to self.cfg, store min/max unset as None
* Remove debugging
* Update docstrings and docs
* Update spacy/pipeline/span_finder.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix imports
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Require that all SpanGroup spans are from the current doc
The restriction on only adding spans from the current doc were already
implemented for all operations except for `SpanGroup.__init__`.
Initialize copied spans for `SpanGroup.copy` with `Doc.char_span` in
order to validate the character offsets and to make it possible to copy
spans between documents with differing tokenization. Currently there is
no validation that the document texts are identical, but the span char
offsets must be valid spans in the target doc, which prevents you from
ending up with completely invalid spans.
* Undo change in test_beam_overfitting_IO
* Address upcoming numpy v1.25 deprecations in test suite
* Temporarily test most recent numpy prerelease in CI
* Revert "Temporarily test most recent numpy prerelease in CI"
This reverts commit d75a66e55e.
* Add scorer option to return per-component scores
Add `per_component` option to `Language.evaluate` and `Scorer.score` to
return scores keyed by `tokenizer` (hard-coded) or by component name.
Add option to `evaluate` CLI to score by component. Per-component scores
can only be saved to JSON.
* Update help text and messages
This reverts commit 6f314f99c4.
We are reverting this until we can support this normalization more
consistently across vectors, training corpora, and lemmatizer data.
* Use Latin normalization for Serbian attrs
Use Latin normalization for Serbian `NORM`, `PREFIX`, and `SUFFIX`.
* Update NORMs in tokenizer exceptions and related tests
* Add tests for all custom lex attrs
* Remove unused imports
* Add spans in spacy benchmark
The current implementation of spaCy benchmark accuracy / spacy evaluate
doesn't include the "spans" type, so calling the command doesn't render
the HTML displaCy file needed.
This PR attempts to fix that by creating a new parameter for "spans"
and calling the appropriate displaCy value.
* Reformat file with black
* Add tests for evaluate
* Fix spans -> span for displacy style
* Update test to check render instead
* Update source so mypy passes
* Add parser information to avoid warnings
* avoid nesting then flattening
* mypy fix
* Apply suggestions from code review
* Add type for indices
* Run full matrix for mypy
* Add back modified type: ignore
* Revert "Run full matrix for mypy"
This reverts commit e218873d04.
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add noun chunking to la syntax iterators
* Expand list of numeral, ordinal words
* Expand abbreviations in la tokenizer_exceptions
* Add example sents
* Update spacy/lang/la/syntax_iterators.py
Reorganize la syntax iterators
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Minor updates based on review
* fix call
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add default to MorphAnalysis.get
Similar to `dict`, allow a `default` option for `MorphAnalysis.get` for
the user to provide a default return value if the field is not found.
The default return value remains `[]`, which is not the same as
`dict.get`, but is already established as this method's default return
value with the return type `List[str]`. However the new `default` option
does not enforce that the user-provided default is actually `List[str]`.
* Restore test case