SpaCy's HashEmbedCNN layer performs convolutions over tokens to produce
contextualized embeddings using a `MaxoutWindowEncoder` layer. These
convolutions are implemented using Thinc's `expand_window` layer, which
concatenates `window_size` neighboring sequence items on either side of
the sequence item being processed. This is repeated across `depth`
convolutional layers.
For example, consider the sequence "ABCDE" and a `MaxoutWindowEncoder`
layer with a context window of 1 and a depth of 2. We'll focus on the
token "C". We can visually represent the contextual embedding produced
for "C" as:
```mermaid
flowchart LR
A0(A<sub>0</sub>)
B0(B<sub>0</sub>)
C0(C<sub>0</sub>)
D0(D<sub>0</sub>)
E0(E<sub>0</sub>)
B1(B<sub>1</sub>)
C1(C<sub>1</sub>)
D1(D<sub>1</sub>)
C2(C<sub>2</sub>)
A0 --> B1
B0 --> B1
C0 --> B1
B0 --> C1
C0 --> C1
D0 --> C1
C0 --> D1
D0 --> D1
E0 --> D1
B1 --> C2
C1 --> C2
D1 --> C2
```
Described in words, this graph shows that before the first layer of the
convolution, the "receptive field" centered at each token consists only
of that same token. That is to say, that we have a receptive field of 1.
The first layer of the convolution adds one neighboring token on either
side to the receptive field. Since this is done on both sides, the
receptive field increases by 2, giving the first layer a receptive field
of 3. The second layer of the convolutions adds an _additional_
neighboring token on either side to the receptive field, giving a final
receptive field of 5.
However, this doesn't match the formula currently given in the docs,
which read:
> The receptive field of the CNN will be
> `depth * (window_size * 2 + 1)`, so a 4-layer network with a window
> size of `2` will be sensitive to 20 words at a time.
Substituting in our depth of 2 and window size of 1, this formula gives
us a receptive field of:
```
depth * (window_size * 2 + 1)
= 2 * (1 * 2 + 1)
= 2 * (2 + 1)
= 2 * 3
= 6
```
This not only doesn't match our computations from above, it's also an
even number! This is suspicious, since the receptive field is supposed
to be centered on a token, and not between tokens. Generally, this
formula results in an even number for any even value of `depth`.
The error in this formula is that the adjustment for the center token
is multiplied by the depth, when it should occur only once. The
corrected formula, `depth * window_size * 2 + 1`, gives the correct
value for our small example from above:
```
depth * window_size * 2 + 1
= 2 * 1 * 2 + 1
= 4 + 1
= 5
```
These changes update the docs to correct the receptive field formula and
the example receptive field size.
There was a mistake in the regex pattern which caused not matching all the desired tokens. The problem was that when we use r string literal prefix to suppose a raw text, we should not use two backslashes to demonstrate a backslash.
* remove migration support form
* initial test commit
* add fixture
* add combo test
* pull out parameter example data
* fix formatting on examples
* remove unused import
* remove unncessary fmt:off instructions
* only set logger level if verbose flag is explicitly set
---------
Co-authored-by: svlandeg <svlandeg@github.com>
* Add data structures to docs
* Adjusted descriptions for more consistency
* Add _optional_ flag to parameters
* Add tests and adjust optional title key in doc
* Add title to dep visualizations
* fix typo
---------
Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
* Add cli for finding locations of registered func
* fixes: naming and typing
* isort
* update naming
* remove to find-function
* remove file:// bit
* use registry name if given and exit gracefully if a registry was not found
* clean up failure msg
* specify registry_name options
* mypy fixes
* return location for internal usage
* add documentation
* more mypy fixes
* clean up example
* add section to menu
* add tests
---------
Co-authored-by: svlandeg <svlandeg@github.com>
* modified: spacy/language.py
- corrected typo in docstring for :method:`Language.replace_listeners`
- added noqa comment on unused local variable assignment in :method:`Language.from_config` as I wasn't sure if it should be unassigned
modified: website/docs/api/language.mdx
- corrected typo in `Language.replace_listeners` markdown
* modified: spacy/language.py
- removed noqa comment
---------
Co-authored-by: Ian Thompson <ian.thompson@hrblock.com>
These changes add a missing call to `escape_html` in the displaCy span
renderer. Previously span-annotated tokens would be inserted into the
page markup without being escaped, resulting in potentially incorrect
rendering. When I encountered this issue, it resulted in some docs and
span underlines being superimposed on top of properly rendered docs and
span underlines near the beginning of the visualization (due to an
unescaped `<span>` tag).
When the default `max_length` is not set and there are longer training
documents, it can be difficult to train and evaluate the span finder due
to memory limits and the time it takes to evaluate a huge number of
predicted spans.
* Fix problem with universe pages using `docker` language
* Fix problem with universe pages using `r` language
* Add fallback, in case code language is unknown
* Support custom token/lexeme attribute for vectors
* Fix imports
* Back off to ORTH without Vectors.attr
* Fallback if vectors.attr doesn't exist
* Update docs
When sourcing a component, the object from the original pipeline is added to the new pipeline as the same object. This creates a situation where there are several attributes that cannot be in sync between the original pipeline and the new pipeline at the same time for this one object:
* component.name
* component.listener_map / component.listening_components for tok2vec and transformer
When running replace_listeners on a component, the config is not updated correctly if the state of the component is incorrect for the current pipeline (in particular changes that should be applied from model.attrs["replace_listener_cfg"] as used in spacy-transformers) due to the fact that:
* find_listeners relies on component.name to set the name in the listener_map
* replace_listeners relies on listener_map to determine how to modify the configs
In addition, there are several places where pipeline components are modified and the listener map and/or internal component names aren't currently updated.
In cases where there is a component shared by two pipelines that cannot be in sync, this PR chooses to prioritize the most recently modified or initialized pipeline. There is no actual solution with the current source behavior that will make both pipelines usable, so the current pipeline is updated whenever components are added/renamed/removed or the pipeline is initialized for training.