* initial
* initial documentation run
* fix typo
* Remove mentions of Torchscript and quantization
Both are disabled in the initial release of `spacy-curated-transformers`.
* Fix `piece_encoder` entries
* Remove `spacy-transformers`-specific warning
* Fix duplicate entries in tables
* Doc fixes
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove type aliases
* Fix copy-paste typo
* Change `debug pieces` version tag to `3.7`
* Set curated transformers API version to `3.7`
* Fix transformer listener naming
* Add docs for `init fill-config-transformer`
* Update CLI command invocation syntax
* Update intro section of the pipeline component docs
* Fix source URL
* Add a note to the architectures section about the `init fill-config-transformer` CLI command
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update CLI command name, args
* Remove hyphen from the `curated-transformers.mdx` filename
* Fix links
* Remove placeholder text
* Add text to the model/tokenizer loader sections
* Fill in the `DocTransformerOutput` section
* Formatting fixes
* Add curated transformer page to API docs sidebar
* More formatting fixes
* Remove TODO comment
* Remove outdated info about default config
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add link to HF model hub
* `prettier`
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
SpaCy's HashEmbedCNN layer performs convolutions over tokens to produce
contextualized embeddings using a `MaxoutWindowEncoder` layer. These
convolutions are implemented using Thinc's `expand_window` layer, which
concatenates `window_size` neighboring sequence items on either side of
the sequence item being processed. This is repeated across `depth`
convolutional layers.
For example, consider the sequence "ABCDE" and a `MaxoutWindowEncoder`
layer with a context window of 1 and a depth of 2. We'll focus on the
token "C". We can visually represent the contextual embedding produced
for "C" as:
```mermaid
flowchart LR
A0(A<sub>0</sub>)
B0(B<sub>0</sub>)
C0(C<sub>0</sub>)
D0(D<sub>0</sub>)
E0(E<sub>0</sub>)
B1(B<sub>1</sub>)
C1(C<sub>1</sub>)
D1(D<sub>1</sub>)
C2(C<sub>2</sub>)
A0 --> B1
B0 --> B1
C0 --> B1
B0 --> C1
C0 --> C1
D0 --> C1
C0 --> D1
D0 --> D1
E0 --> D1
B1 --> C2
C1 --> C2
D1 --> C2
```
Described in words, this graph shows that before the first layer of the
convolution, the "receptive field" centered at each token consists only
of that same token. That is to say, that we have a receptive field of 1.
The first layer of the convolution adds one neighboring token on either
side to the receptive field. Since this is done on both sides, the
receptive field increases by 2, giving the first layer a receptive field
of 3. The second layer of the convolutions adds an _additional_
neighboring token on either side to the receptive field, giving a final
receptive field of 5.
However, this doesn't match the formula currently given in the docs,
which read:
> The receptive field of the CNN will be
> `depth * (window_size * 2 + 1)`, so a 4-layer network with a window
> size of `2` will be sensitive to 20 words at a time.
Substituting in our depth of 2 and window size of 1, this formula gives
us a receptive field of:
```
depth * (window_size * 2 + 1)
= 2 * (1 * 2 + 1)
= 2 * (2 + 1)
= 2 * 3
= 6
```
This not only doesn't match our computations from above, it's also an
even number! This is suspicious, since the receptive field is supposed
to be centered on a token, and not between tokens. Generally, this
formula results in an even number for any even value of `depth`.
The error in this formula is that the adjustment for the center token
is multiplied by the depth, when it should occur only once. The
corrected formula, `depth * window_size * 2 + 1`, gives the correct
value for our small example from above:
```
depth * window_size * 2 + 1
= 2 * 1 * 2 + 1
= 4 + 1
= 5
```
These changes update the docs to correct the receptive field formula and
the example receptive field size.
There was a mistake in the regex pattern which caused not matching all the desired tokens. The problem was that when we use r string literal prefix to suppose a raw text, we should not use two backslashes to demonstrate a backslash.
* Support registered vectors
* Format
* Auto-fill [nlp] on load from config and from bytes/disk
* Only auto-fill [nlp]
* Undo all changes to Language.from_disk
* Expand BaseVectors
These methods are needed in various places for training and vector
similarity.
* isort
* More linting
* Only fill [nlp.vectors]
* Update spacy/vocab.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Revert changes to test related to auto-filling [nlp]
* Add vectors registry
* Rephrase error about vocab methods for vectors
* Switch to dummy implementation for BaseVectors.to_ops
* Add initial draft of docs
* Remove example from BaseVectors docs
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/docs/api/basevectors.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix type and lint bpemb example
* Update website/docs/api/basevectors.mdx
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add data structures to docs
* Adjusted descriptions for more consistency
* Add _optional_ flag to parameters
* Add tests and adjust optional title key in doc
* Add title to dep visualizations
* fix typo
---------
Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
* Add cli for finding locations of registered func
* fixes: naming and typing
* isort
* update naming
* remove to find-function
* remove file:// bit
* use registry name if given and exit gracefully if a registry was not found
* clean up failure msg
* specify registry_name options
* mypy fixes
* return location for internal usage
* add documentation
* more mypy fixes
* clean up example
* add section to menu
* add tests
---------
Co-authored-by: svlandeg <svlandeg@github.com>
* modified: spacy/language.py
- corrected typo in docstring for :method:`Language.replace_listeners`
- added noqa comment on unused local variable assignment in :method:`Language.from_config` as I wasn't sure if it should be unassigned
modified: website/docs/api/language.mdx
- corrected typo in `Language.replace_listeners` markdown
* modified: spacy/language.py
- removed noqa comment
---------
Co-authored-by: Ian Thompson <ian.thompson@hrblock.com>
* Fix problem with universe pages using `docker` language
* Fix problem with universe pages using `r` language
* Add fallback, in case code language is unknown
* Support custom token/lexeme attribute for vectors
* Fix imports
* Back off to ORTH without Vectors.attr
* Fallback if vectors.attr doesn't exist
* Update docs
* Add SpanMarker for NER to spaCy universe
* Escape the newlines in the text in the code example
Or at least, attempt to
* Remove now unnecessary import
* Disable NER pipeline component in code example
This is a really odd bug, where Firefox doesn't re-render the `code` element, even though `children` changed.
Two things fixed that:
- remove the `language-ini` `className`
- replace the `code` block with a `div`
Both are not ideal. Therefor this solution adds an inner `div` that now has the classes while still maintaining the semantic `code` element.
I couldn't find any explanation for why this is happening and why it only happens in Firefox. I assume it is a bug caused by one of our many dependencies (or their interplay)
To make matters worse: This bug *doesn't* occure when running the site in dev mode. You have to build and serve the site to recreate it.
* span finder integrated into spacy from experimental
* black
* isort
* black
* default spankey constant
* black
* Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* rename
* rename
* max_length and min_length as Optional[int] and strict checking
* black
* mypy fix for integer type infinity
* revert line order
* implement all comparison operators for inf int
* avoid two for loops over all docs by not precomputing
* interleave thresholding with span creation
* black
* revert to not interleaving (relized its faster)
* black
* Update spacy/errors.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* update dosctring
* enforce that the gold and predicted documents have the same text
* new error for ensuring reference and predicted texts are the same
* remove todo
* adjust test
* black
* handle misaligned tokenization
* return correct variable
* failing overfit test
* only use a single spans_key like in spancat
* black
* remove debug lines
* typo
* remove comment
* remove near duplicate reduntant method
* use the 'spans_key' variable name everywhere
* Update spacy/pipeline/span_finder.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* flaky test fix suggestion, hand set bias terms
* only test suggester and test result exhaustively
* make it clear that the span_finder_suggester is more general (not specific to span_finder)
* Update spacy/tests/pipeline/test_span_finder.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Apply suggestions from code review
* remove question comment
* move preset_spans_suggester test to spancat tests
* Add docs and unify default configs for spancat and span finder
* Add `allow_overlap=True` to span finder scorer
* Fix offset bug in set_annotations
* Ignore labels in span finder scorer
* Format
* Add span_finder to quickstart template
* Move settings to self.cfg, store min/max unset as None
* Remove debugging
* Update docstrings and docs
* Update spacy/pipeline/span_finder.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix imports
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* add vetiver to spacy universe
* remove image
* update logo to render correctly in thumbnail
* apply Basil's suggestion
Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
* refer to the same model
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
* Add scorer option to return per-component scores
Add `per_component` option to `Language.evaluate` and `Scorer.score` to
return scores keyed by `tokenizer` (hard-coded) or by component name.
Add option to `evaluate` CLI to score by component. Per-component scores
can only be saved to JSON.
* Update help text and messages
* parsigs universe
* added model installation explanation in the description
* Update website/meta/universe.json
Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
* added model installement instruction in the code example
---------
Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
* Add default to MorphAnalysis.get
Similar to `dict`, allow a `default` option for `MorphAnalysis.get` for
the user to provide a default return value if the field is not found.
The default return value remains `[]`, which is not the same as
`dict.get`, but is already established as this method's default return
value with the return type `List[str]`. However the new `default` option
does not enforce that the user-provided default is actually `List[str]`.
* Restore test case
Replace `progress_bar = "all_steps"` with `progress_bar = "eval"`, which is consistent with the default behavior for `spacy.ConsoleLogger.v1` and `spacy.ConsoleLogger.v2`.
* [wip] Update
* [wip] Update
* Add initial port
* [wip] Update
* Fix all imports
* Add spancat_exclusive to pipeline
* [WIP] Update
* [ci skip] Add breakpoint for debugging
* Use spacy.SpanCategorizer.v1 as default archi
* Update spacy/pipeline/spancat_exclusive.py
Co-authored-by: kadarakos <kadar.akos@gmail.com>
* [ci skip] Small updates
* Use Softmax v2 directly from thinc
* Cache the label map
* Fix mypy errors
However, I ignored line 370 because it opened up a bunch of type errors
that might be trickier to solve and might lead to a more complicated
codebase.
* avoid multiplication with 1.0
Co-authored-by: kadarakos <kadar.akos@gmail.com>
* Update spacy/pipeline/spancat_exclusive.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update component versions to v2
* Add scorer to docstring
* Add _n_labels property to SpanCategorizer
Instead of using len(self.labels) in initialize() I am using a private
property self._n_labels. This achieves implementation parity and allows
me to delete the whole initialize() method for spancat_exclusive (since
it's now the same with spancat).
* Inherit from SpanCat instead of TrainablePipe
This commit changes the inheritance structure of Exclusive_Spancat,
now it's inheriting from SpanCategorizer than TrainablePipe. This
allows me to remove duplicate methods that are already present in
the parent function.
* Revert documentation link to spancat
* Fix init call for exclusive spancat
* Update spacy/pipeline/spancat_exclusive.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Import Suggester from spancat
* Include zero_init.v1 for spancat
* Implement _allow_extra_label to use _n_labels
To ensure that spancat / spancat_exclusive cannot be resized after
initialization, I inherited the _allow_extra_label() method from
spacy/pipeline/trainable_pipe.pyx and used self._n_labels instead
of len(self.labels) for checking.
I think that changing it locally is a better solution rather than
forcing each class that inherits TrainablePipe to use the self._n_labels
attribute.
Also note that I turned-off black formatting in this block of code
because it reads better without the overhang.
* Extend existing tests to spancat_exclusive
In this commit, I extended the existing tests for spancat to include
spancat_exclusive. I parametrized the test functions with 'name'
(similar var name with textcat and textcat_multilabel) for each
applicable test.
TODO: Add overfitting tests for spancat_exclusive
* Update documentation for spancat
* Turn on formatting for allow_extra_label
* Remove initializers in default config
* Use DEFAULT_EXCL_SPANCAT_MODEL
I also renamed spancat_exclusive_default_config into
spancat_excl_default_config because black does some not pretty
formatting changes.
* Update documentation
Update grammar and usage
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Clarify docstring for Exclusive_SpanCategorizer
* Remove mypy ignore and typecast labels to list
* Fix documentation API
* Use a single variable for tests
* Update defaults for number of rows
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Put back initializers in spancat config
Whenever I remove model.scorer.init_w and model.scorer.init_b,
I encounter an error in the test:
SystemError: <method '__getitem__' of 'dict' objects> returned a result
with an error set.
My Thinc version is 8.1.5, but I can't seem to check what's causing the
error.
* Update spancat_exclusive docstring
* Remove init_W and init_B parameters
This commit is expected to fail until the new Thinc release.
* Require thinc>=8.1.6 for serializable Softmax defaults
* Handle zero suggestions to make tests pass
I'm not sure if this is the most elegant solution. But what should
happen is that the _make_span_group function MUST return an empty
SpanGroup if there are no suggestions.
The error happens when the 'scores' variable is empty. We cannot
get the 'predicted' and other downstream vars.
* Better approach for handling zero suggestions
* Update website/docs/api/spancategorizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spancategorizer headers
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add default value in negative_weight in docs
* Add default value in allow_overlap in docs
* Update how spancat_exclusive is constructed
In this commit, I added the following:
- Put the default values of negative_weight and allow_overlap
in the default_config dictionary.
- Rename make_spancat -> make_exclusive_spancat
* Run prettier on spancategorizer.mdx
* Change exactly one -> at most one
* Add suggester documentation in Exclusive_SpanCategorizer
* Add suggester to spancat docstrings
* merge multilabel and singlelabel spancat
* rename spancat_exclusive to singlelable
* wire up different make_spangroups for single and multilabel
* black
* black
* add docstrings
* more docstring and fix negative_label
* don't rely on default arguments
* black
* remove spancat exclusive
* replace single_label with add_negative_label and adjust inference
* mypy
* logical bug in configuration check
* add spans.attrs[scores]
* single label make_spangroup test
* bugfix
* black
* tests for make_span_group with negative labels
* refactor make_span_group
* black
* Update spacy/tests/pipeline/test_spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* remove duplicate declaration
* Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* raise error instead of just print
* make label mapper private
* update docs
* run prettier
* Update website/docs/api/spancategorizer.mdx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update website/docs/api/spancategorizer.mdx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* don't keep recomputing self._label_map for each span
* typo in docs
* Intervals to private and document 'name' param
* Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/pipeline/spancat.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* add Tag to new features
* replace tags
* revert
* revert
* revert
* revert
* Update website/docs/api/spancategorizer.mdx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update website/docs/api/spancategorizer.mdx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* prettier
* Fix merge
* Update website/docs/api/spancategorizer.mdx
* remove references to 'single_label'
* remove old paragraph
* Add spancat_singlelabel to config template
* Format
* Extend init config tests
---------
Co-authored-by: kadarakos <kadar.akos@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Make empty_kb() configurable.
* Format.
* Update docs.
* Be more specific in KB serialization test.
* Update KB serialization tests. Update docs.
* Remove doc update for batched candidate generation.
* Fix serialization of subclassed KB in tests.
* Format.
* Update docstring.
* Update docstring.
* Switch from pickle to json for custom field serialization.
* Add immediate left/right child/parent dependency relations
* Add tests for new REL_OPs: `>+`, `>-`, `<+`, and `<-`.
---------
Co-authored-by: Tan Long <tanloong@foxmail.com>
* Fix FUZZY operator definition
The default length of the FUZZY operator is 2 and not 3.
* adjust edit distance in matcher usage docs too
---------
Co-authored-by: svlandeg <svlandeg@github.com>
* Add span_id to Span.char_span, update Doc/Span.char_span docs
`Span.char_span(id=)` should be removed in the future.
* Also use Union[int, str] in Doc docstring
* Add `spacy.PlainTextCorpusReader.v1`
This is a corpus reader that reads plain text corpora with the following
format:
- UTF-8 encoding
- One line per document.
- Blank lines are ignored.
It is useful for applications where we deal with very large corpora,
such as distillation, and don't want to deal with the space overhead of
serialized formats. Additionally, many large corpora already use such
a text format, keeping the necessary preprocessing to a minimum.
* Update spacy/training/corpus.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* docs: add version to `PlainTextCorpus`
* Add docstring to registry function
* Add plain text corpus tests
* Only strip newline/carriage return
* Add return type _string_to_tmp_file helper
* Use a temporary directory in place of file name
Different OS auto delete/sharing semantics are just wonky.
* This will be new in 3.5.1 (rather than 4)
* Test improvements from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Rename CSS class to make use more clear
* Rename component prop to improve code readability
* Fix `aria-hidden` directly on a link element
This link wouldn't have been clickable by screenreaders
* Refactor component
This removes a unnessary `div` and a duplicate link
Co-authored-by: Ines Montani <ines@ines.io>
Originally introduced in 62b9c9c6d7
Original error: Warning: Invalid DOM property `class`. Did you mean `className`?
React doesn't have `class`, it uses `className`.