* Fix docstring for EntityRenderer
* Add warning in displacy if doc.spans are empty
* Implement parse_spans converter
One notable change here is that the default spans_key is sc, and
it's set by the user through the options.
* Implement SpanRenderer
Here, I implemented a SpanRenderer that looks similar to the
EntityRenderer except for some templates. The spans_key, by default, is
set to sc, but can be configured in the options (see parse_spans). The
way I rendered these spans is per-token, i.e., I first check if each
token (1) belongs to a given span type and (2) a starting token of a
given span type. Once I have this information, I render them into the
markup.
* Fix mypy issues on typing
* Add tests for displacy spans support
* Update colors from RGB to hex
Co-authored-by: Ines Montani <ines@ines.io>
* Remove unnecessary CSS properties
* Add documentation for website
* Remove unnecesasry scripts
* Update wording on the documentation
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Put typing dependency on top of file
* Put back z-index so that spans overlap properly
* Make warning more explicit for spans_key
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update universe.json
added classy-classification to Spacy universe
* Update universe.json
added classy-classification to the spacy universe resources
* Update universe.json
corrected a small typo in json
* Update website/meta/universe.json
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/meta/universe.json
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/meta/universe.json
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update universe.json
processed merge feedback
* Update universe.json
* updated information for Classy Classificaiton
Made a more comprehensible and easy description for Classy Classification based on feedback of Philip Vollet to prepare for sharing.
* added note about examples
* corrected for wrong formatting changes
* Update website/meta/universe.json with small typo correction
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* resolved another typo
* Update website/meta/universe.json
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Tagger: use unnormalized probabilities for inference
Using unnormalized softmax avoids use of the relatively expensive exp function,
which can significantly speed up non-transformer models (e.g. I got a speedup
of 27% on a German tagging + parsing pipeline).
* Add spacy.Tagger.v2 with configurable normalization
Normalization of probabilities is disabled by default to improve
performance.
* Update documentation, models, and tests to spacy.Tagger.v2
* Move Tagger.v1 to spacy-legacy
* docs/architectures: run prettier
* Unnormalized softmax is now a Softmax_v2 option
* Require thinc 8.0.14 and spacy-legacy 3.0.9
* Add save_candidates attribute
* Change spancat api
* Add unit test
* reimplement method to produce a list of doc
* Add method to docs
* Add new version tag
* Add intended use to docstring
* prettier formatting
* Partial fix of entity linker batching
* Add import
* Better name
* Add `use_gold_ents` option, docs
* Change to v2, create stub v1, update docs etc.
* Fix error type
Honestly no idea what the right type to use here is.
ConfigValidationError seems wrong. Maybe a NotImplementedError?
* Make mypy happy
* Add hacky fix for init issue
* Add legacy pipeline entity linker
* Fix references to class name
* Add __init__.py for legacy
* Attempted fix for loss issue
* Remove placeholder V1
* formatting
* slightly more interesting train data
* Handle batches with no usable examples
This adds a test for batches that have docs but not entities, and a
check in the component that detects such cases and skips the update step
as thought the batch were empty.
* Remove todo about data verification
Check for empty data was moved further up so this should be OK now - the
case in question shouldn't be possible.
* Fix gradient calculation
The model doesn't know which entities are not in the kb, so it generates
embeddings for the context of all of them.
However, the loss does know which entities aren't in the kb, and it
ignores them, as there's no sensible gradient.
This has the issue that the gradient will not be calculated for some of
the input embeddings, which causes a dimension mismatch in backprop.
That should have caused a clear error, but with numpyops it was causing
nans to happen, which is another problem that should be addressed
separately.
This commit changes the loss to give a zero gradient for entities not in
the kb.
* add failing test for v1 EL legacy architecture
* Add nasty but simple working check for legacy arch
* Clarify why init hack works the way it does
* Clarify use_gold_ents use case
* Fix use gold ents related handling
* Add tests for no gold ents and fix other tests
* Use aligned ents function (not working)
This doesn't actually work because the "aligned" ents are gold-only. But
if I have a different function that returns the intersection, *then*
this will work as desired.
* Use proper matching ent check
This changes the process when gold ents are not used so that the
intersection of ents in the pred and gold is used.
* Move get_matching_ents to Example
* Use model attribute to check for legacy arch
* Rename flag
* bump spacy-legacy to lower 3.0.9
Co-authored-by: svlandeg <svlandeg@github.com>
* remove duplicate line
* add sent start/end token attributes to the docs
* let has_annotation work with IS_SENT_END
* elif instead of if
* add has_annotation test for sent attributes
* fix typo
* remove duplicate is_sent_start entry in docs
* Added spacy-wrap to universe
Added spacy-wrap to universe a small package for wrapping fine-tuned huggingface transformers to a spacy pipeline following the same API as spacy-transformers. (Currently limited to classification models)
* Update website/meta/universe.json
* Update website/meta/universe.json
* Update website/meta/universe.json
* Update website/meta/universe.json
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Clarify Span.ents documentation
Ref: #10135
Retain current behaviour. Span.ents will only include entities within
said span. You can't get tokens outside of the original span.
* Reword docstrings
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update API docs in the website
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix infix as prefix in Tokenizer.explain
Update `Tokenizer.explain` to align with the `Tokenizer` algorithm:
* skip infix matches that are prefixes in the current substring
* Update tokenizer pseudocode in docs
* Support version tags in universe and add note about reporting
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* added new field
* added exception for IOb strings
* minor refinement to schema
* removed field
* fixed typo
* imported numeriacla val
* changed the code bit
* cosmetics
* added test for matcher
* set ents of moc docs
* added invalid pattern
* minor update to documentation
* blacked matcher
* added pattern validation
* add IOB vals to schema
* changed into test
* mypy compat
* cleaned left over
* added compat import
* changed type
* added compat import
* changed literal a bit
* went back to old
* made explicit type
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add link to pattern file info in EntityRuler.initialize docs
* Update website/docs/api/entityruler.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Use Vectors.shape rather than Vectors.data.shape
* Use Vectors.size rather than Vectors.data.size
* Add Vectors.to_ops to move data between different ops
* Add documentation for Vector.to_ops
* add entry for Applied Language Technology under "Courses"
Added the following entry into `universe.json`:
```
{
"type": "education",
"id": "applt-course",
"title": "Applied Language Technology",
"slogan": "NLP for newcomers using spaCy and Stanza",
"description": "These learning materials provide an introduction to applied language technology for audiences who are unfamiliar with language technology and programming. The learning materials assume no previous knowledge of the Python programming language.",
"url": "https://applied-language-technology.readthedocs.io/",
"image": "https://www.mv.helsinki.fi/home/thiippal/images/applt-preview.jpg",
"thumb": "https://applied-language-technology.readthedocs.io/en/latest/_static/logo.png",
"author": "Tuomo Hiippala",
"author_links": {
"twitter": "tuomo_h",
"github": "thiippal",
"website": "https://www.mv.helsinki.fi/home/thiippal/"
},
"category": ["courses"]
},
```
* Update the entry for "Applied Language Technology"
* Fix Scorer.score_cats for missing labels
* Add test case for Scorer.score_cats missing labels
* semantic nitpick
* black formatting
* adjust test to give different results depending on multi_label setting
* fix loss function according to whether or not missing values are supported
* add note to docs
* small fixes
* make mypy happy
* Update spacy/pipeline/textcat.py
Co-authored-by: Florian Cäsar <florian.caesar@pm.me>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>