* Update docs for displacy style kwargs
Added "span" to the accepted values for the style kwarg in the displacy.serve and displacy.render top-level functions. These styles are new as of SpaCy 3.3, so I added the "new" tag for that option only
* restored alpha ordering
* Rename to spans_key for consistency
* Implement spans length in debug data
* Implement how span bounds and spans are obtained
In this commit, I implemented how span boundaries (the tokens) around a
given span and spans are obtained. I've put them in the compile_gold()
function so that it's accessible later on. I will do the actual
computation of the span and boundary distinctiveness in the main
function above.
* Compute for p_spans and p_bounds
* Add computation for SD and BD
* Fix mypy issues
* Add weighted average computation
* Fix compile_gold conditional logic
* Add test for frequency distribution computation
* Add tests for kl-divergence computation
* Fix weighted average computation
* Make tables more compact by rounding them
* Add more descriptive checks for spans
* Modularize span computation methods
In this commit, I added the _get_span_characteristics and
_print_span_characteristics functions so that they can be reusable
anywhere.
* Remove unnecessary arguments and make fxs more compact
* Update a few parameter arguments
* Add tests for print_span and get_span methods
* Update API to talk about span characteristics in brief
* Add better reporting of spans_length
* Add test for span length reporting
* Update formatting of span length report
Removed '' to indicate that it's not a string, then
sort the n-grams by their length, not by their frequency.
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Show all frequency distribution when -V
In this commit, I displayed the full frequency distribution of the
span lengths when --verbose is passed. To make things simpler, I
rewrote some of the formatter functions so that I can call them
whenever.
Another notable change is that instead of showing percentages as
Integers, I showed them as floats (max 2-decimal places). I did this
because it looks weird when it displays (0%).
* Update logic on how total is computed
The way the 90% thresholding is computed now is that we keep
adding the percentages until we reach >= 90%. I also updated the wording
and used the term "At least" to denote that >= 90% of your spans have
these distributions.
* Fix display when showing the threshold percentage
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add better phrasing for span information
* Update spacy/cli/debug_data.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add minor edits for whitespaces etc.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Allow assets to be optional in spacy project: draft for optional flag/download_all options.
* Allow assets to be optional in spacy project: added OPTIONAL_DEFAULT reflecting default asset optionality.
* Allow assets to be optional in spacy project: renamed --all to --extra.
* Allow assets to be optional in spacy project: included optional flag in project config test.
* Allow assets to be optional in spacy project: added documentation.
* Allow assets to be optional in spacy project: fixing deprecated --all reference.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Allow assets to be optional in spacy project: fixed project_assets() docstring.
* Allow assets to be optional in spacy project: adjusted wording in justification of optional assets.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: switched to as keyword in project.yml. Updated docs.
* Allow assets to be optional in spacy project: updated comment.
* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in output.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in docstring..
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test..
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow assets to be optional in spacy project: renamed OPTIONAL_DEFAULT to EXTRA_DEFAULT.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* add v1 and v2 tests for tok2vec architectures
* textcat architectures are not "layers"
* test older textcat architectures
* test older parser architecture
* Document different ways to create a pipeline: moved up/slightly modified paragraph on pipeline creation.
* Document different ways to create a pipeline: changed Finnish to Ukrainian in example for language without trained pipeline.
* Document different ways to create a pipeline: added explanation of blank pipeline.
* Document different ways to create a pipeline: exchanged Ukrainian with Yoruba.
* signing contributor agreement
* adding new content to the spaCy universe
* updating outdated example codes
* resolving issues for the PR
* resolve review for klayers
* remove contributor-agreement file from the PR
* Update code example of spaCySentiWS
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy-sentiws code example
Co-authored-by: schaeran <schaeran1994@gmail.com>
Co-authored-by: schaeran <schaeran@explosion.ai>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* added crosslingual coreference to spacy universe
* Updated example to introduce batching example.
Co-authored-by: David Berenstein <david.berenstein@pandoraintelligence.com>
* Add initial design for diff command
For now, the diffing process looks like this:
- The default config is created based from some values in the user
config (e.g. which pipeline components were used, the lang, etc.)
- The user must supply manually if it was optimized for acc/efficiency
and if pretraining was involved.
* Make diff command structure similar to siblings
* Include gpu as a user option for CLI
* Make variables more explicit
* Fix type declaration for optimize enum
* Improve docstrings for diff CLI
* Add debug-diff to website API docs
* Switch position of configs so that user config is modded
* Add markdown flag for debug diff
This commit adds a --markdown (--md) flag that allows easier
copy-pasting to Github issues. Please note that this commit is dependent
on an unreleased version of wasabi (for the time being).
For posterity, the related PR is found here: https://github.com/ines/wasabi/pull/20
* Bump version of wasabi to 0.9.1
So that we can use the add_symbols parameter.
* Apply suggestions from code review
Co-authored-by: Ines Montani <ines@ines.io>
* Update docs based on code review suggestions
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Change command name from diff -> diff-config
* Clarify when options are relevant or not
* Rerun prettier on cli.md
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added new convenience cython functions to SpanGroup to avoid unnecessary allocation/deallocation of objects
* Replaced sorting in has_overlap with C++ for efficiency. Also, added a test for has_overlap
* Added a method to efficiently merge SpanGroups
* Added __delitem__, __add__ and __iadd__. Also, allowed to pass span lists to merge function. Replaced extend() body with call to merge
* Renamed merge to concat and added missing things to documentation
* Added operator+ and operator += in the documentation
* Added a test for Doc deallocation
* Update spacy/tokens/span_group.pyx
* Updated SpanGroup tests to use new span list comparison function rather than assert_span_list_equal, eliminating the need to have a separate assert_not_equal fnction
* Fixed typos in SpanGroup documentation
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Minor changes requested by Sofie: rearranged import statements. Added new=3.2.1 tag to SpanGroup.__setitem__ documentation
* SpanGroup: moved repetitive list index check/adjustment in a separate function
* Turn off formatting that hurts readability spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove formatting that hurts readability spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Turn off formatting that hurts readability in spacy/tests/doc/test_span_group.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Support more internal methods for SpanGroup
Add support for:
* `__setitem__`
* `__delitem__`
* `__iadd__`: for `SpanGroup` or `Iterable[Span]`
* `__add__`: for `SpanGroup` only
Adapted from #9698 with the scope limited to the magic methods.
* Use v3.3 as new version in docs
* Add new tag to SpanGroup.copy in API docs
* Remove duplicate import
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remaining suggestions and formatting
Co-authored-by: nrodnova <nrodnova@hotmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Natalia Rodnova <4512370+nrodnova@users.noreply.github.com>
* Add vector deduplication
* Add `Vocab.deduplicate_vectors()`
* Always run deduplication in `spacy init vectors`
* Clean up a few vector-related error messages and docs examples
* Always unique with numpy
* Fix types
* Add edit tree lemmatizer
Co-authored-by: Daniël de Kok <me@danieldk.eu>
* Hide edit tree lemmatizer labels
* Use relative imports
* Switch to single quotes in error message
* Type annotation fixes
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Reformat edit_tree_lemmatizer with black
* EditTreeLemmatizer.predict: take Iterable
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Validate edit trees during deserialization
This change also changes the serialized representation. Rather than
mirroring the deep C structure, we use a simple flat union of the match
and substitution node types.
* Move edit_trees to _edit_tree_internals
* Fix invalid edit tree format error message
* edit_tree_lemmatizer: remove outdated TODO comment
* Rename factory name to trainable_lemmatizer
* Ignore type instead of casting truths to List[Union[Ints1d, Floats2d, List[int], List[str]]] for thinc v8.0.14
* Switch to Tagger.v2
* Add documentation for EditTreeLemmatizer
* docs: Fix 3.2 -> 3.3 somewhere
* trainable_lemmatizer documentation fixes
* docs: EditTreeLemmatizer is in edit_tree_lemmatizer.py
Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update universe.json
added classy-classification to Spacy universe
* Update universe.json
added classy-classification to the spacy universe resources
* Update universe.json
corrected a small typo in json
* Update website/meta/universe.json
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/meta/universe.json
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/meta/universe.json
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update universe.json
processed merge feedback
* Update universe.json
* updated information for Classy Classificaiton
Made a more comprehensible and easy description for Classy Classification based on feedback of Philip Vollet to prepare for sharing.
* added note about examples
* corrected for wrong formatting changes
* Update website/meta/universe.json with small typo correction
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* resolved another typo
* Update website/meta/universe.json
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* added Concise Concepts package to spaCy universe.
* updated example code Concise Concepts
* updated description for Concise Concepts
* updated PR with more visually appealing examples
SO to koaning for the suggestions.
* corrected for small json typo's in concise concepts
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add tokenizer option to allow Matcher handling for all rules
Add tokenizer option `with_faster_rules_heuristics` that determines
whether the special cases applied by the internal `Matcher` are filtered
by whether they contain affixes or space. If `True` (default), the rules
are filtered to prioritize speed over rare edge cases. If `False`, all
rules are included in the final `Matcher`-based pass over the doc.
* Reset all caches when reloading special cases
* Revert "Reset all caches when reloading special cases"
This reverts commit 4ef6bd171d.
* Initialize max_length properly
* Add new tag to API docs
* Rename to faster heuristics
* Fix docstring for EntityRenderer
* Add warning in displacy if doc.spans are empty
* Implement parse_spans converter
One notable change here is that the default spans_key is sc, and
it's set by the user through the options.
* Implement SpanRenderer
Here, I implemented a SpanRenderer that looks similar to the
EntityRenderer except for some templates. The spans_key, by default, is
set to sc, but can be configured in the options (see parse_spans). The
way I rendered these spans is per-token, i.e., I first check if each
token (1) belongs to a given span type and (2) a starting token of a
given span type. Once I have this information, I render them into the
markup.
* Fix mypy issues on typing
* Add tests for displacy spans support
* Update colors from RGB to hex
Co-authored-by: Ines Montani <ines@ines.io>
* Remove unnecessary CSS properties
* Add documentation for website
* Remove unnecesasry scripts
* Update wording on the documentation
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Put typing dependency on top of file
* Put back z-index so that spans overlap properly
* Make warning more explicit for spans_key
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update universe.json
added classy-classification to Spacy universe
* Update universe.json
added classy-classification to the spacy universe resources
* Update universe.json
corrected a small typo in json
* Update website/meta/universe.json
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/meta/universe.json
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/meta/universe.json
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update universe.json
processed merge feedback
* Update universe.json
* updated information for Classy Classificaiton
Made a more comprehensible and easy description for Classy Classification based on feedback of Philip Vollet to prepare for sharing.
* added note about examples
* corrected for wrong formatting changes
* Update website/meta/universe.json with small typo correction
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* resolved another typo
* Update website/meta/universe.json
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Tagger: use unnormalized probabilities for inference
Using unnormalized softmax avoids use of the relatively expensive exp function,
which can significantly speed up non-transformer models (e.g. I got a speedup
of 27% on a German tagging + parsing pipeline).
* Add spacy.Tagger.v2 with configurable normalization
Normalization of probabilities is disabled by default to improve
performance.
* Update documentation, models, and tests to spacy.Tagger.v2
* Move Tagger.v1 to spacy-legacy
* docs/architectures: run prettier
* Unnormalized softmax is now a Softmax_v2 option
* Require thinc 8.0.14 and spacy-legacy 3.0.9
* Add save_candidates attribute
* Change spancat api
* Add unit test
* reimplement method to produce a list of doc
* Add method to docs
* Add new version tag
* Add intended use to docstring
* prettier formatting