* Add a dry run flag to download
* Remove --dry-run, add --url option to `spacy info` instead
* Make mypy happy
* Print only the URL, so it's easier to use in scripts
* Don't add the egg hash unless downloading an sdist
* Update spacy/cli/info.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add two implementations of requirements
* Clean up requirements sample slightly
This should make mypy happy
* Update URL help string
* Remove requirements option
* Add url option to docs
* Add URL to spacy info model output, when available
* Add types-setuptools to testing reqs
* Add types-setuptools to requirements
* Add "compatible", expand docstring
* Update spacy/cli/info.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Run prettier on CLI docs
* Update docs
Add a sidebar about finding download URLs, with some examples of the new
command.
* Add download URLs to table on model page
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Updates from review
* download url -> download link
* Update docs
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* adding unit test for spacy.load with disable/exclude string arg
* allow pure strings in from_config
* update docs
* upstream type adjustements
* docs update
* make docstring more consistent
* Update spacy/language.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* two more cleanups
* fix type in internal method
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Clean up old Matcher call style related stuff
In v2 Matcher.add was called with (key, on_match, *patterns). In v3 this
was changed to (key, patterns, *, on_match=None), but there were various
points where the old call syntax was documented or handled specially.
This removes all those.
The Matcher itself didn't need any code changes, as it just gives a
generic type error. However the PhraseMatcher required some changes
because it would automatically "fix" the old call style.
Surprisingly, the tokenizer was still using the old call style in one
place.
After these changes tests failed in two places:
1. one test for the "new" call style, including the "old" call style. I
removed this test.
2. deserializing the PhraseMatcher fails because the input docs are a
set.
I am not sure why 2 is happening - I guess it's a quirk of the
serialization format? - so for now I just convert the set to a list when
deserializing. The check that the input Docs are a List in the
PhraseMatcher is a new check, but makes it parallel with the other
Matchers, which seemed like the right thing to do.
* Add notes related to input docs / deserialization type
* Remove Typing import
* Remove old note about call style change
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Use separate method for setting internal doc representations
In addition to the title change, this changes the internal dict to be a
defaultdict, instead of a dict with frequent use of setdefault.
* Add _add_from_arrays for unpickling
* Cleanup around adding from arrays
This moves adding to internal structures into the private batch method,
and removes the single-add method.
This has one behavioral change for `add`, in that if something is wrong
with the list of input Docs (such as one of the items not being a Doc),
valid items before the invalid one will not be added. Also the callback
will not be updated if anything is invalid. This change should not be
significant.
This also adds a test to check failure when given a non-Doc.
* Update spacy/matcher/phrasematcher.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add lang folder for la (Latin)
* Add Latin lang classes
* Add minimal tokenizer exceptions
* Add minimal stopwords
* Add minimal lex_attrs
* Update stopwords, tokenizer exceptions
* Add la tests; register la_tokenizer in conftest.py
* Update spacy/lang/la/lex_attrs.py
Remove duplicate form in Latin lex_attrs
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update natto-py version spec (#11222)
* Update natto-py version spec
* Update setup.cfg
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add scorer to textcat API docs config settings (#11263)
* Update docs for pipeline initialize() methods (#11221)
* Update documentation for dependency parser
* Update documentation for trainable_lemmatizer
* Update documentation for entity_linker
* Update documentation for ner
* Update documentation for morphologizer
* Update documentation for senter
* Update documentation for spancat
* Update documentation for tagger
* Update documentation for textcat
* Update documentation for tok2vec
* Run prettier on edited files
* Apply similar changes in transformer docs
* Remove need to say annotated example explicitly
I removed the need to say "Must contain at least one annotated Example"
because it's often a given that Examples will contain some gold-standard
annotation.
* Run prettier on transformer docs
* chore: add 'concepCy' to spacy universe (#11255)
* chore: add 'concepCy' to spacy universe
* docs: add 'slogan' to concepCy
* Support full prerelease versions in the compat table (#11228)
* Support full prerelease versions in the compat table
* Fix types
* adding spans to doc_annotation in Example.to_dict (#11261)
* adding spans to doc_annotation in Example.to_dict
* to_dict compatible with from_dict: tuples instead of spans
* use strings for label and kb_id
* Simplify test
* Update data formats docs
Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix regex invalid escape sequences (#11276)
* Add W605 to the errors raised by flake8 in the CI (#11283)
* Clean up automated label-based issue handling (#11284)
* Clean up automated label-based issue handline
1. upgrade tiangolo/issue-manager to latest
2. move needs-more-info to tiangolo
3. change needs-more-info close time to 7 days
4. delete old needs-more-info config
* Use old, longer message
* Fix label name
* Fix Dutch noun chunks to skip overlapping spans (#11275)
* Add test for overlapping noun chunks
* Skip overlapping noun chunks
* Update spacy/tests/lang/nl/test_noun_chunks.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Docs: displaCy documentation - data types, `parse_{deps,ents,spans}`, spans example (#10950)
* add in spans example and parse references
* rm autoformatter
* rm extra ents copy
* TypedDict draft
* type fixes
* restore non-documentation files
* docs update
* fix spans example
* fix hyperlinks
* add parse example
* example fix + argument fix
* fix api arg in docs
* fix bad variable replacement
* fix spacing in style
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* fix spacing on table
* fix spacing on table
* rm temp files
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* include span_ruler for default warning filter (#11333)
* Add uk pipelines to website (#11332)
* Check for . in factory names (#11336)
* Make fixes for PR #11349
* Fix roman numeral coverage in #11349
Co-authored-by: Patrick J. Burns <patricks@diyclassics.org>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Lj Miranda <12949683+ljvmiranda921@users.noreply.github.com>
Co-authored-by: Jules Belveze <32683010+JulesBelveze@users.noreply.github.com>
Co-authored-by: stefawolf <wlf.ste@gmail.com>
Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
* Switch to mecab-ko as default Korean tokenizer
Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.
Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.
* Temporarily run tests with mecab-ko tokenizer
* Fix types
* Fix duplicate test names
* Update requirements test
* Revert "Temporarily run tests with mecab-ko tokenizer"
This reverts commit d2083e7044.
* Add mecab_args setting, fix pickle for KoreanNattoTokenizer
* Fix length check
* Update docs
* Formatting
* Update natto-py error message
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Map `Span.id` to `Token.ent_id` in all cases when setting `Doc.ents`
* Reset `Token.ent_id` and `Token.ent_kb_id` when setting `Doc.ents`
* Make `Span.ent_id` an alias of `Span.id` rather than a read-only view
of the root token's `ent_id` annotation
* adding spans to doc_annotation in Example.to_dict
* to_dict compatible with from_dict: tuples instead of spans
* use strings for label and kb_id
* Simplify test
* Update data formats docs
Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update documentation for dependency parser
* Update documentation for trainable_lemmatizer
* Update documentation for entity_linker
* Update documentation for ner
* Update documentation for morphologizer
* Update documentation for senter
* Update documentation for spancat
* Update documentation for tagger
* Update documentation for textcat
* Update documentation for tok2vec
* Run prettier on edited files
* Apply similar changes in transformer docs
* Remove need to say annotated example explicitly
I removed the need to say "Must contain at least one annotated Example"
because it's often a given that Examples will contain some gold-standard
annotation.
* Run prettier on transformer docs
* add additional REL_OP
* change to condition and new rel_op symbols
* add operators to docs
* add the anchor while we're in here
* add tests
Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
* Min_max_operators
1. Modified API and Usage for spaCy website to include min_max operator
2. Modified matcher.pyx to include min_max function {n,m} and its variants
3. Modified schemas.py to include min_max validation error
4. Added test cases to test_matcher_api.py, test_matcher_logic.py and test_pattern_validation.py
* attempt to fix mypy/pydantic compat issue
* formatting
* Update spacy/tests/matcher/test_pattern_validation.py
Co-authored-by: Source-Shen <82353723+Source-Shen@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Enable flag on spacy.load: foundation for include, enable arguments.
* Enable flag on spacy.load: fixed tests.
* Enable flag on spacy.load: switched from pretrained model to empty model with added pipes for tests.
* Enable flag on spacy.load: switched to more consistent error on misspecification of component activity. Test refactoring. Added to default config.
* Enable flag on spacy.load: added support for fields not in pipeline.
* Enable flag on spacy.load: removed serialization fields from supported fields.
* Enable flag on spacy.load: removed 'enable' from config again.
* Enable flag on spacy.load: relaxed checks in _resolve_component_activation_status() to allow non-standard pipes.
* Enable flag on spacy.load: fixed relaxed checks for _resolve_component_activation_status() to allow non-standard pipes. Extended tests.
* Enable flag on spacy.load: comments w.r.t. resolution workarounds.
* Enable flag on spacy.load: remove include fields. Update website docs.
* Enable flag on spacy.load: updates w.r.t. changes in master.
* Implement Doc.from_json(): update docstrings.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Implement Doc.from_json(): remove newline.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Implement Doc.from_json(): change error message for E1038.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Enable flag on spacy.load: wrapped docstring for _resolve_component_status() at 80 chars.
* Enable flag on spacy.load: changed exmples for enable flag.
* Remove newline.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix docstring for Language._resolve_component_status().
* Rename E1038 to E1042.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add SpanRuler component
Add a `SpanRuler` component similar to `EntityRuler` that saves a list
of matched spans to `Doc.spans[spans_key]`. The matches from the token
and phrase matchers are deduplicated and sorted before assignment but
are not otherwise filtered.
* Update spacy/pipeline/span_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix cast
* Add self.key property
* Use number of patterns as length
* Remove patterns kwarg from init
* Update spacy/tests/pipeline/test_span_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add options for spans filter and setting to ents
* Add `spans_filter` option as a registered function'
* Make `spans_key` optional and if `None`, set to `doc.ents` instead of
`doc.spans[spans_key]`.
* Update and generalize tests
* Add test for setting doc.ents, fix key property type
* Fix typing
* Allow independent doc.spans and doc.ents
* If `spans_key` is set, set `doc.spans` with `spans_filter`.
* If `annotate_ents` is set, set `doc.ents` with `ents_fitler`.
* Use `util.filter_spans` by default as `ents_filter`.
* Use a custom warning if the filter does not work for `doc.ents`.
* Enable use of SpanC.id in Span
* Support id in SpanRuler as Span.id
* Update types
* `id` can only be provided as string (already by `PatternType`
definition)
* Update all uses of Span.id/ent_id in Doc
* Rename Span id kwarg to span_id
* Update types and docs
* Add ents filter to mimic EntityRuler overwrite_ents
* Refactor `ents_filter` to take `entities, spans` args for more
filtering options
* Give registered filters more descriptive names
* Allow registered `filter_spans` filter
(`spacy.first_longest_spans_filter.v1`) to take any number of
`Iterable[Span]` objects as args so it can be used for spans filter
or ents filter
* Implement future entity ruler as span ruler
Implement a compatible `entity_ruler` as `future_entity_ruler` using
`SpanRuler` as the underlying component:
* Add `sort_key` and `sort_reverse` to allow the sorting behavior to be
customized. (Necessary for the same sorting/filtering as in
`EntityRuler`.)
* Implement `overwrite_overlapping_ents_filter` and
`preserve_existing_ents_filter` to support
`EntityRuler.overwrite_ents` settings.
* Add `remove_by_id` to support `EntityRuler.remove` functionality.
* Refactor `entity_ruler` tests to parametrize all tests to test both
`entity_ruler` and `future_entity_ruler`
* Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns`
properties.
Additional changes:
* Move all config settings to top-level attributes to avoid duplicating
settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of
casting.)
* Format
* Fix filter make method name
* Refactor to use same error for removing by label or ID
* Also provide existing spans to spans filter
* Support ids property
* Remove token_patterns and phrase_patterns
* Update docstrings
* Add span ruler docs
* Fix types
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Move sorting into filters
* Check for all tokens in seen tokens in entity ruler filters
* Remove registered sort key
* Set Token.ent_id in a backwards-compatible way in Doc.set_ents
* Remove sort options from API docs
* Update docstrings
* Rename entity ruler filters
* Fix and parameterize scoring
* Add id to Span API docs
* Fix typo in API docs
* Include explicit labeled=True for scorer
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update docs for displacy style kwargs
Added "span" to the accepted values for the style kwarg in the displacy.serve and displacy.render top-level functions. These styles are new as of SpaCy 3.3, so I added the "new" tag for that option only
* restored alpha ordering
* Rename to spans_key for consistency
* Implement spans length in debug data
* Implement how span bounds and spans are obtained
In this commit, I implemented how span boundaries (the tokens) around a
given span and spans are obtained. I've put them in the compile_gold()
function so that it's accessible later on. I will do the actual
computation of the span and boundary distinctiveness in the main
function above.
* Compute for p_spans and p_bounds
* Add computation for SD and BD
* Fix mypy issues
* Add weighted average computation
* Fix compile_gold conditional logic
* Add test for frequency distribution computation
* Add tests for kl-divergence computation
* Fix weighted average computation
* Make tables more compact by rounding them
* Add more descriptive checks for spans
* Modularize span computation methods
In this commit, I added the _get_span_characteristics and
_print_span_characteristics functions so that they can be reusable
anywhere.
* Remove unnecessary arguments and make fxs more compact
* Update a few parameter arguments
* Add tests for print_span and get_span methods
* Update API to talk about span characteristics in brief
* Add better reporting of spans_length
* Add test for span length reporting
* Update formatting of span length report
Removed '' to indicate that it's not a string, then
sort the n-grams by their length, not by their frequency.
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Show all frequency distribution when -V
In this commit, I displayed the full frequency distribution of the
span lengths when --verbose is passed. To make things simpler, I
rewrote some of the formatter functions so that I can call them
whenever.
Another notable change is that instead of showing percentages as
Integers, I showed them as floats (max 2-decimal places). I did this
because it looks weird when it displays (0%).
* Update logic on how total is computed
The way the 90% thresholding is computed now is that we keep
adding the percentages until we reach >= 90%. I also updated the wording
and used the term "At least" to denote that >= 90% of your spans have
these distributions.
* Fix display when showing the threshold percentage
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add better phrasing for span information
* Update spacy/cli/debug_data.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add minor edits for whitespaces etc.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>