* fixing argument order for rehearse
* rehearse test for ner and tagger
* rehearse bugfix
* added test for parser
* test for multilabel textcat
* rehearse fix
* remove debug line
* Update spacy/tests/training/test_rehearse.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/tests/training/test_rehearse.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Kádár Ákos <akos@onyx.uvt.nl>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Make core projectivization methods cdef nogil
While profiling the parser, I noticed that relatively a lot of time is
spent in projectivization. This change rewrites the functions in the
core loops as cdef nogil for efficiency.
In C++-land, we use vector in place of Python lists and absent heads
are represented as -1 in place of None.
* _heads_to_c: add assertion
Validation should be performed by the caller, but this assertion ensures that
we are not reading/writing out of bounds with incorrect input.
* Fix NER check in CoNLL-U converter
Leave ents unset if no NER annotation is found in the MISC column.
* Revert to global rather than per-sentence NER check
* Update spacy/training/converters/conllu_to_docs.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Add whitespace augmenter that inserts a single whitespace token into a
doc containing annotation used in core trained pipelines.
Add a combined augmenter that handles lowercasing, orth variants and
whitespace augmentation.
* Extended list of numbers for ru language
Extended list of numbers with all forms and cases including short forms, slang variants and roman numerals.
* Update lex_attrs.py
* Update 'like_num' function with percentages
Added support for numbers with percentages like 12%, 1.2% and etc. to the 'like_num' function.
* black formatting
Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
* Extend list of abbreviations for ru language
Extended list of abbreviations for ru language those may have influence on tokenization.
* black formatting
Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
* Delay loading of mecab in Korean tokenizer
Delay loading of mecab until the tokenizer is called the first time so
that it's possible to initialize a blank `ko` pipeline without having
mecab installed, e.g. for use with `spacy init vectors`.
* Move mecab import back to __init__
Move mecab import back to __init__ to warn users at the same point as
before for missing python dependencies.
* remove duplicate line
* add sent start/end token attributes to the docs
* let has_annotation work with IS_SENT_END
* elif instead of if
* add has_annotation test for sent attributes
* fix typo
* remove duplicate is_sent_start entry in docs
* Setup debug data for spancat
* Add check for missing labels
* Add low-level data warning error
* Improve logic when compiling the gold train data
* Implement check for negative examples
* Remove breakpoint
* Remove ws_ents and missing entity checks
* Fix mypy errors
* Make variable name spans_key consistent
* Rename pipeline -> component for consistency
* Account for missing labels per spans_key
* Cleanup variable names for consistency
* Improve brevity of conditional statements
* Remove unused variables
* Include spans_key as an argument for _get_examples
* Add a conditional check for spans_key
* Update spancat debug data based on new API
- Instead of using _get_labels_from_model(), I'm now using
_get_labels_from_spancat() (cf. https://github.com/explosion/spaCy/pull10079)
- The way information is displayed was also changed (text -> table)
* Rename model_labels to ensure mypy works
* Update wording on warning messages
Use "span type" instead of "entity type" in wording the warning messages.
This is because Spans aren't necessarily entities.
* Update component type into a Literal
This is to make it clear that the component parameter should only accept
either 'spancat' or 'ner'.
* Update checks to include actual model span_keys
Instead of looking at everything in the data, we only check those
span_keys from the actual spancat component. Instead of doing the filter
inside the for-loop, I just made another dictionary,
data_labels_in_component to hold this value.
* Update spacy/cli/debug_data.py
* Show label counts only when verbose is True
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix debug data check for ents that cross sents
* Use aligned sent starts to have the same indices for the NER and sent
start annotation
* Add a temporary, insufficient hack for the case where a
sentence-initial reference token is split into multiple tokens in the
predicted doc, since `Example.get_aligned("SENT_START")` currently
aligns `True` to all the split tokens.
* Improve test example
* Use Example.get_aligned_sent_starts
* Add test for crossing entity
* Auto-format code with black
* add black requirement to dev dependencies and pin to 22.x
* ignore black dependency for comparison with setup.cfg
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
So that overriding `paths.vectors` works consistently in generated
configs, set vectors model in `paths.vectors` and always refer to this
path in `initialize.vectors`.
Remove exception for whitespace tokens in `Example.get_aligned` so that
annotation on whitespace tokens is aligned in the same way as for
non-whitespace tokens.
* Clarify Span.ents documentation
Ref: #10135
Retain current behaviour. Span.ents will only include entities within
said span. You can't get tokens outside of the original span.
* Reword docstrings
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update API docs in the website
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* This comma has been most probably been left out unintentionally, leading to string concatenation between the two consecutive lines. This issue has been found automatically using a regular expression.
* This comma has been most probably been left out unintentionally, leading to string concatenation between the two consecutive lines. This issue has been found automatically using a regular expression.
* Fix infix as prefix in Tokenizer.explain
Update `Tokenizer.explain` to align with the `Tokenizer` algorithm:
* skip infix matches that are prefixes in the current substring
* Update tokenizer pseudocode in docs
* Improve typing hints for Matcher.__call__
* Add typing hints for DependencyMatcher
* Add typing hints to underscore extensions
* Update Doc.tensor type (requires numpy 1.21)
* Fix typing hints for Language.component decorator
* Use generic np.ndarray type in Doc to avoid numpy version update
* Fix mypy errors
* Fix cyclic import caused by Underscore typing hints
* Use Literal type from spacy.compat
* Update matcher.pyi import format
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Instead of the running the actual suggester, which may require
annotation from annotating components that is not necessarily present in
the reference docs, use the built-in 1-gram suggester.
* added iob to int
* added tests
* added iob strings
* added error
* blacked attrs
* Update spacy/tests/lang/test_attrs.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/attrs.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* added iob strings as global
* minor refinement with iob
* removed iob strings from token
* changed to uppercase
* cleaned and went back to master version
* imported iob from attrs
* Update and format errors
* Support and test both str and int ENT_IOB key
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* added new field
* added exception for IOb strings
* minor refinement to schema
* removed field
* fixed typo
* imported numeriacla val
* changed the code bit
* cosmetics
* added test for matcher
* set ents of moc docs
* added invalid pattern
* minor update to documentation
* blacked matcher
* added pattern validation
* add IOB vals to schema
* changed into test
* mypy compat
* cleaned left over
* added compat import
* changed type
* added compat import
* changed literal a bit
* went back to old
* made explicit type
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Determine labels by factory name in debug data
For all components, return labels for all components with the
corresponding factory name rather than for only the default name.
For `spancat`, return labels as a dict keyed by `spans_key`.
* Refactor for typing
* Add test
* Use assert instead of cast, removed unneeded arg
* Mark test as slow
* Use Vectors.shape rather than Vectors.data.shape
* Use Vectors.size rather than Vectors.data.size
* Add Vectors.to_ops to move data between different ops
* Add documentation for Vector.to_ops
* Corrected Span's __richcmp__ implementation to take end, label and kb_id in consideration
* Updated test
* Updated test
* Removed formatting from a test for readability sake
* Use same tuples for all comparisons
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>