* Make core projectivization methods cdef nogil
While profiling the parser, I noticed that relatively a lot of time is
spent in projectivization. This change rewrites the functions in the
core loops as cdef nogil for efficiency.
In C++-land, we use vector in place of Python lists and absent heads
are represented as -1 in place of None.
* _heads_to_c: add assertion
Validation should be performed by the caller, but this assertion ensures that
we are not reading/writing out of bounds with incorrect input.
* Fix NER check in CoNLL-U converter
Leave ents unset if no NER annotation is found in the MISC column.
* Revert to global rather than per-sentence NER check
* Update spacy/training/converters/conllu_to_docs.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Add whitespace augmenter that inserts a single whitespace token into a
doc containing annotation used in core trained pipelines.
Add a combined augmenter that handles lowercasing, orth variants and
whitespace augmentation.
* Extended list of numbers for ru language
Extended list of numbers with all forms and cases including short forms, slang variants and roman numerals.
* Update lex_attrs.py
* Update 'like_num' function with percentages
Added support for numbers with percentages like 12%, 1.2% and etc. to the 'like_num' function.
* black formatting
Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
* Extend list of abbreviations for ru language
Extended list of abbreviations for ru language those may have influence on tokenization.
* black formatting
Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
* Delay loading of mecab in Korean tokenizer
Delay loading of mecab until the tokenizer is called the first time so
that it's possible to initialize a blank `ko` pipeline without having
mecab installed, e.g. for use with `spacy init vectors`.
* Move mecab import back to __init__
Move mecab import back to __init__ to warn users at the same point as
before for missing python dependencies.
* remove duplicate line
* add sent start/end token attributes to the docs
* let has_annotation work with IS_SENT_END
* elif instead of if
* add has_annotation test for sent attributes
* fix typo
* remove duplicate is_sent_start entry in docs
* Setup debug data for spancat
* Add check for missing labels
* Add low-level data warning error
* Improve logic when compiling the gold train data
* Implement check for negative examples
* Remove breakpoint
* Remove ws_ents and missing entity checks
* Fix mypy errors
* Make variable name spans_key consistent
* Rename pipeline -> component for consistency
* Account for missing labels per spans_key
* Cleanup variable names for consistency
* Improve brevity of conditional statements
* Remove unused variables
* Include spans_key as an argument for _get_examples
* Add a conditional check for spans_key
* Update spancat debug data based on new API
- Instead of using _get_labels_from_model(), I'm now using
_get_labels_from_spancat() (cf. https://github.com/explosion/spaCy/pull10079)
- The way information is displayed was also changed (text -> table)
* Rename model_labels to ensure mypy works
* Update wording on warning messages
Use "span type" instead of "entity type" in wording the warning messages.
This is because Spans aren't necessarily entities.
* Update component type into a Literal
This is to make it clear that the component parameter should only accept
either 'spancat' or 'ner'.
* Update checks to include actual model span_keys
Instead of looking at everything in the data, we only check those
span_keys from the actual spancat component. Instead of doing the filter
inside the for-loop, I just made another dictionary,
data_labels_in_component to hold this value.
* Update spacy/cli/debug_data.py
* Show label counts only when verbose is True
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix debug data check for ents that cross sents
* Use aligned sent starts to have the same indices for the NER and sent
start annotation
* Add a temporary, insufficient hack for the case where a
sentence-initial reference token is split into multiple tokens in the
predicted doc, since `Example.get_aligned("SENT_START")` currently
aligns `True` to all the split tokens.
* Improve test example
* Use Example.get_aligned_sent_starts
* Add test for crossing entity
* Auto-format code with black
* add black requirement to dev dependencies and pin to 22.x
* ignore black dependency for comparison with setup.cfg
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
So that overriding `paths.vectors` works consistently in generated
configs, set vectors model in `paths.vectors` and always refer to this
path in `initialize.vectors`.
Remove exception for whitespace tokens in `Example.get_aligned` so that
annotation on whitespace tokens is aligned in the same way as for
non-whitespace tokens.
* Clarify Span.ents documentation
Ref: #10135
Retain current behaviour. Span.ents will only include entities within
said span. You can't get tokens outside of the original span.
* Reword docstrings
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update API docs in the website
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* This comma has been most probably been left out unintentionally, leading to string concatenation between the two consecutive lines. This issue has been found automatically using a regular expression.
* This comma has been most probably been left out unintentionally, leading to string concatenation between the two consecutive lines. This issue has been found automatically using a regular expression.
* Fix infix as prefix in Tokenizer.explain
Update `Tokenizer.explain` to align with the `Tokenizer` algorithm:
* skip infix matches that are prefixes in the current substring
* Update tokenizer pseudocode in docs
* Improve typing hints for Matcher.__call__
* Add typing hints for DependencyMatcher
* Add typing hints to underscore extensions
* Update Doc.tensor type (requires numpy 1.21)
* Fix typing hints for Language.component decorator
* Use generic np.ndarray type in Doc to avoid numpy version update
* Fix mypy errors
* Fix cyclic import caused by Underscore typing hints
* Use Literal type from spacy.compat
* Update matcher.pyi import format
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Instead of the running the actual suggester, which may require
annotation from annotating components that is not necessarily present in
the reference docs, use the built-in 1-gram suggester.
* added iob to int
* added tests
* added iob strings
* added error
* blacked attrs
* Update spacy/tests/lang/test_attrs.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/attrs.pyx
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* added iob strings as global
* minor refinement with iob
* removed iob strings from token
* changed to uppercase
* cleaned and went back to master version
* imported iob from attrs
* Update and format errors
* Support and test both str and int ENT_IOB key
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* added new field
* added exception for IOb strings
* minor refinement to schema
* removed field
* fixed typo
* imported numeriacla val
* changed the code bit
* cosmetics
* added test for matcher
* set ents of moc docs
* added invalid pattern
* minor update to documentation
* blacked matcher
* added pattern validation
* add IOB vals to schema
* changed into test
* mypy compat
* cleaned left over
* added compat import
* changed type
* added compat import
* changed literal a bit
* went back to old
* made explicit type
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/schemas.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Determine labels by factory name in debug data
For all components, return labels for all components with the
corresponding factory name rather than for only the default name.
For `spancat`, return labels as a dict keyed by `spans_key`.
* Refactor for typing
* Add test
* Use assert instead of cast, removed unneeded arg
* Mark test as slow
* Use Vectors.shape rather than Vectors.data.shape
* Use Vectors.size rather than Vectors.data.size
* Add Vectors.to_ops to move data between different ops
* Add documentation for Vector.to_ops
* Corrected Span's __richcmp__ implementation to take end, label and kb_id in consideration
* Updated test
* Updated test
* Removed formatting from a test for readability sake
* Use same tuples for all comparisons
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Edited Slovenian stop words list (#9707)
* Noun chunks for Italian (#9662)
* added it vocab
* copied portuguese
* added possessive determiner
* added conjed Nps
* added nmoded Nps
* test misc
* more examples
* fixed typo
* fixed parenth
* fixed comma
* comma fix
* added syntax iters
* fix some index problems
* fixed index
* corrected heads for test case
* fixed tets case
* fixed determiner gender
* cleaned left over
* added example with apostophe
* French NP review (#9667)
* adapted from pt
* added basic tests
* added fr vocab
* fixed noun chunks
* more examples
* typo fix
* changed naming
* changed the naming
* typo fix
* Add Japanese kana characters to default exceptions (fix#9693) (#9742)
This includes the main kana, or phonetic characters, used in Japanese.
There are some supplemental kana blocks in Unicode outside the BMP that
could also be included, but because their actual use is rare I omitted
them for now, but maybe they should be added. The omitted blocks are:
- Kana Supplement
- Kana Extended (A and B)
- Small Kana Extension
* Remove NER words from stop words in Norwegian (#9820)
Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.
Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.
See explanation in https://github.com/explosion/spaCy/issues/3052#issuecomment-986756711 and comment https://github.com/explosion/spaCy/issues/3052#issuecomment-986951831
* Bump sudachipy version
* Update sudachipy versions
* Bump versions
Bumping to the most recent dictionary just to keep thing current.
Bumping sudachipy to 5.2 because older versions don't support recent
dictionaries.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Richard Hudson <richard@explosion.ai>
Co-authored-by: Duygu Altinok <duygu@explosion.ai>
Co-authored-by: Haakon Meland Eriksen <haakon.eriksen@far.no>
This change changes the type of left/right-arc collections from
vector[ArcC] to unordered_map[int, vector[Arc]], so that the arcs are
keyed by the head. This allows us to find all the left/right arcs for a
particular head in constant time in StateC::{L,R}.
Benchmarks with long docs (N is the number of text repetitions):
Before (using #10019):
N Time (s)
400 3.2
800 5.0
1600 9.5
3200 23.2
6400 66.8
12800 220.0
After (this commit):
N Time (s)
400 3.1
800 4.3
1600 6.7
3200 12.0
6400 22.0
12800 42.0
Related to #9858 and #10019.
* Speed up the StateC::L feature function
This function gets the n-th most-recent left-arc with a particular head.
Before this change, StateC::L would construct a vector of all left-arcs
with the given head and then pick the n-th most recent from that vector.
Since the number of left-arcs strongly correlates with the doc length
and the feature is constructed for every transition, this can make
transition-parsing quadratic.
With this change StateC::L:
- Searches left-arcs backwards.
- Stops early when the n-th matching transition is found.
- Does not construct a vector (reducing memory pressure).
This change doesn't avoid the linear search when the transition that is
queried does not occur in the left-arcs. Regardless, performance is
improved quite a bit with very long docs:
Before:
N Time
400 3.3
800 5.4
1600 11.6
3200 30.7
After:
N Time
400 3.2
800 5.0
1600 9.5
3200 23.2
We can probably do better with more tailored data structures, but I
first wanted to make a low-impact PR.
Found while investigating #9858.
* StateC::L: simplify loop
* Speed up the StateC::L feature function
This function gets the n-th most-recent left-arc with a particular head.
Before this change, StateC::L would construct a vector of all left-arcs
with the given head and then pick the n-th most recent from that vector.
Since the number of left-arcs strongly correlates with the doc length
and the feature is constructed for every transition, this can make
transition-parsing quadratic.
With this change StateC::L:
- Searches left-arcs backwards.
- Stops early when the n-th matching transition is found.
- Does not construct a vector (reducing memory pressure).
This change doesn't avoid the linear search when the transition that is
queried does not occur in the left-arcs. Regardless, performance is
improved quite a bit with very long docs:
Before:
N Time
400 3.3
800 5.4
1600 11.6
3200 30.7
After:
N Time
400 3.2
800 5.0
1600 9.5
3200 23.2
We can probably do better with more tailored data structures, but I
first wanted to make a low-impact PR.
Found while investigating #9858.
* StateC::L: simplify loop
* Check for assets with size of 0 bytes
* Update spacy/cli/project/assets.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix Scorer.score_cats for missing labels
* Add test case for Scorer.score_cats missing labels
* semantic nitpick
* black formatting
* adjust test to give different results depending on multi_label setting
* fix loss function according to whether or not missing values are supported
* add note to docs
* small fixes
* make mypy happy
* Update spacy/pipeline/textcat.py
Co-authored-by: Florian Cäsar <florian.caesar@pm.me>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
* change '_' to '' to allow Token.pos, when no value for token pos in conllu data
* Minor code style
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.
Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.
See explanation in https://github.com/explosion/spaCy/issues/3052#issuecomment-986756711 and comment https://github.com/explosion/spaCy/issues/3052#issuecomment-986951831
* added ruler coe
* added error for none existing pattern
* changed error to warning
* changed error to warning
* added basic tests
* fixed place
* added test files
* went back to error
* went back to pattern error
* minor change to docs
* changed style
* changed doc
* changed error slightly
* added remove to phrasem api
* error key already existed
* phrase matcher match code to api
* blacked tests
* moved comments before expr
* corrected error no
* Update website/docs/api/entityruler.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/docs/api/entityruler.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added sents property to Span class that returns a generator of sentences the Span belongs to
* Added description to Span.sents property
* Update test_span to clarify the difference between span.sent and span.sents
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/tests/doc/test_span.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix documentation typos in spacy/tokens/span.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update Span.sents doc string in spacy/tokens/span.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Parametrized test_span_spans
* Corrected Span.sents to check for span-level hook first. Also, made Span.sent respect doc-level sents hook if no span-level hook is provided
* Corrected Span ocumentation copy/paste issue
* Put back accidentally deleted lines
* Fixed formatting in span.pyx
* Moved check for SENT_START annotation after user hooks in Span.sents
* add version where the property was introduced
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Migrate regressions 1-1000
* Move serialize test to correct file
* Remove tests that won't work in v3
* Migrate regressions 1000-1500
Removed regression test 1250 because v3 doesn't support the old LEX
scheme anymore.
* Add missing imports in serializer tests
* Migrate tests 1500-2000
* Migrate regressions from 2000-2500
* Migrate regressions from 2501-3000
* Migrate regressions from 3000-3501
* Migrate regressions from 3501-4000
* Migrate regressions from 4001-4500
* Migrate regressions from 4501-5000
* Migrate regressions from 5001-5501
* Migrate regressions from 5501 to 7000
* Migrate regressions from 7001 to 8000
* Migrate remaining regression tests
* Fixing missing imports
* Update docs with new system [ci skip]
* Update CONTRIBUTING.md
- Fix formatting
- Update wording
* Remove lemmatizer tests in el lang
* Move a few tests into the general tokenizer
* Separate Doc and DocBin tests
This includes the main kana, or phonetic characters, used in Japanese.
There are some supplemental kana blocks in Unicode outside the BMP that
could also be included, but because their actual use is rare I omitted
them for now, but maybe they should be added. The omitted blocks are:
- Kana Supplement
- Kana Extended (A and B)
- Small Kana Extension
* morphologizer: avoid recreating label tuple for each token
The `labels` property converts the dictionary key set to a tuple. This
property was used for every annotated token, recreating the tuple over
and over again.
Construct the tuple once in the set_annotations function and reuse it.
On a Finnish pipeline that I was experimenting with, this results in a
speedup of ~15% (~13000 -> ~15000 WPS).
* tagger: avoid recreating label tuple for each token
* Add support for kb_id to be displayed via displacy.serve. The current support is only limited to the manual option in displacy.render
* Commit to check pre-commit hooks are run.
* Update spacy/displacy/__init__.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Changes as per suggestions on the PR.
* Update website/docs/api/top-level.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/docs/api/top-level.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* tag option as new from 3.2.1 onwards
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
* Use internal names for factories
If a component factory is registered like `@French.factory(...)` instead
of `@Language.factory(...)`, the name in the factories registry will be
prefixed with the language code. However in the nlp.config object the
factory will be listed without the language code. The `add_pipe` code
has fallback logic to handle this, but packaging code and the registry
itself don't.
This change makes it so that the factory name in nlp.config is the
language-specific form. It's not clear if this will break anything else,
but it does seem to fix the inconsistency and resolve the specific user
issue that brought this to our attention.
* Change approach to use fallback in package lookup
This adds fallback logic to the package lookup, so it doesn't have to
touch the way the config is built. It seems to fix the tests too.
* Remove unecessary line
* Add test
Thsi also adds an assert that seems to have been forgotten.
* Added Slovak
* Added Slovenian tests
* Added Estonian tests
* Added Croatian tests
* Added Latvian tests
* Added Icelandic tests
* Added Afrikaans tests
* Added language-independent tests
* Added Kannada tests
* Tidied up
* Added Albanian tests
* Formatted with black
* Added failing tests for anomalies
* Update spacy/tests/lang/af/test_text.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Estonian tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Croatian tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Icelandic tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Latvian tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Slovak tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Slovenian tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added ENT_ID and ENT_KB_ID into the list of the attributes that Matcher matches on
* Added ENT_ID and ENT_KB_ID to TEST_PATTERNS in test_pattern_validation.py. Disabled tests that I added before
* Update website/docs/api/matcher.md
* Format
* Remove skipped tests
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* added error string
* added serialization test
* added more to if statements
* wrote file to tempdir
* added tempdir
* changed parameter a bit
* Update spacy/tests/pipeline/test_entity_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
If the predicted docs are missing annotation according to
`has_annotation`, treat the docs as having no predictions rather than
raising errors when the annotation is missing.
The motivation for this is a combined tokenization+sents scorer for a
component where the sents annotation is optional. To provide a single
scorer in the component factory, it needs to be possible for the scorer
to continue despite missing sents annotation in the case where the
component is not annotating sents.
Exclude strings from `Vector.to_bytes()` comparions for v3.2+ `Vectors`
that now include the string store so that the source vector comparison
is only comparing the vectors and not the strings.
* Clarify how to fill in init_tok2vec after pretraining
* Ignore init_tok2vec arg in pretraining
* Update docs, config setting
* Remove obsolete note about not filling init_tok2vec early
This seems to have also caught some lines that needed cleanup.