* extend span scorer with consider_label and allow_overlap
* unit test for spans y2x overlap
* add score_spans unit test
* docs for new fields in scorer.score_spans
* rename to include_label
* spell out if-else for clarity
* rename to 'labeled'
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Data in the JSON format is split into sentences, and each sentence is
saved with is_sent_start flags. Currently the flags are 1 for the first
token and 0 for the others. When deserialized this results in a pattern
of True, None, None, None... which makes single-sentence documents look
as though they haven't had sentence boundaries set.
Since items saved in JSON format have been split into sentences already,
the is_sent_start values should all be True or False.
* Support match alignments
* change naming from match_alignments to with_alignments, add conditional flow if with_alignments is given, validate with_alignments, add related test case
* remove added errors, utilize bint type, cleanup whitespace
* fix no new line in end of file
* Minor formatting
* Skip alignments processing if as_spans is set
* Add with_alignments to Matcher API docs
* Update website/docs/api/matcher.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Support infinite generators for training corpora
Support a training corpus with an infinite generator in the `spacy
train` training loop:
* Revert `create_train_batches` to the state where an infinite generator
can be used as the in the first epoch of exactly one epoch without
resulting in a memory leak (`max_epochs != 1` will still result in a
memory leak)
* Move the shuffling for the first epoch into the corpus reader,
renaming it to `spacy.Corpus.v2`.
* Switch to training option for shuffling in memory
Training loop:
* Add option `training.shuffle_train_corpus_in_memory` that controls
whether the corpus is loaded in memory once and shuffled in the training
loop
* Revert changes to `create_train_batches` and rename to
`create_train_batches_with_shuffling` for use with `spacy.Corpus.v1` and
a corpus that should be loaded in memory
* Add `create_train_batches_without_shuffling` for a corpus that
should not be shuffled in the training loop: the corpus is merely
batched during training
Corpus readers:
* Restore `spacy.Corpus.v1`
* Add `spacy.ShuffledCorpus.v1` for a corpus shuffled in memory in the
reader instead of the training loop
* In combination with `shuffle_train_corpus_in_memory = False`, each
epoch could result in a different augmentation
* Refactor create_train_batches, validation
* Rename config setting to `training.shuffle_train_corpus`
* Refactor to use a single `create_train_batches` method with a
`shuffle` option
* Only validate `get_examples` in initialize step if:
* labels are required
* labels are not provided
* Switch back to max_epochs=-1 for streaming train corpus
* Use first 100 examples for stream train corpus init
* Always check validate_get_examples in initialize
* Add failing test for PRFScore
* Fix erroneous implementation of __add__
* Simplify constructor
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Adjust custom extension data when copying user data in `Span.as_doc()`
* Restrict `Doc.from_docs()` to adjusting offsets for custom extension
data
* Update test to use extension
* (Duplicate bug fix for character offset from #7497)
Merge data from `doc.spans` in `Doc.from_docs()`.
* Fix internal character offset set when merging empty docs (only
affects tokens and spans in `user_data` if an empty doc is in the list
of docs)
In the retokenizer, only reset sent starts (with
`set_children_from_head`) if the doc is parsed. If there is no parse,
merged tokens have the unset `token.is_sent_start == None` by default after
retokenization.
* Add util method for check
* Add new languages to list with lexeme norm tables
* Add check to all relevant components
* Add config details to warning message
Note that we're not actually inspecting the model config to see if
`NORM` is used as an attribute, so it may warn in cases where it's not
relevant.
See here:
https://github.com/explosion/spaCy/discussions/7463
Still need to check if there are any side effects of listeners being
present but not in the pipeline, but this commit will silence the
warnings.
* To allow default lookup lemmatization with a blank Russian model,
rename pymorphy2 lookup mode to `pymorphy2_lookup`
* Bug fix: update pymorphy2 lookup lemmatize to return list rather than
string
* add multi-label textcat to menu
* add infobox on textcat API
* add info to v3 migration guide
* small edits
* further fixes in doc strings
* add infobox to textcat architectures
* add textcat_multilabel to overview of built-in components
* spelling
* fix unrelated warn msg
* Add textcat_multilabel to quickstart [ci skip]
* remove separate documentation page for multilabel_textcategorizer
* small edits
* positive label clarification
* avoid duplicating information in self.cfg and fix textcat.score
* fix multilabel textcat too
* revert threshold to storage in cfg
* revert threshold stuff for multi-textcat
Co-authored-by: Ines Montani <ines@ines.io>
* Fix aborted/skipped augmentation for `spacy.orth_variants.v1` if
lowercasing was enabled for an example
* Simplify `spacy.orth_variants.v1` for `Example` vs. `GoldParse`
* Preserve reference tokenization in `spacy.lower_case.v1`
* initialize NLP with train corpus
* add more pretraining tests
* more tests
* function to fetch tok2vec layer for pretraining
* clarify parameter name
* test different objectives
* formatting
* fix check for static vectors when using vectors objective
* clarify docs
* logger statement
* fix init_tok2vec and proc.initialize order
* test training after pretraining
* add init_config tests for pretraining
* pop pretraining block to avoid config validation errors
* custom errors
* Fix patience for identical scores
Fix training patience so that the earliest best step is chosen for
identical max scores.
* Restore break, remove print
* Explicitly define best_step for clarity
* Add hint for --gpu-id to CLI device info
If the user has `cupy` and an available GPU, add a hint about using
`--gpu-id 0` to the CLI output.
* Undo change to original CPU message
* Fix `is_cython_func` for imported code loaded under `python_code`
module name
* Add `make_named_tempfile` context manager to test utils to test
loading of imported code
* Add test for validation of `initialize` params in custom module
Fix class variable and init for `UkrainianLemmatizer` so that it loads
the `uk` dictionaries rather than having the parent `RussianLemmatizer`
override with the `ru` settings.
Now that `nlp.evaluate()` does not modify the examples, rerun the
pipeline on the (limited) texts in order to provide the predicted
annotation in the displacy output option.
* Add test for #7035
* Update test for issue 7056
* Fix test
* Fix transitions method used in testing
* Fix state eol detection when rebuffer
* Clean up redundant fix
* Add regression test
* Run PhraseMatcher on Spans
* Add test for PhraseMatcher on Spans and Docs
* Add SCA
* Add test with 3 matches in Doc, 1 match in Span
* Update docs
* Use doc.length for find_matches in tokenizer
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Now that the initialize step is fully implemented, the source of E923 is
typically missing or improperly converted/formatted data rather than a
bug in spaCy, so rephrase the error and message and remove the prompt to
open an issue.
```python
def test_vocab_lexeme_add_flag_auto_id(en_vocab):
is_len4 = en_vocab.add_flag(lambda string: len(string) == 4)
assert en_vocab["1999"].check_flag(is_len4) is True
assert en_vocab["1999"].check_flag(IS_DIGIT) is True
assert en_vocab["199"].check_flag(is_len4) is False
> assert en_vocab["199"].check_flag(IS_DIGIT) is True
E assert False is True
E + where False = <built-in method check_flag of spacy.lexeme.Lexeme object at 0x7fa155c36840>(3)
E + where <built-in method check_flag of spacy.lexeme.Lexeme object at 0x7fa155c36840> = <spacy.lexeme.Lexeme object at 0x7fa155c36840>.check_flag
spacy/tests/vocab_vectors/test_lexeme.py:49: AssertionError
```
> `pytest==6.1.1`
>
> `numpy==1.19.2`
>
> `Python version: 3.8.3`
To reproduce the error, run `pytest --random-order-bucket=global --random-order-seed=170158 -v spacy/tests`
If `test_vocab_lexeme_add_flag_auto_id` is run after `test_vocab_lexeme_add_flag_provided_id`, it fails.
It seems like `test_vocab_lexeme_add_flag_provided_id` uses the `IS_DIGIT` bit for testing purposes but does not reset the bit.
This solution seems to work but, if anyone has a better fix, please let me know and I will integrate it.
* add capture argument to project_run and run_commands
* git bump to 3.0.1
* Set version to 3.0.1.dev0
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
When `--no-cache-dir` is present, it prevents caching to properly function.
If the user still wants to do this, there is the possibility to pass options with `user_pip_args`.
But you should not enforce options like these. In my case this is preventing some docker build (using buildkit caching) to have proper caching of models.
Instead of silently using only the first token in each matched span:
* Forbid `OP: ?/*/+` through `DependencyMatcher` validation
* As a fail-safe, add warning if a token match that's not exactly one
token long is found by a token pattern.
* add error handler for pipe methods
* add unit tests
* remove pipe method that are the same as their base class
* have Language keep track of a default error handler
* cleanup
* formatting
* small refactor
* add documentation
* Initial Spanish lemmatizer
* Handle merged verb+pron(s) multi-word tokens
* Use VERB for AUX rule lookup
* Add morph to lemma cache key
* Fix aux lookups, minor refactoring
* Improve verb+pron handling
* Move verb+pron handling into its own method
* Check for exceptions (primarily for se)
* Collect pronouns in the same (not reversed) order
* Only add modified possible lemmas
* Fix `spacy.util.minibatch` when the size iterator is finished (#6745)
* Skip 0-length matches (#6759)
Add hack to prevent matcher from returning 0-length matches.
* support IS_SENT_START in PhraseMatcher (#6771)
* support IS_SENT_START in PhraseMatcher
* add unit test and friendlier error
* use IDS.get instead
* ensure span.text works for an empty span (#6772)
* Remove unicode_literals
Co-authored-by: Santiago Castro <bryant@montevideo.com.uy>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow output_path to be None during training
* Fix cat scoring (?)
* Improve error message for weighted None score
* Improve messages
So we can call this in other places etc.
* FIx output path check
* Use latest wasabi
* Revert "Improve error message for weighted None score"
This reverts commit 7059926763.
* Exclude None scores from final score by default
It's otherwise very difficult to keep track of the score weights if we modify a config programmatically, source components etc.
* Update warnings and use logger.warning
* Spacy Cli info method causing backward compatibility issues #6791
fix backward compatibility by setting default value to exclude in info
method.
* setting empty list as default argument is dangerous.
so setting default to None and then setting it to emptylist, if None.
Reference : https://nikos7am.com/posts/mutable-default-arguments/
* Adding contributor agreement for user werew
* [DependencyMatcher] Comment and clean code
* [DependencyMatcher] Use defaultdicts
* [DependencyMatcher] Simplify _retrieve_tree method
* [DependencyMatcher] Remove prepended underscores
* [DependencyMatcher] Address TODO and move grouping of token's positions out of the loop
* [DependencyMatcher] Remove _nodes attribute
* [DependencyMatcher] Use enumerate in _retrieve_tree method
* [DependencyMatcher] Clean unused vars and use camel_case naming
* [DependencyMatcher] Memoize node+operator map
* Add root property to Token
* [DependencyMatcher] Groups matches by root
* [DependencyMatcher] Remove unused _keys_to_token attribute
* [DependencyMatcher] Use a list to map tokens to matcher's keys
* [DependencyMatcher] Remove recursion
* [DependencyMatcher] Use a generator to retrieve matches
* [DependencyMatcher] Remove unused memory pool
* [DependencyMatcher] Hide private methods and attributes
* [DependencyMatcher] Improvements to the matches validation
* Apply suggestions from code review
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* [DependencyMatcher] Fix keys_to_position_maps
* Remove Token.root property
* [DependencyMatcher] Remove functools' lru_cache
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* warn when frozen components break listener pattern
* few notes in the documentation
* update arg name
* formatting
* cleanup
* specify listeners return type
* raise NotImplementedError when noun_chunks iterator is not implemented
* bring back, fix and document span.noun_chunks
* formatting
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Add long_token_splitter component
Add a `long_token_splitter` component for use with transformer
pipelines. This component splits up long tokens like URLs into smaller
tokens. This is particularly relevant for pretrained pipelines with
`strided_spans`, since the user can't change the length of the span
`window` and may not wish to preprocess the input texts.
The `long_token_splitter` splits tokens that are at least
`long_token_length` tokens long into smaller tokens of `split_length`
size.
Notes:
* Since this is intended for use as the first component in a pipeline,
the token splitter does not try to preserve any token annotation.
* API docs to come when the API is stable.
* Adjust API, add test
* Fix name in factory
Add all strings from the source model when adding a pipe from a source
model.
Minor:
* Skip `disable=["vocab", "tokenizer"]` when loading a source model from
the config, since this doesn't do anything and is misleading.
* Handle unset token.morph in Morphologizer
Handle unset `token.morph` in `Morphologizer.initialize` and
`Morphologizer.get_loss`. If both `token.morph` and `token.pos` are
unset, treat the annotation as missing rather than empty.
* Add token.has_morph()
* Override language defaults for null token and URL match
When the serialized `token_match` or `url_match` is `None`, override the
language defaults to preserve `None` on deserialization.
* Fix fixtures in tests
* Draft out initial Spans data structure
* Initial span group commit
* Basic span group support on Doc
* Basic test for span group
* Compile span_group.pyx
* Draft addition of SpanGroup to DocBin
* Add deserialization for SpanGroup
* Add tests for serializing SpanGroup
* Fix serialization of SpanGroup
* Add EdgeC and GraphC structs
* Add draft Graph data structure
* Compile graph
* More work on Graph
* Update GraphC
* Upd graph
* Fix walk functions
* Let Graph take nodes and edges on construction
* Fix walking and getting
* Add graph tests
* Fix import
* Add module with the SpanGroups dict thingy
* Update test
* Rename 'span_groups' attribute
* Try to fix c++11 compilation
* Fix test
* Update DocBin
* Try to fix compilation
* Try to fix graph
* Improve SpanGroup docstrings
* Add doc.spans to documentation
* Fix serialization
* Tidy up and add docs
* Update docs [ci skip]
* Add SpanGroup.has_overlap
* WIP updated Graph API
* Start testing new Graph API
* Update Graph tests
* Update Graph
* Add docstring
Co-authored-by: Ines Montani <ines@ines.io>
Validate both `[initialize]` and `[training]` in `debug data` and
`nlp.initialize()` with separate config validation error blocks that
indicate which block of the config is being validated.
Add `initialize.before_init` and `initialize.after_init` callbacks to
the config. The `initialize.before_init` callback is a place to
implement one-time tokenizer customizations that are then saved with the
model.
* Update stop_words.py
Added three aditional stopwords: "a" and "o" that means "the", and "e" that means "and"
* Create cristianasp.md
* zero edit to push CI
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* fix TorchBiLSTMEncoder documentation
* ensure the types of the encoding Tok2vec layers are correct
* update references from v1 to v2 for the new architectures
* add syntax iterators for danish
* add test noun chunks for danish syntax iterators
* add contributor agreement
* update da syntax iterators to remove nested chunks
* add tests for da noun chunks
* Fix test
* add missing import
* fix example
* Prevent overlapping noun chunks
Prevent overlapping noun chunks by tracking the end index of the
previous noun chunk span.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* clean up of ner tests
* beam_parser tests
* implement get_beam_parses and scored_parses for the dep parser
* we don't have to add the parse if there are no arcs
* add convenience method to determine tok2vec width in a model
* fix transformer tok2vec dimensions in TextCatEnsemble architecture
* init function should not be nested to avoid pickle issues
* small fixes and formatting
* bring test_issue4313 up-to-date, currently fails
* formatting
* add get_beam_parses method back
* add scored_ents function
* delete tag map
Instead of unsetting lemmas on retokenized tokens, set the default
lemmas to:
* merge: concatenate any existing lemmas with `SPACY` preserved
* split: use the new `ORTH` values if lemmas were previously set,
otherwise leave unset
* multi-label textcat component
* formatting
* fix comment
* cleanup
* fix from #6481
* random edit to push the tests
* add explicit error when textcat is called with multi-label gold data
* fix error nr
* small fix
* Fix memory issues in Language.evaluate
Reset annotation in predicted docs before evaluating and store all data
in `examples`.
* Minor refactor to docs generator init
* Fix generator expression
* Fix final generator check
* Refactor pipeline loop
* Handle examples generator in Language.evaluate
* Add test with generator
* Use make_doc
* Add Amharic to space
* clean up
* Add some PRON_LEMMA
* add Tigrinya support
* remove text_noun_chunks
* Tigrinya Support
* added some more details for ti
* fix unit test
* add amharic char range
* changes from review
* amharic and tigrinya share same unicode block
* get rid of _amharic/_tigrinya in char_classes
Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>
Fix lookup of empty morph in the morphology table, which fixes a memory
leak where a new morphology tag was allocated each time the empty morph
tag was added.
* Switch converters to generator functions
To reduce the memory usage when converting large corpora, refactor the
convert methods to be generator functions.
* Update tests
* Get basic beam tests working
* Get basic beam tests working
* Compile _beam_utils
* Remove prints
* Test beam density
* Beam parser seems to train
* Draft beam NER
* Upd beam
* Add hypothesis as dev dependency
* Implement missing is-gold-parse method
* Implement early update
* Fix state hashing
* Fix test
* Fix test
* Default to non-beam in parser constructor
* Improve oracle for beam
* Start refactoring beam
* Update test
* Refactor beam
* Update nn
* Refactor beam and weight by cost
* Update ner beam settings
* Update test
* Add __init__.pxd
* Upd test
* Fix test
* Upd test
* Fix test
* Remove ring buffer history from StateC
* WIP change arc-eager transitions
* Add state tests
* Support ternary sent start values
* Fix arc eager
* Fix NER
* Pass oracle cut size for beam
* Fix ner test
* Fix beam
* Improve StateC.clone
* Improve StateClass.borrow
* Work directly with StateC, not StateClass
* Remove print statements
* Fix state copy
* Improve state class
* Refactor parser oracles
* Fix arc eager oracle
* Fix arc eager oracle
* Use a vector to implement the stack
* Refactor state data structure
* Fix alignment of sent start
* Add get_aligned_sent_starts method
* Add test for ae oracle when bad sentence starts
* Fix sentence segment handling
* Avoid Reduce that inserts illegal sentence
* Update preset SBD test
* Fix test
* Remove prints
* Fix sent starts in Example
* Improve python API of StateClass
* Tweak comments and debug output of arc eager
* Upd test
* Fix state test
* Fix state test
* add test for multi-label textcat reproducibility
* remove positive_label
* fix lengths dtype
* fix comments
* remove comment that we should not have forgotten :-)
Remove the non-working `--use-chars` option from the train CLI. The
implementation of the option across component types and the CLI settings
could be fixed, but the `CharacterEmbed` model does not work on GPU in
v2 so it's better to remove it.
* define new architectures for the pretraining objective
* add loss function as attr of the omdel
* cleanup
* cleanup
* shorten name
* fix typo
* remove unused error
Preserve `token.spacy` corresponding to the span end token in the
original doc rather than adjusting for the current offset.
* If not modifying in place, this checks in the original document
(`doc.c` rather than `tokens`).
* If modifying in place, the document has not been modified past the
current span start position so the value at the current span end
position is valid.
* When checking for token alignments, check not only that the tokens are
identical but that the character positions are both at the start of a
token.
It's possible for the tokens to be identical even though the two
tokens aren't aligned one-to-one in a case like `["a'", "''"]` vs.
`["a", "''", "'"]`, where the middle tokens are identical but should not
be aligned on the token level at character position 2 since it's the
start of one token but the middle of another.
* Use the lowercased version of the token texts to create the
character-to-token alignment because lowercasing can change the string
length (e.g., for `İ`, see the not-a-bug bug report:
https://bugs.python.org/issue34723)
* Only set NORM on Token in retokenizer
Instead of setting `NORM` on both the token and lexeme, set `NORM` only
on the token.
The retokenizer tries to set all possible attributes with
`Token/Lexeme.set_struct_attr` so that it doesn't have to enumerate
which attributes are available for each. `NORM` is the only attribute
that's stored on both and for most cases it doesn't make sense to set
the global norms based on a individual retokenization. For lexeme-only
attributes like `IS_STOP` there's no way to avoid the global side
effects, but I think that `NORM` would be better only on the token.
* Fix test