Provide more customized normalization table warnings when training a new
model. Only suggest installing `spacy-lookups-data` if it's not already
installed and it includes a table for this language (currently checked
in a hard-coded list).
* Fix warning message for lemmatization tables
* Add a warning when the `lexeme_norm` table is empty. (Given the
relatively lang-specific loading for `Lookups`, it seemed like too much
overhead to dynamically extract the list of languages, so for now it's
hard-coded.)
* Fix most_similar for vectors with unused rows
Address issues related to the unused rows in the vector table and
`most_similar`:
* Update `most_similar()` to search only through rows that are in use
according to `key2row`.
* Raise an error when `most_similar(n=n)` is larger than the number of
vectors in the table.
* Set and restore `_unset` correctly when vectors are added or
deserialized so that new vectors are added in the correct row.
* Set data and keys to the same length in `Vocab.prune_vectors()` to
avoid spurious entries in `key2row`.
* Fix regression test using `most_similar`
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Add warning for misaligned character offset spans
* Resolve conflict
* Filter warnings in example scripts
Filter warnings in example scripts to show warnings once, in particular
warnings about misaligned entities.
Co-authored-by: Ines Montani <ines@ines.io>
Check that row is within bounds for the vector data array when adding a
vector.
Don't add vectors with rank OOV_RANK in `init-model` (change is due to
shift from OOV as 0 to OOV as OOV_RANK).
Reconstruction of the original PR #4697 by @MiniLau.
Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema
because the Matcher is only going to be able to support `IS_SENT_START`.
Improve GoldParse NER alignment by including all cases where the start
and end of the NER span can be aligned, regardless of internal
tokenization differences.
To do this, convert BILUO tags to character offsets, check start/end
alignment with `doc.char_span()`, and assign the BILUO tags for the
aligned spans. Alignment for `O/-` tags is handled through the
one-to-one and multi alignments.
* Matcher support for Span, as well as Doc #5056
* Removes an import unused
* Signed contributors agreement
* Code optimization and better test
* Add error message for bad Matcher call argument
* Fix merging
* Add Doc init from list of words and text
Add an option to initialize a `Doc` from a text and list of words where
the words may or may not include all whitespace tokens. If the text and
words are mismatched, raise an error.
* Fix error code
* Remove all whitespace before aligning words/text
* Move words/text init to util function
* Update error message
* Rename to get_words_and_spaces
* Fix formatting
* Modify Vector.resize to work with cupy
Modify `Vectors.resize` to work with cupy. Modify behavior when resizing
to a different vector dimension so that individual vectors are truncated
or extended with zeros instead of having the original values filled into
the new shape without regard for the original axes.
* Update spacy/tests/vocab_vectors/test_vectors.py
Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Improve token head verification
Improve the verification for valid token heads when heads are set:
* in `Token.head`: heads come from the same document
* in `Doc.from_array()`: head indices are within the bounds of the
document
* Improve error message
Iterate over lr_edges until all heads are within the current sentence.
Instead of iterating over them for a fixed number of iterations, check
whether the sentence boundaries are correct for the heads and stop when
all are correct. Stop after a maximum of 10 iterations, providing a
warning in this case since the sentence boundaries may not be correct.
* Add work in progress
* Update analysis helpers and component decorator
* Fix porting of docstrings for Python 2
* Fix docstring stuff on Python 2
* Support meta factories when loading model
* Put auto pipeline analysis behind flag for now
* Analyse pipes on remove_pipe and replace_pipe
* Move analysis to root for now
Try to find a better place for it, but it needs to go for now to avoid circular imports
* Simplify decorator
Don't return a wrapped class and instead just write to the object
* Update existing components and factories
* Add condition in factory for classes vs. functions
* Add missing from_nlp classmethods
* Add "retokenizes" to printed overview
* Update assigns/requires declarations of builtins
* Only return data if no_print is enabled
* Use multiline table for overview
* Don't support Span
* Rewrite errors/warnings and move them to spacy.errors
* Implement new API for {Phrase}Matcher.add (backwards-compatible)
* Update docs
* Also update DependencyMatcher.add
* Update internals
* Rewrite tests to use new API
* Add basic check for common mistake
Raise error with suggestion if user likely passed in a pattern instead of a list of patterns
* Fix typo [ci skip]
* Error for ill-formed input to iob_to_biluo()
Check for empty label in iob_to_biluo(), which can result from
ill-formed input.
* Check for empty NER label in debug-data
* fix overflow error on windows
* more documentation & logging fixes
* md fix
* 3 different limit parameters to play with execution time
* bug fixes directory locations
* small fixes
* exclude dev test articles from prior probabilities stats
* small fixes
* filtering wikidata entities, removing numeric and meta items
* adding aliases from wikidata also to the KB
* fix adding WD aliases
* adding also new aliases to previously added entities
* fixing comma's
* small doc fixes
* adding subclassof filtering
* append alias functionality in KB
* prevent appending the same entity-alias pair
* fix for appending WD aliases
* remove date filter
* remove unnecessary import
* small corrections and reformatting
* remove WD aliases for now (too slow)
* removing numeric entities from training and evaluation
* small fixes
* shortcut during prediction if there is only one candidate
* add counts and fscore logging, remove FP NER from evaluation
* fix entity_linker.predict to take docs instead of single sentences
* remove enumeration sentences from the WP dataset
* entity_linker.update to process full doc instead of single sentence
* spelling corrections and dump locations in readme
* NLP IO fix
* reading KB is unnecessary at the end of the pipeline
* small logging fix
* remove empty files
* Move test
* Allow default in Lookups.get_table
* Start with blank tables in Lookups.from_bytes
* Refactor lemmatizer to hold instance of Lookups
* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need
* Update tests and docs
* Fix more tests
* Fix lemmatizer
* Upgrade pytest to try and fix weird CI errors
* Try pytest 4.6.5
* Replace PhraseMatcher with Aho-Corasick
Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays
of the hash values for the relevant attribute. The implementation is
based on FlashText.
The speed should be similar to the previous PhraseMatcher. It is now
possible to easily remove match IDs and matches don't go missing with
large keyword lists / vocabularies.
Fixes#4308.
* Restore support for pickling
* Fix internal keyword add/remove for numpy arrays
* Add missing loop for match ID set in search loop
* Remove cruft in matching loop for partial matches
There was a bit of unnecessary code left over from FlashText in the
matching loop to handle partial token matches, which we don't have with
PhraseMatcher.
* Replace dict trie with MapStruct trie
* Fix how match ID hash is stored/added
* Update fix for match ID vocab
* Switch from map_get_unless_missing to map_get
* Switch from numpy array to Token.get_struct_attr
Access token attributes directly in Doc instead of making a copy of the
relevant values in a numpy array.
Add unsatisfactory warning for hash collision with reserved terminal
hash key. (Ideally it would change the reserved terminal hash and redo
the whole trie, but for now, I'm hoping there won't be collisions.)
* Restructure imports to export find_matches
* Implement full remove()
Remove unnecessary trie paths and free unused maps.
Parallel to Matcher, raise KeyError when attempting to remove a match ID
that has not been added.
* Store docs internally only as attr lists
* Reduces size for pickle
* Remove duplicate keywords store
Now that docs are stored as lists of attr hashes, there's no need to
have the duplicate _keywords store.