* Sync Span __eq__ and __hash__
Use the same tuple for `__eq__` and `__hash__`, including all attributes
except `vector` and `vector_norm`.
* Update entity comparison in tests
Update `assert_docs_equal()` test util to compare `Span` properties for
ents rather than `Span` objects.
Modify flag settings so that `DEP` is not sufficient to set `is_parsed`
and only run `set_children_from_heads()` if `HEAD` is provided.
Then the combination `[SENT_START, DEP]` will set deps and not clobber
sent starts with a lot of one-word sentences.
* Restructure tag maps for MorphAnalysis changes
Prepare tag maps for upcoming MorphAnalysis changes that allow
arbritrary features.
* Use default tag map rather than duplicating for ca / uk / vi
* Import tag map into defaults for ga
* Modify tag maps so all morphological fields and features are strings
* Move features from `"Other"` to the top level
* Rewrite tuples as strings separated by `","`
* Rewrite morph symbols for fr lemmatizer as strings
* Export MorphAnalysis under spacy.tokens
* Modify morphology to support arbitrary features
Modify `Morphology` and `MorphAnalysis` so that arbitrary features are
supported.
* Modify `MorphAnalysisC` so that it can support arbitrary features and
multiple values per field. `MorphAnalysisC` is redesigned to contain:
* key: hash of UD FEATS string of morphological features
* array of `MorphFeatureC` structs that each contain a hash of `Field`
and `Field=Value` for a given morphological feature, which makes it
possible to:
* find features by field
* represent multiple values for a given field
* `get_field()` is renamed to `get_by_field()` and is no longer `nogil`.
Instead a new helper function `get_n_by_field()` is `nogil` and returns
`n` features by field.
* `MorphAnalysis.get()` returns all possible values for a field as a
list of individual features such as `["Tense=Pres", "Tense=Past"]`.
* `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string.
* `Morphology.feats_to_dict()` converts a UD FEATS string to a dict
where:
* Each field has one entry in the dict
* Multiple values remain separated by a separator in the value string
* `Token.morph_` returns the UD FEATS string and you can set
`Token.morph_` with a UD FEATS string or with a tag map dict.
* Modify get_by_field to use np.ndarray
Modify `get_by_field()` to use np.ndarray. Remove `max_results` from
`get_n_by_field()` and always iterate over all the fields.
* Rewrite without MorphFeatureC
* Add shortcut for existing feats strings as keys
Add shortcut for existing feats strings as keys in `Morphology.add()`.
* Check for '_' as empty analysis when adding morphs
* Extend helper converters in Morphology
Add and extend helper converters that convert and normalize between:
* UD FEATS strings (`"Case=dat,gen|Number=sing"`)
* per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`)
* list of individual features (`["Case=dat", "Case=gen",
"Number=sing"]`)
All converters sort fields and values where applicable.
* Update util.filter_spans() to prefer earlier spans
* Add filter_spans test for first same-length span
* Update entity relation example to refer to util.filter_spans()
* Move test
* Allow default in Lookups.get_table
* Start with blank tables in Lookups.from_bytes
* Refactor lemmatizer to hold instance of Lookups
* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need
* Update tests and docs
* Fix more tests
* Fix lemmatizer
* Upgrade pytest to try and fix weird CI errors
* Try pytest 4.6.5
* remove duplicate unit test
* unit test (currently failing) for issue 4267
* bugfix: ensure doc.ents preserves kb_id annotations
* fix in setting doc.ents with empty label
* rename
* test for presetting an entity to a certain type
* allow overwriting Outside + blocking presets
* fix actions when previous label needs to be kept
* fix default ent_iob in set entities
* cleaner solution with U- action
* remove debugging print statements
* unit tests with explicit transitions and is_valid testing
* remove U- from move_names explicitly
* remove unit tests with pre-trained models that don't work
* remove (working) unit tests with pre-trained models
* clean up unit tests
* move unit tests
* small fixes
* remove two TODO's from doc.ents comments
* Improve load_language_data helper
* WIP: Add Lookups implementation
* Start moving lemma data over to JSON
* WIP: move data over for more languages
* Convert more languages
* Fix lemmatizer fixtures in tests
* Finish conversion
* Auto-format JSON files
* Fix test for now
* Make sure tables are stored on instance
* Update docstrings
* Update docstrings and errors
* Update test
* Add Lookups.__len__
* Add serialization methods
* Add Lookups.remove_table
* Use msgpack for serialization to disk
* Fix file exists check
* Try using OrderedDict for everything
* Update .flake8 [ci skip]
* Try fixing serialization
* Update test_lookups.py
* Update test_serialize_vocab_strings.py
* Lookups / Tables now work
This implements the stubs in the Lookups/Table classes. Currently this
is in Cython but with no type declarations, so that could be improved.
* Add lookups to setup.py
* Actually add lookups pyx
The previous commit added the old py file...
* Lookups work-in-progress
* Move from pyx back to py
* Add string based lookups, fix serialization
* Update tests, language/lemmatizer to work with string lookups
There are some outstanding issues here:
- a pickling-related test fails due to the bloom filter
- some custom lemmatizers (fr/nl at least) have issues
More generally, there's a question of how to deal with the case where
you have a string but want to use the lookup table. Currently the table
allows access by string or id, but that's getting pretty awkward.
* Change lemmatizer lookup method to pass (orth, string)
* Fix token lookup
* Fix French lookup
* Fix lt lemmatizer test
* Fix Dutch lemmatizer
* Fix lemmatizer lookup test
This was using a normal dict instead of a Table, so checks for the
string instead of an integer key failed.
* Make uk/nl/ru lemmatizer lookup methods consistent
The mentioned tokenizers all have their own implementation of the
`lookup` method, which accesses a `Lookups` table. The way that was
called in `token.pyx` was changed so this should be updated to have the
same arguments as `lookup` in `lemmatizer.py` (specificially (orth/id,
string)).
Prior to this change tests weren't failing, but there would probably be
issues with normal use of a model. More tests should proably be added.
Additionally, the language-specific `lookup` implementations seem like
they might not be needed, since they handle things like lower-casing
that aren't actually language specific.
* Make recently added Greek method compatible
* Remove redundant class/method
Leftovers from a merge not cleaned up adequately.
* Allow copying the user_data with as_doc + unit test
* add option to docs
* add typing
* import fix
* workaround to avoid bool clashing ...
* bint instead of bool
* Modify retokenizer to use span root attributes
* tag/pos/morph are set to root tag/pos/morph
* lemma and norm are reset and end up as orth (not ideal, but better
than orth of first token)
* Also handle individual merge case
* Add test
* Attempt to handle ent_iob and ent_type in merges
* Fix check for whether B-ENT should become I-ENT
* Move IOB consistency check to after attrs
Move all IOB consistency checks after attrs are set and simplify to
check entire document, modifying I to B at the beginning of the document
or if the entity type of the previous token isn't the same.
* Move IOB consistency check for single merge
Move IOB consistency check after the token array is compressed for the
single merge case.
* Update spacy/tokens/_retokenize.pyx
Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
* Remove single vs. multiple merge distinction
Remove original single-instance `_merge()` and use `_bulk_merge()` (now
renamed `_merge()`) for all merges.
* Add out-of-bound check in previous entity check
* Add custom __dir__ to Underscore (see #3707)
* Make sure custom extension methods keep their docstrings (see #3707)
* Improve tests
* Prepend note on partial to docstring (see #3707)
* Remove print statement
* Handle cases where docstring is None
* label in span not writable anymore
* more explicit unit test and error message for readonly label
* bit more explanation (view)
* error msg tailored to specific case
* fix None case
* Make serialization methods consistent
exclude keyword argument instead of random named keyword arguments and deprecation handling
* Update docs and add section on serialization fields
* Use default return instead of else
* Add Doc.is_nered to indicate if entities have been set
* Add properties in Doc.to_json if they were set, not if they're available
This way, if a processed Doc exports "pos": None, it means that the tag was explicitly unset. If it exports "ents": [], it means that entity annotations are available but that this document doesn't contain any entities. Before, this would have been unclear and problematic for training.
<!--- Provide a general summary of your changes in the title. -->
## Description
This PR adds the abilility to override custom extension attributes during merging. This will only work for attributes that are writable, i.e. attributes registered with a default value like `default=False` or attribute that have both a getter *and* a setter implemented.
```python
Token.set_extension('is_musician', default=False)
doc = nlp("I like David Bowie.")
with doc.retokenize() as retokenizer:
attrs = {"LEMMA": "David Bowie", "_": {"is_musician": True}}
retokenizer.merge(doc[2:4], attrs=attrs)
assert doc[2].text == "David Bowie"
assert doc[2].lemma_ == "David Bowie"
assert doc[2]._.is_musician
```
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.