* Modify retokenizer to use span root attributes
* tag/pos/morph are set to root tag/pos/morph
* lemma and norm are reset and end up as orth (not ideal, but better
than orth of first token)
* Also handle individual merge case
* Add test
* Attempt to handle ent_iob and ent_type in merges
* Fix check for whether B-ENT should become I-ENT
* Move IOB consistency check to after attrs
Move all IOB consistency checks after attrs are set and simplify to
check entire document, modifying I to B at the beginning of the document
or if the entity type of the previous token isn't the same.
* Move IOB consistency check for single merge
Move IOB consistency check after the token array is compressed for the
single merge case.
* Update spacy/tokens/_retokenize.pyx
Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
* Remove single vs. multiple merge distinction
Remove original single-instance `_merge()` and use `_bulk_merge()` (now
renamed `_merge()`) for all merges.
* Add out-of-bound check in previous entity check
* Add custom __dir__ to Underscore (see #3707)
* Make sure custom extension methods keep their docstrings (see #3707)
* Improve tests
* Prepend note on partial to docstring (see #3707)
* Remove print statement
* Handle cases where docstring is None
* label in span not writable anymore
* more explicit unit test and error message for readonly label
* bit more explanation (view)
* error msg tailored to specific case
* fix None case
* Make serialization methods consistent
exclude keyword argument instead of random named keyword arguments and deprecation handling
* Update docs and add section on serialization fields
* Use default return instead of else
* Add Doc.is_nered to indicate if entities have been set
* Add properties in Doc.to_json if they were set, not if they're available
This way, if a processed Doc exports "pos": None, it means that the tag was explicitly unset. If it exports "ents": [], it means that entity annotations are available but that this document doesn't contain any entities. Before, this would have been unclear and problematic for training.
<!--- Provide a general summary of your changes in the title. -->
## Description
This PR adds the abilility to override custom extension attributes during merging. This will only work for attributes that are writable, i.e. attributes registered with a default value like `default=False` or attribute that have both a getter *and* a setter implemented.
```python
Token.set_extension('is_musician', default=False)
doc = nlp("I like David Bowie.")
with doc.retokenize() as retokenizer:
attrs = {"LEMMA": "David Bowie", "_": {"is_musician": True}}
retokenizer.merge(doc[2:4], attrs=attrs)
assert doc[2].text == "David Bowie"
assert doc[2].lemma_ == "David Bowie"
assert doc[2]._.is_musician
```
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* Change retokenize.split() API for heads
* Pass lists as values for attrs in split
* Fix test_doc_split filename
* Add error for mismatched tokens after split
* Raise error if new tokens don't match text
* Fix doc test
* Fix error
* Move deps under attrs
* Fix split tests
* Fix retokenize.split
* Add split one token into several (resolves#2838)
* Improve error message for token splitting
* Make retokenizer.split() tests use a Token object
Change retokenizer.split() to use a Token object, instead of an index.
* Pass Token into retokenize.split()
Tweak retokenize.split() API so that we pass the `Token` object, not the index.
* Fix token.idx in retokenize.split()
* Test that token.idx is correct after split
* Fix token.idx for split tokens
* Fix retokenize.split()
* Fix retokenize.split
* Fix retokenize.split() test
* Add custom MatchPatternError
* Improve validators and add validation option to Matcher
* Adjust formatting
* Never validate in Matcher within PhraseMatcher
If we do decide to make validate default to True, the PhraseMatcher's Matcher shouldn't ever validate. Here, we create the patterns automatically anyways (and it's currently unclear whether the validation has performance impacts at a very large scale).
This PR adds a test for an untested case of `Span.get_lca_matrix`, and fixes a bug for that scenario, which I introduced in [this PR](https://github.com/explosion/spaCy/pull/3089) (sorry!).
## Description
The previous implementation of get_lca_matrix was failing for the case `doc[j:k].get_lca_matrix()` where `j > 0`. A test has been added for this case and the bug has been fixed.
### Types of change
Bug fix
## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
Initially span.as_doc() was designed to return a view of the span's contents, as a Doc object. This was a nice idea, but it fails due to the token.idx property, which refers to the character offset within the string. In a span, the idx of the first token might not be 0. Because this data is different, we can't have a view --- it'll be inconsistent.
This patch changes span.as_doc() to instead return a copy. The docs are updated accordingly. Closes#1537
* Update test for span.as_doc()
* Make span.as_doc() return a copy. Closes#1537
* Document change to Span.as_doc()
Fixes#3027.
* Allow Span.__init__ to take unicode values for the `label` argument.
* Allow `Span.label_` to be writeable.
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
* Support nowrap setting in util.prints
* Tidy up and fix whitespace
* Simplify script and use read_jsonl helper
* Add JSON schemas (see #2928)
* Deprecate Doc.print_tree
Will be replaced with Doc.to_json, which will produce a unified format
* Add Doc.to_json() method (see #2928)
Converts Doc objects to JSON using the same unified format as the training data. Method also supports serializing selected custom attributes in the doc._. space.
* Remove outdated test
* Add write_json and write_jsonl helpers
* WIP: Update spacy train
* Tidy up spacy train
* WIP: Use wasabi for formatting
* Add GoldParse helpers for JSON format
* WIP: add debug-data command
* Fix typo
* Add missing import
* Update wasabi pin
* Add missing import
* 💫 Refactor CLI (#2943)
To be merged into #2932.
## Description
- [x] refactor CLI To use [`wasabi`](https://github.com/ines/wasabi)
- [x] use [`black`](https://github.com/ambv/black) for auto-formatting
- [x] add `flake8` config
- [x] move all messy UD-related scripts to `cli.ud`
- [x] make converters function that take the opened file and return the converted data (instead of having them handle the IO)
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* Update wasabi pin
* Delete old test
* Update errors
* Fix typo
* Tidy up and format remaining code
* Fix formatting
* Improve formatting of messages
* Auto-format remaining code
* Add tok2vec stuff to spacy.train
* Fix typo
* Update wasabi pin
* Fix path checks for when train() is called as function
* Reformat and tidy up pretrain script
* Update argument annotations
* Raise error if model language doesn't match lang
* Document new train command