* Add test for case where parser overwrite annotations
* Move test to its own file
Also add note about how other tokens modify results.
* Fix xfail decorator
* Add util function to unique lists and preserve order
* Use unique function instead of list(set())
list(set()) has the issue that it's not consistent between runs of the
Python interpreter, so order can vary.
list(set()) calls were left in a few places where they were behind calls
to sorted(). I think in this case the calls to list() can be removed,
but this commit doesn't do that.
* Use the existing pattern for this
* Fix inconsistency
This makes the failing test pass, so that behavior is consistent whether
patterns are added in one call or two.
The issue is that the hash for patterns depended on the index of the
pattern in the list of current patterns, not the list of total patterns,
so a second call would get identical match ids.
* Add illustrative test case
* Add failing test for remove case
Patterns are not removed from the internal matcher on calls to remove,
which causes spurious weird matches (or misses).
* Fix removal issue
Remove patterns from the internal matcher.
* Check that the single add call also gets no matches
Since a component may reference anything in the vocab, share the full
vocab when loading source components and vectors (which will include
`strings` as of #8909).
When loading a source component from a config, save and restore the
vocab state after loading source pipelines, in particular to preserve
the original state without vectors, since `[initialize.vectors]
= null` skips rather than resets the vectors.
The vocab references are not synced for components loaded with
`Language.add_pipe(source=)` because the pipelines are already loaded
and not necessarily with the same vocab. A warning could be added in
`Language.create_pipe_from_source` that it may be necessary to save and
reload before training, but it's a rare enough case that this kind of
warning may be too noisy overall.
* Add link to Discussions FAQ
* Remove old FAQ entries
I think these are no longer relevant.
- no-cache-dir: affected pip versions are *very* old now
- narrow unicode: not an issue from py3.3+
- utf-8 osx: upstream bug closed in 2019
Some of the other issues are also maybe not frequent.
* Add link to Discussions FAQ
* Remove old FAQ entries
I think these are no longer relevant.
- no-cache-dir: affected pip versions are *very* old now
- narrow unicode: not an issue from py3.3+
- utf-8 osx: upstream bug closed in 2019
Some of the other issues are also maybe not frequent.
* Validate pos values when creating Doc
* Add clear error when setting invalid pos
This also changes the error language slightly.
* Fix variable name
* Update spacy/tokens/doc.pyx
* Test that setting invalid pos raises an error
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* First take at StringStore/Vocab docs
Things to check:
1. The mysterious vocab members
2. How to make table of contents? Is it autogenerated?
3. Anything I missed / needs more detail?
* Update docs
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Updates based on review feedback
* Minor fix
* Move example code down
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>