* Accept Doc input in pipelines
Allow `Doc` input to `Language.__call__` and `Language.pipe`, which
skips `Language.make_doc` and passes the doc directly to the pipeline.
* ensure_doc helper function
* avoid running multiple processes on GPU
* Update spacy/tests/test_language.py
Co-authored-by: svlandeg <svlandeg@github.com>
* Validate pos values when creating Doc
* Add clear error when setting invalid pos
This also changes the error language slightly.
* Fix variable name
* Update spacy/tokens/doc.pyx
* Test that setting invalid pos raises an error
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* First take at StringStore/Vocab docs
Things to check:
1. The mysterious vocab members
2. How to make table of contents? Is it autogenerated?
3. Anything I missed / needs more detail?
* Update docs
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Updates based on review feedback
* Minor fix
* Move example code down
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove two attributes marked for removal in 3.1
* Add back unused ints with changed names
* Change data_dir to _unused_object
This is still kept in the type definition, but I removed it from the
serialization code.
* Put serialization code back for now
Not sure how this interacts with old serialized models yet.
* Replace all basestring references with unicode
`basestring` was a compatability type introduced by Cython to make
dealing with utf-8 strings in Python2 easier. In Python3 it is
equivalent to the unicode (or str) type.
I replaced all references to basestring with unicode, since that was
used elsewhere, but we could also just replace them with str, which
shoudl also be equivalent.
All tests pass locally.
* Replace all references to unicode type with str
Since we only support python3 this is simpler.
* Remove all references to unicode type
This removes all references to the unicode type across the codebase and
replaces them with `str`, which makes it more drastic than the prior
commits. In order to make this work importing `unicode_literals` had to
be removed, and one explicit unicode literal also had to be removed (it
is unclear why this is necessary in Cython with language level 3, but
without doing it there were errors about implicit conversion).
When `unicode` is used as a type in comments it was also edited to be
`str`.
Additionally `coding: utf8` headers were removed from a few files.
* Handle spacy-legacy in package CLI for dependencies
* Implement legacy backoff in spacy registry.find
* Remove unused import
* Update and format test
* pass alignments to callbacks
* refactor for single callback loop
* Update spacy/matcher/matcher.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix surprises when asking for the root of a git repo
In the case of the first asset I wanted to get from git, the data I
wanted was the entire repository. I tried leaving "path" blank, which
gave a less-than-helpful error, and then I tried `path: "/"`, which
started copying my entire filesystem into the project. The path I should
have used was "".
I've made two changes to make this smoother for others:
- The 'path' within a git clone defaults to ""
- If the path points outside of the tmpdir that the git clone goes
into, we fail with an error
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* use a descriptive error instead of a default
plus some minor fixes from PR review
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* check for None values in assets
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
Co-authored-by: Elia Robyn Speer <elia@explosion.ai>
* Add textcat docs
* Add NER docs
* Add Entity Linker docs
* Add assigned fields docs for the tagger
This also adds a preamble, since there wasn't one.
* Add morphologizer docs
* Add dependency parser docs
* Update entityrecognizer docs
This is a little weird because `Doc.ents` is the only thing assigned to,
but it's actually a bidirectional property.
* Add token fields for entityrecognizer
* Fix section name
* Add entity ruler docs
* Add lemmatizer docs
* Add sentencizer/recognizer docs
* Update website/docs/api/entityrecognizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update website/docs/api/entityruler.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update website/docs/api/tagger.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update website/docs/api/entityruler.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update type for Doc.ents
This was `Tuple[Span, ...]` everywhere but `Tuple[Span]` seems to be
correct.
* Run prettier
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Run prettier
* Add transformers section
This basically just moves and renames the "custom attributes" section
from the bottom of the page to be consistent with "assigned attributes"
on other pages.
I looked at moving the paragraph just above the section into the
section, but it includes the unrelated registry additions, so it seemed
better to leave it unchanged.
* Make table header consistent
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix inference of epoch_resume
When an epoch_resume value is not specified individually, it can often
be inferred from the filename. The value inference code was there but
the value wasn't passed back to the training loop.
This also adds a specific error in the case where no epoch_resume value
is provided and it can't be inferred from the filename.
* Add new error
* Always use the epoch resume value if specified
Before this the value in the filename was used if found
* Start Listeners documentation
* intro tabel of different architectures
* initialization, linking, dim inference
* internal comm (WIP)
* expand internal comm section
* frozen components and replacing listeners
* various small fixes
* fix content table
* fix link