* Move Entity Linker v1 component to spacy-legacy
This is a follow up to #11889 that moves the component instead of
removing it.
In general, we never import from spacy-legacy in spaCy proper. However,
to use this component, that kind of import will be necessary. I was able
to test this without issues, but is this current import strategy
acceptable? Or should we put the component in a registry?
* Use spacy-legacy pr for CI
This will need to be reverted before merging.
* Add temporary step to log installed spacy-legacy version
* Modify requirements.txt to trigger tests
* Add comment to Python to trigger tests
* TODO REVERT This is a commit with logic changes to trigger tests
* Remove pipe from YAML
Works locally, but possibly this is causing a quoting error or
something.
* Revert "TODO REVERT This is a commit with logic changes to trigger tests"
This reverts commit 689fae71f3.
* Revert "Add comment to Python to trigger tests"
This reverts commit 11840fc598.
* Add more logging
* Try installing directly in workflow
* Try explicitly uninstalling spacy-legacy first
* Cat requirements.txt to confirm contents
In the branch, the thinc version spec is `thinc>=8.1.0,<8.2.0`. But in
the logs, it's clear that a development release of 9.0 is being
installed. It's not clear why that would happen.
* Log requirements at start of build
* TODO REVERT Change thinc spec
Want to see what happens to the installed thinc spec with this change.
* Update thinc requirements
This makes it the same as it was before the merge, >=8.1.0,<8.2.0.
* Use same thinc version as v4 branch
* TODO REVERT Mark dependency check as xfail
spacy-legacy is specified as a git checkout in requirements.txt while
this PR is in progress, which makes the consistency check here fail.
* Remove debugging output / install step
* Revert "Remove debugging output / install step"
This reverts commit 923ea7448b.
* Clean up debugging output
The manual install step with the URL fragment seems to have caused
issues on Windows due to the = in the URL being misinterpreted. On the
other hand, removing it seems to mean the git version of spacy-legacy
isn't actually installed.
This PR removes the URL fragment but keeps the direct command-line
install. Additionally, since it looks like this job is configured to use
the default shell (and not bash), it removes a comment that upsets the
Windows cmd shell.
* Revert "TODO REVERT Mark dependency check as xfail"
This reverts commit d4863ec156.
* Fix requirements.txt, increasing spacy-legacy version
* Raise spacy legacy version in setup.cfg
* Remove azure build workarounds
* make spacy-legacy version explicit in error message
* Remove debugging line
* Suggestions from code review
* Init
* fix tests
* Update spacy/errors.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix test_blank_languages
* Rename xx to mul in docs
* Format _util with black
* prettier formatting
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Language.distill: copy both reference and predicted
In distillation we also modify the teacher docs (e.g. in tok2vec
components), so we need to copy both the reference and predicted doc.
Problem caught by @shadeMe
* Make new `_copy_examples` args kwonly
* Add the configuration schema for distillation
This also adds the default configuration and some tests. The schema will
be used by the training loop and `distill` subcommand.
* Format
* Change distillation shortopt to -d
* Fix descripion of max_epochs
* Rename distillation flag to -dt
* Rename `pipe_map` to `student_to_teacher`
* Don't re-download installed models
When downloading a model, this checks if the same version of the same
model is already installed. If it is then the download is skipped.
This is necessary because pip uses the final download URL for its
caching feature, but because of the way models are hosted on Github,
their URLs change every few minutes.
* Use importlib instead of meta.json
* Use get_package_version
* Add untested, disabled test
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add `Language.distill`
This method is the distillation counterpart of `Language.update`. It
takes a teacher `Language` instance and distills the student pipes on
the teacher pipes.
* Apply suggestions from code review
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Clarify that how Example is used in distillation
* Update transition parser distill docstring for examples argument
* Pass optimizer to `TrainablePipe.distill`
* Annotate pipe before update
As discussed internally, we want to let a pipe annotate before doing an
update with gold/silver data. Otherwise, the output may be (too)
informed by the gold/silver data.
* Rename `component_map` to `student_to_teacher`
* Better synopsis in `Language.distill` docstring
* `name` -> `student_name`
* Fix labels type in docstring
* Mark distill test as slow
* Fix `student_to_teacher` type in docs
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* WIP
* rm ipython embeds
* rm total
* WIP
* cleanup
* cleanup + reword
* rm component function
* remove migration support form
* fix reference dataset for dev data
* additional fixes
- set approach to identifying unique trees
- adjust line length on messages
- add logic for detecting docs without annotations
* use 0 instead of none for no annotation
* partial annotation support
* initial tests for _compile_gold lemma attributes
Using the example data from the edit tree lemmatizer tests for:
- lemmatizer_trees
- partial_lemma_annotations
- n_low_cardinality_lemmas
- no_lemma_annotations
* adds output test for cli app
* switch msg level
* rm unclear uniqueness check
* Revert "rm unclear uniqueness check"
This reverts commit 6ea2b3524b.
* remove good message on uniqueness
* formatting
* use en_vocab fixture
* clarify data set source in messages
* remove unnecessary import
Co-authored-by: svlandeg <svlandeg@github.com>
* Add `spacy.PlainTextCorpusReader.v1`
This is a corpus reader that reads plain text corpora with the following
format:
- UTF-8 encoding
- One line per document.
- Blank lines are ignored.
It is useful for applications where we deal with very large corpora,
such as distillation, and don't want to deal with the space overhead of
serialized formats. Additionally, many large corpora already use such
a text format, keeping the necessary preprocessing to a minimum.
* Update spacy/training/corpus.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* docs: add version to `PlainTextCorpus`
* Add docstring to registry function
* Add plain text corpus tests
* Only strip newline/carriage return
* Add return type _string_to_tmp_file helper
* Use a temporary directory in place of file name
Different OS auto delete/sharing semantics are just wonky.
* This will be new in 3.5.1 (rather than 4)
* Test improvements from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Don't pass mem pool to new lexeme function
* Remove unused mem from function args
Two methods calling _new_lexeme, get and get_by_orth, took mem arguments
just to call the internal method. That's no longer necessary, so this
cleans it up.
* prettier formatting
* Remove more unused mem args
* Rename CSS class to make use more clear
* Rename component prop to improve code readability
* Fix `aria-hidden` directly on a link element
This link wouldn't have been clickable by screenreaders
* Refactor component
This removes a unnessary `div` and a duplicate link
Co-authored-by: Ines Montani <ines@ines.io>
Originally introduced in 62b9c9c6d7
Original error: Warning: Invalid DOM property `class`. Did you mean `className`?
React doesn't have `class`, it uses `className`.
* Fix missing comma
* Activate user zoom for website
This is recommended by lighthouse:
> Disabling zooming is problematic for users with low vision who rely on screen magnification to properly see the contents of a web page. Learn more.
Also iOS already ignores this attribute anyway.
* Fix gap in landing pattern at the top
* Replace `.jpg` patterns with `.png`
This drastically reduces file size (for the landing page from 221kb to 57kb) while doubling the resolution to look sharper on retina displays.
* Refactor _scores2guesses
* Handle arrays on GPU
* Convert argmax result to raw integer
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Use NumpyOps() to copy data to CPU
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Changes based on review comments
* Use different _scores2guesses depending on tree_k
* Add tests for corner cases
* Add empty line for consistency
* Improve naming
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* Improve naming
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* skeleton
* Fill in non-CLI details from release notes draft
* Add TODO for fuzzy matching
* Website updates for v3-5 draft
* Fill in usage examples
* Add fuzzy matching to intro
* Fix fuzzy examples
* Shell example formatting
* Fix typo
* Format
* Remove trailing periods in internal list
* Update
* Fix spacing for nested lists
* Update InMemoryLookupKB link
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
* API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar
* adjust to mdx
* linkout to InMemoryLookupKB at first occurrence in kb.mdx
* fix links to docs
* revert Azure trigger setting (I'll make a separate PR)
Co-authored-by: svlandeg <svlandeg@github.com>
* Update years in website landing page
* Update website/pages/index.tsx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>