Commit Graph

15902 Commits

Author SHA1 Message Date
Daniël de Kok
eec5ccd72f
Language.update: ensure that tok2vec gets updated (#12136)
* `Language.update`: ensure that tok2vec gets updated

The components in a pipeline can be updated independently. However,
tok2vec implementations are an exception to this, since they depend on
listeners for their gradients. The update method of a tok2vec
implementation computes the tok2vec forward and passes this along with a
backprop function to the listeners. This backprop function accumulates
gradients for all the listeners. There are two ways in which the
accumulated gradients can be used to update the tok2vec weights:

1. Call the `finish_update` method of tok2vec *after* the `update`
   method is called on all of the pipes that use a tok2vec listener.
2. Pass an optimizer to the `update` method of tok2vec. In this
   case, tok2vec will give the last listener a special backprop
   function that calls `finish_update` on the tok2vec.

Unfortunately, `Language.update` did neither of these. Instead, it
immediately called `finish_update` on every pipe after `update`. As a
result, the tok2vec weights are updated when no gradients have been
accumulated from listeners yet. And the gradients of the listeners are
only used in the next call to `Language.update` (when `finish_update` is
called on tok2vec again).

This change fixes this issue by passing the optimizer to the `update`
method of trainable pipes, leading to use of the second strategy
outlined above.

The main updating loop in `Language.update` is also simplified by using
the `TrainableComponent` protocol consistently.

* Train loop: `sgd` is `Optional[Optimizer]`, do not pass false

* Language.update: call pipe finish_update after all pipe updates

This does correct and fast updates if multiple components update the
same parameters.

* Add comment why we moved `finish_update` to a separate loop
2023-02-03 15:22:25 +01:00
Sofie Van Landeghem
c47ec5b5c6
Merge pull request #12218 from adrianeboyd/chore/update-v4-from-master-7
Update v4 from master
2023-02-03 12:04:20 +01:00
Paul O'Leary McCann
89f974d4f5
Cleanup/remove backwards compat overwrite settings (#11888)
* Remove backwards-compatible overwrite from Entity Linker

This also adds a docstring about overwrite, since it wasn't present.

* Fix docstring

* Remove backward compat settings in Morphologizer

This also needed a docstring added.

For this component it's less clear what the right overwrite settings
are.

* Remove backward compat from sentencizer

This was simple

* Remove backward compat from senter

Another simple one

* Remove backward compat setting from tagger

* Add docstrings

* Update spacy/pipeline/morphologizer.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update docs

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-02-02 14:13:38 +01:00
Adriane Boyd
cd95b29053 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master-7 2023-02-02 13:06:15 +01:00
Sofie Van Landeghem
4c60afb946
Backslash fixes in docs (#12213)
* backslash fixes

* revert unrelated change
2023-02-01 10:15:38 +01:00
Paul O'Leary McCann
6920fb7baf
Move Entity Linker v1 to spacy-legacy (#12006)
* Move Entity Linker v1 component to spacy-legacy

This is a follow up to #11889 that moves the component instead of
removing it.

In general, we never import from spacy-legacy in spaCy proper. However,
to use this component, that kind of import will be necessary. I was able
to test this without issues, but is this current import strategy
acceptable? Or should we put the component in a registry?

* Use spacy-legacy pr for CI

This will need to be reverted before merging.

* Add temporary step to log installed spacy-legacy version

* Modify requirements.txt to trigger tests

* Add comment to Python to trigger tests

* TODO REVERT This is a commit with logic changes to trigger tests

* Remove pipe from YAML

Works locally, but possibly this is causing a quoting error or
something.

* Revert "TODO REVERT This is a commit with logic changes to trigger tests"

This reverts commit 689fae71f3.

* Revert "Add comment to Python to trigger tests"

This reverts commit 11840fc598.

* Add more logging

* Try installing directly in workflow

* Try explicitly uninstalling spacy-legacy first

* Cat requirements.txt to confirm contents

In the branch, the thinc version spec is `thinc>=8.1.0,<8.2.0`. But in
the logs, it's clear that a development release of 9.0 is being
installed. It's not clear why that would happen.

* Log requirements at start of build

* TODO REVERT Change thinc spec

Want to see what happens to the installed thinc spec with this change.

* Update thinc requirements

This makes it the same as it was before the merge, >=8.1.0,<8.2.0.

* Use same thinc version as v4 branch

* TODO REVERT Mark dependency check as xfail

spacy-legacy is specified as a git checkout in requirements.txt while
this PR is in progress, which makes the consistency check here fail.

* Remove debugging output / install step

* Revert "Remove debugging output / install step"

This reverts commit 923ea7448b.

* Clean up debugging output

The manual install step with the URL fragment seems to have caused
issues on Windows due to the = in the URL being misinterpreted. On the
other hand, removing it seems to mean the git version of spacy-legacy
isn't actually installed.

This PR removes the URL fragment but keeps the direct command-line
install. Additionally, since it looks like this job is configured to use
the default shell (and not bash), it removes a comment that upsets the
Windows cmd shell.

* Revert "TODO REVERT Mark dependency check as xfail"

This reverts commit d4863ec156.

* Fix requirements.txt, increasing spacy-legacy version

* Raise spacy legacy version in setup.cfg

* Remove azure build workarounds

* make spacy-legacy version explicit in error message

* Remove debugging line

* Suggestions from code review
2023-02-01 09:47:56 +01:00
Edward
360ccf628a
Rename language codes (Icelandic, multi-language) (#12149)
* Init

* fix tests

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix test_blank_languages

* Rename xx to mul in docs

* Format _util with black

* prettier formatting

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-31 17:30:43 +01:00
Raphael Mitsch
02af17a5c8
Remove flaky assertions. (#12210) 2023-01-31 16:52:06 +01:00
Daniël de Kok
c6cca4c00a
Language.distill: copy both reference and predicted (#12209)
* Language.distill: copy both reference and predicted

In distillation we also modify the teacher docs (e.g. in tok2vec
components), so we need to copy both the reference and predicted doc.

Problem caught by @shadeMe

* Make new `_copy_examples` args kwonly
2023-01-31 13:19:42 +01:00
Daniël de Kok
fb7f018ded
Add the configuration schema for distillation (#12201)
* Add the configuration schema for distillation

This also adds the default configuration and some tests. The schema will
be used by the training loop and `distill` subcommand.

* Format

* Change distillation shortopt to -d

* Fix descripion of max_epochs

* Rename distillation flag to -dt

* Rename `pipe_map` to `student_to_teacher`
2023-01-31 13:06:02 +01:00
Paul O'Leary McCann
1b5aba9e22
Don't re-download installed models (#12188)
* Don't re-download installed models

When downloading a model, this checks if the same version of the same
model is already installed. If it is then the download is skipped.

This is necessary because pip uses the final download URL for its
caching feature, but because of the way models are hosted on Github,
their URLs change every few minutes.

* Use importlib instead of meta.json

* Use get_package_version

* Add untested, disabled test

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-31 11:31:17 +01:00
Adriane Boyd
0e51c918ae
Normalize whitespace in evaluate CLI output test (#12157)
* Normalize whitespace in evaluate CLI output test

Depending on terminal settings, lines may be padded to the screen width
so the comparison is too strict with only the command string replacement.

* Move to test util method

* Change to normalization method
2023-01-30 17:51:27 +01:00
Daniël de Kok
6b07be2110
Add Language.distill (#12116)
* Add `Language.distill`

This method is the distillation counterpart of `Language.update`.  It
takes a teacher `Language` instance and distills the student pipes on
the teacher pipes.

* Apply suggestions from code review

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Clarify that how Example is used in distillation

* Update transition parser distill docstring for examples argument

* Pass optimizer to `TrainablePipe.distill`

* Annotate pipe before update

As discussed internally, we want to let a pipe annotate before doing an
update with gold/silver data. Otherwise, the output may be (too)
informed by the gold/silver data.

* Rename `component_map` to `student_to_teacher`

* Better synopsis in `Language.distill` docstring

* `name` -> `student_name`

* Fix labels type in docstring

* Mark distill test as slow

* Fix `student_to_teacher` type in docs

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2023-01-30 12:44:11 +01:00
Paul O'Leary McCann
8932f4dc35
Add extra flag to assets docs (#12194)
* Add extra flag to assets docs

For some reason this wasn't included.

* Add new tag to docs
2023-01-30 10:05:23 +01:00
Adriane Boyd
606273f7e4
Normalize whitespace in evaluate CLI output test (#12157)
* Normalize whitespace in evaluate CLI output test

Depending on terminal settings, lines may be padded to the screen width
so the comparison is too strict with only the command string replacement.

* Move to test util method

* Change to normalization method
2023-01-27 16:13:34 +01:00
Adriane Boyd
ec45f704b1
Drop python 3.6/3.7, remove unneeded compat (#12187)
* Drop python 3.6/3.7, remove unneeded compat

* Remove unused import

* Minimal python 3.8+ docs updates
2023-01-27 15:48:20 +01:00
Sofie Van Landeghem
bd739e67d6
explain KB change and how to remedy (#12189) 2023-01-27 15:13:20 +01:00
Adriane Boyd
5f8a398bb9
Add span_id to Span.char_span, update Doc/Span.char_span docs (#12196)
* Add span_id to Span.char_span, update Doc/Span.char_span docs

`Span.char_span(id=)` should be removed in the future.

* Also use Union[int, str] in Doc docstring
2023-01-27 15:09:17 +01:00
Sofie Van Landeghem
1678a98449
Merge pull request #12192 from adrianeboyd/chore/update-v4-from-master-5
Update v4 from master, format, update CI
2023-01-27 14:59:26 +01:00
Simon Gurcke
774c10fa39
Add alignment_mode argument to Span.char_span() (#12145)
* Add alignment_mode argument to Span.char_span()

* Update website

* Update spacy/tokens/span.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-27 11:43:40 +01:00
Adriane Boyd
16609517f1 CI: Skip tests that require published pipelines 2023-01-27 08:37:02 +01:00
Adriane Boyd
fd911fe2af Format 2023-01-27 08:29:46 +01:00
Adriane Boyd
8548d4d16e Merge remote-tracking branch 'upstream/master' into update-v4-from-master-1 2023-01-27 08:29:09 +01:00
Peter Baumgartner
c68e6b8a96
trainable_lemmatizer in debug data (#11419)
* WIP

* rm ipython embeds

* rm total

* WIP

* cleanup

* cleanup + reword

* rm component function

* remove migration support form

* fix reference dataset for dev data

* additional fixes

- set approach to identifying unique trees
- adjust line length on messages
- add logic for detecting docs without annotations

* use 0 instead of none for no annotation

* partial annotation support

* initial tests for _compile_gold lemma attributes

Using the example data from the edit tree lemmatizer tests for:
- lemmatizer_trees
- partial_lemma_annotations
- n_low_cardinality_lemmas
- no_lemma_annotations

* adds output test for cli app

* switch msg level

* rm unclear uniqueness check

* Revert "rm unclear uniqueness check"

This reverts commit 6ea2b3524b.

* remove good message on uniqueness

* formatting

* use en_vocab fixture

* clarify data set source in messages

* remove unnecessary import

Co-authored-by: svlandeg <svlandeg@github.com>
2023-01-26 17:36:50 +01:00
Daniël de Kok
8d69874afb
Add spacy.PlainTextCorpusReader.v1 (#12122)
* Add `spacy.PlainTextCorpusReader.v1`

This is a corpus reader that reads plain text corpora with the following
format:

- UTF-8 encoding
- One line per document.
- Blank lines are ignored.

It is useful for applications where we deal with very large corpora,
such as distillation, and don't want to deal with the space overhead of
serialized formats. Additionally, many large corpora already use such
a text format, keeping the necessary preprocessing to a minimum.

* Update spacy/training/corpus.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* docs: add version to `PlainTextCorpus`

* Add docstring to registry function

* Add plain text corpus tests

* Only strip newline/carriage return

* Add return type _string_to_tmp_file helper

* Use a temporary directory in place of file name

Different OS auto delete/sharing semantics are just wonky.

* This will be new in 3.5.1 (rather than 4)

* Test improvements from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-26 11:33:22 +01:00
Marcus Blättermann
a37117abd0
Fix text colors in docs (#12186) 2023-01-26 10:30:24 +01:00
Marcus Blättermann
056b73468c
Load components dynamically (decrease initial file size for docs) (#12175)
* Extract `CodeBlock` component into own file

* Extract `InlineCode` component into own file

* Extract `TypeAnnotation` component into own file

* Convert named `export` to `default export`

* Remove unused `export`

* Simplify `TypeAnnotation` to remove dependency for Prism

* Load `Code` component dynamically

* Extract `MarkdownToReact` component into own file

* WIP Code Dynamic

* Load `MarkdownToReact` component dynamically

* Extract `htmlToReact` to own file

* Load `htmlToReact` component dynamically

* Dynamically load `Juniper`
2023-01-25 17:30:41 +01:00
Adriane Boyd
07dfa54669
CI: Extend website excludes (#12185) 2023-01-25 15:35:17 +01:00
Marcus Blättermann
11f10fff60
Fix frontpage image (#12184) 2023-01-25 13:17:35 +01:00
Marcus Blättermann
5a6000fb8b
Fix text color in docs (#12183)
* Fix text color on landing page

* Fix code color
2023-01-25 13:14:32 +01:00
Adriane Boyd
8ea15240ca
Update binder version to v3.5 (#12153) 2023-01-25 13:14:23 +01:00
Adriane Boyd
2dbb764183
CI: Add black formatting check to validation (#12182) 2023-01-25 12:51:37 +01:00
Marcus Blättermann
99a05734a8
Add aria-label to quickstart widget (#12179) 2023-01-25 11:46:55 +01:00
Marcus Blättermann
0298b1a863
WEB-28 Increase contrast of grey text (#12178)
* Use transparent colors to increase contrast on darker backgrounds

* Increase color contrast of grey text
2023-01-25 11:46:43 +01:00
Marcus Blättermann
3062fae2ca
Fix broken URL (#12176) 2023-01-25 11:42:19 +01:00
Marcus Blättermann
57ba37bc52
Fix regression with links in prompts (#12172) 2023-01-25 08:51:40 +01:00
Marcus Blättermann
05a3685849
Fix broken syntax for type annotations (#12171) 2023-01-25 08:51:25 +01:00
Paul O'Leary McCann
de360bc981
Refactor lexeme mem passing (#12125)
* Don't pass mem pool to new lexeme function

* Remove unused mem from function args

Two methods calling _new_lexeme, get and get_by_orth, took mem arguments
just to call the internal method. That's no longer necessary, so this
cleans it up.

* prettier formatting

* Remove more unused mem args
2023-01-25 12:50:21 +09:00
Marcus Blättermann
f3c586f74a
Fix navigation alert (#12169)
Fixes a regression introduced in #12163
2023-01-24 16:40:40 +01:00
Marcus Blättermann
49237f05a6
Fix aria-hidden element (#12163)
* Rename CSS class to make use more clear

* Rename component prop to improve code readability

* Fix `aria-hidden` directly on a link element

This link wouldn't have been clickable by screenreaders

* Refactor component

This removes a unnessary `div` and a duplicate link

Co-authored-by: Ines Montani <ines@ines.io>
2023-01-24 14:44:47 +01:00
Marcus Blättermann
0a70696923
Fix wrong HTML element attribute (#12151)
Originally introduced in 62b9c9c6d7

Original error: Warning: Invalid DOM property `class`. Did you mean `className`?

React doesn't have `class`, it uses `className`.
2023-01-24 14:35:31 +01:00
Marcus Blättermann
9555e7aecf
Remove unnessary links (#12159)
There is no need to link to the image we are already viewing and this is also considered an accessibility issue.
2023-01-24 14:01:00 +01:00
Marcus Blättermann
031f6c7b60
WEB-27 Add alt tags to images (#12166)
* Update spaCy badge `alt` text

* Add `next/image` component to Universe

* Add missing `alt`texts
2023-01-24 13:56:14 +01:00
Marcus Blättermann
c9beb47ab7
Increase contrast of text and theme color (#12165) 2023-01-24 13:55:20 +01:00
Marcus Blättermann
a7d6a62f7c
Remove zoom locking (#12164)
* Fix missing comma

* Activate user zoom for website

This is recommended by lighthouse:

> Disabling zooming is problematic for users with low vision who rely on screen magnification to properly see the contents of a web page. Learn more.

Also iOS already ignores this attribute anyway.
2023-01-24 13:54:49 +01:00
Marcus Blättermann
48159e1d60
Update explosion logo (#12162)
This fixes a misalignment of the explosion logo
2023-01-24 13:53:51 +01:00
Marcus Blättermann
7160f7835d
Fix GitHub badge (#12161)
* Extract component

* Remove rounded border form GitHub Stars badge

* Add `alt` text
2023-01-24 13:53:28 +01:00
Marcus Blättermann
3aa61e615f
Add missing label (#12160) 2023-01-24 13:52:55 +01:00
Marcus Blättermann
fcedcd54a8
WEB-30 spaCy pattern in .png (#12158)
* Fix gap in landing pattern at the top

* Replace `.jpg` patterns with `.png`

This drastically reduces file size (for the landing page from 221kb to 57kb) while doubling the resolution to look sharper on retina displays.
2023-01-24 13:51:39 +01:00
Sofie Van Landeghem
de1fe8dce3
Fix Azure ignoring website files (#12129)
* ignore all mdx files and all files in website

* have both .md and .mdx

* exclude everything but universe.json
2023-01-24 10:02:07 +01:00