Commit Graph

1868 Commits

Author SHA1 Message Date
thomashacker
5f57bdb588 Merge branch 'master' into add-more-info-to-custom-code-doc 2023-03-31 10:21:26 +02:00
Ye Lei (叶磊)
ce258670b7
Allow passing a Span to displacy.parse_deps (#12477)
* Allow passing a Span to displacy.parse_deps

* Update docstring

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update API docs

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-31 09:44:01 +02:00
Adriane Boyd
dc1a3d07fc
Update website/docs/usage/training.mdx 2023-03-29 09:02:21 +02:00
Edward
dba4e7bece
Add info to stringstore and vocab (#12471) 2023-03-27 13:15:14 +02:00
thomashacker
1e820aceb4 link to spacy load and mention difference 2023-03-27 13:03:46 +02:00
Prajakta Darade
ae7779e830
corrected example code (#12466) 2023-03-27 11:32:49 +02:00
kadarakos
d1474fdd91
add explanation about overwriting behaviour (#12464)
* add explanation about overwriting behaviour

* Update website/docs/api/spancategorizer.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/api/spancategorizer.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/api/spancategorizer.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* format

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-27 10:27:11 +02:00
Vinit Ravishankar
28de85737f
Tagger label smoothing (#12293)
* add label smoothing

* use True/False instead of floats

* add entropy to debug data

* formatting

* docs

* change test to check difference in distributions

* Update website/docs/api/tagger.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/pipeline/tagger.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* bool -> float

* update docs

* fix seed

* black

* update tests to use label_smoothing = 0.0

* set default to 0.0, update quickstart

* Update spacy/pipeline/tagger.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* update morphologizer, tagger test

* fix morph docs

* add url to docs

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-22 12:17:56 +01:00
Adriane Boyd
2ce9a220db
Fix --verbose for spacy find-threshold (#12418) 2023-03-14 17:16:49 +01:00
Lj Miranda
913d74f509
Add spancat_singlelabel pipeline for multiclass and non-overlapping span labelling tasks (#11365)
* [wip] Update

* [wip] Update

* Add initial port

* [wip] Update

* Fix all imports

* Add spancat_exclusive to pipeline

* [WIP] Update

* [ci skip] Add breakpoint for debugging

* Use spacy.SpanCategorizer.v1 as default archi

* Update spacy/pipeline/spancat_exclusive.py

Co-authored-by: kadarakos <kadar.akos@gmail.com>

* [ci skip] Small updates

* Use Softmax v2 directly from thinc

* Cache the label map

* Fix mypy errors

However, I ignored line 370 because it opened up a bunch of type errors
that might be trickier to solve and might lead to a more complicated
codebase.

* avoid multiplication with 1.0

Co-authored-by: kadarakos <kadar.akos@gmail.com>

* Update spacy/pipeline/spancat_exclusive.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update component versions to v2

* Add scorer to docstring

* Add _n_labels property to SpanCategorizer

Instead of using len(self.labels) in initialize() I am using a private
property self._n_labels. This achieves implementation parity and allows
me to delete the whole initialize() method for spancat_exclusive (since
it's now the same with spancat).

* Inherit from SpanCat instead of TrainablePipe

This commit changes the inheritance structure of Exclusive_Spancat,
now it's inheriting from SpanCategorizer than TrainablePipe. This
allows me to remove duplicate methods that are already present in
the parent function.

* Revert documentation link to spancat

* Fix init call for exclusive spancat

* Update spacy/pipeline/spancat_exclusive.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Import Suggester from spancat

* Include zero_init.v1 for spancat

* Implement _allow_extra_label to use _n_labels

To ensure that spancat / spancat_exclusive cannot be resized after
initialization, I inherited the _allow_extra_label() method from
spacy/pipeline/trainable_pipe.pyx and used self._n_labels instead
of len(self.labels) for checking.

I think that changing it locally is a better solution rather than
forcing each class that inherits TrainablePipe to use the self._n_labels
attribute.

Also note that I turned-off black formatting in this block of code
because it reads better without the overhang.

* Extend existing tests to spancat_exclusive

In this commit, I extended the existing tests for spancat to include
spancat_exclusive. I parametrized the test functions with 'name'
(similar var name with textcat and textcat_multilabel) for each
applicable test.

TODO: Add overfitting tests for spancat_exclusive

* Update documentation for spancat

* Turn on formatting for allow_extra_label

* Remove initializers in default config

* Use DEFAULT_EXCL_SPANCAT_MODEL

I also renamed spancat_exclusive_default_config into
spancat_excl_default_config because black does some not pretty
formatting changes.

* Update documentation

Update grammar and usage

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Clarify docstring for Exclusive_SpanCategorizer

* Remove mypy ignore and typecast labels to list

* Fix documentation API

* Use a single variable for tests

* Update defaults for number of rows

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Put back initializers in spancat config

Whenever I remove model.scorer.init_w and model.scorer.init_b,
I encounter an error in the test:

    SystemError: <method '__getitem__' of 'dict' objects> returned a result
    with an error set.

My Thinc version is 8.1.5, but I can't seem to check what's causing the
error.

* Update spancat_exclusive docstring

* Remove init_W and init_B parameters

This commit is expected to fail until the new Thinc release.

* Require thinc>=8.1.6 for serializable Softmax defaults

* Handle zero suggestions to make tests pass

I'm not sure if this is the most elegant solution. But what should
happen is that the _make_span_group function MUST return an empty
SpanGroup if there are no suggestions.

The error happens when the 'scores' variable is empty. We cannot
get the 'predicted' and other downstream vars.

* Better approach for handling zero suggestions

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spancategorizer headers

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add default value in negative_weight in docs

* Add default value in allow_overlap in docs

* Update how spancat_exclusive is constructed

In this commit, I added the following:
- Put the default values of negative_weight and allow_overlap
    in the default_config dictionary.
- Rename make_spancat -> make_exclusive_spancat

* Run prettier on spancategorizer.mdx

* Change exactly one -> at most one

* Add suggester documentation in Exclusive_SpanCategorizer

* Add suggester to spancat docstrings

* merge multilabel and singlelabel spancat

* rename spancat_exclusive to singlelable

* wire up different make_spangroups for single and multilabel

* black

* black

* add docstrings

* more docstring and fix negative_label

* don't rely on default arguments

* black

* remove spancat exclusive

* replace single_label with add_negative_label and adjust inference

* mypy

* logical bug in configuration check

* add spans.attrs[scores]

* single label make_spangroup test

* bugfix

* black

* tests for make_span_group with negative labels

* refactor make_span_group

* black

* Update spacy/tests/pipeline/test_spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* remove duplicate declaration

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* raise error instead of just print

* make label mapper private

* update docs

* run prettier

* Update website/docs/api/spancategorizer.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/api/spancategorizer.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* don't keep recomputing self._label_map for each span

* typo in docs

* Intervals to private and document 'name' param

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* add Tag to new features

* replace tags

* revert

* revert

* revert

* revert

* Update website/docs/api/spancategorizer.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/api/spancategorizer.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* prettier

* Fix merge

* Update website/docs/api/spancategorizer.mdx

* remove references to 'single_label'

* remove old paragraph

* Add spancat_singlelabel to config template

* Format

* Extend init config tests

---------

Co-authored-by: kadarakos <kadar.akos@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-09 10:30:59 +01:00
Edward
f21d12bbff
Fix wording
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-03-06 15:04:41 +01:00
thomashacker
13f9a9af20 Add warning section 2023-03-06 12:16:29 +01:00
Raphael Mitsch
1ea31552be Merge branch 'master' into sync/master-into-v4
# Conflicts:
#	requirements.txt
#	spacy/pipeline/entity_linker.py
#	spacy/util.py
#	website/docs/api/entitylinker.mdx
2023-03-02 16:24:15 +01:00
Raphael Mitsch
6aa6b86d49
Make generation of empty KnowledgeBase instances configurable in EntityLinker (#12320)
* Make empty_kb() configurable.

* Format.

* Update docs.

* Be more specific in KB serialization test.

* Update KB serialization tests. Update docs.

* Remove doc update for batched candidate generation.

* Fix serialization of subclassed KB in tests.

* Format.

* Update docstring.

* Update docstring.

* Switch from pickle to json for custom field serialization.
2023-03-01 16:02:55 +01:00
Adriane Boyd
da75896ef5
Return Tuple[Span] for all Doc/Span attrs that provide spans (#12288)
* Return Tuple[Span] for all Doc/Span attrs that provide spans

* Update Span types
2023-03-01 16:00:02 +01:00
kadarakos
56aa0cc75f
Displacy doc fix (#12352)
* more details for color setting

* more details for color setting

* prettier
2023-03-01 15:38:23 +01:00
Raphael Mitsch
efbc3d37b3
Update docs w.r.t. spacy.CandidateBatchGenerator.v1. (#12350) 2023-03-01 11:01:35 +01:00
Adriane Boyd
33864f1d07
Add new tags in docs for #12334 (#12348) 2023-03-01 10:46:13 +01:00
TAN Long
071667376a
Add new REL_OPs: >+, >-, <+, and <- (#12334)
* Add immediate left/right child/parent dependency relations

* Add tests for new REL_OPs: `>+`, `>-`, `<+`, and `<-`.

---------

Co-authored-by: Tan Long <tanloong@foxmail.com>
2023-02-28 14:36:33 +01:00
Adriane Boyd
4539fbae17
Revert "Fix FUZZY operator definition (#12318)" (#12336)
This reverts commit daedc45d05.

The default length depends on the length of the pattern string and was
correct for this example.
2023-02-27 09:48:36 +01:00
Adriane Boyd
df4c069a13
Remove backoff from .vector to .tensor (#12292) 2023-02-23 11:36:50 +01:00
andyjessen
daedc45d05
Fix FUZZY operator definition (#12318)
* Fix FUZZY operator definition

The default length of the FUZZY operator is 2 and not 3.

* adjust edit distance in matcher usage docs too

---------

Co-authored-by: svlandeg <svlandeg@github.com>
2023-02-23 09:37:40 +01:00
Adriane Boyd
b95123060a
Make Span.char_span optional args keyword-only (#12257)
* Make Span.char_span optional args keyword-only

* Make kb_id and following kw-only

* Format
2023-02-15 12:34:33 +01:00
Raphael Mitsch
2d4fb94ba0
Fix wrong file name in docs for rule-based matcher. (#12262) 2023-02-09 12:58:14 +01:00
Adriane Boyd
cbc2ae933e
Remove unused Span.char_span(id=) (#12250) 2023-02-08 14:46:07 +01:00
Adriane Boyd
cf85b81f34
Remove names for vectors (#12243)
* Remove names for vectors

Named vectors are basically a carry-over from v2 and aren't used for
anything.

* Format
2023-02-08 14:37:42 +01:00
Raphael Mitsch
d38a88f0f3
Remove negation. (#12252) 2023-02-08 14:18:33 +01:00
Sofie Van Landeghem
c47ec5b5c6
Merge pull request #12218 from adrianeboyd/chore/update-v4-from-master-7
Update v4 from master
2023-02-03 12:04:20 +01:00
Paul O'Leary McCann
89f974d4f5
Cleanup/remove backwards compat overwrite settings (#11888)
* Remove backwards-compatible overwrite from Entity Linker

This also adds a docstring about overwrite, since it wasn't present.

* Fix docstring

* Remove backward compat settings in Morphologizer

This also needed a docstring added.

For this component it's less clear what the right overwrite settings
are.

* Remove backward compat from sentencizer

This was simple

* Remove backward compat from senter

Another simple one

* Remove backward compat setting from tagger

* Add docstrings

* Update spacy/pipeline/morphologizer.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update docs

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-02-02 14:13:38 +01:00
Adriane Boyd
cd95b29053 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master-7 2023-02-02 13:06:15 +01:00
Sofie Van Landeghem
4c60afb946
Backslash fixes in docs (#12213)
* backslash fixes

* revert unrelated change
2023-02-01 10:15:38 +01:00
Edward
360ccf628a
Rename language codes (Icelandic, multi-language) (#12149)
* Init

* fix tests

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix test_blank_languages

* Rename xx to mul in docs

* Format _util with black

* prettier formatting

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-31 17:30:43 +01:00
Daniël de Kok
6b07be2110
Add Language.distill (#12116)
* Add `Language.distill`

This method is the distillation counterpart of `Language.update`.  It
takes a teacher `Language` instance and distills the student pipes on
the teacher pipes.

* Apply suggestions from code review

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Clarify that how Example is used in distillation

* Update transition parser distill docstring for examples argument

* Pass optimizer to `TrainablePipe.distill`

* Annotate pipe before update

As discussed internally, we want to let a pipe annotate before doing an
update with gold/silver data. Otherwise, the output may be (too)
informed by the gold/silver data.

* Rename `component_map` to `student_to_teacher`

* Better synopsis in `Language.distill` docstring

* `name` -> `student_name`

* Fix labels type in docstring

* Mark distill test as slow

* Fix `student_to_teacher` type in docs

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2023-01-30 12:44:11 +01:00
Paul O'Leary McCann
8932f4dc35
Add extra flag to assets docs (#12194)
* Add extra flag to assets docs

For some reason this wasn't included.

* Add new tag to docs
2023-01-30 10:05:23 +01:00
Adriane Boyd
ec45f704b1
Drop python 3.6/3.7, remove unneeded compat (#12187)
* Drop python 3.6/3.7, remove unneeded compat

* Remove unused import

* Minimal python 3.8+ docs updates
2023-01-27 15:48:20 +01:00
Sofie Van Landeghem
bd739e67d6
explain KB change and how to remedy (#12189) 2023-01-27 15:13:20 +01:00
Adriane Boyd
5f8a398bb9
Add span_id to Span.char_span, update Doc/Span.char_span docs (#12196)
* Add span_id to Span.char_span, update Doc/Span.char_span docs

`Span.char_span(id=)` should be removed in the future.

* Also use Union[int, str] in Doc docstring
2023-01-27 15:09:17 +01:00
Simon Gurcke
774c10fa39
Add alignment_mode argument to Span.char_span() (#12145)
* Add alignment_mode argument to Span.char_span()

* Update website

* Update spacy/tokens/span.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-27 11:43:40 +01:00
Adriane Boyd
8548d4d16e Merge remote-tracking branch 'upstream/master' into update-v4-from-master-1 2023-01-27 08:29:09 +01:00
Daniël de Kok
8d69874afb
Add spacy.PlainTextCorpusReader.v1 (#12122)
* Add `spacy.PlainTextCorpusReader.v1`

This is a corpus reader that reads plain text corpora with the following
format:

- UTF-8 encoding
- One line per document.
- Blank lines are ignored.

It is useful for applications where we deal with very large corpora,
such as distillation, and don't want to deal with the space overhead of
serialized formats. Additionally, many large corpora already use such
a text format, keeping the necessary preprocessing to a minimum.

* Update spacy/training/corpus.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* docs: add version to `PlainTextCorpus`

* Add docstring to registry function

* Add plain text corpus tests

* Only strip newline/carriage return

* Add return type _string_to_tmp_file helper

* Use a temporary directory in place of file name

Different OS auto delete/sharing semantics are just wonky.

* This will be new in 3.5.1 (rather than 4)

* Test improvements from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-26 11:33:22 +01:00
Marcus Blättermann
05a3685849
Fix broken syntax for type annotations (#12171) 2023-01-25 08:51:25 +01:00
Paul O'Leary McCann
de360bc981
Refactor lexeme mem passing (#12125)
* Don't pass mem pool to new lexeme function

* Remove unused mem from function args

Two methods calling _new_lexeme, get and get_by_orth, took mem arguments
just to call the internal method. That's no longer necessary, so this
cleans it up.

* prettier formatting

* Remove more unused mem args
2023-01-25 12:50:21 +09:00
Marcus Blättermann
031f6c7b60
WEB-27 Add alt tags to images (#12166)
* Update spaCy badge `alt` text

* Add `next/image` component to Universe

* Add missing `alt`texts
2023-01-24 13:56:14 +01:00
Edward
e9048fd4a1
Add how to load probability tables to existing models to spaCy docs (#12051)
* add section about adding tables to models

* change to lexeme_norm

* Change syntax

* change to _prob

* Update website/docs/usage/saving-loading.mdx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-24 10:01:22 +01:00
Sofie Van Landeghem
0f5d8a27f2
3.5 usage page (#12057)
* skeleton

* Fill in non-CLI details from release notes draft

* Add TODO for fuzzy matching

* Website updates for v3-5 draft

* Fill in usage examples

* Add fuzzy matching to intro

* Fix fuzzy examples

* Shell example formatting

* Fix typo

* Format

* Remove trailing periods in internal list

* Update

* Fix spacing for nested lists

* Update InMemoryLookupKB link

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2023-01-19 16:13:04 +01:00
Adriane Boyd
3b8918e166
API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar (#12128)
* API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar

* adjust to mdx

* linkout to InMemoryLookupKB at first occurrence in kb.mdx

* fix links to docs

* revert Azure trigger setting (I'll make a separate PR)

Co-authored-by: svlandeg <svlandeg@github.com>
2023-01-19 13:29:17 +01:00
Sofie Van Landeghem
7d88c55eeb
update docs for apply (#12127)
* update docs for apply

* prettier
2023-01-19 10:37:09 +01:00
Daniël de Kok
b052b1b47f
Fix batching regression (#12094)
* Fix batching regression

Some time ago, the spaCy v4 branch switched to the new Thinc v9
schedule. However, this introduced an error in how batching is handed.

In the PR, the batchers were changed to keep track of their step,
so that the step can be passed to the schedule. However, the issue
is that the training loop repeatedly calls the batching functions
(rather than using an infinite generator/iterator). So, the step and
therefore the schedule would be reset each epoch. Before the schedule
switch we didn't have this issue, because the old schedules were
stateful.

This PR fixes this issue by reverting the batching functions to use
a (stateful) generator. Their registry functions do accept a `Schedule`
and we convert `Schedule`s to generators.

* Update batcher docs

* Docstring fixes

* Make minibatch take iterables again as well

* Bump thinc requirement to 9.0.0.dev2

* Use type declaration

* Convert another comment into a proper type declaration
2023-01-18 18:28:30 +01:00
Daniël de Kok
a183db3cef
Merge the parser refactor into v4 (#10940)
* Try to fix doc.copy

* Set dev version

* Make vocab always own lexemes

* Change version

* Add SpanGroups.copy method

* Fix set_annotations during Parser.update

* Fix dict proxy copy

* Upd version

* Fix copying SpanGroups

* Fix set_annotations in parser.update

* Fix parser set_annotations during update

* Revert "Fix parser set_annotations during update"

This reverts commit eb138c89ed.

* Revert "Fix set_annotations in parser.update"

This reverts commit c6df0eafd0.

* Fix set_annotations during parser update

* Inc version

* Handle final states in get_oracle_sequence

* Inc version

* Try to fix parser training

* Inc version

* Fix

* Inc version

* Fix parser oracle

* Inc version

* Inc version

* Fix transition has_gold

* Inc version

* Try to use real histories, not oracle

* Inc version

* Upd parser

* Inc version

* WIP on rewrite parser

* WIP refactor parser

* New progress on parser model refactor

* Prepare to remove parser_model.pyx

* Convert parser from cdef class

* Delete spacy.ml.parser_model

* Delete _precomputable_affine module

* Wire up tb_framework to new parser model

* Wire up parser model

* Uncython ner.pyx and dep_parser.pyx

* Uncython

* Work on parser model

* Support unseen_classes in parser model

* Support unseen classes in parser

* Cleaner handling of unseen classes

* Work through tests

* Keep working through errors

* Keep working through errors

* Work on parser. 15 tests failing

* Xfail beam stuff. 9 failures

* More xfail. 7 failures

* Xfail. 6 failures

* cleanup

* formatting

* fixes

* pass nO through

* Fix empty doc in update

* Hackishly fix resizing. 3 failures

* Fix redundant test. 2 failures

* Add reference version

* black formatting

* Get tests passing with reference implementation

* Fix missing prints

* Add missing file

* Improve indexing on reference implementation

* Get non-reference forward func working

* Start rigging beam back up

* removing redundant tests, cf #8106

* black formatting

* temporarily xfailing issue 4314

* make flake8 happy again

* mypy fixes

* ensure labels are added upon predict

* cleanup remnants from merge conflicts

* Improve unseen label masking

Two changes to speed up masking by ~10%:

- Use a bool array rather than an array of float32.

- Let the mask indicate whether a label was seen, rather than
  unseen. The mask is most frequently used to index scores for
  seen labels. However, since the mask marked unseen labels,
  this required computing an intermittent flipped mask.

* Write moves costs directly into numpy array (#10163)

This avoids elementwise indexing and the allocation of an additional
array.

Gives a ~15% speed improvement when using batch_by_sequence with size
32.

* Temporarily disable ner and rehearse tests

Until rehearse is implemented again in the refactored parser.

* Fix loss serialization issue (#10600)

* Fix loss serialization issue

Serialization of a model fails with:

TypeError: array(738.3855, dtype=float32) is not JSON serializable

Fix this using float conversion.

* Disable CI steps that require spacy.TransitionBasedParser.v2

After finishing the refactor, TransitionBasedParser.v2 should be
provided for backwards compat.

* Add back support for beam parsing to the refactored parser (#10633)

* Add back support for beam parsing

Beam parsing was already implemented as part of the `BeamBatch` class.
This change makes its counterpart `GreedyBatch`. Both classes are hooked
up in `TransitionModel`, selecting `GreedyBatch` when the beam size is
one, or `BeamBatch` otherwise.

* Use kwarg for beam width

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Avoid implicit default for beam_width and beam_density

* Parser.{beam,greedy}_parse: ensure labels are added

* Remove 'deprecated' comments

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Parser `StateC` optimizations (#10746)

* `StateC`: Optimizations

Avoid GIL acquisition in `__init__`
Increase default buffer capacities on init
Reduce C++ exception overhead

* Fix typo

* Replace `set::count` with `set::find`

* Add exception attribute to c'tor

* Remove unused import

* Use a power-of-two value for initial capacity
Use default-insert to init `_heads` and `_unshiftable`

* Merge `cdef` variable declarations and assignments

* Vectorize `example.get_aligned_parses` (#10789)

* `example`: Vectorize `get_aligned_parse`
Rename `numpy` import

* Convert aligned array to lists before returning

* Revert import renaming

* Elide slice arguments when selecting the entire range

* Tagger/morphologizer alignment performance optimizations (#10798)

* `example`: Unwrap `numpy` scalar arrays before passing them to `StringStore.__getitem__`

* `AlignmentArray`: Use native list as staging buffer for offset calculation

* `example`: Vectorize `get_aligned`

* Hoist inner functions out of `get_aligned`

* Replace inline `if..else` clause in assignment statement

* `AlignmentArray`: Use raw indexing into offset and data `numpy` arrays

* `example`: Replace array unique value check with `groupby`

* `example`: Correctly exclude tokens with no alignment in `_get_aligned_vectorized`
Simplify `_get_aligned_non_vectorized`

* `util`: Update `all_equal` docstring

* Explicitly use `int32_t*`

* Restore C CPU inference in the refactored parser (#10747)

* Bring back the C parsing model

The C parsing model is used for CPU inference and is still faster for
CPU inference than the forward pass of the Thinc model.

* Use C sgemm provided by the Ops implementation

* Make tb_framework module Cython, merge in C forward implementation

* TransitionModel: raise in backprop returned from forward_cpu

* Re-enable greedy parse test

* Return transition scores when forward_cpu is used

* Apply suggestions from code review

Import `Model` from `thinc.api`

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Use relative imports in tb_framework

* Don't assume a default for beam_width

* We don't have a direct dependency on BLIS anymore

* Rename forwards to _forward_{fallback,greedy_cpu}

* Require thinc >=8.1.0,<8.2.0

* tb_framework: clean up imports

* Fix return type of _get_seen_mask

* Move up _forward_greedy_cpu

* Style fixes.

* Lower thinc lowerbound to 8.1.0.dev0

* Formatting fix

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Reimplement parser rehearsal function (#10878)

* Reimplement parser rehearsal function

Before the parser refactor, rehearsal was driven by a loop in the
`rehearse` method itself. For each parsing step, the loops would:

1. Get the predictions of the teacher.
2. Get the predictions and backprop function of the student.
3. Compute the loss and backprop into the student.
4. Move the teacher and student forward with the predictions of
   the student.

In the refactored parser, we cannot perform search stepwise rehearsal
anymore, since the model now predicts all parsing steps at once.
Therefore, rehearsal is performed in the following steps:

1. Get the predictions of all parsing steps from the student, along
   with its backprop function.
2. Get the predictions from the teacher, but use the predictions of
   the student to advance the parser while doing so.
3. Compute the loss and backprop into the student.

To support the second step a new method, `advance_with_actions` is
added to `GreedyBatch`, which performs the provided parsing steps.

* tb_framework: wrap upper_W and upper_b in Linear

Thinc's Optimizer cannot handle resizing of existing parameters. Until
it does, we work around this by wrapping the weights/biases of the upper
layer of the parser model in Linear. When the upper layer is resized, we
copy over the existing parameters into a new Linear instance. This does
not trigger an error in Optimizer, because it sees the resized layer as
a new set of parameters.

* Add test for TransitionSystem.apply_actions

* Better FIXME marker

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Fixes from Madeesh

* Apply suggestions from Sofie

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remove useless assignment

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename some identifiers in the parser refactor (#10935)

* Rename _parseC to _parse_batch

* tb_framework: prefix many auxiliary functions with underscore

To clearly state the intent that they are private.

* Rename `lower` to `hidden`, `upper` to `output`

* Parser slow test fixup

We don't have TransitionBasedParser.{v1,v2} until we bring it back as a
legacy option.

* Remove last vestiges of PrecomputableAffine

This does not exist anymore as a separate layer.

* ner: re-enable sentence boundary checks

* Re-enable test that works now.

* test_ner: make loss test more strict again

* Remove commented line

* Re-enable some more beam parser tests

* Remove unused _forward_reference function

* Update for CBlas changes in Thinc 8.1.0.dev2

Bump thinc dependency to 8.1.0.dev3.

* Remove references to spacy.TransitionBasedParser.{v1,v2}

Since they will not be offered starting with spaCy v4.

* `tb_framework`: Replace references to `thinc.backends.linalg` with `CBlas`

* dont use get_array_module (#11056) (#11293)

Co-authored-by: kadarakos <kadar.akos@gmail.com>

* Move `thinc.extra.search` to `spacy.pipeline._parser_internals` (#11317)

* `search`: Move from `thinc.extra.search`
Fix NPE in `Beam.__dealloc__`

* `pytest`: Add support for executing Cython tests
Move `search` tests from thinc and patch them to run with `pytest`

* `mypy` fix

* Update comment

* `conftest`: Expose `register_cython_tests`

* Remove unused import

* Move `argmax` impls to new `_parser_utils` Cython module (#11410)

* Parser does not have to be a cdef class anymore

This also fixes validation of the initialization schema.

* Add back spacy.TransitionBasedParser.v2

* Fix a rename that was missed in #10878.

So that rehearsal tests pass.

* Remove module from setup.py that got added during the merge

* Bring back support for `update_with_oracle_cut_size` (#12086)

* Bring back support for `update_with_oracle_cut_size`

This option was available in the pre-refactor parser, but was never
implemented in the refactored parser. This option cuts transition
sequences that are longer than `update_with_oracle_cut` size into
separate sequences that have at most `update_with_oracle_cut`
transitions. The oracle (gold standard) transition sequence is used to
determine the cuts and the initial states for the additional sequences.

Applying this cut makes the batches more homogeneous in the transition
sequence lengths, making forward passes (and as a consequence training)
much faster.

Training time 1000 steps on de_core_news_lg:

- Before this change: 149s
- After this change: 68s
- Pre-refactor parser: 81s

* Fix a rename that was missed in #10878.

So that rehearsal tests pass.

* Apply suggestions from @shadeMe

* Use chained conditional

* Test with update_with_oracle_cut_size={0, 1, 5, 100}

And fix a git that occurs with a cut size of 1.

* Fix up some merge fall out

* Update parser distillation for the refactor

In the old parser, we'd iterate over the transitions in the distill
function and compute the loss/gradients on the go. In the refactored
parser, we first let the student model parse the inputs. Then we'll let
the teacher compute the transition probabilities of the states in the
student's transition sequence. We can then compute the gradients of the
student given the teacher.

* Add back spacy.TransitionBasedParser.v1 references

- Accordion in the architecture docs.
- Test in test_parse, but disabled until we have a spacy-legacy release.

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: kadarakos <kadar.akos@gmail.com>
2023-01-18 11:27:45 +01:00
Adriane Boyd
794cea6907
Fix comments and examples for levenshtein_compare (#12113) 2023-01-18 08:02:33 +01:00