Compare commits

...

110 Commits

Author SHA1 Message Date
Daniël de Kok
17c4a3d646
Set version to v4.0.0.dev1 (#12748) 2023-06-23 09:43:41 +02:00
Sofie Van Landeghem
95619b6736
Merge pull request #12717 from danieldk/sync-v4-master-20230612
Merge master into v4
2023-06-22 17:44:57 +01:00
Daniël de Kok
096794dd74 Account for differences between Span.sents in spaCy 3/4 2023-06-22 15:38:22 +02:00
Daniël de Kok
4990cfefb4 spancat type fixes 2023-06-12 16:43:11 +02:00
Daniël de Kok
d82e167aea Remove Python 3.7 builds 2023-06-12 16:16:03 +02:00
Daniël de Kok
50c5e9a2dd Merge remote-tracking branch 'upstream/master' into sync-v4-master-20230612 2023-06-12 15:57:10 +02:00
Daniël de Kok
8a5814bf2c
Add distillation loop (#12542)
* Add distillation initialization and loop

* Fix up configuration keys

* Add docstring

* Type annotations

* init_nlp_distill -> init_nlp_student

* Do not resolve dot name distill corpus in initialization

(Since we don't use it.)

* student: do not request use of optimizer in student pipe

We apply finish up the updates once in the training loop instead.

Also add the necessary logic to `Language.distill` to mirror
`Language.update`.

* Correctly determine sort key in subdivide_batch

* Fix _distill_loop docstring wrt. stopping condition

* _distill_loop: fix distill_data docstring

Make similar changes in train_while_improving, since it also had
incorrect types and missing type annotations.

* Move `set_{gpu_allocator,seed}_from_config` to spacy.util

* Update Language.update docs for the sgd argument

* Type annotation

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2023-04-21 13:49:40 +02:00
Adriane Boyd
5d0f48fe69
Enforce that Span.start/end(_char) remain valid and in sync (#12268)
* Enforce that Span.start/end(_char) remain valid and in sync

Allowing span attributes to be writable starting in v3 has made it
possible for the internal `Span.start/end/start_char/end_char` to get
out-of-sync or have invalid values.

This checks that the values are valid and syncs the token and char
offsets if any attributes are modified directly. It does not yet handle
the case where the underlying doc is modified.

* Format
2023-04-06 16:01:59 +02:00
Daniël de Kok
b734e5314d
Avoid TrainablePipe.finish_update getting called twice during training (#12450)
* Avoid `TrainablePipe.finish_update` getting called twice during training

PR #12136 fixed an issue where the tok2vec pipe was updated before
gradient were accumulated. However, it introduced a new bug that cause
`finish_update` to be called twice when using the training loop. This
causes a fairly large slowdown.

The `Language.update` method accepts the `sgd` argument for passing an
optimizer. This argument has three possible values:

- `Optimizer`: use the given optimizer to finish pipe updates.
- `None`: use a default optimizer to finish pipe updates.
- `False`: do not finish pipe updates.

However, the latter option was not documented and not valid with the
existing type of `sgd`. I assumed that this was a remnant of earlier
spaCy versions and removed handling of `False`.

However, with that change, we are passing `None` to `Language.update`.
As a result, we were calling `finish_update` in both `Language.update`
and in the training loop after all subbatches are processed.

This change restores proper handling/use of `False`. Moreover, the role
of `False` is now documented and added to the type to avoid future
accidents.

* Fix typo

* Document defaults for `Language.update`
2023-03-30 09:30:42 +02:00
Edward
a653dec654
Add info that Vocab and StringStore are not static in docs (#12427)
* Add size increase info about vocab and stringstore

* Update website/docs/api/stringstore.mdx

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

* Update website/docs/api/vocab.mdx

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>

* Change wording

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-03-27 09:18:23 +02:00
Raphael Mitsch
3102e2e27a
Entity linking: use SpanGroup instead of Iterable[Span] for mentions (#12344)
* Convert Candidate from Cython to Python class.

* Format.

* Fix .entity_ typo in _add_activations() usage.

* Change type for mentions to look up entity candidates for to SpanGroup from Iterable[Span].

* Update docs.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update doc string of BaseCandidate.__init__().

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.

* Adjust Candidate to support and mandate numerical entity IDs.

* Format.

* Fix docstring and docs.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename alias -> mention.

* Refactor Candidate attribute names. Update docs and tests accordingly.

* Refacor Candidate attributes and their usage.

* Format.

* Fix mypy error.

* Update error code in line with v4 convention.

* Reverse erroneous changes during merge.

* Update return type in EL tests.

* Re-add Candidate to setup.py.

* Format updated docs.

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-20 12:25:18 +01:00
Raphael Mitsch
9340eb8ad2
Introduce hierarchy for EL Candidate objects (#12341)
* Convert Candidate from Cython to Python class.

* Format.

* Fix .entity_ typo in _add_activations() usage.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update doc string of BaseCandidate.__init__().

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate.

* Adjust Candidate to support and mandate numerical entity IDs.

* Format.

* Fix docstring and docs.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename alias -> mention.

* Refactor Candidate attribute names. Update docs and tests accordingly.

* Refacor Candidate attributes and their usage.

* Format.

* Fix mypy error.

* Update error code in line with v4 convention.

* Update spacy/kb/candidate.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Updated error code.

* Simplify interface for int/str representations.

* Update website/docs/api/kb.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename 'alias' to 'mention'.

* Port Candidate and InMemoryCandidate to Cython.

* Remove redundant entry in setup.py.

* Add abstract class check.

* Drop storing mention.

* Update spacy/kb/candidate.pxd

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix entity_id refactoring problems in docstrings.

* Drop unused InMemoryCandidate._entity_hash.

* Update docstrings.

* Move attributes out of Candidate.

* Partially fix alias/mention terminology usage. Convert Candidate to interface.

* Remove prior_prob from supported properties in Candidate. Introduce KnowledgeBase.supports_prior_probs().

* Update docstrings related to prior_prob.

* Update alias/mention usage in doc(strings).

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Mention -> alias renaming. Drop Candidate.mentions(). Drop InMemoryLookupKB.get_alias_candidates() from docs.

* Update docstrings.

* Fix InMemoryCandidate attribute names.

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/models/entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update W401 test.

* Update spacy/errors.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/kb/kb.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Use Candidate output type for toy generators in the test suite to mimick best practices

* fix docs

* fix import

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-03-20 00:34:35 +01:00
Adriane Boyd
6ae7618418
Clean up Vocab constructor (#12290)
* Clean up Vocab constructor

* Change effective type of `strings` from `Iterable[str]` to `Optional[StringStore]`
  * Don't automatically add strings to vocab
* Change default values to `None`
* Remove `**deprecated_kwargs`

* Format
2023-03-19 23:41:20 +01:00
Madeesh Kannan
520279ff7c
Tok2Vec: Add distill method (#12108)
* `Tok2Vec`: Add `distill` method

* `Tok2Vec`: Refactor `update`

* Add `Tok2Vec.distill` test

* Update `distill` signature to accept `Example`s instead of separate teacher and student docs

* Add docs

* Remove docstring

* Update test

* Remove `update` calls from test

* Update `Tok2Vec.distill` docstring
2023-03-09 09:37:19 +01:00
Raphael Mitsch
41b3a0d932
Drop support for EntityLinker_v1. (#12377) 2023-03-07 13:10:45 +01:00
Adriane Boyd
8ca71f9591
Merge pull request #12371 from rmitsch/sync/master-into-v4
Sync `v4` with latest from `master`
2023-03-06 17:10:19 +01:00
Raphael Mitsch
749e446ee3 Merge branch 'master' into sync/master-into-v4
# Conflicts:
#	.github/azure-steps.yml
2023-03-06 16:27:56 +01:00
Sofie Van Landeghem
04f41854c1
Merge pull request #12356 from rmitsch/sync/master-into-v4
Sync `v4` with latest from `master`
2023-03-03 09:31:45 +01:00
Raphael Mitsch
1ea31552be Merge branch 'master' into sync/master-into-v4
# Conflicts:
#	requirements.txt
#	spacy/pipeline/entity_linker.py
#	spacy/util.py
#	website/docs/api/entitylinker.mdx
2023-03-02 16:24:15 +01:00
Adriane Boyd
da75896ef5
Return Tuple[Span] for all Doc/Span attrs that provide spans (#12288)
* Return Tuple[Span] for all Doc/Span attrs that provide spans

* Update Span types
2023-03-01 16:00:02 +01:00
Adriane Boyd
df4c069a13
Remove backoff from .vector to .tensor (#12292) 2023-02-23 11:36:50 +01:00
Daniël de Kok
e27c60a702
Reimplement distillation with oracle cut size (#12214)
* Improve the correctness of _parse_patch

* If there are no more actions, do not attempt to make further
  transitions, even if not all states are final.
* Assert that the number of actions for a step is the same as
  the number of states.

* Reimplement distillation with oracle cut size

The code for distillation with an oracle cut size was not reimplemented
after the parser refactor. We did not notice, because we did not have
tests for this functionality. This change brings back the functionality
and adds this to the parser tests.

* Rename states2actions to _states_to_actions for consistency

* Test distillation max cuts in NER

* Mark parser/NER tests as slow

* Typo

* Fix invariant in _states_diff_to_actions

* Rename _init_batch -> _init_batch_from_teacher

* Ninja edit the ninja edit

* Check that we raise an exception when we pass the incorrect number or actions

* Remove unnecessary get

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Write out condition more explicitly

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2023-02-21 15:47:18 +01:00
Paul O'Leary McCann
dd3f138830
Use tempfile.TemporaryDirectory (#12285) 2023-02-16 11:08:55 +01:00
Adriane Boyd
b95123060a
Make Span.char_span optional args keyword-only (#12257)
* Make Span.char_span optional args keyword-only

* Make kb_id and following kw-only

* Format
2023-02-15 12:34:33 +01:00
Adriane Boyd
cbc2ae933e
Remove unused Span.char_span(id=) (#12250) 2023-02-08 14:46:07 +01:00
Adriane Boyd
cf85b81f34
Remove names for vectors (#12243)
* Remove names for vectors

Named vectors are basically a carry-over from v2 and aren't used for
anything.

* Format
2023-02-08 14:37:42 +01:00
Adriane Boyd
5089efa2d0
Use the same tuple in Span cmp and hash (#12251) 2023-02-08 14:28:34 +01:00
Daniël de Kok
eec5ccd72f
Language.update: ensure that tok2vec gets updated (#12136)
* `Language.update`: ensure that tok2vec gets updated

The components in a pipeline can be updated independently. However,
tok2vec implementations are an exception to this, since they depend on
listeners for their gradients. The update method of a tok2vec
implementation computes the tok2vec forward and passes this along with a
backprop function to the listeners. This backprop function accumulates
gradients for all the listeners. There are two ways in which the
accumulated gradients can be used to update the tok2vec weights:

1. Call the `finish_update` method of tok2vec *after* the `update`
   method is called on all of the pipes that use a tok2vec listener.
2. Pass an optimizer to the `update` method of tok2vec. In this
   case, tok2vec will give the last listener a special backprop
   function that calls `finish_update` on the tok2vec.

Unfortunately, `Language.update` did neither of these. Instead, it
immediately called `finish_update` on every pipe after `update`. As a
result, the tok2vec weights are updated when no gradients have been
accumulated from listeners yet. And the gradients of the listeners are
only used in the next call to `Language.update` (when `finish_update` is
called on tok2vec again).

This change fixes this issue by passing the optimizer to the `update`
method of trainable pipes, leading to use of the second strategy
outlined above.

The main updating loop in `Language.update` is also simplified by using
the `TrainableComponent` protocol consistently.

* Train loop: `sgd` is `Optional[Optimizer]`, do not pass false

* Language.update: call pipe finish_update after all pipe updates

This does correct and fast updates if multiple components update the
same parameters.

* Add comment why we moved `finish_update` to a separate loop
2023-02-03 15:22:25 +01:00
Sofie Van Landeghem
c47ec5b5c6
Merge pull request #12218 from adrianeboyd/chore/update-v4-from-master-7
Update v4 from master
2023-02-03 12:04:20 +01:00
Paul O'Leary McCann
89f974d4f5
Cleanup/remove backwards compat overwrite settings (#11888)
* Remove backwards-compatible overwrite from Entity Linker

This also adds a docstring about overwrite, since it wasn't present.

* Fix docstring

* Remove backward compat settings in Morphologizer

This also needed a docstring added.

For this component it's less clear what the right overwrite settings
are.

* Remove backward compat from sentencizer

This was simple

* Remove backward compat from senter

Another simple one

* Remove backward compat setting from tagger

* Add docstrings

* Update spacy/pipeline/morphologizer.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update docs

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-02-02 14:13:38 +01:00
Adriane Boyd
cd95b29053 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master-7 2023-02-02 13:06:15 +01:00
Paul O'Leary McCann
6920fb7baf
Move Entity Linker v1 to spacy-legacy (#12006)
* Move Entity Linker v1 component to spacy-legacy

This is a follow up to #11889 that moves the component instead of
removing it.

In general, we never import from spacy-legacy in spaCy proper. However,
to use this component, that kind of import will be necessary. I was able
to test this without issues, but is this current import strategy
acceptable? Or should we put the component in a registry?

* Use spacy-legacy pr for CI

This will need to be reverted before merging.

* Add temporary step to log installed spacy-legacy version

* Modify requirements.txt to trigger tests

* Add comment to Python to trigger tests

* TODO REVERT This is a commit with logic changes to trigger tests

* Remove pipe from YAML

Works locally, but possibly this is causing a quoting error or
something.

* Revert "TODO REVERT This is a commit with logic changes to trigger tests"

This reverts commit 689fae71f3.

* Revert "Add comment to Python to trigger tests"

This reverts commit 11840fc598.

* Add more logging

* Try installing directly in workflow

* Try explicitly uninstalling spacy-legacy first

* Cat requirements.txt to confirm contents

In the branch, the thinc version spec is `thinc>=8.1.0,<8.2.0`. But in
the logs, it's clear that a development release of 9.0 is being
installed. It's not clear why that would happen.

* Log requirements at start of build

* TODO REVERT Change thinc spec

Want to see what happens to the installed thinc spec with this change.

* Update thinc requirements

This makes it the same as it was before the merge, >=8.1.0,<8.2.0.

* Use same thinc version as v4 branch

* TODO REVERT Mark dependency check as xfail

spacy-legacy is specified as a git checkout in requirements.txt while
this PR is in progress, which makes the consistency check here fail.

* Remove debugging output / install step

* Revert "Remove debugging output / install step"

This reverts commit 923ea7448b.

* Clean up debugging output

The manual install step with the URL fragment seems to have caused
issues on Windows due to the = in the URL being misinterpreted. On the
other hand, removing it seems to mean the git version of spacy-legacy
isn't actually installed.

This PR removes the URL fragment but keeps the direct command-line
install. Additionally, since it looks like this job is configured to use
the default shell (and not bash), it removes a comment that upsets the
Windows cmd shell.

* Revert "TODO REVERT Mark dependency check as xfail"

This reverts commit d4863ec156.

* Fix requirements.txt, increasing spacy-legacy version

* Raise spacy legacy version in setup.cfg

* Remove azure build workarounds

* make spacy-legacy version explicit in error message

* Remove debugging line

* Suggestions from code review
2023-02-01 09:47:56 +01:00
Edward
360ccf628a
Rename language codes (Icelandic, multi-language) (#12149)
* Init

* fix tests

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix test_blank_languages

* Rename xx to mul in docs

* Format _util with black

* prettier formatting

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-31 17:30:43 +01:00
Daniël de Kok
c6cca4c00a
Language.distill: copy both reference and predicted (#12209)
* Language.distill: copy both reference and predicted

In distillation we also modify the teacher docs (e.g. in tok2vec
components), so we need to copy both the reference and predicted doc.

Problem caught by @shadeMe

* Make new `_copy_examples` args kwonly
2023-01-31 13:19:42 +01:00
Daniël de Kok
fb7f018ded
Add the configuration schema for distillation (#12201)
* Add the configuration schema for distillation

This also adds the default configuration and some tests. The schema will
be used by the training loop and `distill` subcommand.

* Format

* Change distillation shortopt to -d

* Fix descripion of max_epochs

* Rename distillation flag to -dt

* Rename `pipe_map` to `student_to_teacher`
2023-01-31 13:06:02 +01:00
Paul O'Leary McCann
1b5aba9e22
Don't re-download installed models (#12188)
* Don't re-download installed models

When downloading a model, this checks if the same version of the same
model is already installed. If it is then the download is skipped.

This is necessary because pip uses the final download URL for its
caching feature, but because of the way models are hosted on Github,
their URLs change every few minutes.

* Use importlib instead of meta.json

* Use get_package_version

* Add untested, disabled test

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-01-31 11:31:17 +01:00
Daniël de Kok
6b07be2110
Add Language.distill (#12116)
* Add `Language.distill`

This method is the distillation counterpart of `Language.update`.  It
takes a teacher `Language` instance and distills the student pipes on
the teacher pipes.

* Apply suggestions from code review

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Clarify that how Example is used in distillation

* Update transition parser distill docstring for examples argument

* Pass optimizer to `TrainablePipe.distill`

* Annotate pipe before update

As discussed internally, we want to let a pipe annotate before doing an
update with gold/silver data. Otherwise, the output may be (too)
informed by the gold/silver data.

* Rename `component_map` to `student_to_teacher`

* Better synopsis in `Language.distill` docstring

* `name` -> `student_name`

* Fix labels type in docstring

* Mark distill test as slow

* Fix `student_to_teacher` type in docs

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2023-01-30 12:44:11 +01:00
Adriane Boyd
ec45f704b1
Drop python 3.6/3.7, remove unneeded compat (#12187)
* Drop python 3.6/3.7, remove unneeded compat

* Remove unused import

* Minimal python 3.8+ docs updates
2023-01-27 15:48:20 +01:00
Sofie Van Landeghem
1678a98449
Merge pull request #12192 from adrianeboyd/chore/update-v4-from-master-5
Update v4 from master, format, update CI
2023-01-27 14:59:26 +01:00
Adriane Boyd
16609517f1 CI: Skip tests that require published pipelines 2023-01-27 08:37:02 +01:00
Adriane Boyd
fd911fe2af Format 2023-01-27 08:29:46 +01:00
Adriane Boyd
8548d4d16e Merge remote-tracking branch 'upstream/master' into update-v4-from-master-1 2023-01-27 08:29:09 +01:00
Paul O'Leary McCann
de360bc981
Refactor lexeme mem passing (#12125)
* Don't pass mem pool to new lexeme function

* Remove unused mem from function args

Two methods calling _new_lexeme, get and get_by_orth, took mem arguments
just to call the internal method. That's no longer necessary, so this
cleans it up.

* prettier formatting

* Remove more unused mem args
2023-01-25 12:50:21 +09:00
Daniël de Kok
6348a7a4b4
Set version to v4.0.0.dev0 (#12126) 2023-01-19 09:25:34 +01:00
Daniël de Kok
b052b1b47f
Fix batching regression (#12094)
* Fix batching regression

Some time ago, the spaCy v4 branch switched to the new Thinc v9
schedule. However, this introduced an error in how batching is handed.

In the PR, the batchers were changed to keep track of their step,
so that the step can be passed to the schedule. However, the issue
is that the training loop repeatedly calls the batching functions
(rather than using an infinite generator/iterator). So, the step and
therefore the schedule would be reset each epoch. Before the schedule
switch we didn't have this issue, because the old schedules were
stateful.

This PR fixes this issue by reverting the batching functions to use
a (stateful) generator. Their registry functions do accept a `Schedule`
and we convert `Schedule`s to generators.

* Update batcher docs

* Docstring fixes

* Make minibatch take iterables again as well

* Bump thinc requirement to 9.0.0.dev2

* Use type declaration

* Convert another comment into a proper type declaration
2023-01-18 18:28:30 +01:00
Daniël de Kok
a183db3cef
Merge the parser refactor into v4 (#10940)
* Try to fix doc.copy

* Set dev version

* Make vocab always own lexemes

* Change version

* Add SpanGroups.copy method

* Fix set_annotations during Parser.update

* Fix dict proxy copy

* Upd version

* Fix copying SpanGroups

* Fix set_annotations in parser.update

* Fix parser set_annotations during update

* Revert "Fix parser set_annotations during update"

This reverts commit eb138c89ed.

* Revert "Fix set_annotations in parser.update"

This reverts commit c6df0eafd0.

* Fix set_annotations during parser update

* Inc version

* Handle final states in get_oracle_sequence

* Inc version

* Try to fix parser training

* Inc version

* Fix

* Inc version

* Fix parser oracle

* Inc version

* Inc version

* Fix transition has_gold

* Inc version

* Try to use real histories, not oracle

* Inc version

* Upd parser

* Inc version

* WIP on rewrite parser

* WIP refactor parser

* New progress on parser model refactor

* Prepare to remove parser_model.pyx

* Convert parser from cdef class

* Delete spacy.ml.parser_model

* Delete _precomputable_affine module

* Wire up tb_framework to new parser model

* Wire up parser model

* Uncython ner.pyx and dep_parser.pyx

* Uncython

* Work on parser model

* Support unseen_classes in parser model

* Support unseen classes in parser

* Cleaner handling of unseen classes

* Work through tests

* Keep working through errors

* Keep working through errors

* Work on parser. 15 tests failing

* Xfail beam stuff. 9 failures

* More xfail. 7 failures

* Xfail. 6 failures

* cleanup

* formatting

* fixes

* pass nO through

* Fix empty doc in update

* Hackishly fix resizing. 3 failures

* Fix redundant test. 2 failures

* Add reference version

* black formatting

* Get tests passing with reference implementation

* Fix missing prints

* Add missing file

* Improve indexing on reference implementation

* Get non-reference forward func working

* Start rigging beam back up

* removing redundant tests, cf #8106

* black formatting

* temporarily xfailing issue 4314

* make flake8 happy again

* mypy fixes

* ensure labels are added upon predict

* cleanup remnants from merge conflicts

* Improve unseen label masking

Two changes to speed up masking by ~10%:

- Use a bool array rather than an array of float32.

- Let the mask indicate whether a label was seen, rather than
  unseen. The mask is most frequently used to index scores for
  seen labels. However, since the mask marked unseen labels,
  this required computing an intermittent flipped mask.

* Write moves costs directly into numpy array (#10163)

This avoids elementwise indexing and the allocation of an additional
array.

Gives a ~15% speed improvement when using batch_by_sequence with size
32.

* Temporarily disable ner and rehearse tests

Until rehearse is implemented again in the refactored parser.

* Fix loss serialization issue (#10600)

* Fix loss serialization issue

Serialization of a model fails with:

TypeError: array(738.3855, dtype=float32) is not JSON serializable

Fix this using float conversion.

* Disable CI steps that require spacy.TransitionBasedParser.v2

After finishing the refactor, TransitionBasedParser.v2 should be
provided for backwards compat.

* Add back support for beam parsing to the refactored parser (#10633)

* Add back support for beam parsing

Beam parsing was already implemented as part of the `BeamBatch` class.
This change makes its counterpart `GreedyBatch`. Both classes are hooked
up in `TransitionModel`, selecting `GreedyBatch` when the beam size is
one, or `BeamBatch` otherwise.

* Use kwarg for beam width

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Avoid implicit default for beam_width and beam_density

* Parser.{beam,greedy}_parse: ensure labels are added

* Remove 'deprecated' comments

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Parser `StateC` optimizations (#10746)

* `StateC`: Optimizations

Avoid GIL acquisition in `__init__`
Increase default buffer capacities on init
Reduce C++ exception overhead

* Fix typo

* Replace `set::count` with `set::find`

* Add exception attribute to c'tor

* Remove unused import

* Use a power-of-two value for initial capacity
Use default-insert to init `_heads` and `_unshiftable`

* Merge `cdef` variable declarations and assignments

* Vectorize `example.get_aligned_parses` (#10789)

* `example`: Vectorize `get_aligned_parse`
Rename `numpy` import

* Convert aligned array to lists before returning

* Revert import renaming

* Elide slice arguments when selecting the entire range

* Tagger/morphologizer alignment performance optimizations (#10798)

* `example`: Unwrap `numpy` scalar arrays before passing them to `StringStore.__getitem__`

* `AlignmentArray`: Use native list as staging buffer for offset calculation

* `example`: Vectorize `get_aligned`

* Hoist inner functions out of `get_aligned`

* Replace inline `if..else` clause in assignment statement

* `AlignmentArray`: Use raw indexing into offset and data `numpy` arrays

* `example`: Replace array unique value check with `groupby`

* `example`: Correctly exclude tokens with no alignment in `_get_aligned_vectorized`
Simplify `_get_aligned_non_vectorized`

* `util`: Update `all_equal` docstring

* Explicitly use `int32_t*`

* Restore C CPU inference in the refactored parser (#10747)

* Bring back the C parsing model

The C parsing model is used for CPU inference and is still faster for
CPU inference than the forward pass of the Thinc model.

* Use C sgemm provided by the Ops implementation

* Make tb_framework module Cython, merge in C forward implementation

* TransitionModel: raise in backprop returned from forward_cpu

* Re-enable greedy parse test

* Return transition scores when forward_cpu is used

* Apply suggestions from code review

Import `Model` from `thinc.api`

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Use relative imports in tb_framework

* Don't assume a default for beam_width

* We don't have a direct dependency on BLIS anymore

* Rename forwards to _forward_{fallback,greedy_cpu}

* Require thinc >=8.1.0,<8.2.0

* tb_framework: clean up imports

* Fix return type of _get_seen_mask

* Move up _forward_greedy_cpu

* Style fixes.

* Lower thinc lowerbound to 8.1.0.dev0

* Formatting fix

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Reimplement parser rehearsal function (#10878)

* Reimplement parser rehearsal function

Before the parser refactor, rehearsal was driven by a loop in the
`rehearse` method itself. For each parsing step, the loops would:

1. Get the predictions of the teacher.
2. Get the predictions and backprop function of the student.
3. Compute the loss and backprop into the student.
4. Move the teacher and student forward with the predictions of
   the student.

In the refactored parser, we cannot perform search stepwise rehearsal
anymore, since the model now predicts all parsing steps at once.
Therefore, rehearsal is performed in the following steps:

1. Get the predictions of all parsing steps from the student, along
   with its backprop function.
2. Get the predictions from the teacher, but use the predictions of
   the student to advance the parser while doing so.
3. Compute the loss and backprop into the student.

To support the second step a new method, `advance_with_actions` is
added to `GreedyBatch`, which performs the provided parsing steps.

* tb_framework: wrap upper_W and upper_b in Linear

Thinc's Optimizer cannot handle resizing of existing parameters. Until
it does, we work around this by wrapping the weights/biases of the upper
layer of the parser model in Linear. When the upper layer is resized, we
copy over the existing parameters into a new Linear instance. This does
not trigger an error in Optimizer, because it sees the resized layer as
a new set of parameters.

* Add test for TransitionSystem.apply_actions

* Better FIXME marker

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Fixes from Madeesh

* Apply suggestions from Sofie

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remove useless assignment

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Rename some identifiers in the parser refactor (#10935)

* Rename _parseC to _parse_batch

* tb_framework: prefix many auxiliary functions with underscore

To clearly state the intent that they are private.

* Rename `lower` to `hidden`, `upper` to `output`

* Parser slow test fixup

We don't have TransitionBasedParser.{v1,v2} until we bring it back as a
legacy option.

* Remove last vestiges of PrecomputableAffine

This does not exist anymore as a separate layer.

* ner: re-enable sentence boundary checks

* Re-enable test that works now.

* test_ner: make loss test more strict again

* Remove commented line

* Re-enable some more beam parser tests

* Remove unused _forward_reference function

* Update for CBlas changes in Thinc 8.1.0.dev2

Bump thinc dependency to 8.1.0.dev3.

* Remove references to spacy.TransitionBasedParser.{v1,v2}

Since they will not be offered starting with spaCy v4.

* `tb_framework`: Replace references to `thinc.backends.linalg` with `CBlas`

* dont use get_array_module (#11056) (#11293)

Co-authored-by: kadarakos <kadar.akos@gmail.com>

* Move `thinc.extra.search` to `spacy.pipeline._parser_internals` (#11317)

* `search`: Move from `thinc.extra.search`
Fix NPE in `Beam.__dealloc__`

* `pytest`: Add support for executing Cython tests
Move `search` tests from thinc and patch them to run with `pytest`

* `mypy` fix

* Update comment

* `conftest`: Expose `register_cython_tests`

* Remove unused import

* Move `argmax` impls to new `_parser_utils` Cython module (#11410)

* Parser does not have to be a cdef class anymore

This also fixes validation of the initialization schema.

* Add back spacy.TransitionBasedParser.v2

* Fix a rename that was missed in #10878.

So that rehearsal tests pass.

* Remove module from setup.py that got added during the merge

* Bring back support for `update_with_oracle_cut_size` (#12086)

* Bring back support for `update_with_oracle_cut_size`

This option was available in the pre-refactor parser, but was never
implemented in the refactored parser. This option cuts transition
sequences that are longer than `update_with_oracle_cut` size into
separate sequences that have at most `update_with_oracle_cut`
transitions. The oracle (gold standard) transition sequence is used to
determine the cuts and the initial states for the additional sequences.

Applying this cut makes the batches more homogeneous in the transition
sequence lengths, making forward passes (and as a consequence training)
much faster.

Training time 1000 steps on de_core_news_lg:

- Before this change: 149s
- After this change: 68s
- Pre-refactor parser: 81s

* Fix a rename that was missed in #10878.

So that rehearsal tests pass.

* Apply suggestions from @shadeMe

* Use chained conditional

* Test with update_with_oracle_cut_size={0, 1, 5, 100}

And fix a git that occurs with a cut size of 1.

* Fix up some merge fall out

* Update parser distillation for the refactor

In the old parser, we'd iterate over the transitions in the distill
function and compute the loss/gradients on the go. In the refactored
parser, we first let the student model parse the inputs. Then we'll let
the teacher compute the transition probabilities of the states in the
student's transition sequence. We can then compute the gradients of the
student given the teacher.

* Add back spacy.TransitionBasedParser.v1 references

- Accordion in the architecture docs.
- Test in test_parse, but disabled until we have a spacy-legacy release.

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: kadarakos <kadar.akos@gmail.com>
2023-01-18 11:27:45 +01:00
Daniël de Kok
5e297aa20e
Add TrainablePipe.{distill,get_teacher_student_loss} (#12016)
* Add `TrainablePipe.{distill,get_teacher_student_loss}`

This change adds two methods:

- `TrainablePipe::distill` which performs a training step of a
   student pipe on a teacher pipe, giving a batch of `Doc`s.
- `TrainablePipe::get_teacher_student_loss` computes the loss
  of a student relative to the teacher.

The `distill` or `get_teacher_student_loss` methods are also implemented
in the tagger, edit tree lemmatizer, and parser pipes, to enable
distillation in those pipes and as an example for other pipes.

* Fix stray `Beam` import

* Fix incorrect import

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* TrainablePipe.distill: use `Iterable[Example]`

* Add Pipe.is_distillable method

* Add `validate_distillation_examples`

This first calls `validate_examples` and then checks that the
student/teacher tokens are the same.

* Update distill documentation

* Add distill documentation for all pipes that support distillation

* Fix incorrect identifier

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add comment to explain `is_distillable`

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-01-16 10:25:53 +01:00
Sofie Van Landeghem
c2f3e699ca
fix anchors (#12095) 2023-01-13 11:14:58 +01:00
Sofie Van Landeghem
2c2e66e145
Merge pull request #12096 from svlandeg/copy_v4
Sync with latest from master
2023-01-11 20:46:33 +01:00
svlandeg
fc2723925b update tests from master to follow v4 principles (2) 2023-01-11 19:04:06 +01:00
svlandeg
6ff5eb256c update tests from master to follow v4 principles 2023-01-11 18:57:50 +01:00
svlandeg
b2fd9490e3 Merge branch 'copy_master' into copy_v4 2023-01-11 18:40:55 +01:00
Madeesh Kannan
a231bf65af
Pass step=0 to Schedule class to yield initial learning rate (#12078) 2023-01-09 20:15:02 +01:00
Tetsuo Kiso
b510fbd0aa
Delete unused imports for StringStore (#12040) 2023-01-03 17:43:09 +01:00
Sofie Van Landeghem
326b541312
Merge pull request #12049 from svlandeg/copy_v4
Sync v4 with latest from master
2023-01-03 16:43:54 +01:00
svlandeg
6852adc8b7 Merge branch 'copy_master' into copy_v4 2023-01-03 13:34:05 +01:00
Daniël de Kok
20b63943f5
Adjust to new Schedule class and pass scores to Optimizer (#12008)
* Adjust to new `Schedule` class and pass scores to `Optimizer`

Requires https://github.com/explosion/thinc/pull/804

* Bump minimum Thinc requirement to 9.0.0.dev1
2022-12-29 08:03:24 +01:00
Daniël de Kok
d30ba9b7b8
Merge pull request #12015 from danieldk/chore/v4-merge-master-20221222
Merge master into v4
2022-12-22 11:22:33 +01:00
Daniël de Kok
2f08deea2a Fix fallout from a previous merge 2022-12-22 10:23:31 +01:00
Daniël de Kok
207565a788 Merge remote-tracking branch 'upstream/master' into chore/v4-merge-master-20221222 2022-12-22 10:08:54 +01:00
Daniël de Kok
f9308aae13
Fix v4 branch to build against Thinc v9 (#11921)
* Move `thinc.extra.search` to `spacy.pipeline._parser_internals`

Backport of:
https://github.com/explosion/spaCy/pull/11317

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Replace references to `thinc.backends.linalg` with `CBlas`

Backport of:
https://github.com/explosion/spaCy/pull/11292

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Use cross entropy from `thinc.legacy`

* Require thinc>=9.0.0.dev0,<9.1.0

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2022-12-17 14:32:19 +01:00
Edward
ca75190a3d
Custom extensions for spans with equal boundaries (#11429)
* Init

* Fix return type for mypy

* adjust types and improve setting new attributes

* Add underscore changes to json conversion

* Add test and underscore changes to from_docs

* add underscore changes and test to span.to_doc

* update return values

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add types to function

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* adjust formatting

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* shorten return type

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* add helper function to improve readability

* Improve code and add comments

* rerun azure tests

* Fix tests for json conversion

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-12-12 08:55:53 +01:00
Madeesh Kannan
f5aabaf7d6
Remove unused, experimental multi-task components (#11919)
* Remove experimental multi-task components

These are incomplete implementations and are not usable in their current state.

* Remove orphaned error message

* Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)

* Switch ubuntu-latest to ubuntu-20.04 in main tests

* Only use 20.04 for 3.6

* Revert "Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)"

This reverts commit 77c0fd7b17.

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2022-12-08 13:24:45 +01:00
Paul O'Leary McCann
d60997febb
Remove old model shortcuts (#11916)
* Remove old model shortcuts

* Remove error, docs warnings about shortcuts

* Fix import in util

Accidentally deleted the whole import and not just the old part...

* Change universe example to v3 style

* Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)

* Switch ubuntu-latest to ubuntu-20.04 in main tests

* Only use 20.04 for 3.6

* Update some model loading in Universe

* Add v2 tag to neuralcoref

* Use the spacy-version feature instead of a v2 tag

Co-authored-by: svlandeg <svlandeg@github.com>
2022-12-08 11:45:52 +01:00
Paul O'Leary McCann
6b9af38eeb
Remove all references to "begin_training" (#11943)
When v3 was released, `begin_training` was renamed to `initialize`.
There were warnings in the code and docs about that. This PR removes
them.
2022-12-08 11:43:52 +01:00
Sofie Van Landeghem
60379cec65
Merge pull request #11929 from svlandeg/copy_v4
sync v4 with latest master
2022-12-07 15:24:07 +01:00
Paul O'Leary McCann
8267aa1b65 Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)
* Switch ubuntu-latest to ubuntu-20.04 in main tests

* Only use 20.04 for 3.6
2022-12-05 09:44:19 +01:00
svlandeg
799d226676 prettier formatting 2022-12-05 08:57:24 +01:00
svlandeg
04fea09ffd Merge branch 'copy_master' into copy_v4 2022-12-05 08:56:15 +01:00
Edward
e79910d57e
Remove sentiment extension (#11722)
* remove sentiment attribute

* remove sentiment from docs

* add test for backwards compatibility

* replace from_disk with from_bytes

* Fix docs and format file

* Fix formatting
2022-11-23 13:09:32 +01:00
Adriane Boyd
d0fc871a1c
Merge pull request #11741 from adrianeboyd/chore/update-v4-from-master-4
Update v4 from master
2022-11-03 12:39:15 +01:00
Adriane Boyd
68b8fa2df2 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master-4 2022-11-03 09:42:36 +01:00
Adriane Boyd
cae4589f5a
Replace EntityRuler with SpanRuler implementation (#11320)
* Replace EntityRuler with SpanRuler implementation

Remove `EntityRuler` and rename the `SpanRuler`-based
`future_entity_ruler` to `entity_ruler`.

Main changes:

* It is no longer possible to load patterns on init as with
`EntityRuler(patterns=)`.
* The older serialization formats (`patterns.jsonl`) are no longer
supported and the related tests are removed.
* The config settings are only stored in the config, not in the
serialized component (in particular the `phrase_matcher_attr` and
overwrite settings).

* Add migration guide to EntityRuler API docs

* docs update

* Minor edit

Co-authored-by: svlandeg <svlandeg@github.com>
2022-10-24 09:11:35 +02:00
Adriane Boyd
a4bd890f32
Merge pull request #11686 from adrianeboyd/chore/update-v4-from-master
Update v4 from master
2022-10-21 12:55:53 +02:00
Paul O'Leary McCann
0e2b7fb28b
Remove thinc util reimports (#11665)
* Remove imports marked as v2 leftovers

There are a few functions that were in `spacy.util` in v2, but were
moved to Thinc. In v3 these were imported in `spacy.util` so that code
could be used unchanged, but the comment over them indicates they should
always be imported from Thinc. This commit removes those imports.

It doesn't look like any DeprecationWarning was ever thrown for using
these, but it is probably fine to remove them anyway with a major
version. It is not clear that they were widely used.

* Import fix_random_seed correctly

This seems to be the only place in spaCy that was using the old import.
2022-10-21 11:01:18 +02:00
Adriane Boyd
103b24fb25 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master 2022-10-21 09:13:32 +02:00
Madeesh Kannan
446a3ecf34
StringStore refactoring (#11344)
* `strings`: Remove unused `hash32_utf8` function

* `strings`: Make `hash_utf8` and `decode_Utf8Str` private

* `strings`: Reorganize private functions

* 'strings': Raise error when non-string/-int types are passed to functions that don't accept them

* `strings`: Add `items()` method, add type hints, remove unused methods, restrict inputs to specific types, reorganize methods

* `Morphology`: Use `StringStore.items()` to enumerate features when pickling

* `test_stringstore`: Update pre-Python 3 tests

* Update `StringStore` docs

* Fix `get_string_id` imports

* Replace redundant test with tests for type checking

* Rename `_retrieve_interned_str`, remove `.get` default arg

* Add `get_string_id` to `strings.pyi`
Remove `mypy` ignore directives from imports of the above

* `strings.pyi`: Replace functions that consume `Union`-typed params with overloads

* `strings.pyi`: Revert some function signatures

* Update `SYMBOLS_BY_INT` lookups and error codes post-merge

* Revert clobbered change introduced in a previous merge

* Remove unnecessary type hint

* Invert tuple order in `StringStore.items()`

* Add test for `StringStore.items()`

* Revert "`Morphology`: Use `StringStore.items()` to enumerate features when pickling"

This reverts commit 1af9510ceb.

* Rename `keys` and `key_map`

* Add `keys()` and `values()`

* Add comment about the inverted key-value semantics in the API

* Fix type hints

* Implement `keys()`, `values()`, `items()` without generators

* Fix type hints, remove unnecessary boxing

* Update docs

* Simplify `keys/values/items()` impl

* `mypy` fix

* Fix error message, doc fixes
2022-10-06 10:51:06 +02:00
Sofie Van Landeghem
c6704f368c
Merge pull request #11572 from svlandeg/copy_v4
update v4 with latest from master and develop
2022-10-03 15:30:55 +02:00
svlandeg
d4922f25fc fix test for EL activations with refactored KB 2022-10-03 14:41:15 +02:00
svlandeg
e3027c65b8 Merge branch 'copy_develop' into copy_v4 2022-10-03 14:12:16 +02:00
Sofie Van Landeghem
5157e4e823
disable mypy run for Python 3.10 (#11508) (#11512) 2022-09-15 17:06:58 +02:00
Daniël de Kok
efdbb722c5
Store activations in Docs when save_activations is enabled (#11002)
* Store activations in Doc when `store_activations` is enabled

This change adds the new `activations` attribute to `Doc`. This
attribute can be used by trainable pipes to store their activations,
probabilities, and guesses for downstream users.

As an example, this change modifies the `tagger` and `senter` pipes to
add an `store_activations` option. When this option is enabled, the
probabilities and guesses are stored in `set_annotations`.

* Change type of `store_activations` to `Union[bool, List[str]]`

When the value is:

- A bool: all activations are stored when set to `True`.
- A List[str]: the activations named in the list are stored

* Formatting fixes in Tagger

* Support store_activations in spancat and morphologizer

* Make Doc.activations type visible to MyPy

* textcat/textcat_multilabel: add store_activations option

* trainable_lemmatizer/entity_linker: add store_activations option

* parser/ner: do not currently support returning activations

* Extend tagger and senter tests

So that they, like the other tests, also check that we get no
activations if no activations were requested.

* Document `Doc.activations` and `store_activations` in the relevant pipes

* Start errors/warnings at higher numbers to avoid merge conflicts

Between the master and v4 branches.

* Add `store_activations` to docstrings.

* Replace store_activations setter by set_store_activations method

Setters that take a different type than what the getter returns are still
problematic for MyPy. Replace the setter by a method, so that type inference
works everywhere.

* Use dict comprehension suggested by @svlandeg

* Revert "Use dict comprehension suggested by @svlandeg"

This reverts commit 6e7b958f70.

* EntityLinker: add type annotations to _add_activations

* _store_activations: make kwarg-only, remove doc_scores_lens arg

* set_annotations: add type annotations

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* TextCat.predict: return dict

* Make the `TrainablePipe.store_activations` property a bool

This means that we can also bring back `store_activations` setter.

* Remove `TrainablePipe.activations`

We do not need to enumerate the activations anymore since `store_activations` is
`bool`.

* Add type annotations for activations in predict/set_annotations

* Rename `TrainablePipe.store_activations` to `save_activations`

* Error E1400 is not used anymore

This error was used when activations were still `Union[bool, List[str]]`.

* Change wording in API docs after store -> save change

* docs: tag (save_)activations as new in spaCy 4.0

* Fix copied line in morphologizer activations test

* Don't train in any test_save_activations test

* Rename activations

- "probs" -> "probabilities"
- "guesses" -> "label_ids", except in the edit tree lemmatizer, where
  "guesses" -> "tree_ids".

* Remove unused W400 warning.

This warning was used when we still allowed the user to specify
which activations to save.

* Formatting fixes

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Replace "kb_ids" by a constant

* spancat: replace a cast by an assertion

* Fix EOF spacing

* Fix comments in test_save_activations tests

* Do not set RNG seed in activation saving tests

* Revert "spancat: replace a cast by an assertion"

This reverts commit 0bd5730d16.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-09-13 09:51:12 +02:00
Madeesh Kannan
60c050e82b
Merge pull request #11448 from shadeMe/merge-develop-into-v4
Merge `develop` into `v4`
2022-09-07 13:26:11 +02:00
shademe
977b847cce
Merge branch 'develop' into merge-develop-into-v4 2022-09-07 11:35:47 +02:00
Adriane Boyd
4a615cacd2
Consolidate and freeze symbols (#11352)
* Consolidate and freeze symbols

Instead of having symbol values defined in three potentially conflicting
places (`spacy.attrs`, `spacy.parts_of_speech`, `spacy.symbols`), define
all symbols in `spacy.symbols` and reference those values in
`spacy.attrs` and `spacy.parts_of_speech`.

Remove deprecated and placeholder symbols from `spacy.attrs.IDS`.

Make `spacy.attrs.NAMES` and `spacy.symbols.NAMES` reverse dicts rather
than lists in order to support future use of hash values in `attr_id_t`.

Minor changes:

* Use `uint64_t` for attrs in `Doc.to_array` to support future use of
hash values
* Remove unneeded attrs filter for error message in `Doc.to_array`
* Remove unused attr `SENT_END`

* Handle dynamic size of attr_id_t in Doc.to_array

* Undo added warnings

* Refactor to make Doc.to_array more similar to Doc.from_array

* Improve refactoring
2022-09-02 09:08:40 +02:00
Paul O'Leary McCann
698b8b495f
Update/remove old Matcher syntax (#11370)
* Clean up old Matcher call style related stuff

In v2 Matcher.add was called with (key, on_match, *patterns). In v3 this
was changed to (key, patterns, *, on_match=None), but there were various
points where the old call syntax was documented or handled specially.
This removes all those.

The Matcher itself didn't need any code changes, as it just gives a
generic type error. However the PhraseMatcher required some changes
because it would automatically "fix" the old call style.

Surprisingly, the tokenizer was still using the old call style in one
place.

After these changes tests failed in two places:

1. one test for the "new" call style, including the "old" call style. I
   removed this test.
2. deserializing the PhraseMatcher fails because the input docs are a
   set.

I am not sure why 2 is happening - I guess it's a quirk of the
serialization format? - so for now I just convert the set to a list when
deserializing. The check that the input Docs are a List in the
PhraseMatcher is a new check, but makes it parallel with the other
Matchers, which seemed like the right thing to do.

* Add notes related to input docs / deserialization type

* Remove Typing import

* Remove old note about call style change

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Use separate method for setting internal doc representations

In addition to the title change, this changes the internal dict to be a
defaultdict, instead of a dict with frequent use of setdefault.

* Add _add_from_arrays for unpickling

* Cleanup around adding from arrays

This moves adding to internal structures into the private batch method,
and removes the single-add method.

This has one behavioral change for `add`, in that if something is wrong
with the list of input Docs (such as one of the items not being a Doc),
valid items before the invalid one will not be added. Also the callback
will not be updated if anything is invalid. This change should not be
significant.

This also adds a test to check failure when given a non-Doc.

* Update spacy/matcher/phrasematcher.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-30 15:40:31 +02:00
Adriane Boyd
98a916e01a
Make stable private modules public and adjust names (#11353)
* Make stable private modules public and adjust names

* `spacy.ml._character_embed` -> `spacy.ml.character_embed`
* `spacy.ml._precomputable_affine` -> `spacy.ml.precomputable_affine`
* `spacy.tokens._serialize` -> `spacy.tokens.doc_bin`
* `spacy.tokens._retokenize` -> `spacy.tokens.retokenize`
* `spacy.tokens._dict_proxies` -> `spacy.tokens.span_groups`

* Skip _precomputable_affine

* retokenize -> retokenizer

* Fix imports
2022-08-30 13:56:35 +02:00
Adriane Boyd
4bce8fa755
Remove setup_requires from setup.cfg (#11384)
* Remove setup_requires from setup.cfg

* Update requirements test to ignore cython in setup.cfg
2022-08-29 13:23:24 +02:00
Adriane Boyd
2a558a7cdc
Switch to mecab-ko as default Korean tokenizer (#11294)
* Switch to mecab-ko as default Korean tokenizer

Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.

Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.

* Temporarily run tests with mecab-ko tokenizer

* Fix types

* Fix duplicate test names

* Update requirements test

* Revert "Temporarily run tests with mecab-ko tokenizer"

This reverts commit d2083e7044.

* Add mecab_args setting, fix pickle for KoreanNattoTokenizer

* Fix length check

* Update docs

* Formatting

* Update natto-py error message

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2022-08-26 10:11:18 +02:00
Adriane Boyd
1eb7ce5ef7
Merge pull request #11377 from adrianeboyd/chore/update-v4-from-develop-2
Update v4 from develop
2022-08-25 08:26:55 +02:00
Adriane Boyd
740c33fe58 Merge remote-tracking branch 'upstream/develop' into chore/update-v4-from-develop 2022-08-24 20:43:07 +02:00
Sofie Van Landeghem
8dd1fa9896
Merge pull request #11366 from adrianeboyd/chore/update-v4-from-master
Update v4 from master
2022-08-24 09:45:55 +02:00
Adriane Boyd
c44d243f25 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master 2022-08-24 07:15:41 +02:00
Adriane Boyd
bb0e178878
Make Span/Doc.ents more consistent for ent_kb_id and ent_id (#11328)
* Map `Span.id` to `Token.ent_id` in all cases when setting `Doc.ents`
* Reset `Token.ent_id` and `Token.ent_kb_id` when setting `Doc.ents`
* Make `Span.ent_id` an alias of `Span.id` rather than a read-only view
of the root token's `ent_id` annotation
2022-08-22 20:28:57 +02:00
Sofie Van Landeghem
1a5be63715
Cleanup Cython structs (#11337)
* cleanup Tokenizer fields

* remove unused object from vocab

* remove IS_OOV_DEPRECATED

* add back in as FLAG13

* FLAG 18 instead

* import fix

* fix clumpsy fingers

* revert symbol changes in favor of #11352

* bint instead of bool
2022-08-22 15:52:24 +02:00
Adriane Boyd
d757dec5c4
Remove intify_attrs(_do_deprecated) (#11319) 2022-08-17 12:13:54 +02:00
antonpibm
551e73ccfc
Match private networks as URLs (#11121) 2022-08-11 11:26:26 +02:00
Sofie Van Landeghem
5d54c0e32a
Rename modules for consistency (#11286)
* rename Python module to entity_ruler

* rename Python module to attribute_ruler
2022-08-10 11:44:05 +02:00
Daniël de Kok
e581eeac34
precompute_hiddens/Parser: look up CPU ops once (v4) (#11068)
* precompute_hiddens/Parser: look up CPU ops once

* precompute_hiddens: make cpu_ops private
2022-07-29 15:12:19 +02:00
Daniël de Kok
b2d05f9f66
Merge pull request #11242 from danieldk/merge-master-v4-20220728
Merge `master` into `v4`
2022-07-29 09:17:02 +02:00
Daniël de Kok
1ff683a50b Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220728 2022-07-28 13:53:59 +02:00
Madeesh Kannan
ba18d2913d
Morphology/Morphologizer optimizations and refactoring (#11024)
* `Morphology`: Refactor to use C types, reduce allocations, remove unused code

* `Morphologzier`: Avoid unnecessary sorting of morpho features

* `Morphologizer`: Remove execessive reallocations of labels, improve hash lookups of labels, coerce `numpy` numeric types to native ints
Update docs

* Remove unused method

* Replace `unique_ptr` usage with `shared_ptr`

* Add type annotations to internal Python methods, rename `hash` variable, fix typos

* Add comment to clarify implementation detail

* Fix return type

* `Morphology`: Stop early when splitting fields and values
2022-07-15 11:14:08 +02:00
Daniël de Kok
851a7ca4fa
Merge pull request #11035 from danieldk/merge-master-v4-20220627-2
Merge `master` into `v4`
2022-06-27 19:20:21 +02:00
Daniël de Kok
1605ef7319 Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220627-2 2022-06-27 17:45:45 +02:00
Daniël de Kok
7f3842f54d
Merge pull request #10936 from danieldk/merge-master-v4-20220609
Merge `master` into `v4`
2022-06-09 12:47:13 +02:00
Daniël de Kok
2f05c6824c Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220609 2022-06-09 10:18:25 +02:00
Adriane Boyd
10b7223021
Merge pull request #10816 from danieldk/v4-merge-master-20220518
Merge `master` into `v4`
2022-05-20 09:56:06 +02:00
Daniël de Kok
5586fd9311 Merge remote-tracking branch 'upstream/master' into v4-merge-master-20220518 2022-05-18 11:34:54 +02:00
Adriane Boyd
0e71bd973f
Return doc offsets in Matcher on spans (#10576)
The returned match offsets were only adjusted for `as_spans`, not
generally. Because the `on_match` callbacks are always applied to the
doc, the `Matcher` matches on spans should consistently use the doc
offsets.
2022-04-15 15:34:58 +02:00
Daniël de Kok
75f7c15187
Span/SpanGroup: wrap SpanC in shared_ptr (#9869)
* Span/SpanGroup: wrap SpanC in shared_ptr

When a Span that was retrieved from a SpanGroup was modified, these
changes were not reflected in the SpanGroup because the underlying
SpanC struct was copied.

This change applies the solution proposed by @nrodnova, to wrap SpanC in
a shared_ptr. This makes a SpanGroup and Spans derived from it share the
same SpanC. So, changes made through a Span are visible in the SpanGroup
as well.

Fixes #9556

* Test that a SpanGroup is modified through its Spans

* SpanGroup.push_back: remove nogil

Modifying std::vector is not thread-safe.

* C++ >= 11 does not allow const T in vector<T>

* Add Span.span_c as a shorthand for Span.c.get

Since this method is cdef'ed, it is only visible from Cython, so we
avoid using raw pointers in Python

Replace existing uses of span.c.get() to use this new method.

* Fix formatting

* Style fix: pointer types

* SpanGroup.to_bytes: reduce number of shared_ptr::get calls

* Mark SpanGroup modification test with issue

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-12 13:38:52 +01:00
241 changed files with 7620 additions and 5214 deletions

View File

@ -30,7 +30,7 @@ jobs:
- name: Configure Python version
uses: actions/setup-python@v4
with:
python-version: "3.7"
python-version: "3.8"
architecture: x64
- name: black
@ -50,13 +50,9 @@ jobs:
os: [ubuntu-latest, windows-latest, macos-latest]
python_version: ["3.11"]
include:
- os: ubuntu-20.04
python_version: "3.6"
- os: windows-latest
python_version: "3.7"
- os: macos-latest
python_version: "3.8"
- os: ubuntu-latest
- os: ubuntu-20.04
python_version: "3.9"
- os: windows-latest
python_version: "3.10"
@ -85,7 +81,6 @@ jobs:
- name: Run mypy
run: |
python -m mypy spacy
if: matrix.python_version != '3.6'
- name: Delete source directory and .egg-info
run: |
@ -107,22 +102,22 @@ jobs:
- name: Test import
run: python -W error -c "import spacy"
# - name: "Test download CLI"
# run: |
# python -m spacy download ca_core_news_sm
# python -m spacy download ca_core_news_md
# python -c "import spacy; nlp=spacy.load('ca_core_news_sm'); doc=nlp('test')"
# if: matrix.python_version == '3.9'
#
# - name: "Test download_url in info CLI"
# run: |
# python -W error -m spacy info ca_core_news_sm | grep -q download_url
# if: matrix.python_version == '3.9'
#
# - name: "Test no warnings on load (#11713)"
# run: |
# python -W error -c "import ca_core_news_sm; nlp = ca_core_news_sm.load(); doc=nlp('test')"
# if: matrix.python_version == '3.9'
# - name: "Test download CLI"
# run: |
# python -m spacy download ca_core_news_sm
# python -m spacy download ca_core_news_md
# python -c "import spacy; nlp=spacy.load('ca_core_news_sm'); doc=nlp('test')"
# if: matrix.python_version == '3.9'
#
# - name: "Test download_url in info CLI"
# run: |
# python -W error -m spacy info ca_core_news_sm | grep -q download_url
# if: matrix.python_version == '3.9'
#
# - name: "Test no warnings on load (#11713)"
# run: |
# python -W error -c "import ca_core_news_sm; nlp = ca_core_news_sm.load(); doc=nlp('test')"
# if: matrix.python_version == '3.9'
- name: "Test convert CLI"
run: |
@ -146,17 +141,17 @@ jobs:
python -m spacy train ner.cfg --paths.train ner-token-per-line-conll2003.spacy --paths.dev ner-token-per-line-conll2003.spacy --training.max_steps 10 --gpu-id -1
if: matrix.python_version == '3.9'
# - name: "Test assemble CLI"
# run: |
# python -c "import spacy; config = spacy.util.load_config('ner.cfg'); config['components']['ner'] = {'source': 'ca_core_news_sm'}; config.to_disk('ner_source_sm.cfg')"
# PYTHONWARNINGS="error,ignore::DeprecationWarning" python -m spacy assemble ner_source_sm.cfg output_dir
# if: matrix.python_version == '3.9'
#
# - name: "Test assemble CLI vectors warning"
# run: |
# python -c "import spacy; config = spacy.util.load_config('ner.cfg'); config['components']['ner'] = {'source': 'ca_core_news_md'}; config.to_disk('ner_source_md.cfg')"
# python -m spacy assemble ner_source_md.cfg output_dir 2>&1 | grep -q W113
# if: matrix.python_version == '3.9'
# - name: "Test assemble CLI"
# run: |
# python -c "import spacy; config = spacy.util.load_config('ner.cfg'); config['components']['ner'] = {'source': 'ca_core_news_sm'}; config.to_disk('ner_source_sm.cfg')"
# PYTHONWARNINGS="error,ignore::DeprecationWarning" python -m spacy assemble ner_source_sm.cfg output_dir
# if: matrix.python_version == '3.9'
#
# - name: "Test assemble CLI vectors warning"
# run: |
# python -c "import spacy; config = spacy.util.load_config('ner.cfg'); config['components']['ner'] = {'source': 'ca_core_news_md'}; config.to_disk('ner_source_md.cfg')"
# python -m spacy assemble ner_source_md.cfg output_dir 2>&1 | grep -q W113
# if: matrix.python_version == '3.9'
- name: "Install test requirements"
run: |

View File

@ -25,7 +25,7 @@ jobs:
- name: Configure Python version
uses: actions/setup-python@v4
with:
python-version: "3.7"
python-version: "3.8"
architecture: x64
- name: Validate website/meta/universe.json

View File

@ -3,7 +3,7 @@ repos:
rev: 22.3.0
hooks:
- id: black
language_version: python3.7
language_version: python3.8
additional_dependencies: ['click==8.0.4']
- repo: https://github.com/pycqa/flake8
rev: 5.0.4

View File

@ -276,7 +276,7 @@ except: # noqa: E722
### Python conventions
All Python code must be written **compatible with Python 3.6+**. More detailed
All Python code must be written **compatible with Python 3.8+**. More detailed
code conventions can be found in the [developer docs](https://github.com/explosion/spaCy/blob/master/extra/DEVELOPER_DOCS/Code%20Conventions.md).
#### I/O and handling paths

View File

@ -5,7 +5,7 @@ override SPACY_EXTRAS = spacy-lookups-data==1.0.2 jieba spacy-pkuseg==0.0.28 sud
endif
ifndef PYVER
override PYVER = 3.6
override PYVER = 3.8
endif
VENV := ./env$(PYVER)

View File

@ -109,7 +109,7 @@ For detailed installation instructions, see the
- **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual
Studio)
- **Python version**: Python 3.6+ (only 64 bit)
- **Python version**: Python 3.8+ (only 64 bit)
- **Package managers**: [pip] · [conda] (via `conda-forge`)
[pip]: https://pypi.org/project/spacy/

View File

@ -1,6 +1,4 @@
# build version constraints for use with wheelwright + multibuild
numpy==1.15.0; python_version<='3.7' and platform_machine!='aarch64'
numpy==1.19.2; python_version<='3.7' and platform_machine=='aarch64'
numpy==1.17.3; python_version=='3.8' and platform_machine!='aarch64'
numpy==1.19.2; python_version=='3.8' and platform_machine=='aarch64'
numpy==1.19.3; python_version=='3.9'

View File

@ -5,7 +5,7 @@ requires = [
"cymem>=2.0.2,<2.1.0",
"preshed>=3.0.2,<3.1.0",
"murmurhash>=0.28.0,<1.1.0",
"thinc>=8.1.8,<8.2.0",
"thinc>=9.0.0.dev2,<9.1.0",
"numpy>=1.15.0",
]
build-backend = "setuptools.build_meta"

View File

@ -1,9 +1,9 @@
# Our libraries
spacy-legacy>=3.0.11,<3.1.0
spacy-legacy>=4.0.0.dev0,<4.1.0
spacy-loggers>=1.0.0,<2.0.0
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
thinc>=8.1.8,<8.2.0
thinc>=9.0.0.dev2,<9.1.0
ml_datasets>=0.2.0,<0.3.0
murmurhash>=0.28.0,<1.1.0
wasabi>=0.9.1,<1.2.0
@ -22,7 +22,6 @@ langcodes>=3.2.0,<4.0.0
# Official Python utilities
setuptools
packaging>=20.0
typing_extensions>=3.7.4.1,<4.5.0; python_version < "3.8"
# Development dependencies
pre-commit>=2.13.0
cython>=0.25,<3.0
@ -31,8 +30,7 @@ pytest-timeout>=1.3.0,<2.0.0
mock>=2.0.0,<3.0.0
flake8>=3.8.0,<6.0.0
hypothesis>=3.27.0,<7.0.0
mypy>=0.990,<1.1.0; platform_machine != "aarch64" and python_version >= "3.7"
types-dataclasses>=0.1.3; python_version < "3.7"
mypy>=0.990,<1.1.0; platform_machine != "aarch64"
types-mock>=0.1.1
types-setuptools>=57.0.0
types-requests

View File

@ -17,8 +17,6 @@ classifiers =
Operating System :: Microsoft :: Windows
Programming Language :: Cython
Programming Language :: Python :: 3
Programming Language :: Python :: 3.6
Programming Language :: Python :: 3.7
Programming Language :: Python :: 3.8
Programming Language :: Python :: 3.9
Programming Language :: Python :: 3.10
@ -31,7 +29,7 @@ project_urls =
[options]
zip_safe = false
include_package_data = true
python_requires = >=3.6
python_requires = >=3.8
setup_requires =
cython>=0.25,<3.0
numpy>=1.15.0
@ -39,15 +37,15 @@ setup_requires =
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
murmurhash>=0.28.0,<1.1.0
thinc>=8.1.8,<8.2.0
thinc>=9.0.0.dev2,<9.1.0
install_requires =
# Our libraries
spacy-legacy>=3.0.11,<3.1.0
spacy-legacy>=4.0.0.dev0,<4.1.0
spacy-loggers>=1.0.0,<2.0.0
murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
thinc>=8.1.8,<8.2.0
thinc>=9.0.0.dev2,<9.1.0
wasabi>=0.9.1,<1.2.0
srsly>=2.4.3,<3.0.0
catalogue>=2.0.6,<2.1.0
@ -63,7 +61,6 @@ install_requires =
# Official Python utilities
setuptools
packaging>=20.0
typing_extensions>=3.7.4.1,<4.5.0; python_version < "3.8"
langcodes>=3.2.0,<4.0.0
[options.entry_points]
@ -120,7 +117,7 @@ ja =
sudachipy>=0.5.2,!=0.6.1
sudachidict_core>=20211220
ko =
natto-py>=0.9.0
mecab-ko>=1.0.0
th =
pythainlp>=2.0

View File

@ -33,13 +33,10 @@ MOD_NAMES = [
"spacy.kb.candidate",
"spacy.kb.kb",
"spacy.kb.kb_in_memory",
"spacy.ml.parser_model",
"spacy.ml.tb_framework",
"spacy.morphology",
"spacy.pipeline.dep_parser",
"spacy.pipeline._edit_tree_internals.edit_trees",
"spacy.pipeline.morphologizer",
"spacy.pipeline.multitask",
"spacy.pipeline.ner",
"spacy.pipeline.pipe",
"spacy.pipeline.trainable_pipe",
"spacy.pipeline.sentencizer",
@ -47,12 +44,15 @@ MOD_NAMES = [
"spacy.pipeline.tagger",
"spacy.pipeline.transition_parser",
"spacy.pipeline._parser_internals.arc_eager",
"spacy.pipeline._parser_internals.batch",
"spacy.pipeline._parser_internals.ner",
"spacy.pipeline._parser_internals.nonproj",
"spacy.pipeline._parser_internals.search",
"spacy.pipeline._parser_internals._state",
"spacy.pipeline._parser_internals.stateclass",
"spacy.pipeline._parser_internals.transition_system",
"spacy.pipeline._parser_internals._beam_utils",
"spacy.pipeline._parser_internals._parser_utils",
"spacy.tokenizer",
"spacy.training.align",
"spacy.training.gold_io",
@ -62,12 +62,13 @@ MOD_NAMES = [
"spacy.tokens.span_group",
"spacy.tokens.graph",
"spacy.tokens.morphanalysis",
"spacy.tokens._retokenize",
"spacy.tokens.retokenizer",
"spacy.matcher.matcher",
"spacy.matcher.phrasematcher",
"spacy.matcher.dependencymatcher",
"spacy.symbols",
"spacy.vectors",
"spacy.tests.parser._search",
]
COMPILE_OPTIONS = {
"msvc": ["/Ox", "/EHsc"],

View File

@ -1,6 +1,6 @@
# fmt: off
__title__ = "spacy"
__version__ = "3.6.0.dev1"
__version__ = "4.0.0.dev1"
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
__projects__ = "https://github.com/explosion/projects"

View File

@ -1,98 +1,49 @@
# Reserve 64 values for flag features
from . cimport symbols
cdef enum attr_id_t:
NULL_ATTR
IS_ALPHA
IS_ASCII
IS_DIGIT
IS_LOWER
IS_PUNCT
IS_SPACE
IS_TITLE
IS_UPPER
LIKE_URL
LIKE_NUM
LIKE_EMAIL
IS_STOP
IS_OOV_DEPRECATED
IS_BRACKET
IS_QUOTE
IS_LEFT_PUNCT
IS_RIGHT_PUNCT
IS_CURRENCY
NULL_ATTR = 0
IS_ALPHA = symbols.IS_ALPHA
IS_ASCII = symbols.IS_ASCII
IS_DIGIT = symbols.IS_DIGIT
IS_LOWER = symbols.IS_LOWER
IS_PUNCT = symbols.IS_PUNCT
IS_SPACE = symbols.IS_SPACE
IS_TITLE = symbols.IS_TITLE
IS_UPPER = symbols.IS_UPPER
LIKE_URL = symbols.LIKE_URL
LIKE_NUM = symbols.LIKE_NUM
LIKE_EMAIL = symbols.LIKE_EMAIL
IS_STOP = symbols.IS_STOP
IS_BRACKET = symbols.IS_BRACKET
IS_QUOTE = symbols.IS_QUOTE
IS_LEFT_PUNCT = symbols.IS_LEFT_PUNCT
IS_RIGHT_PUNCT = symbols.IS_RIGHT_PUNCT
IS_CURRENCY = symbols.IS_CURRENCY
FLAG19 = 19
FLAG20
FLAG21
FLAG22
FLAG23
FLAG24
FLAG25
FLAG26
FLAG27
FLAG28
FLAG29
FLAG30
FLAG31
FLAG32
FLAG33
FLAG34
FLAG35
FLAG36
FLAG37
FLAG38
FLAG39
FLAG40
FLAG41
FLAG42
FLAG43
FLAG44
FLAG45
FLAG46
FLAG47
FLAG48
FLAG49
FLAG50
FLAG51
FLAG52
FLAG53
FLAG54
FLAG55
FLAG56
FLAG57
FLAG58
FLAG59
FLAG60
FLAG61
FLAG62
FLAG63
ID = symbols.ID
ORTH = symbols.ORTH
LOWER = symbols.LOWER
NORM = symbols.NORM
SHAPE = symbols.SHAPE
PREFIX = symbols.PREFIX
SUFFIX = symbols.SUFFIX
ID
ORTH
LOWER
NORM
SHAPE
PREFIX
SUFFIX
LENGTH = symbols.LENGTH
CLUSTER = symbols.CLUSTER
LEMMA = symbols.LEMMA
POS = symbols.POS
TAG = symbols.TAG
DEP = symbols.DEP
ENT_IOB = symbols.ENT_IOB
ENT_TYPE = symbols.ENT_TYPE
HEAD = symbols.HEAD
SENT_START = symbols.SENT_START
SPACY = symbols.SPACY
PROB = symbols.PROB
LENGTH
CLUSTER
LEMMA
POS
TAG
DEP
ENT_IOB
ENT_TYPE
HEAD
SENT_START
SPACY
PROB
LANG
LANG = symbols.LANG
ENT_KB_ID = symbols.ENT_KB_ID
MORPH
MORPH = symbols.MORPH
ENT_ID = symbols.ENT_ID
IDX
SENT_END
IDX = symbols.IDX

View File

@ -16,57 +16,11 @@ IDS = {
"LIKE_NUM": LIKE_NUM,
"LIKE_EMAIL": LIKE_EMAIL,
"IS_STOP": IS_STOP,
"IS_OOV_DEPRECATED": IS_OOV_DEPRECATED,
"IS_BRACKET": IS_BRACKET,
"IS_QUOTE": IS_QUOTE,
"IS_LEFT_PUNCT": IS_LEFT_PUNCT,
"IS_RIGHT_PUNCT": IS_RIGHT_PUNCT,
"IS_CURRENCY": IS_CURRENCY,
"FLAG19": FLAG19,
"FLAG20": FLAG20,
"FLAG21": FLAG21,
"FLAG22": FLAG22,
"FLAG23": FLAG23,
"FLAG24": FLAG24,
"FLAG25": FLAG25,
"FLAG26": FLAG26,
"FLAG27": FLAG27,
"FLAG28": FLAG28,
"FLAG29": FLAG29,
"FLAG30": FLAG30,
"FLAG31": FLAG31,
"FLAG32": FLAG32,
"FLAG33": FLAG33,
"FLAG34": FLAG34,
"FLAG35": FLAG35,
"FLAG36": FLAG36,
"FLAG37": FLAG37,
"FLAG38": FLAG38,
"FLAG39": FLAG39,
"FLAG40": FLAG40,
"FLAG41": FLAG41,
"FLAG42": FLAG42,
"FLAG43": FLAG43,
"FLAG44": FLAG44,
"FLAG45": FLAG45,
"FLAG46": FLAG46,
"FLAG47": FLAG47,
"FLAG48": FLAG48,
"FLAG49": FLAG49,
"FLAG50": FLAG50,
"FLAG51": FLAG51,
"FLAG52": FLAG52,
"FLAG53": FLAG53,
"FLAG54": FLAG54,
"FLAG55": FLAG55,
"FLAG56": FLAG56,
"FLAG57": FLAG57,
"FLAG58": FLAG58,
"FLAG59": FLAG59,
"FLAG60": FLAG60,
"FLAG61": FLAG61,
"FLAG62": FLAG62,
"FLAG63": FLAG63,
"ID": ID,
"ORTH": ORTH,
"LOWER": LOWER,
@ -92,12 +46,11 @@ IDS = {
}
# ATTR IDs, in order of the symbol
NAMES = [key for key, value in sorted(IDS.items(), key=lambda item: item[1])]
NAMES = {v: k for k, v in IDS.items()}
locals().update(IDS)
def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
def intify_attrs(stringy_attrs, strings_map=None):
"""
Normalize a dictionary of attributes, converting them to ints.
@ -109,75 +62,6 @@ def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
converted to ints.
"""
inty_attrs = {}
if _do_deprecated:
if "F" in stringy_attrs:
stringy_attrs["ORTH"] = stringy_attrs.pop("F")
if "L" in stringy_attrs:
stringy_attrs["LEMMA"] = stringy_attrs.pop("L")
if "pos" in stringy_attrs:
stringy_attrs["TAG"] = stringy_attrs.pop("pos")
if "morph" in stringy_attrs:
morphs = stringy_attrs.pop("morph")
if "number" in stringy_attrs:
stringy_attrs.pop("number")
if "tenspect" in stringy_attrs:
stringy_attrs.pop("tenspect")
morph_keys = [
"PunctType",
"PunctSide",
"Other",
"Degree",
"AdvType",
"Number",
"VerbForm",
"PronType",
"Aspect",
"Tense",
"PartType",
"Poss",
"Hyph",
"ConjType",
"NumType",
"Foreign",
"VerbType",
"NounType",
"Gender",
"Mood",
"Negative",
"Tense",
"Voice",
"Abbr",
"Derivation",
"Echo",
"Foreign",
"NameType",
"NounType",
"NumForm",
"NumValue",
"PartType",
"Polite",
"StyleVariant",
"PronType",
"AdjType",
"Person",
"Variant",
"AdpType",
"Reflex",
"Negative",
"Mood",
"Aspect",
"Case",
"Polarity",
"PrepCase",
"Animacy", # U20
]
for key in morph_keys:
if key in stringy_attrs:
stringy_attrs.pop(key)
elif key.lower() in stringy_attrs:
stringy_attrs.pop(key.lower())
elif key.upper() in stringy_attrs:
stringy_attrs.pop(key.upper())
for name, value in stringy_attrs.items():
int_key = intify_attr(name)
if int_key is not None:

View File

@ -1,4 +1,4 @@
from typing import Dict, Any, Union, List, Optional, Tuple, Iterable
from typing import Dict, Any, Union, List, Optional, Tuple, Iterable, Literal
from typing import TYPE_CHECKING, overload
import sys
import shutil
@ -16,10 +16,10 @@ from thinc.util import gpu_is_available
from configparser import InterpolationError
import os
from ..compat import Literal
from ..schemas import ProjectConfigSchema, validate
from ..util import import_file, run_command, make_tempdir, registry, logger
from ..util import is_compatible_version, SimpleFrozenDict, ENV_VARS
from ..errors import RENAMED_LANGUAGE_CODES
from .. import about
if TYPE_CHECKING:
@ -135,6 +135,16 @@ def _parse_override(value: Any) -> Any:
return str(value)
def _handle_renamed_language_codes(lang: Optional[str]) -> None:
# Throw error for renamed language codes in v4
if lang in RENAMED_LANGUAGE_CODES:
msg.fail(
title="Renamed language code",
text=f"Language code '{lang}' was replaced with '{RENAMED_LANGUAGE_CODES[lang]}' in spaCy v4. Update the language code from '{lang}' to '{RENAMED_LANGUAGE_CODES[lang]}'.",
exits=1,
)
def load_project_config(
path: Path, interpolate: bool = True, overrides: Dict[str, Any] = SimpleFrozenDict()
) -> Dict[str, Any]:

View File

@ -7,7 +7,7 @@ import re
import sys
import itertools
from ._util import app, Arg, Opt, walk_directory
from ._util import app, Arg, Opt, _handle_renamed_language_codes, walk_directory
from ..training import docs_to_json
from ..tokens import Doc, DocBin
from ..training.converters import iob_to_docs, conll_ner_to_docs, json_to_docs
@ -112,6 +112,10 @@ def convert(
input_path = Path(input_path)
if not msg:
msg = Printer(no_print=silent)
# Throw error for renamed language codes in v4
_handle_renamed_language_codes(lang)
ner_map = srsly.read_json(ner_map) if ner_map is not None else None
doc_files = []
for input_loc in walk_directory(input_path, converter):

View File

@ -1,5 +1,5 @@
from typing import Any, Dict, Iterable, List, Optional, Sequence, Set, Tuple, Union
from typing import cast, overload
from typing import Literal, cast, overload
from pathlib import Path
from collections import Counter
import sys
@ -22,7 +22,6 @@ from ..pipeline._edit_tree_internals.edit_trees import EditTrees
from ..morphology import Morphology
from ..language import Language
from ..util import registry, resolve_dot_names
from ..compat import Literal
from ..vectors import Mode as VectorsMode
from .. import util

View File

@ -7,8 +7,8 @@ import typer
from ._util import app, Arg, Opt, WHEEL_SUFFIX, SDIST_SUFFIX
from .. import about
from ..util import is_package, get_minor_version, run_command
from ..util import is_prerelease_version
from ..errors import OLD_MODEL_SHORTCUTS
from ..util import is_prerelease_version, get_installed_models
from ..util import get_package_version
@app.command(
@ -61,15 +61,17 @@ def download(
version = components[-1]
else:
model_name = model
if model in OLD_MODEL_SHORTCUTS:
msg.warn(
f"As of spaCy v3.0, shortcuts like '{model}' are deprecated. Please "
f"use the full pipeline package name '{OLD_MODEL_SHORTCUTS[model]}' instead."
)
model_name = OLD_MODEL_SHORTCUTS[model]
compatibility = get_compatibility()
version = get_version(model_name, compatibility)
# If we already have this version installed, skip downloading
installed = get_installed_models()
if model_name in installed:
installed_version = get_package_version(model_name)
if installed_version == version:
msg.warn(f"{model_name} v{version} already installed, skipping")
return
filename = get_model_filename(model_name, version, sdist)
download_model(filename, pip_args)

View File

@ -4,12 +4,12 @@ import json
from pathlib import Path
from wasabi import Printer, MarkdownRenderer
import srsly
import importlib.metadata
from ._util import app, Arg, Opt, string_to_list
from .download import get_model_filename, get_latest_version
from .. import util
from .. import about
from ..compat import importlib_metadata
@app.command("info")
@ -137,7 +137,7 @@ def info_installed_model_url(model: str) -> Optional[str]:
dist-info available.
"""
try:
dist = importlib_metadata.distribution(model)
dist = importlib.metadata.distribution(model)
text = dist.read_text("direct_url.json")
if isinstance(text, str):
data = json.loads(text)

View File

@ -8,11 +8,11 @@ import re
from jinja2 import Template
from .. import util
from ..language import DEFAULT_CONFIG_PRETRAIN_PATH
from ..language import DEFAULT_CONFIG_DISTILL_PATH, DEFAULT_CONFIG_PRETRAIN_PATH
from ..schemas import RecommendationSchema
from ..util import SimpleFrozenList
from ._util import init_cli, Arg, Opt, show_validation_error, COMMAND
from ._util import string_to_list, import_code
from ._util import string_to_list, import_code, _handle_renamed_language_codes
ROOT = Path(__file__).parent / "templates"
@ -43,7 +43,7 @@ class InitValues:
def init_config_cli(
# fmt: off
output_file: Path = Arg(..., help="File to save the config to or - for stdout (will only output config and no additional logging info)", allow_dash=True),
lang: str = Opt(InitValues.lang, "--lang", "-l", help="Two-letter code of the language to use"),
lang: str = Opt(InitValues.lang, "--lang", "-l", help="Code of the language to use"),
pipeline: str = Opt(",".join(InitValues.pipeline), "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include (without 'tok2vec' or 'transformer')"),
optimize: Optimizations = Opt(InitValues.optimize, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
gpu: bool = Opt(InitValues.gpu, "--gpu", "-G", help="Whether the model can run on GPU. This will impact the choice of architecture, pretrained weights and related hyperparameters."),
@ -83,6 +83,7 @@ def init_fill_config_cli(
# fmt: off
base_path: Path = Arg(..., help="Path to base config to fill", exists=True, dir_okay=False),
output_file: Path = Arg("-", help="Path to output .cfg file (or - for stdout)", allow_dash=True),
distillation: bool = Opt(False, "--distillation", "-dt", help="Include config for distillation (with 'spacy distill')"),
pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"),
diff: bool = Opt(False, "--diff", "-D", help="Print a visual diff highlighting the changes"),
code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
@ -98,13 +99,20 @@ def init_fill_config_cli(
DOCS: https://spacy.io/api/cli#init-fill-config
"""
import_code(code_path)
fill_config(output_file, base_path, pretraining=pretraining, diff=diff)
fill_config(
output_file,
base_path,
distillation=distillation,
pretraining=pretraining,
diff=diff,
)
def fill_config(
output_file: Path,
base_path: Path,
*,
distillation: bool = False,
pretraining: bool = False,
diff: bool = False,
silent: bool = False,
@ -123,6 +131,9 @@ def fill_config(
# replaced with their actual config after loading, so we have to re-add them
sourced = util.get_sourced_components(config)
filled["components"].update(sourced)
if distillation:
distillation_config = util.load_config(DEFAULT_CONFIG_DISTILL_PATH)
filled = distillation_config.merge(filled)
if pretraining:
validate_config_for_pretrain(filled, msg)
pretrain_config = util.load_config(DEFAULT_CONFIG_PRETRAIN_PATH)
@ -158,6 +169,10 @@ def init_config(
msg = Printer(no_print=silent)
with TEMPLATE_PATH.open("r") as f:
template = Template(f.read())
# Throw error for renamed language codes in v4
_handle_renamed_language_codes(lang)
# Filter out duplicates since tok2vec and transformer are added by template
pipeline = [pipe for pipe in pipeline if pipe not in ("tok2vec", "transformer")]
defaults = RECOMMENDATIONS["__default__"]

View File

@ -9,7 +9,7 @@ from .. import util
from ..training.initialize import init_nlp, convert_vectors
from ..language import Language
from ._util import init_cli, Arg, Opt, parse_config_overrides, show_validation_error
from ._util import import_code, setup_gpu
from ._util import import_code, setup_gpu, _handle_renamed_language_codes
@init_cli.command("vectors")
@ -21,7 +21,6 @@ def init_vectors_cli(
prune: int = Opt(-1, "--prune", "-p", help="Optional number of vectors to prune to"),
truncate: int = Opt(0, "--truncate", "-t", help="Optional number of vectors to truncate to when reading in vectors file"),
mode: str = Opt("default", "--mode", "-m", help="Vectors mode: default or floret"),
name: Optional[str] = Opt(None, "--name", "-n", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"),
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
jsonl_loc: Optional[Path] = Opt(None, "--lexemes-jsonl", "-j", help="Location of JSONL-formatted attributes file", hidden=True),
# fmt: on
@ -31,6 +30,10 @@ def init_vectors_cli(
a model with vectors.
"""
util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
# Throw error for renamed language codes in v4
_handle_renamed_language_codes(lang)
msg.info(f"Creating blank nlp object for language '{lang}'")
nlp = util.get_lang_class(lang)()
if jsonl_loc is not None:
@ -40,7 +43,6 @@ def init_vectors_cli(
vectors_loc,
truncate=truncate,
prune=prune,
name=name,
mode=mode,
)
msg.good(f"Successfully converted {len(nlp.vocab.vectors)} vectors")

View File

@ -90,12 +90,11 @@ grad_factor = 1.0
factory = "parser"
[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
@architectures = "spacy.TransitionBasedParser.v3"
state_type = "parser"
extra_state_tokens = false
hidden_width = 128
maxout_pieces = 3
use_upper = false
nO = null
[components.parser.model.tok2vec]
@ -111,12 +110,11 @@ grad_factor = 1.0
factory = "ner"
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
@architectures = "spacy.TransitionBasedParser.v3"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null
[components.ner.model.tok2vec]
@ -385,12 +383,11 @@ width = ${components.tok2vec.model.encode.width}
factory = "parser"
[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
@architectures = "spacy.TransitionBasedParser.v3"
state_type = "parser"
extra_state_tokens = false
hidden_width = 128
maxout_pieces = 3
use_upper = true
nO = null
[components.parser.model.tok2vec]
@ -403,12 +400,11 @@ width = ${components.tok2vec.model.encode.width}
factory = "ner"
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
@architectures = "spacy.TransitionBasedParser.v3"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null
[components.ner.model.tok2vec]

View File

@ -22,19 +22,6 @@ try:
except ImportError:
cupy = None
if sys.version_info[:2] >= (3, 8): # Python 3.8+
from typing import Literal, Protocol, runtime_checkable
else:
from typing_extensions import Literal, Protocol, runtime_checkable # noqa: F401
# Important note: The importlib_metadata "backport" includes functionality
# that's not part of the built-in importlib.metadata. We should treat this
# import like the built-in and only use what's available there.
try: # Python 3.8+
import importlib.metadata as importlib_metadata
except ImportError:
from catalogue import _importlib_metadata as importlib_metadata # type: ignore[no-redef] # noqa: F401
from thinc.api import Optimizer # noqa: F401
pickle = pickle

View File

@ -0,0 +1,34 @@
[paths]
raw_text = null
[distillation]
corpus = "corpora.distillation"
dropout = 0.1
max_epochs = 1
max_steps = 0
student_to_teacher = {}
[distillation.batcher]
@batchers = "spacy.batch_by_words.v1"
size = 3000
discard_oversize = false
tolerance = 0.2
[distillation.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 1e-8
learn_rate = 1e-4
[corpora]
[corpora.distillation]
@readers = "spacy.PlainTextCorpus.v1"
path = ${paths.raw_text}
min_length = 0
max_length = 0

View File

@ -1,5 +1,5 @@
from typing import Literal
import warnings
from .compat import Literal
class ErrorsWithCodes(type):
@ -82,7 +82,7 @@ class Warnings(metaclass=ErrorsWithCodes):
"ignoring the duplicate entry.")
W021 = ("Unexpected hash collision in PhraseMatcher. Matches may be "
"incorrect. Modify PhraseMatcher._terminal_hash to fix.")
W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in "
W024 = ("Entity '{entity}' - alias '{alias}' combination already exists in "
"the Knowledge Base.")
W026 = ("Unable to set all sentence boundaries from dependency parses. If "
"you are constructing a parse tree incrementally by setting "
@ -131,13 +131,6 @@ class Warnings(metaclass=ErrorsWithCodes):
"and make it independent. For example, `replace_listeners = "
"[\"model.tok2vec\"]` See the documentation for details: "
"https://spacy.io/usage/training#config-components-listeners")
W088 = ("The pipeline component {name} implements a `begin_training` "
"method, which won't be called by spaCy. As of v3.0, `begin_training` "
"has been renamed to `initialize`, so you likely want to rename the "
"component method. See the documentation for details: "
"https://spacy.io/api/language#initialize")
W089 = ("As of spaCy v3.0, the `nlp.begin_training` method has been renamed "
"to `nlp.initialize`.")
W090 = ("Could not locate any {format} files in path '{path}'.")
W091 = ("Could not clean/remove the temp directory at {dir}: {msg}.")
W092 = ("Ignoring annotations for sentence starts, as dependency heads are set.")
@ -216,6 +209,12 @@ class Warnings(metaclass=ErrorsWithCodes):
"`enabled` ({enabled}). Be aware that this might affect other components in your pipeline.")
W124 = ("{host}:{port} is already in use, using the nearest available port {serve_port} as an alternative.")
# v4 warning strings
W400 = ("`use_upper=False` is ignored, the upper layer is always enabled")
W401 = ("`incl_prior is True`, but the selected knowledge base type {kb_type} doesn't support prior probability "
"lookups so this setting will be ignored. If your KB does support prior probability lookups, make sure "
"to return `True` in `.supports_prior_probs`.")
class Errors(metaclass=ErrorsWithCodes):
E001 = ("No component '{name}' found in pipeline. Available names: {opts}")
@ -251,9 +250,7 @@ class Errors(metaclass=ErrorsWithCodes):
"https://spacy.io/usage/models")
E011 = ("Unknown operator: '{op}'. Options: {opts}")
E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}")
E016 = ("MultitaskObjective target should be function or one of: dep, "
"tag, ent, dep_tag_offset, ent_tag.")
E017 = ("Can only add unicode or bytes. Got type: {value_type}")
E017 = ("Can only add 'str' inputs to StringStore. Got type: {value_type}")
E018 = ("Can't retrieve string for hash '{hash_value}'. This usually "
"refers to an issue with the `Vocab` or `StringStore`.")
E019 = ("Can't create transition with unknown action ID: {action}. Action "
@ -465,13 +462,13 @@ class Errors(metaclass=ErrorsWithCodes):
"same, but found '{nlp}' and '{vocab}' respectively.")
E152 = ("The attribute {attr} is not supported for token patterns. "
"Please use the option `validate=True` with the Matcher, PhraseMatcher, "
"EntityRuler or AttributeRuler for more details.")
"SpanRuler or AttributeRuler for more details.")
E153 = ("The value type {vtype} is not supported for token patterns. "
"Please use the option validate=True with Matcher, PhraseMatcher, "
"EntityRuler or AttributeRuler for more details.")
"SpanRuler or AttributeRuler for more details.")
E154 = ("One of the attributes or values is not supported for token "
"patterns. Please use the option `validate=True` with the Matcher, "
"PhraseMatcher, or EntityRuler for more details.")
"PhraseMatcher, or SpanRuler for more details.")
E155 = ("The pipeline needs to include a {pipe} in order to use "
"Matcher or PhraseMatcher with the attribute {attr}. "
"Try using `nlp()` instead of `nlp.make_doc()` or `list(nlp.pipe())` "
@ -495,7 +492,7 @@ class Errors(metaclass=ErrorsWithCodes):
"Current DocBin: {current}\nOther DocBin: {other}")
E169 = ("Can't find module: {module}")
E170 = ("Cannot apply transition {name}: invalid for the current state.")
E171 = ("Matcher.add received invalid 'on_match' callback argument: expected "
E171 = ("{name}.add received invalid 'on_match' callback argument: expected "
"callable or None, but got: {arg_type}")
E175 = ("Can't remove rule for unknown match pattern ID: {key}")
E176 = ("Alias '{alias}' is not defined in the Knowledge Base.")
@ -734,13 +731,6 @@ class Errors(metaclass=ErrorsWithCodes):
"method in component '{name}'. If you want to use this "
"method, make sure it's overwritten on the subclass.")
E940 = ("Found NaN values in scores.")
E941 = ("Can't find model '{name}'. It looks like you're trying to load a "
"model from a shortcut, which is obsolete as of spaCy v3.0. To "
"load the model, use its full name instead:\n\n"
"nlp = spacy.load(\"{full}\")\n\nFor more details on the available "
"models, see the models directory: https://spacy.io/models. If you "
"want to create a blank model, use spacy.blank: "
"nlp = spacy.blank(\"{name}\")")
E942 = ("Executing `after_{name}` callback failed. Expected the function to "
"return an initialized nlp object but got: {value}. Maybe "
"you forgot to return the modified object in your function?")
@ -754,7 +744,7 @@ class Errors(metaclass=ErrorsWithCodes):
"loaded nlp object, but got: {source}")
E947 = ("`Matcher.add` received invalid `greedy` argument: expected "
"a string value from {expected} but got: '{arg}'")
E948 = ("`Matcher.add` received invalid 'patterns' argument: expected "
E948 = ("`{name}.add` received invalid 'patterns' argument: expected "
"a list, but got: {arg_type}")
E949 = ("Unable to align tokens for the predicted and reference docs. It "
"is only possible to align the docs when both texts are the same "
@ -928,8 +918,6 @@ class Errors(metaclass=ErrorsWithCodes):
E1021 = ("`pos` value \"{pp}\" is not a valid Universal Dependencies tag. "
"Non-UD tags should use the `tag` property.")
E1022 = ("Words must be of type str or int, but input is of type '{wtype}'")
E1023 = ("Couldn't read EntityRuler from the {path}. This file doesn't "
"exist.")
E1024 = ("A pattern with {attr_type} '{label}' is not present in "
"'{component}' patterns.")
E1025 = ("Cannot intify the value '{value}' as an IOB string. The only "
@ -940,7 +928,7 @@ class Errors(metaclass=ErrorsWithCodes):
E1029 = ("Edit tree cannot be applied to form.")
E1030 = ("Edit tree identifier out of range.")
E1031 = ("Could not find gold transition - see logs above.")
E1032 = ("`{var}` should not be {forbidden}, but received {value}.")
E1032 = ("Span {var} {value} is out of bounds for {obj} with length {length}.")
E1033 = ("Dimension {name} invalid -- only nO, nF, nP")
E1034 = ("Node index {i} out of bounds ({length})")
E1035 = ("Token index {i} out of bounds ({length})")
@ -978,15 +966,22 @@ class Errors(metaclass=ErrorsWithCodes):
E1054 = ("The text, including whitespace, must match between reference and "
"predicted docs when training {component}.")
# v4 error strings
E4000 = ("Expected a Doc as input, but got: '{type}'")
E4001 = ("Expected input to be one of the following types: ({expected_types}), "
"but got '{received_type}'")
E4002 = ("Pipe '{name}' requires a teacher pipe for distillation.")
E4003 = ("Training examples for distillation must have the exact same tokens in the "
"reference and predicted docs.")
E4004 = ("Backprop is not supported when is_train is not set.")
E4005 = ("EntityLinker_v1 is not supported in spaCy v4. Update your configuration.")
E4006 = ("Expected `entity_id` to be of type {exp_type}, but is of type {found_type}.")
E4007 = ("Span {var} {value} must be {op} Span {existing_var} "
"{existing_value}.")
E4008 = ("Span {pos}_char {value} does not correspond to a token {pos}.")
# Deprecated model shortcuts, only used in errors and warnings
OLD_MODEL_SHORTCUTS = {
"en": "en_core_web_sm", "de": "de_core_news_sm", "es": "es_core_news_sm",
"pt": "pt_core_news_sm", "fr": "fr_core_news_sm", "it": "it_core_news_sm",
"nl": "nl_core_news_sm", "el": "el_core_news_sm", "nb": "nb_core_news_sm",
"lt": "lt_core_news_sm", "xx": "xx_ent_wiki_sm"
}
RENAMED_LANGUAGE_CODES = {"xx": "mul", "is": "isl"}
# fmt: on

View File

@ -1,3 +1,5 @@
from .kb import KnowledgeBase
from .kb_in_memory import InMemoryLookupKB
from .candidate import Candidate, get_candidates, get_candidates_batch
from .candidate import Candidate, InMemoryCandidate
__all__ = ["KnowledgeBase", "InMemoryLookupKB", "Candidate", "InMemoryCandidate"]

View File

@ -1,12 +1,15 @@
from .kb cimport KnowledgeBase
from libcpp.vector cimport vector
from .kb_in_memory cimport InMemoryLookupKB
from ..typedefs cimport hash_t
# Object used by the Entity Linker that summarizes one entity-alias candidate combination.
cdef class Candidate:
cdef readonly KnowledgeBase kb
cdef hash_t entity_hash
cdef float entity_freq
cdef vector[float] entity_vector
cdef hash_t alias_hash
cdef float prior_prob
pass
cdef class InMemoryCandidate(Candidate):
cdef readonly hash_t _entity_hash
cdef readonly hash_t _alias_hash
cpdef vector[float] _entity_vector
cdef float _prior_prob
cdef readonly InMemoryLookupKB _kb
cdef float _entity_freq

View File

@ -1,74 +1,96 @@
# cython: infer_types=True, profile=True
from typing import Iterable
from .kb cimport KnowledgeBase
from ..tokens import Span
from .kb_in_memory cimport InMemoryLookupKB
from ..errors import Errors
cdef class Candidate:
"""A `Candidate` object refers to a textual mention (`alias`) that may or may not be resolved
to a specific `entity` from a Knowledge Base. This will be used as input for the entity linking
"""A `Candidate` object refers to a textual mention that may or may not be resolved
to a specific entity from a Knowledge Base. This will be used as input for the entity linking
algorithm which will disambiguate the various candidates to the correct one.
Each candidate (alias, entity) pair is assigned a certain prior probability.
Each candidate, which represents a possible link between one textual mention and one entity in the knowledge base,
is assigned a certain prior probability.
DOCS: https://spacy.io/api/kb/#candidate-init
"""
def __init__(self, KnowledgeBase kb, entity_hash, entity_freq, entity_vector, alias_hash, prior_prob):
self.kb = kb
self.entity_hash = entity_hash
self.entity_freq = entity_freq
self.entity_vector = entity_vector
self.alias_hash = alias_hash
self.prior_prob = prior_prob
def __init__(self):
# Make sure abstract Candidate is not instantiated.
if self.__class__ == Candidate:
raise TypeError(
Errors.E1046.format(cls_name=self.__class__.__name__)
)
@property
def entity(self) -> int:
"""RETURNS (uint64): hash of the entity's KB ID/name"""
return self.entity_hash
def entity_id(self) -> int:
"""RETURNS (int): Numerical representation of entity ID (if entity ID is numerical, this is just the entity ID,
otherwise the hash of the entity ID string)."""
raise NotImplementedError
@property
def entity_(self) -> str:
"""RETURNS (str): ID/name of this entity in the KB"""
return self.kb.vocab.strings[self.entity_hash]
def entity_id_(self) -> str:
"""RETURNS (str): String representation of entity ID."""
raise NotImplementedError
@property
def alias(self) -> int:
"""RETURNS (uint64): hash of the alias"""
return self.alias_hash
def entity_vector(self) -> vector[float]:
"""RETURNS (vector[float]): Entity vector."""
raise NotImplementedError
cdef class InMemoryCandidate(Candidate):
"""Candidate for InMemoryLookupKB."""
def __init__(
self,
kb: InMemoryLookupKB,
entity_hash: int,
alias_hash: int,
entity_vector: vector[float],
prior_prob: float,
entity_freq: float
):
"""
kb (InMemoryLookupKB]): InMemoryLookupKB instance.
entity_id (int): Entity ID as hash that can be looked up with InMemoryKB.vocab.strings.__getitem__().
entity_freq (int): Entity frequency in KB corpus.
entity_vector (List[float]): Entity embedding.
alias_hash (int): Alias hash.
prior_prob (float): Prior probability of entity for this alias. I. e. the probability that, independent of
the context, this alias - which matches one of this entity's aliases - resolves to one this entity.
"""
super().__init__()
self._entity_hash = entity_hash
self._entity_vector = entity_vector
self._prior_prob = prior_prob
self._kb = kb
self._alias_hash = alias_hash
self._entity_freq = entity_freq
@property
def alias_(self) -> str:
"""RETURNS (str): ID of the original alias"""
return self.kb.vocab.strings[self.alias_hash]
def entity_id(self) -> int:
return self._entity_hash
@property
def entity_freq(self) -> float:
return self.entity_freq
@property
def entity_vector(self) -> Iterable[float]:
return self.entity_vector
def entity_vector(self) -> vector[float]:
return self._entity_vector
@property
def prior_prob(self) -> float:
return self.prior_prob
"""RETURNS (float): Prior probability that this alias, which matches one of this entity's synonyms, resolves to
this entity."""
return self._prior_prob
@property
def alias(self) -> str:
"""RETURNS (str): Alias."""
return self._kb.vocab.strings[self._alias_hash]
def get_candidates(kb: KnowledgeBase, mention: Span) -> Iterable[Candidate]:
"""
Return candidate entities for a given mention and fetching appropriate entries from the index.
kb (KnowledgeBase): Knowledge base to query.
mention (Span): Entity mention for which to identify candidates.
RETURNS (Iterable[Candidate]): Identified candidates.
"""
return kb.get_candidates(mention)
@property
def entity_id_(self) -> str:
return self._kb.vocab.strings[self._entity_hash]
def get_candidates_batch(kb: KnowledgeBase, mentions: Iterable[Span]) -> Iterable[Iterable[Candidate]]:
"""
Return candidate entities for the given mentions and fetching appropriate entries from the index.
kb (KnowledgeBase): Knowledge base to query.
mention (Iterable[Span]): Entity mentions for which to identify candidates.
RETURNS (Iterable[Iterable[Candidate]]): Identified candidates.
"""
return kb.get_candidates_batch(mentions)
@property
def entity_freq(self) -> float:
"""RETURNS (float): Entity frequency in KB corpus."""
return self._entity_freq

View File

@ -5,7 +5,7 @@ from typing import Iterable, Tuple, Union
from cymem.cymem cimport Pool
from .candidate import Candidate
from ..tokens import Span
from ..tokens import Span, SpanGroup
from ..util import SimpleFrozenList
from ..errors import Errors
@ -30,21 +30,23 @@ cdef class KnowledgeBase:
self.entity_vector_length = entity_vector_length
self.mem = Pool()
def get_candidates_batch(self, mentions: Iterable[Span]) -> Iterable[Iterable[Candidate]]:
def get_candidates_batch(self, mentions: SpanGroup) -> Iterable[Iterable[Candidate]]:
"""
Return candidate entities for specified texts. Each candidate defines the entity, the original alias,
and the prior probability of that alias resolving to that entity.
If no candidate is found for a given text, an empty list is returned.
mentions (Iterable[Span]): Mentions for which to get candidates.
Return candidate entities for a specified Span mention. Each candidate defines at least the entity and the
entity's embedding vector. Depending on the KB implementation, further properties - such as the prior
probability of the specified mention text resolving to that entity - might be included.
If no candidates are found for a given mention, an empty list is returned.
mentions (SpanGroup): Mentions for which to get candidates.
RETURNS (Iterable[Iterable[Candidate]]): Identified candidates.
"""
return [self.get_candidates(span) for span in mentions]
def get_candidates(self, mention: Span) -> Iterable[Candidate]:
"""
Return candidate entities for specified text. Each candidate defines the entity, the original alias,
and the prior probability of that alias resolving to that entity.
If the no candidate is found for a given text, an empty list is returned.
Return candidate entities for a specific mention. Each candidate defines at least the entity and the
entity's embedding vector. Depending on the KB implementation, further properties - such as the prior
probability of the specified mention text resolving to that entity - might be included.
If no candidate is found for the given mention, an empty list is returned.
mention (Span): Mention for which to get candidates.
RETURNS (Iterable[Candidate]): Identified candidates.
"""
@ -106,3 +108,10 @@ cdef class KnowledgeBase:
raise NotImplementedError(
Errors.E1045.format(parent="KnowledgeBase", method="from_disk", name=self.__name__)
)
@property
def supports_prior_probs(self) -> bool:
"""RETURNS (bool): Whether this KB type supports looking up prior probabilities for entity mentions."""
raise NotImplementedError(
Errors.E1045.format(parent="KnowledgeBase", method="supports_prior_probs", name=self.__name__)
)

View File

@ -18,7 +18,7 @@ from .. import util
from ..util import SimpleFrozenList, ensure_path
from ..vocab cimport Vocab
from .kb cimport KnowledgeBase
from .candidate import Candidate as Candidate
from .candidate import InMemoryCandidate
cdef class InMemoryLookupKB(KnowledgeBase):
@ -226,10 +226,10 @@ cdef class InMemoryLookupKB(KnowledgeBase):
alias_entry.probs = probs
self._aliases_table[alias_index] = alias_entry
def get_candidates(self, mention: Span) -> Iterable[Candidate]:
return self.get_alias_candidates(mention.text) # type: ignore
def get_candidates(self, mention: Span) -> Iterable[InMemoryCandidate]:
return self._get_alias_candidates(mention.text) # type: ignore
def get_alias_candidates(self, str alias) -> Iterable[Candidate]:
def _get_alias_candidates(self, str alias) -> Iterable[InMemoryCandidate]:
"""
Return candidate entities for an alias. Each candidate defines the entity, the original alias,
and the prior probability of that alias resolving to that entity.
@ -241,14 +241,18 @@ cdef class InMemoryLookupKB(KnowledgeBase):
alias_index = <int64_t>self._alias_index.get(alias_hash)
alias_entry = self._aliases_table[alias_index]
return [Candidate(kb=self,
entity_hash=self._entries[entry_index].entity_hash,
entity_freq=self._entries[entry_index].freq,
entity_vector=self._vectors_table[self._entries[entry_index].vector_index],
alias_hash=alias_hash,
prior_prob=prior_prob)
for (entry_index, prior_prob) in zip(alias_entry.entry_indices, alias_entry.probs)
if entry_index != 0]
return [
InMemoryCandidate(
kb=self,
entity_hash=self._entries[entry_index].entity_hash,
alias_hash=alias_hash,
entity_vector=self._vectors_table[self._entries[entry_index].vector_index],
prior_prob=prior_prob,
entity_freq=self._entries[entry_index].freq
)
for (entry_index, prior_prob) in zip(alias_entry.entry_indices, alias_entry.probs)
if entry_index != 0
]
def get_vector(self, str entity):
cdef hash_t entity_hash = self.vocab.strings[entity]
@ -279,6 +283,9 @@ cdef class InMemoryLookupKB(KnowledgeBase):
return 0.0
def supports_prior_probs(self) -> bool:
return True
def to_bytes(self, **kwargs):
"""Serialize the current state to a binary string.
"""

View File

@ -7,7 +7,7 @@ class IcelandicDefaults(BaseDefaults):
class Icelandic(Language):
lang = "is"
lang = "isl"
Defaults = IcelandicDefaults

View File

@ -18,34 +18,23 @@ DEFAULT_CONFIG = """
[nlp.tokenizer]
@tokenizers = "spacy.ko.KoreanTokenizer"
mecab_args = ""
"""
@registry.tokenizers("spacy.ko.KoreanTokenizer")
def create_tokenizer():
def create_tokenizer(mecab_args: str):
def korean_tokenizer_factory(nlp):
return KoreanTokenizer(nlp.vocab)
return KoreanTokenizer(nlp.vocab, mecab_args=mecab_args)
return korean_tokenizer_factory
class KoreanTokenizer(DummyTokenizer):
def __init__(self, vocab: Vocab):
def __init__(self, vocab: Vocab, *, mecab_args: str = ""):
self.vocab = vocab
self._mecab = try_mecab_import() # type: ignore[func-returns-value]
self._mecab_tokenizer = None
@property
def mecab_tokenizer(self):
# This is a property so that initializing a pipeline with blank:ko is
# possible without actually requiring mecab-ko, e.g. to run
# `spacy init vectors ko` for a pipeline that will have a different
# tokenizer in the end. The languages need to match for the vectors
# to be imported and there's no way to pass a custom config to
# `init vectors`.
if self._mecab_tokenizer is None:
self._mecab_tokenizer = self._mecab("-F%f[0],%f[7]")
return self._mecab_tokenizer
mecab = try_mecab_import()
self.mecab_tokenizer = mecab.Tagger(mecab_args)
def __reduce__(self):
return KoreanTokenizer, (self.vocab,)
@ -68,13 +57,15 @@ class KoreanTokenizer(DummyTokenizer):
def detailed_tokens(self, text: str) -> Iterator[Dict[str, Any]]:
# 품사 태그(POS)[0], 의미 부류(semantic class)[1], 종성 유무(jongseong)[2], 읽기(reading)[3],
# 타입(type)[4], 첫번째 품사(start pos)[5], 마지막 품사(end pos)[6], 표현(expression)[7], *
for node in self.mecab_tokenizer.parse(text, as_nodes=True):
if node.is_eos():
for line in self.mecab_tokenizer.parse(text).split("\n"):
if line == "EOS":
break
surface = node.surface
feature = node.feature
tag, _, expr = feature.partition(",")
lemma, _, remainder = expr.partition("/")
surface, _, expr = line.partition("\t")
features = expr.split("/")[0].split(",")
tag = features[0]
lemma = "*"
if len(features) >= 8:
lemma = features[7]
if lemma == "*":
lemma = surface
yield {"surface": surface, "lemma": lemma, "tag": tag}
@ -97,20 +88,94 @@ class Korean(Language):
Defaults = KoreanDefaults
def try_mecab_import() -> None:
def try_mecab_import():
try:
from natto import MeCab
import mecab_ko as MeCab
return MeCab
except ImportError:
raise ImportError(
'The Korean tokenizer ("spacy.ko.KoreanTokenizer") requires '
"[mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md), "
"[mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic), "
"and [natto-py](https://github.com/buruzaemon/natto-py)"
"the python package `mecab-ko`: pip install mecab-ko"
) from None
@registry.tokenizers("spacy.KoreanNattoTokenizer.v1")
def create_natto_tokenizer():
def korean_natto_tokenizer_factory(nlp):
return KoreanNattoTokenizer(nlp.vocab)
return korean_natto_tokenizer_factory
class KoreanNattoTokenizer(DummyTokenizer):
def __init__(self, vocab: Vocab):
self.vocab = vocab
self._mecab = self._try_mecab_import() # type: ignore[func-returns-value]
self._mecab_tokenizer = None
@property
def mecab_tokenizer(self):
# This is a property so that initializing a pipeline with blank:ko is
# possible without actually requiring mecab-ko, e.g. to run
# `spacy init vectors ko` for a pipeline that will have a different
# tokenizer in the end. The languages need to match for the vectors
# to be imported and there's no way to pass a custom config to
# `init vectors`.
if self._mecab_tokenizer is None:
self._mecab_tokenizer = self._mecab("-F%f[0],%f[7]")
return self._mecab_tokenizer
def __reduce__(self):
return KoreanNattoTokenizer, (self.vocab,)
def __call__(self, text: str) -> Doc:
dtokens = list(self.detailed_tokens(text))
surfaces = [dt["surface"] for dt in dtokens]
doc = Doc(self.vocab, words=surfaces, spaces=list(check_spaces(text, surfaces)))
for token, dtoken in zip(doc, dtokens):
first_tag, sep, eomi_tags = dtoken["tag"].partition("+")
token.tag_ = first_tag # stem(어간) or pre-final(선어말 어미)
if token.tag_ in TAG_MAP:
token.pos = TAG_MAP[token.tag_][POS]
else:
token.pos = X
token.lemma_ = dtoken["lemma"]
doc.user_data["full_tags"] = [dt["tag"] for dt in dtokens]
return doc
def detailed_tokens(self, text: str) -> Iterator[Dict[str, Any]]:
# 품사 태그(POS)[0], 의미 부류(semantic class)[1], 종성 유무(jongseong)[2], 읽기(reading)[3],
# 타입(type)[4], 첫번째 품사(start pos)[5], 마지막 품사(end pos)[6], 표현(expression)[7], *
for node in self.mecab_tokenizer.parse(text, as_nodes=True):
if node.is_eos():
break
surface = node.surface
feature = node.feature
tag, _, expr = feature.partition(",")
lemma, _, remainder = expr.partition("/")
if lemma == "*" or lemma == "":
lemma = surface
yield {"surface": surface, "lemma": lemma, "tag": tag}
def score(self, examples):
validate_examples(examples, "KoreanTokenizer.score")
return Scorer.score_tokenization(examples)
def _try_mecab_import(self):
try:
from natto import MeCab
return MeCab
except ImportError:
raise ImportError(
'The Korean Natto tokenizer ("spacy.ko.KoreanNattoTokenizer") requires '
"[mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md), "
"[mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic), "
"and [natto-py](https://github.com/buruzaemon/natto-py)"
) from None
def check_spaces(text, tokens):
prev_end = -1
start = 0

View File

@ -3,10 +3,10 @@ from ...language import Language
class MultiLanguage(Language):
"""Language class to be used for models that support multiple languages.
This module allows models to specify their language ID as 'xx'.
This module allows models to specify their language ID as 'mul'.
"""
lang = "xx"
lang = "mul"
__all__ = ["MultiLanguage"]

View File

@ -17,10 +17,6 @@ URL_PATTERN = (
r"(?:\S+(?::\S*)?@)?"
r"(?:"
# IP address exclusion
# private & local networks
r"(?!(?:10|127)(?:\.\d{1,3}){3})"
r"(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
r"(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
# IP address dotted notation octets
# excludes loopback network 0.0.0.0
# excludes reserved space >= 224.0.0.0

View File

@ -1,4 +1,4 @@
from typing import Iterator, Optional, Any, Dict, Callable, Iterable
from typing import Iterator, Optional, Any, Dict, Callable, Iterable, Literal
from typing import Union, Tuple, List, Set, Pattern, Sequence
from typing import NoReturn, TypeVar, cast, overload
@ -22,7 +22,7 @@ from . import ty
from .tokens.underscore import Underscore
from .vocab import Vocab, create_vocab
from .pipe_analysis import validate_attrs, analyze_pipes, print_pipe_analysis
from .training import Example, validate_examples
from .training import Example, validate_examples, validate_distillation_examples
from .training.initialize import init_vocab, init_tok2vec
from .scorer import Scorer
from .util import registry, SimpleFrozenList, _pipe, raise_error, _DEFAULT_EMPTY_PIPES
@ -40,7 +40,6 @@ from .git_info import GIT_VERSION
from . import util
from . import about
from .lookups import load_lookups
from .compat import Literal
PipeCallable = Callable[[Doc], Doc]
@ -49,6 +48,9 @@ PipeCallable = Callable[[Doc], Doc]
# This is the base config will all settings (training etc.)
DEFAULT_CONFIG_PATH = Path(__file__).parent / "default_config.cfg"
DEFAULT_CONFIG = util.load_config(DEFAULT_CONFIG_PATH)
# This is the base config for the [distillation] block and currently not included
# in the main config and only added via the 'init fill-config' command
DEFAULT_CONFIG_DISTILL_PATH = Path(__file__).parent / "default_config_distillation.cfg"
# This is the base config for the [pretraining] block and currently not included
# in the main config and only added via the 'init fill-config' command
DEFAULT_CONFIG_PRETRAIN_PATH = Path(__file__).parent / "default_config_pretraining.cfg"
@ -172,8 +174,7 @@ class Language:
if not isinstance(vocab, Vocab) and vocab is not True:
raise ValueError(Errors.E918.format(vocab=vocab, vocab_type=type(Vocab)))
if vocab is True:
vectors_name = meta.get("vectors", {}).get("name")
vocab = create_vocab(self.lang, self.Defaults, vectors_name=vectors_name)
vocab = create_vocab(self.lang, self.Defaults)
else:
if (self.lang and vocab.lang) and (self.lang != vocab.lang):
raise ValueError(Errors.E150.format(nlp=self.lang, vocab=vocab.lang))
@ -227,7 +228,6 @@ class Language:
"width": self.vocab.vectors_length,
"vectors": len(self.vocab.vectors),
"keys": self.vocab.vectors.n_keys,
"name": self.vocab.vectors.name,
"mode": self.vocab.vectors.mode,
}
self._meta["labels"] = dict(self.pipe_labels)
@ -1018,6 +1018,116 @@ class Language:
raise ValueError(Errors.E005.format(name=name, returned_type=type(doc)))
return doc
def distill(
self,
teacher: "Language",
examples: Iterable[Example],
*,
drop: float = 0.0,
sgd: Union[Optimizer, None, Literal[False]] = None,
losses: Optional[Dict[str, float]] = None,
component_cfg: Optional[Dict[str, Dict[str, Any]]] = None,
exclude: Iterable[str] = SimpleFrozenList(),
annotates: Iterable[str] = SimpleFrozenList(),
student_to_teacher: Optional[Dict[str, str]] = None,
):
"""Distill the models in a student pipeline from a teacher pipeline.
teacher (Language): Teacher to distill from.
examples (Iterable[Example]): Distillation examples. The reference
(teacher) and predicted (student) docs must have the same number of
tokens and the same orthography.
drop (float): The dropout rate.
sgd (Union[Optimizer, None, Literal[False]]): An optimizer. Will
be created via create_optimizer if 'None'. No optimizer will
be used when set to 'False'.
losses (Optional(Dict[str, float])): Dictionary to update with the loss,
keyed by component.
component_cfg (Optional[Dict[str, Dict[str, Any]]]): Config parameters
for specific pipeline components, keyed by component name.
exclude (Iterable[str]): Names of components that shouldn't be updated.
annotates (Iterable[str]): Names of components that should set
annotations on the predicted examples after updating.
student_to_teacher (Optional[Dict[str, str]]): Map student pipe name to
teacher pipe name, only needed for pipes where the student pipe
name does not match the teacher pipe name.
RETURNS (Dict[str, float]): The updated losses dictionary
DOCS: https://spacy.io/api/language#distill
"""
if student_to_teacher is None:
student_to_teacher = {}
if losses is None:
losses = {}
if isinstance(examples, list) and len(examples) == 0:
return losses
validate_distillation_examples(examples, "Language.distill")
examples = _copy_examples(examples, copy_x=True, copy_y=True)
if sgd is None:
if self._optimizer is None:
self._optimizer = self.create_optimizer()
sgd = self._optimizer
if component_cfg is None:
component_cfg = {}
pipe_kwargs = {}
for student_name, student_proc in self.pipeline:
component_cfg.setdefault(student_name, {})
pipe_kwargs[student_name] = deepcopy(component_cfg[student_name])
component_cfg[student_name].setdefault("drop", drop)
pipe_kwargs[student_name].setdefault("batch_size", self.batch_size)
teacher_pipes = dict(teacher.pipeline)
for student_name, student_proc in self.pipeline:
if student_name in annotates:
for doc, eg in zip(
_pipe(
(eg.predicted for eg in examples),
proc=student_proc,
name=student_name,
default_error_handler=self.default_error_handler,
kwargs=pipe_kwargs[student_name],
),
examples,
):
eg.predicted = doc
if (
student_name not in exclude
and isinstance(student_proc, ty.DistillableComponent)
and student_proc.is_distillable
):
# A missing teacher pipe is not an error, some student pipes
# do not need a teacher, such as tok2vec layer losses.
teacher_name = (
student_to_teacher[student_name]
if student_name in student_to_teacher
else student_name
)
teacher_pipe = teacher_pipes.get(teacher_name, None)
student_proc.distill(
teacher_pipe,
examples,
sgd=None,
losses=losses,
**component_cfg[student_name],
)
# Only finish the update after all component updates are done. Some
# components may share weights (such as tok2vec) and we only want
# to apply weight updates after all gradients are accumulated.
for student_name, student_proc in self.pipeline:
if (
student_name not in exclude
and isinstance(student_proc, ty.DistillableComponent)
and student_proc.is_distillable
and sgd not in (None, False)
):
student_proc.finish_update(sgd)
return losses
def disable_pipes(self, *names) -> "DisabledPipes":
"""Disable one or more pipeline components. If used as a context
manager, the pipeline will be restored to the initial state at the end
@ -1106,7 +1216,7 @@ class Language:
_: Optional[Any] = None,
*,
drop: float = 0.0,
sgd: Optional[Optimizer] = None,
sgd: Union[Optimizer, None, Literal[False]] = None,
losses: Optional[Dict[str, float]] = None,
component_cfg: Optional[Dict[str, Dict[str, Any]]] = None,
exclude: Iterable[str] = SimpleFrozenList(),
@ -1117,7 +1227,9 @@ class Language:
examples (Iterable[Example]): A batch of examples
_: Should not be set - serves to catch backwards-incompatible scripts.
drop (float): The dropout rate.
sgd (Optimizer): An optimizer.
sgd (Union[Optimizer, None, Literal[False]]): An optimizer. Will
be created via create_optimizer if 'None'. No optimizer will
be used when set to 'False'.
losses (Dict[str, float]): Dictionary to update with the loss, keyed by
component.
component_cfg (Dict[str, Dict]): Config parameters for specific pipeline
@ -1150,17 +1262,12 @@ class Language:
component_cfg[name].setdefault("drop", drop)
pipe_kwargs[name].setdefault("batch_size", self.batch_size)
for name, proc in self.pipeline:
# ignore statements are used here because mypy ignores hasattr
if name not in exclude and hasattr(proc, "update"):
proc.update(examples, sgd=None, losses=losses, **component_cfg[name]) # type: ignore
if sgd not in (None, False):
if (
name not in exclude
and isinstance(proc, ty.TrainableComponent)
and proc.is_trainable
and proc.model not in (True, False, None)
):
proc.finish_update(sgd)
if (
name not in exclude
and isinstance(proc, ty.TrainableComponent)
and proc.is_trainable
):
proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
if name in annotates:
for doc, eg in zip(
_pipe(
@ -1173,6 +1280,18 @@ class Language:
examples,
):
eg.predicted = doc
# Only finish the update after all component updates are done. Some
# components may share weights (such as tok2vec) and we only want
# to apply weight updates after all gradients are accumulated.
for name, proc in self.pipeline:
if (
name not in exclude
and isinstance(proc, ty.TrainableComponent)
and proc.is_trainable
and sgd not in (None, False)
):
proc.finish_update(sgd)
return losses
def rehearse(
@ -1239,25 +1358,20 @@ class Language:
sgd(key, W, dW) # type: ignore[call-arg, misc]
return losses
def begin_training(
self,
get_examples: Optional[Callable[[], Iterable[Example]]] = None,
*,
sgd: Optional[Optimizer] = None,
) -> Optimizer:
warnings.warn(Warnings.W089, DeprecationWarning)
return self.initialize(get_examples, sgd=sgd)
def initialize(
self,
get_examples: Optional[Callable[[], Iterable[Example]]] = None,
*,
labels: Optional[Dict[str, Any]] = None,
sgd: Optional[Optimizer] = None,
) -> Optimizer:
"""Initialize the pipe for training, using data examples if available.
get_examples (Callable[[], Iterable[Example]]): Optional function that
returns gold-standard Example objects.
labels (Optional[Dict[str, Any]]): Labels to pass to pipe initialization,
using the names of the pipes as keys. Overrides labels that are in
the model configuration.
sgd (Optional[Optimizer]): An optimizer to use for updates. If not
provided, will be created using the .create_optimizer() method.
RETURNS (thinc.api.Optimizer): The optimizer.
@ -1305,6 +1419,8 @@ class Language:
for name, proc in self.pipeline:
if isinstance(proc, ty.InitializableComponent):
p_settings = I["components"].get(name, {})
if labels is not None and name in labels:
p_settings["labels"] = labels[name]
p_settings = validate_init_settings(
proc.initialize, p_settings, section="components", name=name
)
@ -1741,6 +1857,7 @@ class Language:
# using the nlp.config with all defaults.
config = util.copy_config(config)
orig_pipeline = config.pop("components", {})
orig_distill = config.pop("distillation", None)
orig_pretraining = config.pop("pretraining", None)
config["components"] = {}
if auto_fill:
@ -1749,6 +1866,9 @@ class Language:
filled = config
filled["components"] = orig_pipeline
config["components"] = orig_pipeline
if orig_distill is not None:
filled["distillation"] = orig_distill
config["distillation"] = orig_distill
if orig_pretraining is not None:
filled["pretraining"] = orig_pretraining
config["pretraining"] = orig_pretraining
@ -2098,9 +2218,6 @@ class Language:
if path.exists():
data = srsly.read_json(path)
self.meta.update(data)
# self.meta always overrides meta["vectors"] with the metadata
# from self.vocab.vectors, so set the name directly
self.vocab.vectors.name = data.get("vectors", {}).get("name")
def deserialize_vocab(path: Path) -> None:
if path.exists():
@ -2169,9 +2286,6 @@ class Language:
def deserialize_meta(b):
data = srsly.json_loads(b)
self.meta.update(data)
# self.meta always overrides meta["vectors"] with the metadata
# from self.vocab.vectors, so set the name directly
self.vocab.vectors.name = data.get("vectors", {}).get("name")
deserializers: Dict[str, Callable[[bytes], Any]] = {}
deserializers["config.cfg"] = lambda b: self.config.from_bytes(
@ -2238,13 +2352,18 @@ class DisabledPipes(list):
self[:] = []
def _copy_examples(examples: Iterable[Example]) -> List[Example]:
def _copy_examples(
examples: Iterable[Example], *, copy_x: bool = True, copy_y: bool = False
) -> List[Example]:
"""Make a copy of a batch of examples, copying the predicted Doc as well.
This is used in contexts where we need to take ownership of the examples
so that they can be mutated, for instance during Language.evaluate and
Language.update.
"""
return [Example(eg.x.copy(), eg.y) for eg in examples]
return [
Example(eg.x.copy() if copy_x else eg.x, eg.y.copy() if copy_y else eg.y)
for eg in examples
]
def _apply_pipes(

View File

@ -5,7 +5,6 @@ from .attrs cimport attr_id_t
from .attrs cimport ID, ORTH, LOWER, NORM, SHAPE, PREFIX, SUFFIX, LENGTH, LANG
from .structs cimport LexemeC
from .strings cimport StringStore
from .vocab cimport Vocab

View File

@ -20,7 +20,6 @@ class Lexeme:
def vector_norm(self) -> float: ...
vector: Floats1d
rank: int
sentiment: float
@property
def orth_(self) -> str: ...
@property

View File

@ -41,7 +41,7 @@ cdef class Lexeme:
"""
self.vocab = vocab
self.orth = orth
self.c = <LexemeC*><void*>vocab.get_by_orth(vocab.mem, orth)
self.c = <LexemeC*><void*>vocab.get_by_orth(orth)
if self.c.orth != orth:
raise ValueError(Errors.E071.format(orth=orth, vocab_orth=self.c.orth))
@ -173,19 +173,6 @@ cdef class Lexeme:
def __set__(self, value):
self.c.id = value
property sentiment:
"""RETURNS (float): A scalar value indicating the positivity or
negativity of the lexeme."""
def __get__(self):
sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment", {})
return sentiment_table.get(self.c.orth, 0.0)
def __set__(self, float x):
if "lexeme_sentiment" not in self.vocab.lookups:
self.vocab.lookups.add_table("lexeme_sentiment")
sentiment_table = self.vocab.lookups.get_table("lexeme_sentiment")
sentiment_table[self.c.orth] = x
@property
def orth_(self):
"""RETURNS (str): The original verbatim text of the lexeme

View File

@ -169,9 +169,9 @@ cdef class DependencyMatcher:
on_match (callable): Optional callback executed on match.
"""
if on_match is not None and not hasattr(on_match, "__call__"):
raise ValueError(Errors.E171.format(arg_type=type(on_match)))
if patterns is None or not isinstance(patterns, List): # old API
raise ValueError(Errors.E948.format(arg_type=type(patterns)))
raise ValueError(Errors.E171.format(name="DependencyMatcher", arg_type=type(on_match)))
if patterns is None or not isinstance(patterns, List):
raise ValueError(Errors.E948.format(name="DependencyMatcher", arg_type=type(patterns)))
for pattern in patterns:
if len(pattern) == 0:
raise ValueError(Errors.E012.format(key=key))

View File

@ -1,6 +1,5 @@
from typing import Any, List, Dict, Tuple, Optional, Callable, Union
from typing import Any, List, Dict, Tuple, Optional, Callable, Union, Literal
from typing import Iterator, Iterable, overload
from ..compat import Literal
from ..vocab import Vocab
from ..tokens import Doc, Span

View File

@ -23,7 +23,7 @@ from ..attrs cimport ID, attr_id_t, NULL_ATTR, ORTH, POS, TAG, DEP, LEMMA, MORPH
from .levenshtein import levenshtein_compare
from ..schemas import validate_token_pattern
from ..errors import Errors, MatchPatternError, Warnings
from ..strings import get_string_id
from ..strings cimport get_string_id
from ..attrs import IDS
from ..util import registry
@ -115,9 +115,9 @@ cdef class Matcher:
"""
errors = {}
if on_match is not None and not hasattr(on_match, "__call__"):
raise ValueError(Errors.E171.format(arg_type=type(on_match)))
if patterns is None or not isinstance(patterns, List): # old API
raise ValueError(Errors.E948.format(arg_type=type(patterns)))
raise ValueError(Errors.E171.format(name="Matcher", arg_type=type(on_match)))
if patterns is None or not isinstance(patterns, List):
raise ValueError(Errors.E948.format(name="Matcher", arg_type=type(patterns)))
if greedy is not None and greedy not in ["FIRST", "LONGEST"]:
raise ValueError(Errors.E947.format(expected=["FIRST", "LONGEST"], arg=greedy))
for i, pattern in enumerate(patterns):
@ -265,6 +265,10 @@ cdef class Matcher:
# non-overlapping ones this `match` can be either (start, end) or
# (start, end, alignments) depending on `with_alignments=` option.
for key, *match in matches:
# Adjust span matches to doc offsets
if isinstance(doclike, Span):
match[0] += doclike.start
match[1] += doclike.start
span_filter = self._filter.get(key)
if span_filter is not None:
pairs = pairs_by_id.get(key, [])
@ -295,9 +299,6 @@ cdef class Matcher:
if as_spans:
final_results = []
for key, start, end, *_ in final_matches:
if isinstance(doclike, Span):
start += doclike.start
end += doclike.start
final_results.append(Span(doc, start, end, label=key))
elif with_alignments:
# convert alignments List[Dict[str, int]] --> List[int]

View File

@ -1,5 +1,5 @@
from typing import List, Tuple, Union, Optional, Callable, Any, Dict, overload
from ..compat import Literal
from typing import List, Tuple, Union, Optional, Callable, Any, Dict, Literal
from typing import overload
from .matcher import Matcher
from ..vocab import Vocab
from ..tokens import Doc, Span
@ -20,6 +20,15 @@ class PhraseMatcher:
Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]
] = ...,
) -> None: ...
def _add_from_arrays(
self,
key: str,
specs: List[List[int]],
*,
on_match: Optional[
Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]
] = ...,
) -> None: ...
def remove(self, key: str) -> None: ...
@overload
def __call__(

View File

@ -1,4 +1,6 @@
# cython: infer_types=True, profile=True
from typing import List
from collections import defaultdict
from libc.stdint cimport uintptr_t
from preshed.maps cimport map_init, map_set, map_get, map_clear, map_iter
@ -39,7 +41,7 @@ cdef class PhraseMatcher:
"""
self.vocab = vocab
self._callbacks = {}
self._docs = {}
self._docs = defaultdict(set)
self._validate = validate
self.mem = Pool()
@ -155,66 +157,24 @@ cdef class PhraseMatcher:
del self._callbacks[key]
del self._docs[key]
def add(self, key, docs, *_docs, on_match=None):
"""Add a match-rule to the phrase-matcher. A match-rule consists of: an ID
key, an on_match callback, and one or more patterns.
Since spaCy v2.2.2, PhraseMatcher.add takes a list of patterns as the
second argument, with the on_match callback as an optional keyword
argument.
def _add_from_arrays(self, key, specs, *, on_match=None):
"""Add a preprocessed list of specs, with an optional callback.
key (str): The match ID.
docs (list): List of `Doc` objects representing match patterns.
specs (List[List[int]]): A list of lists of hashes to match.
on_match (callable): Callback executed on match.
*_docs (Doc): For backwards compatibility: list of patterns to add
as variable arguments. Will be ignored if a list of patterns is
provided as the second argument.
DOCS: https://spacy.io/api/phrasematcher#add
"""
if docs is None or hasattr(docs, "__call__"): # old API
on_match = docs
docs = _docs
_ = self.vocab[key]
self._callbacks[key] = on_match
self._docs.setdefault(key, set())
cdef MapStruct* current_node
cdef MapStruct* internal_node
cdef void* result
if isinstance(docs, Doc):
raise ValueError(Errors.E179.format(key=key))
for doc in docs:
if len(doc) == 0:
continue
if isinstance(doc, Doc):
attrs = (TAG, POS, MORPH, LEMMA, DEP)
has_annotation = {attr: doc.has_annotation(attr) for attr in attrs}
for attr in attrs:
if self.attr == attr and not has_annotation[attr]:
if attr == TAG:
pipe = "tagger"
elif attr in (POS, MORPH):
pipe = "morphologizer or tagger+attribute_ruler"
elif attr == LEMMA:
pipe = "lemmatizer"
elif attr == DEP:
pipe = "parser"
error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr))
raise ValueError(error_msg)
if self._validate and any(has_annotation.values()) \
and self.attr not in attrs:
string_attr = self.vocab.strings[self.attr]
warnings.warn(Warnings.W012.format(key=key, attr=string_attr))
keyword = self._convert_to_array(doc)
else:
keyword = doc
self._docs[key].add(tuple(keyword))
self._callbacks[key] = on_match
for spec in specs:
self._docs[key].add(tuple(spec))
current_node = self.c_map
for token in keyword:
for token in spec:
if token == self._terminal_hash:
warnings.warn(Warnings.W021)
break
@ -233,6 +193,57 @@ cdef class PhraseMatcher:
result = internal_node
map_set(self.mem, <MapStruct*>result, self.vocab.strings[key], NULL)
def add(self, key, docs, *, on_match=None):
"""Add a match-rule to the phrase-matcher. A match-rule consists of: an ID
key, a list of one or more patterns, and (optionally) an on_match callback.
key (str): The match ID.
docs (list): List of `Doc` objects representing match patterns.
on_match (callable): Callback executed on match.
If any of the input Docs are invalid, no internal state will be updated.
DOCS: https://spacy.io/api/phrasematcher#add
"""
if isinstance(docs, Doc):
raise ValueError(Errors.E179.format(key=key))
if docs is None or not isinstance(docs, List):
raise ValueError(Errors.E948.format(name="PhraseMatcher", arg_type=type(docs)))
if on_match is not None and not hasattr(on_match, "__call__"):
raise ValueError(Errors.E171.format(name="PhraseMatcher", arg_type=type(on_match)))
_ = self.vocab[key]
specs = []
for doc in docs:
if len(doc) == 0:
continue
if not isinstance(doc, Doc):
raise ValueError(Errors.E4000.format(type=type(doc)))
attrs = (TAG, POS, MORPH, LEMMA, DEP)
has_annotation = {attr: doc.has_annotation(attr) for attr in attrs}
for attr in attrs:
if self.attr == attr and not has_annotation[attr]:
if attr == TAG:
pipe = "tagger"
elif attr in (POS, MORPH):
pipe = "morphologizer or tagger+attribute_ruler"
elif attr == LEMMA:
pipe = "lemmatizer"
elif attr == DEP:
pipe = "parser"
error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr))
raise ValueError(error_msg)
if self._validate and any(has_annotation.values()) \
and self.attr not in attrs:
string_attr = self.vocab.strings[self.attr]
warnings.warn(Warnings.W012.format(key=key, attr=string_attr))
specs.append(self._convert_to_array(doc))
self._add_from_arrays(key, specs, on_match=on_match)
def __call__(self, object doclike, *, as_spans=False):
"""Find all sequences matching the supplied patterns on the `Doc`.
@ -345,7 +356,7 @@ def unpickle_matcher(vocab, docs, callbacks, attr):
matcher = PhraseMatcher(vocab, attr=attr)
for key, specs in docs.items():
callback = callbacks.get(key, None)
matcher.add(key, specs, on_match=callback)
matcher._add_from_arrays(key, specs, on_match=callback)
return matcher

View File

@ -1,164 +0,0 @@
from thinc.api import Model, normal_init
from ..util import registry
@registry.layers("spacy.PrecomputableAffine.v1")
def PrecomputableAffine(nO, nI, nF, nP, dropout=0.1):
model = Model(
"precomputable_affine",
forward,
init=init,
dims={"nO": nO, "nI": nI, "nF": nF, "nP": nP},
params={"W": None, "b": None, "pad": None},
attrs={"dropout_rate": dropout},
)
return model
def forward(model, X, is_train):
nF = model.get_dim("nF")
nO = model.get_dim("nO")
nP = model.get_dim("nP")
nI = model.get_dim("nI")
W = model.get_param("W")
# Preallocate array for layer output, including padding.
Yf = model.ops.alloc2f(X.shape[0] + 1, nF * nO * nP, zeros=False)
model.ops.gemm(X, W.reshape((nF * nO * nP, nI)), trans2=True, out=Yf[1:])
Yf = Yf.reshape((Yf.shape[0], nF, nO, nP))
# Set padding. Padding has shape (1, nF, nO, nP). Unfortunately, we cannot
# change its shape to (nF, nO, nP) without breaking existing models. So
# we'll squeeze the first dimension here.
Yf[0] = model.ops.xp.squeeze(model.get_param("pad"), 0)
def backward(dY_ids):
# This backprop is particularly tricky, because we get back a different
# thing from what we put out. We put out an array of shape:
# (nB, nF, nO, nP), and get back:
# (nB, nO, nP) and ids (nB, nF)
# The ids tell us the values of nF, so we would have:
#
# dYf = zeros((nB, nF, nO, nP))
# for b in range(nB):
# for f in range(nF):
# dYf[b, ids[b, f]] += dY[b]
#
# However, we avoid building that array for efficiency -- and just pass
# in the indices.
dY, ids = dY_ids
assert dY.ndim == 3
assert dY.shape[1] == nO, dY.shape
assert dY.shape[2] == nP, dY.shape
# nB = dY.shape[0]
model.inc_grad("pad", _backprop_precomputable_affine_padding(model, dY, ids))
Xf = X[ids]
Xf = Xf.reshape((Xf.shape[0], nF * nI))
model.inc_grad("b", dY.sum(axis=0))
dY = dY.reshape((dY.shape[0], nO * nP))
Wopfi = W.transpose((1, 2, 0, 3))
Wopfi = Wopfi.reshape((nO * nP, nF * nI))
dXf = model.ops.gemm(dY.reshape((dY.shape[0], nO * nP)), Wopfi)
dWopfi = model.ops.gemm(dY, Xf, trans1=True)
dWopfi = dWopfi.reshape((nO, nP, nF, nI))
# (o, p, f, i) --> (f, o, p, i)
dWopfi = dWopfi.transpose((2, 0, 1, 3))
model.inc_grad("W", dWopfi)
return dXf.reshape((dXf.shape[0], nF, nI))
return Yf, backward
def _backprop_precomputable_affine_padding(model, dY, ids):
nB = dY.shape[0]
nF = model.get_dim("nF")
nP = model.get_dim("nP")
nO = model.get_dim("nO")
# Backprop the "padding", used as a filler for missing values.
# Values that are missing are set to -1, and each state vector could
# have multiple missing values. The padding has different values for
# different missing features. The gradient of the padding vector is:
#
# for b in range(nB):
# for f in range(nF):
# if ids[b, f] < 0:
# d_pad[f] += dY[b]
#
# Which can be rewritten as:
#
# (ids < 0).T @ dY
mask = model.ops.asarray(ids < 0, dtype="f")
d_pad = model.ops.gemm(mask, dY.reshape(nB, nO * nP), trans1=True)
return d_pad.reshape((1, nF, nO, nP))
def init(model, X=None, Y=None):
"""This is like the 'layer sequential unit variance', but instead
of taking the actual inputs, we randomly generate whitened data.
Why's this all so complicated? We have a huge number of inputs,
and the maxout unit makes guessing the dynamics tricky. Instead
we set the maxout weights to values that empirically result in
whitened outputs given whitened inputs.
"""
if model.has_param("W") and model.get_param("W").any():
return
nF = model.get_dim("nF")
nO = model.get_dim("nO")
nP = model.get_dim("nP")
nI = model.get_dim("nI")
W = model.ops.alloc4f(nF, nO, nP, nI)
b = model.ops.alloc2f(nO, nP)
pad = model.ops.alloc4f(1, nF, nO, nP)
ops = model.ops
W = normal_init(ops, W.shape, mean=float(ops.xp.sqrt(1.0 / nF * nI)))
pad = normal_init(ops, pad.shape, mean=1.0)
model.set_param("W", W)
model.set_param("b", b)
model.set_param("pad", pad)
ids = ops.alloc((5000, nF), dtype="f")
ids += ops.xp.random.uniform(0, 1000, ids.shape)
ids = ops.asarray(ids, dtype="i")
tokvecs = ops.alloc((5000, nI), dtype="f")
tokvecs += ops.xp.random.normal(loc=0.0, scale=1.0, size=tokvecs.size).reshape(
tokvecs.shape
)
def predict(ids, tokvecs):
# nS ids. nW tokvecs. Exclude the padding array.
hiddens = model.predict(tokvecs[:-1]) # (nW, f, o, p)
vectors = model.ops.alloc((ids.shape[0], nO * nP), dtype="f")
# need nS vectors
hiddens = hiddens.reshape((hiddens.shape[0] * nF, nO * nP))
model.ops.scatter_add(vectors, ids.flatten(), hiddens)
vectors = vectors.reshape((vectors.shape[0], nO, nP))
vectors += b
vectors = model.ops.asarray(vectors)
if nP >= 2:
return model.ops.maxout(vectors)[0]
else:
return vectors * (vectors >= 0)
tol_var = 0.01
tol_mean = 0.01
t_max = 10
W = model.get_param("W").copy()
b = model.get_param("b").copy()
for t_i in range(t_max):
acts1 = predict(ids, tokvecs)
var = model.ops.xp.var(acts1)
mean = model.ops.xp.mean(acts1)
if abs(var - 1.0) >= tol_var:
W /= model.ops.xp.sqrt(var)
model.set_param("W", W)
elif abs(mean) >= tol_mean:
b -= mean
model.set_param("b", b)
else:
break

View File

@ -23,6 +23,7 @@ DEFAULT_NVTX_ANNOTATABLE_PIPE_METHODS = [
"update",
"rehearse",
"get_loss",
"get_teacher_student_loss",
"initialize",
"begin_update",
"finish_update",

View File

@ -6,9 +6,9 @@ from thinc.api import Model, Maxout, Linear, tuplify, Ragged
from ...util import registry
from ...kb import KnowledgeBase, InMemoryLookupKB
from ...kb import Candidate, get_candidates, get_candidates_batch
from ...kb import Candidate
from ...vocab import Vocab
from ...tokens import Span, Doc
from ...tokens import Doc, Span, SpanGroup
from ..extract_spans import extract_spans
from ...errors import Errors
@ -114,6 +114,28 @@ def create_candidates() -> Callable[[KnowledgeBase, Span], Iterable[Candidate]]:
@registry.misc("spacy.CandidateBatchGenerator.v1")
def create_candidates_batch() -> Callable[
[KnowledgeBase, Iterable[Span]], Iterable[Iterable[Candidate]]
[KnowledgeBase, SpanGroup], Iterable[Iterable[Candidate]]
]:
return get_candidates_batch
def get_candidates(kb: KnowledgeBase, mention: Span) -> Iterable[Candidate]:
"""
Return candidate entities for a given mention and fetching appropriate entries from the index.
kb (KnowledgeBase): Knowledge base to query.
mention (Span): Entity mention for which to identify candidates.
RETURNS (Iterable[Candidate]): Identified candidates.
"""
return kb.get_candidates(mention)
def get_candidates_batch(
kb: KnowledgeBase, mentions: SpanGroup
) -> Iterable[Iterable[Candidate]]:
"""
Return candidate entities for the given mentions and fetching appropriate entries from the index.
kb (KnowledgeBase): Knowledge base to query.
mentions (SpanGroup): Entity mentions for which to identify candidates.
RETURNS (Iterable[Iterable[Candidate]]): Identified candidates.
"""
return kb.get_candidates_batch(mentions)

View File

@ -1,17 +1,19 @@
from typing import Optional, List, cast
from thinc.api import Model, chain, list2array, Linear, zero_init, use_ops
from typing import Optional, List, Tuple, Any, Literal
from thinc.types import Floats2d
from thinc.api import Model
import warnings
from ...errors import Errors
from ...compat import Literal
from ...errors import Errors, Warnings
from ...util import registry
from .._precomputable_affine import PrecomputableAffine
from ..tb_framework import TransitionModel
from ...tokens import Doc
from ...tokens.doc import Doc
TransitionSystem = Any # TODO
State = Any # TODO
@registry.architectures("spacy.TransitionBasedParser.v2")
def build_tb_parser_model(
@registry.architectures.register("spacy.TransitionBasedParser.v2")
def transition_parser_v2(
tok2vec: Model[List[Doc], List[Floats2d]],
state_type: Literal["parser", "ner"],
extra_state_tokens: bool,
@ -19,6 +21,46 @@ def build_tb_parser_model(
maxout_pieces: int,
use_upper: bool,
nO: Optional[int] = None,
) -> Model:
if not use_upper:
warnings.warn(Warnings.W400)
return build_tb_parser_model(
tok2vec,
state_type,
extra_state_tokens,
hidden_width,
maxout_pieces,
nO=nO,
)
@registry.architectures.register("spacy.TransitionBasedParser.v3")
def transition_parser_v3(
tok2vec: Model[List[Doc], List[Floats2d]],
state_type: Literal["parser", "ner"],
extra_state_tokens: bool,
hidden_width: int,
maxout_pieces: int,
nO: Optional[int] = None,
) -> Model:
return build_tb_parser_model(
tok2vec,
state_type,
extra_state_tokens,
hidden_width,
maxout_pieces,
nO=nO,
)
def build_tb_parser_model(
tok2vec: Model[List[Doc], List[Floats2d]],
state_type: Literal["parser", "ner"],
extra_state_tokens: bool,
hidden_width: int,
maxout_pieces: int,
nO: Optional[int] = None,
) -> Model:
"""
Build a transition-based parser model. Can apply to NER or dependency-parsing.
@ -51,14 +93,7 @@ def build_tb_parser_model(
feature sets (for the NER) or 13 (for the parser).
hidden_width (int): The width of the hidden layer.
maxout_pieces (int): How many pieces to use in the state prediction layer.
Recommended values are 1, 2 or 3. If 1, the maxout non-linearity
is replaced with a ReLu non-linearity if use_upper=True, and no
non-linearity if use_upper=False.
use_upper (bool): Whether to use an additional hidden layer after the state
vector in order to predict the action scores. It is recommended to set
this to False for large pretrained models such as transformers, and True
for smaller networks. The upper layer is computed on CPU, which becomes
a bottleneck on larger GPU-based models, where it's also less necessary.
Recommended values are 1, 2 or 3.
nO (int or None): The number of actions the model will predict between.
Usually inferred from data at the beginning of training, or loaded from
disk.
@ -69,106 +104,11 @@ def build_tb_parser_model(
nr_feature_tokens = 6 if extra_state_tokens else 3
else:
raise ValueError(Errors.E917.format(value=state_type))
t2v_width = tok2vec.get_dim("nO") if tok2vec.has_dim("nO") else None
tok2vec = chain(
tok2vec,
list2array(),
Linear(hidden_width, t2v_width),
return TransitionModel(
tok2vec=tok2vec,
state_tokens=nr_feature_tokens,
hidden_width=hidden_width,
maxout_pieces=maxout_pieces,
nO=nO,
unseen_classes=set(),
)
tok2vec.set_dim("nO", hidden_width)
lower = _define_lower(
nO=hidden_width if use_upper else nO,
nF=nr_feature_tokens,
nI=tok2vec.get_dim("nO"),
nP=maxout_pieces,
)
upper = None
if use_upper:
with use_ops("cpu"):
# Initialize weights at zero, as it's a classification layer.
upper = _define_upper(nO=nO, nI=None)
return TransitionModel(tok2vec, lower, upper, resize_output)
def _define_upper(nO, nI):
return Linear(nO=nO, nI=nI, init_W=zero_init)
def _define_lower(nO, nF, nI, nP):
return PrecomputableAffine(nO=nO, nF=nF, nI=nI, nP=nP)
def resize_output(model, new_nO):
if model.attrs["has_upper"]:
return _resize_upper(model, new_nO)
return _resize_lower(model, new_nO)
def _resize_upper(model, new_nO):
upper = model.get_ref("upper")
if upper.has_dim("nO") is None:
upper.set_dim("nO", new_nO)
return model
elif new_nO == upper.get_dim("nO"):
return model
smaller = upper
nI = smaller.maybe_get_dim("nI")
with use_ops("cpu"):
larger = _define_upper(nO=new_nO, nI=nI)
# it could be that the model is not initialized yet, then skip this bit
if smaller.has_param("W"):
larger_W = larger.ops.alloc2f(new_nO, nI)
larger_b = larger.ops.alloc1f(new_nO)
smaller_W = smaller.get_param("W")
smaller_b = smaller.get_param("b")
# Weights are stored in (nr_out, nr_in) format, so we're basically
# just adding rows here.
if smaller.has_dim("nO"):
old_nO = smaller.get_dim("nO")
larger_W[:old_nO] = smaller_W
larger_b[:old_nO] = smaller_b
for i in range(old_nO, new_nO):
model.attrs["unseen_classes"].add(i)
larger.set_param("W", larger_W)
larger.set_param("b", larger_b)
model._layers[-1] = larger
model.set_ref("upper", larger)
return model
def _resize_lower(model, new_nO):
lower = model.get_ref("lower")
if lower.has_dim("nO") is None:
lower.set_dim("nO", new_nO)
return model
smaller = lower
nI = smaller.maybe_get_dim("nI")
nF = smaller.maybe_get_dim("nF")
nP = smaller.maybe_get_dim("nP")
larger = _define_lower(nO=new_nO, nI=nI, nF=nF, nP=nP)
# it could be that the model is not initialized yet, then skip this bit
if smaller.has_param("W"):
larger_W = larger.ops.alloc4f(nF, new_nO, nP, nI)
larger_b = larger.ops.alloc2f(new_nO, nP)
larger_pad = larger.ops.alloc4f(1, nF, new_nO, nP)
smaller_W = smaller.get_param("W")
smaller_b = smaller.get_param("b")
smaller_pad = smaller.get_param("pad")
# Copy the old weights and padding into the new layer
if smaller.has_dim("nO"):
old_nO = smaller.get_dim("nO")
larger_W[:, 0:old_nO, :, :] = smaller_W
larger_pad[:, :, 0:old_nO, :] = smaller_pad
larger_b[0:old_nO, :] = smaller_b
for i in range(old_nO, new_nO):
model.attrs["unseen_classes"].add(i)
larger.set_param("W", larger_W)
larger.set_param("b", larger_b)
larger.set_param("pad", larger_pad)
model._layers[1] = larger
model.set_ref("lower", larger)
return model

View File

@ -7,7 +7,7 @@ from thinc.api import expand_window, residual, Maxout, Mish, PyTorchLSTM
from ...tokens import Doc
from ...util import registry
from ...errors import Errors
from ...ml import _character_embed
from ...ml import character_embed
from ..staticvectors import StaticVectors
from ..featureextractor import FeatureExtractor
from ...pipeline.tok2vec import Tok2VecListener
@ -226,7 +226,7 @@ def CharacterEmbed(
if feature is None:
raise ValueError(Errors.E911.format(feat=feature))
char_embed = chain(
_character_embed.CharacterEmbed(nM=nM, nC=nC),
character_embed.CharacterEmbed(nM=nM, nC=nC),
cast(Model[List[Floats2d], Ragged], list2ragged()),
)
feature_extractor: Model[List[Doc], Ragged] = chain(

View File

@ -1,49 +0,0 @@
from libc.string cimport memset, memcpy
from thinc.backends.cblas cimport CBlas
from ..typedefs cimport weight_t, hash_t
from ..pipeline._parser_internals._state cimport StateC
cdef struct SizesC:
int states
int classes
int hiddens
int pieces
int feats
int embed_width
cdef struct WeightsC:
const float* feat_weights
const float* feat_bias
const float* hidden_bias
const float* hidden_weights
const float* seen_classes
cdef struct ActivationsC:
int* token_ids
float* unmaxed
float* scores
float* hiddens
int* is_valid
int _curr_size
int _max_size
cdef WeightsC get_c_weights(model) except *
cdef SizesC get_c_sizes(model, int batch_size) except *
cdef ActivationsC alloc_activations(SizesC n) nogil
cdef void free_activations(const ActivationsC* A) nogil
cdef void predict_states(CBlas cblas, ActivationsC* A, StateC** states,
const WeightsC* W, SizesC n) nogil
cdef int arg_max_if_valid(const weight_t* scores, const int* is_valid, int n) nogil
cdef void cpu_log_loss(float* d_scores,
const float* costs, const int* is_valid, const float* scores, int O) nogil

View File

@ -1,492 +0,0 @@
# cython: infer_types=True, cdivision=True, boundscheck=False
cimport numpy as np
from libc.math cimport exp
from libc.string cimport memset, memcpy
from libc.stdlib cimport calloc, free, realloc
from thinc.backends.linalg cimport Vec, VecVec
from thinc.backends.cblas cimport saxpy, sgemm
import numpy
import numpy.random
from thinc.api import Model, CupyOps, NumpyOps, get_ops
from .. import util
from ..errors import Errors
from ..typedefs cimport weight_t, class_t, hash_t
from ..pipeline._parser_internals.stateclass cimport StateClass
cdef WeightsC get_c_weights(model) except *:
cdef WeightsC output
cdef precompute_hiddens state2vec = model.state2vec
output.feat_weights = state2vec.get_feat_weights()
output.feat_bias = <const float*>state2vec.bias.data
cdef np.ndarray vec2scores_W
cdef np.ndarray vec2scores_b
if model.vec2scores is None:
output.hidden_weights = NULL
output.hidden_bias = NULL
else:
vec2scores_W = model.vec2scores.get_param("W")
vec2scores_b = model.vec2scores.get_param("b")
output.hidden_weights = <const float*>vec2scores_W.data
output.hidden_bias = <const float*>vec2scores_b.data
cdef np.ndarray class_mask = model._class_mask
output.seen_classes = <const float*>class_mask.data
return output
cdef SizesC get_c_sizes(model, int batch_size) except *:
cdef SizesC output
output.states = batch_size
if model.vec2scores is None:
output.classes = model.state2vec.get_dim("nO")
else:
output.classes = model.vec2scores.get_dim("nO")
output.hiddens = model.state2vec.get_dim("nO")
output.pieces = model.state2vec.get_dim("nP")
output.feats = model.state2vec.get_dim("nF")
output.embed_width = model.tokvecs.shape[1]
return output
cdef ActivationsC alloc_activations(SizesC n) nogil:
cdef ActivationsC A
memset(&A, 0, sizeof(A))
resize_activations(&A, n)
return A
cdef void free_activations(const ActivationsC* A) nogil:
free(A.token_ids)
free(A.scores)
free(A.unmaxed)
free(A.hiddens)
free(A.is_valid)
cdef void resize_activations(ActivationsC* A, SizesC n) nogil:
if n.states <= A._max_size:
A._curr_size = n.states
return
if A._max_size == 0:
A.token_ids = <int*>calloc(n.states * n.feats, sizeof(A.token_ids[0]))
A.scores = <float*>calloc(n.states * n.classes, sizeof(A.scores[0]))
A.unmaxed = <float*>calloc(n.states * n.hiddens * n.pieces, sizeof(A.unmaxed[0]))
A.hiddens = <float*>calloc(n.states * n.hiddens, sizeof(A.hiddens[0]))
A.is_valid = <int*>calloc(n.states * n.classes, sizeof(A.is_valid[0]))
A._max_size = n.states
else:
A.token_ids = <int*>realloc(A.token_ids,
n.states * n.feats * sizeof(A.token_ids[0]))
A.scores = <float*>realloc(A.scores,
n.states * n.classes * sizeof(A.scores[0]))
A.unmaxed = <float*>realloc(A.unmaxed,
n.states * n.hiddens * n.pieces * sizeof(A.unmaxed[0]))
A.hiddens = <float*>realloc(A.hiddens,
n.states * n.hiddens * sizeof(A.hiddens[0]))
A.is_valid = <int*>realloc(A.is_valid,
n.states * n.classes * sizeof(A.is_valid[0]))
A._max_size = n.states
A._curr_size = n.states
cdef void predict_states(CBlas cblas, ActivationsC* A, StateC** states,
const WeightsC* W, SizesC n) nogil:
cdef double one = 1.0
resize_activations(A, n)
for i in range(n.states):
states[i].set_context_tokens(&A.token_ids[i*n.feats], n.feats)
memset(A.unmaxed, 0, n.states * n.hiddens * n.pieces * sizeof(float))
memset(A.hiddens, 0, n.states * n.hiddens * sizeof(float))
sum_state_features(cblas, A.unmaxed,
W.feat_weights, A.token_ids, n.states, n.feats, n.hiddens * n.pieces)
for i in range(n.states):
VecVec.add_i(&A.unmaxed[i*n.hiddens*n.pieces],
W.feat_bias, 1., n.hiddens * n.pieces)
for j in range(n.hiddens):
index = i * n.hiddens * n.pieces + j * n.pieces
which = Vec.arg_max(&A.unmaxed[index], n.pieces)
A.hiddens[i*n.hiddens + j] = A.unmaxed[index + which]
memset(A.scores, 0, n.states * n.classes * sizeof(float))
if W.hidden_weights == NULL:
memcpy(A.scores, A.hiddens, n.states * n.classes * sizeof(float))
else:
# Compute hidden-to-output
sgemm(cblas)(False, True, n.states, n.classes, n.hiddens,
1.0, <const float *>A.hiddens, n.hiddens,
<const float *>W.hidden_weights, n.hiddens,
0.0, A.scores, n.classes)
# Add bias
for i in range(n.states):
VecVec.add_i(&A.scores[i*n.classes],
W.hidden_bias, 1., n.classes)
# Set unseen classes to minimum value
i = 0
min_ = A.scores[0]
for i in range(1, n.states * n.classes):
if A.scores[i] < min_:
min_ = A.scores[i]
for i in range(n.states):
for j in range(n.classes):
if not W.seen_classes[j]:
A.scores[i*n.classes+j] = min_
cdef void sum_state_features(CBlas cblas, float* output,
const float* cached, const int* token_ids, int B, int F, int O) nogil:
cdef int idx, b, f, i
cdef const float* feature
padding = cached
cached += F * O
cdef int id_stride = F*O
cdef float one = 1.
for b in range(B):
for f in range(F):
if token_ids[f] < 0:
feature = &padding[f*O]
else:
idx = token_ids[f] * id_stride + f*O
feature = &cached[idx]
saxpy(cblas)(O, one, <const float*>feature, 1, &output[b*O], 1)
token_ids += F
cdef void cpu_log_loss(float* d_scores,
const float* costs, const int* is_valid, const float* scores,
int O) nogil:
"""Do multi-label log loss"""
cdef double max_, gmax, Z, gZ
best = arg_max_if_gold(scores, costs, is_valid, O)
guess = Vec.arg_max(scores, O)
if best == -1 or guess == -1:
# These shouldn't happen, but if they do, we want to make sure we don't
# cause an OOB access.
return
Z = 1e-10
gZ = 1e-10
max_ = scores[guess]
gmax = scores[best]
for i in range(O):
Z += exp(scores[i] - max_)
if costs[i] <= costs[best]:
gZ += exp(scores[i] - gmax)
for i in range(O):
if costs[i] <= costs[best]:
d_scores[i] = (exp(scores[i]-max_) / Z) - (exp(scores[i]-gmax)/gZ)
else:
d_scores[i] = exp(scores[i]-max_) / Z
cdef int arg_max_if_gold(const weight_t* scores, const weight_t* costs,
const int* is_valid, int n) nogil:
# Find minimum cost
cdef float cost = 1
for i in range(n):
if is_valid[i] and costs[i] < cost:
cost = costs[i]
# Now find best-scoring with that cost
cdef int best = -1
for i in range(n):
if costs[i] <= cost and is_valid[i]:
if best == -1 or scores[i] > scores[best]:
best = i
return best
cdef int arg_max_if_valid(const weight_t* scores, const int* is_valid, int n) nogil:
cdef int best = -1
for i in range(n):
if is_valid[i] >= 1:
if best == -1 or scores[i] > scores[best]:
best = i
return best
class ParserStepModel(Model):
def __init__(self, docs, layers, *, has_upper, unseen_classes=None, train=True,
dropout=0.1):
Model.__init__(self, name="parser_step_model", forward=step_forward)
self.attrs["has_upper"] = has_upper
self.attrs["dropout_rate"] = dropout
self.tokvecs, self.bp_tokvecs = layers[0](docs, is_train=train)
if layers[1].get_dim("nP") >= 2:
activation = "maxout"
elif has_upper:
activation = None
else:
activation = "relu"
self.state2vec = precompute_hiddens(len(docs), self.tokvecs, layers[1],
activation=activation, train=train)
if has_upper:
self.vec2scores = layers[-1]
else:
self.vec2scores = None
self.cuda_stream = util.get_cuda_stream(non_blocking=True)
self.backprops = []
self._class_mask = numpy.zeros((self.nO,), dtype='f')
self._class_mask.fill(1)
if unseen_classes is not None:
for class_ in unseen_classes:
self._class_mask[class_] = 0.
def clear_memory(self):
del self.tokvecs
del self.bp_tokvecs
del self.state2vec
del self.backprops
del self._class_mask
@property
def nO(self):
if self.attrs["has_upper"]:
return self.vec2scores.get_dim("nO")
else:
return self.state2vec.get_dim("nO")
def class_is_unseen(self, class_):
return self._class_mask[class_]
def mark_class_unseen(self, class_):
self._class_mask[class_] = 0
def mark_class_seen(self, class_):
self._class_mask[class_] = 1
def get_token_ids(self, states):
cdef StateClass state
states = [state for state in states if not state.is_final()]
cdef np.ndarray ids = numpy.zeros((len(states), self.state2vec.nF),
dtype='i', order='C')
ids.fill(-1)
c_ids = <int*>ids.data
for state in states:
state.c.set_context_tokens(c_ids, ids.shape[1])
c_ids += ids.shape[1]
return ids
def backprop_step(self, token_ids, d_vector, get_d_tokvecs):
if isinstance(self.state2vec.ops, CupyOps) \
and not isinstance(token_ids, self.state2vec.ops.xp.ndarray):
# Move token_ids and d_vector to GPU, asynchronously
self.backprops.append((
util.get_async(self.cuda_stream, token_ids),
util.get_async(self.cuda_stream, d_vector),
get_d_tokvecs
))
else:
self.backprops.append((token_ids, d_vector, get_d_tokvecs))
def finish_steps(self, golds):
# Add a padding vector to the d_tokvecs gradient, so that missing
# values don't affect the real gradient.
d_tokvecs = self.ops.alloc((self.tokvecs.shape[0]+1, self.tokvecs.shape[1]))
# Tells CUDA to block, so our async copies complete.
if self.cuda_stream is not None:
self.cuda_stream.synchronize()
for ids, d_vector, bp_vector in self.backprops:
d_state_features = bp_vector((d_vector, ids))
ids = ids.flatten()
d_state_features = d_state_features.reshape(
(ids.size, d_state_features.shape[2]))
self.ops.scatter_add(d_tokvecs, ids,
d_state_features)
# Padded -- see update()
self.bp_tokvecs(d_tokvecs[:-1])
return d_tokvecs
NUMPY_OPS = NumpyOps()
def step_forward(model: ParserStepModel, states, is_train):
token_ids = model.get_token_ids(states)
vector, get_d_tokvecs = model.state2vec(token_ids, is_train)
mask = None
if model.attrs["has_upper"]:
dropout_rate = model.attrs["dropout_rate"]
if is_train and dropout_rate > 0:
mask = NUMPY_OPS.get_dropout_mask(vector.shape, 0.1)
vector *= mask
scores, get_d_vector = model.vec2scores(vector, is_train)
else:
scores = NumpyOps().asarray(vector)
get_d_vector = lambda d_scores: d_scores
# If the class is unseen, make sure its score is minimum
scores[:, model._class_mask == 0] = numpy.nanmin(scores)
def backprop_parser_step(d_scores):
# Zero vectors for unseen classes
d_scores *= model._class_mask
d_vector = get_d_vector(d_scores)
if mask is not None:
d_vector *= mask
model.backprop_step(token_ids, d_vector, get_d_tokvecs)
return None
return scores, backprop_parser_step
cdef class precompute_hiddens:
"""Allow a model to be "primed" by pre-computing input features in bulk.
This is used for the parser, where we want to take a batch of documents,
and compute vectors for each (token, position) pair. These vectors can then
be reused, especially for beam-search.
Let's say we're using 12 features for each state, e.g. word at start of
buffer, three words on stack, their children, etc. In the normal arc-eager
system, a document of length N is processed in 2*N states. This means we'll
create 2*N*12 feature vectors --- but if we pre-compute, we only need
N*12 vector computations. The saving for beam-search is much better:
if we have a beam of k, we'll normally make 2*N*12*K computations --
so we can save the factor k. This also gives a nice CPU/GPU division:
we can do all our hard maths up front, packed into large multiplications,
and do the hard-to-program parsing on the CPU.
"""
cdef readonly int nF, nO, nP
cdef bint _is_synchronized
cdef public object ops
cdef public object numpy_ops
cdef np.ndarray _features
cdef np.ndarray _cached
cdef np.ndarray bias
cdef object _cuda_stream
cdef object _bp_hiddens
cdef object activation
def __init__(self, batch_size, tokvecs, lower_model, cuda_stream=None,
activation="maxout", train=False):
gpu_cached, bp_features = lower_model(tokvecs, train)
cdef np.ndarray cached
if not isinstance(gpu_cached, numpy.ndarray):
# Note the passing of cuda_stream here: it lets
# cupy make the copy asynchronously.
# We then have to block before first use.
cached = gpu_cached.get(stream=cuda_stream)
else:
cached = gpu_cached
if not isinstance(lower_model.get_param("b"), numpy.ndarray):
self.bias = lower_model.get_param("b").get(stream=cuda_stream)
else:
self.bias = lower_model.get_param("b")
self.nF = cached.shape[1]
if lower_model.has_dim("nP"):
self.nP = lower_model.get_dim("nP")
else:
self.nP = 1
self.nO = cached.shape[2]
self.ops = lower_model.ops
self.numpy_ops = NumpyOps()
assert activation in (None, "relu", "maxout")
self.activation = activation
self._is_synchronized = False
self._cuda_stream = cuda_stream
self._cached = cached
self._bp_hiddens = bp_features
cdef const float* get_feat_weights(self) except NULL:
if not self._is_synchronized and self._cuda_stream is not None:
self._cuda_stream.synchronize()
self._is_synchronized = True
return <float*>self._cached.data
def has_dim(self, name):
if name == "nF":
return self.nF if self.nF is not None else True
elif name == "nP":
return self.nP if self.nP is not None else True
elif name == "nO":
return self.nO if self.nO is not None else True
else:
return False
def get_dim(self, name):
if name == "nF":
return self.nF
elif name == "nP":
return self.nP
elif name == "nO":
return self.nO
else:
raise ValueError(Errors.E1033.format(name=name))
def set_dim(self, name, value):
if name == "nF":
self.nF = value
elif name == "nP":
self.nP = value
elif name == "nO":
self.nO = value
else:
raise ValueError(Errors.E1033.format(name=name))
def __call__(self, X, bint is_train):
if is_train:
return self.begin_update(X)
else:
return self.predict(X), lambda X: X
def predict(self, X):
return self.begin_update(X)[0]
def begin_update(self, token_ids):
cdef np.ndarray state_vector = numpy.zeros(
(token_ids.shape[0], self.nO, self.nP), dtype='f')
# This is tricky, but (assuming GPU available);
# - Input to forward on CPU
# - Output from forward on CPU
# - Input to backward on GPU!
# - Output from backward on GPU
bp_hiddens = self._bp_hiddens
cdef CBlas cblas
if isinstance(self.ops, CupyOps):
cblas = NUMPY_OPS.cblas()
else:
cblas = self.ops.cblas()
feat_weights = self.get_feat_weights()
cdef int[:, ::1] ids = token_ids
sum_state_features(cblas, <float*>state_vector.data,
feat_weights, &ids[0,0],
token_ids.shape[0], self.nF, self.nO*self.nP)
state_vector += self.bias
state_vector, bp_nonlinearity = self._nonlinearity(state_vector)
def backward(d_state_vector_ids):
d_state_vector, token_ids = d_state_vector_ids
d_state_vector = bp_nonlinearity(d_state_vector)
d_tokens = bp_hiddens((d_state_vector, token_ids))
return d_tokens
return state_vector, backward
def _nonlinearity(self, state_vector):
if self.activation == "maxout":
return self._maxout_nonlinearity(state_vector)
else:
return self._relu_nonlinearity(state_vector)
def _maxout_nonlinearity(self, state_vector):
state_vector, mask = self.numpy_ops.maxout(state_vector)
# We're outputting to CPU, but we need this variable on GPU for the
# backward pass.
mask = self.ops.asarray(mask)
def backprop_maxout(d_best):
return self.ops.backprop_maxout(d_best, mask, self.nP)
return state_vector, backprop_maxout
def _relu_nonlinearity(self, state_vector):
state_vector = state_vector.reshape((state_vector.shape[0], -1))
mask = state_vector >= 0.
state_vector *= mask
# We're outputting to CPU, but we need this variable on GPU for the
# backward pass.
mask = self.ops.asarray(mask)
def backprop_relu(d_best):
d_best *= mask
return d_best.reshape((d_best.shape + (1,)))
return state_vector, backprop_relu

28
spacy/ml/tb_framework.pxd Normal file
View File

@ -0,0 +1,28 @@
from libc.stdint cimport int8_t
cdef struct SizesC:
int states
int classes
int hiddens
int pieces
int feats
int embed_width
int tokens
cdef struct WeightsC:
const float* feat_weights
const float* feat_bias
const float* hidden_bias
const float* hidden_weights
const int8_t* seen_mask
cdef struct ActivationsC:
int* token_ids
float* unmaxed
float* hiddens
int* is_valid
int _curr_size
int _max_size

View File

@ -1,50 +0,0 @@
from thinc.api import Model, noop
from .parser_model import ParserStepModel
from ..util import registry
@registry.layers("spacy.TransitionModel.v1")
def TransitionModel(
tok2vec, lower, upper, resize_output, dropout=0.2, unseen_classes=set()
):
"""Set up a stepwise transition-based model"""
if upper is None:
has_upper = False
upper = noop()
else:
has_upper = True
# don't define nO for this object, because we can't dynamically change it
return Model(
name="parser_model",
forward=forward,
dims={"nI": tok2vec.maybe_get_dim("nI")},
layers=[tok2vec, lower, upper],
refs={"tok2vec": tok2vec, "lower": lower, "upper": upper},
init=init,
attrs={
"has_upper": has_upper,
"unseen_classes": set(unseen_classes),
"resize_output": resize_output,
},
)
def forward(model, X, is_train):
step_model = ParserStepModel(
X,
model.layers,
unseen_classes=model.attrs["unseen_classes"],
train=is_train,
has_upper=model.attrs["has_upper"],
)
return step_model, step_model.finish_steps
def init(model, X=None, Y=None):
model.get_ref("tok2vec").initialize(X=X)
lower = model.get_ref("lower")
lower.initialize()
if model.attrs["has_upper"]:
statevecs = model.ops.alloc2f(2, lower.get_dim("nO"))
model.get_ref("upper").initialize(X=statevecs)

623
spacy/ml/tb_framework.pyx Normal file
View File

@ -0,0 +1,623 @@
# cython: infer_types=True, cdivision=True, boundscheck=False
from typing import List, Tuple, Any, Optional, TypeVar, cast
from libc.string cimport memset, memcpy
from libc.stdlib cimport calloc, free, realloc
from libcpp.vector cimport vector
import numpy
cimport numpy as np
from thinc.api import Model, normal_init, chain, list2array, Linear
from thinc.api import uniform_init, glorot_uniform_init, zero_init
from thinc.api import NumpyOps
from thinc.backends.cblas cimport CBlas, saxpy, sgemm
from thinc.types import Floats1d, Floats2d, Floats3d, Floats4d
from thinc.types import Ints1d, Ints2d
from ..errors import Errors
from ..pipeline._parser_internals import _beam_utils
from ..pipeline._parser_internals.batch import GreedyBatch
from ..pipeline._parser_internals._parser_utils cimport arg_max
from ..pipeline._parser_internals.transition_system cimport c_transition_batch, c_apply_actions
from ..pipeline._parser_internals.transition_system cimport TransitionSystem
from ..pipeline._parser_internals.stateclass cimport StateC, StateClass
from ..tokens.doc import Doc
from ..util import registry
State = Any # TODO
@registry.layers("spacy.TransitionModel.v2")
def TransitionModel(
*,
tok2vec: Model[List[Doc], List[Floats2d]],
beam_width: int = 1,
beam_density: float = 0.0,
state_tokens: int,
hidden_width: int,
maxout_pieces: int,
nO: Optional[int] = None,
unseen_classes=set(),
) -> Model[Tuple[List[Doc], TransitionSystem], List[Tuple[State, List[Floats2d]]]]:
"""Set up a transition-based parsing model, using a maxout hidden
layer and a linear output layer.
"""
t2v_width = tok2vec.get_dim("nO") if tok2vec.has_dim("nO") else None
tok2vec_projected = chain(tok2vec, list2array(), Linear(hidden_width, t2v_width)) # type: ignore
tok2vec_projected.set_dim("nO", hidden_width)
# FIXME: we use `output` as a container for the output layer's
# weights and biases. Thinc optimizers cannot handle resizing
# of parameters. So, when the parser model is resized, we
# construct a new `output` layer, which has a different key in
# the optimizer. Once the optimizer supports parameter resizing,
# we can replace the `output` layer by `output_W` and `output_b`
# parameters in this model.
output = Linear(nO=None, nI=hidden_width, init_W=zero_init)
return Model(
name="parser_model",
forward=forward,
init=init,
layers=[tok2vec_projected, output],
refs={
"tok2vec": tok2vec_projected,
"output": output,
},
params={
"hidden_W": None, # Floats2d W for the hidden layer
"hidden_b": None, # Floats1d bias for the hidden layer
"hidden_pad": None, # Floats1d padding for the hidden layer
},
dims={
"nO": None, # Output size
"nP": maxout_pieces,
"nH": hidden_width,
"nI": tok2vec_projected.maybe_get_dim("nO"),
"nF": state_tokens,
},
attrs={
"beam_width": beam_width,
"beam_density": beam_density,
"unseen_classes": set(unseen_classes),
"resize_output": resize_output,
},
)
def resize_output(model: Model, new_nO: int) -> Model:
old_nO = model.maybe_get_dim("nO")
output = model.get_ref("output")
if old_nO is None:
model.set_dim("nO", new_nO)
output.set_dim("nO", new_nO)
output.initialize()
return model
elif new_nO <= old_nO:
return model
elif output.has_param("W"):
nH = model.get_dim("nH")
new_output = Linear(nO=new_nO, nI=nH, init_W=zero_init)
new_output.initialize()
new_W = new_output.get_param("W")
new_b = new_output.get_param("b")
old_W = output.get_param("W")
old_b = output.get_param("b")
new_W[:old_nO] = old_W # type: ignore
new_b[:old_nO] = old_b # type: ignore
for i in range(old_nO, new_nO):
model.attrs["unseen_classes"].add(i)
model.layers[-1] = new_output
model.set_ref("output", new_output)
# TODO: Avoid this private intrusion
model._dims["nO"] = new_nO
return model
def init(
model,
X: Optional[Tuple[List[Doc], TransitionSystem]] = None,
Y: Optional[Tuple[List[State], List[Floats2d]]] = None,
):
if X is not None:
docs, moves = X
model.get_ref("tok2vec").initialize(X=docs)
else:
model.get_ref("tok2vec").initialize()
inferred_nO = _infer_nO(Y)
if inferred_nO is not None:
current_nO = model.maybe_get_dim("nO")
if current_nO is None or current_nO != inferred_nO:
model.attrs["resize_output"](model, inferred_nO)
nO = model.get_dim("nO")
nP = model.get_dim("nP")
nH = model.get_dim("nH")
nI = model.get_dim("nI")
nF = model.get_dim("nF")
ops = model.ops
Wl = ops.alloc2f(nH * nP, nF * nI)
bl = ops.alloc1f(nH * nP)
padl = ops.alloc1f(nI)
# Wl = zero_init(ops, Wl.shape)
Wl = glorot_uniform_init(ops, Wl.shape)
padl = uniform_init(ops, padl.shape) # type: ignore
# TODO: Experiment with whether better to initialize output_W
model.set_param("hidden_W", Wl)
model.set_param("hidden_b", bl)
model.set_param("hidden_pad", padl)
# model = _lsuv_init(model)
return model
class TransitionModelInputs:
"""
Input to transition model.
"""
# dataclass annotation is not yet supported in Cython 0.29.x,
# so, we'll do something close to it.
actions: Optional[List[Ints1d]]
docs: List[Doc]
max_moves: int
moves: TransitionSystem
states: Optional[List[State]]
__slots__ = [
"actions",
"docs",
"max_moves",
"moves",
"states",
]
def __init__(
self,
docs: List[Doc],
moves: TransitionSystem,
actions: Optional[List[Ints1d]]=None,
max_moves: int=0,
states: Optional[List[State]]=None):
"""
actions (Optional[List[Ints1d]]): actions to apply for each Doc.
docs (List[Doc]): Docs to predict transition sequences for.
max_moves: (int): the maximum number of moves to apply, values less
than 1 will apply moves to states until they are final states.
moves (TransitionSystem): the transition system to use when predicting
the transition sequences.
states (Optional[List[States]]): the initial states to predict the
transition sequences for. When absent, the initial states are
initialized from the provided Docs.
"""
self.actions = actions
self.docs = docs
self.moves = moves
self.max_moves = max_moves
self.states = states
def forward(model, inputs: TransitionModelInputs, is_train: bool):
docs = inputs.docs
moves = inputs.moves
actions = inputs.actions
beam_width = model.attrs["beam_width"]
hidden_pad = model.get_param("hidden_pad")
tok2vec = model.get_ref("tok2vec")
states = moves.init_batch(docs) if inputs.states is None else inputs.states
tokvecs, backprop_tok2vec = tok2vec(docs, is_train)
tokvecs = model.ops.xp.vstack((tokvecs, hidden_pad))
feats, backprop_feats = _forward_precomputable_affine(model, tokvecs, is_train)
seen_mask = _get_seen_mask(model)
if not is_train and beam_width == 1 and isinstance(model.ops, NumpyOps):
# Note: max_moves is only used during training, so we don't need to
# pass it to the greedy inference path.
return _forward_greedy_cpu(model, moves, states, feats, seen_mask, actions=actions)
else:
return _forward_fallback(model, moves, states, tokvecs, backprop_tok2vec,
feats, backprop_feats, seen_mask, is_train, actions=actions,
max_moves=inputs.max_moves)
def _forward_greedy_cpu(model: Model, TransitionSystem moves, states: List[StateClass], np.ndarray feats,
np.ndarray[np.npy_bool, ndim=1] seen_mask, actions: Optional[List[Ints1d]]=None):
cdef vector[StateC*] c_states
cdef StateClass state
for state in states:
if not state.is_final():
c_states.push_back(state.c)
weights = _get_c_weights(model, <float*>feats.data, seen_mask)
# Precomputed features have rows for each token, plus one for padding.
cdef int n_tokens = feats.shape[0] - 1
sizes = _get_c_sizes(model, c_states.size(), n_tokens)
cdef CBlas cblas = model.ops.cblas()
scores = _parse_batch(cblas, moves, &c_states[0], weights, sizes, actions=actions)
def backprop(dY):
raise ValueError(Errors.E4004)
return (states, scores), backprop
cdef list _parse_batch(CBlas cblas, TransitionSystem moves, StateC** states,
WeightsC weights, SizesC sizes, actions: Optional[List[Ints1d]]=None):
cdef int i, j
cdef vector[StateC *] unfinished
cdef ActivationsC activations = _alloc_activations(sizes)
cdef np.ndarray step_scores
cdef np.ndarray step_actions
scores = []
while sizes.states >= 1 and (actions is None or len(actions) > 0):
step_scores = numpy.empty((sizes.states, sizes.classes), dtype="f")
step_actions = actions[0] if actions is not None else None
assert step_actions is None or step_actions.size == sizes.states, \
f"number of step actions ({step_actions.size}) must equal number of states ({sizes.states})"
with nogil:
_predict_states(cblas, &activations, <float*>step_scores.data, states, &weights, sizes)
if actions is None:
# Validate actions, argmax, take action.
c_transition_batch(moves, states, <const float*>step_scores.data, sizes.classes,
sizes.states)
else:
c_apply_actions(moves, states, <const int*>step_actions.data, sizes.states)
for i in range(sizes.states):
if not states[i].is_final():
unfinished.push_back(states[i])
for i in range(unfinished.size()):
states[i] = unfinished[i]
sizes.states = unfinished.size()
scores.append(step_scores)
unfinished.clear()
actions = actions[1:] if actions is not None else None
_free_activations(&activations)
return scores
def _forward_fallback(
model: Model,
moves: TransitionSystem,
states: List[StateClass],
tokvecs, backprop_tok2vec,
feats,
backprop_feats,
seen_mask,
is_train: bool,
actions: Optional[List[Ints1d]]=None,
max_moves: int=0):
nF = model.get_dim("nF")
output = model.get_ref("output")
hidden_b = model.get_param("hidden_b")
nH = model.get_dim("nH")
nP = model.get_dim("nP")
beam_width = model.attrs["beam_width"]
beam_density = model.attrs["beam_density"]
ops = model.ops
all_ids = []
all_which = []
all_statevecs = []
all_scores = []
if beam_width == 1:
batch = GreedyBatch(moves, states, None)
else:
batch = _beam_utils.BeamBatch(
moves, states, None, width=beam_width, density=beam_density
)
arange = ops.xp.arange(nF)
n_moves = 0
while not batch.is_done:
ids = numpy.zeros((len(batch.get_unfinished_states()), nF), dtype="i")
for i, state in enumerate(batch.get_unfinished_states()):
state.set_context_tokens(ids, i, nF)
# Sum the state features, add the bias and apply the activation (maxout)
# to create the state vectors.
preacts2f = feats[ids, arange].sum(axis=1) # type: ignore
preacts2f += hidden_b
preacts = ops.reshape3f(preacts2f, preacts2f.shape[0], nH, nP)
assert preacts.shape[0] == len(batch.get_unfinished_states()), preacts.shape
statevecs, which = ops.maxout(preacts)
# We don't use output's backprop, since we want to backprop for
# all states at once, rather than a single state.
scores = output.predict(statevecs)
scores[:, seen_mask] = ops.xp.nanmin(scores)
# Transition the states, filtering out any that are finished.
cpu_scores = ops.to_numpy(scores)
if actions is None:
batch.advance(cpu_scores)
else:
batch.advance_with_actions(actions[0])
actions = actions[1:]
all_scores.append(scores)
if is_train:
# Remember intermediate results for the backprop.
all_ids.append(ids)
all_statevecs.append(statevecs)
all_which.append(which)
if n_moves >= max_moves >= 1:
break
n_moves += 1
def backprop_parser(d_states_d_scores):
ids = ops.xp.vstack(all_ids)
which = ops.xp.vstack(all_which)
statevecs = ops.xp.vstack(all_statevecs)
_, d_scores = d_states_d_scores
if model.attrs.get("unseen_classes"):
# If we have a negative gradient (i.e. the probability should
# increase) on any classes we filtered out as unseen, mark
# them as seen.
for clas in set(model.attrs["unseen_classes"]):
if (d_scores[:, clas] < 0).any():
model.attrs["unseen_classes"].remove(clas)
d_scores *= seen_mask == False
# Calculate the gradients for the parameters of the output layer.
# The weight gemm is (nS, nO) @ (nS, nH).T
output.inc_grad("b", d_scores.sum(axis=0))
output.inc_grad("W", ops.gemm(d_scores, statevecs, trans1=True))
# Now calculate d_statevecs, by backproping through the output linear layer.
# This gemm is (nS, nO) @ (nO, nH)
output_W = output.get_param("W")
d_statevecs = ops.gemm(d_scores, output_W)
# Backprop through the maxout activation
d_preacts = ops.backprop_maxout(d_statevecs, which, nP)
d_preacts2f = ops.reshape2f(d_preacts, d_preacts.shape[0], nH * nP)
model.inc_grad("hidden_b", d_preacts2f.sum(axis=0))
# We don't need to backprop the summation, because we pass back the IDs instead
d_state_features = backprop_feats((d_preacts2f, ids))
d_tokvecs = ops.alloc2f(tokvecs.shape[0], tokvecs.shape[1])
ops.scatter_add(d_tokvecs, ids, d_state_features)
model.inc_grad("hidden_pad", d_tokvecs[-1])
return (backprop_tok2vec(d_tokvecs[:-1]), None)
return (list(batch), all_scores), backprop_parser
def _get_seen_mask(model: Model) -> numpy.array[bool, 1]:
mask = model.ops.xp.zeros(model.get_dim("nO"), dtype="bool")
for class_ in model.attrs.get("unseen_classes", set()):
mask[class_] = True
return mask
def _forward_precomputable_affine(model, X: Floats2d, is_train: bool):
W: Floats2d = model.get_param("hidden_W")
nF = model.get_dim("nF")
nH = model.get_dim("nH")
nP = model.get_dim("nP")
nI = model.get_dim("nI")
# The weights start out (nH * nP, nF * nI). Transpose and reshape to (nF * nH *nP, nI)
W3f = model.ops.reshape3f(W, nH * nP, nF, nI)
W3f = W3f.transpose((1, 0, 2))
W2f = model.ops.reshape2f(W3f, nF * nH * nP, nI)
assert X.shape == (X.shape[0], nI), X.shape
Yf_ = model.ops.gemm(X, W2f, trans2=True)
Yf = model.ops.reshape3f(Yf_, Yf_.shape[0], nF, nH * nP)
def backward(dY_ids: Tuple[Floats3d, Ints2d]):
# This backprop is particularly tricky, because we get back a different
# thing from what we put out. We put out an array of shape:
# (nB, nF, nH, nP), and get back:
# (nB, nH, nP) and ids (nB, nF)
# The ids tell us the values of nF, so we would have:
#
# dYf = zeros((nB, nF, nH, nP))
# for b in range(nB):
# for f in range(nF):
# dYf[b, ids[b, f]] += dY[b]
#
# However, we avoid building that array for efficiency -- and just pass
# in the indices.
dY, ids = dY_ids
dXf = model.ops.gemm(dY, W)
Xf = X[ids].reshape((ids.shape[0], -1))
dW = model.ops.gemm(dY, Xf, trans1=True)
model.inc_grad("hidden_W", dW)
return model.ops.reshape3f(dXf, dXf.shape[0], nF, nI)
return Yf, backward
def _infer_nO(Y: Optional[Tuple[List[State], List[Floats2d]]]) -> Optional[int]:
if Y is None:
return None
_, scores = Y
if len(scores) == 0:
return None
assert scores[0].shape[0] >= 1
assert len(scores[0].shape) == 2
return scores[0].shape[1]
def _lsuv_init(model: Model):
"""This is like the 'layer sequential unit variance', but instead
of taking the actual inputs, we randomly generate whitened data.
Why's this all so complicated? We have a huge number of inputs,
and the maxout unit makes guessing the dynamics tricky. Instead
we set the maxout weights to values that empirically result in
whitened outputs given whitened inputs.
"""
W = model.maybe_get_param("hidden_W")
if W is not None and W.any():
return
nF = model.get_dim("nF")
nH = model.get_dim("nH")
nP = model.get_dim("nP")
nI = model.get_dim("nI")
W = model.ops.alloc4f(nF, nH, nP, nI)
b = model.ops.alloc2f(nH, nP)
pad = model.ops.alloc4f(1, nF, nH, nP)
ops = model.ops
W = normal_init(ops, W.shape, mean=float(ops.xp.sqrt(1.0 / nF * nI)))
pad = normal_init(ops, pad.shape, mean=1.0)
model.set_param("W", W)
model.set_param("b", b)
model.set_param("pad", pad)
ids = ops.alloc_f((5000, nF), dtype="f")
ids += ops.xp.random.uniform(0, 1000, ids.shape)
ids = ops.asarray(ids, dtype="i")
tokvecs = ops.alloc_f((5000, nI), dtype="f")
tokvecs += ops.xp.random.normal(loc=0.0, scale=1.0, size=tokvecs.size).reshape(
tokvecs.shape
)
def predict(ids, tokvecs):
# nS ids. nW tokvecs. Exclude the padding array.
hiddens, _ = _forward_precomputable_affine(model, tokvecs[:-1], False)
vectors = model.ops.alloc2f(ids.shape[0], nH * nP)
# need nS vectors
hiddens = hiddens.reshape((hiddens.shape[0] * nF, nH * nP))
model.ops.scatter_add(vectors, ids.flatten(), hiddens)
vectors3f = model.ops.reshape3f(vectors, vectors.shape[0], nH, nP)
vectors3f += b
return model.ops.maxout(vectors3f)[0]
tol_var = 0.01
tol_mean = 0.01
t_max = 10
W = cast(Floats4d, model.get_param("hidden_W").copy())
b = cast(Floats2d, model.get_param("hidden_b").copy())
for t_i in range(t_max):
acts1 = predict(ids, tokvecs)
var = model.ops.xp.var(acts1)
mean = model.ops.xp.mean(acts1)
if abs(var - 1.0) >= tol_var:
W /= model.ops.xp.sqrt(var)
model.set_param("hidden_W", W)
elif abs(mean) >= tol_mean:
b -= mean
model.set_param("hidden_b", b)
else:
break
return model
cdef WeightsC _get_c_weights(model, const float* feats, np.ndarray[np.npy_bool, ndim=1] seen_mask) except *:
output = model.get_ref("output")
cdef np.ndarray hidden_b = model.get_param("hidden_b")
cdef np.ndarray output_W = output.get_param("W")
cdef np.ndarray output_b = output.get_param("b")
cdef WeightsC weights
weights.feat_weights = feats
weights.feat_bias = <const float*>hidden_b.data
weights.hidden_weights = <const float *> output_W.data
weights.hidden_bias = <const float *> output_b.data
weights.seen_mask = <const int8_t*> seen_mask.data
return weights
cdef SizesC _get_c_sizes(model, int batch_size, int tokens) except *:
cdef SizesC sizes
sizes.states = batch_size
sizes.classes = model.get_dim("nO")
sizes.hiddens = model.get_dim("nH")
sizes.pieces = model.get_dim("nP")
sizes.feats = model.get_dim("nF")
sizes.embed_width = model.get_dim("nI")
sizes.tokens = tokens
return sizes
cdef ActivationsC _alloc_activations(SizesC n) nogil:
cdef ActivationsC A
memset(&A, 0, sizeof(A))
_resize_activations(&A, n)
return A
cdef void _free_activations(const ActivationsC* A) nogil:
free(A.token_ids)
free(A.unmaxed)
free(A.hiddens)
free(A.is_valid)
cdef void _resize_activations(ActivationsC* A, SizesC n) nogil:
if n.states <= A._max_size:
A._curr_size = n.states
return
if A._max_size == 0:
A.token_ids = <int*>calloc(n.states * n.feats, sizeof(A.token_ids[0]))
A.unmaxed = <float*>calloc(n.states * n.hiddens * n.pieces, sizeof(A.unmaxed[0]))
A.hiddens = <float*>calloc(n.states * n.hiddens, sizeof(A.hiddens[0]))
A.is_valid = <int*>calloc(n.states * n.classes, sizeof(A.is_valid[0]))
A._max_size = n.states
else:
A.token_ids = <int*>realloc(A.token_ids,
n.states * n.feats * sizeof(A.token_ids[0]))
A.unmaxed = <float*>realloc(A.unmaxed,
n.states * n.hiddens * n.pieces * sizeof(A.unmaxed[0]))
A.hiddens = <float*>realloc(A.hiddens,
n.states * n.hiddens * sizeof(A.hiddens[0]))
A.is_valid = <int*>realloc(A.is_valid,
n.states * n.classes * sizeof(A.is_valid[0]))
A._max_size = n.states
A._curr_size = n.states
cdef void _predict_states(CBlas cblas, ActivationsC* A, float* scores, StateC** states, const WeightsC* W, SizesC n) nogil:
_resize_activations(A, n)
for i in range(n.states):
states[i].set_context_tokens(&A.token_ids[i*n.feats], n.feats)
memset(A.unmaxed, 0, n.states * n.hiddens * n.pieces * sizeof(float))
_sum_state_features(cblas, A.unmaxed, W.feat_weights, A.token_ids, n)
for i in range(n.states):
saxpy(cblas)(n.hiddens * n.pieces, 1., W.feat_bias, 1, &A.unmaxed[i*n.hiddens*n.pieces], 1)
for j in range(n.hiddens):
index = i * n.hiddens * n.pieces + j * n.pieces
which = arg_max(&A.unmaxed[index], n.pieces)
A.hiddens[i*n.hiddens + j] = A.unmaxed[index + which]
if W.hidden_weights == NULL:
memcpy(scores, A.hiddens, n.states * n.classes * sizeof(float))
else:
# Compute hidden-to-output
sgemm(cblas)(False, True, n.states, n.classes, n.hiddens,
1.0, <const float *>A.hiddens, n.hiddens,
<const float *>W.hidden_weights, n.hiddens,
0.0, scores, n.classes)
# Add bias
for i in range(n.states):
saxpy(cblas)(n.classes, 1., W.hidden_bias, 1, &scores[i*n.classes], 1)
# Set unseen classes to minimum value
i = 0
min_ = scores[0]
for i in range(1, n.states * n.classes):
if scores[i] < min_:
min_ = scores[i]
for i in range(n.states):
for j in range(n.classes):
if W.seen_mask[j]:
scores[i*n.classes+j] = min_
cdef void _sum_state_features(CBlas cblas, float* output,
const float* cached, const int* token_ids, SizesC n) nogil:
cdef int idx, b, f, i
cdef const float* feature
cdef int B = n.states
cdef int O = n.hiddens * n.pieces
cdef int F = n.feats
cdef int T = n.tokens
padding = cached + (T * F * O)
cdef int id_stride = F*O
cdef float one = 1.
for b in range(B):
for f in range(F):
if token_ids[f] < 0:
feature = &padding[f*O]
else:
idx = token_ids[f] * id_stride + f*O
feature = &cached[idx]
saxpy(cblas)(O, one, <const float*>feature, 1, &output[b*O], 1)
token_ids += F

View File

@ -1,23 +1,41 @@
from cymem.cymem cimport Pool
from preshed.maps cimport PreshMap
cimport numpy as np
from libc.stdint cimport uint64_t
from libc.stdint cimport uint32_t, uint64_t
from libcpp.unordered_map cimport unordered_map
from libcpp.vector cimport vector
from libcpp.memory cimport shared_ptr
from .structs cimport MorphAnalysisC
from .strings cimport StringStore
from .typedefs cimport attr_t, hash_t
cdef cppclass Feature:
hash_t field
hash_t value
__init__():
this.field = 0
this.value = 0
cdef cppclass MorphAnalysisC:
hash_t key
vector[Feature] features
__init__():
this.key = 0
cdef class Morphology:
cdef readonly Pool mem
cdef readonly StringStore strings
cdef PreshMap tags # Keyed by hash, value is pointer to tag
cdef unordered_map[hash_t, shared_ptr[MorphAnalysisC]] tags
cdef MorphAnalysisC create_morph_tag(self, field_feature_pairs) except *
cdef int insert(self, MorphAnalysisC tag) except -1
cdef shared_ptr[MorphAnalysisC] _lookup_tag(self, hash_t tag_hash)
cdef void _intern_morph_tag(self, hash_t tag_key, feats)
cdef hash_t _add(self, features)
cdef str _normalize_features(self, features)
cdef str get_morph_str(self, hash_t morph_key)
cdef shared_ptr[MorphAnalysisC] get_morph_c(self, hash_t morph_key)
cdef int check_feature(const MorphAnalysisC* morph, attr_t feature) nogil
cdef list list_features(const MorphAnalysisC* morph)
cdef np.ndarray get_by_field(const MorphAnalysisC* morph, attr_t field)
cdef int get_n_by_field(attr_t* results, const MorphAnalysisC* morph, attr_t field) nogil
cdef int check_feature(const shared_ptr[MorphAnalysisC] morph, attr_t feature) nogil
cdef list list_features(const shared_ptr[MorphAnalysisC] morph)
cdef np.ndarray get_by_field(const shared_ptr[MorphAnalysisC] morph, attr_t field)
cdef int get_n_by_field(attr_t* results, const shared_ptr[MorphAnalysisC] morph, attr_t field) nogil

View File

@ -1,10 +1,10 @@
# cython: infer_types
import numpy
import warnings
from typing import Union, Tuple, List, Dict, Optional
from cython.operator cimport dereference as deref
from libcpp.memory cimport shared_ptr
from .attrs cimport POS
from .parts_of_speech import IDS as POS_IDS
from .errors import Warnings
from . import symbols
@ -24,134 +24,187 @@ cdef class Morphology:
EMPTY_MORPH = symbols.NAMES[symbols._]
def __init__(self, StringStore strings):
self.mem = Pool()
self.strings = strings
self.tags = PreshMap()
def __reduce__(self):
tags = set([self.get(self.strings[s]) for s in self.strings])
tags -= set([""])
return (unpickle_morphology, (self.strings, sorted(tags)), None, None)
def add(self, features):
cdef shared_ptr[MorphAnalysisC] _lookup_tag(self, hash_t tag_hash):
match = self.tags.find(tag_hash)
if match != self.tags.const_end():
return deref(match).second
else:
return shared_ptr[MorphAnalysisC]()
def _normalize_attr(self, attr_key : Union[int, str], attr_value : Union[int, str]) -> Optional[Tuple[str, Union[str, List[str]]]]:
if isinstance(attr_key, (int, str)) and isinstance(attr_value, (int, str)):
attr_key = self.strings.as_string(attr_key)
attr_value = self.strings.as_string(attr_value)
# Preserve multiple values as a list
if self.VALUE_SEP in attr_value:
values = attr_value.split(self.VALUE_SEP)
values.sort()
attr_value = values
else:
warnings.warn(Warnings.W100.format(feature={attr_key: attr_value}))
return None
return attr_key, attr_value
def _str_to_normalized_feat_dict(self, feats: str) -> Dict[str, str]:
if not feats or feats == self.EMPTY_MORPH:
return {}
out = []
for feat in feats.split(self.FEATURE_SEP):
field, values = feat.split(self.FIELD_SEP, 1)
normalized_attr = self._normalize_attr(field, values)
if normalized_attr is None:
continue
out.append((normalized_attr[0], normalized_attr[1]))
out.sort(key=lambda x: x[0])
return dict(out)
def _dict_to_normalized_feat_dict(self, feats: Dict[Union[int, str], Union[int, str]]) -> Dict[str, str]:
out = []
for field, values in feats.items():
normalized_attr = self._normalize_attr(field, values)
if normalized_attr is None:
continue
out.append((normalized_attr[0], normalized_attr[1]))
out.sort(key=lambda x: x[0])
return dict(out)
def _normalized_feat_dict_to_str(self, feats: Dict[str, str]) -> str:
norm_feats_string = self.FEATURE_SEP.join([
self.FIELD_SEP.join([field, self.VALUE_SEP.join(values) if isinstance(values, list) else values])
for field, values in feats.items()
])
return norm_feats_string or self.EMPTY_MORPH
cdef hash_t _add(self, features):
"""Insert a morphological analysis in the morphology table, if not
already present. The morphological analysis may be provided in the UD
FEATS format as a string or in the tag map dict format.
Returns the hash of the new analysis.
"""
cdef MorphAnalysisC* tag_ptr
cdef hash_t tag_hash = 0
cdef shared_ptr[MorphAnalysisC] tag
if isinstance(features, str):
if features == "":
features = self.EMPTY_MORPH
tag_ptr = <MorphAnalysisC*>self.tags.get(<hash_t>self.strings[features])
if tag_ptr != NULL:
return tag_ptr.key
features = self.feats_to_dict(features)
if not isinstance(features, dict):
tag_hash = self.strings[features]
tag = self._lookup_tag(tag_hash)
if tag:
return deref(tag).key
features = self._str_to_normalized_feat_dict(features)
elif isinstance(features, dict):
features = self._dict_to_normalized_feat_dict(features)
else:
warnings.warn(Warnings.W100.format(feature=features))
features = {}
string_features = {self.strings.as_string(field): self.strings.as_string(values) for field, values in features.items()}
# intified ("Field", "Field=Value") pairs
field_feature_pairs = []
for field in sorted(string_features):
values = string_features[field]
for value in values.split(self.VALUE_SEP):
field_feature_pairs.append((
self.strings.add(field),
self.strings.add(field + self.FIELD_SEP + value),
))
cdef MorphAnalysisC tag = self.create_morph_tag(field_feature_pairs)
# the hash key for the tag is either the hash of the normalized UFEATS
# string or the hash of an empty placeholder
norm_feats_string = self.normalize_features(features)
tag.key = self.strings.add(norm_feats_string)
self.insert(tag)
return tag.key
norm_feats_string = self._normalized_feat_dict_to_str(features)
tag_hash = self.strings.add(norm_feats_string)
tag = self._lookup_tag(tag_hash)
if tag:
return deref(tag).key
def normalize_features(self, features):
self._intern_morph_tag(tag_hash, features)
return tag_hash
cdef void _intern_morph_tag(self, hash_t tag_key, feats):
# intified ("Field", "Field=Value") pairs where fields with multiple values have
# been split into individual tuples, e.g.:
# [("Field1", "Field1=Value1"), ("Field1", "Field1=Value2"),
# ("Field2", "Field2=Value3")]
field_feature_pairs = []
# Feat dict is normalized at this point.
for field, values in feats.items():
field_key = self.strings.add(field)
if isinstance(values, list):
for value in values:
value_key = self.strings.add(field + self.FIELD_SEP + value)
field_feature_pairs.append((field_key, value_key))
else:
# We could box scalar values into a list and use a common
# code path to generate features but that incurs a small
# but measurable allocation/iteration overhead (as this
# branch is taken often enough).
value_key = self.strings.add(field + self.FIELD_SEP + values)
field_feature_pairs.append((field_key, value_key))
num_features = len(field_feature_pairs)
cdef shared_ptr[MorphAnalysisC] tag = shared_ptr[MorphAnalysisC](new MorphAnalysisC())
deref(tag).key = tag_key
deref(tag).features.resize(num_features)
for i in range(num_features):
deref(tag).features[i].field = field_feature_pairs[i][0]
deref(tag).features[i].value = field_feature_pairs[i][1]
self.tags[tag_key] = tag
cdef str get_morph_str(self, hash_t morph_key):
cdef shared_ptr[MorphAnalysisC] tag = self._lookup_tag(morph_key)
if not tag:
return ""
else:
return self.strings[deref(tag).key]
cdef shared_ptr[MorphAnalysisC] get_morph_c(self, hash_t morph_key):
return self._lookup_tag(morph_key)
cdef str _normalize_features(self, features):
"""Create a normalized FEATS string from a features string or dict.
features (Union[dict, str]): Features as dict or UFEATS string.
RETURNS (str): Features as normalized UFEATS string.
"""
if isinstance(features, str):
features = self.feats_to_dict(features)
if not isinstance(features, dict):
features = self._str_to_normalized_feat_dict(features)
elif isinstance(features, dict):
features = self._dict_to_normalized_feat_dict(features)
else:
warnings.warn(Warnings.W100.format(feature=features))
features = {}
features = self.normalize_attrs(features)
string_features = {self.strings.as_string(field): self.strings.as_string(values) for field, values in features.items()}
# normalized UFEATS string with sorted fields and values
norm_feats_string = self.FEATURE_SEP.join(sorted([
self.FIELD_SEP.join([field, values])
for field, values in string_features.items()
]))
return norm_feats_string or self.EMPTY_MORPH
def normalize_attrs(self, attrs):
"""Convert attrs dict so that POS is always by ID, other features are
by string. Values separated by VALUE_SEP are sorted.
"""
out = {}
attrs = dict(attrs)
for key, value in attrs.items():
# convert POS value to ID
if key == POS or (isinstance(key, str) and key.upper() == "POS"):
if isinstance(value, str) and value.upper() in POS_IDS:
value = POS_IDS[value.upper()]
elif isinstance(value, int) and value not in POS_IDS.values():
warnings.warn(Warnings.W100.format(feature={key: value}))
continue
out[POS] = value
# accept any string or ID fields and values and convert to strings
elif isinstance(key, (int, str)) and isinstance(value, (int, str)):
key = self.strings.as_string(key)
value = self.strings.as_string(value)
# sort values
if self.VALUE_SEP in value:
value = self.VALUE_SEP.join(sorted(value.split(self.VALUE_SEP)))
out[key] = value
else:
warnings.warn(Warnings.W100.format(feature={key: value}))
return out
return self._normalized_feat_dict_to_str(features)
cdef MorphAnalysisC create_morph_tag(self, field_feature_pairs) except *:
"""Creates a MorphAnalysisC from a list of intified
("Field", "Field=Value") tuples where fields with multiple values have
been split into individual tuples, e.g.:
[("Field1", "Field1=Value1"), ("Field1", "Field1=Value2"),
("Field2", "Field2=Value3")]
"""
cdef MorphAnalysisC tag
tag.length = len(field_feature_pairs)
if tag.length > 0:
tag.fields = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t))
tag.features = <attr_t*>self.mem.alloc(tag.length, sizeof(attr_t))
for i, (field, feature) in enumerate(field_feature_pairs):
tag.fields[i] = field
tag.features[i] = feature
return tag
def add(self, features):
return self._add(features)
cdef int insert(self, MorphAnalysisC tag) except -1:
cdef hash_t key = tag.key
if self.tags.get(key) == NULL:
tag_ptr = <MorphAnalysisC*>self.mem.alloc(1, sizeof(MorphAnalysisC))
tag_ptr[0] = tag
self.tags.set(key, <void*>tag_ptr)
def get(self, morph_key):
return self.get_morph_str(morph_key)
def get(self, hash_t morph):
tag = <MorphAnalysisC*>self.tags.get(morph)
if tag == NULL:
return ""
else:
return self.strings[tag.key]
def normalize_features(self, features):
return self._normalize_features(features)
@staticmethod
def feats_to_dict(feats):
def feats_to_dict(feats, *, sort_values=True):
if not feats or feats == Morphology.EMPTY_MORPH:
return {}
return {field: Morphology.VALUE_SEP.join(sorted(values.split(Morphology.VALUE_SEP))) for field, values in
[feat.split(Morphology.FIELD_SEP) for feat in feats.split(Morphology.FEATURE_SEP)]}
out = {}
for feat in feats.split(Morphology.FEATURE_SEP):
field, values = feat.split(Morphology.FIELD_SEP, 1)
if sort_values:
values = values.split(Morphology.VALUE_SEP)
values.sort()
values = Morphology.VALUE_SEP.join(values)
out[field] = values
return out
@staticmethod
def dict_to_feats(feats_dict):
@ -160,34 +213,34 @@ cdef class Morphology:
return Morphology.FEATURE_SEP.join(sorted([Morphology.FIELD_SEP.join([field, Morphology.VALUE_SEP.join(sorted(values.split(Morphology.VALUE_SEP)))]) for field, values in feats_dict.items()]))
cdef int check_feature(const MorphAnalysisC* morph, attr_t feature) nogil:
cdef int check_feature(const shared_ptr[MorphAnalysisC] morph, attr_t feature) nogil:
cdef int i
for i in range(morph.length):
if morph.features[i] == feature:
for i in range(deref(morph).features.size()):
if deref(morph).features[i].value == feature:
return True
return False
cdef list list_features(const MorphAnalysisC* morph):
cdef list list_features(const shared_ptr[MorphAnalysisC] morph):
cdef int i
features = []
for i in range(morph.length):
features.append(morph.features[i])
for i in range(deref(morph).features.size()):
features.append(deref(morph).features[i].value)
return features
cdef np.ndarray get_by_field(const MorphAnalysisC* morph, attr_t field):
cdef np.ndarray results = numpy.zeros((morph.length,), dtype="uint64")
cdef np.ndarray get_by_field(const shared_ptr[MorphAnalysisC] morph, attr_t field):
cdef np.ndarray results = numpy.zeros((deref(morph).features.size(),), dtype="uint64")
n = get_n_by_field(<uint64_t*>results.data, morph, field)
return results[:n]
cdef int get_n_by_field(attr_t* results, const MorphAnalysisC* morph, attr_t field) nogil:
cdef int get_n_by_field(attr_t* results, const shared_ptr[MorphAnalysisC] morph, attr_t field) nogil:
cdef int n_results = 0
cdef int i
for i in range(morph.length):
if morph.fields[i] == field:
results[n_results] = morph.features[i]
for i in range(deref(morph).features.size()):
if deref(morph).features[i].field == field:
results[n_results] = deref(morph).features[i].value
n_results += 1
return n_results

View File

@ -3,22 +3,22 @@ from . cimport symbols
cpdef enum univ_pos_t:
NO_TAG = 0
ADJ = symbols.ADJ
ADP
ADV
AUX
CONJ
CCONJ # U20
DET
INTJ
NOUN
NUM
PART
PRON
PROPN
PUNCT
SCONJ
SYM
VERB
X
EOL
SPACE
ADP = symbols.ADP
ADV = symbols.ADV
AUX = symbols.AUX
CONJ = symbols.CONJ
CCONJ = symbols.CCONJ # U20
DET = symbols.DET
INTJ = symbols.INTJ
NOUN = symbols.NOUN
NUM = symbols.NUM
PART = symbols.PART
PRON = symbols.PRON
PROPN = symbols.PROPN
PUNCT = symbols.PUNCT
SCONJ = symbols.SCONJ
SYM = symbols.SYM
VERB = symbols.VERB
X = symbols.X
EOL = symbols.EOL
SPACE = symbols.SPACE

View File

@ -1,8 +1,7 @@
from .attributeruler import AttributeRuler
from .attribute_ruler import AttributeRuler
from .dep_parser import DependencyParser
from .edit_tree_lemmatizer import EditTreeLemmatizer
from .entity_linker import EntityLinker
from .entityruler import EntityRuler
from .functions import merge_entities, merge_noun_chunks, merge_subtokens
from .lemmatizer import Lemmatizer
from .morphologizer import Morphologizer
@ -24,7 +23,6 @@ __all__ = [
"DependencyParser",
"EntityLinker",
"EntityRecognizer",
"EntityRuler",
"Morphologizer",
"Lemmatizer",
"MultiLabel_TextCategorizer",

View File

@ -1,6 +1,6 @@
from ...typedefs cimport class_t, hash_t
# These are passed as callbacks to thinc.search.Beam
# These are passed as callbacks to .search.Beam
cdef int transition_state(void* _dest, void* _src, class_t clas, void* _moves) except -1
cdef int check_final_state(void* _state, void* extra_args) except -1

View File

@ -3,17 +3,17 @@
cimport numpy as np
import numpy
from cpython.ref cimport PyObject, Py_XDECREF
from thinc.extra.search cimport Beam
from thinc.extra.search import MaxViolation
from thinc.extra.search cimport MaxViolation
from ...typedefs cimport hash_t, class_t
from .transition_system cimport TransitionSystem, Transition
from ...errors import Errors
from .batch cimport Batch
from .search cimport Beam, MaxViolation
from .search import MaxViolation
from .stateclass cimport StateC, StateClass
# These are passed as callbacks to thinc.search.Beam
# These are passed as callbacks to .search.Beam
cdef int transition_state(void* _dest, void* _src, class_t clas, void* _moves) except -1:
dest = <StateC*>_dest
src = <StateC*>_src
@ -27,7 +27,7 @@ cdef int check_final_state(void* _state, void* extra_args) except -1:
return state.is_final()
cdef class BeamBatch(object):
cdef class BeamBatch(Batch):
cdef public TransitionSystem moves
cdef public object states
cdef public object docs

View File

@ -0,0 +1,2 @@
cdef int arg_max(const float* scores, const int n_classes) nogil
cdef int arg_max_if_valid(const float* scores, const int* is_valid, int n) nogil

View File

@ -0,0 +1,22 @@
# cython: infer_types=True
cdef inline int arg_max(const float* scores, const int n_classes) nogil:
if n_classes == 2:
return 0 if scores[0] > scores[1] else 1
cdef int i
cdef int best = 0
cdef float mode = scores[0]
for i in range(1, n_classes):
if scores[i] > mode:
mode = scores[i]
best = i
return best
cdef inline int arg_max_if_valid(const float* scores, const int* is_valid, int n) nogil:
cdef int best = -1
for i in range(n):
if is_valid[i] >= 1:
if best == -1 or scores[i] > scores[best]:
best = i
return best

View File

@ -6,7 +6,6 @@ cimport libcpp
from libcpp.unordered_map cimport unordered_map
from libcpp.vector cimport vector
from libcpp.set cimport set
from cpython.exc cimport PyErr_CheckSignals, PyErr_SetFromErrno
from murmurhash.mrmr cimport hash64
from ...vocab cimport EMPTY_LEXEME
@ -26,7 +25,7 @@ cdef struct ArcC:
cdef cppclass StateC:
int* _heads
vector[int] _heads
const TokenC* _sent
vector[int] _stack
vector[int] _rebuffer
@ -34,31 +33,34 @@ cdef cppclass StateC:
unordered_map[int, vector[ArcC]] _left_arcs
unordered_map[int, vector[ArcC]] _right_arcs
vector[libcpp.bool] _unshiftable
vector[int] history
set[int] _sent_starts
TokenC _empty_token
int length
int offset
int _b_i
__init__(const TokenC* sent, int length) nogil:
__init__(const TokenC* sent, int length) nogil except +:
this._heads.resize(length, -1)
this._unshiftable.resize(length, False)
# Reserve memory ahead of time to minimize allocations during parsing.
# The initial capacity set here ideally reflects the expected average-case/majority usage.
cdef int init_capacity = 32
this._stack.reserve(init_capacity)
this._rebuffer.reserve(init_capacity)
this._ents.reserve(init_capacity)
this._left_arcs.reserve(init_capacity)
this._right_arcs.reserve(init_capacity)
this.history.reserve(init_capacity)
this._sent = sent
this._heads = <int*>calloc(length, sizeof(int))
if not (this._sent and this._heads):
with gil:
PyErr_SetFromErrno(MemoryError)
PyErr_CheckSignals()
this.offset = 0
this.length = length
this._b_i = 0
for i in range(length):
this._heads[i] = -1
this._unshiftable.push_back(0)
memset(&this._empty_token, 0, sizeof(TokenC))
this._empty_token.lex = &EMPTY_LEXEME
__dealloc__():
free(this._heads)
void set_context_tokens(int* ids, int n) nogil:
cdef int i, j
if n == 1:
@ -131,19 +133,20 @@ cdef cppclass StateC:
ids[i] = -1
int S(int i) nogil const:
if i >= this._stack.size():
cdef int stack_size = this._stack.size()
if i >= stack_size or i < 0:
return -1
elif i < 0:
return -1
return this._stack.at(this._stack.size() - (i+1))
else:
return this._stack[stack_size - (i+1)]
int B(int i) nogil const:
cdef int buf_size = this._rebuffer.size()
if i < 0:
return -1
elif i < this._rebuffer.size():
return this._rebuffer.at(this._rebuffer.size() - (i+1))
elif i < buf_size:
return this._rebuffer[buf_size - (i+1)]
else:
b_i = this._b_i + (i - this._rebuffer.size())
b_i = this._b_i + (i - buf_size)
if b_i >= this.length:
return -1
else:
@ -242,7 +245,7 @@ cdef cppclass StateC:
return 0
elif this._sent[word].sent_start == 1:
return 1
elif this._sent_starts.count(word) >= 1:
elif this._sent_starts.const_find(word) != this._sent_starts.const_end():
return 1
else:
return 0
@ -327,7 +330,7 @@ cdef cppclass StateC:
if item >= this._unshiftable.size():
return 0
else:
return this._unshiftable.at(item)
return this._unshiftable[item]
void set_reshiftable(int item) nogil:
if item < this._unshiftable.size():
@ -347,6 +350,9 @@ cdef cppclass StateC:
this._heads[child] = head
void map_del_arc(unordered_map[int, vector[ArcC]]* heads_arcs, int h_i, int c_i) nogil:
cdef vector[ArcC]* arcs
cdef ArcC* arc
arcs_it = heads_arcs.find(h_i)
if arcs_it == heads_arcs.end():
return
@ -355,12 +361,12 @@ cdef cppclass StateC:
if arcs.size() == 0:
return
arc = arcs.back()
arc = &arcs.back()
if arc.head == h_i and arc.child == c_i:
arcs.pop_back()
else:
for i in range(arcs.size()-1):
arc = arcs.at(i)
arc = &deref(arcs)[i]
if arc.head == h_i and arc.child == c_i:
arc.head = -1
arc.child = -1
@ -400,10 +406,11 @@ cdef cppclass StateC:
this._rebuffer = src._rebuffer
this._sent_starts = src._sent_starts
this._unshiftable = src._unshiftable
memcpy(this._heads, src._heads, this.length * sizeof(this._heads[0]))
this._heads = src._heads
this._ents = src._ents
this._left_arcs = src._left_arcs
this._right_arcs = src._right_arcs
this._b_i = src._b_i
this.offset = src.offset
this._empty_token = src._empty_token
this.history = src.history

View File

@ -15,7 +15,7 @@ from ...training.example cimport Example
from .stateclass cimport StateClass
from ._state cimport StateC, ArcC
from ...errors import Errors
from thinc.extra.search cimport Beam
from .search cimport Beam
cdef weight_t MIN_SCORE = -90000
cdef attr_t SUBTOK_LABEL = hash_string('subtok')
@ -773,6 +773,8 @@ cdef class ArcEager(TransitionSystem):
return list(arcs)
def has_gold(self, Example eg, start=0, end=None):
if end is not None and end < 0:
end = None
for word in eg.y[start:end]:
if word.dep != 0:
return True
@ -858,6 +860,7 @@ cdef class ArcEager(TransitionSystem):
state.print_state()
)))
action.do(state.c, action.label)
state.c.history.push_back(i)
break
else:
failed = False

View File

@ -0,0 +1,2 @@
cdef class Batch:
pass

View File

@ -0,0 +1,52 @@
from typing import Any
TransitionSystem = Any # TODO
cdef class Batch:
def advance(self, scores):
raise NotImplementedError
def get_states(self):
raise NotImplementedError
@property
def is_done(self):
raise NotImplementedError
def get_unfinished_states(self):
raise NotImplementedError
def __getitem__(self, i):
raise NotImplementedError
def __len__(self):
raise NotImplementedError
class GreedyBatch(Batch):
def __init__(self, moves: TransitionSystem, states, golds):
self._moves = moves
self._states = states
self._next_states = [s for s in states if not s.is_final()]
def advance(self, scores):
self._next_states = self._moves.transition_states(self._next_states, scores)
def advance_with_actions(self, actions):
self._next_states = self._moves.apply_actions(self._next_states, actions)
def get_states(self):
return self._states
@property
def is_done(self):
return all(s.is_final() for s in self._states)
def get_unfinished_states(self):
return [st for st in self._states if not st.is_final()]
def __getitem__(self, i):
return self._states[i]
def __len__(self):
return len(self._states)

View File

@ -1,10 +1,11 @@
import os
import random
from libc.stdint cimport int32_t
from libcpp.memory cimport shared_ptr
from libcpp.vector cimport vector
from cymem.cymem cimport Pool
from collections import Counter
from thinc.extra.search cimport Beam
from ...tokens.doc cimport Doc
from ...tokens.span import Span
@ -15,6 +16,7 @@ from ...attrs cimport IS_SPACE
from ...structs cimport TokenC, SpanC
from ...training import split_bilu_label
from ...training.example cimport Example
from .search cimport Beam
from .stateclass cimport StateClass
from ._state cimport StateC
from .transition_system cimport Transition, do_func_t
@ -43,9 +45,7 @@ MOVE_NAMES[OUT] = 'O'
cdef struct GoldNERStateC:
Transition* ner
SpanC* negs
int32_t length
int32_t nr_neg
vector[shared_ptr[SpanC]] negs
cdef class BiluoGold:
@ -78,8 +78,6 @@ cdef GoldNERStateC create_gold_state(
negs = []
assert example.x.length > 0
gs.ner = <Transition*>mem.alloc(example.x.length, sizeof(Transition))
gs.negs = <SpanC*>mem.alloc(len(negs), sizeof(SpanC))
gs.nr_neg = len(negs)
ner_ents, ner_tags = example.get_aligned_ents_and_ner()
for i, ner_tag in enumerate(ner_tags):
gs.ner[i] = moves.lookup_transition(ner_tag)
@ -93,8 +91,8 @@ cdef GoldNERStateC create_gold_state(
# In order to handle negative samples, we need to maintain the full
# (start, end, label) triple. If we break it down to the 'isnt B-LOC'
# thing, we'll get blocked if there's an incorrect prefix.
for i, neg in enumerate(negs):
gs.negs[i] = neg.c
for neg in negs:
gs.negs.push_back(neg.c)
return gs
@ -308,6 +306,8 @@ cdef class BiluoPushDown(TransitionSystem):
for span in eg.y.spans.get(neg_key, []):
if span.start >= start and span.end <= end:
return True
if end is not None and end < 0:
end = None
for word in eg.y[start:end]:
if word.ent_iob != 0:
return True
@ -411,6 +411,8 @@ cdef class Begin:
cdef int g_act = gold.ner[b0].move
cdef attr_t g_tag = gold.ner[b0].label
cdef shared_ptr[SpanC] span
if g_act == MISSING:
pass
elif g_act == BEGIN:
@ -428,8 +430,8 @@ cdef class Begin:
# be correct or not. However, we can at least tell whether we're
# going to be opening an entity where there's only one possible
# L.
for span in gold.negs[:gold.nr_neg]:
if span.label == label and span.start == b0:
for span in gold.negs:
if span.get().label == label and span.get().start == b0:
cost += 1
break
return cost
@ -574,8 +576,9 @@ cdef class Last:
# If we have negative-example entities, integrate them into the objective,
# by marking actions that close an entity that we know is incorrect
# as costly.
for span in gold.negs[:gold.nr_neg]:
if span.label == label and (span.end-1) == b0 and span.start == ent_start:
cdef shared_ptr[SpanC] span
for span in gold.negs:
if span.get().label == label and (span.get().end-1) == b0 and span.get().start == ent_start:
cost += 1
break
return cost
@ -639,8 +642,9 @@ cdef class Unit:
# This is fairly straight-forward for U- entities, as we have a single
# action
cdef int b0 = s.B(0)
for span in gold.negs[:gold.nr_neg]:
if span.label == label and span.start == b0 and span.end == (b0+1):
cdef shared_ptr[SpanC] span
for span in gold.negs:
if span.get().label == label and span.get().start == b0 and span.get().end == (b0+1):
cost += 1
break
return cost

View File

@ -0,0 +1,89 @@
from cymem.cymem cimport Pool
from libc.stdint cimport uint32_t
from libc.stdint cimport uint64_t
from libcpp.pair cimport pair
from libcpp.queue cimport priority_queue
from libcpp.vector cimport vector
from ...typedefs cimport class_t, weight_t, hash_t
ctypedef pair[weight_t, size_t] Entry
ctypedef priority_queue[Entry] Queue
ctypedef int (*trans_func_t)(void* dest, void* src, class_t clas, void* x) except -1
ctypedef void* (*init_func_t)(Pool mem, int n, void* extra_args) except NULL
ctypedef int (*del_func_t)(Pool mem, void* state, void* extra_args) except -1
ctypedef int (*finish_func_t)(void* state, void* extra_args) except -1
ctypedef hash_t (*hash_func_t)(void* state, void* x) except 0
cdef struct _State:
void* content
class_t* hist
weight_t score
weight_t loss
int i
int t
bint is_done
cdef class Beam:
cdef Pool mem
cdef class_t nr_class
cdef class_t width
cdef class_t size
cdef public weight_t min_density
cdef int t
cdef readonly bint is_done
cdef list histories
cdef list _parent_histories
cdef weight_t** scores
cdef int** is_valid
cdef weight_t** costs
cdef _State* _parents
cdef _State* _states
cdef del_func_t del_func
cdef int _fill(self, Queue* q, weight_t** scores, int** is_valid) except -1
cdef inline void* at(self, int i) nogil:
return self._states[i].content
cdef int initialize(self, init_func_t init_func, del_func_t del_func, int n, void* extra_args) except -1
cdef int advance(self, trans_func_t transition_func, hash_func_t hash_func,
void* extra_args) except -1
cdef int check_done(self, finish_func_t finish_func, void* extra_args) except -1
cdef inline void set_cell(self, int i, int j, weight_t score, int is_valid, weight_t cost) nogil:
self.scores[i][j] = score
self.is_valid[i][j] = is_valid
self.costs[i][j] = cost
cdef int set_row(self, int i, const weight_t* scores, const int* is_valid,
const weight_t* costs) except -1
cdef int set_table(self, weight_t** scores, int** is_valid, weight_t** costs) except -1
cdef class MaxViolation:
cdef Pool mem
cdef weight_t cost
cdef weight_t delta
cdef readonly weight_t p_score
cdef readonly weight_t g_score
cdef readonly double Z
cdef readonly double gZ
cdef class_t n
cdef readonly list p_hist
cdef readonly list g_hist
cdef readonly list p_probs
cdef readonly list g_probs
cpdef int check(self, Beam pred, Beam gold) except -1
cpdef int check_crf(self, Beam pred, Beam gold) except -1

View File

@ -0,0 +1,306 @@
# cython: profile=True, experimental_cpp_class_def=True, cdivision=True, infer_types=True
cimport cython
from libc.string cimport memset, memcpy
from libc.math cimport log, exp
import math
from cymem.cymem cimport Pool
from preshed.maps cimport PreshMap
cdef class Beam:
def __init__(self, class_t nr_class, class_t width, weight_t min_density=0.0):
assert nr_class != 0
assert width != 0
self.nr_class = nr_class
self.width = width
self.min_density = min_density
self.size = 1
self.t = 0
self.mem = Pool()
self.del_func = NULL
self._parents = <_State*>self.mem.alloc(self.width, sizeof(_State))
self._states = <_State*>self.mem.alloc(self.width, sizeof(_State))
cdef int i
self.histories = [[] for i in range(self.width)]
self._parent_histories = [[] for i in range(self.width)]
self.scores = <weight_t**>self.mem.alloc(self.width, sizeof(weight_t*))
self.is_valid = <int**>self.mem.alloc(self.width, sizeof(weight_t*))
self.costs = <weight_t**>self.mem.alloc(self.width, sizeof(weight_t*))
for i in range(self.width):
self.scores[i] = <weight_t*>self.mem.alloc(self.nr_class, sizeof(weight_t))
self.is_valid[i] = <int*>self.mem.alloc(self.nr_class, sizeof(int))
self.costs[i] = <weight_t*>self.mem.alloc(self.nr_class, sizeof(weight_t))
def __len__(self):
return self.size
property score:
def __get__(self):
return self._states[0].score
property min_score:
def __get__(self):
return self._states[self.size-1].score
property loss:
def __get__(self):
return self._states[0].loss
property probs:
def __get__(self):
return _softmax([self._states[i].score for i in range(self.size)])
property scores:
def __get__(self):
return [self._states[i].score for i in range(self.size)]
property histories:
def __get__(self):
return self.histories
cdef int set_row(self, int i, const weight_t* scores, const int* is_valid,
const weight_t* costs) except -1:
cdef int j
for j in range(self.nr_class):
self.scores[i][j] = scores[j]
self.is_valid[i][j] = is_valid[j]
self.costs[i][j] = costs[j]
cdef int set_table(self, weight_t** scores, int** is_valid, weight_t** costs) except -1:
cdef int i, j
for i in range(self.width):
memcpy(self.scores[i], scores[i], sizeof(weight_t) * self.nr_class)
memcpy(self.is_valid[i], is_valid[i], sizeof(bint) * self.nr_class)
memcpy(self.costs[i], costs[i], sizeof(int) * self.nr_class)
cdef int initialize(self, init_func_t init_func, del_func_t del_func, int n, void* extra_args) except -1:
for i in range(self.width):
self._states[i].content = init_func(self.mem, n, extra_args)
self._parents[i].content = init_func(self.mem, n, extra_args)
self.del_func = del_func
def __dealloc__(self):
if self.del_func == NULL:
return
for i in range(self.width):
self.del_func(self.mem, self._states[i].content, NULL)
self.del_func(self.mem, self._parents[i].content, NULL)
@cython.cdivision(True)
cdef int advance(self, trans_func_t transition_func, hash_func_t hash_func,
void* extra_args) except -1:
cdef weight_t** scores = self.scores
cdef int** is_valid = self.is_valid
cdef weight_t** costs = self.costs
cdef Queue* q = new Queue()
self._fill(q, scores, is_valid)
# For a beam of width k, we only ever need 2k state objects. How?
# Each transition takes a parent and a class and produces a new state.
# So, we don't need the whole history --- just the parent. So at
# each step, we take a parent, and apply one or more extensions to
# it.
self._parents, self._states = self._states, self._parents
self._parent_histories, self.histories = self.histories, self._parent_histories
cdef weight_t score
cdef int p_i
cdef int i = 0
cdef class_t clas
cdef _State* parent
cdef _State* state
cdef hash_t key
cdef PreshMap seen_states = PreshMap(self.width)
cdef uint64_t is_seen
cdef uint64_t one = 1
while i < self.width and not q.empty():
data = q.top()
p_i = data.second / self.nr_class
clas = data.second % self.nr_class
score = data.first
q.pop()
parent = &self._parents[p_i]
# Indicates terminal state reached; i.e. state is done
if parent.is_done:
# Now parent will not be changed, so we don't have to copy.
# Once finished, should also be unbranching.
self._states[i], parent[0] = parent[0], self._states[i]
parent.i = self._states[i].i
parent.t = self._states[i].t
parent.is_done = self._states[i].t
self._states[i].score = score
self.histories[i] = list(self._parent_histories[p_i])
i += 1
else:
state = &self._states[i]
# The supplied transition function should adjust the destination
# state to be the result of applying the class to the source state
transition_func(state.content, parent.content, clas, extra_args)
key = hash_func(state.content, extra_args) if hash_func is not NULL else 0
is_seen = <uint64_t>seen_states.get(key)
if key == 0 or key == 1 or not is_seen:
if key != 0 and key != 1:
seen_states.set(key, <void*>one)
state.score = score
state.loss = parent.loss + costs[p_i][clas]
self.histories[i] = list(self._parent_histories[p_i])
self.histories[i].append(clas)
i += 1
del q
self.size = i
assert self.size >= 1
for i in range(self.width):
memset(self.scores[i], 0, sizeof(weight_t) * self.nr_class)
memset(self.costs[i], 0, sizeof(weight_t) * self.nr_class)
memset(self.is_valid[i], 0, sizeof(int) * self.nr_class)
self.t += 1
cdef int check_done(self, finish_func_t finish_func, void* extra_args) except -1:
cdef int i
for i in range(self.size):
if not self._states[i].is_done:
self._states[i].is_done = finish_func(self._states[i].content, extra_args)
for i in range(self.size):
if not self._states[i].is_done:
self.is_done = False
break
else:
self.is_done = True
@cython.cdivision(True)
cdef int _fill(self, Queue* q, weight_t** scores, int** is_valid) except -1:
"""Populate the queue from a k * n matrix of scores, where k is the
beam-width, and n is the number of classes.
"""
cdef Entry entry
cdef weight_t score
cdef _State* s
cdef int i, j, move_id
assert self.size >= 1
cdef vector[Entry] entries
for i in range(self.size):
s = &self._states[i]
move_id = i * self.nr_class
if s.is_done:
# Update score by path average, following TACL '13 paper.
if self.histories[i]:
entry.first = s.score + (s.score / self.t)
else:
entry.first = s.score
entry.second = move_id
entries.push_back(entry)
else:
for j in range(self.nr_class):
if is_valid[i][j]:
entry.first = s.score + scores[i][j]
entry.second = move_id + j
entries.push_back(entry)
cdef double max_, Z, cutoff
if self.min_density == 0.0:
for i in range(entries.size()):
q.push(entries[i])
elif not entries.empty():
max_ = entries[0].first
Z = 0.
cutoff = 0.
# Softmax into probabilities, so we can prune
for i in range(entries.size()):
if entries[i].first > max_:
max_ = entries[i].first
for i in range(entries.size()):
Z += exp(entries[i].first-max_)
cutoff = (1. / Z) * self.min_density
for i in range(entries.size()):
prob = exp(entries[i].first-max_) / Z
if prob >= cutoff:
q.push(entries[i])
cdef class MaxViolation:
def __init__(self):
self.p_score = 0.0
self.g_score = 0.0
self.Z = 0.0
self.gZ = 0.0
self.delta = -1
self.cost = 0
self.p_hist = []
self.g_hist = []
self.p_probs = []
self.g_probs = []
cpdef int check(self, Beam pred, Beam gold) except -1:
cdef _State* p = &pred._states[0]
cdef _State* g = &gold._states[0]
cdef weight_t d = p.score - g.score
if p.loss >= 1 and (self.cost == 0 or d > self.delta):
self.cost = p.loss
self.delta = d
self.p_hist = list(pred.histories[0])
self.g_hist = list(gold.histories[0])
self.p_score = p.score
self.g_score = g.score
self.Z = 1e-10
self.gZ = 1e-10
for i in range(pred.size):
if pred._states[i].loss > 0:
self.Z += exp(pred._states[i].score)
for i in range(gold.size):
if gold._states[i].loss == 0:
prob = exp(gold._states[i].score)
self.Z += prob
self.gZ += prob
cpdef int check_crf(self, Beam pred, Beam gold) except -1:
d = pred.score - gold.score
seen_golds = set([tuple(gold.histories[i]) for i in range(gold.size)])
if pred.loss > 0 and (self.cost == 0 or d > self.delta):
p_hist = []
p_scores = []
g_hist = []
g_scores = []
for i in range(pred.size):
if pred._states[i].loss > 0:
p_scores.append(pred._states[i].score)
p_hist.append(list(pred.histories[i]))
# This can happen from non-monotonic actions
# If we find a better gold analysis this way, be sure to keep it.
elif pred._states[i].loss <= 0 \
and tuple(pred.histories[i]) not in seen_golds:
g_scores.append(pred._states[i].score)
g_hist.append(list(pred.histories[i]))
for i in range(gold.size):
if gold._states[i].loss == 0:
g_scores.append(gold._states[i].score)
g_hist.append(list(gold.histories[i]))
all_probs = _softmax(p_scores + g_scores)
p_probs = all_probs[:len(p_scores)]
g_probs_all = all_probs[len(p_scores):]
g_probs = _softmax(g_scores)
self.cost = pred.loss
self.delta = d
self.p_hist = p_hist
self.g_hist = g_hist
# TODO: These variables are misnamed! These are the gradients of the loss.
self.p_probs = p_probs
# Intuition here:
# The gradient of the loss is:
# P(model) - P(truth)
# Normally, P(truth) is 1 for the gold
# But, if we want to do the "partial credit" scheme, we want
# to create a distribution over the gold, proportional to the scores
# awarded.
self.g_probs = [x-y for x, y in zip(g_probs_all, g_probs)]
def _softmax(nums):
if not nums:
return []
max_ = max(nums)
nums = [(exp(n-max_) if n is not None else None) for n in nums]
Z = sum(n for n in nums if n is not None)
return [(n/Z if n is not None else None) for n in nums]

View File

@ -20,6 +20,10 @@ cdef class StateClass:
if self._borrowed != 1:
del self.c
@property
def history(self):
return list(self.c.history)
@property
def stack(self):
return [self.S(i) for i in range(self.c.stack_depth())]
@ -176,3 +180,6 @@ cdef class StateClass:
def clone(self, StateClass src):
self.c.clone(src.c)
def set_context_tokens(self, int[:, :] output, int row, int n_feats):
self.c.set_context_tokens(&output[row, 0], n_feats)

View File

@ -53,3 +53,10 @@ cdef class TransitionSystem:
cdef int set_costs(self, int* is_valid, weight_t* costs,
const StateC* state, gold) except -1
cdef void c_apply_actions(TransitionSystem moves, StateC** states, const int* actions,
int batch_size) nogil
cdef void c_transition_batch(TransitionSystem moves, StateC** states, const float* scores,
int nr_class, int batch_size) nogil

View File

@ -1,6 +1,8 @@
# cython: infer_types=True
from __future__ import print_function
from cymem.cymem cimport Pool
from libc.stdlib cimport calloc, free
from libcpp.vector cimport vector
from collections import Counter
import srsly
@ -10,6 +12,7 @@ from ...typedefs cimport weight_t, attr_t
from ...tokens.doc cimport Doc
from ...structs cimport TokenC
from .stateclass cimport StateClass
from ._parser_utils cimport arg_max_if_valid
from ...errors import Errors
from ... import util
@ -73,7 +76,18 @@ cdef class TransitionSystem:
offset += len(doc)
return states
def follow_history(self, doc, history):
cdef int clas
cdef StateClass state = StateClass(doc)
for clas in history:
action = self.c[clas]
action.do(state.c, action.label)
state.c.history.push_back(clas)
return state
def get_oracle_sequence(self, Example example, _debug=False):
if not self.has_gold(example):
return []
states, golds, _ = self.init_gold_batch([example])
if not states:
return []
@ -85,6 +99,8 @@ cdef class TransitionSystem:
return self.get_oracle_sequence_from_state(state, gold)
def get_oracle_sequence_from_state(self, StateClass state, gold, _debug=None):
if state.is_final():
return []
cdef Pool mem = Pool()
# n_moves should not be zero at this point, but make sure to avoid zero-length mem alloc
assert self.n_moves > 0
@ -110,6 +126,7 @@ cdef class TransitionSystem:
"S0 head?", str(state.has_head(state.S(0))),
)))
action.do(state.c, action.label)
state.c.history.push_back(i)
break
else:
if _debug:
@ -137,6 +154,28 @@ cdef class TransitionSystem:
raise ValueError(Errors.E170.format(name=name))
action = self.lookup_transition(name)
action.do(state.c, action.label)
state.c.history.push_back(action.clas)
def apply_actions(self, states, const int[::1] actions):
assert len(states) == actions.shape[0]
cdef StateClass state
cdef vector[StateC*] c_states
c_states.resize(len(states))
cdef int i
for (i, state) in enumerate(states):
c_states[i] = state.c
c_apply_actions(self, &c_states[0], &actions[0], actions.shape[0])
return [state for state in states if not state.c.is_final()]
def transition_states(self, states, float[:, ::1] scores):
assert len(states) == scores.shape[0]
cdef StateClass state
cdef float* c_scores = &scores[0, 0]
cdef vector[StateC*] c_states
for state in states:
c_states.push_back(state.c)
c_transition_batch(self, &c_states[0], c_scores, scores.shape[1], scores.shape[0])
return [state for state in states if not state.c.is_final()]
cdef Transition lookup_transition(self, object name) except *:
raise NotImplementedError
@ -250,3 +289,35 @@ cdef class TransitionSystem:
self.cfg.update(msg['cfg'])
self.initialize_actions(labels)
return self
cdef void c_apply_actions(TransitionSystem moves, StateC** states, const int* actions,
int batch_size) nogil:
cdef int i
cdef Transition action
cdef StateC* state
for i in range(batch_size):
state = states[i]
action = moves.c[actions[i]]
action.do(state, action.label)
state.history.push_back(action.clas)
cdef void c_transition_batch(TransitionSystem moves, StateC** states, const float* scores,
int nr_class, int batch_size) nogil:
is_valid = <int*>calloc(moves.n_moves, sizeof(int))
cdef int i, guess
cdef Transition action
for i in range(batch_size):
moves.set_valid(is_valid, states[i])
guess = arg_max_if_valid(&scores[i*nr_class], is_valid, nr_class)
if guess == -1:
# This shouldn't happen, but it's hard to raise an error here,
# and we don't want to infinite loop. So, force to end state.
states[i].force_final()
else:
action = moves.c[guess]
action.do(states[i], action.label)
states[i].history.push_back(guess)
free(is_valid)

View File

@ -11,7 +11,7 @@ from ..matcher import Matcher
from ..scorer import Scorer
from ..symbols import IDS
from ..tokens import Doc, Span
from ..tokens._retokenize import normalize_token_attrs, set_token_attrs
from ..tokens.retokenizer import normalize_token_attrs, set_token_attrs
from ..vocab import Vocab
from ..util import SimpleFrozenList, registry
from .. import util

View File

@ -4,8 +4,8 @@ from typing import Optional, Iterable, Callable
from thinc.api import Model, Config
from ._parser_internals.transition_system import TransitionSystem
from .transition_parser cimport Parser
from ._parser_internals.arc_eager cimport ArcEager
from .transition_parser import Parser
from ._parser_internals.arc_eager import ArcEager
from .functions import merge_subtokens
from ..language import Language
@ -18,12 +18,11 @@ from ..util import registry
default_model_config = """
[model]
@architectures = "spacy.TransitionBasedParser.v2"
@architectures = "spacy.TransitionBasedParser.v3"
state_type = "parser"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
[model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v2"
@ -123,6 +122,7 @@ def make_parser(
scorer=scorer,
)
@Language.factory(
"beam_parser",
assigns=["token.dep", "token.head", "token.is_sent_start", "doc.sents"],
@ -228,6 +228,7 @@ def parser_score(examples, **kwargs):
DOCS: https://spacy.io/api/dependencyparser#score
"""
def has_sents(doc):
return doc.has_annotation("SENT_START")
@ -235,8 +236,11 @@ def parser_score(examples, **kwargs):
dep = getattr(token, attr)
dep = token.vocab.strings.as_string(dep).lower()
return dep
results = {}
results.update(Scorer.score_spans(examples, "sents", has_annotation=has_sents, **kwargs))
results.update(
Scorer.score_spans(examples, "sents", has_annotation=has_sents, **kwargs)
)
kwargs.setdefault("getter", dep_getter)
kwargs.setdefault("ignore_labels", ("p", "punct"))
results.update(Scorer.score_deps(examples, "dep", **kwargs))
@ -249,11 +253,12 @@ def make_parser_scorer():
return parser_score
cdef class DependencyParser(Parser):
class DependencyParser(Parser):
"""Pipeline component for dependency parsing.
DOCS: https://spacy.io/api/dependencyparser
"""
TransitionSystem = ArcEager
def __init__(
@ -273,8 +278,7 @@ cdef class DependencyParser(Parser):
incorrect_spans_key=None,
scorer=parser_score,
):
"""Create a DependencyParser.
"""
"""Create a DependencyParser."""
super().__init__(
vocab,
model,

View File

@ -1,4 +1,4 @@
from typing import cast, Any, Callable, Dict, Iterable, List, Optional
from typing import cast, Any, Callable, Dict, Iterable, List, Optional, Union
from typing import Tuple
from collections import Counter
from itertools import islice
@ -6,7 +6,8 @@ import numpy as np
import srsly
from thinc.api import Config, Model, SequenceCategoricalCrossentropy, NumpyOps
from thinc.types import Floats2d, Ints2d
from thinc.types import ArrayXd, Floats2d, Ints1d
from thinc.legacy import LegacySequenceCategoricalCrossentropy
from ._edit_tree_internals.edit_trees import EditTrees
from ._edit_tree_internals.schemas import validate_edit_tree
@ -20,6 +21,7 @@ from ..vocab import Vocab
from .. import util
ActivationsT = Dict[str, Union[List[Floats2d], List[Ints1d]]]
# The cutoff value of *top_k* above which an alternative method is used to process guesses.
TOP_K_GUARDRAIL = 20
@ -52,6 +54,7 @@ DEFAULT_EDIT_TREE_LEMMATIZER_MODEL = Config().from_str(default_model_config)["mo
"overwrite": False,
"top_k": 1,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
"save_activations": False,
},
default_score_weights={"lemma_acc": 1.0},
)
@ -64,6 +67,7 @@ def make_edit_tree_lemmatizer(
overwrite: bool,
top_k: int,
scorer: Optional[Callable],
save_activations: bool,
):
"""Construct an EditTreeLemmatizer component."""
return EditTreeLemmatizer(
@ -75,6 +79,7 @@ def make_edit_tree_lemmatizer(
overwrite=overwrite,
top_k=top_k,
scorer=scorer,
save_activations=save_activations,
)
@ -94,6 +99,7 @@ class EditTreeLemmatizer(TrainablePipe):
overwrite: bool = False,
top_k: int = 1,
scorer: Optional[Callable] = lemmatizer_score,
save_activations: bool = False,
):
"""
Construct an edit tree lemmatizer.
@ -105,6 +111,7 @@ class EditTreeLemmatizer(TrainablePipe):
frequency in the training data.
overwrite (bool): overwrite existing lemma annotations.
top_k (int): try to apply at most the k most probable edit trees.
save_activations (bool): save model activations in Doc when annotating.
"""
self.vocab = vocab
self.model = model
@ -119,13 +126,16 @@ class EditTreeLemmatizer(TrainablePipe):
self.cfg: Dict[str, Any] = {"labels": []}
self.scorer = scorer
self.save_activations = save_activations
self.numpy_ops = NumpyOps()
def get_loss(
self, examples: Iterable[Example], scores: List[Floats2d]
) -> Tuple[float, List[Floats2d]]:
validate_examples(examples, "EditTreeLemmatizer.get_loss")
loss_func = SequenceCategoricalCrossentropy(normalize=False, missing_value=-1)
loss_func = LegacySequenceCategoricalCrossentropy(
normalize=False, missing_value=-1
)
truths = []
for eg in examples:
@ -148,7 +158,26 @@ class EditTreeLemmatizer(TrainablePipe):
return float(loss), d_scores
def predict(self, docs: Iterable[Doc]) -> List[Ints2d]:
def get_teacher_student_loss(
self, teacher_scores: List[Floats2d], student_scores: List[Floats2d]
) -> Tuple[float, List[Floats2d]]:
"""Calculate the loss and its gradient for a batch of student
scores, relative to teacher scores.
teacher_scores: Scores representing the teacher model's predictions.
student_scores: Scores representing the student model's predictions.
RETURNS (Tuple[float, float]): The loss and the gradient.
DOCS: https://spacy.io/api/edittreelemmatizer#get_teacher_student_loss
"""
loss_func = LegacySequenceCategoricalCrossentropy(normalize=False)
d_scores, loss = loss_func(student_scores, teacher_scores)
if self.model.ops.xp.isnan(loss):
raise ValueError(Errors.E910.format(name=self.name))
return float(loss), d_scores
def predict(self, docs: Iterable[Doc]) -> ActivationsT:
if self.top_k == 1:
scores2guesses = self._scores2guesses_top_k_equals_1
elif self.top_k <= TOP_K_GUARDRAIL:
@ -165,14 +194,19 @@ class EditTreeLemmatizer(TrainablePipe):
if not any(len(doc) for doc in docs):
# Handle cases where there are no tokens in any docs.
n_labels = len(self.cfg["labels"])
guesses: List[Ints2d] = [self.model.ops.alloc2i(0, n_labels) for _ in docs]
guesses: List[Ints1d] = [
self.model.ops.alloc((0,), dtype="i") for doc in docs
]
scores: List[Floats2d] = [
self.model.ops.alloc((0, n_labels), dtype="i") for doc in docs
]
assert len(guesses) == n_docs
return guesses
return {"probabilities": scores, "tree_ids": guesses}
scores = self.model.predict(docs)
assert len(scores) == n_docs
guesses = scores2guesses(docs, scores)
assert len(guesses) == n_docs
return guesses
return {"probabilities": scores, "tree_ids": guesses}
def _scores2guesses_top_k_equals_1(self, docs, scores):
guesses = []
@ -232,8 +266,13 @@ class EditTreeLemmatizer(TrainablePipe):
return guesses
def set_annotations(self, docs: Iterable[Doc], batch_tree_ids):
def set_annotations(self, docs: Iterable[Doc], activations: ActivationsT):
batch_tree_ids = activations["tree_ids"]
for i, doc in enumerate(docs):
if self.save_activations:
doc.activations[self.name] = {}
for act_name, acts in activations.items():
doc.activations[self.name][act_name] = acts[i]
doc_tree_ids = batch_tree_ids[i]
if hasattr(doc_tree_ids, "get"):
doc_tree_ids = doc_tree_ids.get()

View File

@ -1,5 +1,7 @@
from typing import Optional, Iterable, Callable, Dict, Union, List, Any
from thinc.types import Floats2d
import warnings
from typing import Optional, Iterable, Callable, Dict, Sequence, Union, List, Any, cast
from numpy import dtype
from thinc.types import Floats1d, Floats2d, Ints1d, Ragged
from pathlib import Path
from itertools import islice
import srsly
@ -8,21 +10,23 @@ from thinc.api import CosineDistance, Model, Optimizer, Config
from thinc.api import set_dropout_rate
from ..kb import KnowledgeBase, Candidate
from ..ml import empty_kb
from ..tokens import Doc, Span
from ..ml import empty_kb
from ..tokens import Doc, Span, SpanGroup
from .pipe import deserialize_config
from .legacy.entity_linker import EntityLinker_v1
from .trainable_pipe import TrainablePipe
from ..language import Language
from ..vocab import Vocab
from ..training import Example, validate_examples, validate_get_examples
from ..errors import Errors
from ..errors import Errors, Warnings
from ..util import SimpleFrozenList, registry
from .. import util
from ..scorer import Scorer
# See #9050
BACKWARD_OVERWRITE = True
ActivationsT = Dict[str, Union[List[Ragged], List[str]]]
KNOWLEDGE_BASE_IDS = "kb_ids"
default_model_config = """
[model]
@ -54,12 +58,13 @@ DEFAULT_NEL_MODEL = Config().from_str(default_model_config)["model"]
"entity_vector_length": 64,
"get_candidates": {"@misc": "spacy.CandidateGenerator.v1"},
"get_candidates_batch": {"@misc": "spacy.CandidateBatchGenerator.v1"},
"overwrite": False,
"generate_empty_kb": {"@misc": "spacy.EmptyKB.v2"},
"overwrite": True,
"scorer": {"@scorers": "spacy.entity_linker_scorer.v1"},
"use_gold_ents": True,
"candidates_batch_size": 1,
"threshold": None,
"save_activations": False,
},
default_score_weights={
"nel_micro_f": 1.0,
@ -79,7 +84,7 @@ def make_entity_linker(
entity_vector_length: int,
get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
get_candidates_batch: Callable[
[KnowledgeBase, Iterable[Span]], Iterable[Iterable[Candidate]]
[KnowledgeBase, SpanGroup], Iterable[Iterable[Candidate]]
],
generate_empty_kb: Callable[[Vocab, int], KnowledgeBase],
overwrite: bool,
@ -87,6 +92,7 @@ def make_entity_linker(
use_gold_ents: bool,
candidates_batch_size: int,
threshold: Optional[float] = None,
save_activations: bool,
):
"""Construct an EntityLinker component.
@ -101,7 +107,7 @@ def make_entity_linker(
get_candidates (Callable[[KnowledgeBase, Span], Iterable[Candidate]]): Function that
produces a list of candidates, given a certain knowledge base and a textual mention.
get_candidates_batch (
Callable[[KnowledgeBase, Iterable[Span]], Iterable[Iterable[Candidate]]], Iterable[Candidate]]
Callable[[KnowledgeBase, SpanGroup], Iterable[Iterable[Candidate]]], Iterable[Candidate]]
): Function that produces a list of candidates, given a certain knowledge base and several textual mentions.
generate_empty_kb (Callable[[Vocab, int], KnowledgeBase]): Callable returning empty KnowledgeBase.
scorer (Optional[Callable]): The scoring method.
@ -110,23 +116,11 @@ def make_entity_linker(
candidates_batch_size (int): Size of batches for entity candidate generation.
threshold (Optional[float]): Confidence threshold for entity predictions. If confidence is below the threshold,
prediction is discarded. If None, predictions are not filtered by any threshold.
save_activations (bool): save model activations in Doc when annotating.
"""
if not model.attrs.get("include_span_maker", False):
# The only difference in arguments here is that use_gold_ents and threshold aren't available.
return EntityLinker_v1(
nlp.vocab,
model,
name,
labels_discard=labels_discard,
n_sents=n_sents,
incl_prior=incl_prior,
incl_context=incl_context,
entity_vector_length=entity_vector_length,
get_candidates=get_candidates,
overwrite=overwrite,
scorer=scorer,
)
raise ValueError(Errors.E4005)
return EntityLinker(
nlp.vocab,
model,
@ -144,6 +138,7 @@ def make_entity_linker(
use_gold_ents=use_gold_ents,
candidates_batch_size=candidates_batch_size,
threshold=threshold,
save_activations=save_activations,
)
@ -177,14 +172,15 @@ class EntityLinker(TrainablePipe):
entity_vector_length: int,
get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
get_candidates_batch: Callable[
[KnowledgeBase, Iterable[Span]], Iterable[Iterable[Candidate]]
[KnowledgeBase, SpanGroup], Iterable[Iterable[Candidate]]
],
generate_empty_kb: Callable[[Vocab, int], KnowledgeBase],
overwrite: bool = BACKWARD_OVERWRITE,
overwrite: bool = False,
scorer: Optional[Callable] = entity_linker_score,
use_gold_ents: bool,
candidates_batch_size: int,
threshold: Optional[float] = None,
save_activations: bool = False,
) -> None:
"""Initialize an entity linker.
@ -200,16 +196,18 @@ class EntityLinker(TrainablePipe):
get_candidates (Callable[[KnowledgeBase, Span], Iterable[Candidate]]): Function that
produces a list of candidates, given a certain knowledge base and a textual mention.
get_candidates_batch (
Callable[[KnowledgeBase, Iterable[Span]], Iterable[Iterable[Candidate]]],
Callable[[KnowledgeBase, SpanGroup], Iterable[Iterable[Candidate]]],
Iterable[Candidate]]
): Function that produces a list of candidates, given a certain knowledge base and several textual mentions.
generate_empty_kb (Callable[[Vocab, int], KnowledgeBase]): Callable returning empty KnowledgeBase.
overwrite (bool): Whether to overwrite existing non-empty annotations.
scorer (Optional[Callable]): The scoring method. Defaults to Scorer.score_links.
use_gold_ents (bool): Whether to copy entities from gold docs or not. If false, another
component must provide entity annotations.
candidates_batch_size (int): Size of batches for entity candidate generation.
threshold (Optional[float]): Confidence threshold for entity predictions. If confidence is below the
threshold, prediction is discarded. If None, predictions are not filtered by any threshold.
save_activations (bool): save model activations in Doc when annotating.
DOCS: https://spacy.io/api/entitylinker#init
"""
@ -239,9 +237,12 @@ class EntityLinker(TrainablePipe):
self.use_gold_ents = use_gold_ents
self.candidates_batch_size = candidates_batch_size
self.threshold = threshold
self.save_activations = save_activations
if candidates_batch_size < 1:
raise ValueError(Errors.E1044)
if self.incl_prior and not self.kb.supports_prior_probs:
warnings.warn(Warnings.W401)
def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]):
"""Define the KB of this pipe by providing a function that will
@ -427,7 +428,7 @@ class EntityLinker(TrainablePipe):
loss = loss / len(entity_encodings)
return float(loss), out
def predict(self, docs: Iterable[Doc]) -> List[str]:
def predict(self, docs: Iterable[Doc]) -> ActivationsT:
"""Apply the pipeline's model to a batch of docs, without modifying them.
Returns the KB IDs for each entity in each doc, including NIL if there is
no prediction.
@ -440,13 +441,24 @@ class EntityLinker(TrainablePipe):
self.validate_kb()
entity_count = 0
final_kb_ids: List[str] = []
xp = self.model.ops.xp
ops = self.model.ops
xp = ops.xp
docs_ents: List[Ragged] = []
docs_scores: List[Ragged] = []
if not docs:
return final_kb_ids
return {
KNOWLEDGE_BASE_IDS: final_kb_ids,
"ents": docs_ents,
"scores": docs_scores,
}
if isinstance(docs, Doc):
docs = [docs]
for i, doc in enumerate(docs):
for doc in docs:
doc_ents: List[Ints1d] = []
doc_scores: List[Floats1d] = []
if len(doc) == 0:
docs_scores.append(Ragged(ops.alloc1f(0), ops.alloc1i(0)))
docs_ents.append(Ragged(xp.zeros(0, dtype="uint64"), ops.alloc1i(0)))
continue
sentences = [s for s in doc.sents]
@ -463,7 +475,8 @@ class EntityLinker(TrainablePipe):
batch_candidates = list(
self.get_candidates_batch(
self.kb, [ent_batch[idx] for idx in valid_ent_idx]
self.kb,
SpanGroup(doc, spans=[ent_batch[idx] for idx in valid_ent_idx]),
)
if self.candidates_batch_size > 1
else [
@ -500,19 +513,38 @@ class EntityLinker(TrainablePipe):
if ent.label_ in self.labels_discard:
# ignoring this entity - setting to NIL
final_kb_ids.append(self.NIL)
self._add_activations(
doc_scores=doc_scores,
doc_ents=doc_ents,
scores=[0.0],
ents=[0],
)
else:
candidates = list(batch_candidates[j])
if not candidates:
# no prediction possible for this entity - setting to NIL
final_kb_ids.append(self.NIL)
self._add_activations(
doc_scores=doc_scores,
doc_ents=doc_ents,
scores=[0.0],
ents=[0],
)
elif len(candidates) == 1 and self.threshold is None:
# shortcut for efficiency reasons: take the 1 candidate
final_kb_ids.append(candidates[0].entity_)
final_kb_ids.append(candidates[0].entity_id_)
self._add_activations(
doc_scores=doc_scores,
doc_ents=doc_ents,
scores=[1.0],
ents=[candidates[0].entity_id],
)
else:
random.shuffle(candidates)
# set all prior probabilities to 0 if incl_prior=False
prior_probs = xp.asarray([c.prior_prob for c in candidates])
if not self.incl_prior:
if self.incl_prior and self.kb.supports_prior_probs:
prior_probs = xp.asarray([c.prior_prob for c in candidates]) # type: ignore
else:
prior_probs = xp.asarray([0.0 for _ in candidates])
scores = prior_probs
# add in similarity from the context
@ -536,33 +568,57 @@ class EntityLinker(TrainablePipe):
raise ValueError(Errors.E161)
scores = prior_probs + sims - (prior_probs * sims)
final_kb_ids.append(
candidates[scores.argmax().item()].entity_
candidates[scores.argmax().item()].entity_id_
if self.threshold is None
or scores.max() >= self.threshold
else EntityLinker.NIL
)
self._add_activations(
doc_scores=doc_scores,
doc_ents=doc_ents,
scores=scores,
ents=[c.entity_id for c in candidates],
)
self._add_doc_activations(
docs_scores=docs_scores,
docs_ents=docs_ents,
doc_scores=doc_scores,
doc_ents=doc_ents,
)
if not (len(final_kb_ids) == entity_count):
err = Errors.E147.format(
method="predict", msg="result variables not of equal length"
)
raise RuntimeError(err)
return final_kb_ids
return {
KNOWLEDGE_BASE_IDS: final_kb_ids,
"ents": docs_ents,
"scores": docs_scores,
}
def set_annotations(self, docs: Iterable[Doc], kb_ids: List[str]) -> None:
def set_annotations(self, docs: Iterable[Doc], activations: ActivationsT) -> None:
"""Modify a batch of documents, using pre-computed scores.
docs (Iterable[Doc]): The documents to modify.
kb_ids (List[str]): The IDs to set, produced by EntityLinker.predict.
activations (ActivationsT): The activations used for setting annotations, produced
by EntityLinker.predict.
DOCS: https://spacy.io/api/entitylinker#set_annotations
"""
kb_ids = cast(List[str], activations[KNOWLEDGE_BASE_IDS])
count_ents = len([ent for doc in docs for ent in doc.ents])
if count_ents != len(kb_ids):
raise ValueError(Errors.E148.format(ents=count_ents, ids=len(kb_ids)))
i = 0
overwrite = self.cfg["overwrite"]
for doc in docs:
for j, doc in enumerate(docs):
if self.save_activations:
doc.activations[self.name] = {}
for act_name, acts in activations.items():
if act_name != KNOWLEDGE_BASE_IDS:
# We only copy activations that are Ragged.
doc.activations[self.name][act_name] = cast(Ragged, acts[j])
for ent in doc.ents:
kb_id = kb_ids[i]
i += 1
@ -661,3 +717,32 @@ class EntityLinker(TrainablePipe):
def add_label(self, label):
raise NotImplementedError
def _add_doc_activations(
self,
*,
docs_scores: List[Ragged],
docs_ents: List[Ragged],
doc_scores: List[Floats1d],
doc_ents: List[Ints1d],
):
if not self.save_activations:
return
ops = self.model.ops
lengths = ops.asarray1i([s.shape[0] for s in doc_scores])
docs_scores.append(Ragged(ops.flatten(doc_scores), lengths))
docs_ents.append(Ragged(ops.flatten(doc_ents), lengths))
def _add_activations(
self,
*,
doc_scores: List[Floats1d],
doc_ents: List[Ints1d],
scores: Sequence[float],
ents: Sequence[int],
):
if not self.save_activations:
return
ops = self.model.ops
doc_scores.append(ops.asarray1f(scores))
doc_ents.append(ops.asarray1i(ents, dtype="uint64"))

View File

@ -1,541 +0,0 @@
from typing import Optional, Union, List, Dict, Tuple, Iterable, Any, Callable, Sequence
import warnings
from collections import defaultdict
from pathlib import Path
import srsly
from .pipe import Pipe
from ..training import Example
from ..language import Language
from ..errors import Errors, Warnings
from ..util import ensure_path, to_disk, from_disk, SimpleFrozenList, registry
from ..tokens import Doc, Span
from ..matcher import Matcher, PhraseMatcher
from ..matcher.levenshtein import levenshtein_compare
from ..scorer import get_ner_prf
DEFAULT_ENT_ID_SEP = "||"
PatternType = Dict[str, Union[str, List[Dict[str, Any]]]]
@Language.factory(
"entity_ruler",
assigns=["doc.ents", "token.ent_type", "token.ent_iob"],
default_config={
"phrase_matcher_attr": None,
"matcher_fuzzy_compare": {"@misc": "spacy.levenshtein_compare.v1"},
"validate": False,
"overwrite_ents": False,
"ent_id_sep": DEFAULT_ENT_ID_SEP,
"scorer": {"@scorers": "spacy.entity_ruler_scorer.v1"},
},
default_score_weights={
"ents_f": 1.0,
"ents_p": 0.0,
"ents_r": 0.0,
"ents_per_type": None,
},
)
def make_entity_ruler(
nlp: Language,
name: str,
phrase_matcher_attr: Optional[Union[int, str]],
matcher_fuzzy_compare: Callable,
validate: bool,
overwrite_ents: bool,
ent_id_sep: str,
scorer: Optional[Callable],
):
return EntityRuler(
nlp,
name,
phrase_matcher_attr=phrase_matcher_attr,
matcher_fuzzy_compare=matcher_fuzzy_compare,
validate=validate,
overwrite_ents=overwrite_ents,
ent_id_sep=ent_id_sep,
scorer=scorer,
)
def entity_ruler_score(examples, **kwargs):
return get_ner_prf(examples)
@registry.scorers("spacy.entity_ruler_scorer.v1")
def make_entity_ruler_scorer():
return entity_ruler_score
class EntityRuler(Pipe):
"""The EntityRuler lets you add spans to the `Doc.ents` using token-based
rules or exact phrase matches. It can be combined with the statistical
`EntityRecognizer` to boost accuracy, or used on its own to implement a
purely rule-based entity recognition system. After initialization, the
component is typically added to the pipeline using `nlp.add_pipe`.
DOCS: https://spacy.io/api/entityruler
USAGE: https://spacy.io/usage/rule-based-matching#entityruler
"""
def __init__(
self,
nlp: Language,
name: str = "entity_ruler",
*,
phrase_matcher_attr: Optional[Union[int, str]] = None,
matcher_fuzzy_compare: Callable = levenshtein_compare,
validate: bool = False,
overwrite_ents: bool = False,
ent_id_sep: str = DEFAULT_ENT_ID_SEP,
patterns: Optional[List[PatternType]] = None,
scorer: Optional[Callable] = entity_ruler_score,
) -> None:
"""Initialize the entity ruler. If patterns are supplied here, they
need to be a list of dictionaries with a `"label"` and `"pattern"`
key. A pattern can either be a token pattern (list) or a phrase pattern
(string). For example: `{'label': 'ORG', 'pattern': 'Apple'}`.
nlp (Language): The shared nlp object to pass the vocab to the matchers
and process phrase patterns.
name (str): Instance name of the current pipeline component. Typically
passed in automatically from the factory when the component is
added. Used to disable the current entity ruler while creating
phrase patterns with the nlp object.
phrase_matcher_attr (int / str): Token attribute to match on, passed
to the internal PhraseMatcher as `attr`.
matcher_fuzzy_compare (Callable): The fuzzy comparison method for the
internal Matcher. Defaults to
spacy.matcher.levenshtein.levenshtein_compare.
validate (bool): Whether patterns should be validated, passed to
Matcher and PhraseMatcher as `validate`
patterns (iterable): Optional patterns to load in.
overwrite_ents (bool): If existing entities are present, e.g. entities
added by the model, overwrite them by matches if necessary.
ent_id_sep (str): Separator used internally for entity IDs.
scorer (Optional[Callable]): The scoring method. Defaults to
spacy.scorer.get_ner_prf.
DOCS: https://spacy.io/api/entityruler#init
"""
self.nlp = nlp
self.name = name
self.overwrite = overwrite_ents
self.token_patterns = defaultdict(list) # type: ignore
self.phrase_patterns = defaultdict(list) # type: ignore
self._validate = validate
self.matcher_fuzzy_compare = matcher_fuzzy_compare
self.matcher = Matcher(
nlp.vocab, validate=validate, fuzzy_compare=self.matcher_fuzzy_compare
)
self.phrase_matcher_attr = phrase_matcher_attr
self.phrase_matcher = PhraseMatcher(
nlp.vocab, attr=self.phrase_matcher_attr, validate=validate
)
self.ent_id_sep = ent_id_sep
self._ent_ids = defaultdict(tuple) # type: ignore
if patterns is not None:
self.add_patterns(patterns)
self.scorer = scorer
def __len__(self) -> int:
"""The number of all patterns added to the entity ruler."""
n_token_patterns = sum(len(p) for p in self.token_patterns.values())
n_phrase_patterns = sum(len(p) for p in self.phrase_patterns.values())
return n_token_patterns + n_phrase_patterns
def __contains__(self, label: str) -> bool:
"""Whether a label is present in the patterns."""
return label in self.token_patterns or label in self.phrase_patterns
def __call__(self, doc: Doc) -> Doc:
"""Find matches in document and add them as entities.
doc (Doc): The Doc object in the pipeline.
RETURNS (Doc): The Doc with added entities, if available.
DOCS: https://spacy.io/api/entityruler#call
"""
error_handler = self.get_error_handler()
try:
matches = self.match(doc)
self.set_annotations(doc, matches)
return doc
except Exception as e:
return error_handler(self.name, self, [doc], e)
def match(self, doc: Doc):
self._require_patterns()
with warnings.catch_warnings():
warnings.filterwarnings("ignore", message="\\[W036")
matches = list(self.matcher(doc)) + list(self.phrase_matcher(doc))
final_matches = set(
[(m_id, start, end) for m_id, start, end in matches if start != end]
)
get_sort_key = lambda m: (m[2] - m[1], -m[1])
final_matches = sorted(final_matches, key=get_sort_key, reverse=True)
return final_matches
def set_annotations(self, doc, matches):
"""Modify the document in place"""
entities = list(doc.ents)
new_entities = []
seen_tokens = set()
for match_id, start, end in matches:
if any(t.ent_type for t in doc[start:end]) and not self.overwrite:
continue
# check for end - 1 here because boundaries are inclusive
if start not in seen_tokens and end - 1 not in seen_tokens:
if match_id in self._ent_ids:
label, ent_id = self._ent_ids[match_id]
span = Span(doc, start, end, label=label, span_id=ent_id)
else:
span = Span(doc, start, end, label=match_id)
new_entities.append(span)
entities = [
e for e in entities if not (e.start < end and e.end > start)
]
seen_tokens.update(range(start, end))
doc.ents = entities + new_entities
@property
def labels(self) -> Tuple[str, ...]:
"""All labels present in the match patterns.
RETURNS (set): The string labels.
DOCS: https://spacy.io/api/entityruler#labels
"""
keys = set(self.token_patterns.keys())
keys.update(self.phrase_patterns.keys())
all_labels = set()
for l in keys:
if self.ent_id_sep in l:
label, _ = self._split_label(l)
all_labels.add(label)
else:
all_labels.add(l)
return tuple(sorted(all_labels))
def initialize(
self,
get_examples: Callable[[], Iterable[Example]],
*,
nlp: Optional[Language] = None,
patterns: Optional[Sequence[PatternType]] = None,
):
"""Initialize the pipe for training.
get_examples (Callable[[], Iterable[Example]]): Function that
returns a representative sample of gold-standard Example objects.
nlp (Language): The current nlp object the component is part of.
patterns Optional[Iterable[PatternType]]: The list of patterns.
DOCS: https://spacy.io/api/entityruler#initialize
"""
self.clear()
if patterns:
self.add_patterns(patterns) # type: ignore[arg-type]
@property
def ent_ids(self) -> Tuple[Optional[str], ...]:
"""All entity ids present in the match patterns `id` properties
RETURNS (set): The string entity ids.
DOCS: https://spacy.io/api/entityruler#ent_ids
"""
keys = set(self.token_patterns.keys())
keys.update(self.phrase_patterns.keys())
all_ent_ids = set()
for l in keys:
if self.ent_id_sep in l:
_, ent_id = self._split_label(l)
all_ent_ids.add(ent_id)
return tuple(all_ent_ids)
@property
def patterns(self) -> List[PatternType]:
"""Get all patterns that were added to the entity ruler.
RETURNS (list): The original patterns, one dictionary per pattern.
DOCS: https://spacy.io/api/entityruler#patterns
"""
all_patterns = []
for label, patterns in self.token_patterns.items():
for pattern in patterns:
ent_label, ent_id = self._split_label(label)
p = {"label": ent_label, "pattern": pattern}
if ent_id:
p["id"] = ent_id
all_patterns.append(p)
for label, patterns in self.phrase_patterns.items():
for pattern in patterns:
ent_label, ent_id = self._split_label(label)
p = {"label": ent_label, "pattern": pattern.text}
if ent_id:
p["id"] = ent_id
all_patterns.append(p)
return all_patterns
def add_patterns(self, patterns: List[PatternType]) -> None:
"""Add patterns to the entity ruler. A pattern can either be a token
pattern (list of dicts) or a phrase pattern (string). For example:
{'label': 'ORG', 'pattern': 'Apple'}
{'label': 'GPE', 'pattern': [{'lower': 'san'}, {'lower': 'francisco'}]}
patterns (list): The patterns to add.
DOCS: https://spacy.io/api/entityruler#add_patterns
"""
# disable the nlp components after this one in case they hadn't been initialized / deserialised yet
try:
current_index = -1
for i, (name, pipe) in enumerate(self.nlp.pipeline):
if self == pipe:
current_index = i
break
subsequent_pipes = [pipe for pipe in self.nlp.pipe_names[current_index:]]
except ValueError:
subsequent_pipes = []
with self.nlp.select_pipes(disable=subsequent_pipes):
token_patterns = []
phrase_pattern_labels = []
phrase_pattern_texts = []
phrase_pattern_ids = []
for entry in patterns:
if isinstance(entry["pattern"], str):
phrase_pattern_labels.append(entry["label"])
phrase_pattern_texts.append(entry["pattern"])
phrase_pattern_ids.append(entry.get("id"))
elif isinstance(entry["pattern"], list):
token_patterns.append(entry)
phrase_patterns = []
for label, pattern, ent_id in zip(
phrase_pattern_labels,
self.nlp.pipe(phrase_pattern_texts),
phrase_pattern_ids,
):
phrase_pattern = {"label": label, "pattern": pattern}
if ent_id:
phrase_pattern["id"] = ent_id
phrase_patterns.append(phrase_pattern)
for entry in token_patterns + phrase_patterns: # type: ignore[operator]
label = entry["label"] # type: ignore
if "id" in entry:
ent_label = label
label = self._create_label(label, entry["id"])
key = self.matcher._normalize_key(label)
self._ent_ids[key] = (ent_label, entry["id"])
pattern = entry["pattern"] # type: ignore
if isinstance(pattern, Doc):
self.phrase_patterns[label].append(pattern)
self.phrase_matcher.add(label, [pattern]) # type: ignore
elif isinstance(pattern, list):
self.token_patterns[label].append(pattern)
self.matcher.add(label, [pattern])
else:
raise ValueError(Errors.E097.format(pattern=pattern))
def clear(self) -> None:
"""Reset all patterns."""
self.token_patterns = defaultdict(list)
self.phrase_patterns = defaultdict(list)
self._ent_ids = defaultdict(tuple)
self.matcher = Matcher(
self.nlp.vocab,
validate=self._validate,
fuzzy_compare=self.matcher_fuzzy_compare,
)
self.phrase_matcher = PhraseMatcher(
self.nlp.vocab, attr=self.phrase_matcher_attr, validate=self._validate
)
def remove(self, ent_id: str) -> None:
"""Remove a pattern by its ent_id if a pattern with this ent_id was added before
ent_id (str): id of the pattern to be removed
RETURNS: None
DOCS: https://spacy.io/api/entityruler#remove
"""
label_id_pairs = [
(label, eid) for (label, eid) in self._ent_ids.values() if eid == ent_id
]
if not label_id_pairs:
raise ValueError(
Errors.E1024.format(attr_type="ID", label=ent_id, component=self.name)
)
created_labels = [
self._create_label(label, eid) for (label, eid) in label_id_pairs
]
# remove the patterns from self.phrase_patterns
self.phrase_patterns = defaultdict(
list,
{
label: val
for (label, val) in self.phrase_patterns.items()
if label not in created_labels
},
)
# remove the patterns from self.token_pattern
self.token_patterns = defaultdict(
list,
{
label: val
for (label, val) in self.token_patterns.items()
if label not in created_labels
},
)
# remove the patterns from self.token_pattern
for label in created_labels:
if label in self.phrase_matcher:
self.phrase_matcher.remove(label)
else:
self.matcher.remove(label)
def _require_patterns(self) -> None:
"""Raise a warning if this component has no patterns defined."""
if len(self) == 0:
warnings.warn(Warnings.W036.format(name=self.name))
def _split_label(self, label: str) -> Tuple[str, Optional[str]]:
"""Split Entity label into ent_label and ent_id if it contains self.ent_id_sep
label (str): The value of label in a pattern entry
RETURNS (tuple): ent_label, ent_id
"""
if self.ent_id_sep in label:
ent_label, ent_id = label.rsplit(self.ent_id_sep, 1)
else:
ent_label = label
ent_id = None # type: ignore
return ent_label, ent_id
def _create_label(self, label: Any, ent_id: Any) -> str:
"""Join Entity label with ent_id if the pattern has an `id` attribute
If ent_id is not a string, the label is returned as is.
label (str): The label to set for ent.label_
ent_id (str): The label
RETURNS (str): The ent_label joined with configured `ent_id_sep`
"""
if isinstance(ent_id, str):
label = f"{label}{self.ent_id_sep}{ent_id}"
return label
def from_bytes(
self, patterns_bytes: bytes, *, exclude: Iterable[str] = SimpleFrozenList()
) -> "EntityRuler":
"""Load the entity ruler from a bytestring.
patterns_bytes (bytes): The bytestring to load.
RETURNS (EntityRuler): The loaded entity ruler.
DOCS: https://spacy.io/api/entityruler#from_bytes
"""
cfg = srsly.msgpack_loads(patterns_bytes)
self.clear()
if isinstance(cfg, dict):
self.add_patterns(cfg.get("patterns", cfg))
self.overwrite = cfg.get("overwrite", False)
self.phrase_matcher_attr = cfg.get("phrase_matcher_attr", None)
self.phrase_matcher = PhraseMatcher(
self.nlp.vocab,
attr=self.phrase_matcher_attr,
)
self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
else:
self.add_patterns(cfg)
return self
def to_bytes(self, *, exclude: Iterable[str] = SimpleFrozenList()) -> bytes:
"""Serialize the entity ruler patterns to a bytestring.
RETURNS (bytes): The serialized patterns.
DOCS: https://spacy.io/api/entityruler#to_bytes
"""
serial = {
"overwrite": self.overwrite,
"ent_id_sep": self.ent_id_sep,
"phrase_matcher_attr": self.phrase_matcher_attr,
"patterns": self.patterns,
}
return srsly.msgpack_dumps(serial)
def from_disk(
self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
) -> "EntityRuler":
"""Load the entity ruler from a file. Expects a file containing
newline-delimited JSON (JSONL) with one entry per line.
path (str / Path): The JSONL file to load.
RETURNS (EntityRuler): The loaded entity ruler.
DOCS: https://spacy.io/api/entityruler#from_disk
"""
path = ensure_path(path)
self.clear()
depr_patterns_path = path.with_suffix(".jsonl")
if path.suffix == ".jsonl": # user provides a jsonl
if path.is_file:
patterns = srsly.read_jsonl(path)
self.add_patterns(patterns)
else:
raise ValueError(Errors.E1023.format(path=path))
elif depr_patterns_path.is_file():
patterns = srsly.read_jsonl(depr_patterns_path)
self.add_patterns(patterns)
elif path.is_dir(): # path is a valid directory
cfg = {}
deserializers_patterns = {
"patterns": lambda p: self.add_patterns(
srsly.read_jsonl(p.with_suffix(".jsonl"))
)
}
deserializers_cfg = {"cfg": lambda p: cfg.update(srsly.read_json(p))}
from_disk(path, deserializers_cfg, {})
self.overwrite = cfg.get("overwrite", False)
self.phrase_matcher_attr = cfg.get("phrase_matcher_attr")
self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
self.phrase_matcher = PhraseMatcher(
self.nlp.vocab, attr=self.phrase_matcher_attr
)
from_disk(path, deserializers_patterns, {})
else: # path is not a valid directory or file
raise ValueError(Errors.E146.format(path=path))
return self
def to_disk(
self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
) -> None:
"""Save the entity ruler patterns to a directory. The patterns will be
saved as newline-delimited JSON (JSONL).
path (str / Path): The JSONL file to save.
DOCS: https://spacy.io/api/entityruler#to_disk
"""
path = ensure_path(path)
cfg = {
"overwrite": self.overwrite,
"phrase_matcher_attr": self.phrase_matcher_attr,
"ent_id_sep": self.ent_id_sep,
}
serializers = {
"patterns": lambda p: srsly.write_jsonl(
p.with_suffix(".jsonl"), self.patterns
),
"cfg": lambda p: srsly.write_json(p, cfg),
}
if path.suffix == ".jsonl": # user wants to save only JSONL
srsly.write_jsonl(path, self.patterns)
else:
to_disk(path, serializers, {})

View File

@ -1,3 +0,0 @@
from .entity_linker import EntityLinker_v1
__all__ = ["EntityLinker_v1"]

View File

@ -1,422 +0,0 @@
# This file is present to provide a prior version of the EntityLinker component
# for backwards compatability. For details see #9669.
from typing import Optional, Iterable, Callable, Dict, Union, List, Any
from thinc.types import Floats2d
from pathlib import Path
from itertools import islice
import srsly
import random
from thinc.api import CosineDistance, Model, Optimizer
from thinc.api import set_dropout_rate
import warnings
from ...kb import KnowledgeBase, Candidate
from ...ml import empty_kb
from ...tokens import Doc, Span
from ..pipe import deserialize_config
from ..trainable_pipe import TrainablePipe
from ...language import Language
from ...vocab import Vocab
from ...training import Example, validate_examples, validate_get_examples
from ...errors import Errors, Warnings
from ...util import SimpleFrozenList
from ... import util
from ...scorer import Scorer
# See #9050
BACKWARD_OVERWRITE = True
def entity_linker_score(examples, **kwargs):
return Scorer.score_links(examples, negative_labels=[EntityLinker_v1.NIL], **kwargs)
class EntityLinker_v1(TrainablePipe):
"""Pipeline component for named entity linking.
DOCS: https://spacy.io/api/entitylinker
"""
NIL = "NIL" # string used to refer to a non-existing link
def __init__(
self,
vocab: Vocab,
model: Model,
name: str = "entity_linker",
*,
labels_discard: Iterable[str],
n_sents: int,
incl_prior: bool,
incl_context: bool,
entity_vector_length: int,
get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
overwrite: bool = BACKWARD_OVERWRITE,
scorer: Optional[Callable] = entity_linker_score,
) -> None:
"""Initialize an entity linker.
vocab (Vocab): The shared vocabulary.
model (thinc.api.Model): The Thinc Model powering the pipeline component.
name (str): The component instance name, used to add entries to the
losses during training.
labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction.
n_sents (int): The number of neighbouring sentences to take into account.
incl_prior (bool): Whether or not to include prior probabilities from the KB in the model.
incl_context (bool): Whether or not to include the local context in the model.
entity_vector_length (int): Size of encoding vectors in the KB.
get_candidates (Callable[[KnowledgeBase, Span], Iterable[Candidate]]): Function that
produces a list of candidates, given a certain knowledge base and a textual mention.
scorer (Optional[Callable]): The scoring method. Defaults to Scorer.score_links.
DOCS: https://spacy.io/api/entitylinker#init
"""
self.vocab = vocab
self.model = model
self.name = name
self.labels_discard = list(labels_discard)
self.n_sents = n_sents
self.incl_prior = incl_prior
self.incl_context = incl_context
self.get_candidates = get_candidates
self.cfg: Dict[str, Any] = {"overwrite": overwrite}
self.distance = CosineDistance(normalize=False)
# how many neighbour sentences to take into account
# create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'.
self.kb = empty_kb(entity_vector_length)(self.vocab)
self.scorer = scorer
def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]):
"""Define the KB of this pipe by providing a function that will
create it using this object's vocab."""
if not callable(kb_loader):
raise ValueError(Errors.E885.format(arg_type=type(kb_loader)))
self.kb = kb_loader(self.vocab)
def validate_kb(self) -> None:
# Raise an error if the knowledge base is not initialized.
if self.kb is None:
raise ValueError(Errors.E1018.format(name=self.name))
if len(self.kb) == 0:
raise ValueError(Errors.E139.format(name=self.name))
def initialize(
self,
get_examples: Callable[[], Iterable[Example]],
*,
nlp: Optional[Language] = None,
kb_loader: Optional[Callable[[Vocab], KnowledgeBase]] = None,
):
"""Initialize the pipe for training, using a representative set
of data examples.
get_examples (Callable[[], Iterable[Example]]): Function that
returns a representative sample of gold-standard Example objects.
nlp (Language): The current nlp object the component is part of.
kb_loader (Callable[[Vocab], KnowledgeBase]): A function that creates an InMemoryLookupKB from a Vocab instance.
Note that providing this argument, will overwrite all data accumulated in the current KB.
Use this only when loading a KB as-such from file.
DOCS: https://spacy.io/api/entitylinker#initialize
"""
validate_get_examples(get_examples, "EntityLinker_v1.initialize")
if kb_loader is not None:
self.set_kb(kb_loader)
self.validate_kb()
nO = self.kb.entity_vector_length
doc_sample = []
vector_sample = []
for example in islice(get_examples(), 10):
doc_sample.append(example.x)
vector_sample.append(self.model.ops.alloc1f(nO))
assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
assert len(vector_sample) > 0, Errors.E923.format(name=self.name)
self.model.initialize(
X=doc_sample, Y=self.model.ops.asarray(vector_sample, dtype="float32")
)
def update(
self,
examples: Iterable[Example],
*,
drop: float = 0.0,
sgd: Optional[Optimizer] = None,
losses: Optional[Dict[str, float]] = None,
) -> Dict[str, float]:
"""Learn from a batch of documents and gold-standard information,
updating the pipe's model. Delegates to predict and get_loss.
examples (Iterable[Example]): A batch of Example objects.
drop (float): The dropout rate.
sgd (thinc.api.Optimizer): The optimizer.
losses (Dict[str, float]): Optional record of the loss during training.
Updated using the component name as the key.
RETURNS (Dict[str, float]): The updated losses dictionary.
DOCS: https://spacy.io/api/entitylinker#update
"""
self.validate_kb()
if losses is None:
losses = {}
losses.setdefault(self.name, 0.0)
if not examples:
return losses
validate_examples(examples, "EntityLinker_v1.update")
sentence_docs = []
for eg in examples:
sentences = [s for s in eg.reference.sents]
kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True)
for ent in eg.reference.ents:
# KB ID of the first token is the same as the whole span
kb_id = kb_ids[ent.start]
if kb_id:
try:
# find the sentence in the list of sentences.
sent_index = sentences.index(ent.sent)
except AttributeError:
# Catch the exception when ent.sent is None and provide a user-friendly warning
raise RuntimeError(Errors.E030) from None
# get n previous sentences, if there are any
start_sentence = max(0, sent_index - self.n_sents)
# get n posterior sentences, or as many < n as there are
end_sentence = min(len(sentences) - 1, sent_index + self.n_sents)
# get token positions
start_token = sentences[start_sentence].start
end_token = sentences[end_sentence].end
# append that span as a doc to training
sent_doc = eg.predicted[start_token:end_token].as_doc()
sentence_docs.append(sent_doc)
set_dropout_rate(self.model, drop)
if not sentence_docs:
warnings.warn(Warnings.W093.format(name="Entity Linker"))
return losses
sentence_encodings, bp_context = self.model.begin_update(sentence_docs)
loss, d_scores = self.get_loss(
sentence_encodings=sentence_encodings, examples=examples
)
bp_context(d_scores)
if sgd is not None:
self.finish_update(sgd)
losses[self.name] += loss
return losses
def get_loss(self, examples: Iterable[Example], sentence_encodings: Floats2d):
validate_examples(examples, "EntityLinker_v1.get_loss")
entity_encodings = []
for eg in examples:
kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True)
for ent in eg.reference.ents:
kb_id = kb_ids[ent.start]
if kb_id:
entity_encoding = self.kb.get_vector(kb_id)
entity_encodings.append(entity_encoding)
entity_encodings = self.model.ops.asarray2f(entity_encodings)
if sentence_encodings.shape != entity_encodings.shape:
err = Errors.E147.format(
method="get_loss", msg="gold entities do not match up"
)
raise RuntimeError(err)
gradients = self.distance.get_grad(sentence_encodings, entity_encodings)
loss = self.distance.get_loss(sentence_encodings, entity_encodings)
loss = loss / len(entity_encodings)
return float(loss), gradients
def predict(self, docs: Iterable[Doc]) -> List[str]:
"""Apply the pipeline's model to a batch of docs, without modifying them.
Returns the KB IDs for each entity in each doc, including NIL if there is
no prediction.
docs (Iterable[Doc]): The documents to predict.
RETURNS (List[str]): The models prediction for each document.
DOCS: https://spacy.io/api/entitylinker#predict
"""
self.validate_kb()
entity_count = 0
final_kb_ids: List[str] = []
if not docs:
return final_kb_ids
if isinstance(docs, Doc):
docs = [docs]
for i, doc in enumerate(docs):
sentences = [s for s in doc.sents]
if len(doc) > 0:
# Looping through each entity (TODO: rewrite)
for ent in doc.ents:
sent = ent.sent
sent_index = sentences.index(sent)
assert sent_index >= 0
# get n_neighbour sentences, clipped to the length of the document
start_sentence = max(0, sent_index - self.n_sents)
end_sentence = min(len(sentences) - 1, sent_index + self.n_sents)
start_token = sentences[start_sentence].start
end_token = sentences[end_sentence].end
sent_doc = doc[start_token:end_token].as_doc()
# currently, the context is the same for each entity in a sentence (should be refined)
xp = self.model.ops.xp
if self.incl_context:
sentence_encoding = self.model.predict([sent_doc])[0]
sentence_encoding_t = sentence_encoding.T
sentence_norm = xp.linalg.norm(sentence_encoding_t)
entity_count += 1
if ent.label_ in self.labels_discard:
# ignoring this entity - setting to NIL
final_kb_ids.append(self.NIL)
else:
candidates = list(self.get_candidates(self.kb, ent))
if not candidates:
# no prediction possible for this entity - setting to NIL
final_kb_ids.append(self.NIL)
elif len(candidates) == 1:
# shortcut for efficiency reasons: take the 1 candidate
final_kb_ids.append(candidates[0].entity_)
else:
random.shuffle(candidates)
# set all prior probabilities to 0 if incl_prior=False
prior_probs = xp.asarray([c.prior_prob for c in candidates])
if not self.incl_prior:
prior_probs = xp.asarray([0.0 for _ in candidates])
scores = prior_probs
# add in similarity from the context
if self.incl_context:
entity_encodings = xp.asarray(
[c.entity_vector for c in candidates]
)
entity_norm = xp.linalg.norm(entity_encodings, axis=1)
if len(entity_encodings) != len(prior_probs):
raise RuntimeError(
Errors.E147.format(
method="predict",
msg="vectors not of equal length",
)
)
# cosine similarity
sims = xp.dot(entity_encodings, sentence_encoding_t) / (
sentence_norm * entity_norm
)
if sims.shape != prior_probs.shape:
raise ValueError(Errors.E161)
scores = prior_probs + sims - (prior_probs * sims)
best_index = scores.argmax().item()
best_candidate = candidates[best_index]
final_kb_ids.append(best_candidate.entity_)
if not (len(final_kb_ids) == entity_count):
err = Errors.E147.format(
method="predict", msg="result variables not of equal length"
)
raise RuntimeError(err)
return final_kb_ids
def set_annotations(self, docs: Iterable[Doc], kb_ids: List[str]) -> None:
"""Modify a batch of documents, using pre-computed scores.
docs (Iterable[Doc]): The documents to modify.
kb_ids (List[str]): The IDs to set, produced by EntityLinker.predict.
DOCS: https://spacy.io/api/entitylinker#set_annotations
"""
count_ents = len([ent for doc in docs for ent in doc.ents])
if count_ents != len(kb_ids):
raise ValueError(Errors.E148.format(ents=count_ents, ids=len(kb_ids)))
i = 0
overwrite = self.cfg["overwrite"]
for doc in docs:
for ent in doc.ents:
kb_id = kb_ids[i]
i += 1
for token in ent:
if token.ent_kb_id == 0 or overwrite:
token.ent_kb_id_ = kb_id
def to_bytes(self, *, exclude=tuple()):
"""Serialize the pipe to a bytestring.
exclude (Iterable[str]): String names of serialization fields to exclude.
RETURNS (bytes): The serialized object.
DOCS: https://spacy.io/api/entitylinker#to_bytes
"""
self._validate_serialization_attrs()
serialize = {}
if hasattr(self, "cfg") and self.cfg is not None:
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
serialize["kb"] = self.kb.to_bytes
serialize["model"] = self.model.to_bytes
return util.to_bytes(serialize, exclude)
def from_bytes(self, bytes_data, *, exclude=tuple()):
"""Load the pipe from a bytestring.
exclude (Iterable[str]): String names of serialization fields to exclude.
RETURNS (TrainablePipe): The loaded object.
DOCS: https://spacy.io/api/entitylinker#from_bytes
"""
self._validate_serialization_attrs()
def load_model(b):
try:
self.model.from_bytes(b)
except AttributeError:
raise ValueError(Errors.E149) from None
deserialize = {}
if hasattr(self, "cfg") and self.cfg is not None:
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
deserialize["kb"] = lambda b: self.kb.from_bytes(b)
deserialize["model"] = load_model
util.from_bytes(bytes_data, deserialize, exclude)
return self
def to_disk(
self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
) -> None:
"""Serialize the pipe to disk.
path (str / Path): Path to a directory.
exclude (Iterable[str]): String names of serialization fields to exclude.
DOCS: https://spacy.io/api/entitylinker#to_disk
"""
serialize = {}
serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
serialize["kb"] = lambda p: self.kb.to_disk(p)
serialize["model"] = lambda p: self.model.to_disk(p)
util.to_disk(path, serialize, exclude)
def from_disk(
self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
) -> "EntityLinker_v1":
"""Load the pipe from disk. Modifies the object in place and returns it.
path (str / Path): Path to a directory.
exclude (Iterable[str]): String names of serialization fields to exclude.
RETURNS (EntityLinker): The modified EntityLinker object.
DOCS: https://spacy.io/api/entitylinker#from_disk
"""
def load_model(p):
try:
with p.open("rb") as infile:
self.model.from_bytes(infile.read())
except AttributeError:
raise ValueError(Errors.E149) from None
deserialize: Dict[str, Callable[[Any], Any]] = {}
deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p))
deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
deserialize["kb"] = lambda p: self.kb.from_disk(p)
deserialize["model"] = load_model
util.from_disk(path, deserialize, exclude)
return self
def rehearse(self, examples, *, sgd=None, losses=None, **config):
raise NotImplementedError
def add_label(self, label):
raise NotImplementedError

View File

@ -1,7 +1,9 @@
# cython: infer_types=True, profile=True, binding=True
from typing import Optional, Union, Dict, Callable
from typing import Callable, Dict, Iterable, List, Optional, Union
import srsly
from thinc.api import SequenceCategoricalCrossentropy, Model, Config
from thinc.api import Model, Config
from thinc.legacy import LegacySequenceCategoricalCrossentropy
from thinc.types import Floats2d, Ints1d
from itertools import islice
from ..tokens.doc cimport Doc
@ -13,7 +15,7 @@ from ..symbols import POS
from ..language import Language
from ..errors import Errors
from .pipe import deserialize_config
from .tagger import Tagger
from .tagger import ActivationsT, Tagger
from .. import util
from ..scorer import Scorer
from ..training import validate_examples, validate_get_examples
@ -52,8 +54,14 @@ DEFAULT_MORPH_MODEL = Config().from_str(default_model_config)["model"]
@Language.factory(
"morphologizer",
assigns=["token.morph", "token.pos"],
default_config={"model": DEFAULT_MORPH_MODEL, "overwrite": True, "extend": False,
"scorer": {"@scorers": "spacy.morphologizer_scorer.v1"}, "label_smoothing": 0.0},
default_config={
"model": DEFAULT_MORPH_MODEL,
"overwrite": True,
"extend": False,
"scorer": {"@scorers": "spacy.morphologizer_scorer.v1"},
"label_smoothing": 0.0,
"save_activations": False,
},
default_score_weights={"pos_acc": 0.5, "morph_acc": 0.5, "morph_per_feat": None},
)
def make_morphologizer(
@ -64,8 +72,10 @@ def make_morphologizer(
extend: bool,
label_smoothing: float,
scorer: Optional[Callable],
save_activations: bool,
):
return Morphologizer(nlp.vocab, model, name, overwrite=overwrite, extend=extend, label_smoothing=label_smoothing, scorer=scorer)
return Morphologizer(nlp.vocab, model, name, overwrite=overwrite, extend=extend, label_smoothing=label_smoothing, scorer=scorer,
save_activations=save_activations)
def morphologizer_score(examples, **kwargs):
@ -98,6 +108,7 @@ class Morphologizer(Tagger):
extend: bool = BACKWARD_EXTEND,
label_smoothing: float = 0.0,
scorer: Optional[Callable] = morphologizer_score,
save_activations: bool = False,
):
"""Initialize a morphologizer.
@ -105,9 +116,12 @@ class Morphologizer(Tagger):
model (thinc.api.Model): The Thinc Model powering the pipeline component.
name (str): The component instance name, used to add entries to the
losses during training.
overwrite (bool): Whether to overwrite existing annotations.
extend (bool): Whether to extend existing annotations.
scorer (Optional[Callable]): The scoring method. Defaults to
Scorer.score_token_attr for the attributes "pos" and "morph" and
Scorer.score_token_attr_per_feat for the attribute "morph".
save_activations (bool): save model activations in Doc when annotating.
DOCS: https://spacy.io/api/morphologizer#init
"""
@ -128,10 +142,11 @@ class Morphologizer(Tagger):
}
self.cfg = dict(sorted(cfg.items()))
self.scorer = scorer
self.save_activations = save_activations
@property
def labels(self):
"""RETURNS (Tuple[str]): The labels currently added to the component."""
"""RETURNS (Iterable[str]): The labels currently added to the component."""
return tuple(self.cfg["labels_morph"].keys())
@property
@ -155,7 +170,7 @@ class Morphologizer(Tagger):
# normalize label
norm_label = self.vocab.morphology.normalize_features(label)
# extract separate POS and morph tags
label_dict = Morphology.feats_to_dict(label)
label_dict = Morphology.feats_to_dict(label, sort_values=False)
pos = label_dict.get(self.POS_FEAT, "")
if self.POS_FEAT in label_dict:
label_dict.pop(self.POS_FEAT)
@ -193,7 +208,7 @@ class Morphologizer(Tagger):
continue
morph = str(token.morph)
# create and add the combined morph+POS label
morph_dict = Morphology.feats_to_dict(morph)
morph_dict = Morphology.feats_to_dict(morph, sort_values=False)
if pos:
morph_dict[self.POS_FEAT] = pos
norm_label = self.vocab.strings[self.vocab.morphology.add(morph_dict)]
@ -210,7 +225,7 @@ class Morphologizer(Tagger):
for i, token in enumerate(example.reference):
pos = token.pos_
morph = str(token.morph)
morph_dict = Morphology.feats_to_dict(morph)
morph_dict = Morphology.feats_to_dict(morph, sort_values=False)
if pos:
morph_dict[self.POS_FEAT] = pos
norm_label = self.vocab.strings[self.vocab.morphology.add(morph_dict)]
@ -221,40 +236,48 @@ class Morphologizer(Tagger):
assert len(label_sample) > 0, Errors.E923.format(name=self.name)
self.model.initialize(X=doc_sample, Y=label_sample)
def set_annotations(self, docs, batch_tag_ids):
def set_annotations(self, docs: Iterable[Doc], activations: ActivationsT):
"""Modify a batch of documents, using pre-computed scores.
docs (Iterable[Doc]): The documents to modify.
batch_tag_ids: The IDs to set, produced by Morphologizer.predict.
activations (ActivationsT): The activations used for setting annotations, produced by Morphologizer.predict.
DOCS: https://spacy.io/api/morphologizer#set_annotations
"""
batch_tag_ids = activations["label_ids"]
if isinstance(docs, Doc):
docs = [docs]
cdef Doc doc
cdef Vocab vocab = self.vocab
cdef bint overwrite = self.cfg["overwrite"]
cdef bint extend = self.cfg["extend"]
labels = self.labels
# We require random access for the upcoming ops, so we need
# to allocate a compatible container out of the iterable.
labels = tuple(self.labels)
for i, doc in enumerate(docs):
if self.save_activations:
doc.activations[self.name] = {}
for act_name, acts in activations.items():
doc.activations[self.name][act_name] = acts[i]
doc_tag_ids = batch_tag_ids[i]
if hasattr(doc_tag_ids, "get"):
doc_tag_ids = doc_tag_ids.get()
for j, tag_id in enumerate(doc_tag_ids):
morph = labels[tag_id]
morph = labels[int(tag_id)]
# set morph
if doc.c[j].morph == 0 or overwrite or extend:
if overwrite and extend:
# morphologizer morph overwrites any existing features
# while extending
extended_morph = Morphology.feats_to_dict(self.vocab.strings[doc.c[j].morph])
extended_morph.update(Morphology.feats_to_dict(self.cfg["labels_morph"].get(morph, 0)))
extended_morph = Morphology.feats_to_dict(self.vocab.strings[doc.c[j].morph], sort_values=False)
extended_morph.update(Morphology.feats_to_dict(self.cfg["labels_morph"].get(morph, 0), sort_values=False))
doc.c[j].morph = self.vocab.morphology.add(extended_morph)
elif extend:
# existing features are preserved and any new features
# are added
extended_morph = Morphology.feats_to_dict(self.cfg["labels_morph"].get(morph, 0))
extended_morph.update(Morphology.feats_to_dict(self.vocab.strings[doc.c[j].morph]))
extended_morph = Morphology.feats_to_dict(self.cfg["labels_morph"].get(morph, 0), sort_values=False)
extended_morph.update(Morphology.feats_to_dict(self.vocab.strings[doc.c[j].morph], sort_values=False))
doc.c[j].morph = self.vocab.morphology.add(extended_morph)
else:
# clobber
@ -274,8 +297,8 @@ class Morphologizer(Tagger):
DOCS: https://spacy.io/api/morphologizer#get_loss
"""
validate_examples(examples, "Morphologizer.get_loss")
loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False,
label_smoothing=self.cfg["label_smoothing"])
loss_func = LegacySequenceCategoricalCrossentropy(names=self.labels, normalize=False,
label_smoothing=self.cfg["label_smoothing"])
truths = []
for eg in examples:
eg_truths = []
@ -296,7 +319,7 @@ class Morphologizer(Tagger):
label = None
# Otherwise, generate the combined label
else:
label_dict = Morphology.feats_to_dict(morph)
label_dict = Morphology.feats_to_dict(morph, sort_values=False)
if pos:
label_dict[self.POS_FEAT] = pos
label = self.vocab.strings[self.vocab.morphology.add(label_dict)]

View File

@ -1,221 +0,0 @@
# cython: infer_types=True, profile=True, binding=True
from typing import Optional
import numpy
from thinc.api import CosineDistance, to_categorical, Model, Config
from thinc.api import set_dropout_rate
from ..tokens.doc cimport Doc
from .trainable_pipe import TrainablePipe
from .tagger import Tagger
from ..training import validate_examples
from ..language import Language
from ._parser_internals import nonproj
from ..attrs import POS, ID
from ..errors import Errors
default_model_config = """
[model]
@architectures = "spacy.MultiTask.v1"
maxout_pieces = 3
token_vector_width = 96
[model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v2"
pretrained_vectors = null
width = 96
depth = 4
embed_size = 2000
window_size = 1
maxout_pieces = 2
subword_features = true
"""
DEFAULT_MT_MODEL = Config().from_str(default_model_config)["model"]
@Language.factory(
"nn_labeller",
default_config={"labels": None, "target": "dep_tag_offset", "model": DEFAULT_MT_MODEL}
)
def make_nn_labeller(nlp: Language, name: str, model: Model, labels: Optional[dict], target: str):
return MultitaskObjective(nlp.vocab, model, name)
class MultitaskObjective(Tagger):
"""Experimental: Assist training of a parser or tagger, by training a
side-objective.
"""
def __init__(self, vocab, model, name="nn_labeller", *, target):
self.vocab = vocab
self.model = model
self.name = name
if target == "dep":
self.make_label = self.make_dep
elif target == "tag":
self.make_label = self.make_tag
elif target == "ent":
self.make_label = self.make_ent
elif target == "dep_tag_offset":
self.make_label = self.make_dep_tag_offset
elif target == "ent_tag":
self.make_label = self.make_ent_tag
elif target == "sent_start":
self.make_label = self.make_sent_start
elif hasattr(target, "__call__"):
self.make_label = target
else:
raise ValueError(Errors.E016)
cfg = {"labels": {}, "target": target}
self.cfg = dict(cfg)
@property
def labels(self):
return self.cfg.setdefault("labels", {})
@labels.setter
def labels(self, value):
self.cfg["labels"] = value
def set_annotations(self, docs, dep_ids):
pass
def initialize(self, get_examples, nlp=None, labels=None):
if not hasattr(get_examples, "__call__"):
err = Errors.E930.format(name="MultitaskObjective", obj=type(get_examples))
raise ValueError(err)
if labels is not None:
self.labels = labels
else:
for example in get_examples():
for token in example.y:
label = self.make_label(token)
if label is not None and label not in self.labels:
self.labels[label] = len(self.labels)
self.model.initialize() # TODO: fix initialization by defining X and Y
def predict(self, docs):
tokvecs = self.model.get_ref("tok2vec")(docs)
scores = self.model.get_ref("softmax")(tokvecs)
return tokvecs, scores
def get_loss(self, examples, scores):
cdef int idx = 0
correct = numpy.zeros((scores.shape[0],), dtype="i")
guesses = scores.argmax(axis=1)
docs = [eg.predicted for eg in examples]
for i, eg in enumerate(examples):
# Handles alignment for tokenization differences
doc_annots = eg.get_aligned() # TODO
for j in range(len(eg.predicted)):
tok_annots = {key: values[j] for key, values in tok_annots.items()}
label = self.make_label(j, tok_annots)
if label is None or label not in self.labels:
correct[idx] = guesses[idx]
else:
correct[idx] = self.labels[label]
idx += 1
correct = self.model.ops.xp.array(correct, dtype="i")
d_scores = scores - to_categorical(correct, n_classes=scores.shape[1])
loss = (d_scores**2).sum()
return float(loss), d_scores
@staticmethod
def make_dep(token):
return token.dep_
@staticmethod
def make_tag(token):
return token.tag_
@staticmethod
def make_ent(token):
if token.ent_iob_ == "O":
return "O"
else:
return token.ent_iob_ + "-" + token.ent_type_
@staticmethod
def make_dep_tag_offset(token):
dep = token.dep_
tag = token.tag_
offset = token.head.i - token.i
offset = min(offset, 2)
offset = max(offset, -2)
return f"{dep}-{tag}:{offset}"
@staticmethod
def make_ent_tag(token):
if token.ent_iob_ == "O":
ent = "O"
else:
ent = token.ent_iob_ + "-" + token.ent_type_
tag = token.tag_
return f"{tag}-{ent}"
@staticmethod
def make_sent_start(token):
"""A multi-task objective for representing sentence boundaries,
using BILU scheme. (O is impossible)
"""
if token.is_sent_start and token.is_sent_end:
return "U-SENT"
elif token.is_sent_start:
return "B-SENT"
else:
return "I-SENT"
class ClozeMultitask(TrainablePipe):
def __init__(self, vocab, model, **cfg):
self.vocab = vocab
self.model = model
self.cfg = cfg
self.distance = CosineDistance(ignore_zeros=True, normalize=False) # TODO: in config
def set_annotations(self, docs, dep_ids):
pass
def initialize(self, get_examples, nlp=None):
self.model.initialize() # TODO: fix initialization by defining X and Y
X = self.model.ops.alloc((5, self.model.get_ref("tok2vec").get_dim("nO")))
self.model.output_layer.initialize(X)
def predict(self, docs):
tokvecs = self.model.get_ref("tok2vec")(docs)
vectors = self.model.get_ref("output_layer")(tokvecs)
return tokvecs, vectors
def get_loss(self, examples, vectors, prediction):
validate_examples(examples, "ClozeMultitask.get_loss")
# The simplest way to implement this would be to vstack the
# token.vector values, but that's a bit inefficient, especially on GPU.
# Instead we fetch the index into the vectors table for each of our tokens,
# and look them up all at once. This prevents data copying.
ids = self.model.ops.flatten([eg.predicted.to_array(ID).ravel() for eg in examples])
target = vectors[ids]
gradient = self.distance.get_grad(prediction, target)
loss = self.distance.get_loss(prediction, target)
return float(loss), gradient
def update(self, examples, *, drop=0., sgd=None, losses=None):
pass
def rehearse(self, examples, drop=0., sgd=None, losses=None):
if losses is not None and self.name not in losses:
losses[self.name] = 0.
set_dropout_rate(self.model, drop)
validate_examples(examples, "ClozeMultitask.rehearse")
docs = [eg.predicted for eg in examples]
predictions, bp_predictions = self.model.begin_update()
loss, d_predictions = self.get_loss(examples, self.vocab.vectors.data, predictions)
bp_predictions(d_predictions)
if sgd is not None:
self.finish_update(sgd)
if losses is not None:
losses[self.name] += loss
return losses
def add_label(self, label):
raise NotImplementedError

View File

@ -4,22 +4,22 @@ from typing import Optional, Iterable, Callable
from thinc.api import Model, Config
from ._parser_internals.transition_system import TransitionSystem
from .transition_parser cimport Parser
from ._parser_internals.ner cimport BiluoPushDown
from .transition_parser import Parser
from ._parser_internals.ner import BiluoPushDown
from ..language import Language
from ..scorer import get_ner_prf, PRFScore
from ..training import validate_examples
from ..util import registry
from ..training import remove_bilu_prefix
default_model_config = """
[model]
@architectures = "spacy.TransitionBasedParser.v2"
@architectures = "spacy.TransitionBasedParser.v3"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
[model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v2"
@ -44,8 +44,12 @@ DEFAULT_NER_MODEL = Config().from_str(default_model_config)["model"]
"incorrect_spans_key": None,
"scorer": {"@scorers": "spacy.ner_scorer.v1"},
},
default_score_weights={"ents_f": 1.0, "ents_p": 0.0, "ents_r": 0.0, "ents_per_type": None},
default_score_weights={
"ents_f": 1.0,
"ents_p": 0.0,
"ents_r": 0.0,
"ents_per_type": None,
},
)
def make_ner(
nlp: Language,
@ -98,6 +102,7 @@ def make_ner(
scorer=scorer,
)
@Language.factory(
"beam_ner",
assigns=["doc.ents", "token.ent_iob", "token.ent_type"],
@ -111,7 +116,12 @@ def make_ner(
"incorrect_spans_key": None,
"scorer": None,
},
default_score_weights={"ents_f": 1.0, "ents_p": 0.0, "ents_r": 0.0, "ents_per_type": None},
default_score_weights={
"ents_f": 1.0,
"ents_p": 0.0,
"ents_r": 0.0,
"ents_per_type": None,
},
)
def make_beam_ner(
nlp: Language,
@ -185,11 +195,12 @@ def make_ner_scorer():
return ner_score
cdef class EntityRecognizer(Parser):
class EntityRecognizer(Parser):
"""Pipeline component for named entity recognition.
DOCS: https://spacy.io/api/entityrecognizer
"""
TransitionSystem = BiluoPushDown
def __init__(
@ -207,15 +218,14 @@ cdef class EntityRecognizer(Parser):
incorrect_spans_key=None,
scorer=ner_score,
):
"""Create an EntityRecognizer.
"""
"""Create an EntityRecognizer."""
super().__init__(
vocab,
model,
name,
moves,
update_with_oracle_cut_size=update_with_oracle_cut_size,
min_action_freq=1, # not relevant for NER
min_action_freq=1, # not relevant for NER
learn_tokens=False, # not relevant for NER
beam_width=beam_width,
beam_density=beam_density,
@ -242,8 +252,11 @@ cdef class EntityRecognizer(Parser):
def labels(self):
# Get the labels from the model by looking at the available moves, e.g.
# B-PERSON, I-PERSON, L-PERSON, U-PERSON
labels = set(remove_bilu_prefix(move) for move in self.move_names
if move[0] in ("B", "I", "L", "U"))
labels = set(
remove_bilu_prefix(move)
for move in self.move_names
if move[0] in ("B", "I", "L", "U")
)
return tuple(sorted(labels))
def scored_ents(self, beams):

View File

@ -19,13 +19,6 @@ cdef class Pipe:
DOCS: https://spacy.io/api/pipe
"""
@classmethod
def __init_subclass__(cls, **kwargs):
"""Raise a warning if an inheriting class implements 'begin_training'
(from v2) instead of the new 'initialize' method (from v3)"""
if hasattr(cls, "begin_training"):
warnings.warn(Warnings.W088.format(name=cls.__name__))
def __call__(self, Doc doc) -> Doc:
"""Apply the pipe to one document. The document is modified in place,
and returned. This usually happens under the hood when the nlp object
@ -94,6 +87,10 @@ cdef class Pipe:
return self.scorer(examples, **scorer_kwargs)
return {}
@property
def is_distillable(self) -> bool:
return False
@property
def is_trainable(self) -> bool:
return False

View File

@ -10,9 +10,6 @@ from ..language import Language
from ..scorer import Scorer
from .. import util
# see #9050
BACKWARD_OVERWRITE = False
@Language.factory(
"sentencizer",
assigns=["token.is_sent_start", "doc.sents"],
@ -52,13 +49,14 @@ class Sentencizer(Pipe):
name="sentencizer",
*,
punct_chars=None,
overwrite=BACKWARD_OVERWRITE,
overwrite=False,
scorer=senter_score,
):
"""Initialize the sentencizer.
punct_chars (list): Punctuation characters to split on. Will be
serialized with the nlp object.
overwrite (bool): Whether to overwrite existing annotations.
scorer (Optional[Callable]): The scoring method. Defaults to
Scorer.score_spans for the attribute "sents".

View File

@ -1,13 +1,16 @@
# cython: infer_types=True, profile=True, binding=True
from typing import Optional, Callable
from typing import Dict, Iterable, Optional, Callable, List, Union
from itertools import islice
import srsly
from thinc.api import Model, SequenceCategoricalCrossentropy, Config
from thinc.api import Model, Config
from thinc.legacy import LegacySequenceCategoricalCrossentropy
from thinc.types import Floats2d, Ints1d
from ..tokens.doc cimport Doc
from .tagger import Tagger
from .tagger import ActivationsT, Tagger
from ..language import Language
from ..errors import Errors
from ..scorer import Scorer
@ -15,8 +18,6 @@ from ..training import validate_examples, validate_get_examples
from ..util import registry
from .. import util
# See #9050
BACKWARD_OVERWRITE = False
default_model_config = """
[model]
@ -38,11 +39,21 @@ DEFAULT_SENTER_MODEL = Config().from_str(default_model_config)["model"]
@Language.factory(
"senter",
assigns=["token.is_sent_start"],
default_config={"model": DEFAULT_SENTER_MODEL, "overwrite": False, "scorer": {"@scorers": "spacy.senter_scorer.v1"}},
default_config={
"model": DEFAULT_SENTER_MODEL,
"overwrite": False,
"scorer": {"@scorers": "spacy.senter_scorer.v1"},
"save_activations": False,
},
default_score_weights={"sents_f": 1.0, "sents_p": 0.0, "sents_r": 0.0},
)
def make_senter(nlp: Language, name: str, model: Model, overwrite: bool, scorer: Optional[Callable]):
return SentenceRecognizer(nlp.vocab, model, name, overwrite=overwrite, scorer=scorer)
def make_senter(nlp: Language,
name: str,
model: Model,
overwrite: bool,
scorer: Optional[Callable],
save_activations: bool):
return SentenceRecognizer(nlp.vocab, model, name, overwrite=overwrite, scorer=scorer, save_activations=save_activations)
def senter_score(examples, **kwargs):
@ -70,8 +81,9 @@ class SentenceRecognizer(Tagger):
model,
name="senter",
*,
overwrite=BACKWARD_OVERWRITE,
overwrite=False,
scorer=senter_score,
save_activations: bool = False,
):
"""Initialize a sentence recognizer.
@ -79,8 +91,10 @@ class SentenceRecognizer(Tagger):
model (thinc.api.Model): The Thinc Model powering the pipeline component.
name (str): The component instance name, used to add entries to the
losses during training.
overwrite (bool): Whether to overwrite existing annotations.
scorer (Optional[Callable]): The scoring method. Defaults to
Scorer.score_spans for the attribute "sents".
save_activations (bool): save model activations in Doc when annotating.
DOCS: https://spacy.io/api/sentencerecognizer#init
"""
@ -90,6 +104,7 @@ class SentenceRecognizer(Tagger):
self._rehearsal_model = None
self.cfg = {"overwrite": overwrite}
self.scorer = scorer
self.save_activations = save_activations
@property
def labels(self):
@ -107,19 +122,24 @@ class SentenceRecognizer(Tagger):
def label_data(self):
return None
def set_annotations(self, docs, batch_tag_ids):
def set_annotations(self, docs: Iterable[Doc], activations: ActivationsT):
"""Modify a batch of documents, using pre-computed scores.
docs (Iterable[Doc]): The documents to modify.
batch_tag_ids: The IDs to set, produced by SentenceRecognizer.predict.
activations (ActivationsT): The activations used for setting annotations, produced by SentenceRecognizer.predict.
DOCS: https://spacy.io/api/sentencerecognizer#set_annotations
"""
batch_tag_ids = activations["label_ids"]
if isinstance(docs, Doc):
docs = [docs]
cdef Doc doc
cdef bint overwrite = self.cfg["overwrite"]
for i, doc in enumerate(docs):
if self.save_activations:
doc.activations[self.name] = {}
for act_name, acts in activations.items():
doc.activations[self.name][act_name] = acts[i]
doc_tag_ids = batch_tag_ids[i]
if hasattr(doc_tag_ids, "get"):
doc_tag_ids = doc_tag_ids.get()
@ -142,7 +162,7 @@ class SentenceRecognizer(Tagger):
"""
validate_examples(examples, "SentenceRecognizer.get_loss")
labels = self.labels
loss_func = SequenceCategoricalCrossentropy(names=labels, normalize=False)
loss_func = LegacySequenceCategoricalCrossentropy(names=labels, normalize=False)
truths = []
for eg in examples:
eg_truth = []

View File

@ -11,7 +11,7 @@ from ..language import Language
from ..errors import Errors, Warnings
from ..util import ensure_path, SimpleFrozenList, registry
from ..tokens import Doc, Span
from ..scorer import Scorer
from ..scorer import Scorer, get_ner_prf
from ..matcher import Matcher, PhraseMatcher
from ..matcher.levenshtein import levenshtein_compare
from .. import util
@ -21,7 +21,7 @@ DEFAULT_SPANS_KEY = "ruler"
@Language.factory(
"future_entity_ruler",
"entity_ruler",
assigns=["doc.ents"],
default_config={
"phrase_matcher_attr": None,
@ -67,6 +67,15 @@ def make_entity_ruler(
)
def entity_ruler_score(examples, **kwargs):
return get_ner_prf(examples)
@registry.scorers("spacy.entity_ruler_scorer.v1")
def make_entity_ruler_scorer():
return entity_ruler_score
@Language.factory(
"span_ruler",
assigns=["doc.spans"],
@ -124,7 +133,7 @@ def prioritize_new_ents_filter(
) -> List[Span]:
"""Merge entities and spans into one list without overlaps by allowing
spans to overwrite any entities that they overlap with. Intended to
replicate the overwrite_ents=True behavior from the EntityRuler.
replicate the overwrite_ents=True behavior from the v3 EntityRuler.
entities (Iterable[Span]): The entities, already filtered for overlaps.
spans (Iterable[Span]): The spans to merge, may contain overlaps.
@ -155,7 +164,7 @@ def prioritize_existing_ents_filter(
) -> List[Span]:
"""Merge entities and spans into one list without overlaps by prioritizing
existing entities. Intended to replicate the overwrite_ents=False behavior
from the EntityRuler.
from the v3 EntityRuler.
entities (Iterable[Span]): The entities, already filtered for overlaps.
spans (Iterable[Span]): The spans to merge, may contain overlaps.

View File

@ -1,12 +1,15 @@
from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Union, cast
from typing import Union, Protocol, runtime_checkable
from dataclasses import dataclass
from functools import partial
from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Union, cast
from thinc.api import Config, Model, get_current_ops, set_dropout_rate, Ops
from thinc.api import Optimizer
from thinc.types import Ragged, Ints2d, Floats2d
import numpy
from thinc.api import Config, Model, Ops, Optimizer, get_current_ops, set_dropout_rate
from thinc.types import Floats2d, Ints1d, Ints2d, Ragged
from ..compat import Protocol, runtime_checkable
from ..errors import Errors
from ..language import Language
from ..scorer import Scorer
@ -16,6 +19,9 @@ from ..util import registry
from ..vocab import Vocab
from .trainable_pipe import TrainablePipe
ActivationsT = Dict[str, Union[Floats2d, Ragged]]
spancat_default_config = """
[model]
@architectures = "spacy.SpanCategorizer.v1"
@ -170,6 +176,7 @@ def build_preset_spans_suggester(spans_key: str) -> Suggester:
"model": DEFAULT_SPANCAT_MODEL,
"suggester": {"@misc": "spacy.ngram_suggester.v1", "sizes": [1, 2, 3]},
"scorer": {"@scorers": "spacy.spancat_scorer.v1"},
"save_activations": False,
},
default_score_weights={"spans_sc_f": 1.0, "spans_sc_p": 0.0, "spans_sc_r": 0.0},
)
@ -182,6 +189,7 @@ def make_spancat(
scorer: Optional[Callable],
threshold: float,
max_positive: Optional[int],
save_activations: bool,
) -> "SpanCategorizer":
"""Create a SpanCategorizer component and configure it for multi-label
classification to be able to assign multiple labels for each span.
@ -209,6 +217,7 @@ def make_spancat(
0.5.
max_positive (Optional[int]): Maximum number of labels to consider positive
per span. Defaults to None, indicating no limit.
save_activations (bool): save model activations in Doc when annotating.
"""
return SpanCategorizer(
nlp.vocab,
@ -222,6 +231,7 @@ def make_spancat(
threshold=threshold,
scorer=scorer,
add_negative_label=False,
save_activations=save_activations,
)
@ -235,6 +245,7 @@ def make_spancat(
"suggester": {"@misc": "spacy.ngram_suggester.v1", "sizes": [1, 2, 3]},
"scorer": {"@scorers": "spacy.spancat_scorer.v1"},
"allow_overlap": True,
"save_activations": False,
},
default_score_weights={"spans_sc_f": 1.0, "spans_sc_p": 0.0, "spans_sc_r": 0.0},
)
@ -247,6 +258,7 @@ def make_spancat_singlelabel(
negative_weight: float,
allow_overlap: bool,
scorer: Optional[Callable],
save_activations: bool,
) -> "SpanCategorizer":
"""Create a SpanCategorizer component and configure it for multi-class
classification. With this configuration each span can get at most one
@ -274,6 +286,7 @@ def make_spancat_singlelabel(
allow_overlap (bool): If True the data is assumed to contain overlapping spans.
Otherwise it produces non-overlapping spans greedily prioritizing
higher assigned label scores.
save_activations (bool): save model activations in Doc when annotating.
"""
return SpanCategorizer(
nlp.vocab,
@ -287,6 +300,7 @@ def make_spancat_singlelabel(
add_negative_label=True,
threshold=None,
scorer=scorer,
save_activations=save_activations,
)
@ -349,6 +363,7 @@ class SpanCategorizer(TrainablePipe):
max_positive: Optional[int] = None,
threshold: Optional[float] = 0.5,
scorer: Optional[Callable] = spancat_score,
save_activations: bool = False,
) -> None:
"""Initialize the multi-label or multi-class span categorizer.
@ -398,6 +413,7 @@ class SpanCategorizer(TrainablePipe):
self.model = model
self.name = name
self.scorer = scorer
self.save_activations = save_activations
self.add_negative_label = add_negative_label
if not allow_overlap and max_positive is not None and max_positive > 1:
raise ValueError(Errors.E1051.format(max_positive=max_positive))
@ -479,7 +495,7 @@ class SpanCategorizer(TrainablePipe):
else:
return None
def predict(self, docs: Iterable[Doc]):
def predict(self, docs: Iterable[Doc]) -> ActivationsT:
"""Apply the pipeline's model to a batch of docs, without modifying them.
docs (Iterable[Doc]): The documents to predict.
@ -492,7 +508,7 @@ class SpanCategorizer(TrainablePipe):
scores = self.model.ops.alloc2f(0, 0)
else:
scores = self.model.predict((docs, indices)) # type: ignore
return indices, scores
return {"indices": indices, "scores": scores}
def set_candidates(
self, docs: Iterable[Doc], *, candidates_key: str = "candidates"
@ -512,18 +528,27 @@ class SpanCategorizer(TrainablePipe):
for index in candidates.dataXd:
doc.spans[candidates_key].append(doc[index[0] : index[1]])
def set_annotations(self, docs: Iterable[Doc], indices_scores) -> None:
def set_annotations(self, docs: Iterable[Doc], activations: ActivationsT) -> None:
"""Modify a batch of Doc objects, using pre-computed scores.
docs (Iterable[Doc]): The documents to modify.
scores: The scores to set, produced by SpanCategorizer.predict.
activations: ActivationsT: The activations, produced by SpanCategorizer.predict.
DOCS: https://spacy.io/api/spancategorizer#set_annotations
"""
indices, scores = indices_scores
indices = activations["indices"]
assert isinstance(indices, Ragged)
scores = cast(Floats2d, activations["scores"])
offset = 0
for i, doc in enumerate(docs):
indices_i = indices[i].dataXd
indices_i = cast(Ints2d, indices[i].dataXd)
if self.save_activations:
doc.activations[self.name] = {}
doc.activations[self.name]["indices"] = indices_i
doc.activations[self.name]["scores"] = scores[
offset : offset + indices.lengths[i]
]
allow_overlap = cast(bool, self.cfg["allow_overlap"])
if self.cfg["max_positive"] == 1:
doc.spans[self.key] = self._make_span_group_singlelabel(

View File

@ -1,9 +1,11 @@
# cython: infer_types=True, profile=True, binding=True
from typing import Callable, Optional
from typing import Callable, Dict, Iterable, List, Optional, Union
from typing import Tuple
import numpy
import srsly
from thinc.api import Model, set_dropout_rate, SequenceCategoricalCrossentropy, Config
from thinc.types import Floats2d
from thinc.api import Model, set_dropout_rate, Config
from thinc.legacy import LegacySequenceCategoricalCrossentropy
from thinc.types import Floats2d, Ints1d
import warnings
from itertools import islice
@ -22,8 +24,8 @@ from ..training import validate_examples, validate_get_examples
from ..util import registry
from .. import util
# See #9050
BACKWARD_OVERWRITE = False
ActivationsT = Dict[str, Union[List[Floats2d], List[Ints1d]]]
default_model_config = """
[model]
@ -45,7 +47,14 @@ DEFAULT_TAGGER_MODEL = Config().from_str(default_model_config)["model"]
@Language.factory(
"tagger",
assigns=["token.tag"],
default_config={"model": DEFAULT_TAGGER_MODEL, "overwrite": False, "scorer": {"@scorers": "spacy.tagger_scorer.v1"}, "neg_prefix": "!", "label_smoothing": 0.0},
default_config={
"model": DEFAULT_TAGGER_MODEL,
"overwrite": False,
"scorer": {"@scorers": "spacy.tagger_scorer.v1"},
"neg_prefix": "!",
"label_smoothing": 0.0,
"save_activations": False,
},
default_score_weights={"tag_acc": 1.0},
)
def make_tagger(
@ -56,6 +65,7 @@ def make_tagger(
scorer: Optional[Callable],
neg_prefix: str,
label_smoothing: float,
save_activations: bool,
):
"""Construct a part-of-speech tagger component.
@ -64,7 +74,8 @@ def make_tagger(
in size, and be normalized as probabilities (all scores between 0 and 1,
with the rows summing to 1).
"""
return Tagger(nlp.vocab, model, name, overwrite=overwrite, scorer=scorer, neg_prefix=neg_prefix, label_smoothing=label_smoothing)
return Tagger(nlp.vocab, model, name, overwrite=overwrite, scorer=scorer, neg_prefix=neg_prefix,
label_smoothing=label_smoothing, save_activations=save_activations)
def tagger_score(examples, **kwargs):
@ -87,10 +98,11 @@ class Tagger(TrainablePipe):
model,
name="tagger",
*,
overwrite=BACKWARD_OVERWRITE,
overwrite=False,
scorer=tagger_score,
neg_prefix="!",
label_smoothing=0.0,
save_activations: bool = False,
):
"""Initialize a part-of-speech tagger.
@ -98,8 +110,10 @@ class Tagger(TrainablePipe):
model (thinc.api.Model): The Thinc Model powering the pipeline component.
name (str): The component instance name, used to add entries to the
losses during training.
overwrite (bool): Whether to overwrite existing annotations.
scorer (Optional[Callable]): The scoring method. Defaults to
Scorer.score_token_attr for the attribute "tag".
save_activations (bool): save model activations in Doc when annotating.
DOCS: https://spacy.io/api/tagger#init
"""
@ -107,9 +121,15 @@ class Tagger(TrainablePipe):
self.model = model
self.name = name
self._rehearsal_model = None
cfg = {"labels": [], "overwrite": overwrite, "neg_prefix": neg_prefix, "label_smoothing": label_smoothing}
cfg = {
"labels": [],
"overwrite": overwrite,
"neg_prefix": neg_prefix,
"label_smoothing": label_smoothing
}
self.cfg = dict(sorted(cfg.items()))
self.scorer = scorer
self.save_activations = save_activations
@property
def labels(self):
@ -128,7 +148,7 @@ class Tagger(TrainablePipe):
"""Data about the labels currently added to the component."""
return tuple(self.cfg["labels"])
def predict(self, docs):
def predict(self, docs) -> ActivationsT:
"""Apply the pipeline's model to a batch of docs, without modifying them.
docs (Iterable[Doc]): The documents to predict.
@ -141,12 +161,12 @@ class Tagger(TrainablePipe):
n_labels = len(self.labels)
guesses = [self.model.ops.alloc((0, n_labels)) for doc in docs]
assert len(guesses) == len(docs)
return guesses
return {"probabilities": guesses, "label_ids": guesses}
scores = self.model.predict(docs)
assert len(scores) == len(docs), (len(scores), len(docs))
guesses = self._scores2guesses(scores)
assert len(guesses) == len(docs)
return guesses
return {"probabilities": scores, "label_ids": guesses}
def _scores2guesses(self, scores):
guesses = []
@ -157,14 +177,15 @@ class Tagger(TrainablePipe):
guesses.append(doc_guesses)
return guesses
def set_annotations(self, docs, batch_tag_ids):
def set_annotations(self, docs: Iterable[Doc], activations: ActivationsT):
"""Modify a batch of documents, using pre-computed scores.
docs (Iterable[Doc]): The documents to modify.
batch_tag_ids: The IDs to set, produced by Tagger.predict.
activations (ActivationsT): The activations used for setting annotations, produced by Tagger.predict.
DOCS: https://spacy.io/api/tagger#set_annotations
"""
batch_tag_ids = activations["label_ids"]
if isinstance(docs, Doc):
docs = [docs]
cdef Doc doc
@ -172,6 +193,10 @@ class Tagger(TrainablePipe):
cdef bint overwrite = self.cfg["overwrite"]
labels = self.labels
for i, doc in enumerate(docs):
if self.save_activations:
doc.activations[self.name] = {}
for act_name, acts in activations.items():
doc.activations[self.name][act_name] = acts[i]
doc_tag_ids = batch_tag_ids[i]
if hasattr(doc_tag_ids, "get"):
doc_tag_ids = doc_tag_ids.get()
@ -227,7 +252,6 @@ class Tagger(TrainablePipe):
DOCS: https://spacy.io/api/tagger#rehearse
"""
loss_func = SequenceCategoricalCrossentropy()
if losses is None:
losses = {}
losses.setdefault(self.name, 0.0)
@ -241,12 +265,32 @@ class Tagger(TrainablePipe):
set_dropout_rate(self.model, drop)
tag_scores, bp_tag_scores = self.model.begin_update(docs)
tutor_tag_scores, _ = self._rehearsal_model.begin_update(docs)
grads, loss = loss_func(tag_scores, tutor_tag_scores)
loss, grads = self.get_teacher_student_loss(tutor_tag_scores, tag_scores)
bp_tag_scores(grads)
self.finish_update(sgd)
if sgd is not None:
self.finish_update(sgd)
losses[self.name] += loss
return losses
def get_teacher_student_loss(
self, teacher_scores: List[Floats2d], student_scores: List[Floats2d]
) -> Tuple[float, List[Floats2d]]:
"""Calculate the loss and its gradient for a batch of student
scores, relative to teacher scores.
teacher_scores: Scores representing the teacher model's predictions.
student_scores: Scores representing the student model's predictions.
RETURNS (Tuple[float, float]): The loss and the gradient.
DOCS: https://spacy.io/api/tagger#get_teacher_student_loss
"""
loss_func = LegacySequenceCategoricalCrossentropy(normalize=False)
d_scores, loss = loss_func(student_scores, teacher_scores)
if self.model.ops.xp.isnan(loss):
raise ValueError(Errors.E910.format(name=self.name))
return float(loss), d_scores
def get_loss(self, examples, scores):
"""Find the loss and gradient of loss for the batch of documents and
their predicted scores.
@ -258,7 +302,12 @@ class Tagger(TrainablePipe):
DOCS: https://spacy.io/api/tagger#get_loss
"""
validate_examples(examples, "Tagger.get_loss")
loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False, neg_prefix=self.cfg["neg_prefix"], label_smoothing=self.cfg["label_smoothing"])
loss_func = LegacySequenceCategoricalCrossentropy(
names=self.labels,
normalize=False,
neg_prefix=self.cfg["neg_prefix"],
label_smoothing=self.cfg["label_smoothing"]
)
# Convert empty tag "" to missing value None so that both misaligned
# tokens and tokens with missing annotation have the default missing
# value None.

View File

@ -1,4 +1,4 @@
from typing import Iterable, Tuple, Optional, Dict, List, Callable, Any
from typing import Iterable, Tuple, Optional, Dict, List, Callable, Any, Union
from thinc.api import get_array_module, Model, Optimizer, set_dropout_rate, Config
from thinc.types import Floats2d
import numpy
@ -14,6 +14,9 @@ from ..util import registry
from ..vocab import Vocab
ActivationsT = Dict[str, Floats2d]
single_label_default_config = """
[model]
@architectures = "spacy.TextCatEnsemble.v2"
@ -75,6 +78,7 @@ subword_features = true
"threshold": 0.0,
"model": DEFAULT_SINGLE_TEXTCAT_MODEL,
"scorer": {"@scorers": "spacy.textcat_scorer.v2"},
"save_activations": False,
},
default_score_weights={
"cats_score": 1.0,
@ -95,6 +99,7 @@ def make_textcat(
model: Model[List[Doc], List[Floats2d]],
threshold: float,
scorer: Optional[Callable],
save_activations: bool,
) -> "TextCategorizer":
"""Create a TextCategorizer component. The text categorizer predicts categories
over a whole document. It can learn one or more labels, and the labels are considered
@ -104,8 +109,16 @@ def make_textcat(
scores for each category.
threshold (float): Cutoff to consider a prediction "positive".
scorer (Optional[Callable]): The scoring method.
save_activations (bool): save model activations in Doc when annotating.
"""
return TextCategorizer(nlp.vocab, model, name, threshold=threshold, scorer=scorer)
return TextCategorizer(
nlp.vocab,
model,
name,
threshold=threshold,
scorer=scorer,
save_activations=save_activations,
)
def textcat_score(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
@ -136,6 +149,7 @@ class TextCategorizer(TrainablePipe):
*,
threshold: float,
scorer: Optional[Callable] = textcat_score,
save_activations: bool = False,
) -> None:
"""Initialize a text categorizer for single-label classification.
@ -161,6 +175,7 @@ class TextCategorizer(TrainablePipe):
}
self.cfg = dict(cfg)
self.scorer = scorer
self.save_activations = save_activations
@property
def support_missing_values(self):
@ -185,7 +200,7 @@ class TextCategorizer(TrainablePipe):
"""
return self.labels # type: ignore[return-value]
def predict(self, docs: Iterable[Doc]):
def predict(self, docs: Iterable[Doc]) -> ActivationsT:
"""Apply the pipeline's model to a batch of docs, without modifying them.
docs (Iterable[Doc]): The documents to predict.
@ -198,12 +213,12 @@ class TextCategorizer(TrainablePipe):
tensors = [doc.tensor for doc in docs]
xp = self.model.ops.xp
scores = xp.zeros((len(list(docs)), len(self.labels)))
return scores
return {"probabilities": scores}
scores = self.model.predict(docs)
scores = self.model.ops.asarray(scores)
return scores
return {"probabilities": scores}
def set_annotations(self, docs: Iterable[Doc], scores) -> None:
def set_annotations(self, docs: Iterable[Doc], activations: ActivationsT) -> None:
"""Modify a batch of Doc objects, using pre-computed scores.
docs (Iterable[Doc]): The documents to modify.
@ -211,9 +226,13 @@ class TextCategorizer(TrainablePipe):
DOCS: https://spacy.io/api/textcategorizer#set_annotations
"""
probs = activations["probabilities"]
for i, doc in enumerate(docs):
if self.save_activations:
doc.activations[self.name] = {}
doc.activations[self.name]["probabilities"] = probs[i]
for j, label in enumerate(self.labels):
doc.cats[label] = float(scores[i, j])
doc.cats[label] = float(probs[i, j])
def update(
self,

View File

@ -1,4 +1,4 @@
from typing import Iterable, Optional, Dict, List, Callable, Any
from typing import Iterable, Optional, Dict, List, Callable, Any, Union
from thinc.types import Floats2d
from thinc.api import Model, Config
@ -75,6 +75,7 @@ subword_features = true
"threshold": 0.5,
"model": DEFAULT_MULTI_TEXTCAT_MODEL,
"scorer": {"@scorers": "spacy.textcat_multilabel_scorer.v2"},
"save_activations": False,
},
default_score_weights={
"cats_score": 1.0,
@ -95,8 +96,9 @@ def make_multilabel_textcat(
model: Model[List[Doc], List[Floats2d]],
threshold: float,
scorer: Optional[Callable],
save_activations: bool,
) -> "MultiLabel_TextCategorizer":
"""Create a MultiLabel_TextCategorizer component. The text categorizer predicts categories
"""Create a TextCategorizer component. The text categorizer predicts categories
over a whole document. It can learn one or more labels, and the labels are considered
to be non-mutually exclusive, which means that there can be zero or more labels
per doc).
@ -107,7 +109,12 @@ def make_multilabel_textcat(
scorer (Optional[Callable]): The scoring method.
"""
return MultiLabel_TextCategorizer(
nlp.vocab, model, name, threshold=threshold, scorer=scorer
nlp.vocab,
model,
name,
threshold=threshold,
scorer=scorer,
save_activations=save_activations,
)
@ -139,6 +146,7 @@ class MultiLabel_TextCategorizer(TextCategorizer):
*,
threshold: float,
scorer: Optional[Callable] = textcat_multilabel_score,
save_activations: bool = False,
) -> None:
"""Initialize a text categorizer for multi-label classification.
@ -148,6 +156,7 @@ class MultiLabel_TextCategorizer(TextCategorizer):
losses during training.
threshold (float): Cutoff to consider a prediction "positive".
scorer (Optional[Callable]): The scoring method.
save_activations (bool): save model activations in Doc when annotating.
DOCS: https://spacy.io/api/textcategorizer#init
"""
@ -158,6 +167,7 @@ class MultiLabel_TextCategorizer(TextCategorizer):
cfg = {"labels": [], "threshold": threshold}
self.cfg = dict(cfg)
self.scorer = scorer
self.save_activations = save_activations
@property
def support_missing_values(self):

View File

@ -1,5 +1,6 @@
from typing import Sequence, Iterable, Optional, Dict, Callable, List, Any
from typing import Sequence, Iterable, Optional, Dict, Callable, List, Any, Tuple
from thinc.api import Model, set_dropout_rate, Optimizer, Config
from thinc.types import Floats2d
from itertools import islice
from .trainable_pipe import TrainablePipe
@ -157,39 +158,9 @@ class Tok2Vec(TrainablePipe):
DOCS: https://spacy.io/api/tok2vec#update
"""
if losses is None:
losses = {}
validate_examples(examples, "Tok2Vec.update")
docs = [eg.predicted for eg in examples]
set_dropout_rate(self.model, drop)
tokvecs, bp_tokvecs = self.model.begin_update(docs)
d_tokvecs = [self.model.ops.alloc2f(*t2v.shape) for t2v in tokvecs]
losses.setdefault(self.name, 0.0)
def accumulate_gradient(one_d_tokvecs):
"""Accumulate tok2vec loss and gradient. This is passed as a callback
to all but the last listener. Only the last one does the backprop.
"""
nonlocal d_tokvecs
for i in range(len(one_d_tokvecs)):
d_tokvecs[i] += one_d_tokvecs[i]
losses[self.name] += float((one_d_tokvecs[i] ** 2).sum())
return [self.model.ops.alloc2f(*t2v.shape) for t2v in tokvecs]
def backprop(one_d_tokvecs):
"""Callback to actually do the backprop. Passed to last listener."""
accumulate_gradient(one_d_tokvecs)
d_docs = bp_tokvecs(d_tokvecs)
if sgd is not None:
self.finish_update(sgd)
return d_docs
batch_id = Tok2VecListener.get_batch_id(docs)
for listener in self.listeners[:-1]:
listener.receive(batch_id, tokvecs, accumulate_gradient)
if self.listeners:
self.listeners[-1].receive(batch_id, tokvecs, backprop)
return losses
return self._update_with_docs(docs, drop=drop, sgd=sgd, losses=losses)
def get_loss(self, examples, scores) -> None:
pass
@ -219,6 +190,96 @@ class Tok2Vec(TrainablePipe):
def add_label(self, label):
raise NotImplementedError
def distill(
self,
teacher_pipe: Optional["TrainablePipe"],
examples: Iterable["Example"],
*,
drop: float = 0.0,
sgd: Optional[Optimizer] = None,
losses: Optional[Dict[str, float]] = None,
) -> Dict[str, float]:
"""Performs an update of the student pipe's model using the
student's distillation examples and sets the annotations
of the teacher's distillation examples using the teacher pipe.
teacher_pipe (Optional[TrainablePipe]): The teacher pipe to use
for prediction.
examples (Iterable[Example]): Distillation examples. The reference (teacher)
and predicted (student) docs must have the same number of tokens and the
same orthography.
drop (float): dropout rate.
sgd (Optional[Optimizer]): An optimizer. Will be created via
create_optimizer if not set.
losses (Optional[Dict[str, float]]): Optional record of loss during
distillation.
RETURNS: The updated losses dictionary.
DOCS: https://spacy.io/api/tok2vec#distill
"""
# By default we require a teacher pipe, but there are downstream
# implementations that don't require a pipe.
if teacher_pipe is None:
raise ValueError(Errors.E4002.format(name=self.name))
teacher_docs = [eg.reference for eg in examples]
student_docs = [eg.predicted for eg in examples]
teacher_preds = teacher_pipe.predict(teacher_docs)
teacher_pipe.set_annotations(teacher_docs, teacher_preds)
return self._update_with_docs(student_docs, drop=drop, sgd=sgd, losses=losses)
def _update_with_docs(
self,
docs: Iterable[Doc],
*,
drop: float = 0.0,
sgd: Optional[Optimizer] = None,
losses: Optional[Dict[str, float]] = None,
):
if losses is None:
losses = {}
losses.setdefault(self.name, 0.0)
set_dropout_rate(self.model, drop)
tokvecs, accumulate_gradient, backprop = self._create_backprops(
docs, losses, sgd=sgd
)
batch_id = Tok2VecListener.get_batch_id(docs)
for listener in self.listeners[:-1]:
listener.receive(batch_id, tokvecs, accumulate_gradient)
if self.listeners:
self.listeners[-1].receive(batch_id, tokvecs, backprop)
return losses
def _create_backprops(
self,
docs: Iterable[Doc],
losses: Dict[str, float],
*,
sgd: Optional[Optimizer] = None,
) -> Tuple[Floats2d, Callable, Callable]:
tokvecs, bp_tokvecs = self.model.begin_update(docs)
d_tokvecs = [self.model.ops.alloc2f(*t2v.shape) for t2v in tokvecs]
def accumulate_gradient(one_d_tokvecs):
"""Accumulate tok2vec loss and gradient. This is passed as a callback
to all but the last listener. Only the last one does the backprop.
"""
nonlocal d_tokvecs
for i in range(len(one_d_tokvecs)):
d_tokvecs[i] += one_d_tokvecs[i]
losses[self.name] += float((one_d_tokvecs[i] ** 2).sum())
return [self.model.ops.alloc2f(*t2v.shape) for t2v in tokvecs]
def backprop(one_d_tokvecs):
"""Callback to actually do the backprop. Passed to last listener."""
accumulate_gradient(one_d_tokvecs)
d_docs = bp_tokvecs(d_tokvecs)
if sgd is not None:
self.finish_update(sgd)
return d_docs
return tokvecs, accumulate_gradient, backprop
class Tok2VecListener(Model):
"""A layer that gets fed its answers from an upstream connection,

View File

@ -6,3 +6,4 @@ cdef class TrainablePipe(Pipe):
cdef public object model
cdef public object cfg
cdef public object scorer
cdef bint _save_activations

View File

@ -2,11 +2,12 @@
from typing import Iterable, Iterator, Optional, Dict, Tuple, Callable
import srsly
from thinc.api import set_dropout_rate, Model, Optimizer
import warnings
from ..tokens.doc cimport Doc
from ..training import validate_examples
from ..errors import Errors
from ..training import validate_examples, validate_distillation_examples
from ..errors import Errors, Warnings
from .pipe import Pipe, deserialize_config
from .. import util
from ..vocab import Vocab
@ -55,6 +56,53 @@ cdef class TrainablePipe(Pipe):
except Exception as e:
error_handler(self.name, self, [doc], e)
def distill(self,
teacher_pipe: Optional["TrainablePipe"],
examples: Iterable["Example"],
*,
drop: float=0.0,
sgd: Optional[Optimizer]=None,
losses: Optional[Dict[str, float]]=None) -> Dict[str, float]:
"""Train a pipe (the student) on the predictions of another pipe
(the teacher). The student is typically trained on the probability
distribution of the teacher, but details may differ per pipe.
teacher_pipe (Optional[TrainablePipe]): The teacher pipe to learn
from.
examples (Iterable[Example]): Distillation examples. The reference
(teacher) and predicted (student) docs must have the same number of
tokens and the same orthography.
drop (float): dropout rate.
sgd (Optional[Optimizer]): An optimizer. Will be created via
create_optimizer if not set.
losses (Optional[Dict[str, float]]): Optional record of loss during
distillation.
RETURNS: The updated losses dictionary.
DOCS: https://spacy.io/api/pipe#distill
"""
# By default we require a teacher pipe, but there are downstream
# implementations that don't require a pipe.
if teacher_pipe is None:
raise ValueError(Errors.E4002.format(name=self.name))
if losses is None:
losses = {}
losses.setdefault(self.name, 0.0)
validate_distillation_examples(examples, "TrainablePipe.distill")
set_dropout_rate(self.model, drop)
for node in teacher_pipe.model.walk():
if node.name == "softmax":
node.attrs["softmax_normalize"] = True
teacher_scores = teacher_pipe.model.predict([eg.reference for eg in examples])
student_scores, bp_student_scores = self.model.begin_update([eg.predicted for eg in examples])
loss, d_scores = self.get_teacher_student_loss(teacher_scores, student_scores)
bp_student_scores(d_scores)
if sgd is not None:
self.finish_update(sgd)
losses[self.name] += loss
return losses
def pipe(self, stream: Iterable[Doc], *, batch_size: int=128) -> Iterator[Doc]:
"""Apply the pipe to a stream of documents. This usually happens under
the hood when the nlp object is called on a text and all components are
@ -168,6 +216,19 @@ cdef class TrainablePipe(Pipe):
"""
raise NotImplementedError(Errors.E931.format(parent="TrainablePipe", method="get_loss", name=self.name))
def get_teacher_student_loss(self, teacher_scores, student_scores):
"""Calculate the loss and its gradient for a batch of student
scores, relative to teacher scores.
teacher_scores: Scores representing the teacher model's predictions.
student_scores: Scores representing the student model's predictions.
RETURNS (Tuple[float, float]): The loss and the gradient.
DOCS: https://spacy.io/api/pipe#get_teacher_student_loss
"""
raise NotImplementedError(Errors.E931.format(parent="TrainablePipe", method="get_teacher_student_loss", name=self.name))
def create_optimizer(self) -> Optimizer:
"""Create an optimizer for the pipeline component.
@ -204,6 +265,14 @@ cdef class TrainablePipe(Pipe):
"""
raise NotImplementedError(Errors.E931.format(parent="Pipe", method="add_label", name=self.name))
@property
def is_distillable(self) -> bool:
# Normally a pipe overrides `get_teacher_student_loss` to implement
# distillation. In more exceptional cases, a pipe can provide its
# own `distill` implementation. If neither of these methods is
# overridden, the pipe does not implement distillation.
return not (self.__class__.distill is TrainablePipe.distill and self.__class__.get_teacher_student_loss is TrainablePipe.get_teacher_student_loss)
@property
def is_trainable(self) -> bool:
return True
@ -342,3 +411,11 @@ cdef class TrainablePipe(Pipe):
deserialize["model"] = load_model
util.from_disk(path, deserialize, exclude)
return self
@property
def save_activations(self):
return self._save_activations
@save_activations.setter
def save_activations(self, save_activations: bool):
self._save_activations = save_activations

View File

@ -1,20 +0,0 @@
from cymem.cymem cimport Pool
from thinc.backends.cblas cimport CBlas
from ..vocab cimport Vocab
from .trainable_pipe cimport TrainablePipe
from ._parser_internals.transition_system cimport Transition, TransitionSystem
from ._parser_internals._state cimport StateC
from ..ml.parser_model cimport WeightsC, ActivationsC, SizesC
cdef class Parser(TrainablePipe):
cdef public object _rehearsal_model
cdef readonly TransitionSystem moves
cdef public object _multitasks
cdef void _parseC(self, CBlas cblas, StateC** states,
WeightsC weights, SizesC sizes) nogil
cdef void c_transition_batch(self, StateC** states, const float* scores,
int nr_class, int batch_size) nogil

View File

@ -1,5 +1,6 @@
# cython: infer_types=True, cdivision=True, boundscheck=False, binding=True
from __future__ import print_function
from typing import Dict, Iterable, List, Optional, Tuple
from cymem.cymem cimport Pool
cimport numpy as np
from itertools import islice
@ -7,33 +8,43 @@ from libcpp.vector cimport vector
from libc.string cimport memset, memcpy
from libc.stdlib cimport calloc, free
import random
import contextlib
import srsly
from thinc.api import get_ops, set_dropout_rate, CupyOps, NumpyOps
from thinc.extra.search cimport Beam
from thinc.api import get_ops, set_dropout_rate, CupyOps, NumpyOps, Optimizer
from thinc.api import chain, softmax_activation, use_ops, get_array_module
from thinc.legacy import LegacySequenceCategoricalCrossentropy
from thinc.types import Floats2d, Ints1d
import numpy.random
import numpy
import warnings
from ._parser_internals.stateclass cimport StateClass
from ..ml.parser_model cimport alloc_activations, free_activations
from ..ml.parser_model cimport predict_states, arg_max_if_valid
from ..ml.parser_model cimport WeightsC, ActivationsC, SizesC, cpu_log_loss
from ..ml.parser_model cimport get_c_weights, get_c_sizes
from ..ml.tb_framework import TransitionModelInputs
from ._parser_internals.stateclass cimport StateC, StateClass
from ._parser_internals.search cimport Beam
from ..tokens.doc cimport Doc
from .trainable_pipe import TrainablePipe
from .trainable_pipe cimport TrainablePipe
from ._parser_internals cimport _beam_utils
from ._parser_internals import _beam_utils
from ..vocab cimport Vocab
from ._parser_internals.transition_system cimport Transition, TransitionSystem
from ..typedefs cimport weight_t
from ..training import validate_examples, validate_get_examples
from ..training import validate_distillation_examples
from ..errors import Errors, Warnings
from .. import util
# TODO: Remove when we switch to Cython 3.
cdef extern from "<algorithm>" namespace "std" nogil:
bint equal[InputIt1, InputIt2](InputIt1 first1, InputIt1 last1, InputIt2 first2) except +
NUMPY_OPS = NumpyOps()
cdef class Parser(TrainablePipe):
class Parser(TrainablePipe):
"""
Base class of the DependencyParser and EntityRecognizer.
"""
@ -123,6 +134,7 @@ cdef class Parser(TrainablePipe):
self._rehearsal_model = None
self.scorer = scorer
self._cpu_ops = get_ops("cpu") if isinstance(self.model.ops, CupyOps) else self.model.ops
def __getnewargs_ex__(self):
"""This allows pickling the Parser and its keyword-only init arguments"""
@ -132,8 +144,9 @@ cdef class Parser(TrainablePipe):
@property
def move_names(self):
names = []
cdef TransitionSystem moves = self.moves
for i in range(self.moves.n_moves):
name = self.moves.move_name(self.moves.c[i].move, self.moves.c[i].label)
name = self.moves.move_name(moves.c[i].move, moves.c[i].label)
# Explicitly removing the internal "U-" token used for blocking entities
if name != "U-":
names.append(name)
@ -202,6 +215,118 @@ cdef class Parser(TrainablePipe):
# Defined in subclasses, to avoid circular import
raise NotImplementedError
def distill(self,
teacher_pipe: Optional[TrainablePipe],
examples: Iterable["Example"],
*,
drop: float=0.0,
sgd: Optional[Optimizer]=None,
losses: Optional[Dict[str, float]]=None):
"""Train a pipe (the student) on the predictions of another pipe
(the teacher). The student is trained on the transition probabilities
of the teacher.
teacher_pipe (Optional[TrainablePipe]): The teacher pipe to learn
from.
examples (Iterable[Example]): Distillation examples. The reference
(teacher) and predicted (student) docs must have the same number of
tokens and the same orthography.
drop (float): dropout rate.
sgd (Optional[Optimizer]): An optimizer. Will be created via
create_optimizer if not set.
losses (Optional[Dict[str, float]]): Optional record of loss during
distillation.
RETURNS: The updated losses dictionary.
DOCS: https://spacy.io/api/dependencyparser#distill
"""
if teacher_pipe is None:
raise ValueError(Errors.E4002.format(name=self.name))
if losses is None:
losses = {}
losses.setdefault(self.name, 0.0)
validate_distillation_examples(examples, "TransitionParser.distill")
set_dropout_rate(self.model, drop)
student_docs = [eg.predicted for eg in examples]
max_moves = self.cfg["update_with_oracle_cut_size"]
if max_moves >= 1:
# Chop sequences into lengths of this many words, to make the
# batch uniform length. Since we do not have a gold standard
# sequence, we use the teacher's predictions as the gold
# standard.
max_moves = int(random.uniform(max(max_moves // 2, 1), max_moves * 2))
states = self._init_batch_from_teacher(teacher_pipe, student_docs, max_moves)
else:
states = self.moves.init_batch(student_docs)
# We distill as follows: 1. we first let the student predict transition
# sequences (and the corresponding transition probabilities); (2) we
# let the teacher follow the student's predicted transition sequences
# to obtain the teacher's transition probabilities; (3) we compute the
# gradients of the student's transition distributions relative to the
# teacher's distributions.
student_inputs = TransitionModelInputs(docs=student_docs,
states=[state.copy() for state in states], moves=self.moves, max_moves=max_moves)
(student_states, student_scores), backprop_scores = self.model.begin_update(student_inputs)
actions = _states_diff_to_actions(states, student_states)
teacher_inputs = TransitionModelInputs(docs=[eg.reference for eg in examples],
states=states, moves=teacher_pipe.moves, actions=actions)
(_, teacher_scores) = teacher_pipe.model.predict(teacher_inputs)
loss, d_scores = self.get_teacher_student_loss(teacher_scores, student_scores)
backprop_scores((student_states, d_scores))
if sgd is not None:
self.finish_update(sgd)
losses[self.name] += loss
return losses
def get_teacher_student_loss(
self, teacher_scores: List[Floats2d], student_scores: List[Floats2d],
normalize: bool=False,
) -> Tuple[float, List[Floats2d]]:
"""Calculate the loss and its gradient for a batch of student
scores, relative to teacher scores.
teacher_scores: Scores representing the teacher model's predictions.
student_scores: Scores representing the student model's predictions.
RETURNS (Tuple[float, float]): The loss and the gradient.
DOCS: https://spacy.io/api/dependencyparser#get_teacher_student_loss
"""
# We can't easily hook up a softmax layer in the parsing model, since
# the get_loss does additional masking. So, we could apply softmax
# manually here and use Thinc's cross-entropy loss. But it's a bit
# suboptimal, since we can have a lot of states that would result in
# many kernel launches. Futhermore the parsing model's backprop expects
# a XP array, so we'd have to concat the softmaxes anyway. So, like
# the get_loss implementation, we'll compute the loss and gradients
# ourselves.
teacher_scores = self.model.ops.softmax(self.model.ops.xp.vstack(teacher_scores),
axis=-1, inplace=True)
student_scores = self.model.ops.softmax(self.model.ops.xp.vstack(student_scores),
axis=-1, inplace=True)
assert teacher_scores.shape == student_scores.shape
d_scores = student_scores - teacher_scores
if normalize:
d_scores /= d_scores.shape[0]
loss = (d_scores**2).sum() / d_scores.size
return float(loss), d_scores
def init_multitask_objectives(self, get_examples, pipeline, **cfg):
"""Setup models for secondary objectives, to benefit from multi-task
learning. This method is intended to be overridden by subclasses.
@ -222,9 +347,6 @@ cdef class Parser(TrainablePipe):
stream: The sequence of documents to process.
batch_size (int): Number of documents to accumulate into a working set.
error_handler (Callable[[str, List[Doc], Exception], Any]): Function that
deals with a failing batch of documents. The default function just reraises
the exception.
YIELDS (Doc): Documents, in order.
"""
@ -246,83 +368,29 @@ cdef class Parser(TrainablePipe):
def predict(self, docs):
if isinstance(docs, Doc):
docs = [docs]
self._ensure_labels_are_added(docs)
if not any(len(doc) for doc in docs):
result = self.moves.init_batch(docs)
return result
if self.cfg["beam_width"] == 1:
return self.greedy_parse(docs, drop=0.0)
else:
return self.beam_parse(
docs,
drop=0.0,
beam_width=self.cfg["beam_width"],
beam_density=self.cfg["beam_density"]
)
with _change_attrs(self.model, beam_width=self.cfg["beam_width"], beam_density=self.cfg["beam_density"]):
inputs = TransitionModelInputs(docs=docs, moves=self.moves)
states_or_beams, _ = self.model.predict(inputs)
return states_or_beams
def greedy_parse(self, docs, drop=0.):
cdef vector[StateC*] states
cdef StateClass state
ops = self.model.ops
cdef CBlas cblas
if isinstance(ops, CupyOps):
cblas = NUMPY_OPS.cblas()
else:
cblas = ops.cblas()
self._resize()
self._ensure_labels_are_added(docs)
set_dropout_rate(self.model, drop)
batch = self.moves.init_batch(docs)
model = self.model.predict(docs)
weights = get_c_weights(model)
for state in batch:
if not state.is_final():
states.push_back(state.c)
sizes = get_c_sizes(model, states.size())
with nogil:
self._parseC(cblas, &states[0], weights, sizes)
model.clear_memory()
del model
return batch
with _change_attrs(self.model, beam_width=1):
inputs = TransitionModelInputs(docs=docs, moves=self.moves)
states, _ = self.model.predict(inputs)
return states
def beam_parse(self, docs, int beam_width, float drop=0., beam_density=0.):
cdef Beam beam
cdef Doc doc
self._ensure_labels_are_added(docs)
batch = _beam_utils.BeamBatch(
self.moves,
self.moves.init_batch(docs),
None,
beam_width,
density=beam_density
)
model = self.model.predict(docs)
while not batch.is_done:
states = batch.get_unfinished_states()
if not states:
break
scores = model.predict(states)
batch.advance(scores)
model.clear_memory()
del model
return list(batch)
cdef void _parseC(self, CBlas cblas, StateC** states,
WeightsC weights, SizesC sizes) nogil:
cdef int i, j
cdef vector[StateC*] unfinished
cdef ActivationsC activations = alloc_activations(sizes)
while sizes.states >= 1:
predict_states(cblas, &activations, states, &weights, sizes)
# Validate actions, argmax, take action.
self.c_transition_batch(states,
activations.scores, sizes.classes, sizes.states)
for i in range(sizes.states):
if not states[i].is_final():
unfinished.push_back(states[i])
for i in range(unfinished.size()):
states[i] = unfinished[i]
sizes.states = unfinished.size()
unfinished.clear()
free_activations(&activations)
with _change_attrs(self.model, beam_width=self.cfg["beam_width"], beam_density=self.cfg["beam_density"]):
inputs = TransitionModelInputs(docs=docs, moves=self.moves)
beams, _ = self.model.predict(inputs)
return beams
def set_annotations(self, docs, states_or_beams):
cdef StateClass state
@ -334,35 +402,6 @@ cdef class Parser(TrainablePipe):
for hook in self.postprocesses:
hook(doc)
def transition_states(self, states, float[:, ::1] scores):
cdef StateClass state
cdef float* c_scores = &scores[0, 0]
cdef vector[StateC*] c_states
for state in states:
c_states.push_back(state.c)
self.c_transition_batch(&c_states[0], c_scores, scores.shape[1], scores.shape[0])
return [state for state in states if not state.c.is_final()]
cdef void c_transition_batch(self, StateC** states, const float* scores,
int nr_class, int batch_size) nogil:
# n_moves should not be zero at this point, but make sure to avoid zero-length mem alloc
with gil:
assert self.moves.n_moves > 0, Errors.E924.format(name=self.name)
is_valid = <int*>calloc(self.moves.n_moves, sizeof(int))
cdef int i, guess
cdef Transition action
for i in range(batch_size):
self.moves.set_valid(is_valid, states[i])
guess = arg_max_if_valid(&scores[i*nr_class], is_valid, nr_class)
if guess == -1:
# This shouldn't happen, but it's hard to raise an error here,
# and we don't want to infinite loop. So, force to end state.
states[i].force_final()
else:
action = self.moves.c[guess]
action.do(states[i], action.label)
free(is_valid)
def update(self, examples, *, drop=0., sgd=None, losses=None):
cdef StateClass state
if losses is None:
@ -374,67 +413,99 @@ cdef class Parser(TrainablePipe):
)
for multitask in self._multitasks:
multitask.update(examples, drop=drop, sgd=sgd)
# We need to take care to act on the whole batch, because we might be
# getting vectors via a listener.
n_examples = len([eg for eg in examples if self.moves.has_gold(eg)])
if n_examples == 0:
return losses
set_dropout_rate(self.model, drop)
# The probability we use beam update, instead of falling back to
# a greedy update
beam_update_prob = self.cfg["beam_update_prob"]
if self.cfg['beam_width'] >= 2 and numpy.random.random() < beam_update_prob:
return self.update_beam(
examples,
beam_width=self.cfg["beam_width"],
sgd=sgd,
losses=losses,
beam_density=self.cfg["beam_density"]
)
docs = [eg.x for eg in examples if len(eg.x)]
max_moves = self.cfg["update_with_oracle_cut_size"]
if max_moves >= 1:
# Chop sequences into lengths of this many words, to make the
# batch uniform length.
max_moves = int(random.uniform(max_moves // 2, max_moves * 2))
states, golds, _ = self._init_gold_batch(
max_moves = int(random.uniform(max(max_moves // 2, 1), max_moves * 2))
init_states, gold_states, _ = self._init_gold_batch(
examples,
max_length=max_moves
)
else:
states, golds, _ = self.moves.init_gold_batch(examples)
if not states:
init_states, gold_states, _ = self.moves.init_gold_batch(examples)
inputs = TransitionModelInputs(docs=docs, moves=self.moves,
max_moves=max_moves, states=[state.copy() for state in init_states])
(pred_states, scores), backprop_scores = self.model.begin_update(inputs)
if sum(s.shape[0] for s in scores) == 0:
return losses
model, backprop_tok2vec = self.model.begin_update([eg.x for eg in examples])
all_states = list(states)
states_golds = list(zip(states, golds))
n_moves = 0
while states_golds:
states, golds = zip(*states_golds)
scores, backprop = model.begin_update(states)
d_scores = self.get_batch_loss(states, golds, scores, losses)
# Note that the gradient isn't normalized by the batch size
# here, because our "samples" are really the states...But we
# can't normalize by the number of states either, as then we'd
# be getting smaller gradients for states in long sequences.
backprop(d_scores)
# Follow the predicted action
self.transition_states(states, scores)
states_golds = [(s, g) for (s, g) in zip(states, golds) if not s.is_final()]
if max_moves >= 1 and n_moves >= max_moves:
break
n_moves += 1
backprop_tok2vec(golds)
d_scores = self.get_loss((gold_states, init_states, pred_states, scores),
examples, max_moves)
backprop_scores((pred_states, d_scores))
if sgd not in (None, False):
self.finish_update(sgd)
losses[self.name] += float((d_scores**2).sum())
# Ugh, this is annoying. If we're working on GPU, we want to free the
# memory ASAP. It seems that Python doesn't necessarily get around to
# removing these in time if we don't explicitly delete? It's confusing.
del backprop
del backprop_tok2vec
model.clear_memory()
del model
del backprop_scores
return losses
def get_loss(self, states_scores, examples, max_moves):
gold_states, init_states, pred_states, scores = states_scores
scores = self.model.ops.xp.vstack(scores)
costs = self._get_costs_from_histories(
examples,
gold_states,
init_states,
[list(state.history) for state in pred_states],
max_moves
)
xp = get_array_module(scores)
best_costs = costs.min(axis=1, keepdims=True)
gscores = scores.copy()
min_score = scores.min() - 1000
assert costs.shape == scores.shape, (costs.shape, scores.shape)
gscores[costs > best_costs] = min_score
max_ = scores.max(axis=1, keepdims=True)
gmax = gscores.max(axis=1, keepdims=True)
exp_scores = xp.exp(scores - max_)
exp_gscores = xp.exp(gscores - gmax)
Z = exp_scores.sum(axis=1, keepdims=True)
gZ = exp_gscores.sum(axis=1, keepdims=True)
d_scores = exp_scores / Z
d_scores -= (costs <= best_costs) * (exp_gscores / gZ)
return d_scores
def _get_costs_from_histories(self, examples, gold_states, init_states, histories, max_moves):
cdef TransitionSystem moves = self.moves
cdef StateClass state
cdef int clas
cdef int nF = self.model.get_dim("nF")
cdef int nO = moves.n_moves
cdef int nS = sum([len(history) for history in histories])
cdef Pool mem = Pool()
cdef np.ndarray costs_i
is_valid = <int*>mem.alloc(nO, sizeof(int))
batch = list(zip(init_states, histories, gold_states))
n_moves = 0
output = []
while batch:
costs = numpy.zeros((len(batch), nO), dtype="f")
for i, (state, history, gold) in enumerate(batch):
costs_i = costs[i]
clas = history.pop(0)
moves.set_costs(is_valid, <weight_t*>costs_i.data, state.c, gold)
action = moves.c[clas]
action.do(state.c, action.label)
state.c.history.push_back(clas)
output.append(costs)
batch = [(s, h, g) for s, h, g in batch if len(h) != 0]
if n_moves >= max_moves >= 1:
break
n_moves += 1
return self.model.ops.xp.vstack(output)
def rehearse(self, examples, sgd=None, losses=None, **cfg):
"""Perform a "rehearsal" update, to prevent catastrophic forgetting."""
if losses is None:
@ -444,10 +515,9 @@ cdef class Parser(TrainablePipe):
multitask.rehearse(examples, losses=losses, sgd=sgd)
if self._rehearsal_model is None:
return None
losses.setdefault(self.name, 0.)
losses.setdefault(self.name, 0.0)
validate_examples(examples, "Parser.rehearse")
docs = [eg.predicted for eg in examples]
states = self.moves.init_batch(docs)
# This is pretty dirty, but the NER can resize itself in init_batch,
# if labels are missing. We therefore have to check whether we need to
# expand our model output.
@ -455,85 +525,33 @@ cdef class Parser(TrainablePipe):
# Prepare the stepwise model, and get the callback for finishing the batch
set_dropout_rate(self._rehearsal_model, 0.0)
set_dropout_rate(self.model, 0.0)
tutor, _ = self._rehearsal_model.begin_update(docs)
model, backprop_tok2vec = self.model.begin_update(docs)
n_scores = 0.
loss = 0.
while states:
targets, _ = tutor.begin_update(states)
guesses, backprop = model.begin_update(states)
d_scores = (guesses - targets) / targets.shape[0]
# If all weights for an output are 0 in the original model, don't
# supervise that output. This allows us to add classes.
loss += (d_scores**2).sum()
backprop(d_scores)
# Follow the predicted action
self.transition_states(states, guesses)
states = [state for state in states if not state.is_final()]
n_scores += d_scores.size
# Do the backprop
backprop_tok2vec(docs)
student_inputs = TransitionModelInputs(docs=docs, moves=self.moves)
(student_states, student_scores), backprop_scores = self.model.begin_update(student_inputs)
actions = _states_to_actions(student_states)
teacher_inputs = TransitionModelInputs(docs=docs, moves=self.moves, actions=actions)
_, teacher_scores = self._rehearsal_model.predict(teacher_inputs)
loss, d_scores = self.get_teacher_student_loss(teacher_scores, student_scores, normalize=True)
teacher_scores = self.model.ops.xp.vstack(teacher_scores)
student_scores = self.model.ops.xp.vstack(student_scores)
assert teacher_scores.shape == student_scores.shape
d_scores = (student_scores - teacher_scores) / teacher_scores.shape[0]
# If all weights for an output are 0 in the original model, don't
# supervise that output. This allows us to add classes.
loss = (d_scores**2).sum() / d_scores.size
backprop_scores((student_states, d_scores))
if sgd is not None:
self.finish_update(sgd)
losses[self.name] += loss / n_scores
del backprop
del backprop_tok2vec
model.clear_memory()
tutor.clear_memory()
del model
del tutor
losses[self.name] += loss
return losses
def update_beam(self, examples, *, beam_width,
drop=0., sgd=None, losses=None, beam_density=0.0):
states, golds, _ = self.moves.init_gold_batch(examples)
if not states:
return losses
# Prepare the stepwise model, and get the callback for finishing the batch
model, backprop_tok2vec = self.model.begin_update(
[eg.predicted for eg in examples])
loss = _beam_utils.update_beam(
self.moves,
states,
golds,
model,
beam_width,
beam_density=beam_density,
)
losses[self.name] += loss
backprop_tok2vec(golds)
if sgd is not None:
self.finish_update(sgd)
def get_batch_loss(self, states, golds, float[:, ::1] scores, losses):
cdef StateClass state
cdef Pool mem = Pool()
cdef int i
# n_moves should not be zero at this point, but make sure to avoid zero-length mem alloc
assert self.moves.n_moves > 0, Errors.E924.format(name=self.name)
is_valid = <int*>mem.alloc(self.moves.n_moves, sizeof(int))
costs = <float*>mem.alloc(self.moves.n_moves, sizeof(float))
cdef np.ndarray d_scores = numpy.zeros((len(states), self.moves.n_moves),
dtype='f', order='C')
c_d_scores = <float*>d_scores.data
unseen_classes = self.model.attrs["unseen_classes"]
for i, (state, gold) in enumerate(zip(states, golds)):
memset(is_valid, 0, self.moves.n_moves * sizeof(int))
memset(costs, 0, self.moves.n_moves * sizeof(float))
self.moves.set_costs(is_valid, costs, state.c, gold)
for j in range(self.moves.n_moves):
if costs[j] <= 0.0 and j in unseen_classes:
unseen_classes.remove(j)
cpu_log_loss(c_d_scores,
costs, is_valid, &scores[i, 0], d_scores.shape[1])
c_d_scores += d_scores.shape[1]
# Note that we don't normalize this. See comment in update() for why.
if losses is not None:
losses.setdefault(self.name, 0.)
losses[self.name] += (d_scores**2).sum()
return d_scores
raise NotImplementedError
def set_output(self, nO):
self.model.attrs["resize_output"](self.model, nO)
@ -572,7 +590,7 @@ cdef class Parser(TrainablePipe):
for example in islice(get_examples(), 10):
doc_sample.append(example.predicted)
assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
self.model.initialize(doc_sample)
self.model.initialize((doc_sample, self.moves))
if nlp is not None:
self.init_multitask_objectives(get_examples, nlp.pipeline)
@ -629,28 +647,75 @@ cdef class Parser(TrainablePipe):
raise ValueError(Errors.E149) from None
return self
def _init_gold_batch(self, examples, max_length):
"""Make a square batch, of length equal to the shortest transition
def _init_batch_from_teacher(self, teacher_pipe, docs, max_length):
"""Make a square batch of length equal to the shortest transition
sequence or a cap. A long
doc will get multiple states. Let's say we have a doc of length 2*N,
where N is the shortest doc. We'll make two states, one representing
long_doc[:N], and another representing long_doc[N:]."""
long_doc[:N], and another representing long_doc[N:]. In contrast to
_init_gold_batch, this version uses a teacher model to generate the
cut sequences."""
cdef:
StateClass state
TransitionSystem moves = teacher_pipe.moves
# Start with the same heuristic as in supervised training: exclude
# docs that are within the maximum length.
all_states = moves.init_batch(docs)
states = []
to_cut = []
for state, doc in zip(all_states, docs):
if not state.is_final():
if len(doc) < max_length:
states.append(state)
else:
to_cut.append(state)
if not to_cut:
return states
# Parse the states that are too long with the teacher's parsing model.
teacher_inputs = TransitionModelInputs(docs=docs, moves=moves,
states=[state.copy() for state in to_cut])
(teacher_states, _ ) = teacher_pipe.model.predict(teacher_inputs)
# Step through the teacher's actions and store every state after
# each multiple of max_length.
teacher_actions = _states_to_actions(teacher_states)
while to_cut:
states.extend(state.copy() for state in to_cut)
for step_actions in teacher_actions[:max_length]:
to_cut = moves.apply_actions(to_cut, step_actions)
teacher_actions = teacher_actions[max_length:]
if len(teacher_actions) < max_length:
break
return states
def _init_gold_batch(self, examples, max_length):
"""Make a square batch, of length equal to the shortest transition
sequence or a cap. A long doc will get multiple states. Let's say we
have a doc of length 2*N, where N is the shortest doc. We'll make
two states, one representing long_doc[:N], and another representing
long_doc[N:]."""
cdef:
StateClass start_state
StateClass state
Transition action
all_states = self.moves.init_batch([eg.predicted for eg in examples])
TransitionSystem moves = self.moves
all_states = moves.init_batch([eg.predicted for eg in examples])
states = []
golds = []
to_cut = []
for state, eg in zip(all_states, examples):
if self.moves.has_gold(eg) and not state.is_final():
gold = self.moves.init_gold(state, eg)
if moves.has_gold(eg) and not state.is_final():
gold = moves.init_gold(state, eg)
if len(eg.x) < max_length:
states.append(state)
golds.append(gold)
else:
oracle_actions = self.moves.get_oracle_sequence_from_state(
oracle_actions = moves.get_oracle_sequence_from_state(
state.copy(), gold)
to_cut.append((eg, state, gold, oracle_actions))
if not to_cut:
@ -660,13 +725,94 @@ cdef class Parser(TrainablePipe):
for i in range(0, len(oracle_actions), max_length):
start_state = state.copy()
for clas in oracle_actions[i:i+max_length]:
action = self.moves.c[clas]
action = moves.c[clas]
action.do(state.c, action.label)
if state.is_final():
break
if self.moves.has_gold(eg, start_state.B(0), state.B(0)):
if moves.has_gold(eg, start_state.B(0), state.B(0)):
states.append(start_state)
golds.append(gold)
if state.is_final():
break
return states, golds, max_length
@contextlib.contextmanager
def _change_attrs(model, **kwargs):
"""Temporarily modify a thinc model's attributes."""
unset = object()
old_attrs = {}
for key, value in kwargs.items():
old_attrs[key] = model.attrs.get(key, unset)
model.attrs[key] = value
yield model
for key, value in old_attrs.items():
if value is unset:
model.attrs.pop(key)
else:
model.attrs[key] = value
def _states_to_actions(states: List[StateClass]) -> List[Ints1d]:
cdef int step
cdef StateClass state
cdef StateC* c_state
actions = []
while True:
step = len(actions)
step_actions = []
for state in states:
c_state = state.c
if step < c_state.history.size():
step_actions.append(c_state.history[step])
# We are done if we have exhausted all histories.
if len(step_actions) == 0:
break
actions.append(numpy.array(step_actions, dtype="i"))
return actions
def _states_diff_to_actions(
before_states: List[StateClass],
after_states: List[StateClass]
) -> List[Ints1d]:
"""
Return for two sets of states the actions to go from the first set of
states to the second set of states. The histories of the first set of
states must be a prefix of the second set of states.
"""
cdef StateClass before_state, after_state
cdef StateC* c_state_before
cdef StateC* c_state_after
assert len(before_states) == len(after_states)
# Check invariant: before states histories must be prefixes of after states.
for before_state, after_state in zip(before_states, after_states):
c_state_before = before_state.c
c_state_after = after_state.c
assert equal(c_state_before.history.begin(), c_state_before.history.end(),
c_state_after.history.begin())
actions = []
while True:
step = len(actions)
step_actions = []
for before_state, after_state in zip(before_states, after_states):
c_state_before = before_state.c
c_state_after = after_state.c
if step < c_state_after.history.size() - c_state_before.history.size():
step_actions.append(c_state_after.history[c_state_before.history.size() + step])
# We are done if we have exhausted all histories.
if len(step_actions) == 0:
break
actions.append(numpy.array(step_actions, dtype="i"))
return actions

View File

@ -1,6 +1,5 @@
from typing import Dict, List, Union, Optional, Any, Callable, Type, Tuple
from typing import Iterable, TypeVar, TYPE_CHECKING
from .compat import Literal
from typing import Iterable, TypeVar, Literal, TYPE_CHECKING
from enum import Enum
from pydantic import BaseModel, Field, ValidationError, validator, create_model
from pydantic import StrictStr, StrictInt, StrictFloat, StrictBool, ConstrainedStr
@ -144,7 +143,7 @@ def validate_init_settings(
def validate_token_pattern(obj: list) -> List[str]:
# Try to convert non-string keys (e.g. {ORTH: "foo"} -> {"ORTH": "foo"})
get_key = lambda k: NAMES[k] if isinstance(k, int) and k < len(NAMES) else k
get_key = lambda k: NAMES[k] if isinstance(k, int) and k in NAMES else k
if isinstance(obj, list):
converted = []
for pattern in obj:
@ -423,6 +422,27 @@ class ConfigSchemaInit(BaseModel):
arbitrary_types_allowed = True
class ConfigSchemaDistillEmpty(BaseModel):
class Config:
extra = "forbid"
class ConfigSchemaDistill(BaseModel):
# fmt: off
batcher: Batcher = Field(..., title="Batcher for the training data")
corpus: StrictStr = Field(..., title="Path in the config to the distillation data")
dropout: StrictFloat = Field(..., title="Dropout rate")
max_epochs: StrictInt = Field(..., title="Maximum number of epochs to distill for")
max_steps: StrictInt = Field(..., title="Maximum number of steps to distill for")
optimizer: Optimizer = Field(..., title="The optimizer to use")
student_to_teacher: Dict[str, str] = Field(..., title="Mapping from student to teacher pipe")
# fmt: on
class Config:
extra = "forbid"
arbitrary_types_allowed = True
class ConfigSchema(BaseModel):
training: ConfigSchemaTraining
nlp: ConfigSchemaNlp
@ -430,6 +450,7 @@ class ConfigSchema(BaseModel):
components: Dict[str, Dict[str, Any]]
corpora: Dict[str, Reader]
initialize: ConfigSchemaInit
distillation: Union[ConfigSchemaDistill, ConfigSchemaDistillEmpty] = {} # type: ignore[assignment]
class Config:
extra = "allow"
@ -441,6 +462,7 @@ CONFIG_SCHEMAS = {
"training": ConfigSchemaTraining,
"pretraining": ConfigSchemaPretrain,
"initialize": ConfigSchemaInit,
"distillation": ConfigSchemaDistill,
}

View File

@ -104,7 +104,7 @@ class Scorer:
def __init__(
self,
nlp: Optional["Language"] = None,
default_lang: str = "xx",
default_lang: str = "mul",
default_pipeline: Iterable[str] = DEFAULT_PIPELINE,
**cfg,
) -> None:

View File

@ -1,4 +1,4 @@
from libc.stdint cimport int64_t
from libc.stdint cimport int64_t, uint32_t
from libcpp.vector cimport vector
from libcpp.set cimport set
from cymem.cymem cimport Pool
@ -7,13 +7,6 @@ from murmurhash.mrmr cimport hash64
from .typedefs cimport attr_t, hash_t
cpdef hash_t hash_string(str string) except 0
cdef hash_t hash_utf8(char* utf8_string, int length) nogil
cdef str decode_Utf8Str(const Utf8Str* string)
ctypedef union Utf8Str:
unsigned char[8] s
unsigned char* p
@ -21,9 +14,13 @@ ctypedef union Utf8Str:
cdef class StringStore:
cdef Pool mem
cdef vector[hash_t] _keys
cdef PreshMap _map
cdef vector[hash_t] keys
cdef public PreshMap _map
cdef hash_t _intern_str(self, str string)
cdef Utf8Str* _allocate_str_repr(self, const unsigned char* chars, uint32_t length) except *
cdef str _decode_str_repr(self, const Utf8Str* string)
cdef const Utf8Str* intern_unicode(self, str py_string)
cdef const Utf8Str* _intern_utf8(self, char* utf8_string, int length, hash_t* precalculated_hash)
cpdef hash_t hash_string(object string) except -1
cpdef hash_t get_string_id(object string_or_hash) except -1

Some files were not shown because too many files have changed in this diff Show More