Commit Graph

542 Commits

Author SHA1 Message Date
Paul O'Leary McCann
6029cfc391
Add scores to output in spancat (#8855)
* Add scores to output in spancat

This exposes the scores as an attribute on the SpanGroup. Includes a
basic test.

* Add basic doc note

* Vectorize score calcs

* Add "annotation format" section

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Clean up doc section

* Ran prettier on docs

* Get arrays off the gpu before iterating over them

* Remove int() calls

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-08-10 13:47:49 +02:00
Adriane Boyd
941a591f3c
Pass excludes when serializing vocab (#8824)
* Pass excludes when serializing vocab

Additional minor bug fix:

* Deserialize vocab in `EntityLinker.from_disk`

* Add test for excluding strings on load

* Fix formatting
2021-08-03 14:42:44 +02:00
Sofie Van Landeghem
83e27d262e
negative tag annotation (#8731)
* unit test to unlearn tag via negative annotation

* bump thinc to 8.0.8
2021-07-19 14:39:11 +02:00
Paul O'Leary McCann
8bd0474730 Run black 2021-07-18 20:20:22 +09:00
Paul O'Leary McCann
bc081c24fa Add full traditional scoring
This calculates scores as an average of three metrics. As noted in the
code, these metrics all have issues, but we want to use them to match up
with prior work.

This should be replaced with some simpler default scoring and the scorer
here should be moved to an external project to be passed in just for
generating the traditional scores.
2021-07-18 20:13:10 +09:00
explosion-bot
eff3d1088b Auto-format code with black 2021-07-16 08:03:36 +00:00
Sofie Van Landeghem
77859beb99
spacy.ngram_range_suggester.v1 (#8699) 2021-07-15 10:01:22 +02:00
Paul O'Leary McCann
80a17071d3 Remove unused code 2021-07-11 18:46:39 +09:00
Paul O'Leary McCann
447c7070e3 Fix loss
Accidentally deleted it
2021-07-10 22:45:25 +09:00
Paul O'Leary McCann
e00bd422d9 Fix span embeds
Some of the lengths and backprop weren't right.

Also various cleanup.
2021-07-10 21:38:53 +09:00
Paul O'Leary McCann
dc1f974d39 Merge branch 'master' into feature/coref 2021-07-10 18:10:40 +09:00
Sofie Van Landeghem
64fac754fe
add spacy prefix to ngram_suggester.v1 (#8623) 2021-07-07 08:09:30 +02:00
Sofie Van Landeghem
733e8ceea9
fix spancat initialize with labels (#8620) 2021-07-06 19:08:25 +02:00
Sofie Van Landeghem
3daf57d70c
Small spancat fixes (#8614)
* two small fixes + additional tests

* rename
2021-07-06 14:15:41 +02:00
Adriane Boyd
29906884c5
Raise an error for textcat with <2 labels (#8584)
* Raise an error for textcat with <2 labels

Raise an error if initializing a `textcat` component without at least
two labels.

* Add similar note to docs

* Update positive_label description in API docs
2021-07-06 12:35:22 +02:00
Paul O'Leary McCann
8f66176b2d Fix loss?
This rewrites the loss to not use the Thinc crossentropy code at all.
The main difference here is that the negative predictions are being
masked out (= marginalized over), but negative gradient is still being
reflected.

I'm still not sure this is exactly right but models seem to train
reliably now.
2021-07-05 18:17:10 +09:00
Paul O'Leary McCann
2d3c559dc4 On initialize, use just two samples
Coref docs are kind of long, and using 10 samples on a smallish GPU can
cause OOMs.
2021-07-03 18:43:03 +09:00
Paul O'Leary McCann
f2e0e9dc28 Move placeholder handling into model code 2021-07-03 18:38:48 +09:00
Ines Montani
7f65902702
Merge pull request #8522 from adrianeboyd/chore/update-flake8
Update flake8 version in reqs and CI
2021-06-28 21:46:06 +10:00
Adriane Boyd
86d01e9229 Tidy up with flake8: imports, comparisons, etc. 2021-06-28 12:08:15 +02:00
Adriane Boyd
5eeb25f043 Tidy up code 2021-06-28 12:08:15 +02:00
Adriane Boyd
4b0ed73ed4 Update flake8 version in reqs and CI
* Update some unneeded forward refs related to flake8 checks
2021-06-28 11:29:36 +02:00
Santiago Castro
a2bc743e47
Fix typo in comment 2021-06-25 18:58:38 -07:00
Adrian Zuber
f5aee0bbdf
Raise custom error in EntityLinker when KB is not set (#8442)
* Raise custom error in EntityLinker when KB is not set

* add contributor agreement

* Update E1018 error message
2021-06-25 23:04:00 +02:00
Matthew Honnibal
f9946154d9
Add SpanCategorizer component (#6747)
* Draft spancat model

* Add spancat model

* Add test for extract_spans

* Add extract_spans layer

* Upd extract_spans

* Add spancat model

* Add test for spancat model

* Upd spancat model

* Update spancat component

* Upd spancat

* Update spancat model

* Add quick spancat test

* Import SpanCategorizer

* Fix SpanCategorizer component

* Import SpanGroup

* Fix span extraction

* Fix import

* Fix import

* Upd model

* Update spancat models

* Add scoring, update defaults

* Update and add docs

* Fix type

* Update spacy/ml/extract_spans.py

* Auto-format and fix import

* Fix comment

* Fix type

* Fix type

* Update website/docs/api/spancategorizer.md

* Fix comment

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Better defense

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix labels list

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/extract_spans.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/pipeline/spancat.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Set annotations during update

* Set annotations in spancat

* fix imports in test

* Update spacy/pipeline/spancat.py

* replace MaxoutLogistic with LinearLogistic

* fix config

* various small fixes

* remove set_annotations parameter in update

* use our beloved tupley format with recent support for doc.spans

* bugfix to allow renaming the default span_key (scores weren't showing up)

* use different key in docs example

* change defaults to better-working parameters from project (WIP)

* register spacy.extract_spans.v1 for legacy purposes

* Upd dev version so can build wheel

* layers instead of architectures for smaller building blocks

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Include additional scores from overrides in combined score weights

* Parameterize spans key in scoring

Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so
that it's possible to evaluate multiple `spancat` components in the same
pipeline.

* Use the (intentionally very short) default spans key `sc` in the
  `SpanCategorizer`
* Adjust the default score weights to include the default key
* Adjust the scorer to use `spans_{spans_key}` as the prefix for the
  returned score
* Revert addition of `attr_name` argument to `score_spans` and adjust
  the key in the `getter` instead.

Note that for `spancat` components with a custom `span_key`, the score
weights currently need to be modified manually in
`[training.score_weights]` for them to be available during training. To
suppress the default score weights `spans_sc_p/r/f` during training, set
them to `null` in `[training.score_weights]`.

* Update website/docs/api/scorer.md

* Fix scorer for spans key containing underscore

* Increment version

* Add Spans to Evaluate CLI (#8439)

* Add Spans to Evaluate CLI

* Change to spans_key

* Add spans per_type output

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix spancat GPU issues (#8455)

* Fix GPU issues

* Require thinc >=8.0.6

* Switch to glorot_uniform_init

* Fix and test ngram suggester

* Include final ngram in doc for all sizes
* Fix ngrams for docs of the same length as ngram size
* Handle batches of docs that result in no ngrams
* Add tests

Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Nirant <NirantK@users.noreply.github.com>
2021-06-24 12:35:27 +02:00
Adriane Boyd
ec71a6b572
Filter W036 for entity ruler, etc. (#8424) 2021-06-21 09:34:29 +02:00
Paul O'Leary McCann
a62121e3b4 Expose more hyperparameters 2021-06-17 21:21:46 +09:00
Matthew Honnibal
6f5e308d17
Support negative examples in partial NER annotations (#8106)
* Support a cfg field in transition system

* Make NER 'has gold' check use right alignment for span

* Pass 'negative_samples_key' property into NER transition system

* Add field for negative samples to NER transition system

* Check neg_key in NER has_gold

* Support negative examples in NER oracle

* Test for negative examples in NER

* Fix name of config variable in NER

* Remove vestiges of old-style partial annotation

* Remove obsolete tests

* Add comment noting lack of support for negative samples in parser

* Additions to "neg examples" PR (#8201)

* add custom error and test for deprecated format

* add test for unlearning an entity

* add break also for Begin's cost

* add negative_samples_key property on Parser

* rename

* extend docs & fix some older docs issues

* add subclass constructors, clean up tests, fix docs

* add flaky test with ValueError if gold parse was not found

* remove ValueError if n_gold == 0

* fix docstring

* Hack in environment variables to try out training

* Remove hack

* Remove NER hack, and support 'negative O' samples

* Fix O oracle

* Fix transition parser

* Remove 'not O' from oracle

* Fix NER oracle

* check for spans in both gold.ents and gold.spans and raise if so, to prevent memory access violation

* use set instead of list in consistency check

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-06-17 17:33:00 +10:00
Sofie Van Landeghem
e796aab4b3
Resizable textcat (#7862)
* implement textcat resizing for TextCatCNN

* resizing textcat in-place

* simplify code

* ensure predictions for old textcat labels remain the same after resizing (WIP)

* fix for softmax

* store softmax as attr

* fix ensemble weight copy and cleanup

* restructure slightly

* adjust documentation, update tests and quickstart templates to use latest versions

* extend unit test slightly

* revert unnecessary edits

* fix typo

* ensemble architecture won't be resizable for now

* use resizable layer (WIP)

* revert using resizable layer

* resizable container while avoid shape inference trouble

* cleanup

* ensure model continues training after resizing

* use fill_b parameter

* use fill_defaults

* resize_layer callback

* format

* bump thinc to 8.0.4

* bump spacy-legacy to 3.0.6
2021-06-16 11:45:00 +02:00
Adriane Boyd
5646fcbe46 Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1 2021-06-15 15:05:17 +02:00
graue70
f34dd0b98f
Fix typos in comments (#8279) 2021-06-07 10:43:54 +02:00
Adriane Boyd
9dfd3c9484 Use warnings.warn instead of logger.warning 2021-06-04 17:44:08 +02:00
Sofie Van Landeghem
f0277bdeab Show warning if entity_ruler runs without patterns (#7807)
* Show warning if entity_ruler runs without patterns

* Show warning if matcher runs without patterns

* fix wording

* unit test for warning once (WIP)

* warn W036 only once

* cleanup

* create filter_warning helper
2021-06-04 17:37:38 +02:00
Paul O'Leary McCann
67d9ebc922 Transpose before calculating loss 2021-06-04 17:56:08 +09:00
Paul O'Leary McCann
d959603d51
Don't add duplicate patterns all the time in EntityRuler (fix #8216) (#8246)
* Don't add duplicate patterns (fix #8216)

* Refactor EntityRuler init

This simplifies the EntityRuler init code. This is helpful as prep for
allowing the EntityRuler to reset itself.

* Make EntityRuler.clear reset matchers

Includes a new test for this.

* Tidy PhraseMatcher instantiation

Since the attr can be None safely now, the guard if is no longer
required here.

Also renamed the `_validate` attr. Maybe it's not needed?

* Fix NER test

* Add test to make sure patterns aren't increasing

* Move test to regression tests
2021-06-03 09:05:26 +02:00
Paul O'Leary McCann
d54631f68b
Fix other open calls without context managers (#8245) 2021-05-31 19:04:29 +10:00
Dhruv Naik
283f64a98d
Fix bug from Entityruler: ent_ids returns None for phrases (#8169)
* bugfix for explosion/spaCy#8168

* add test for explosion/spaCy#8168
2021-05-31 18:38:53 +10:00
Sofie Van Landeghem
fff662e41f
Ensemble textcat with listener (#8012)
* add unit test for two listeners, with a textcat ensemble in the middle

* return zero gradients instead of None in accumulate_gradient
2021-05-31 18:21:06 +10:00
Sofie Van Landeghem
ff91e6dac7
Show warning if entity_ruler runs without patterns (#7807)
* Show warning if entity_ruler runs without patterns

* Show warning if matcher runs without patterns

* fix wording

* unit test for warning once (WIP)

* warn W036 only once

* cleanup

* create filter_warning helper
2021-05-31 18:20:27 +10:00
Paul O'Leary McCann
04239e94c7
Use a context manager when reading model (fix #7036) (#8244) 2021-05-31 17:36:17 +10:00
svlandeg
04b55bf054 removing unused imports 2021-05-27 16:31:38 +02:00
svlandeg
910026582d set versions to v1 instead of v0 2021-05-27 16:17:20 +02:00
svlandeg
ba2e491cc4 Merge remote-tracking branch 'upstream/master' into feature/coref 2021-05-27 13:50:32 +02:00
Paul O'Leary McCann
a484245f35 Remove references to coref_er 2021-05-24 19:08:45 +09:00
Paul O'Leary McCann
d6389b133d Don't use a generator for no reason 2021-05-24 19:06:15 +09:00
Paul O'Leary McCann
0942a0b51b Remove coref_er.py
The intent of this was that it would be a component pipeline that used
entities as input, but that's now covered by the get_mentions function
as a pipeline arg.
2021-05-21 18:20:25 +09:00
Paul O'Leary McCann
f6652c9252 Add new coref scoring
This is closer to the traditional evaluation method. That uses an
average of three scores, this is just using the bcubed metric for now
(nothing special about bcubed, just picked one).

The scoring implementation comes from the coval project. It relies on
scipy, which is one issue, and is rather involved, which is another.

Besides being comparable with traditional evaluations, this scoring is
relatively fast.
2021-05-21 15:56:40 +09:00
Paul O'Leary McCann
e1b4a85bb9 Fix loss
The loss was being returned as a single element array, which caused
training to die when it attempted to turn it into JSON.
2021-05-21 15:46:50 +09:00
Sofie Van Landeghem
202943bc8c
KB & NEL to/from bytes (#8113)
* unit test for pickling KB

* add pickling test for NEL

* KB to_bytes and from_bytes

* NEL to_bytes and from_bytes

* xfail pickle tests for now

* fix docs

* cleanup
2021-05-20 18:11:30 +10:00
Paul O'Leary McCann
d22acee4f7 Fix backprop
Training seems to actually run now!
2021-05-18 20:09:27 +09:00
Paul O'Leary McCann
2486b8ad4d Fix pipeline intialize 2021-05-18 19:56:27 +09:00
Paul O'Leary McCann
e303628205 Attempt to use registry correctly 2021-05-17 14:52:48 +09:00
Paul O'Leary McCann
91b111467b Minor fixes 2021-05-17 14:52:30 +09:00
Paul O'Leary McCann
7c42a8c90a Migrate coref code
This includes the coref code that was being tested separately, modified
to work in spaCy. It hasn't been tested yet and presumably still needs
fixes.

In particular, the evaluation code is currently omitted. It's unclear at
the moment whether we want to use a complex scorer similar to the
official one, or a simpler scorer using more modern evaluation methods.
2021-05-15 21:36:10 +09:00
Paul O'Leary McCann
3608b7b3f9 Merge branch 'master' into feature/coref 2021-05-15 20:05:17 +09:00
Adriane Boyd
6788d90f61
Preserve existing ENT_KB_ID annotation in NER (#7988)
* Preserve existing ENT_KB_ID annotation in NER

Preserve `ent_kb_id` annotation on existing entity spans, which is not
preserved by the transition system.

* Simplify kb_id assignment

* Simplify further
2021-05-06 18:49:55 +10:00
Adriane Boyd
d2bdaa7823
Replace negative rows with 0 in StaticVectors (#7674)
* Replace negative rows with 0 in StaticVectors

Replace negative row indices with 0-vectors in `StaticVectors`.

* Increase versions related to StaticVectors

* Increase versions of all architctures and layers related to
`StaticVectors`
* Improve efficiency of 0-vector operations

Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5

* Update config defaults to new versions

* Update docs
2021-04-22 18:04:15 +10:00
Sofie Van Landeghem
27dbbb9903
Bugfix/nel crossing sentence (#7630)
* ensure each entity gets a KB ID, even when it's not within a sentence

* cleanup
2021-04-12 18:08:01 +10:00
Adriane Boyd
8008e2f75b
Use morph hash in lemmatizer cache key (#7690)
Use the morph hash rather than the `MorphAnalysis` object in the cache
key so that the `Lemmatizer` can be pickled.
2021-04-08 13:22:38 +02:00
Adriane Boyd
39153ef90f Update lexeme_norm checks
* Add util method for check
* Add new languages to list with lexeme norm tables
* Add check to all relevant components
* Add config details to warning message

Note that we're not actually inspecting the model config to see if
`NORM` is used as an attribute, so it may warn in cases where it's not
relevant.
2021-03-19 10:59:27 +01:00
Ines Montani
37fc495f5d
Merge pull request #7353 from jankrepl/fix_entity_rules_labels 2021-03-09 15:09:24 +01:00
Sofie Van Landeghem
932887b950
textcat scoring fix and multi_label docs (#6974)
* add multi-label textcat to menu

* add infobox on textcat API

* add info to v3 migration guide

* small edits

* further fixes in doc strings

* add infobox to textcat architectures

* add textcat_multilabel to overview of built-in components

* spelling

* fix unrelated warn msg

* Add textcat_multilabel to quickstart [ci skip]

* remove separate documentation page for multilabel_textcategorizer

* small edits

* positive label clarification

* avoid duplicating information in self.cfg and fix textcat.score

* fix multilabel textcat too

* revert threshold to storage in cfg

* revert threshold stuff for multi-textcat

Co-authored-by: Ines Montani <ines@ines.io>
2021-03-09 23:04:22 +11:00
Jan Krepl
f26b61e001 Make sure sorted 2021-03-09 10:49:53 +01:00
Sofie Van Landeghem
e0c45c669a
Native coref component (#7243)
* initial coref_er pipe

* matcher more flexible

* base coref component without actual model

* initial setup of coref_er.score

* rename to include_label

* preliminary score_clusters method

* apply scoring in coref component

* IO fix

* return None loss for now

* rename to CoreferenceResolver

* some preliminary unit tests

* use registry as callable
2021-03-03 13:50:14 +01:00
Adriane Boyd
10c930cc96
Re-refactor Sentencizer with Pipe API (#7176)
Reapply the refactoring (#4721) so that `Sentencizer` uses the faster
`predict` and `set_annotations` for both `__call__` and `pipe`.
2021-02-26 09:48:14 +01:00
Sofie Van Landeghem
b92f81d5da
fix NEL config and IO, and n_sents functionality (#7100)
* fix NEL config and IO, and n_sents functionality

* add docs

* fix test
2021-02-22 14:49:52 +11:00
Sofie Van Landeghem
ba5a50f62b
NEL docs & UX (#7129)
* EL set_kb docs fix

* custom warning for set_kb mistake
2021-02-22 11:04:22 +11:00
Matthew Honnibal
0fb8d437c0
Fix sentence fragments bug (#7056, #7035) (#7057)
* Add test for #7035

* Update test for issue 7056

* Fix test

* Fix transitions method used in testing

* Fix state eol detection when rebuffer

* Clean up redundant fix
2021-02-14 13:38:13 +11:00
Ines Montani
9ba715ed16 Tidy up and auto-format 2021-02-13 12:55:56 +11:00
svlandeg
aa3ad8825d loop instead of any 2021-02-12 13:14:30 +01:00
svlandeg
a52d466bfc any instead of all 2021-02-11 20:50:55 +01:00
Ines Montani
ad9ce3c8f6 Fix issue #6950: allow pickling Tok2Vec with listeners 2021-02-11 11:37:39 +11:00
Sofie Van Landeghem
a323ef90df
ensure the loss value is cast as float (#6928) 2021-02-07 07:51:56 +08:00
René Octavio Queiroz Dias
999ff03b19
fix: Fix textcat labels to expect a Optional[Iterable[str]] instead of Optional[Dict] (#6911)
* docs: Add agreement

* bug: Regression test

Issue #6908

* fix: Changed from Dict to Iterable[str]

Fix #6908

* Update test to use make_tempdir

* fix: Fix WindowsPath error

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-02-04 23:37:13 +01:00
Sofie Van Landeghem
f638306598
remove link_components flag again (#6883) 2021-02-02 10:08:40 +08:00
Ines Montani
d0c3775712 Replace links to nightly docs [ci skip] 2021-01-30 20:09:38 +11:00
Ines Montani
e6accb3a9e Tidy up and auto-format 2021-01-30 12:52:33 +11:00
Ines Montani
2102082478 Make Tok2Vec.remove_listener return bool
Whether listener was removed
2021-01-29 21:41:38 +11:00
Ines Montani
0f3e3eedc2 Add Tok2vec.remove_listener 2021-01-29 19:36:38 +11:00
Sofie Van Landeghem
837a4f53c2
Error handling in nlp.pipe (#6817)
* add error handler for pipe methods

* add unit tests

* remove pipe method that are the same as their base class

* have Language keep track of a default error handler

* cleanup

* formatting

* small refactor

* add documentation
2021-01-29 08:51:21 +08:00
Sofie Van Landeghem
6b68ad027b
Fix beam NER resizing (#6834)
* move label check to sub methods

* add tests
2021-01-27 23:39:14 +11:00
Matthew Honnibal
05050210f3 Dont add labels implicitly for parser 2021-01-27 13:04:47 +11:00
Matthew Honnibal
1d20e21f3e Add labels implicitly for parser and ner 2021-01-27 12:54:47 +11:00
Matthew Honnibal
f049df1715
Revert "Set annotations in update" (#6810)
* Revert "Set annotations in update (#6767)"

This reverts commit e680efc7cc.

* Fix version

* Update spacy/pipeline/entity_linker.py

* Update spacy/pipeline/entity_linker.py

* Update spacy/pipeline/tagger.pyx

* Update spacy/pipeline/tok2vec.py

* Update spacy/pipeline/tok2vec.py

* Update spacy/pipeline/transition_parser.pyx

* Update spacy/pipeline/transition_parser.pyx

* Update website/docs/api/multilabel_textcategorizer.md

* Update website/docs/api/tok2vec.md

* Update website/docs/usage/layers-architectures.md

* Update website/docs/usage/layers-architectures.md

* Update website/docs/api/transformer.md

* Update website/docs/api/textcategorizer.md

* Update website/docs/api/tagger.md

* Update spacy/pipeline/entity_linker.py

* Update website/docs/api/sentencerecognizer.md

* Update website/docs/api/pipe.md

* Update website/docs/api/morphologizer.md

* Update website/docs/api/entityrecognizer.md

* Update spacy/pipeline/entity_linker.py

* Update spacy/pipeline/multitask.pyx

* Update spacy/pipeline/tagger.pyx

* Update spacy/pipeline/tagger.pyx

* Update spacy/pipeline/textcat.py

* Update spacy/pipeline/textcat.py

* Update spacy/pipeline/textcat.py

* Update spacy/pipeline/tok2vec.py

* Update spacy/pipeline/trainable_pipe.pyx

* Update spacy/pipeline/trainable_pipe.pyx

* Update spacy/pipeline/transition_parser.pyx

* Update spacy/pipeline/transition_parser.pyx

* Update website/docs/api/entitylinker.md

* Update website/docs/api/dependencyparser.md

* Update spacy/pipeline/trainable_pipe.pyx
2021-01-25 22:18:45 +08:00
Sofie Van Landeghem
e680efc7cc
Set annotations in update (#6767)
* bump to 3.0.0rc4

* do set_annotations in component update calls

* update docs and remove set_annotations flag

* fix EL test
2021-01-20 11:49:25 +11:00
Sofie Van Landeghem
57640aa838
warn when frozen components break listener pattern (#6766)
* warn when frozen components break listener pattern

* few notes in the documentation

* update arg name

* formatting

* cleanup

* specify listeners return type
2021-01-20 11:12:35 +11:00
Ines Montani
e697609fef Update docstrings and types [ci skip] 2021-01-18 22:31:26 +11:00
Adriane Boyd
bf0cdae8d4
Add token_splitter component (#6726)
* Add long_token_splitter component

Add a `long_token_splitter` component for use with transformer
pipelines. This component splits up long tokens like URLs into smaller
tokens. This is particularly relevant for pretrained pipelines with
`strided_spans`, since the user can't change the length of the span
`window` and may not wish to preprocess the input texts.

The `long_token_splitter` splits tokens that are at least
`long_token_length` tokens long into smaller tokens of `split_length`
size.

Notes:

* Since this is intended for use as the first component in a pipeline,
the token splitter does not try to preserve any token annotation.
* API docs to come when the API is stable.

* Adjust API, add test

* Fix name in factory
2021-01-17 19:54:41 +08:00
Adriane Boyd
43a752a2a0
Fix assertion in default get oracle sequence usage (#6738)
Remove assertion for default debug value in 
`get_oracle_sequence_from_state`.
2021-01-16 16:07:39 +01:00
Matthew Honnibal
f0c696b4aa Fix failed merge of #6694 patch 2021-01-16 13:44:11 +11:00
Adriane Boyd
9328dd5625
Handle unset token.morph in Morphologizer (#6704)
* Handle unset token.morph in Morphologizer

Handle unset `token.morph` in `Morphologizer.initialize` and
`Morphologizer.get_loss`. If both `token.morph` and `token.pos` are
unset, treat the annotation as missing rather than empty.

* Add token.has_morph()
2021-01-15 17:20:10 +01:00
Matthew Honnibal
7b3f0c6f1b
Questionable fix for parser training bug with misaligned sentences (#6694)
* Questionable fix for parser training bug with misaligned sentences

* Fix

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-01-15 14:18:24 +01:00
Ines Montani
b0b743597c Tidy up and auto-format 2021-01-15 11:57:36 +11:00
svlandeg
fec9b81aa2 Merge remote-tracking branch 'upstream/develop' into feature/missing-dep 2021-01-13 17:46:12 +01:00
svlandeg
ed53bb979d cleanup 2021-01-13 14:20:05 +01:00
svlandeg
5b598bd1d5 formatting 2021-01-12 17:28:41 +01:00
Adriane Boyd
ad43cbb042
Sync missing and misaligned values in Tagger loss (#6689)
Use `None` for both missing and misaligned annotation in
`Tagger.get_loss`, reverting to the default missing value in the loss
function.
2021-01-10 11:30:37 +11:00
svlandeg
dd12c6c8fd allow missing information in deps and heads annotations 2021-01-07 19:10:32 +01:00
svlandeg
1abeca90a6 refer to _parser_internals.nonproj.DELIMITER 2021-01-07 18:58:13 +01:00
Sofie Van Landeghem
75d9019343
Fix types of Tok2Vec encoding architectures (#6442)
* fix TorchBiLSTMEncoder documentation

* ensure the types of the encoding Tok2vec layers are correct

* update references from v1 to v2 for the new architectures
2021-01-07 16:39:27 +11:00