Commit Graph

258 Commits

Author SHA1 Message Date
Paul O'Leary McCann
33f4f90ff0 Formatting 2022-05-10 19:09:52 +09:00
Paul O'Leary McCann
41fc092674 Split span predictor model into its own file 2022-05-10 19:08:21 +09:00
svlandeg
6b51258a58 clean up unused imports + black formatting 2022-05-09 13:34:50 +02:00
Paul O'Leary McCann
683f470852 Merge branch 'master' into feature/coref 2022-04-18 18:39:08 +09:00
kadarakos
b53113e3b8
Preparing span predictor for predicting from gold (#10547)
Note this is squashed because rebasing had conflicts.

* remove unnecessary .device

* span predictor debug start

* gearing up SpanPredictor for gold-heads

* merge SpanPredictor attributes

* remove useless extra prefix and device from spanpredictor

* make sure predicted and reference keeps aligned

* handle empty head_ids

* handle empty clusters

* addressing suggestions by @polm

* nicer restore

* fix score overwriting bug

* prepare for aligned heads-spans training

* span accuracy score

* update with eg.predited as other components

* add backprop callback to spanpredictor

* report start- and end-accuracies separately

* fixing scorer

Co-authored-by: Kádár Ákos <akos@onyx.uvt.nl>
2022-04-13 19:42:49 +09:00
Kádár Ákos
2a1ad4c5d2 add backprop callback to spanpredictor 2022-04-08 14:56:44 +02:00
Kádár Ákos
4fc40340f9 handle empty head_ids 2022-03-28 11:28:21 +02:00
Kádár Ákos
83ac0477c8 remove useless extra prefix and device from spanpredictor 2022-03-24 16:44:50 +01:00
Kádár Ákos
1c5dabcb47 merge SpanPredictor attributes 2022-03-24 16:23:12 +01:00
Kádár Ákos
a872c69ffb merge 2022-03-24 16:10:04 +01:00
Kádár Ákos
706b2e6f25 gearing up SpanPredictor for gold-heads 2022-03-24 16:06:20 +01:00
Kádár Ákos
150e7c46d7 conflict 2022-03-23 11:27:02 +01:00
Kádár Ákos
1eaf8fb0cf span predictor debug start 2022-03-23 11:24:27 +01:00
Paul O'Leary McCann
eec00ce60d Fix various sizes in SpanPredictor FFNN 2022-03-23 16:20:31 +09:00
Paul O'Leary McCann
2190cbc0e6 Add progress on SpanPredictor component
This isn't working. There is a CUDA error in the torch code during
initialization and it's not clear why.
2022-03-19 19:39:49 +09:00
Kádár Ákos
db422abf01 remove unnecessary .device 2022-03-18 16:24:26 +01:00
Paul O'Leary McCann
0275ae29de Remove stale comment 2022-03-16 20:09:12 +09:00
Paul O'Leary McCann
6974f55daa Hack for transformer listener size 2022-03-16 15:15:53 +09:00
Paul O'Leary McCann
5650853c0f Remove unused functions 2022-03-16 14:38:11 +09:00
Daniël de Kok
e5debc68e4
Tagger: use unnormalized probabilities for inference (#10197)
* Tagger: use unnormalized probabilities for inference

Using unnormalized softmax avoids use of the relatively expensive exp function,
which can significantly speed up non-transformer models (e.g. I got a speedup
of 27% on a German tagging + parsing pipeline).

* Add spacy.Tagger.v2 with configurable normalization

Normalization of probabilities is disabled by default to improve
performance.

* Update documentation, models, and tests to spacy.Tagger.v2

* Move Tagger.v1 to spacy-legacy

* docs/architectures: run prettier

* Unnormalized softmax is now a Softmax_v2 option

* Require thinc 8.0.14 and spacy-legacy 3.0.9
2022-03-15 14:15:31 +01:00
Paul O'Leary McCann
d0ae2590db Delete all the coref-hoi code 2022-03-15 20:05:24 +09:00
Paul O'Leary McCann
abdc7d87af Clean up util code
Moved everything into coref_util.py, deleted wl-specific file.
2022-03-15 19:59:44 +09:00
Paul O'Leary McCann
0522a43116 Make span2head component 2022-03-15 19:19:15 +09:00
Paul O'Leary McCann
e6917d8dc4 Add util functions for wl-coref 2022-03-14 19:27:55 +09:00
Paul O'Leary McCann
8eadf3781b Training runs now
Evaluation needs fixing, and code still needs cleanup.
2022-03-14 19:02:17 +09:00
Paul O'Leary McCann
d22a002641 Forward/backward pass works
Evaluate does not work - predict hasn't been updated
2022-03-14 17:26:27 +09:00
Paul O'Leary McCann
c4f9c24738 The coref model is able to be loaded
The span predictor component is initialized but not used at all now.
Plan is to work on it after the word level clustering part is trainable
end-to-end.
2022-03-09 19:31:11 +09:00
Paul O'Leary McCann
35cc2b138f Add span predictor code
Accidentally omitted before
2022-03-08 18:13:26 +09:00
Paul O'Leary McCann
1c697b4011 Remove references to config
Replaced with model arguments
2022-03-08 18:13:09 +09:00
Paul O'Leary McCann
c0cd5025e3 Start bringin in wl-coref
This absolutely does not work. First step here is getting over most of
the code in roughly the files we want it in. After the code has been
pulled over it can be restructured to match spaCy and cleaned up.
2022-03-06 20:00:15 +09:00
Paul O'Leary McCann
91acc3ea75
Fix entity linker batching (#9669)
* Partial fix of entity linker batching

* Add import

* Better name

* Add `use_gold_ents` option, docs

* Change to v2, create stub v1, update docs etc.

* Fix error type

Honestly no idea what the right type to use here is.
ConfigValidationError seems wrong. Maybe a NotImplementedError?

* Make mypy happy

* Add hacky fix for init issue

* Add legacy pipeline entity linker

* Fix references to class name

* Add __init__.py for legacy

* Attempted fix for loss issue

* Remove placeholder V1

* formatting

* slightly more interesting train data

* Handle batches with no usable examples

This adds a test for batches that have docs but not entities, and a
check in the component that detects such cases and skips the update step
as thought the batch were empty.

* Remove todo about data verification

Check for empty data was moved further up so this should be OK now - the
case in question shouldn't be possible.

* Fix gradient calculation

The model doesn't know which entities are not in the kb, so it generates
embeddings for the context of all of them.

However, the loss does know which entities aren't in the kb, and it
ignores them, as there's no sensible gradient.

This has the issue that the gradient will not be calculated for some of
the input embeddings, which causes a dimension mismatch in backprop.
That should have caused a clear error, but with numpyops it was causing
nans to happen, which is another problem that should be addressed
separately.

This commit changes the loss to give a zero gradient for entities not in
the kb.

* add failing test for v1 EL legacy architecture

* Add nasty but simple working check for legacy arch

* Clarify why init hack works the way it does

* Clarify use_gold_ents use case

* Fix use gold ents related handling

* Add tests for no gold ents and fix other tests

* Use aligned ents function (not working)

This doesn't actually work because the "aligned" ents are gold-only. But
if I have a different function that returns the intersection, *then*
this will work as desired.

* Use proper matching ent check

This changes the process when gold ents are not used so that the
intersection of ents in the pred and gold is used.

* Move get_matching_ents to Example

* Use model attribute to check for legacy arch

* Rename flag

* bump spacy-legacy to lower 3.0.9

Co-authored-by: svlandeg <svlandeg@github.com>
2022-03-04 09:17:36 +01:00
github-actions[bot]
91ccacea12
Auto-format code with black (#10209)
* Auto-format code with black

* add black requirement to dev dependencies and pin to 22.x

* ignore black dependency for comparison with setup.cfg

Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2022-02-06 16:30:30 +01:00
Paul O'Leary McCann
c7f586c4ba Merge branch 'master' into feature/coref
This brings coref up to date, in particular giving access to 3.2
features.
2022-02-03 19:01:18 +09:00
Daniël de Kok
50d2a2c930
User fewer Vector internals (#9879)
* Use Vectors.shape rather than Vectors.data.shape

* Use Vectors.size rather than Vectors.data.size

* Add Vectors.to_ops to move data between different ops

* Add documentation for Vector.to_ops
2022-01-18 17:14:35 +01:00
Peter Baumgartner
72abf9e102
MultiHashEmbed vector docs correction (#9918) 2021-12-27 11:18:08 +01:00
Paul O'Leary McCann
c1cc94a33a
Fix typo about receptive field size (#9564) 2021-11-03 15:16:55 +01:00
Daniël de Kok
d0631e3005
Replace use_ops("numpy") by use_ops("cpu") in the parser (#9501)
* Replace use_ops("numpy") by use_ops("cpu") in the parser

This ensures that the best available CPU implementation is chosen
(e.g. Thinc Apple Ops on macOS).

* Run spaCy tests with apple-thinc-ops on macOS
2021-10-21 11:22:45 +02:00
Connor Brinton
657af5f91f
🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167)
* 🚨 Ignore all existing Mypy errors

* 🏗 Add Mypy check to CI

* Add types-mock and types-requests as dev requirements

* Add additional type ignore directives

* Add types packages to dev-only list in reqs test

* Add types-dataclasses for python 3.6

* Add ignore to pretrain

* 🏷 Improve type annotation on `run_command` helper

The `run_command` helper previously declared that it returned an
`Optional[subprocess.CompletedProcess]`, but it isn't actually possible
for the function to return `None`. These changes modify the type
annotation of the `run_command` helper and remove all now-unnecessary
`# type: ignore` directives.

* 🔧 Allow variable type redefinition in limited contexts

These changes modify how Mypy is configured to allow variables to have
their type automatically redefined under certain conditions. The Mypy
documentation contains the following example:

```python
def process(items: List[str]) -> None:
    # 'items' has type List[str]
    items = [item.split() for item in items]
    # 'items' now has type List[List[str]]
    ...
```

This configuration change is especially helpful in reducing the number
of `# type: ignore` directives needed to handle the common pattern of:
* Accepting a filepath as a string
* Overwriting the variable using `filepath = ensure_path(filepath)`

These changes enable redefinition and remove all `# type: ignore`
directives rendered redundant by this change.

* 🏷 Add type annotation to converters mapping

* 🚨 Fix Mypy error in convert CLI argument verification

* 🏷 Improve type annotation on `resolve_dot_names` helper

* 🏷 Add type annotations for `Vocab` attributes `strings` and `vectors`

* 🏷 Add type annotations for more `Vocab` attributes

* 🏷 Add loose type annotation for gold data compilation

* 🏷 Improve `_format_labels` type annotation

* 🏷 Fix `get_lang_class` type annotation

* 🏷 Loosen return type of `Language.evaluate`

* 🏷 Don't accept `Scorer` in `handle_scores_per_type`

* 🏷 Add `string_to_list` overloads

* 🏷 Fix non-Optional command-line options

* 🙈 Ignore redefinition of `wandb_logger` in `loggers.py`

*  Install `typing_extensions` in Python 3.8+

The `typing_extensions` package states that it should be used when
"writing code that must be compatible with multiple Python versions".
Since SpaCy needs to support multiple Python versions, it should be used
when newer `typing` module members are required. One example of this is
`Literal`, which is available starting with Python 3.8.

Previously SpaCy tried to import `Literal` from `typing`, falling back
to `typing_extensions` if the import failed. However, Mypy doesn't seem
to be able to understand what `Literal` means when the initial import
means. Therefore, these changes modify how `compat` imports `Literal` by
always importing it from `typing_extensions`.

These changes also modify how `typing_extensions` is installed, so that
it is a requirement for all Python versions, including those greater
than or equal to 3.8.

* 🏷 Improve type annotation for `Language.pipe`

These changes add a missing overload variant to the type signature of
`Language.pipe`. Additionally, the type signature is enhanced to allow
type checkers to differentiate between the two overload variants based
on the `as_tuple` parameter.

Fixes #8772

*  Don't install `typing-extensions` in Python 3.8+

After more detailed analysis of how to implement Python version-specific
type annotations using SpaCy, it has been determined that by branching
on a comparison against `sys.version_info` can be statically analyzed by
Mypy well enough to enable us to conditionally use
`typing_extensions.Literal`. This means that we no longer need to
install `typing_extensions` for Python versions greater than or equal to
3.8! 🎉

These changes revert previous changes installing `typing-extensions`
regardless of Python version and modify how we import the `Literal` type
to ensure that Mypy treats it properly.

* resolve mypy errors for Strict pydantic types

* refactor code to avoid missing return statement

* fix types of convert CLI command

* avoid list-set confustion in debug_data

* fix typo and formatting

* small fixes to avoid type ignores

* fix types in profile CLI command and make it more efficient

* type fixes in projects CLI

* put one ignore back

* type fixes for render

* fix render types - the sequel

* fix BaseDefault in language definitions

* fix type of noun_chunks iterator - yields tuple instead of span

* fix types in language-specific modules

* 🏷 Expand accepted inputs of `get_string_id`

`get_string_id` accepts either a string (in which case it returns its 
ID) or an ID (in which case it immediately returns the ID). These 
changes extend the type annotation of `get_string_id` to indicate that 
it can accept either strings or IDs.

* 🏷 Handle override types in `combine_score_weights`

The `combine_score_weights` function allows users to pass an `overrides` 
mapping to override data extracted from the `weights` argument. Since it 
allows `Optional` dictionary values, the return value may also include 
`Optional` dictionary values.

These changes update the type annotations for `combine_score_weights` to 
reflect this fact.

* 🏷 Fix tokenizer serialization method signatures in `DummyTokenizer`

* 🏷 Fix redefinition of `wandb_logger`

These changes fix the redefinition of `wandb_logger` by giving a 
separate name to each `WandbLogger` version. For 
backwards-compatibility, `spacy.train` still exports `wandb_logger_v3` 
as `wandb_logger` for now.

* more fixes for typing in language

* type fixes in model definitions

* 🏷 Annotate `_RandomWords.probs` as `NDArray`

* 🏷 Annotate `tok2vec` layers to help Mypy

* 🐛 Fix `_RandomWords.probs` type annotations for Python 3.6

Also remove an import that I forgot to move to the top of the module 😅

* more fixes for matchers and other pipeline components

* quick fix for entity linker

* fixing types for spancat, textcat, etc

* bugfix for tok2vec

* type annotations for scorer

* add runtime_checkable for Protocol

* type and import fixes in tests

* mypy fixes for training utilities

* few fixes in util

* fix import

* 🐵 Remove unused `# type: ignore` directives

* 🏷 Annotate `Language._components`

* 🏷 Annotate `spacy.pipeline.Pipe`

* add doc as property to span.pyi

* small fixes and cleanup

* explicit type annotations instead of via comment

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2021-10-14 15:21:40 +02:00
j-frei
462b009648
Correct parser.py use_upper param info (#9180) 2021-09-10 16:19:58 +02:00
Paul O'Leary McCann
00d481dd12 Stack the mention scorer
In the reference implementations, there's usually a function to build a
ffnn of arbitrary depth, consisting of a stack of Linear >> Relu >>
Dropout. In practice the depth is always 1 in coref-hoi, but in earlier
iterations of the model, which are more similar to our model here (since
we aren't using attention or even necessarily BERT), using a small depth
like 2 was common. This hard-codes a stack of 2.

In brief tests this allows similar performance to the unstacked version
with much smaller embedding sizes.

The depth of the stack could be made into a hyperparameter.
2021-08-09 18:04:42 +09:00
Paul O'Leary McCann
56803d3909 Change mention limit to match reference implementations
This generall means fewer spans are considered, which makes individual
steps in training faster but can make training take longer to find the
good spans.
2021-08-08 19:55:52 +09:00
Paul O'Leary McCann
1d1679d431 Minor speedup
This continue should be a break. The current form doesn't cause errors
but using a break will be a bit faster.
2021-07-21 19:50:10 +09:00
Paul O'Leary McCann
8bd0474730 Run black 2021-07-18 20:20:22 +09:00
Paul O'Leary McCann
9b63cbb775 Add extract spans import 2021-07-15 18:16:53 +09:00
Paul O'Leary McCann
4a9dc00d86 Use relative indices for mentions
Was using batch absolute indices to manage mentions, but extract_spans
expects doc-relative ones.
2021-07-14 18:36:18 +09:00
Paul O'Leary McCann
f1796e4af7 Fix mention list bug
There was an off-by-one error in how mentions are generated that would
affect mentions at the end of a sentence. This was pretty nasty.
2021-07-14 18:19:00 +09:00
Adriane Boyd
f9fd2889b7
Use 0-vector for OOV lexemes (#8639) 2021-07-13 14:48:12 +10:00
Paul O'Leary McCann
c25ec292a9 Cleanup 2021-07-10 22:42:55 +09:00
Paul O'Leary McCann
e00bd422d9 Fix span embeds
Some of the lengths and backprop weren't right.

Also various cleanup.
2021-07-10 21:38:53 +09:00
Paul O'Leary McCann
d7d317a1b5 Clean up span embedding code
This is now cleaner and significantly faster. There's still some messy
parts in the code (particularly variable names), will get to that later.
2021-07-10 19:59:08 +09:00
Paul O'Leary McCann
dc1f974d39 Merge branch 'master' into feature/coref 2021-07-10 18:10:40 +09:00
Paul O'Leary McCann
f34915c1e8 Use scatter_add to speed up span embed backprop
This was the slowest part of the code, and using scatter_add here
probably reduces the runtime by 50%.
2021-07-10 18:08:51 +09:00
Paul O'Leary McCann
d0b041aff4 Switch to using Thinc tuplify
The tuplify code here was added to Thinc proper and that's been
released, so no need to have it here any more.
2021-07-08 16:08:36 +09:00
Sofie Van Landeghem
e7d747e3ee
TransitionBasedParser.v1 to legacy (#8586)
* TransitionBasedParser.v1 to legacy

* register sublayers

* bump spacy-legacy to 3.0.7
2021-07-06 15:26:45 +02:00
Paul O'Leary McCann
eb5820b593 Improve take_vecs implementation
This pulls out references to needed bits so that other parts (the larger
embeddings) can be freed before backprop.
2021-07-05 21:08:42 +09:00
Paul O'Leary McCann
13bef2ddb6 Add width prior feature
Not necessary for convergence, but in coref-hoi this seems to add a few
f1 points.

Note that there are two width-related features in coref-hoi. This is a
"prior" that is added to mention scores. The other width related feature
is appended to the span embedding representation for other layers to
reference.
2021-07-05 21:06:28 +09:00
Paul O'Leary McCann
8f66176b2d Fix loss?
This rewrites the loss to not use the Thinc crossentropy code at all.
The main difference here is that the negative predictions are being
masked out (= marginalized over), but negative gradient is still being
reflected.

I'm still not sure this is exactly right but models seem to train
reliably now.
2021-07-05 18:17:10 +09:00
Paul O'Leary McCann
5db28ec2fd Tweak mention limit calculation
The calculation of this in the coref-hoi code is hard to follow. Based
on comments and variable names it sounds like it's using the doc length,
but it might actually be the number of mentions? Number of mentions
should be much larger and seems more correct, but might want to revisit
this.
2021-07-03 21:13:32 +09:00
Paul O'Leary McCann
251a5b43ac Minor fix in crossing spans code
I think this was technically incorrect but harmless. The reason the code
here is different than the reference in coref-hoi is that the indices
there are such that they get +1 at the end of processing, while the code
here handles indices directly.
2021-07-03 18:41:46 +09:00
Paul O'Leary McCann
865caedebd Remove XXX comment
Comment wondered if there should be some subtraction to avoid double
counting, but it probably doesn't matter because the diagonal is 0.
2021-07-03 18:40:38 +09:00
Paul O'Leary McCann
d74fa82c80 Fix axis handling in topk
In practice this is only ever used with axis=1, so it wasn't causing
issues, even though it was wrong.
2021-07-03 18:39:25 +09:00
Paul O'Leary McCann
f2e0e9dc28 Move placeholder handling into model code 2021-07-03 18:38:48 +09:00
Paul O'Leary McCann
3f66e18592 Clean up pw_prod loss
This doesn't change the math but makes the transposes slightly easier to
understand (maybe?).
2021-07-03 18:33:17 +09:00
Ines Montani
7f65902702
Merge pull request #8522 from adrianeboyd/chore/update-flake8
Update flake8 version in reqs and CI
2021-06-28 21:46:06 +10:00
Adriane Boyd
5eeb25f043 Tidy up code 2021-06-28 12:08:15 +02:00
Adriane Boyd
4b0ed73ed4 Update flake8 version in reqs and CI
* Update some unneeded forward refs related to flake8 checks
2021-06-28 11:29:36 +02:00
Paul O'Leary McCann
4f377d8de8 Fix bug in crossing span detection 2021-06-28 18:20:33 +09:00
Paul O'Leary McCann
23344857b9 Remove unused function 2021-06-28 18:19:43 +09:00
Matthew Honnibal
f9946154d9
Add SpanCategorizer component (#6747)
* Draft spancat model

* Add spancat model

* Add test for extract_spans

* Add extract_spans layer

* Upd extract_spans

* Add spancat model

* Add test for spancat model

* Upd spancat model

* Update spancat component

* Upd spancat

* Update spancat model

* Add quick spancat test

* Import SpanCategorizer

* Fix SpanCategorizer component

* Import SpanGroup

* Fix span extraction

* Fix import

* Fix import

* Upd model

* Update spancat models

* Add scoring, update defaults

* Update and add docs

* Fix type

* Update spacy/ml/extract_spans.py

* Auto-format and fix import

* Fix comment

* Fix type

* Fix type

* Update website/docs/api/spancategorizer.md

* Fix comment

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Better defense

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix labels list

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/extract_spans.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/pipeline/spancat.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Set annotations during update

* Set annotations in spancat

* fix imports in test

* Update spacy/pipeline/spancat.py

* replace MaxoutLogistic with LinearLogistic

* fix config

* various small fixes

* remove set_annotations parameter in update

* use our beloved tupley format with recent support for doc.spans

* bugfix to allow renaming the default span_key (scores weren't showing up)

* use different key in docs example

* change defaults to better-working parameters from project (WIP)

* register spacy.extract_spans.v1 for legacy purposes

* Upd dev version so can build wheel

* layers instead of architectures for smaller building blocks

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Include additional scores from overrides in combined score weights

* Parameterize spans key in scoring

Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so
that it's possible to evaluate multiple `spancat` components in the same
pipeline.

* Use the (intentionally very short) default spans key `sc` in the
  `SpanCategorizer`
* Adjust the default score weights to include the default key
* Adjust the scorer to use `spans_{spans_key}` as the prefix for the
  returned score
* Revert addition of `attr_name` argument to `score_spans` and adjust
  the key in the `getter` instead.

Note that for `spancat` components with a custom `span_key`, the score
weights currently need to be modified manually in
`[training.score_weights]` for them to be available during training. To
suppress the default score weights `spans_sc_p/r/f` during training, set
them to `null` in `[training.score_weights]`.

* Update website/docs/api/scorer.md

* Fix scorer for spans key containing underscore

* Increment version

* Add Spans to Evaluate CLI (#8439)

* Add Spans to Evaluate CLI

* Change to spans_key

* Add spans per_type output

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix spancat GPU issues (#8455)

* Fix GPU issues

* Require thinc >=8.0.6

* Switch to glorot_uniform_init

* Fix and test ngram suggester

* Include final ngram in doc for all sizes
* Fix ngrams for docs of the same length as ngram size
* Handle batches of docs that result in no ngrams
* Add tests

Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Nirant <NirantK@users.noreply.github.com>
2021-06-24 12:35:27 +02:00
Paul O'Leary McCann
5c98c4c3b9 Probably fix pw prod backprop
I think this change is correct, but intuition doesn't really help
here...
2021-06-17 21:23:00 +09:00
Paul O'Leary McCann
ccf561112a Remove old comments 2021-06-17 21:22:17 +09:00
Paul O'Leary McCann
a62121e3b4 Expose more hyperparameters 2021-06-17 21:21:46 +09:00
Paul O'Leary McCann
848fd102e7 Small fix 2021-06-17 21:19:38 +09:00
Paul O'Leary McCann
fce804a79f Minor optimization 2021-06-17 21:10:46 +09:00
Paul O'Leary McCann
cb2364cf83 Fix type of mask
The call here was creating a float64 array, which was turning many
downstream scores into float64s. Later on these values were assigned to
a float32 array in backprop, and numerical underflow caused things to go
to zero.

That's almost certainly not the only reason things go to zero, but it is
incorrect.
2021-06-17 17:56:00 +09:00
Sofie Van Landeghem
e796aab4b3
Resizable textcat (#7862)
* implement textcat resizing for TextCatCNN

* resizing textcat in-place

* simplify code

* ensure predictions for old textcat labels remain the same after resizing (WIP)

* fix for softmax

* store softmax as attr

* fix ensemble weight copy and cleanup

* restructure slightly

* adjust documentation, update tests and quickstart templates to use latest versions

* extend unit test slightly

* revert unnecessary edits

* fix typo

* ensemble architecture won't be resizable for now

* use resizable layer (WIP)

* revert using resizable layer

* resizable container while avoid shape inference trouble

* cleanup

* ensure model continues training after resizing

* use fill_b parameter

* use fill_defaults

* resize_layer callback

* format

* bump thinc to 8.0.4

* bump spacy-legacy to 3.0.6
2021-06-16 11:45:00 +02:00
Paul O'Leary McCann
8452d117ef Fix typo, remove old comment 2021-06-13 19:42:55 +09:00
Paul O'Leary McCann
96be7e8858 Change topk to sort descending
Shouldn't change correctness but is a little clearer
2021-06-13 19:42:24 +09:00
Paul O'Leary McCann
d71198ed36 Replace squeeze with flatten
At a few points in the code it's normal to get a "2d" array where each
row is a single entry. Calling squeeze will make that a proper 1d
array... unless it's just one entry, in which case it turns into a 0d
scalar. That's not what we want; flatten() provides the desired
behavior.
2021-06-12 19:48:01 +09:00
Paul O'Leary McCann
e728b0e45d Silence warning 2021-06-12 19:31:35 +09:00
Paul O'Leary McCann
7efbc721a1 Don't use is_sentenced 2021-06-12 19:29:27 +09:00
Paul O'Leary McCann
18444fccd9 Remove old comment 2021-06-04 17:56:08 +09:00
Paul O'Leary McCann
4a4ef72191 Clean up unused functions
`make_clean_doc` is not needed and was removed.

`logsumexp` may be needed if I misunderstood the loss calculation, so I
left it in for now with a note.
2021-06-02 21:42:23 +09:00
Vito De Tullio
3672464e25
applying suggestion to avoid mypy errors (#8265)
* applying suggestion to avoid mypy errors

* sign contributor agreement
2021-06-02 19:25:30 +10:00
svlandeg
0aa1083ce8 avoid repetitive entities in the output 2021-05-28 16:52:51 +02:00
svlandeg
0f5c586e2f add basic tests for debugging 2021-05-28 14:19:55 +02:00
svlandeg
391b512afd fix types of fwd functions 2021-05-27 16:36:46 +02:00
svlandeg
04b55bf054 removing unused imports 2021-05-27 16:31:38 +02:00
svlandeg
910026582d set versions to v1 instead of v0 2021-05-27 16:17:20 +02:00
Paul O'Leary McCann
d6389b133d Don't use a generator for no reason 2021-05-24 19:06:15 +09:00
Paul O'Leary McCann
d6fd5fe1c0 Minor cleanup 2021-05-24 14:56:43 +09:00
Paul O'Leary McCann
ff3fed06cf Catch a stray reference 2021-05-20 21:30:46 +09:00
Paul O'Leary McCann
8c5df622d8 Help out python gc in coref backprop 2021-05-20 16:40:55 +09:00
Paul O'Leary McCann
fa92daf052 Break pairwise operations into pseudolayers
This makes their scope tighter and more contained, and has the nice side
effect that fewer things need to be passed around for backprop.
2021-05-20 15:59:51 +09:00
Paul O'Leary McCann
0620820857 Deal with generators in tuplify 2021-05-18 19:55:52 +09:00
Paul O'Leary McCann
a7d9c8156d Make get_sentence_map work with init
When sentences are not available, just treat the whole doc as one
sentence. A reasonable general fallback, but important due to the init
call, where upstream components aren't run.
2021-05-18 19:54:54 +09:00
Paul O'Leary McCann
883c137b26 Add basic tuplify init 2021-05-18 19:53:59 +09:00
Paul O'Leary McCann
051715506e Fiddle with get_mentions definition
Ended up not making a difference, but oh well.
2021-05-18 19:53:33 +09:00
Paul O'Leary McCann
e303628205 Attempt to use registry correctly 2021-05-17 14:52:48 +09:00
Paul O'Leary McCann
91b111467b Minor fixes 2021-05-17 14:52:30 +09:00