Commit Graph

15507 Commits

Author SHA1 Message Date
Basile Dura
107bab56b5
docs: add EDS-NLP to spaCy universe (#10489)
* docs: add EDS-NLP to spaCy universe

* fix: remove "standalone" tag for EDS-NLP

Co-authored-by: Basile Dura <basile.dura-ext@aphp.fr>
2022-03-21 11:03:39 +01:00
github-actions[bot]
bf1cf77a5b
Auto-format code with black (#10518)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-03-21 09:21:24 +01:00
Paul O'Leary McCann
2190cbc0e6 Add progress on SpanPredictor component
This isn't working. There is a CUDA error in the torch code during
initialization and it's not clear why.
2022-03-19 19:39:49 +09:00
Adriane Boyd
04f3f414d1
Update pytest to forbid ==7.1.0, allow >=7.1.1 (#10519) 2022-03-18 13:43:54 +01:00
Paul O'Leary McCann
a098849112 Add fake batching
The way fake batching works is that the pipeline component calls the
model repeatedly in a loop internally. It feels like this should break
something, but it worked in testing.

Another issue is that this changes the signature of some of the pipeline
functions, though I don't think that's an issue.

Tested with batch size of 2, so more testing is needed, but this is a
start.
2022-03-18 19:46:58 +09:00
Lj Miranda
0b02dc4c57
Fix mixed-up parameters for spacy-conll (#10516) 2022-03-18 08:56:21 +01:00
Grey Murav
3ff5a6a5c0
Extend list of _num_words (#10468) 2022-03-16 18:25:42 +01:00
Lj Miranda
a79cd3542b
Add displacy support for overlapping Spans (#10332)
* Fix docstring for EntityRenderer

* Add warning in displacy if doc.spans are empty

* Implement parse_spans converter

One notable change here is that the default spans_key is sc, and
it's set by the user through the options.

* Implement SpanRenderer

Here, I implemented a SpanRenderer that looks similar to the
EntityRenderer except for some templates.  The spans_key, by default, is
set to sc, but can be configured in the options (see parse_spans). The
way I rendered these spans is per-token, i.e., I first check if each
token (1) belongs to a given span type and (2) a starting token of a
given span type. Once I have this information, I render them into the
markup.

* Fix mypy issues on typing

* Add tests for displacy spans support

* Update colors from RGB to hex

Co-authored-by: Ines Montani <ines@ines.io>

* Remove unnecessary CSS properties

* Add documentation for website

* Remove unnecesasry scripts

* Update wording on the documentation

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Put typing dependency on top of file

* Put back z-index so that spans overlap properly

* Make warning more explicit for spans_key

Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-03-16 18:14:34 +01:00
Paul O'Leary McCann
1a79d18796 Formatting 2022-03-16 20:10:47 +09:00
Paul O'Leary McCann
6855df0e66 Skeleton for span predictor component
This should be moved into its own file, but for now just stubbing out
the methods.
2022-03-16 20:09:33 +09:00
Paul O'Leary McCann
0275ae29de Remove stale comment 2022-03-16 20:09:12 +09:00
Paul O'Leary McCann
6974f55daa Hack for transformer listener size 2022-03-16 15:15:53 +09:00
Paul O'Leary McCann
7811a1194b Change architecture 2022-03-16 14:57:15 +09:00
Paul O'Leary McCann
5650853c0f Remove unused functions 2022-03-16 14:38:11 +09:00
David Berenstein
e021dc6279
Updated explenation for for classy classification (#10484)
* Update universe.json

added classy-classification to Spacy universe

* Update universe.json

added classy-classification to the spacy universe resources

* Update universe.json

corrected a small typo in json

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update universe.json

processed merge feedback

* Update universe.json

* updated information for Classy Classificaiton 

Made a more comprehensible and easy description for Classy Classification based on feedback of Philip Vollet to prepare for sharing.

* added note about examples

* corrected for wrong formatting changes

* Update website/meta/universe.json with small typo correction

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* resolved another typo

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-03-15 16:42:33 +01:00
Daniël de Kok
e5debc68e4
Tagger: use unnormalized probabilities for inference (#10197)
* Tagger: use unnormalized probabilities for inference

Using unnormalized softmax avoids use of the relatively expensive exp function,
which can significantly speed up non-transformer models (e.g. I got a speedup
of 27% on a German tagging + parsing pipeline).

* Add spacy.Tagger.v2 with configurable normalization

Normalization of probabilities is disabled by default to improve
performance.

* Update documentation, models, and tests to spacy.Tagger.v2

* Move Tagger.v1 to spacy-legacy

* docs/architectures: run prettier

* Unnormalized softmax is now a Softmax_v2 option

* Require thinc 8.0.14 and spacy-legacy 3.0.9
2022-03-15 14:15:31 +01:00
Paul O'Leary McCann
d0ae2590db Delete all the coref-hoi code 2022-03-15 20:05:24 +09:00
Paul O'Leary McCann
abdc7d87af Clean up util code
Moved everything into coref_util.py, deleted wl-specific file.
2022-03-15 19:59:44 +09:00
Paul O'Leary McCann
55039a66ad Remove old default config 2022-03-15 19:53:09 +09:00
Paul O'Leary McCann
17d017a177 Remove span2head
This doesn't work as a component because it needs to modify gold data,
so instead it's a conversion script (in another repo).
2022-03-15 19:52:20 +09:00
Paul O'Leary McCann
0522a43116 Make span2head component 2022-03-15 19:19:15 +09:00
Adriane Boyd
e8357923ec
Various install docs updates (#10487)
* Simplify quickstart source install to use only editable pip install

* Update pytorch install instructions to more recent versions
2022-03-15 11:12:50 +01:00
vincent d warmerdam
610001e8c7
Update universe.json (#10490)
The project moved away from Rasa and into my personal GitHub account.
2022-03-15 11:12:04 +01:00
Adriane Boyd
0dc454ba95
Update docs for Vocab.get_vector (#10486)
* Update docs for Vocab.get_vector

* Clarify description of 0-vector dimensions
2022-03-15 09:10:47 +01:00
Edward
2eef47dd26
Save span candidates produced by spancat suggesters (#10413)
* Add save_candidates attribute

* Change spancat api

* Add unit test

* reimplement method to produce a list of doc

* Add method to docs

* Add new version tag

* Add intended use to docstring

* prettier formatting
2022-03-14 16:46:58 +01:00
Edward
b68bf43f5b
Add spans to doc.to_json (#10073)
* Add spans to to_json

* adjustments to_json

* Change docstring

* change doc key naming

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-03-14 15:47:57 +01:00
Sofie Van Landeghem
23bc93d3d2
limit pytest to <7.1 (#10488)
* limit pytest to <7.1

* 7.1.0
2022-03-14 15:17:22 +01:00
Paul O'Leary McCann
e6917d8dc4 Add util functions for wl-coref 2022-03-14 19:27:55 +09:00
Paul O'Leary McCann
dfec6993d6 Training works now 2022-03-14 19:27:23 +09:00
Paul O'Leary McCann
8eadf3781b Training runs now
Evaluation needs fixing, and code still needs cleanup.
2022-03-14 19:02:17 +09:00
Lj Miranda
6af6c2e86c
Add a note to the dev docs on mypy (#10485) 2022-03-14 09:41:31 +01:00
Paul O'Leary McCann
d22a002641 Forward/backward pass works
Evaluate does not work - predict hasn't been updated
2022-03-14 17:26:27 +09:00
github-actions[bot]
1bbf232074
Auto-format code with black (#10479)
* Auto-format code with black

* Update spacy/lang/hsb/lex_attrs.py

Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-03-11 12:20:23 +01:00
Adriane Boyd
297dd82c86
Fix initial special cases for Tokenizer.explain (#10460)
Add the missing initial check for special cases to `Tokenizer.explain`
to align with `Tokenizer._tokenize_affixes`.
2022-03-11 10:50:47 +01:00
Paul O'Leary McCann
c4f9c24738 The coref model is able to be loaded
The span predictor component is initialized but not used at all now.
Plan is to work on it after the word level clustering part is trainable
end-to-end.
2022-03-09 19:31:11 +09:00
Peter Baumgartner
01ec6349ea
Add path.mkdir to custom component examples of to_disk (#10348)
* add `path.mkdir` to examples

* add ensure_path + mkdir

* update highlights
2022-03-08 16:04:10 +01:00
Adriane Boyd
191e8b31fa
Remove English tokenizer exception May. (#10463) 2022-03-08 14:28:46 +01:00
Adriane Boyd
60520d8669
Fix types in API docs for moves in parser and ner (#10464) 2022-03-08 13:51:11 +01:00
Paul O'Leary McCann
35cc2b138f Add span predictor code
Accidentally omitted before
2022-03-08 18:13:26 +09:00
Paul O'Leary McCann
1c697b4011 Remove references to config
Replaced with model arguments
2022-03-08 18:13:09 +09:00
Adriane Boyd
b2bbefd0b5
Add Finnish, Korean, and Swedish models and Korean support notes (#10355)
* Add Finnish, Korean, and Swedish models to website

* Add Korean language support notes
2022-03-07 17:03:45 +01:00
jnphilipp
5ca0dbae76
Add Lower Sorbian support. (#10431)
* Add support basic support for lower sorbian.

* Add some test for dsb.

* Update spacy/lang/dsb/examples.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-03-07 16:57:14 +01:00
Paul O'Leary McCann
61ba5450ff
Fix get_matching_ents (#10451)
* Fix get_matching_ents

Not sure what happened here - the code prior to this commit simply does
not work. It's already covered by entity linker tests, which were
succeeding in the NEL PR, but couldn't possibly succeed on master.

* Fix test

Test was indented inside another test and so doesn't seem to have been
running properly.
2022-03-07 16:56:57 +01:00
jnphilipp
7ed7908716
Add Upper Sorbian support. (#10432)
* Add support basic support for upper sorbian.

* Add tokenizer exceptions and tests.

* Update spacy/lang/hsb/examples.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-03-07 16:20:39 +01:00
David Berenstein
a6d5824e5f
added classy-classification package to spacy universe (#10393)
* Update universe.json

added classy-classification to Spacy universe

* Update universe.json

added classy-classification to the spacy universe resources

* Update universe.json

corrected a small typo in json

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update universe.json

processed merge feedback

* Update universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-03-07 12:47:26 +01:00
Paul O'Leary McCann
6f4f57f317
Update Issue Templates (#10446)
* Remove mention of python 3.10 wheels

These were released a while ago, just forgot to remove this notice.

* Add note about Discussions
2022-03-07 10:41:03 +01:00
Paul O'Leary McCann
c0cd5025e3 Start bringin in wl-coref
This absolutely does not work. First step here is getting over most of
the code in roughly the files we want it in. After the code has been
pulled over it can be restructured to match spaCy and cleaned up.
2022-03-06 20:00:15 +09:00
Sofie Van Landeghem
d89dac4066
hook up meta in load_model_from_config (#10400) 2022-03-04 11:07:45 +01:00
Paul O'Leary McCann
91acc3ea75
Fix entity linker batching (#9669)
* Partial fix of entity linker batching

* Add import

* Better name

* Add `use_gold_ents` option, docs

* Change to v2, create stub v1, update docs etc.

* Fix error type

Honestly no idea what the right type to use here is.
ConfigValidationError seems wrong. Maybe a NotImplementedError?

* Make mypy happy

* Add hacky fix for init issue

* Add legacy pipeline entity linker

* Fix references to class name

* Add __init__.py for legacy

* Attempted fix for loss issue

* Remove placeholder V1

* formatting

* slightly more interesting train data

* Handle batches with no usable examples

This adds a test for batches that have docs but not entities, and a
check in the component that detects such cases and skips the update step
as thought the batch were empty.

* Remove todo about data verification

Check for empty data was moved further up so this should be OK now - the
case in question shouldn't be possible.

* Fix gradient calculation

The model doesn't know which entities are not in the kb, so it generates
embeddings for the context of all of them.

However, the loss does know which entities aren't in the kb, and it
ignores them, as there's no sensible gradient.

This has the issue that the gradient will not be calculated for some of
the input embeddings, which causes a dimension mismatch in backprop.
That should have caused a clear error, but with numpyops it was causing
nans to happen, which is another problem that should be addressed
separately.

This commit changes the loss to give a zero gradient for entities not in
the kb.

* add failing test for v1 EL legacy architecture

* Add nasty but simple working check for legacy arch

* Clarify why init hack works the way it does

* Clarify use_gold_ents use case

* Fix use gold ents related handling

* Add tests for no gold ents and fix other tests

* Use aligned ents function (not working)

This doesn't actually work because the "aligned" ents are gold-only. But
if I have a different function that returns the intersection, *then*
this will work as desired.

* Use proper matching ent check

This changes the process when gold ents are not used so that the
intersection of ents in the pred and gold is used.

* Move get_matching_ents to Example

* Use model attribute to check for legacy arch

* Rename flag

* bump spacy-legacy to lower 3.0.9

Co-authored-by: svlandeg <svlandeg@github.com>
2022-03-04 09:17:36 +01:00
Adriane Boyd
8e93fa8507
Fix Vectors.n_keys for floret vectors (#10394)
Fix `Vectors.n_keys` for floret vectors to match docstring description
and avoid W007 warnings in similarity methods.
2022-03-01 09:21:25 +01:00