Commit Graph

15596 Commits

Author SHA1 Message Date
Paul O'Leary McCann
6974f55daa Hack for transformer listener size 2022-03-16 15:15:53 +09:00
Paul O'Leary McCann
7811a1194b Change architecture 2022-03-16 14:57:15 +09:00
Paul O'Leary McCann
5650853c0f Remove unused functions 2022-03-16 14:38:11 +09:00
David Berenstein
e021dc6279
Updated explenation for for classy classification (#10484)
* Update universe.json

added classy-classification to Spacy universe

* Update universe.json

added classy-classification to the spacy universe resources

* Update universe.json

corrected a small typo in json

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update universe.json

processed merge feedback

* Update universe.json

* updated information for Classy Classificaiton 

Made a more comprehensible and easy description for Classy Classification based on feedback of Philip Vollet to prepare for sharing.

* added note about examples

* corrected for wrong formatting changes

* Update website/meta/universe.json with small typo correction

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* resolved another typo

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-03-15 16:42:33 +01:00
Daniël de Kok
e5debc68e4
Tagger: use unnormalized probabilities for inference (#10197)
* Tagger: use unnormalized probabilities for inference

Using unnormalized softmax avoids use of the relatively expensive exp function,
which can significantly speed up non-transformer models (e.g. I got a speedup
of 27% on a German tagging + parsing pipeline).

* Add spacy.Tagger.v2 with configurable normalization

Normalization of probabilities is disabled by default to improve
performance.

* Update documentation, models, and tests to spacy.Tagger.v2

* Move Tagger.v1 to spacy-legacy

* docs/architectures: run prettier

* Unnormalized softmax is now a Softmax_v2 option

* Require thinc 8.0.14 and spacy-legacy 3.0.9
2022-03-15 14:15:31 +01:00
Paul O'Leary McCann
d0ae2590db Delete all the coref-hoi code 2022-03-15 20:05:24 +09:00
Paul O'Leary McCann
abdc7d87af Clean up util code
Moved everything into coref_util.py, deleted wl-specific file.
2022-03-15 19:59:44 +09:00
Paul O'Leary McCann
55039a66ad Remove old default config 2022-03-15 19:53:09 +09:00
Paul O'Leary McCann
17d017a177 Remove span2head
This doesn't work as a component because it needs to modify gold data,
so instead it's a conversion script (in another repo).
2022-03-15 19:52:20 +09:00
Paul O'Leary McCann
0522a43116 Make span2head component 2022-03-15 19:19:15 +09:00
Adriane Boyd
e8357923ec
Various install docs updates (#10487)
* Simplify quickstart source install to use only editable pip install

* Update pytorch install instructions to more recent versions
2022-03-15 11:12:50 +01:00
vincent d warmerdam
610001e8c7
Update universe.json (#10490)
The project moved away from Rasa and into my personal GitHub account.
2022-03-15 11:12:04 +01:00
Adriane Boyd
0dc454ba95
Update docs for Vocab.get_vector (#10486)
* Update docs for Vocab.get_vector

* Clarify description of 0-vector dimensions
2022-03-15 09:10:47 +01:00
Edward
2eef47dd26
Save span candidates produced by spancat suggesters (#10413)
* Add save_candidates attribute

* Change spancat api

* Add unit test

* reimplement method to produce a list of doc

* Add method to docs

* Add new version tag

* Add intended use to docstring

* prettier formatting
2022-03-14 16:46:58 +01:00
Edward
b68bf43f5b
Add spans to doc.to_json (#10073)
* Add spans to to_json

* adjustments to_json

* Change docstring

* change doc key naming

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-03-14 15:47:57 +01:00
Sofie Van Landeghem
23bc93d3d2
limit pytest to <7.1 (#10488)
* limit pytest to <7.1

* 7.1.0
2022-03-14 15:17:22 +01:00
Paul O'Leary McCann
e6917d8dc4 Add util functions for wl-coref 2022-03-14 19:27:55 +09:00
Paul O'Leary McCann
dfec6993d6 Training works now 2022-03-14 19:27:23 +09:00
Paul O'Leary McCann
8eadf3781b Training runs now
Evaluation needs fixing, and code still needs cleanup.
2022-03-14 19:02:17 +09:00
Lj Miranda
6af6c2e86c
Add a note to the dev docs on mypy (#10485) 2022-03-14 09:41:31 +01:00
Paul O'Leary McCann
d22a002641 Forward/backward pass works
Evaluate does not work - predict hasn't been updated
2022-03-14 17:26:27 +09:00
github-actions[bot]
1bbf232074
Auto-format code with black (#10479)
* Auto-format code with black

* Update spacy/lang/hsb/lex_attrs.py

Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-03-11 12:20:23 +01:00
Adriane Boyd
297dd82c86
Fix initial special cases for Tokenizer.explain (#10460)
Add the missing initial check for special cases to `Tokenizer.explain`
to align with `Tokenizer._tokenize_affixes`.
2022-03-11 10:50:47 +01:00
Paul O'Leary McCann
c4f9c24738 The coref model is able to be loaded
The span predictor component is initialized but not used at all now.
Plan is to work on it after the word level clustering part is trainable
end-to-end.
2022-03-09 19:31:11 +09:00
Peter Baumgartner
01ec6349ea
Add path.mkdir to custom component examples of to_disk (#10348)
* add `path.mkdir` to examples

* add ensure_path + mkdir

* update highlights
2022-03-08 16:04:10 +01:00
Adriane Boyd
191e8b31fa
Remove English tokenizer exception May. (#10463) 2022-03-08 14:28:46 +01:00
Adriane Boyd
60520d8669
Fix types in API docs for moves in parser and ner (#10464) 2022-03-08 13:51:11 +01:00
Paul O'Leary McCann
35cc2b138f Add span predictor code
Accidentally omitted before
2022-03-08 18:13:26 +09:00
Paul O'Leary McCann
1c697b4011 Remove references to config
Replaced with model arguments
2022-03-08 18:13:09 +09:00
Adriane Boyd
b2bbefd0b5
Add Finnish, Korean, and Swedish models and Korean support notes (#10355)
* Add Finnish, Korean, and Swedish models to website

* Add Korean language support notes
2022-03-07 17:03:45 +01:00
jnphilipp
5ca0dbae76
Add Lower Sorbian support. (#10431)
* Add support basic support for lower sorbian.

* Add some test for dsb.

* Update spacy/lang/dsb/examples.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-03-07 16:57:14 +01:00
Paul O'Leary McCann
61ba5450ff
Fix get_matching_ents (#10451)
* Fix get_matching_ents

Not sure what happened here - the code prior to this commit simply does
not work. It's already covered by entity linker tests, which were
succeeding in the NEL PR, but couldn't possibly succeed on master.

* Fix test

Test was indented inside another test and so doesn't seem to have been
running properly.
2022-03-07 16:56:57 +01:00
jnphilipp
7ed7908716
Add Upper Sorbian support. (#10432)
* Add support basic support for upper sorbian.

* Add tokenizer exceptions and tests.

* Update spacy/lang/hsb/examples.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-03-07 16:20:39 +01:00
David Berenstein
a6d5824e5f
added classy-classification package to spacy universe (#10393)
* Update universe.json

added classy-classification to Spacy universe

* Update universe.json

added classy-classification to the spacy universe resources

* Update universe.json

corrected a small typo in json

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update universe.json

processed merge feedback

* Update universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-03-07 12:47:26 +01:00
Paul O'Leary McCann
6f4f57f317
Update Issue Templates (#10446)
* Remove mention of python 3.10 wheels

These were released a while ago, just forgot to remove this notice.

* Add note about Discussions
2022-03-07 10:41:03 +01:00
Paul O'Leary McCann
c0cd5025e3 Start bringin in wl-coref
This absolutely does not work. First step here is getting over most of
the code in roughly the files we want it in. After the code has been
pulled over it can be restructured to match spaCy and cleaned up.
2022-03-06 20:00:15 +09:00
Sofie Van Landeghem
d89dac4066
hook up meta in load_model_from_config (#10400) 2022-03-04 11:07:45 +01:00
Paul O'Leary McCann
91acc3ea75
Fix entity linker batching (#9669)
* Partial fix of entity linker batching

* Add import

* Better name

* Add `use_gold_ents` option, docs

* Change to v2, create stub v1, update docs etc.

* Fix error type

Honestly no idea what the right type to use here is.
ConfigValidationError seems wrong. Maybe a NotImplementedError?

* Make mypy happy

* Add hacky fix for init issue

* Add legacy pipeline entity linker

* Fix references to class name

* Add __init__.py for legacy

* Attempted fix for loss issue

* Remove placeholder V1

* formatting

* slightly more interesting train data

* Handle batches with no usable examples

This adds a test for batches that have docs but not entities, and a
check in the component that detects such cases and skips the update step
as thought the batch were empty.

* Remove todo about data verification

Check for empty data was moved further up so this should be OK now - the
case in question shouldn't be possible.

* Fix gradient calculation

The model doesn't know which entities are not in the kb, so it generates
embeddings for the context of all of them.

However, the loss does know which entities aren't in the kb, and it
ignores them, as there's no sensible gradient.

This has the issue that the gradient will not be calculated for some of
the input embeddings, which causes a dimension mismatch in backprop.
That should have caused a clear error, but with numpyops it was causing
nans to happen, which is another problem that should be addressed
separately.

This commit changes the loss to give a zero gradient for entities not in
the kb.

* add failing test for v1 EL legacy architecture

* Add nasty but simple working check for legacy arch

* Clarify why init hack works the way it does

* Clarify use_gold_ents use case

* Fix use gold ents related handling

* Add tests for no gold ents and fix other tests

* Use aligned ents function (not working)

This doesn't actually work because the "aligned" ents are gold-only. But
if I have a different function that returns the intersection, *then*
this will work as desired.

* Use proper matching ent check

This changes the process when gold ents are not used so that the
intersection of ents in the pred and gold is used.

* Move get_matching_ents to Example

* Use model attribute to check for legacy arch

* Rename flag

* bump spacy-legacy to lower 3.0.9

Co-authored-by: svlandeg <svlandeg@github.com>
2022-03-04 09:17:36 +01:00
Adriane Boyd
8e93fa8507
Fix Vectors.n_keys for floret vectors (#10394)
Fix `Vectors.n_keys` for floret vectors to match docstring description
and avoid W007 warnings in similarity methods.
2022-03-01 09:21:25 +01:00
Sofie Van Landeghem
3f68bbcfec
Clean up loggers docs (#10351)
* update docs to point to spacy-loggers docs

* remove unused error code
2022-02-25 16:29:12 +01:00
github-actions[bot]
d637b34e2f
Auto-format code with black (#10377)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-02-25 10:00:21 +01:00
Sam Edwardes
5f568f7e41
Updated spaCy universe for spacytextblob (#10335)
* Updated spacytextblob in universe.json

* Fixed json

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Added spacy_version tag to spacytextblob

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-02-24 14:18:10 +09:00
Adriane Boyd
b16da378bb
Re-remove universe tests from test suite (#10357) 2022-02-23 21:08:56 +01:00
kadarakos
249b97184d
Bugfixes and test for rehearse (#10347)
* fixing argument order for rehearse

* rehearse test for ner and tagger

* rehearse bugfix

* added test for parser

* test for multilabel textcat

* rehearse fix

* remove debug line

* Update spacy/tests/training/test_rehearse.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/tests/training/test_rehearse.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Kádár Ákos <akos@onyx.uvt.nl>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-02-23 16:10:05 +01:00
Adriane Boyd
b7ba7f78a2
Merge pull request #10344 from adrianeboyd/chore/v3.2-backport-10324
Fix Tok2Vec for empty batches (#10324)
2022-02-21 16:40:53 +01:00
Daniël de Kok
78a8bec4d0
Make core projectivization functions cdef nogil (#10241)
* Make core projectivization methods cdef nogil

While profiling the parser, I noticed that relatively a lot of time is
spent in projectivization. This change rewrites the functions in the
core loops as cdef nogil for efficiency.

In C++-land, we use vector in place of Python lists and absent heads
are represented as -1 in place of None.

* _heads_to_c: add assertion

Validation should be performed by the caller, but this assertion ensures that
we are not reading/writing out of bounds with incorrect input.
2022-02-21 15:02:21 +01:00
Adriane Boyd
cf5b46b63e Fix Tok2Vec for empty batches (#10324)
* Add test for tok2vec with vectors and empty docs

* Add shortcut for empty batch in Tok2Vec.predict

* Avoid types
2022-02-21 14:29:36 +01:00
Adriane Boyd
30030176ee
Update Korean defaults for Tokenizer (#10322)
Update Korean defaults for `Tokenizer` for tokenization following UD
Korean Kaist.
2022-02-21 10:26:19 +01:00
Adriane Boyd
f32ee2e533
Fix NER check in CoNLL-U converter (#10302)
* Fix NER check in CoNLL-U converter

Leave ents unset if no NER annotation is found in the MISC column.

* Revert to global rather than per-sentence NER check

* Update spacy/training/converters/conllu_to_docs.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-02-21 10:24:52 +01:00
Peter Baumgartner
3358fb9bdd
Miscellaneous Minor SpanGroups/DocBin Improvements (#10250)
* MultiHashEmbed vector docs correction

* doc copy span test

* ignore empty lists in DocBin.span_groups

* serialized empty list const + SpanGroups.is_empty

* add conditional deserial on from_bytes

* clean up + reorganize

* rm test

* add constant as class attribute

* rename to _EMPTY_BYTES

* Update spacy/tests/doc/test_span.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-02-21 10:24:15 +01:00