Commit Graph

14800 Commits

Author SHA1 Message Date
svlandeg
235e9f5488 call replace_listener_cfg attr if it's available 2021-05-12 17:19:38 +02:00
svlandeg
44a3a58599 call replace_listener attr if it's available 2021-05-12 16:01:02 +02:00
svlandeg
ece8be4fec extend test to training with replaced tok2vec layer 2021-05-12 11:32:22 +02:00
Frederic R. Hopp
7bba9cdc14
Update universe.json 2021-05-11 19:18:19 -07:00
Adriane Boyd
d5bbd1f94f
Handle partial entities in Span.as_doc (#8055)
* Handle partial entities in Span.as_doc

In `Span.as_doc` replace partial entities at the beginning or end of the
span with missing entity annotation.

Fixes a bug where invalid entity annotation (no initial `B`) was
returned for an initial partial entity.

* Check for empty span in ents conversion

Note: `Span.as_doc()` will still fail on an empty span due to failures
in `Span.vector`.
2021-05-11 17:10:16 +02:00
Ines Montani
3883d49446 Fix default transformer in quickstart generator (resolves #8018) [ci skip] 2021-05-11 11:27:08 +10:00
Paul O'Leary McCann
bdeaf3a18b
Fix/fix en ordinals (#8028)
* Fix #8019

"th" is not the only ordinal ending.

* Add some more ordinal tests
2021-05-07 10:26:42 +02:00
Adriane Boyd
71c2a3ab47
Fix new version for match_alignments (#8021) 2021-05-07 09:55:20 +02:00
Jeno Pizarro
5cf76ab608
Update negspacy example code for spaCy 3.0 (#8022) 2021-05-07 09:33:21 +02:00
Adriane Boyd
6788d90f61
Preserve existing ENT_KB_ID annotation in NER (#7988)
* Preserve existing ENT_KB_ID annotation in NER

Preserve `ent_kb_id` annotation on existing entity spans, which is not
preserved by the transition system.

* Simplify kb_id assignment

* Simplify further
2021-05-06 18:49:55 +10:00
Sofie Van Landeghem
02a6a5fea0
Fix 'debug model' for transformers + generalize (#7973)
* add overrides to docs

* fix debug model with transformer

* assume training data is set in config
2021-05-06 18:43:32 +10:00
Adriane Boyd
cc5aeaed29
Add Chinese PTB tags to glossary (#7993) 2021-05-06 18:43:03 +10:00
Adriane Boyd
0a22fed634
Fix span offsets for Matcher(as_spans) on spans (#7992)
Fix returned span offsets for `Matcher(as_spans=True)(span)`.
2021-05-06 18:42:44 +10:00
Adriane Boyd
7d5db41ac3
Skip vector ngram backoff if minn is not set (#7925) 2021-05-06 18:34:35 +10:00
Sofie Van Landeghem
e9037d8fc0
make EntityLinker robust for nO=None (#7930) 2021-05-06 18:14:47 +10:00
Paul O'Leary McCann
66bfabd839
Fix pretraining objectives fragment (#8005)
* Fix pretraining objectives fragment

The fragment here is reused from a heading higher up, so you couldn't
link to this section.

* Fix section link to new fragment
2021-05-06 08:27:36 +02:00
Adriane Boyd
a71194362f
Fix Docs.from_docs for all empty docs (#8009) 2021-05-05 18:44:14 +02:00
meghanabhange
debaab7021
Update details in universe denomme | Multilingual Name Detection (#7982)
* Add denomme

* spaCy contributor agreement

* Update install and thumb

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-05-05 17:12:13 +02:00
Adriane Boyd
31528f62ed
Add / to nb infixes (#7991) 2021-05-04 11:00:10 +02:00
Santiago Castro
e99ff6f255
Fix typo in Language docstrings (#7958) 2021-05-03 14:44:09 +02:00
Ines Montani
12d3d0fedd Fix quickstart default checked of conditional fields [ci skip] 2021-05-03 11:48:12 +10:00
Adriane Boyd
2320791f6d
Fix Transformer.initialize example (#7963) 2021-04-30 12:21:31 +02:00
Adriane Boyd
cf032ec31e
Update to catalogue>=2.0.4 (#7951) 2021-04-29 19:11:28 +02:00
Adriane Boyd
7cf5bd072f
Refactor util.to_ternary_int (#7944)
* Refactor to avoid literal comparison with `is`
* Extend tests
2021-04-29 16:58:54 +02:00
Sevdimali
49aed683cc
Azerbaijani language added (#7911) 2021-04-28 14:42:02 +02:00
Adriane Boyd
f4080983ea
Extend to cupy 9.0.0 (#7914) 2021-04-28 10:18:24 +02:00
Paul O'Leary McCann
8007d5c814
Check if the resume path points to a directory (#7919)
This came up in #7878, but if --resume-path is a directory then loading
the weights will fail. On Linux this will give a straightforward error
message, but on Windows it gives "Permission Denied", which is
confusing.
2021-04-28 09:17:15 +02:00
Paul O'Leary McCann
de6b5ed14d
Fix percent unk display in debug data (#7886)
* Fix percent unk display

This was showing (ratio %), so 10% would show as 0.10%. Fix by
multiplying ration by 100.

Might want to add a warning if this is over a threshold.

* Only show whole-integer percents
2021-04-27 09:16:35 +02:00
Janis Klaise
1690595e4d
Update load_lookups return type and docstring (#7907)
* Update load_lookups return type and docstring

* Add contributor agreement
2021-04-27 09:13:39 +02:00
Adriane Boyd
946a4284be Set spacy-legacy to >=3.0.5 (#7897)
Set `spacy-legacy` to `>=3.0.5` due to `spacy.StaticVectors.v1` init bug.
2021-04-26 18:25:39 +02:00
Adriane Boyd
874cd02539
Set spacy-legacy to >=3.0.5 (#7897)
Set `spacy-legacy` to `>=3.0.5` due to `spacy.StaticVectors.v1` init bug.
2021-04-26 17:06:32 +02:00
Adriane Boyd
ae855a4625
Clean up Morphology imports and definitions (#7441)
* Clean up Morphology imports and definitions

* Whitespace formatting
2021-04-26 16:54:23 +02:00
Adriane Boyd
ceee1ecf17
Replace cpdef variables with cdef (#7834) 2021-04-26 16:54:02 +02:00
Adriane Boyd
95c0833656
Add training option to set annotations on update (#7767)
* Add training option to set annotations on update

Add a `[training]` option called `set_annotations_on_update` to specify
a list of components for which the predicted annotations should be set
on `example.predicted` immediately after that component has been
updated. The predicted annotations can be accessed by later components
in the pipeline during the processing of the batch in the same `update`
call.

* Rename to annotates / annotating_components

* Add test for `annotating_components` when training from config

* Add documentation
2021-04-26 16:53:53 +02:00
Jacopo Farina
c105ed10fd
Remove torino from stop words (#7634)
Torino is the proper name of a city and the token has no other meaning
2021-04-26 16:53:43 +02:00
Sofie Van Landeghem
e0b29f8ef7
Fix scoring normalization (#7629)
* fix scoring normalization

* score weights by total sum instead of per component

* cleanup

* more cleanup
2021-04-26 16:53:38 +02:00
Sofie Van Landeghem
95e3cf576b
Optionally append lang for packaged model name (#7417)
* Add empty lines at the end of Python files

* Only prepend the lang code if it's not there already

* Update spacy/cli/package.py

* fix whitespace stripping
2021-04-26 16:53:21 +02:00
Adriane Boyd
df3444421a
Update spacy-legacy to >=3.0.4 (#7865) 2021-04-23 12:16:12 +02:00
Adriane Boyd
8a95475b3d
Set version to v3.0.6 (#7854) 2021-04-22 16:33:26 +02:00
Adriane Boyd
36ecba224e
Set up GPU CI testing (#7293)
* Set up CI for tests with GPU agent

* Update tests for enabled GPU

* Fix steps filename

* Add parallel build jobs as a setting

* Fix test requirements

* Fix install test requirements condition

* Fix pipeline models test

* Reset current ops in prefer/require testing

* Fix more tests

* Remove separate test_models test

* Fix regression 5551

* fix StaticVectors for GPU use

* fix vocab tests

* Fix regression test 5082

* Move azure steps to .github and reenable default pool jobs

* Consolidate/rename azure steps

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2021-04-22 14:58:29 +02:00
Adriane Boyd
bdb485cc80
Add callback to copy vocab/tokenizer from model (#7750)
* Add callback to copy vocab/tokenizer from model

Add callback `spacy.copy_from_base_model.v1` to copy the tokenizer
settings and/or vocab (including vectors) from a base model.

* Move spacy.copy_from_base_model.v1 to spacy.training.callbacks

* Add documentation

* Modify to specify model as tokenizer and vocab params
2021-04-22 12:36:50 +02:00
Adriane Boyd
f68fc29130
Update sent_starts in Example.from_dict (#7847)
* Update sent_starts in Example.from_dict

Update `sent_starts` for `Example.from_dict` so that `Optional[bool]`
values have the same meaning as for `Token.is_sent_start`.

Use `Optional[bool]` as the type for sent start values in the docs.

* Use helper function for conversion to ternary ints
2021-04-22 11:32:45 +02:00
Adriane Boyd
f4339f9bff
Fix tokenizer cache flushing (#7836)
* Fix tokenizer cache flushing

Fix/simplify tokenizer init detection in order to fix cache flushing
when properties are modified.

* Remove init reloading logic

* Remove logic disabling `_reload_special_cases` on init
  * Setting `rules` last in `__init__` (as before) means that setting
    other properties doesn't reload any special cases
  * Reset `rules` first in `from_bytes` so that setting other properties
    during deserialization doesn't reload any special cases
    unnecessarily
* Reset all properties in `Tokenizer.from_bytes` to allow any settings
  to be `None`

* Also reset special matcher when special cache is flushed

* Remove duplicate special case validation

* Add test for special cases flushing

* Extend test for tokenizer deserialization of None values
2021-04-22 18:14:57 +10:00
Sofie Van Landeghem
cfad7e21d5
fix config parsing of ints/strings (#7755)
* add few failing tests for parsing integers and strings

* bump thinc to 8.0.3
2021-04-22 18:09:13 +10:00
Adriane Boyd
d2bdaa7823
Replace negative rows with 0 in StaticVectors (#7674)
* Replace negative rows with 0 in StaticVectors

Replace negative row indices with 0-vectors in `StaticVectors`.

* Increase versions related to StaticVectors

* Increase versions of all architctures and layers related to
`StaticVectors`
* Improve efficiency of 0-vector operations

Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5

* Update config defaults to new versions

* Update docs
2021-04-22 18:04:15 +10:00
Sofie Van Landeghem
6f565cf39d
fix typo in entity_linker docs 2021-04-22 09:59:24 +02:00
Sofie Van Landeghem
2e746dbf32
update EL training data format in docs (#7839)
* update EL training data format

* fix typo

* all -1 because reasons
2021-04-22 08:50:09 +02:00
meghanabhange
49ff1126bf
Project Idea : denomme | Multilingual Name Detection (#7845)
* Add denomme

* spaCy contributor agreement

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-04-22 08:48:17 +02:00
Sam Edwardes
b8c6c10c6f
Added a logo to spaCyTextBlob (#7818)
* Added a logo to spaCyTextBlob

* Updated to better thumb
2021-04-22 08:41:55 +02:00
Diego Palma
bbade153ed
Add TRUNAJOD to spaCy universe. (#7754)
* Add TRUNAJOD to spaCy universe.

* Add trunajod logo and thumb.

Co-authored-by: Diego <dpalma@evernote.com>
2021-04-22 08:40:28 +02:00