Commit Graph

324 Commits

Author SHA1 Message Date
Paul O'Leary McCann
77d698dcae
Fix check for RIGHT_ATTRS in dep matcher (#8807)
* Fix check for RIGHT_ATTRs in dep matcher

If a non-anchor node does not have RIGHT_ATTRS, the dep matcher throws
an E100, which says that non-anchor nodes must have LEFT_ID, REL_OP, and
RIGHT_ID. It specifically does not say RIGHT_ATTRS is required.

A blank RIGHT_ATTRS is also valid, and patterns with one will be
excepted. While not normal, sometimes a REL_OP is enough to specify a
non-anchor node - maybe you just want the head of another node
unconditionally, for example.

This change just sets RIGHT_ATTRS to {} if not present. Alternatively
changing E100 to state RIGHT_ATTRS is required could also be reasonable.

* Fix test

This test was written on the assumption that if `RIGHT_ATTRS` isn't
present an error will be raised. Since the proposed changes make it so
an error won't be raised this is no longer necessary.

* Revert test, update error message

Error message now lists missing keys, and RIGHT_ATTRS is required.

* Use list of required keys in error message

Also removes unused key param arg.
2021-08-04 09:20:41 +02:00
Julien Rossi
e117573822
Adding noun_chunks to the DUTCH language model (nl) (#8529)
*  implement noun_chunks for dutch language

* copy/paste FR and SV syntax iterators to accomodate UD tags
* added tests with dutch text
* signed contributor agreement

* 🐛 fix noun chunks generator

* built from scratch
* define noun chunk as a single Noun-Phrase
* includes some corner cases debugging (incorrect POS tagging)
* test with provided annotated sample (POS, DEP)

*  fix failing test

* CI pipeline did not like the added sample file
* add the sample as a pytest fixture

* Update spacy/lang/nl/syntax_iterators.py

* Update spacy/lang/nl/syntax_iterators.py

Code readability

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/tests/lang/nl/test_noun_chunks.py

correct comment

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* finalize code

* change "if next_word" into "if next_word is not None"

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-07-14 14:01:02 +02:00
Adriane Boyd
29906884c5
Raise an error for textcat with <2 labels (#8584)
* Raise an error for textcat with <2 labels

Raise an error if initializing a `textcat` component without at least
two labels.

* Add similar note to docs

* Update positive_label description in API docs
2021-07-06 12:35:22 +02:00
Adrian Zuber
f5aee0bbdf
Raise custom error in EntityLinker when KB is not set (#8442)
* Raise custom error in EntityLinker when KB is not set

* add contributor agreement

* Update E1018 error message
2021-06-25 23:04:00 +02:00
Adriane Boyd
9fde258053
Use minor version for compatibility check (#8403)
* Use minor version for compatibility check

* Use minor version of compatibility table
* Soften warning message about incompatible models
* Add test for presence of current version in compatibility table

* Add test for download compatibility table

* Use minor version of lower pin in error message if possible

* Fall back to spacy_git_version if available

* Fix unknown version string
2021-06-21 09:39:22 +02:00
Matthew Honnibal
6f5e308d17
Support negative examples in partial NER annotations (#8106)
* Support a cfg field in transition system

* Make NER 'has gold' check use right alignment for span

* Pass 'negative_samples_key' property into NER transition system

* Add field for negative samples to NER transition system

* Check neg_key in NER has_gold

* Support negative examples in NER oracle

* Test for negative examples in NER

* Fix name of config variable in NER

* Remove vestiges of old-style partial annotation

* Remove obsolete tests

* Add comment noting lack of support for negative samples in parser

* Additions to "neg examples" PR (#8201)

* add custom error and test for deprecated format

* add test for unlearning an entity

* add break also for Begin's cost

* add negative_samples_key property on Parser

* rename

* extend docs & fix some older docs issues

* add subclass constructors, clean up tests, fix docs

* add flaky test with ValueError if gold parse was not found

* remove ValueError if n_gold == 0

* fix docstring

* Hack in environment variables to try out training

* Remove hack

* Remove NER hack, and support 'negative O' samples

* Fix O oracle

* Fix transition parser

* Remove 'not O' from oracle

* Fix NER oracle

* check for spans in both gold.ents and gold.spans and raise if so, to prevent memory access violation

* use set instead of list in consistency check

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-06-17 17:33:00 +10:00
Adriane Boyd
5646fcbe46 Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1 2021-06-15 15:05:17 +02:00
Paul O'Leary McCann
2c105cdbce
Raise error if deps not provided with heads (#8335)
* Fill in deps if not provided with heads

Before this change, if heads were passed without deps they would be
silently ignored, which could be confusing. See #8334.

* Use "dep" instead of a blank string

This is the customary placeholder dep. It might be better to show an
error here instead though.

* Throw error on heads without deps

* Add a test

* Fix tests

* Formatting

* Fix all tests

* Fix a test I missed

* Revise error message

* Clean up whitespace

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-06-15 13:23:32 +02:00
Adriane Boyd
9dfd3c9484 Use warnings.warn instead of logger.warning 2021-06-04 17:44:08 +02:00
Sofie Van Landeghem
f0277bdeab Show warning if entity_ruler runs without patterns (#7807)
* Show warning if entity_ruler runs without patterns

* Show warning if matcher runs without patterns

* fix wording

* unit test for warning once (WIP)

* warn W036 only once

* cleanup

* create filter_warning helper
2021-06-04 17:37:38 +02:00
Sofie Van Landeghem
ff91e6dac7
Show warning if entity_ruler runs without patterns (#7807)
* Show warning if entity_ruler runs without patterns

* Show warning if matcher runs without patterns

* fix wording

* unit test for warning once (WIP)

* warn W036 only once

* cleanup

* create filter_warning helper
2021-05-31 18:20:27 +10:00
Sofie Van Landeghem
0dffc5d9e2
Custom warning if the doc_bin is too large (#8069)
* custom warning if the doc_bin is too large

* cleanup

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* fix numbering

* fixing numbering once more

* fixing this seems to be pretty hard

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-05-17 15:48:40 +02:00
Adriane Boyd
b120fb3511
Handle errors while multiprocessing (#8004)
* Handle errors while multiprocessing

Handle errors while multiprocessing without hanging.

* Return the traceback for errors raised while processing a batch, which
  can be handled by the top-level error handler
* Allow for shortened batches due to custom error handlers that ignore
  errors and skip documents

* Define custom components at a higher level

* Also move up custom error handler

* Use simpler component for test

* Switch error type

* Adjust test

* Only call top-level error handler for exceptions

* Register custom test components within tests

Use global functions (so they can be pickled) but register the
components only within the individual tests.
2021-05-17 13:28:39 +02:00
Adriane Boyd
82fa81d095
Make all Span attrs writable (#8062)
Also allow `Span` string properties `label_` and `kb_id_` to be writable
following #6696.
2021-05-17 18:05:45 +10:00
Adriane Boyd
bdb485cc80
Add callback to copy vocab/tokenizer from model (#7750)
* Add callback to copy vocab/tokenizer from model

Add callback `spacy.copy_from_base_model.v1` to copy the tokenizer
settings and/or vocab (including vectors) from a base model.

* Move spacy.copy_from_base_model.v1 to spacy.training.callbacks

* Add documentation

* Modify to specify model as tokenizer and vocab params
2021-04-22 12:36:50 +02:00
Adriane Boyd
1ad646cbcf
Improve checks for sourced components (#7490)
* Improve checks for sourced components

* Remove language class checks

* Convert python warning to logger warning

* Remove unused warning

* Fix formatting
2021-04-19 18:36:32 +10:00
Bram Vanroy
ed561cf428
Terminology: deprecated vs obsolete (#7621)
* Terminology: deprecated vs obsolete

Typically, deprecated is used for functionality that is bound to become unavailable but that can still be used. Obsolete is used for features that have been removed. In E941, I think what is meant is "obsolete" since loading a model by a shortcut simply does not work anymore (and throws an error). This is different from downloading a model with a shortcut, which is deprecated but still works.

In light of this, perhaps all other error codes should be checked as well.

* clarify that the link command is removed and not just deprecated

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2021-04-12 14:37:00 +02:00
Paul O'Leary McCann
7944761ba7
Add warning if initial vectors are empty (#7641)
See #7637, where this came up.
2021-04-04 20:20:24 +02:00
Adriane Boyd
139f655f34
Merge doc.spans in Doc.from_docs() (#7497)
Merge data from `doc.spans` in `Doc.from_docs()`.

* Fix internal character offset set when merging empty docs (only
affects tokens and spans in `user_data` if an empty doc is in the list
of docs)
2021-03-29 22:34:01 +11:00
Adriane Boyd
39153ef90f Update lexeme_norm checks
* Add util method for check
* Add new languages to list with lexeme norm tables
* Add check to all relevant components
* Add config details to warning message

Note that we're not actually inspecting the model config to see if
`NORM` is used as an attribute, so it may warn in cases where it's not
relevant.
2021-03-19 10:59:27 +01:00
Adriane Boyd
d746ea6278
Add warning about GPU selection in Jupyter notebooks (#7075)
* Initial warning

* Update check

* Redo edit

* Move jupyter warning to helper method

* Add link with details to warnings
2021-03-09 15:35:21 +01:00
Sofie Van Landeghem
39de3602e0
return custom error in nlp.initialize (#7104)
* return custom error in nlp.initialize

* Rename error

Co-authored-by: Ines Montani <ines@ines.io>
2021-03-09 23:01:31 +11:00
Sofie Van Landeghem
cd70c3cb79
Fixing pretrain (#7342)
* initialize NLP with train corpus

* add more pretraining tests

* more tests

* function to fetch tok2vec layer for pretraining

* clarify parameter name

* test different objectives

* formatting

* fix check for static vectors when using vectors objective

* clarify docs

* logger statement

* fix init_tok2vec and proc.initialize order

* test training after pretraining

* add init_config tests for pretraining

* pop pretraining block to avoid config validation errors

* custom errors
2021-03-09 14:01:13 +11:00
Sofie Van Landeghem
212f0e779e
Support doc.spans in Example.from_dict (#7197)
* add support for spans in Example.from_dict

* add unit tests

* update error to E879
2021-03-03 01:12:54 +11:00
svlandeg
2010219a7f import wandb failure - UX 2021-02-26 18:00:39 +01:00
Sofie Van Landeghem
ba5a50f62b
NEL docs & UX (#7129)
* EL set_kb docs fix

* custom warning for set_kb mistake
2021-02-22 11:04:22 +11:00
Adriane Boyd
6108dabdc8 Rephrase error related to sample data initialization
Now that the initialize step is fully implemented, the source of E923 is
typically missing or improperly converted/formatted data rather than a
bug in spaCy, so rephrase the error and message and remove the prompt to
open an issue.
2021-02-08 09:21:36 +01:00
Ines Montani
d0c3775712 Replace links to nightly docs [ci skip] 2021-01-30 20:09:38 +11:00
Ines Montani
526b416118 Tidy up comments 2021-01-30 12:34:09 +11:00
Ines Montani
30765674d0 Merge branch 'master' into develop 2021-01-30 12:20:28 +11:00
Ines Montani
7694f76dd1 Update warning and mention replace_listeners 2021-01-29 23:46:01 +11:00
Ines Montani
94232aea08 Improve E889 2021-01-29 23:39:23 +11:00
Ines Montani
bbb94b37c6 Update error handling and docstring 2021-01-29 16:27:49 +11:00
Adriane Boyd
fcce3600ed
Forbid OP matching 2+ tokens in DependencyMatcher (#6824)
Instead of silently using only the first token in each matched span:

* Forbid `OP: ?/*/+` through `DependencyMatcher` validation
* As a fail-safe, add warning if a token match that's not exactly one
token long is found by a token pattern.
2021-01-29 08:52:01 +08:00
Sofie Van Landeghem
24a697abb8
avoid empty aliases and improve UX and docs (#6840) 2021-01-29 08:51:40 +08:00
Adriane Boyd
4096a79de7
Add alignment mode error and fix Doc.char_span docs (#6820)
* Raise an error on an unrecognized alignment mode rather than
defaulting to `strict`
* Fix the `Doc.char_span` API doc alignment mode details
2021-01-27 23:40:42 +11:00
Ines Montani
c0926c9088
WIP: Various small training changes (#6818)
* Allow output_path to be None during training

* Fix cat scoring (?)

* Improve error message for weighted None score

* Improve messages

So we can call this in other places etc.

* FIx output path check

* Use latest wasabi

* Revert "Improve error message for weighted None score"

This reverts commit 7059926763.

* Exclude None scores from final score by default

It's otherwise very difficult to keep track of the score weights if we modify a config programmatically, source components etc.

* Update warnings and use logger.warning
2021-01-26 14:51:52 +11:00
Ines Montani
1090d3d675 Merge branch 'develop' into feature/spacy-legacy 2021-01-18 11:43:39 +11:00
Sofie Van Landeghem
fed8f48965
raise NotImplementedError when noun_chunks iterator is not implemented (#6711)
* raise NotImplementedError when noun_chunks iterator is not implemented

* bring back, fix and document span.noun_chunks

* formatting

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2021-01-17 19:56:05 +08:00
Ines Montani
a552db2819 Include available registry names in error 2021-01-16 14:35:03 +11:00
Ines Montani
a203e3dbb8 Support spacy-legacy via the registry 2021-01-15 21:42:40 +11:00
Adriane Boyd
681a6195f7 Validate seed and gpu_allocator manually 2021-01-14 16:57:57 +01:00
Sofie Van Landeghem
afc5714d32
multi-label textcat component (#6474)
* multi-label textcat component

* formatting

* fix comment

* cleanup

* fix from #6481

* random edit to push the tests

* add explicit error when textcat is called with multi-label gold data

* fix error nr

* small fix
2021-01-06 13:07:14 +11:00
Adriane Boyd
5ca57d8221
Add logger warning when serializing user hooks (#6595)
Add a warning that user hooks are lost on serialization.

Add a `user_hooks` exclude to skip the warning with pickle.
2020-12-29 11:54:32 +01:00
Ines Montani
dfaef27f90
Merge pull request #6503 from adrianeboyd/feature/lemmatizer-rule-warning-pos
Warn on empty POS for the rule-based lemmatizer
2020-12-09 11:34:16 +11:00
Sofie Van Landeghem
de108ed3e8
Add specific error when StaticVectors can't read the vectors data (#6450) 2020-12-09 06:16:07 +08:00
Sofie Van Landeghem
f98a04434a
pretrain architectures (#6451)
* define new architectures for the pretraining objective

* add loss function as attr of the omdel

* cleanup

* cleanup

* shorten name

* fix typo

* remove unused error
2020-12-08 14:41:03 +08:00
Ines Montani
ee2ec52f48
Merge pull request #6409 from svlandeg/feature/trf-docs 2020-12-08 06:32:10 +01:00
Adriane Boyd
d70950605c Warn on empty POS for the rule-based lemmatizer
Add a warning to the rule-based lemmatizer for any tokens without POS
annotation.
2020-12-04 11:46:15 +01:00
Adriane Boyd
26296ab223
Add error message if DocBin zlib decompress fails (#6394)
Add a better error message if DocBin zlib decompress fails, indicating
that the data is not in `DocBin` format.
2020-11-27 14:39:49 +08:00