Commit Graph

606 Commits

Author SHA1 Message Date
Adriane Boyd
d98d525bc8 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-3 2021-10-14 09:41:46 +02:00
Elia Robyn Lake (Robyn Speer)
53b5f245ed
Allow IETF language codes, aliases, and close matches (#9342)
* use language-matching to allow language code aliases

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* link to "IETF language tags" in docs

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* Make requirements consistent

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* change "two-letter language ID" to "IETF language tag" in language docs

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* use langcodes 3.2 and handle language-tag errors better

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* all unknown language codes are ImportErrors

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

Co-authored-by: Elia Robyn Speer <elia@explosion.ai>
2021-10-05 09:52:22 +02:00
Adriane Boyd
4192e71599
Sync vocab in vectors and components sourced in configs (#9335)
Since a component may reference anything in the vocab, share the full
vocab when loading source components and vectors (which will include
`strings` as of #8909).

When loading a source component from a config, save and restore the
vocab state after loading source pipelines, in particular to preserve
the original state without vectors, since `[initialize.vectors]
= null` skips rather than resets the vectors.

The vocab references are not synced for components loaded with
`Language.add_pipe(source=)` because the pipelines are already loaded
and not necessarily with the same vocab. A warning could be added in
`Language.create_pipe_from_source` that it may be necessary to save and
reload before training, but it's a rare enough case that this kind of
warning may be too noisy overall.
2021-10-04 12:19:02 +02:00
Adriane Boyd
e750c1760c
Restore tokenization timing in Language.evaluate (#9305)
Restore tokenization timing steps that were accidentally removed in #6765.
2021-09-27 20:44:14 +02:00
Adriane Boyd
03f234b739 Merge remote-tracking branch 'upstream/master' into develop 2021-09-27 09:10:45 +02:00
Adriane Boyd
2f0bb77920
Accept Doc input in pipelines (#9069)
* Accept Doc input in pipelines

Allow `Doc` input to `Language.__call__` and `Language.pipe`, which
skips `Language.make_doc` and passes the doc directly to the pipeline.

* ensure_doc helper function

* avoid running multiple processes on GPU

* Update spacy/tests/test_language.py

Co-authored-by: svlandeg <svlandeg@github.com>
2021-09-22 09:41:05 +02:00
Sofie Van Landeghem
1e974de837
config is not Optional (#9024) 2021-08-27 11:44:31 +02:00
Ines Montani
d94ddd5686
Auto-detect package dependencies in spacy package (#8948)
* Auto-detect package dependencies in spacy package

* Add simple get_third_party_dependencies test

* Import packages_distributions explicitly

* Inline packages_distributions

* Fix docstring [ci skip]

* Relax catalogue requirement

* Move importlib_metadata to spacy.compat with note

* Include license information [ci skip]
2021-08-17 14:05:13 +02:00
Adriane Boyd
941a591f3c
Pass excludes when serializing vocab (#8824)
* Pass excludes when serializing vocab

Additional minor bug fix:

* Deserialize vocab in `EntityLinker.from_disk`

* Add test for excluding strings on load

* Fix formatting
2021-08-03 14:42:44 +02:00
Adriane Boyd
9ad3b8cf8d
Only add sourced vectors hashes to meta if necessary (#8830) 2021-08-02 18:22:35 +02:00
Adriane Boyd
e532c69475
Update Language.replace_pipe for disabled components (#8729)
* Fix the index where the replacement in inserted to account for
disabled components
* Allow `Language.replace_pipe` to replace disabled components
2021-07-19 18:06:12 +10:00
Ines Montani
f90482d077 Tidy up and auto-format 2021-07-18 15:44:56 +10:00
explosion-bot
334f1f98d8 Auto-format code with black 2021-07-09 08:06:06 +00:00
Luca Dorigo
e8ef4a46d5
Add the right return type for Language.pipe and an overload for the as_tuples case (#8441)
* Add the right return type for Language.pipe and an overload for the as_tuples version

* Reformat, tidy up

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-07-06 14:18:40 +02:00
Adriane Boyd
5fd0b5207e
Fix vectors check for sourced components (#8559)
* Fix vectors check for sourced components

Since vectors are not loaded when components are sourced, store a hash
for the vectors of each sourced component and compare it to the loaded
vectors after the vectors are loaded from the `[initialize]` block.

* Pop temporary info

* Remove stored hash in remove_pipe

* Add default for pop

* Add additional convert/debug/assemble CLI tests
2021-07-06 12:43:17 +02:00
Adriane Boyd
86d01e9229 Tidy up with flake8: imports, comparisons, etc. 2021-06-28 12:08:15 +02:00
Adriane Boyd
5eeb25f043 Tidy up code 2021-06-28 12:08:15 +02:00
Adriane Boyd
7abfa25035
Don't use the same vocab for source models (#8388)
* Don't use the same vocab for source models

The source models should not be loaded with the vocab from the current
pipeline because this loads the vectors from the source model into the
current vocab.

The strings are all copied in `Language.create_pipe_from_source`, so if
the vectors are configured correctly in the current pipeline, the
sourced component will work as expected. If there is a vector mismatch,
a warning is shown. (It's not possible to inspect whether the vectors
are actually used by the component, so a warning is the best option.)

* Update comment on source model loading
2021-06-21 09:33:33 +02:00
Adriane Boyd
5646fcbe46 Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1 2021-06-15 15:05:17 +02:00
Adriane Boyd
9dfd3c9484 Use warnings.warn instead of logger.warning 2021-06-04 17:44:08 +02:00
Narayan Acharya
6b79714080
Address missing config overrides post load of models (#8208) 2021-05-31 18:36:52 +10:00
Ines Montani
5957ab74f7
Merge pull request #8112 from svlandeg/bugfix/replace-trf 2021-05-28 11:35:17 +10:00
Adriane Boyd
b120fb3511
Handle errors while multiprocessing (#8004)
* Handle errors while multiprocessing

Handle errors while multiprocessing without hanging.

* Return the traceback for errors raised while processing a batch, which
  can be handled by the top-level error handler
* Allow for shortened batches due to custom error handlers that ignore
  errors and skip documents

* Define custom components at a higher level

* Also move up custom error handler

* Use simpler component for test

* Switch error type

* Adjust test

* Only call top-level error handler for exceptions

* Register custom test components within tests

Use global functions (so they can be pickled) but register the
components only within the individual tests.
2021-05-17 13:28:39 +02:00
svlandeg
235e9f5488 call replace_listener_cfg attr if it's available 2021-05-12 17:19:38 +02:00
svlandeg
44a3a58599 call replace_listener attr if it's available 2021-05-12 16:01:02 +02:00
Santiago Castro
e99ff6f255
Fix typo in Language docstrings (#7958) 2021-05-03 14:44:09 +02:00
Adriane Boyd
95c0833656
Add training option to set annotations on update (#7767)
* Add training option to set annotations on update

Add a `[training]` option called `set_annotations_on_update` to specify
a list of components for which the predicted annotations should be set
on `example.predicted` immediately after that component has been
updated. The predicted annotations can be accessed by later components
in the pipeline during the processing of the batch in the same `update`
call.

* Rename to annotates / annotating_components

* Add test for `annotating_components` when training from config

* Add documentation
2021-04-26 16:53:53 +02:00
Adriane Boyd
1ad646cbcf
Improve checks for sourced components (#7490)
* Improve checks for sourced components

* Remove language class checks

* Convert python warning to logger warning

* Remove unused warning

* Fix formatting
2021-04-19 18:36:32 +10:00
Adriane Boyd
82d3caf861
Implement replace_listeners for source in config (#7620)
Implement replace_listeners for sourced components loaded from a config.
2021-04-08 18:21:22 +10:00
Paul O'Leary McCann
40bc01e668 Proactively remove unused listeners
With this the changes in initialize.py might be unecessary.

Requires testing.
2021-03-17 22:41:41 +09:00
Adriane Boyd
d746ea6278
Add warning about GPU selection in Jupyter notebooks (#7075)
* Initial warning

* Update check

* Redo edit

* Move jupyter warning to helper method

* Add link with details to warnings
2021-03-09 15:35:21 +01:00
Sofie Van Landeghem
39de3602e0
return custom error in nlp.initialize (#7104)
* return custom error in nlp.initialize

* Rename error

Co-authored-by: Ines Montani <ines@ines.io>
2021-03-09 23:01:31 +11:00
Sofie Van Landeghem
cd70c3cb79
Fixing pretrain (#7342)
* initialize NLP with train corpus

* add more pretraining tests

* more tests

* function to fetch tok2vec layer for pretraining

* clarify parameter name

* test different objectives

* formatting

* fix check for static vectors when using vectors objective

* clarify docs

* logger statement

* fix init_tok2vec and proc.initialize order

* test training after pretraining

* add init_config tests for pretraining

* pop pretraining block to avoid config validation errors

* custom errors
2021-03-09 14:01:13 +11:00
Adriane Boyd
e43d43db32
Allow sourcing disabled components (#7215)
Check `component_names` instead of `pipe_names` to allow sourcing
disabled components.
2021-02-26 13:50:56 +01:00
Sofie Van Landeghem
f638306598
remove link_components flag again (#6883) 2021-02-02 10:08:40 +08:00
Sofie Van Landeghem
acabb284dd
Fix linking resumed components (#6859)
* link components across enabled, resumed and frozen

* revert renaming

* revert renaming, the sequel
2021-02-01 22:19:58 +11:00
Ines Montani
d0c3775712 Replace links to nightly docs [ci skip] 2021-01-30 20:09:38 +11:00
Ines Montani
e6accb3a9e Tidy up and auto-format 2021-01-30 12:52:33 +11:00
Ines Montani
7886d59c56 Add check for remove_listener method 2021-01-29 23:47:30 +11:00
Ines Montani
94232aea08 Improve E889 2021-01-29 23:39:23 +11:00
Ines Montani
e766e8c56d
Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-01-29 21:41:17 +11:00
Ines Montani
325f47500d Move replacement logic to Language.from_config 2021-01-29 19:37:04 +11:00
Ines Montani
99842387cb Remove default value 2021-01-29 18:45:37 +11:00
Ines Montani
44b5542d14 Change method order 2021-01-29 18:42:41 +11:00
Ines Montani
8c15d1daec Update and validate config first and exit early if paths don't exist 2021-01-29 18:24:47 +11:00
Ines Montani
bbb94b37c6 Update error handling and docstring 2021-01-29 16:27:49 +11:00
Ines Montani
01ecfbcc45 Merge branch 'develop' into feature/replace-listeners 2021-01-29 15:57:32 +11:00
Ines Montani
911dfcccfc Add option to replace listeners for sourced components 2021-01-29 15:57:04 +11:00
Sofie Van Landeghem
837a4f53c2
Error handling in nlp.pipe (#6817)
* add error handler for pipe methods

* add unit tests

* remove pipe method that are the same as their base class

* have Language keep track of a default error handler

* cleanup

* formatting

* small refactor

* add documentation
2021-01-29 08:51:21 +08:00
Ines Montani
fabd3a3394 Tidy up code comments [ci skip] 2021-01-27 12:40:03 +11:00