Commit Graph

14531 Commits

Author SHA1 Message Date
Stanislav Schmidt
2516896849
Make vocab update in get_docs deterministic (#7603)
* Make vocab update in get_docs deterministic

The attribute `DocBin.strings` is a set. In `DocBin.get_docs`
a given vocab is updated by iterating over this set.
Iteration over a python set produces an arbitrary ordering,
therefore vocab is updated non-deterministically.

When training (fine-tuning) a spacy model, the base model's
vocabulary will be updated with the new vocabulary in the
training data in exactly the way described above. After
serialization, the file `model/vocab/strings.json` will
be sorted in an arbitrary way. This prevents reproducible
model training.

* Revert "Make vocab update in get_docs deterministic"

This reverts commit d6b87a2f55.

* Sort strings in StringStore serialization

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-04-09 11:53:13 +02:00
Adriane Boyd
8008e2f75b
Use morph hash in lemmatizer cache key (#7690)
Use the morph hash rather than the `MorphAnalysis` object in the cache
key so that the `Lemmatizer` can be pickled.
2021-04-08 13:22:38 +02:00
Sofie Van Landeghem
3e5bd5055e
expand quickstart widget with cuda 11.1 and 11.2 (#7615) 2021-04-08 12:25:42 +02:00
Adriane Boyd
e6b7600adf
Fix parser sourcing in NER converter (#7631) 2021-04-08 12:25:03 +02:00
Sofie Van Landeghem
204c2f116b
Extend score_spans for overlapping & non-labeled spans (#7209)
* extend span scorer with consider_label and allow_overlap

* unit test for spans y2x overlap

* add score_spans unit test

* docs for new fields in scorer.score_spans

* rename to include_label

* spell out if-else for clarity

* rename to 'labeled'

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-04-08 12:19:17 +02:00
Paul O'Leary McCann
c362006cb9
Fix is_sent_start when converting from JSON (fix #7635) (#7655)
Data in the JSON format is split into sentences, and each sentence is
saved with is_sent_start flags. Currently the flags are 1 for the first
token and 0 for the others. When deserialized this results in a pattern
of True, None, None, None... which makes single-sentence documents look
as though they haven't had sentence boundaries set.

Since items saved in JSON format have been split into sentences already,
the is_sent_start values should all be True or False.
2021-04-08 18:24:52 +10:00
Adriane Boyd
82d3caf861
Implement replace_listeners for source in config (#7620)
Implement replace_listeners for sourced components loaded from a config.
2021-04-08 18:21:22 +10:00
broaddeep
ee159b8543
Support match alignments (#7321)
* Support match alignments

* change naming from match_alignments to with_alignments, add conditional flow if with_alignments is given, validate with_alignments, add related test case

* remove added errors, utilize bint type, cleanup whitespace

* fix no new line in end of file

* Minor formatting

* Skip alignments processing if as_spans is set

* Add with_alignments to Matcher API docs

* Update website/docs/api/matcher.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-04-08 18:10:14 +10:00
Adriane Boyd
ff84075839
Support large/infinite training corpora (#7208)
* Support infinite generators for training corpora

Support a training corpus with an infinite generator in the `spacy
train` training loop:

* Revert `create_train_batches` to the state where an infinite generator
can be used as the in the first epoch of exactly one epoch without
resulting in a memory leak (`max_epochs != 1` will still result in a
memory leak)
* Move the shuffling for the first epoch into the corpus reader,
renaming it to `spacy.Corpus.v2`.

* Switch to training option for shuffling in memory

Training loop:

* Add option `training.shuffle_train_corpus_in_memory` that controls
whether the corpus is loaded in memory once and shuffled in the training
loop
  * Revert changes to `create_train_batches` and rename to
`create_train_batches_with_shuffling` for use with `spacy.Corpus.v1` and
a corpus that should be loaded in memory
  * Add `create_train_batches_without_shuffling` for a corpus that
should not be shuffled in the training loop: the corpus is merely
batched during training

Corpus readers:

* Restore `spacy.Corpus.v1`
* Add `spacy.ShuffledCorpus.v1` for a corpus shuffled in memory in the
reader instead of the training loop
  * In combination with `shuffle_train_corpus_in_memory = False`, each
epoch could result in a different augmentation

* Refactor create_train_batches, validation

* Rename config setting to `training.shuffle_train_corpus`
* Refactor to use a single `create_train_batches` method with a
`shuffle` option
* Only validate `get_examples` in initialize step if:
  * labels are required
  * labels are not provided

* Switch back to max_epochs=-1 for streaming train corpus

* Use first 100 examples for stream train corpus init

* Always check validate_get_examples in initialize
2021-04-08 18:08:04 +10:00
graue70
81fd595223
Fix __add__ method of PRFScore (#7557)
* Add failing test for PRFScore

* Fix erroneous implementation of __add__

* Simplify constructor

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-04-08 17:34:14 +10:00
Ines Montani
de4f4c9b8a Add more link anchors [ci skip] 2021-04-06 14:15:21 +10:00
Ines Montani
5bbdd7dc4c Update pipeline design docs [ci skip] 2021-04-06 14:13:22 +10:00
Ines Montani
1d1cfadbca Fix formatting [ci skip] 2021-04-06 14:13:13 +10:00
Jaidev Deshpande
93ee74a0a6
Add Numerizer to SpaCy universe (#7650)
Numerizer is a spaCy extension that converts numbers written in natural language
into numeric strings.
2021-04-05 19:02:27 +02:00
Paul O'Leary McCann
7944761ba7
Add warning if initial vectors are empty (#7641)
See #7637, where this came up.
2021-04-04 20:20:24 +02:00
Sam Edwardes
f6ad4684bd
Updates to universe.json for spaCyTextBlob (#7647)
* Updates to universe.json for spaCyTextBlob

Updated the documentation for spaCy 3.0.

* SamEdwardes.md

* Update SamEdwardes.md
2021-04-04 20:17:57 +02:00
Ayush Chaurasia
3c2ce41dd8
W&B integration: Optional support for dataset and model checkpoint logging and versioning (#7429)
* Add optional artifacts logging

* Update docs

* Update spacy/training/loggers.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/training/loggers.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/training/loggers.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Bump WandbLogger Version

* Add documentation of v1 to legacy docs

* bump spacy-legacy to 3.0.2 (to be released)

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2021-04-01 19:36:23 +02:00
vincent d warmerdam
8b3eec6e62
Add Tokenwiser to Projects (#7541)
* Add tokenwiser

* Update universe.json
2021-04-01 14:39:36 +02:00
Sofie Van Landeghem
59c2069eb1
Legacy docs (#7601)
* document legacy Tok2Vec architectures

* add TextCatEnsemble.v1 legacy documentation

* Separate legacy section in side bar
2021-03-30 12:43:14 +02:00
Adriane Boyd
348d1829c7
Preserve user data for DependencyMatcher on spans (#7528)
* Preserve user data for DependencyMatcher on spans

* Clean underscore in test

* Modify test to use extensions stored in user data
2021-03-30 12:26:22 +02:00
m0canu1
921feee092
Added more exception to the italian language from https://forum.wordr… (#7246)
* Added more exception to the italian language from https://forum.wordreference.com/threads/le-abbreviazioni-nella-lingua-italiana-abbreviations-in-italian.2464189/

* Remove unnecessary exception

Co-authored-by: Alexandru Mocanu <alexandru.mocanu@augeos.it>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-03-30 10:23:32 +02:00
Adriane Boyd
27a48f2802
Fix/update extension copying in Span.as_doc and Doc.from_docs (#7574)
* Adjust custom extension data when copying user data in `Span.as_doc()`
* Restrict `Doc.from_docs()` to adjusting offsets for custom extension
data
  * Update test to use extension
  * (Duplicate bug fix for character offset from #7497)
2021-03-30 09:49:12 +02:00
Santiago Castro
af07fc3bc1
Add support for CUDA 11.2 (#7583)
* Add support for CUDA 11.2

* Update the docs

* Format

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-03-30 09:47:33 +02:00
Álvaro Abella Bascarán
5b4dde38a3
fix fn name: tokenizer.infixes_finditer -> tokenizer.infix_finditer (#7606) 2021-03-30 09:45:49 +02:00
Adriane Boyd
3ae8661085
Fix tensor retokenization for non-numpy ops (#7527)
Implement manual `append` and `delete` for non-numpy ops.
2021-03-29 22:34:48 +11:00
Adriane Boyd
139f655f34
Merge doc.spans in Doc.from_docs() (#7497)
Merge data from `doc.spans` in `Doc.from_docs()`.

* Fix internal character offset set when merging empty docs (only
affects tokens and spans in `user_data` if an empty doc is in the list
of docs)
2021-03-29 22:34:01 +11:00
Adriane Boyd
d59f968d08
Keep sent starts without parse in retokenization (#7424)
In the retokenizer, only reset sent starts (with
`set_children_from_head`) if the doc is parsed. If there is no parse,
merged tokens have the unset `token.is_sent_start == None` by default after
retokenization.
2021-03-29 22:32:00 +11:00
Paul O'Leary McCann
faed54d659
Merge pull request #7537 from polm/docs/patience-negative
Remove mention of -1 for early stopping (fix #7535)
2021-03-26 21:11:53 +09:00
Paul O'Leary McCann
cdab341a75 Remove mention of -1 for early stopping (fix #7535)
Maybe this used to work differently, but currently a negative patience
just causes immediate termination.
2021-03-23 11:50:35 +09:00
Ines Montani
4bd3d01aaf
Merge pull request #7471 from polm/fix/listener-warnings 2021-03-22 12:45:02 +01:00
Ines Montani
d545ab4ca4
Merge pull request #7495 from adrianeboyd/bugfix/norm-ux
Update lexeme_norm checks
2021-03-22 12:44:52 +01:00
Ines Montani
be55f43163
Merge pull request #7473 from adrianeboyd/docs/v3-pipeline-deps-order 2021-03-22 12:43:07 +01:00
Ines Montani
3ee2fcfba0
Merge pull request #7483 from adrianeboyd/docs/various-v3-4 [ci skip] 2021-03-22 12:37:06 +01:00
Ines Montani
88e5a0dc16
Merge pull request #7504 from polm/fix/lexeme-docs [ci skip]
Fix mismatched backtick in Lexeme docs
2021-03-22 12:36:44 +01:00
Ines Montani
66ebd5c69e
Merge pull request #7491 from adrianeboyd/bugfix/corpus-depr-props
Update deprecated doc.is_sentenced in Corpus
2021-03-21 02:17:24 +01:00
Ines Montani
e3c3dbdb15
Merge pull request #7492 from adrianeboyd/bugfix/ux-matcher-attributes
Update matcher errors and docs
2021-03-21 02:17:13 +01:00
Adriane Boyd
0d2b723e8d Update entity setting section 2021-03-20 11:38:55 +01:00
Paul O'Leary McCann
e39c0dcf33 Fix mismatched backtick in Lexeme docs 2021-03-20 18:40:00 +09:00
Adriane Boyd
39153ef90f Update lexeme_norm checks
* Add util method for check
* Add new languages to list with lexeme norm tables
* Add check to all relevant components
* Add config details to warning message

Note that we're not actually inspecting the model config to see if
`NORM` is used as an attribute, so it may warn in cases where it's not
relevant.
2021-03-19 10:59:27 +01:00
Adriane Boyd
c771ec22f0 Update matcher errors and docs
* Mention `tagger+attribute_ruler` in `POS`/`MORPH` error messages for
`Matcher` and `PhraseMatcher`
* Document `Matcher.__call__(allow_missing=)`
2021-03-19 10:11:18 +01:00
Adriane Boyd
48b90c8e1c Update deprecated doc.is_sentenced in Corpus 2021-03-19 09:43:52 +01:00
Adriane Boyd
6a9a467766
Update website/docs/usage/processing-pipelines.md
Co-authored-by: Ines Montani <ines@ines.io>
2021-03-19 08:12:49 +01:00
Ines Montani
34e13c1161
Merge pull request #7472 from erre-quadro/universe/spikex
Add SpikeX to spaCy universe
2021-03-19 02:08:36 +01:00
Ines Montani
4f9aaa2366
Merge pull request #7451 from adrianeboyd/chore/add-py.typed
Add py.typed
2021-03-19 02:08:16 +01:00
Ines Montani
66b900a76d
Merge pull request #7440 from adrianeboyd/bugfix/ru-pymorph2-lookup-lemmatize
Rename and update Russian pymorphy2 lookup lemmatize
2021-03-19 01:54:08 +01:00
Ines Montani
2c6fa8c890
Merge pull request #7489 from adrianeboyd/bugfix/callbacks-entry-points
Check for callbacks entry points
2021-03-19 01:53:53 +01:00
Ines Montani
b878bc74b9
Merge pull request #7488 from Findus23/no-is-not
replace "is not" with !=
2021-03-19 01:53:38 +01:00
Adriane Boyd
0ad9e16ec3 Check for callbacks entry points 2021-03-18 21:18:25 +01:00
Lukas Winkler
3c362ac520
replace "is not" with != 2021-03-18 21:09:11 +01:00
Adriane Boyd
6354b642c5
Fix typo 2021-03-18 19:01:10 +01:00