Commit Graph

14594 Commits

Author SHA1 Message Date
Paul O'Leary McCann
894caab475
Merge pull request #8507 from bryant1410/patch-3
Fix typo in `train_cli` docstring
2021-06-27 13:50:48 +09:00
Santiago Castro
ee63b2b199
Fix typo in train_cli docstring 2021-06-25 22:45:03 -07:00
Adrian Zuber
f5aee0bbdf
Raise custom error in EntityLinker when KB is not set (#8442)
* Raise custom error in EntityLinker when KB is not set

* add contributor agreement

* Update E1018 error message
2021-06-25 23:04:00 +02:00
Matthew Honnibal
f9946154d9
Add SpanCategorizer component (#6747)
* Draft spancat model

* Add spancat model

* Add test for extract_spans

* Add extract_spans layer

* Upd extract_spans

* Add spancat model

* Add test for spancat model

* Upd spancat model

* Update spancat component

* Upd spancat

* Update spancat model

* Add quick spancat test

* Import SpanCategorizer

* Fix SpanCategorizer component

* Import SpanGroup

* Fix span extraction

* Fix import

* Fix import

* Upd model

* Update spancat models

* Add scoring, update defaults

* Update and add docs

* Fix type

* Update spacy/ml/extract_spans.py

* Auto-format and fix import

* Fix comment

* Fix type

* Fix type

* Update website/docs/api/spancategorizer.md

* Fix comment

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Better defense

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix labels list

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/extract_spans.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/pipeline/spancat.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Set annotations during update

* Set annotations in spancat

* fix imports in test

* Update spacy/pipeline/spancat.py

* replace MaxoutLogistic with LinearLogistic

* fix config

* various small fixes

* remove set_annotations parameter in update

* use our beloved tupley format with recent support for doc.spans

* bugfix to allow renaming the default span_key (scores weren't showing up)

* use different key in docs example

* change defaults to better-working parameters from project (WIP)

* register spacy.extract_spans.v1 for legacy purposes

* Upd dev version so can build wheel

* layers instead of architectures for smaller building blocks

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Include additional scores from overrides in combined score weights

* Parameterize spans key in scoring

Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so
that it's possible to evaluate multiple `spancat` components in the same
pipeline.

* Use the (intentionally very short) default spans key `sc` in the
  `SpanCategorizer`
* Adjust the default score weights to include the default key
* Adjust the scorer to use `spans_{spans_key}` as the prefix for the
  returned score
* Revert addition of `attr_name` argument to `score_spans` and adjust
  the key in the `getter` instead.

Note that for `spancat` components with a custom `span_key`, the score
weights currently need to be modified manually in
`[training.score_weights]` for them to be available during training. To
suppress the default score weights `spans_sc_p/r/f` during training, set
them to `null` in `[training.score_weights]`.

* Update website/docs/api/scorer.md

* Fix scorer for spans key containing underscore

* Increment version

* Add Spans to Evaluate CLI (#8439)

* Add Spans to Evaluate CLI

* Change to spans_key

* Add spans per_type output

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix spancat GPU issues (#8455)

* Fix GPU issues

* Require thinc >=8.0.6

* Switch to glorot_uniform_init

* Fix and test ngram suggester

* Include final ngram in doc for all sizes
* Fix ngrams for docs of the same length as ngram size
* Handle batches of docs that result in no ngrams
* Add tests

Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Nirant <NirantK@users.noreply.github.com>
2021-06-24 12:35:27 +02:00
Adriane Boyd
172dfec4f2
Test download in CI with ca_core_news_sm (#8493) 2021-06-24 09:26:30 +02:00
Ines Montani
fb9b389f52
Merge pull request #8486 from adrianeboyd/bugfix/template-paths-vectors
Preserve paths.vectors/initialize.vectors setting in quickstart template
2021-06-24 13:12:18 +10:00
Ines Montani
a8e8d02ba7
Merge pull request #8465 from explosion/feature/spacy-package-readme 2021-06-24 13:11:08 +10:00
Ines Montani
3e3d87a068 Update maintainer info [ci skip] 2021-06-24 12:37:55 +10:00
Ines Montani
40f13c3f0c Add docs [ci skip] 2021-06-24 11:57:15 +10:00
Ines Montani
3982be14e8 Improve fallbacks 2021-06-24 11:55:50 +10:00
Adriane Boyd
393c3c70d7
Various fixes for spans in Docs.from_docs (#8487)
* Fix spans offsets if a doc ends in a single space and no space is
  inserted
* Also include spans key in merged doc for empty spans lists
2021-06-23 15:51:35 +02:00
Adriane Boyd
5aa099505f Preserve paths.vectors/initialize.vectors setting in quickstart template 2021-06-23 11:07:14 +02:00
Ines Montani
ed1ba13439
Merge pull request #8477 from themrmax/patch-1 [ci skip]
Fix broken link
2021-06-23 10:41:22 +10:00
themrmax
d96c422cfc
Fix broken link
change /api/registry to /api/top-level#registry
2021-06-22 15:34:06 -07:00
Nick Sorros
31504f5982
Switch model and data path in prodigy project.yml recipe (#8467) 2021-06-22 09:41:45 +02:00
Ines Montani
cdcbd1023a Auto-generate README in spacy packge 2021-06-22 12:06:25 +10:00
Adriane Boyd
caba63b74f
Set version to v3.1.0 (#8452)
* Update test for v3.1

* Set version to v3.1.0
2021-06-21 10:41:40 +02:00
Adriane Boyd
9fde258053
Use minor version for compatibility check (#8403)
* Use minor version for compatibility check

* Use minor version of compatibility table
* Soften warning message about incompatible models
* Add test for presence of current version in compatibility table

* Add test for download compatibility table

* Use minor version of lower pin in error message if possible

* Fall back to spacy_git_version if available

* Fix unknown version string
2021-06-21 09:39:22 +02:00
Adriane Boyd
ec71a6b572
Filter W036 for entity ruler, etc. (#8424) 2021-06-21 09:34:29 +02:00
Adriane Boyd
e39d1bd4ab
Various docs updates for v3.1 (#8406)
* Update for Catalan/Italian lemmatizer changes

* Add warning about relevance of section
2021-06-21 09:33:50 +02:00
Adriane Boyd
7abfa25035
Don't use the same vocab for source models (#8388)
* Don't use the same vocab for source models

The source models should not be loaded with the vocab from the current
pipeline because this loads the vectors from the source model into the
current vocab.

The strings are all copied in `Language.create_pipe_from_source`, so if
the vectors are configured correctly in the current pipeline, the
sourced component will work as expected. If there is a vector mismatch,
a warning is shown. (It's not possible to inspect whether the vectors
are actually used by the component, so a warning is the best option.)

* Update comment on source model loading
2021-06-21 09:33:33 +02:00
Ines Montani
02d2fdb123 Add link anchor [ci skip] 2021-06-20 11:29:19 +10:00
Adriane Boyd
83fd04dee5
Update package CLI handling of README and LICENSE (#8422)
* Copy rather than move files to top-level of package
* Add all files to `MANIFEST.in` (primarily for older versions of pip)
* Include the `README.md` contents as `long_description` in the setup
2021-06-18 15:48:53 +02:00
Adriane Boyd
30d4eb506a
Fix setting empty entities in Example.from_dict (#8426) 2021-06-18 10:41:50 +02:00
Adriane Boyd
59da26ddad
Update spacy-lookups-data in Makefile (#8408) 2021-06-17 09:56:36 +02:00
Matthew Honnibal
6f5e308d17
Support negative examples in partial NER annotations (#8106)
* Support a cfg field in transition system

* Make NER 'has gold' check use right alignment for span

* Pass 'negative_samples_key' property into NER transition system

* Add field for negative samples to NER transition system

* Check neg_key in NER has_gold

* Support negative examples in NER oracle

* Test for negative examples in NER

* Fix name of config variable in NER

* Remove vestiges of old-style partial annotation

* Remove obsolete tests

* Add comment noting lack of support for negative samples in parser

* Additions to "neg examples" PR (#8201)

* add custom error and test for deprecated format

* add test for unlearning an entity

* add break also for Begin's cost

* add negative_samples_key property on Parser

* rename

* extend docs & fix some older docs issues

* add subclass constructors, clean up tests, fix docs

* add flaky test with ValueError if gold parse was not found

* remove ValueError if n_gold == 0

* fix docstring

* Hack in environment variables to try out training

* Remove hack

* Remove NER hack, and support 'negative O' samples

* Fix O oracle

* Fix transition parser

* Remove 'not O' from oracle

* Fix NER oracle

* check for spans in both gold.ents and gold.spans and raise if so, to prevent memory access violation

* use set instead of list in consistency check

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-06-17 17:33:00 +10:00
Adriane Boyd
02bac8f269
Fix non-deterministic deduplication in Greek lemmatizer (#8421) 2021-06-17 09:11:01 +02:00
Adriane Boyd
994bed2fe2
Update dependencies (#8409)
* Require `thinc>=8.0.5`
* Use `spacy-lookups-data>=1.0.2`
2021-06-16 19:50:28 +02:00
Sofie Van Landeghem
e796aab4b3
Resizable textcat (#7862)
* implement textcat resizing for TextCatCNN

* resizing textcat in-place

* simplify code

* ensure predictions for old textcat labels remain the same after resizing (WIP)

* fix for softmax

* store softmax as attr

* fix ensemble weight copy and cleanup

* restructure slightly

* adjust documentation, update tests and quickstart templates to use latest versions

* extend unit test slightly

* revert unnecessary edits

* fix typo

* ensemble architecture won't be resizable for now

* use resizable layer (WIP)

* revert using resizable layer

* resizable container while avoid shape inference trouble

* cleanup

* ensure model continues training after resizing

* use fill_b parameter

* use fill_defaults

* resize_layer callback

* format

* bump thinc to 8.0.4

* bump spacy-legacy to 3.0.6
2021-06-16 11:45:00 +02:00
Giovanni Toffoli
19521d525b
Added Italian POS-aware lemmatizer. (#8079)
* Added Italian POS-aware lemmatizer.

Also added the code used to build the lookup tables by POS.

* Create gtoffoli.md

* Add imports and format

* Remove helper script

* Use lemma_lookup instead of lemma_lookup_legacy

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-06-16 11:14:45 +02:00
Antti Ajanki
5a6125c227
[Finnish tokenizer] Handle conjunction contractions (#8105) 2021-06-16 10:56:47 +02:00
Adriane Boyd
b09be3e1cb
Merge pull request #8397 from adrianeboyd/chore/develop-into-master-v3.1
Merge develop into master for v3.1
2021-06-16 10:54:47 +02:00
Adriane Boyd
33240ed2c5 Temporarily skip model download test 2021-06-16 10:14:42 +02:00
Adriane Boyd
5646fcbe46 Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1 2021-06-15 15:05:17 +02:00
Adriane Boyd
480a3bf3be
Make JsonlReader path optional (#8396)
To avoid config errors during training when `[corpora.pretrain.path]` is
`None` with the default `spacy.JsonlCorpus.v1` reader, make the reader
path optional, similar to `spacy.Corpus.v1`.
2021-06-15 14:55:15 +02:00
Paul O'Leary McCann
94e1346f44
Change span lemmas to use original whitespace (fix #8368) (#8391)
* Change span lemmas to use original whitespace (fix #8368)

This is a redo of #8371 based off master.

The test for this required some changes to existing tests. I don't think
the changes were significant but I'd like someone to check them.

* Remove mystery docstring

This sentence was uncompleted for years, and now we will never know how
it ends.
2021-06-15 13:24:54 +02:00
Paul O'Leary McCann
2c105cdbce
Raise error if deps not provided with heads (#8335)
* Fill in deps if not provided with heads

Before this change, if heads were passed without deps they would be
silently ignored, which could be confusing. See #8334.

* Use "dep" instead of a blank string

This is the customary placeholder dep. It might be better to show an
error here instead though.

* Throw error on heads without deps

* Add a test

* Fix tests

* Formatting

* Fix all tests

* Fix a test I missed

* Revise error message

* Clean up whitespace

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-06-15 13:23:32 +02:00
Sofie Van Landeghem
0fd0d949c4
fix 's typo's across code base (#8384) 2021-06-15 10:57:08 +02:00
Adriane Boyd
507422149f
Various docs updates for v3.0 (#8353)
* Update cats score names in Scorer API docs

* Refer to performance in meta

* Update package naming/versions, lemmatizer details

* Minor formatting fixes

* Provide more explanation for cats_score_desc

* Provide language-specific lemmatizer defaults in API docs

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2021-06-14 12:19:36 +02:00
Adriane Boyd
6b69b8934b
Set version to v3.1.0.dev0 (#8379) 2021-06-14 11:17:35 +02:00
Sofie Van Landeghem
8729307e67
register extract_ngrams layer (#8358)
* register extract_ngrams layer

* fix import

* bump spacy-legacy to 3.0.6

* revert bump (wrong PR)
2021-06-14 10:30:30 +02:00
Adriane Boyd
63d748f80e
Add Catalan and Danish trf to website models (#8378) 2021-06-14 09:50:13 +02:00
Ines Montani
3259faad42 Update YouTube embed [ci skip] 2021-06-14 10:21:01 +10:00
Ines Montani
7f0f674a1b Fix universe.json and auto-format [ci skip] 2021-06-14 10:18:06 +10:00
Adriane Boyd
b98d216205
Update Catalan language data (#8308)
* Update Catalan language data

Update Catalan language data based on contributions from the Text Mining
Unit at the Barcelona Supercomputing Center:

https://github.com/TeMU-BSC/spacy4release/tree/main/lang_data

* Update tokenizer settings for UD Catalan AnCora

Update for UD Catalan AnCora v2.7 with merged multi-word tokens.

* Update test

* Move prefix patternt to more generic infix pattern

* Clean up
2021-06-11 10:21:22 +02:00
Adriane Boyd
d9be9e6cf9
Move README.md and LICENSES_SOURCES in package (#8297)
In addition to `LICENSE`, move the files `README.md` and
`LICENSES_SOURCES` to the top directory in `spacy package` if present in
the model directory.
2021-06-11 10:20:24 +02:00
Adriane Boyd
dbbeab2506
Merge pull request #8285 from adrianeboyd/feature/refactor-logger-warnings
Refactor warnings
2021-06-11 10:20:02 +02:00
Adriane Boyd
f4008bdb13
Restrict pymorphy2 requirement to pymorphy2 mode (#8299)
For the Russian and Ukrainian lemmatizers, restrict the `pymorphy2`
requirement to the mode `pymorphy2` so that lookup or other lemmatizer
modes can be loaded without installing `pymorphy2`.
2021-06-11 10:19:22 +02:00
Francisco Aranda
0a1a4c665d
update spacy-wordnet code example (#8327)
* update spacy-wordnet code example

- include spaCy 2.x and 3.x init alternatives
- upgrade recognai logo

* fix escape chars
2021-06-10 21:53:11 +02:00
Adriane Boyd
6d2789452e
Restrict cython to <3.0 (#8337) 2021-06-10 11:03:30 +02:00