Commit Graph

14848 Commits

Author SHA1 Message Date
Adriane Boyd
172dfec4f2
Test download in CI with ca_core_news_sm (#8493) 2021-06-24 09:26:30 +02:00
Ines Montani
fb9b389f52
Merge pull request #8486 from adrianeboyd/bugfix/template-paths-vectors
Preserve paths.vectors/initialize.vectors setting in quickstart template
2021-06-24 13:12:18 +10:00
Ines Montani
528746129d Merge branch 'master' into docs/new-in-v3-1 2021-06-24 13:11:37 +10:00
Ines Montani
a8e8d02ba7
Merge pull request #8465 from explosion/feature/spacy-package-readme 2021-06-24 13:11:08 +10:00
Ines Montani
3e3d87a068 Update maintainer info [ci skip] 2021-06-24 12:37:55 +10:00
Ines Montani
3e058dee62 Update features [ci skip] 2021-06-24 12:36:04 +10:00
Ines Montani
40f13c3f0c Add docs [ci skip] 2021-06-24 11:57:15 +10:00
Ines Montani
3982be14e8 Improve fallbacks 2021-06-24 11:55:50 +10:00
Ines Montani
a1e4aca267 Fix sentence [ci skip] 2021-06-24 11:40:36 +10:00
Adriane Boyd
393c3c70d7
Various fixes for spans in Docs.from_docs (#8487)
* Fix spans offsets if a doc ends in a single space and no space is
  inserted
* Also include spans key in merged doc for empty spans lists
2021-06-23 15:51:35 +02:00
Adriane Boyd
5aa099505f Preserve paths.vectors/initialize.vectors setting in quickstart template 2021-06-23 11:07:14 +02:00
Ines Montani
ca0d904faa Update details [ci skip] 2021-06-23 13:05:56 +10:00
Ines Montani
ed1ba13439
Merge pull request #8477 from themrmax/patch-1 [ci skip]
Fix broken link
2021-06-23 10:41:22 +10:00
themrmax
d96c422cfc
Fix broken link
change /api/registry to /api/top-level#registry
2021-06-22 15:34:06 -07:00
Ines Montani
e9b68d4f4c Update details and add example [ci skip] 2021-06-22 17:51:03 +10:00
Nick Sorros
31504f5982
Switch model and data path in prodigy project.yml recipe (#8467) 2021-06-22 09:41:45 +02:00
Ines Montani
bc93c34f54 Add "New in v3.1" guide 2021-06-22 15:23:18 +10:00
Ines Montani
cdcbd1023a Auto-generate README in spacy packge 2021-06-22 12:06:25 +10:00
Adriane Boyd
caba63b74f
Set version to v3.1.0 (#8452)
* Update test for v3.1

* Set version to v3.1.0
2021-06-21 10:41:40 +02:00
Adriane Boyd
9fde258053
Use minor version for compatibility check (#8403)
* Use minor version for compatibility check

* Use minor version of compatibility table
* Soften warning message about incompatible models
* Add test for presence of current version in compatibility table

* Add test for download compatibility table

* Use minor version of lower pin in error message if possible

* Fall back to spacy_git_version if available

* Fix unknown version string
2021-06-21 09:39:22 +02:00
Adriane Boyd
ec71a6b572
Filter W036 for entity ruler, etc. (#8424) 2021-06-21 09:34:29 +02:00
Adriane Boyd
e39d1bd4ab
Various docs updates for v3.1 (#8406)
* Update for Catalan/Italian lemmatizer changes

* Add warning about relevance of section
2021-06-21 09:33:50 +02:00
Adriane Boyd
7abfa25035
Don't use the same vocab for source models (#8388)
* Don't use the same vocab for source models

The source models should not be loaded with the vocab from the current
pipeline because this loads the vectors from the source model into the
current vocab.

The strings are all copied in `Language.create_pipe_from_source`, so if
the vectors are configured correctly in the current pipeline, the
sourced component will work as expected. If there is a vector mismatch,
a warning is shown. (It's not possible to inspect whether the vectors
are actually used by the component, so a warning is the best option.)

* Update comment on source model loading
2021-06-21 09:33:33 +02:00
Ines Montani
02d2fdb123 Add link anchor [ci skip] 2021-06-20 11:29:19 +10:00
Adriane Boyd
83fd04dee5
Update package CLI handling of README and LICENSE (#8422)
* Copy rather than move files to top-level of package
* Add all files to `MANIFEST.in` (primarily for older versions of pip)
* Include the `README.md` contents as `long_description` in the setup
2021-06-18 15:48:53 +02:00
Adriane Boyd
30d4eb506a
Fix setting empty entities in Example.from_dict (#8426) 2021-06-18 10:41:50 +02:00
Adriane Boyd
59da26ddad
Update spacy-lookups-data in Makefile (#8408) 2021-06-17 09:56:36 +02:00
Matthew Honnibal
6f5e308d17
Support negative examples in partial NER annotations (#8106)
* Support a cfg field in transition system

* Make NER 'has gold' check use right alignment for span

* Pass 'negative_samples_key' property into NER transition system

* Add field for negative samples to NER transition system

* Check neg_key in NER has_gold

* Support negative examples in NER oracle

* Test for negative examples in NER

* Fix name of config variable in NER

* Remove vestiges of old-style partial annotation

* Remove obsolete tests

* Add comment noting lack of support for negative samples in parser

* Additions to "neg examples" PR (#8201)

* add custom error and test for deprecated format

* add test for unlearning an entity

* add break also for Begin's cost

* add negative_samples_key property on Parser

* rename

* extend docs & fix some older docs issues

* add subclass constructors, clean up tests, fix docs

* add flaky test with ValueError if gold parse was not found

* remove ValueError if n_gold == 0

* fix docstring

* Hack in environment variables to try out training

* Remove hack

* Remove NER hack, and support 'negative O' samples

* Fix O oracle

* Fix transition parser

* Remove 'not O' from oracle

* Fix NER oracle

* check for spans in both gold.ents and gold.spans and raise if so, to prevent memory access violation

* use set instead of list in consistency check

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-06-17 17:33:00 +10:00
Adriane Boyd
02bac8f269
Fix non-deterministic deduplication in Greek lemmatizer (#8421) 2021-06-17 09:11:01 +02:00
svlandeg
bb9d2f1546 extend example to ensure the text is preserved 2021-06-16 23:56:35 +02:00
Adriane Boyd
994bed2fe2
Update dependencies (#8409)
* Require `thinc>=8.0.5`
* Use `spacy-lookups-data>=1.0.2`
2021-06-16 19:50:28 +02:00
Sofie Van Landeghem
e796aab4b3
Resizable textcat (#7862)
* implement textcat resizing for TextCatCNN

* resizing textcat in-place

* simplify code

* ensure predictions for old textcat labels remain the same after resizing (WIP)

* fix for softmax

* store softmax as attr

* fix ensemble weight copy and cleanup

* restructure slightly

* adjust documentation, update tests and quickstart templates to use latest versions

* extend unit test slightly

* revert unnecessary edits

* fix typo

* ensemble architecture won't be resizable for now

* use resizable layer (WIP)

* revert using resizable layer

* resizable container while avoid shape inference trouble

* cleanup

* ensure model continues training after resizing

* use fill_b parameter

* use fill_defaults

* resize_layer callback

* format

* bump thinc to 8.0.4

* bump spacy-legacy to 3.0.6
2021-06-16 11:45:00 +02:00
Giovanni Toffoli
19521d525b
Added Italian POS-aware lemmatizer. (#8079)
* Added Italian POS-aware lemmatizer.

Also added the code used to build the lookup tables by POS.

* Create gtoffoli.md

* Add imports and format

* Remove helper script

* Use lemma_lookup instead of lemma_lookup_legacy

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-06-16 11:14:45 +02:00
svlandeg
29d83dec0c adjust whitespace tokenizer to avoid sep in split() 2021-06-16 10:58:45 +02:00
Antti Ajanki
5a6125c227
[Finnish tokenizer] Handle conjunction contractions (#8105) 2021-06-16 10:56:47 +02:00
Adriane Boyd
b09be3e1cb
Merge pull request #8397 from adrianeboyd/chore/develop-into-master-v3.1
Merge develop into master for v3.1
2021-06-16 10:54:47 +02:00
Adriane Boyd
33240ed2c5 Temporarily skip model download test 2021-06-16 10:14:42 +02:00
Adriane Boyd
5646fcbe46 Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1 2021-06-15 15:05:17 +02:00
Adriane Boyd
480a3bf3be
Make JsonlReader path optional (#8396)
To avoid config errors during training when `[corpora.pretrain.path]` is
`None` with the default `spacy.JsonlCorpus.v1` reader, make the reader
path optional, similar to `spacy.Corpus.v1`.
2021-06-15 14:55:15 +02:00
Paul O'Leary McCann
94e1346f44
Change span lemmas to use original whitespace (fix #8368) (#8391)
* Change span lemmas to use original whitespace (fix #8368)

This is a redo of #8371 based off master.

The test for this required some changes to existing tests. I don't think
the changes were significant but I'd like someone to check them.

* Remove mystery docstring

This sentence was uncompleted for years, and now we will never know how
it ends.
2021-06-15 13:24:54 +02:00
Paul O'Leary McCann
2c105cdbce
Raise error if deps not provided with heads (#8335)
* Fill in deps if not provided with heads

Before this change, if heads were passed without deps they would be
silently ignored, which could be confusing. See #8334.

* Use "dep" instead of a blank string

This is the customary placeholder dep. It might be better to show an
error here instead though.

* Throw error on heads without deps

* Add a test

* Fix tests

* Formatting

* Fix all tests

* Fix a test I missed

* Revise error message

* Clean up whitespace

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-06-15 13:23:32 +02:00
Sofie Van Landeghem
0fd0d949c4
fix 's typo's across code base (#8384) 2021-06-15 10:57:08 +02:00
Adriane Boyd
507422149f
Various docs updates for v3.0 (#8353)
* Update cats score names in Scorer API docs

* Refer to performance in meta

* Update package naming/versions, lemmatizer details

* Minor formatting fixes

* Provide more explanation for cats_score_desc

* Provide language-specific lemmatizer defaults in API docs

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2021-06-14 12:19:36 +02:00
Adriane Boyd
6b69b8934b
Set version to v3.1.0.dev0 (#8379) 2021-06-14 11:17:35 +02:00
Sofie Van Landeghem
8729307e67
register extract_ngrams layer (#8358)
* register extract_ngrams layer

* fix import

* bump spacy-legacy to 3.0.6

* revert bump (wrong PR)
2021-06-14 10:30:30 +02:00
Adriane Boyd
63d748f80e
Add Catalan and Danish trf to website models (#8378) 2021-06-14 09:50:13 +02:00
Ines Montani
3259faad42 Update YouTube embed [ci skip] 2021-06-14 10:21:01 +10:00
Ines Montani
7f0f674a1b Fix universe.json and auto-format [ci skip] 2021-06-14 10:18:06 +10:00
Adriane Boyd
b98d216205
Update Catalan language data (#8308)
* Update Catalan language data

Update Catalan language data based on contributions from the Text Mining
Unit at the Barcelona Supercomputing Center:

https://github.com/TeMU-BSC/spacy4release/tree/main/lang_data

* Update tokenizer settings for UD Catalan AnCora

Update for UD Catalan AnCora v2.7 with merged multi-word tokens.

* Update test

* Move prefix patternt to more generic infix pattern

* Clean up
2021-06-11 10:21:22 +02:00
Adriane Boyd
d9be9e6cf9
Move README.md and LICENSES_SOURCES in package (#8297)
In addition to `LICENSE`, move the files `README.md` and
`LICENSES_SOURCES` to the top directory in `spacy package` if present in
the model directory.
2021-06-11 10:20:24 +02:00